Commit Graph

6460 Commits

Author SHA1 Message Date
Darrick J. Wong
b521c89027 xfs: fix sign handling problem in xfs_bmbt_diff_two_keys
In xfs_bmbt_diff_two_keys, we perform a signed int64_t subtraction with
two unsigned 64-bit quantities.  If the second quantity is actually the
"maximum" key (all ones) as used in _query_all, the subtraction
effectively becomes addition of two positive numbers and the function
returns incorrect results.  Fix this with explicit comparisons of the
unsigned values.  Nobody needs this now, but the online repair patches
will need this to work properly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-08-28 08:31:01 -07:00
Darrick J. Wong
7380e8fec1 xfs: don't return _QUERY_ABORT from xfs_rmap_has_other_keys
The xfs_rmap_has_other_keys helper aborts the iteration as soon as it
has an answer.  Don't let this abort leak out to callers.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-08-28 08:31:01 -07:00
Darrick J. Wong
c94613feef xfs: fix maxicount division by zero error
In xfs_ialloc_setup_geometry, it's possible for a malicious/corrupt fs
image to set an unreasonably large value for sb_inopblog which will
cause ialloc_blks to be zero.  If sb_imax_pct is also set, this results
in a division by zero error in the second do_div call.  Therefore, force
maxicount to zero if ialloc_blks is zero.

Note that the kernel metadata verifiers will catch the garbage inopblog
value and abort the fs mount long before it tries to set up the inode
geometry; this is needed to avoid a crash in xfs_db while setting up the
xfs_mount structure.

Found by fuzzing sb_inopblog to 122 in xfs/350.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2019-08-28 08:31:01 -07:00
Darrick J. Wong
519e5869d5 xfs: bmap scrub should only scrub records once
The inode block mapping scrub function does more work for btree format
extent maps than is absolutely necessary -- first it will walk the bmbt
and check all the entries, and then it will load the incore tree and
check every entry in that tree, possibly for a second time.

Simplify the code and decrease check runtime by separating the two
responsibilities.  The bmbt walk will make sure the incore extent
mappings are loaded, check the shape of the bmap btree (via xchk_btree)
and check that every bmbt record has a corresponding incore extent map;
and the incore extent map walk takes all the responsibility for checking
the mapping records and cross referencing them with other AG metadata.

This enables us to clean up some messy parameter handling and reduce
redundant code.  Rename a few functions to make the split of
responsibilities clearer.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-08-26 17:43:15 -07:00
zhengbin
71912e08e0 xfs: remove excess function parameter description in 'xfs_btree_sblock_v5hdr_verify'
Fixes gcc warning:

fs/xfs/libxfs/xfs_btree.c:4475: warning: Excess function parameter 'max_recs' description in 'xfs_btree_sblock_v5hdr_verify'
fs/xfs/libxfs/xfs_btree.c:4475: warning: Excess function parameter 'pag_max_level' description in 'xfs_btree_sblock_v5hdr_verify'

Fixes: c5ab131ba0 ("libxfs: refactor short btree block verification")
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-26 17:43:15 -07:00
Dave Chinner
f8f9ee4794 xfs: add kmem_alloc_io()
Memory we use to submit for IO needs strict alignment to the
underlying driver contraints. Worst case, this is 512 bytes. Given
that all allocations for IO are always a power of 2 multiple of 512
bytes, the kernel heap provides natural alignment for objects of
these sizes and that suffices.

Until, of course, memory debugging of some kind is turned on (e.g.
red zones, poisoning, KASAN) and then the alignment of the heap
objects is thrown out the window. Then we get weird IO errors and
data corruption problems because drivers don't validate alignment
and do the wrong thing when passed unaligned memory buffers in bios.

TO fix this, introduce kmem_alloc_io(), which will guaranteeat least
512 byte alignment of buffers for IO, even if memory debugging
options are turned on. It is assumed that the minimum allocation
size will be 512 bytes, and that sizes will be power of 2 mulitples
of 512 bytes.

Use this everywhere we allocate buffers for IO.

This no longer fails with log recovery errors when KASAN is enabled
due to the brd driver not handling unaligned memory buffers:

# mkfs.xfs -f /dev/ram0 ; mount /dev/ram0 /mnt/test

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-26 17:43:15 -07:00
Dave Chinner
d916275aa4 xfs: get allocation alignment from the buftarg
Needed to feed into the allocation routine to guarantee the memory
buffers we add to bios are correctly aligned to the underlying
device.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-26 17:43:14 -07:00
Dave Chinner
0ad95687c3 xfs: add kmem allocation trace points
When trying to correlate XFS kernel allocations to memory reclaim
behaviour, it is useful to know what allocations XFS is actually
attempting. This information is not directly available from
tracepoints in the generic memory allocation and reclaim
tracepoints, so these new trace points provide a high level
indication of what the XFS memory demand actually is.

There is no per-filesystem context in this code, so we just trace
the type of allocation, the size and the allocation constraints.
The kmem code also doesn't include much of the common XFS headers,
so there are a few definitions that need to be added to the trace
headers and a couple of types that need to be made common to avoid
needing to include the whole world in the kmem code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-26 17:43:14 -07:00
Tetsuo Handa
707e0ddaf6 fs: xfs: Remove KM_NOSLEEP and KM_SLEEP.
Since no caller is using KM_NOSLEEP and no callee branches on KM_SLEEP,
we can remove KM_NOSLEEP and replace KM_SLEEP with 0.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-26 12:06:22 -07:00
Darrick J. Wong
1fb254aa98 xfs: fix missing ILOCK unlock when xfs_setattr_nonsize fails due to EDQUOT
Benjamin Moody reported to Debian that XFS partially wedges when a chgrp
fails on account of being out of disk quota.  I ran his reproducer
script:

# adduser dummy
# adduser dummy plugdev

# dd if=/dev/zero bs=1M count=100 of=test.img
# mkfs.xfs test.img
# mount -t xfs -o gquota test.img /mnt
# mkdir -p /mnt/dummy
# chown -c dummy /mnt/dummy
# xfs_quota -xc 'limit -g bsoft=100k bhard=100k plugdev' /mnt

(and then as user dummy)

$ dd if=/dev/urandom bs=1M count=50 of=/mnt/dummy/foo
$ chgrp plugdev /mnt/dummy/foo

and saw:

================================================
WARNING: lock held when returning to user space!
5.3.0-rc5 #rc5 Tainted: G        W
------------------------------------------------
chgrp/47006 is leaving the kernel with locks still held!
1 lock held by chgrp/47006:
 #0: 000000006664ea2d (&xfs_nondir_ilock_class){++++}, at: xfs_ilock+0xd2/0x290 [xfs]

...which is clearly caused by xfs_setattr_nonsize failing to unlock the
ILOCK after the xfs_qm_vop_chown_reserve call fails.  Add the missing
unlock.

Reported-by: benjamin.moody@gmail.com
Fixes: 253f4911f2 ("xfs: better xfs_trans_alloc interface")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
2019-08-22 20:55:54 -07:00
Ira Weiny
b68271609c fs/xfs: Fix return code of xfs_break_leased_layouts()
The parens used in the while loop would result in error being assigned
the value 1 rather than the intended errno value.

This is required to return -ETXTBSY from follow on break_layout()
changes.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-19 18:15:28 -07:00
Darrick J. Wong
5d888b481e xfs: fix reflink source file racing with directio writes
While trawling through the dedupe file comparison code trying to fix
page deadlocking problems, Dave Chinner noticed that the reflink code
only takes shared IOLOCK/MMAPLOCKs on the source file.  Because
page_mkwrite and directio writes do not take the EXCL versions of those
locks, this means that reflink can race with writer processes.

For pure remapping this can lead to undefined behavior and file
corruption; for dedupe this means that we cannot be sure that the
contents are identical when we decide to go ahead with the remapping.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-08-18 18:53:25 -07:00
Christoph Hellwig
4529e6d7a6 xfs: compat_ioctl: use compat_ptr()
For 31-bit s390 user space, we have to pass pointer arguments through
compat_ptr() in the compat_ioctl handler.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-16 18:42:59 -07:00
Christoph Hellwig
314e01a6d7 xfs: fall back to native ioctls for unhandled compat ones
Always try the native ioctl if we don't have a compat handler.  This
removes a lot of boilerplate code as 'modern' ioctls should generally
be compat clean, and fixes the missing entries for the recently added
FS_IOC_GETFSLABEL/FS_IOC_SETFSLABEL ioctls.

Fixes: f7664b3197 ("xfs: implement online get/set fs label")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-16 18:42:59 -07:00
Darrick J. Wong
8612de3f7b xfs: don't crash on null attr fork xfs_bmapi_read
Zorro Lang reported a crash in generic/475 if we try to inactivate a
corrupt inode with a NULL attr fork (stack trace shortened somewhat):

RIP: 0010:xfs_bmapi_read+0x311/0xb00 [xfs]
RSP: 0018:ffff888047f9ed68 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: ffff888047f9f038 RCX: 1ffffffff5f99f51
RDX: 0000000000000002 RSI: 0000000000000008 RDI: 0000000000000012
RBP: ffff888002a41f00 R08: ffffed10005483f0 R09: ffffed10005483ef
R10: ffffed10005483ef R11: ffff888002a41f7f R12: 0000000000000004
R13: ffffe8fff53b5768 R14: 0000000000000005 R15: 0000000000000001
FS:  00007f11d44b5b80(0000) GS:ffff888114200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000ef6000 CR3: 000000002e176003 CR4: 00000000001606e0
Call Trace:
 xfs_dabuf_map.constprop.18+0x696/0xe50 [xfs]
 xfs_da_read_buf+0xf5/0x2c0 [xfs]
 xfs_da3_node_read+0x1d/0x230 [xfs]
 xfs_attr_inactive+0x3cc/0x5e0 [xfs]
 xfs_inactive+0x4c8/0x5b0 [xfs]
 xfs_fs_destroy_inode+0x31b/0x8e0 [xfs]
 destroy_inode+0xbc/0x190
 xfs_bulkstat_one_int+0xa8c/0x1200 [xfs]
 xfs_bulkstat_one+0x16/0x20 [xfs]
 xfs_bulkstat+0x6fa/0xf20 [xfs]
 xfs_ioc_bulkstat+0x182/0x2b0 [xfs]
 xfs_file_ioctl+0xee0/0x12a0 [xfs]
 do_vfs_ioctl+0x193/0x1000
 ksys_ioctl+0x60/0x90
 __x64_sys_ioctl+0x6f/0xb0
 do_syscall_64+0x9f/0x4d0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f11d39a3e5b

The "obvious" cause is that the attr ifork is null despite the inode
claiming an attr fork having at least one extent, but it's not so
obvious why we ended up with an inode in that state.

Reported-by: Zorro Lang <zlang@redhat.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204031
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2019-08-12 09:32:44 -07:00
Darrick J. Wong
858b44dc62 xfs: remove more ondisk directory corruption asserts
Continue our game of replacing ASSERTs for corrupt ondisk metadata with
EFSCORRUPTED returns.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2019-08-12 09:32:44 -07:00
Tetsuo Handa
294fc7a4c8 fs: xfs: xfs_log: Don't use KM_MAYFAIL at xfs_log_reserve().
When the system is close-to-OOM, fsync() may fail due to -ENOMEM because
xfs_log_reserve() is using KM_MAYFAIL. It is a bad thing to fail writeback
operation due to user-triggerable OOM condition. Since we are not using
KM_MAYFAIL at xfs_trans_alloc() before calling xfs_log_reserve(), let's
use the same flags at xfs_log_reserve().

  oom-torture: page allocation failure: order:0, mode:0x46c40(GFP_NOFS|__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_COMP), nodemask=(null)
  CPU: 7 PID: 1662 Comm: oom-torture Kdump: loaded Not tainted 5.3.0-rc2+ #925
  Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00
  Call Trace:
   dump_stack+0x67/0x95
   warn_alloc+0xa9/0x140
   __alloc_pages_slowpath+0x9a8/0xbce
   __alloc_pages_nodemask+0x372/0x3b0
   alloc_slab_page+0x3a/0x8d0
   new_slab+0x330/0x420
   ___slab_alloc.constprop.94+0x879/0xb00
   __slab_alloc.isra.89.constprop.93+0x43/0x6f
   kmem_cache_alloc+0x331/0x390
   kmem_zone_alloc+0x9f/0x110 [xfs]
   kmem_zone_alloc+0x9f/0x110 [xfs]
   xlog_ticket_alloc+0x33/0xd0 [xfs]
   xfs_log_reserve+0xb4/0x410 [xfs]
   xfs_trans_reserve+0x1d1/0x2b0 [xfs]
   xfs_trans_alloc+0xc9/0x250 [xfs]
   xfs_setfilesize_trans_alloc.isra.27+0x44/0xc0 [xfs]
   xfs_submit_ioend.isra.28+0xa5/0x180 [xfs]
   xfs_vm_writepages+0x76/0xa0 [xfs]
   do_writepages+0x17/0x80
   __filemap_fdatawrite_range+0xc1/0xf0
   file_write_and_wait_range+0x53/0xa0
   xfs_file_fsync+0x87/0x290 [xfs]
   vfs_fsync_range+0x37/0x80
   do_fsync+0x38/0x60
   __x64_sys_fsync+0xf/0x20
   do_syscall_64+0x4a/0x1c0
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fixes: eb01c9cd87 ("[XFS] Remove the xlog_ticket allocator")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-03 09:36:43 -07:00
Jia-Ju Bai
afa1d96d14 xfs: Fix possible null-pointer dereferences in xchk_da_btree_block_check_sibling()
In xchk_da_btree_block_check_sibling(), there is an if statement on
line 274 to check whether ds->state->altpath.blk[level].bp is NULL:
    if (ds->state->altpath.blk[level].bp)

When ds->state->altpath.blk[level].bp is NULL, it is used on line 281:
    xfs_trans_brelse(..., ds->state->altpath.blk[level].bp);
        struct xfs_buf_log_item *bip = bp->b_log_item;
        ASSERT(bp->b_transp == tp);

Thus, possible null-pointer dereferences may occur.

To fix these bugs, ds->state->altpath.blk[level].bp is checked before
being used.

These bugs are found by a static analysis tool STCheck written by us.

Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-07-30 11:28:20 -07:00
Darrick J. Wong
2e616d9f9c xfs: fix stack contents leakage in the v1 inumber ioctls
Explicitly initialize the onstack structures to zero so we don't leak
kernel memory into userspace when converting the in-core inumbers
structure to the v1 inogrp ioctl structure.  Add a comment about why we
have to use memset to ensure that the padding holes in the structures
are set to zero.

Fixes: 5f19c7fc68 ("xfs: introduce v5 inode group structure")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2019-07-28 21:12:32 -07:00
Linus Torvalds
366a4e38b8 Also new for 5.3:
- Bring fs/xfs/libxfs/xfs_trans_inode.c in sync with userspace libxfs.
 - Convert the xfs administrator guide to rst and move it into the
   official admin guide under Documentation
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl0t6wIACgkQ+H93GTRK
 tOuCqQ//TN2065e+0avyxuwNgUQIS73ujC/R8KivO3f9PlhMzjiMhgCBiIurGsi5
 8rInaCDr4L0DRL67W+3d72iWAc1eSP+l/zp2vEMUs/2YgDTGli5JfoWR0JeuKaiD
 7ymSxPvShfrwDwidMn6S7tXvcEjUEJrms4jeuERBrHwZMOvxzMiM3910oG12AfbM
 Fipvn9eDTmaInHX9T9nuBhvK077/1bXF1J2r/4XieOuhPnNjV5mLH3fJnPswu3uL
 RCSS1WYuJMBV/EQbLCgiM47dgVAtig8QSacqfpGtAZrDLAXy/Y6T+AwvclPFTE5c
 HlHLHf1+zI6FB8t2BNtnapMrRm1wmhCCHEEJLa+iHENY+J113R0I7c6Jscs6tnJd
 oW253Mc9juiB3bnC3iIAgcXrny68qPPpwozzYktQ9XsPxz5XEDEsWBpQS9rzoJRk
 u1zaU3VUxAO2sTJpKE4OXs5DLWvM5C5Yz8jbPXX3U1GBtNzk8EM9SmI0nA1+OVCA
 B4kbN6jKWMsG13bIIb16TbfkrhAucK/09Pbp2/4spayf6PYFci6E6Bg5NuVp2iaH
 n7oIk5WaobFPDbtdIaG27awAX0UCgItFrPMqLN+bBqeiYhNPuxc+ycmyb16VlRWf
 ob0OzWm6+nRJCNGQ1yBd3flILAPxV/Fa7Fegtf/HIDKCLfDW4xM=
 =+DS3
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.3-merge-13' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs cleanups from Darrick Wong:
 "We had a few more lateish cleanup patches come in for 5.3 -- a couple
  of syncups with the userspace libxfs code and a conversion of the XFS
  administrator's guide to ReST format.

  Summary:

   - Bring fs/xfs/libxfs/xfs_trans_inode.c in sync with userspace
     libxfs.

   - Convert the xfs administrator guide to rst and move it into the
     official admin guide under Documentation"

* tag 'xfs-5.3-merge-13' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  Documentation: filesystem: Convert xfs.txt to ReST
  xfs: sync up xfs_trans_inode with userspace
  xfs: move xfs_trans_inode.c to libxfs/
2019-07-18 11:18:00 -07:00
Linus Torvalds
f8c3500cd1 - virtio_pmem: The new virtio_pmem facility introduces a paravirtualized
persistent memory device that allows a guest VM to use DAX mechanisms to
   access a host-file with host-page-cache. It arranges for MAP_SYNC to
   be disabled and instead triggers a host fsync() when a 'write-cache
   flush' command is sent to the virtual disk device.
 
 - Miscellaneous small fixups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJdMHwpAAoJEB7SkWpmfYgCUYoP/3vcgYBAaXNksyALF0iowPoP
 z4J0KoaOA1CzRFEQtCWUQa84CWj+XoSewwSeyrIkqKQvx/gghXblK+GVjVzBn0BD
 hmmiKr8af4DdxfzYdEXJp65cCpIiVMaJiGr20Aj9ObwvWJb4QZbz9q7hnPt6KgiI
 jVND3BpP3OERb4ZFcibdmJT5foKooMcXVG6+luVe+hc1+ZZQxJBsBaqie4brQIFq
 j59NX3HfHH2fr1vVwnVH0CO4tgbgYg9wZ2EivGu6wBWvORjrr7KiSSbOYP68EBtd
 lUoNps+vQtGnfXGwNzAjp1wuknrQYYh4/KMKjep7hiZD39rgyvBpbHbyynKzQCWV
 REe8cXr/nwphsENvBAUBiqY999EWVIxdT2iaVaSA6K/31JQAC5AFyxVK/P2Ke1SK
 rvePZ++iLQ1o4phTxQPNlVUqF9jOrFVVICGwMDqaqSkOsD9YKQdFClfOF/1ntlDz
 V0bs+Y0Pe8AJCd9ESep4X+vHAWRRIb4EQIuwLaX8RJoY+r1fGye9RPthpYYzvXKp
 DI2iJztFO3anzj2i9htNPUFIaiUmIhzEvG32O2If2yc5FL02hMpHPoFx6vHhe6s3
 f8OJ+olsJK+/IIrV8+DHqYvhzylOYIhmRTvIxIxaNDPHkhR1i2RDQ6KKK1YZmsr8
 MjAZ+Ym0GadDivs+wcM6
 =uAMG
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "Primarily just the virtio_pmem driver:

   - virtio_pmem

     The new virtio_pmem facility introduces a paravirtualized
     persistent memory device that allows a guest VM to use DAX
     mechanisms to access a host-file with host-page-cache. It arranges
     for MAP_SYNC to be disabled and instead triggers a host fsync()
     when a 'write-cache flush' command is sent to the virtual disk
     device.

   - Miscellaneous small fixups"

* tag 'libnvdimm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  virtio_pmem: fix sparse warning
  xfs: disable map_sync for async flush
  ext4: disable map_sync for async flush
  dax: check synchronous mapping is supported
  dm: enable synchronous dax
  libnvdimm: add dax_dev sync flag
  virtio-pmem: Add virtio pmem driver
  libnvdimm: nd_region flush callback support
  libnvdimm, namespace: Drop uuid_t implementation detail
2019-07-18 10:52:08 -07:00
Linus Torvalds
9637d51734 for-linus-20190715
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl0s1ZEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiCEEACE9H/pXoegTTWIVPVajMlsa19UHIeilk4N
 GI7oKSiirQEMZnAOmrEzgB4/0zyYQsVypys0gZlYUD3GJVsXDT3zzjNXL5NpVg/O
 nqwSGWMHBSjWkLbaM40Pb2QLXsYgveptNL+9PtxrgtoYPoT5/+TyrJMFrRfi72EK
 WFeNDKOu6aJxpJ26JSsckJ0gluKeeEpRoEqsgHGIwaMIGHQf+b+ikk7tel5FAIgA
 uDwwD+Oxsdgh/ChsXL0d90GkcbcSp6GQ7GybxVmw/tPijx6mpeIY72xY3Zx+t8zF
 b71UNk6NmCKjOPO/6fiuYKKTYw+KhzlyEKO0j675HKfx2AhchEwKw0irp4yUlydA
 zxWYmz4U7iRgktJtymv3J4FEQQ3S6d1EnuQkQNX1LwiOsEsfzhkWi+7jy7KFhZoJ
 AqtYzqnOXvLx92q0vloj06HtK6zo+I/MINldy0+qn9lq0N0VF+dctyztAHLsF7P6
 pUtS6i7l1JSFKAmMhC31sIj5TImaehM2e/TWMUPEDZaO96oKCmQwOF1oiloc6vlW
 h4xWsxP/9zOFcWNyPzy6Vo3JUXWRvFA7K+jV3Hsukw6rVHiNCGVYGSlTv8Roi5b7
 I4ggu9R2JOGyku7UIlL50IRxEyjAp11LaO8yHhcCnRB65rmyBuNMQNcfOsfxpZ5Y
 1mtSNhm5TQ==
 =g8xI
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190715' of git://git.kernel.dk/linux-block

Pull more block updates from Jens Axboe:
 "A later pull request with some followup items. I had some vacation
  coming up to the merge window, so certain things items were delayed a
  bit. This pull request also contains fixes that came in within the
  last few days of the merge window, which I didn't want to push right
  before sending you a pull request.

  This contains:

   - NVMe pull request, mostly fixes, but also a few minor items on the
     feature side that were timing constrained (Christoph et al)

   - Report zones fixes (Damien)

   - Removal of dead code (Damien)

   - Turn on cgroup psi memstall (Josef)

   - block cgroup MAINTAINERS entry (Konstantin)

   - Flush init fix (Josef)

   - blk-throttle low iops timing fix (Konstantin)

   - nbd resize fixes (Mike)

   - nbd 0 blocksize crash fix (Xiubo)

   - block integrity error leak fix (Wenwen)

   - blk-cgroup writeback and priority inheritance fixes (Tejun)"

* tag 'for-linus-20190715' of git://git.kernel.dk/linux-block: (42 commits)
  MAINTAINERS: add entry for block io cgroup
  null_blk: fixup ->report_zones() for !CONFIG_BLK_DEV_ZONED
  block: Limit zone array allocation size
  sd_zbc: Fix report zones buffer allocation
  block: Kill gfp_t argument of blkdev_report_zones()
  block: Allow mapping of vmalloc-ed buffers
  block/bio-integrity: fix a memory leak bug
  nvme: fix NULL deref for fabrics options
  nbd: add netlink reconfigure resize support
  nbd: fix crash when the blksize is zero
  block: Disable write plugging for zoned block devices
  block: Fix elevator name declaration
  block: Remove unused definitions
  nvme: fix regression upon hot device removal and insertion
  blk-throttle: fix zero wait time for iops throttled group
  block: Fix potential overflow in blk_report_zones()
  blkcg: implement REQ_CGROUP_PUNT
  blkcg, writeback: Implement wbc_blkcg_css()
  blkcg, writeback: Add wbc->no_cgroup_owner
  blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner()
  ...
2019-07-15 21:20:52 -07:00
Eric Sandeen
79ba2a2185 xfs: sync up xfs_trans_inode with userspace
Add an XFS_ICHGTIME_CREATE case to xfs_trans_ichgtime() to keep in
sync with userspace.  (Currently no kernel caller sends this flag.)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-07-15 08:10:34 -07:00
Eric Sandeen
3f6d70e885 xfs: move xfs_trans_inode.c to libxfs/
Userspace now has an identical xfs_trans_inode.c which it has already
moved to libxfs/ so do the same move for kernelspace.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-07-15 08:10:18 -07:00
Linus Torvalds
4ce9d181eb New stuff for 5.3:
- Refactor inode geometry calculation into a single structure instead of
   open-coding pieces everywhere.
 - Add online repair to build options.
 - Remove unnecessary function call flags and functions.
 - Claim maintainership of various loose xfs documentation and header
   files.
 - Use struct bio directly for log buffer IOs instead of struct xfs_buf.
 - Reduce log item boilerplate code requirements.
 - Merge log item code spread across too many files.
 - Further distinguish between log item commits and cancellations.
 - Various small cleanups to the ag small allocator.
 - Support cgroup-aware writeback
 - libxfs refactoring for mkfs cleanup
 - Remove unneeded #includes
 - Fix a memory allocation miscalculation in the new log bio code
 - Fix bisection problems
 - Fix a crash in ioend processing caused by tripping over freeing of
   preallocated transactions
 - Split out a generic inode walk mechanism from the bulkstat code, hook
   up all the internal users to use the walking code, then clean up
   bulkstat to serve only the bulkstat ioctls.
 - Add a multithreaded iwalk implementation to speed up quotacheck on
   fast storage with many CPUs.
 - Remove unnecessary return values in logging teardown functions.
 - Supplement the bstat and inogrp structures with new bulkstat and
   inumbers structures that have all the fields we need for v5
   filesystem features and none of the padding problems of their
   predecessors.
 - Wire up new ioctls that use the new structures with a much simpler
   bulk_ireq structure at the head instead of the pointerhappy mess we
   had before.
 - Enable userspace to constrain bulkstat returns to a single AG or a
   single special inode so that we can phase out a lot of geometry
   guesswork in userspace.
 - Reduce memory consumption and zeroing overhead in extended attribute
   scrub code.
 - Fix some behavioral regressions in the new bulkstat backend code.
 - Fix some behavioral regressions in the new log bio code.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl0mGqwACgkQ+H93GTRK
 tOvUGA//d9+SVDaisbFzwxQVHX2/MPJExKA56kFE4PvSQhIpvPZeGNXUXCZfQXGt
 Hkwm/uEVUdndH9ERo8Z4L2nmT1WQCt3ONUihHn/aE8NSh/g1Bb1J+/t2Wr+Cdqic
 sgU5imhEoMTh0cknoFcGg9PdAJ7ZOv4mr2wzDrj3odq0etP63U+zabxxlX2Lg0VW
 3EXu6YhY+bmq54jdyc4xIL930opfYC64FDVmG7iNK4xlX1M2wJjdEGEP5fEchxf8
 XgsWgJqxRyxJWVH3KTbpXREQkrbj+uDM8MqqFF9uYzjUt2UHc9H206BxJCUfWo7/
 HGhTQmWqXmYMlA+ngEhJ3lcwsgAmHnnLL4YczbprgrQcwIbxMkpw2leEWUGOiLhH
 HdNCnJNVgPcAtmjjDxENqGUR19qslrexgSH2eUrFZFM15/3GHMYT3wdEHyqiZDya
 LiTHllcLDvauUmv8df51jJ+l5dd2Zj9c5iLQUOV5nRGVUj7Fyi7kAuFL+4SVcIbI
 VJVFgtHKsSwKYRPMqb7iTtDnTHkwZNTtQQ1Ptpe+TUH5ngTmqNn5EQVeoUtG6HUi
 0xEaA2MrhoA3VkHk1zFfzStlR8g1YsX5ajBq7luotYY+i3cC4oHLlrvIx9zI7elI
 WMc6Wnn7ozetwehYROVXvgsV+wiqC4D7yWGsnFyhqaS1QakwIM0=
 =VEdn
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.3-merge-12' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "In this release there are a significant amounts of consolidations and
  cleanups in the log code; restructuring of the log to issue struct
  bios directly; new bulkstat ioctls to return v5 fs inode information
  (and fix all the padding problems of the old ioctl); the beginnings of
  multithreaded inode walks (e.g. quotacheck); and a reduction in memory
  usage in the online scrub code leading to reduced runtimes.

   - Refactor inode geometry calculation into a single structure instead
     of open-coding pieces everywhere.

   - Add online repair to build options.

   - Remove unnecessary function call flags and functions.

   - Claim maintainership of various loose xfs documentation and header
     files.

   - Use struct bio directly for log buffer IOs instead of struct
     xfs_buf.

   - Reduce log item boilerplate code requirements.

   - Merge log item code spread across too many files.

   - Further distinguish between log item commits and cancellations.

   - Various small cleanups to the ag small allocator.

   - Support cgroup-aware writeback

   - libxfs refactoring for mkfs cleanup

   - Remove unneeded #includes

   - Fix a memory allocation miscalculation in the new log bio code

   - Fix bisection problems

   - Fix a crash in ioend processing caused by tripping over freeing of
     preallocated transactions

   - Split out a generic inode walk mechanism from the bulkstat code,
     hook up all the internal users to use the walking code, then clean
     up bulkstat to serve only the bulkstat ioctls.

   - Add a multithreaded iwalk implementation to speed up quotacheck on
     fast storage with many CPUs.

   - Remove unnecessary return values in logging teardown functions.

   - Supplement the bstat and inogrp structures with new bulkstat and
     inumbers structures that have all the fields we need for v5
     filesystem features and none of the padding problems of their
     predecessors.

   - Wire up new ioctls that use the new structures with a much simpler
     bulk_ireq structure at the head instead of the pointerhappy mess we
     had before.

   - Enable userspace to constrain bulkstat returns to a single AG or a
     single special inode so that we can phase out a lot of geometry
     guesswork in userspace.

   - Reduce memory consumption and zeroing overhead in extended
     attribute scrub code.

   - Fix some behavioral regressions in the new bulkstat backend code.

   - Fix some behavioral regressions in the new log bio code"

* tag 'xfs-5.3-merge-12' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (100 commits)
  xfs: chain bios the right way around in xfs_rw_bdev
  xfs: bump INUMBERS cursor correctly in xfs_inumbers_walk
  xfs: don't update lastino for FSBULKSTAT_SINGLE
  xfs: online scrub needn't bother zeroing its temporary buffer
  xfs: only allocate memory for scrubbing attributes when we need it
  xfs: refactor attr scrub memory allocation function
  xfs: refactor extended attribute buffer pointer functions
  xfs: attribute scrub should use seen_enough to pass error values
  xfs: allow single bulkstat of special inodes
  xfs: specify AG in bulk req
  xfs: wire up the v5 inumbers ioctl
  xfs: wire up new v5 bulkstat ioctls
  xfs: introduce v5 inode group structure
  xfs: introduce new v5 bulkstat structure
  xfs: rename bulkstat functions
  xfs: remove various bulk request typedef usage
  fs: xfs: xfs_log: Change return type from int to void
  xfs: poll waiting for quotacheck
  xfs: multithreaded iwalk implementation
  xfs: refactor INUMBERS to use iwalk functions
  ...
2019-07-12 17:17:51 -07:00
Linus Torvalds
5010fe9f09 New for 5.3:
- Standardize parameter checking for the SETFLAGS and FSSETXATTR ioctls
   (which were the file attribute setters for ext4 and xfs and have now
   been hoisted to the vfs)
 - Only allow the DAX flag to be set on files and directories.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl0aJgMACgkQ+H93GTRK
 tOuKkg//SJaxcB63uVPZk9hDraYTmyo9OXRRX6X9WwDKPTWwa88CUwS1ny1QF7Mt
 zMkgzG2/y2Rs9PQ0ARoPbh1hNb2CXnvA+xnzUEev1MW6UN/nTFMZEOPn2ZQ+DxQE
 gg/0U56kKgtjtXzBZVpTgHzSETivdXwHxFW3hiTtyRXg+4ulgDIZLOjN2wRB+Pdb
 X8ZmM6MqKOTbhQEXlw13TlCKBzoMjC1w4UU4rkZPjoSjAaUWiPfrk/XU7qgguf9p
 v1dbSN2dADQ19jzZ1dmggXnlJsRMZjk/ls5rxJlB5DHDbh6YgnA2TE+tYrtH28eB
 uyKfD+RQnMzRVdmH8PsMQRQQFXR2UYyprVP7a6wi6TkB+gytn7sR5uT4sbAhmhcF
 TiTYfYNRXzemHCewyOwOsUE/7oCeiJcdbqiPAHHD/jYLZfRjSXDcGzz3+7ZYZ3GO
 hRxUhpxHPbkmK4T2OxhzReCbRsLN/0BeEcDdLkNWmi2FTh3V1gYzMGkgI9wsVbsd
 pHjoGIHbMPWqktF/obuGq96WVfYBBaWJ6WNzQqKT4dQYAJBW2omxitXQHLpi6cjt
 hG5ncxa3cPpWx4t3Lx2hb0TPS7RyYvuoQIcS/Me2RWioxrwWrgnOqdHFfLEwWpfN
 jRowdWiGgOIsq8hMt7qycmGCXzbgsbaA/7oRqh8TiwM9taPOM4c=
 =uH2E
 -----END PGP SIGNATURE-----

Merge tag 'vfs-fix-ioctl-checking-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull common SETFLAGS/FSSETXATTR parameter checking from Darrick Wong:
 "Here's a patch series that sets up common parameter checking functions
  for the FS_IOC_SETFLAGS and FS_IOC_FSSETXATTR ioctl implementations.

  The goal here is to reduce the amount of behaviorial variance between
  the filesystems where those ioctls originated (ext2 and XFS,
  respectively) and everybody else.

   - Standardize parameter checking for the SETFLAGS and FSSETXATTR
     ioctls (which were the file attribute setters for ext4 and xfs and
     have now been hoisted to the vfs)

   - Only allow the DAX flag to be set on files and directories"

* tag 'vfs-fix-ioctl-checking-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  vfs: only allow FSSETXATTR to set DAX flag on files and dirs
  vfs: teach vfs_ioc_fssetxattr_check to check extent size hints
  vfs: teach vfs_ioc_fssetxattr_check to check project id info
  vfs: create a generic checking function for FS_IOC_FSSETXATTR
  vfs: create a generic checking and prep function for FS_IOC_SETFLAGS
2019-07-12 16:54:37 -07:00
Linus Torvalds
40f06c7995 Changes to copy_file_range for 5.3 from Dave and Amir:
- Create a generic copy_file_range handler and make individual
   filesystems responsible for calling it (i.e. no more assuming that
   do_splice_direct will work or is appropriate)
 - Refactor copy_file_range and remap_range parameter checking where they
   are the same
 - Install missing copy_file_range parameter checking(!)
 - Remove suid/sgid and update mtime like any other file write
 - Change the behavior so that a copy range crossing the source file's
   eof will result in a short copy to the source file's eof instead of
   EINVAL
 - Permit filesystems to decide if they want to handle cross-superblock
   copy_file_range in their local handlers.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl0BGvAACgkQ+H93GTRK
 tOu2aw/+KGG7PiXm9ED3ZXUppKVddrZMOgqM7mSfHo6TBgW3pJUJcRIhawK0Wz/P
 stgTsOkurHSl3iT3vQyX4GTZvLoGN/rfsRLPxogJptBUqVv3BOrXsrI53f7V/kbm
 rtjlYsgExji7VBUiMTe5kOWWqxyR7B4nXyvY/8rier57rW/8C1I58B0OrxAmTK0k
 rz1e5BtE1dg91xA7cSdEc38FInz8MW8cvsrEzW9vyYY4IVE0PBuhhA1EvryxTrAZ
 hfthHFfzwxhJkI0mdha8uqNufNWrHLSqiwyjYC7pwAwSQzQPiQz9U17flu+URnfF
 kXaR5LdXbBP3pl46RdthrfuonWsv612cC1Qwfjs8PBG9lG7b9PGJ40MGVTiw7LlQ
 924/03ho0zAnV0E8Qn5O9nPshQNDJhwhzMS39EmMyFKb1D5XGzdMV0gDdIfx6hdO
 HDbw6VQ33S59gvk7v/gxsFB5Bs4PKfamHx/QmwQwpqWM5XExcr0yJ90OTBtAuY4r
 S+9gwG6uED3aPh8HbQ5UgnA8bZmMmi8AkcBvqJ9GgNw5SbZl0oyv9Sj6JNpoOejV
 8y9JkhoZUxqiihnKTw/vtMrj5RCOfifNBjMSwrShfLdLKtK0AZl1mXC0/1Q3VnEQ
 TUcyRHEzrtHgJ9/AK9xIyDNvNYzvHSLZj7maoZZumgQa2FOFrmw=
 =qM44
 -----END PGP SIGNATURE-----

Merge tag 'copy-file-range-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull copy_file_range updates from Darrick Wong:
 "This fixes numerous parameter checking problems and inconsistent
  behaviors in the new(ish) copy_file_range system call.

  Now the system call will actually check its range parameters
  correctly; refuse to copy into files for which the caller does not
  have sufficient privileges; update mtime and strip setuid like file
  writes are supposed to do; and allows copying up to the EOF of the
  source file instead of failing the call like we used to.

  Summary:

   - Create a generic copy_file_range handler and make individual
     filesystems responsible for calling it (i.e. no more assuming that
     do_splice_direct will work or is appropriate)

   - Refactor copy_file_range and remap_range parameter checking where
     they are the same

   - Install missing copy_file_range parameter checking(!)

   - Remove suid/sgid and update mtime like any other file write

   - Change the behavior so that a copy range crossing the source file's
     eof will result in a short copy to the source file's eof instead of
     EINVAL

   - Permit filesystems to decide if they want to handle
     cross-superblock copy_file_range in their local handlers"

* tag 'copy-file-range-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  fuse: copy_file_range needs to strip setuid bits and update timestamps
  vfs: allow copy_file_range to copy across devices
  xfs: use file_modified() helper
  vfs: introduce file_modified() helper
  vfs: add missing checks to copy_file_range
  vfs: remove redundant checks from generic_remap_checks()
  vfs: introduce generic_file_rw_checks()
  vfs: no fallback for ->copy_file_range
  vfs: introduce generic_copy_file_range()
2019-07-10 20:32:37 -07:00
Christoph Hellwig
488ca3d8d0 xfs: chain bios the right way around in xfs_rw_bdev
We need to chain the earlier bios to the later ones, so that
submit_bio_wait waits on the bio that all the completions are
dispatched to.

Fixes: 6ad5b3255b ("xfs: use bios directly to read and write the log recovery buffers")
Reported-by: Dave Chinner <david@fromorbit.com>
Tested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-07-10 10:04:16 -07:00
Darrick J. Wong
0df5c39b3e xfs: bump INUMBERS cursor correctly in xfs_inumbers_walk
There's a subtle unit conversion error when we increment the INUMBERS
cursor at the end of xfs_inumbers_walk.  If there's an inode chunk at
the very end of the AG /and/ the AG size is a perfect power of two, the
startino of that last chunk (which is in units of AG inodes) will be 63
less than (1 << agino_log).  If we add XFS_INODES_PER_CHUNK to the
startino, we end up with a startino that's larger than (1 << agino_log)
and when we convert that back to fs inode units we'll rip off that upper
bit and wind up back at the start of the AG.

Fix this by converting to units of fs inodes before adding
XFS_INODES_PER_CHUNK so that we'll harmlessly end up pointing to the
next AG.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-09 08:06:20 -07:00
Darrick J. Wong
211bbf3c38 xfs: don't update lastino for FSBULKSTAT_SINGLE
The kernel test robot found a regression of xfs/054 in the conversion of
bulkstat to use the new iwalk infrastructure -- if a caller set *lastip
= 128 and invoked FSBULKSTAT_SINGLE, the bstat info would be for inode
128, but *lastip would be increased by the kernel to 129.

FSBULKSTAT_SINGLE never incremented lastip before, so it's incorrect to
make such an update to the internal lastino value now.

Fixes: 2810bd6840 ("xfs: convert bulkstat to new iwalk infrastructure")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-06 21:05:10 -07:00
Pankaj Gupta
b21fec4140 xfs: disable map_sync for async flush
Dont support 'MAP_SYNC' with non-DAX files and DAX files
with asynchronous dax_device. Virtio pmem provides
asynchronous host page cache flush mechanism. We don't
support 'MAP_SYNC' with virtio pmem and xfs.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-07-05 15:19:10 -07:00
Darrick J. Wong
036f463fe1 xfs: online scrub needn't bother zeroing its temporary buffer
The xattr scrubber functions use the temporary memory buffer either for
storing bitmaps or for testing if attribute value extraction works.  The
bitmap code always zeroes what it needs and the value extraction sets
the buffer contents, so it's not necessary to waste CPU time zeroing on
allocation.

Note that while we never read the contents that the attr value
extraction function sets, we do need to call it to check the remote
attribute header and CRCs to check for corruption.

A flame graph analysis showed that we were spending 7% of a xfs_scrub
run (the whole program, not just the attr scrubber itself) allocating
and zeroing 64k segments needlessly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-05 10:29:56 -07:00
Darrick J. Wong
6d6ccedd76 xfs: only allocate memory for scrubbing attributes when we need it
In examining a flame graph of time spent running xfs_scrub on various
filesystems, I noticed that we spent nearly 7% of the total runtime on
allocating a zeroed 65k buffer for every SCRUB_TYPE_XATTR invocation.
We do this even if none of the attribute values were anywhere near 64k
in size, even if there were no attribute blocks to check space on, and
even if it just turns out there are no attributes at all.

Therefore, rearrange the xattr buffer setup code to support reallocating
with a bigger buffer and redistribute the callers of that function so
that we only allocate memory just prior to needing it, and only allocate
as much as we need.  If we can't get memory with the ILOCK held we'll
bail out with EDEADLOCK which will allocate the maximum memory.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-05 10:29:56 -07:00
Darrick J. Wong
0081675933 xfs: refactor attr scrub memory allocation function
Move the code that allocates memory buffers for the extended attribute
scrub code into a separate function so we can reduce memory allocations
in the next patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-05 10:29:55 -07:00
Darrick J. Wong
3addd24880 xfs: refactor extended attribute buffer pointer functions
Replace the open-coded attribute buffer pointer calculations with helper
functions to make it more obvious what we're doing with our freeform
memory allocation w.r.t. either storing xattr values or computing btree
block free space.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-05 10:29:55 -07:00
Darrick J. Wong
2c3b83d7ca xfs: attribute scrub should use seen_enough to pass error values
When we're iterating all the attributes using the built-in xattr
iterator, we can use the seen_enough variable to pass error codes back
to the main scrub function instead of flattening them into 0/1.  This
will be used in a more exciting fashion in upcoming patches.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-05 10:29:54 -07:00
Darrick J. Wong
bf3cb39447 xfs: allow single bulkstat of special inodes
Create a new bulk ireq flag that enables userspace to ask us for a
special inode number instead of interpreting @ino as a literal inode
number.  This enables us to query the root inode easily.

The reason for adding the ability to query specifically the root
directory inode is that certain programs (xfsdump and xfsrestore) want
to confirm when they've been pointed to the root directory.  The
userspace code assumes the root directory is always the first result
from calling bulkstat with lastino == 0, but this isn't true if the
(initial btree roots + initial AGFL + inode alignment padding) is itself
long enough to be allocated to new inodes if all of those blocks should
happen to be free at the same time.  Rather than make userspace guess
at internal filesystem state, we provide a direct query.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-04 07:52:24 -07:00
Darrick J. Wong
13d59a2a61 xfs: specify AG in bulk req
Add a new xfs_bulk_ireq flag to constrain the iteration to a single AG.
If the passed-in startino value is zero then we start with the first
inode in the AG that the user passes in; otherwise, we iterate only
within the same AG as the passed-in inode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-04 07:52:23 -07:00
Darrick J. Wong
fba9760a43 xfs: wire up the v5 inumbers ioctl
Wire up the v5 INUMBERS ioctl.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-03 20:36:28 -07:00
Darrick J. Wong
0448b6f488 xfs: wire up new v5 bulkstat ioctls
Wire up the new v5 BULKSTAT ioctl.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-03 20:36:27 -07:00
Darrick J. Wong
5f19c7fc68 xfs: introduce v5 inode group structure
Introduce a new "v5" inode group structure that fixes the alignment
and padding problems of the existing structure.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-03 20:36:27 -07:00
Darrick J. Wong
7035f9724f xfs: introduce new v5 bulkstat structure
Introduce a new version of the in-core bulkstat structure that supports
our new v5 format features.  This structure also fills the gaps in the
previous structure.  We leave wiring up the ioctls for the next patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-03 20:36:26 -07:00
Darrick J. Wong
8bfe9d1810 xfs: rename bulkstat functions
Rename the bulkstat functions to 'fsbulkstat' so that they match the
ioctl names.  We will be introducing a new set of bulkstat/inumbers
ioctls soon, and it will be important to keep the names straight.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-03 20:36:26 -07:00
Darrick J. Wong
6f71fb6838 xfs: remove various bulk request typedef usage
Remove xfs_bstat_t, xfs_fsop_bulkreq_t, xfs_inogrp_t, and similarly
named compat typedefs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-03 20:36:25 -07:00
Hariprasad Kelam
a7a9250e18 fs: xfs: xfs_log: Change return type from int to void
Change return types of below functions as they never fails
xfs_log_mount_cancel
xlog_recover_cancel
xlog_recover_cancel_intents

fix below issue reported by coccicheck
fs/xfs/xfs_log_recover.c:4886:7-12: Unneeded variable: "error". Return
"0" on line 4926

Signed-off-by: Hariprasad Kelam <hariprasad.kelam@gmail.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-07-03 08:21:58 -07:00
Darrick J. Wong
3e5a428b26 xfs: poll waiting for quotacheck
Create a pwork destroy function that uses polling instead of
uninterruptible sleep to wait for work items to finish so that we can
touch the softlockup watchdog.  IOWs, gross hack.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-03 08:21:58 -07:00
Darrick J. Wong
40786717c8 xfs: multithreaded iwalk implementation
Create a parallel iwalk implementation and switch quotacheck to use it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-03 07:33:26 -07:00
Darrick J. Wong
677717fbd4 xfs: refactor INUMBERS to use iwalk functions
Now that we have generic functions to walk inode records, refactor the
INUMBERS implementation to use it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:06 -07:00
Darrick J. Wong
04b8fba2e1 xfs: refactor iwalk code to handle walking inobt records
Refactor xfs_iwalk_ag_start and xfs_iwalk_ag so that the bits that are
particular to bulkstat (trimming the start irec, starting inode
readahead, and skipping empty groups) can be controlled via flags in the
iwag structure.

This enables us to add a new function to walk all inobt records which
will be used for the new INUMBERS implementation in the next patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:06 -07:00
Darrick J. Wong
2b5eb82601 xfs: refactor xfs_iwalk_grab_ichunk
In preparation for reusing the iwalk code for the inogrp walking code
(aka INUMBERS), move the initial inobt lookup and retrieval code out of
xfs_iwalk_grab_ichunk so that we call the masking code only when we need
to trim out the inodes that came before the cursor in the inobt record
(aka BULKSTAT).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:06 -07:00
Darrick J. Wong
688f7c3678 xfs: clean up long conditionals in xfs_iwalk_ichunk_ra
Refactor xfs_iwalk_ichunk_ra to avoid long conditionals.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:06 -07:00
Darrick J. Wong
5e29f3b720 xfs: change xfs_iwalk_grab_ichunk to use startino, not lastino
Now that the inode chunk grabbing function is a static function in the
iwalk code, change its behavior so that @agino is the inode where we
want to /start/ the iteration.  This reduces cognitive friction with the
callers and simplifes the code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:05 -07:00
Darrick J. Wong
da1d9e5912 xfs: move bulkstat ichunk helpers to iwalk code
Now that we've reworked the bulkstat code to use iwalk, we can move the
old bulkstat ichunk helpers to xfs_iwalk.c.  No functional changes here.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:05 -07:00
Darrick J. Wong
938c710d99 xfs: calculate inode walk prefetch more carefully
The existing inode walk prefetch is based on the old bulkstat code,
which simply allocated 4 pages worth of memory and prefetched that many
inobt records, regardless of however many inodes the caller requested.
65536 inodes is a lot to prefetch (~32M on x64, ~512M on arm64) so let's
scale things down a little more intelligently based on the number of
inodes requested, etc.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:05 -07:00
Darrick J. Wong
2810bd6840 xfs: convert bulkstat to new iwalk infrastructure
Create a new ibulk structure incore to help us deal with bulk inode stat
state tracking and then convert the bulkstat code to use the new iwalk
iterator.  This disentangles inode walking from bulk stat control for
simpler code and enables us to isolate the formatter functions to the
ioctl handling code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:05 -07:00
Darrick J. Wong
f16fe3ecde xfs: bulkstat should copy lastip whenever userspace supplies one
When userspace passes in a @lastip pointer we should copy the results
back, even if the @ocount pointer is NULL.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:05 -07:00
Darrick J. Wong
ebd126a651 xfs: convert quotacheck to use the new iwalk functions
Convert quotacheck to use the new iwalk iterator to dig through the
inodes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:05 -07:00
Darrick J. Wong
a211432c27 xfs: create simplified inode walk function
Create a new iterator function to simplify walking inodes in an XFS
filesystem.  This new iterator will replace the existing open-coded
walking that goes on in various places.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:05 -07:00
Darrick J. Wong
5bb46e3e18 xfs: create iterator error codes
Currently, xfs doesn't have generic error codes defined for "stop
iterating"; we just reuse the XFS_BTREE_QUERY_* return values.  This
looks a little weird if we're not actually iterating a btree index.
Before we start adding more iterators, we should create general
XFS_ITER_{CONTINUE,ABORT} return values and define the XFS_BTREE_QUERY_*
ones from that.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:05 -07:00
Darrick J. Wong
ca29be7534 vfs: teach vfs_ioc_fssetxattr_check to check extent size hints
Move the extent size hint checks that aren't xfs-specific to the vfs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2019-07-01 08:25:36 -07:00
Darrick J. Wong
f991492ed1 vfs: teach vfs_ioc_fssetxattr_check to check project id info
Standardize the project id checks for FSSETXATTR.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2019-07-01 08:25:35 -07:00
Darrick J. Wong
7b0e492e6b vfs: create a generic checking function for FS_IOC_FSSETXATTR
Create a generic checking function for the incoming FS_IOC_FSSETXATTR
fsxattr values so that we can standardize some of the implementation
behaviors.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2019-07-01 08:25:35 -07:00
Ming Lei
79d08f89bb block: fix .bi_size overflow
'bio->bi_iter.bi_size' is 'unsigned int', which at most hold 4G - 1
bytes.

Before 07173c3ec2 ("block: enable multipage bvecs"), one bio can
include very limited pages, and usually at most 256, so the fs bio
size won't be bigger than 1M bytes most of times.

Since we support multi-page bvec, in theory one fs bio really can
be added > 1M pages, especially in case of hugepage, or big writeback
with too many dirty pages. Then there is chance in which .bi_size
is overflowed.

Fixes this issue by using bio_full() to check if the added segment may
overflow .bi_size.

Cc: Liu Yiding <liuyd.fnst@cn.fujitsu.com>
Cc: kernel test robot <rong.a.chen@intel.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: 07173c3ec2 ("block: enable multipage bvecs")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-01 08:18:54 -06:00
Christoph Hellwig
73d30d4874 xfs: remove XFS_TRANS_NOFS
Instead of a magic flag for xfs_trans_alloc, just ensure all callers
that can't relclaim through the file system use memalloc_nofs_save to
set the per-task nofs flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-30 09:05:17 -07:00
Christoph Hellwig
fe64e0d26b xfs: simplify xfs_ioend_can_merge
Compare the block layer status directly instead of converting it to
an errno first.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-30 09:05:17 -07:00
Christoph Hellwig
7dbae9fbde xfs: allow merging ioends over append boundaries
There is no real problem merging ioends that go beyond i_size into an
ioend that doesn't.  We just need to move the append transaction to the
base ioend.  Also use the opportunity to use a real error code instead
of the magic 1 to cancel the transactions, and write a comment
explaining the scheme.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-30 09:05:17 -07:00
Christoph Hellwig
0290d9c1e5 xfs: fix a comment typo in xfs_submit_ioend
The fail argument is long gone, update the comment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-30 09:05:17 -07:00
Christoph Hellwig
1fdafce55c xfs: remove the unused xfs_count_page_state declaration
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-30 09:05:17 -07:00
Christoph Hellwig
89b171acb2 xfs: fix iclog allocation size
Properly allocate the space for the bio_vecs instead of just one byte
per bio_vec.

Fixes: 79b54d9bfc ("xfs: use bios directly to write log buffers")
Reported-by: syzbot+b75afdbe271a0d7ac4f6@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 21:02:45 -07:00
Eric Sandeen
250d4b4c40 xfs: remove unused header files
There are many, many xfs header files which are included but
unneeded (or included twice) in the xfs code, so remove them.

nb: xfs_linux.h includes about 9 headers for everyone, so those
explicit includes get removed by this.  I'm not sure what the
preference is, but if we wanted explicit includes everywhere,
a followup patch could remove those xfs_*.h includes from
xfs_linux.h and move them into the files that need them.
Or it could be left as-is.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:30:43 -07:00
Christoph Hellwig
adfb5fb46a xfs: implement cgroup aware writeback
Link every newly allocated writeback bio to cgroup pointed to by the
writeback control structure, and charge every byte written back to it.

Tested-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:30:22 -07:00
Christoph Hellwig
a247373596 xfs: simplify xfs_chain_bio
Move setting up operation and write hint to xfs_alloc_ioend, and
then just copy over all needed information from the previous bio
in xfs_chain_bio and stop passing various parameters to it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:30:22 -07:00
Darrick J. Wong
f327a00745 xfs: account for log space when formatting new AGs
When we're writing out a fresh new AG, make sure that we don't list an
internal log as free and that we create the rmap for the region.  growfs
never does this, but we will need it when we hook up mkfs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-06-28 19:30:21 -07:00
Darrick J. Wong
8d90857cff xfs: refactor free space btree record initialization
Refactor the code that populates the free space btrees of a new AG so
that we can avoid code duplication once things start getting
complicated.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-06-28 19:30:21 -07:00
Brian Foster
7e36a3a63d xfs: always update params on small allocation
xfs_alloc_ag_vextent_small() doesn't update the output parameters in
the event of an AGFL allocation. Instead, it updates the
xfs_alloc_arg structure directly to complete the allocation.

Update both args and the output params to provide consistent
behavior for future callers.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:30:20 -07:00
Brian Foster
6691cd9267 xfs: skip small alloc cntbt logic on NULL cursor
The small allocation helper is implemented in a way that is fairly
tightly integrated to the existing allocation algorithms. It expects
a cntbt cursor beyond the end of the tree, attempts to locate the
last record in the tree and only attempts an AGFL allocation if the
cntbt is empty.

The upcoming generic algorithm doesn't rely on the cntbt processing
of this function. It will only call this function when the cntbt
doesn't have a big enough extent or is empty and thus AGFL
allocation is the only remaining option. Tweak
xfs_alloc_ag_vextent_small() to handle a NULL cntbt cursor and skip
the cntbt logic. This facilitates use by the existing allocation
code and new code that only requires an AGFL allocation attempt.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:30:20 -07:00
Brian Foster
c63cdd4fc9 xfs: move small allocation helper
Move the small allocation helper further up in the file to avoid the
need for a function declaration. The remaining declarations will be
removed by followup patches. No functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:30:19 -07:00
Brian Foster
2a4f35f984 xfs: clean up small allocation helper
xfs_alloc_ag_vextent_small() is kind of a mess. Clean it up in
preparation for future changes. No functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:30:19 -07:00
Christoph Hellwig
caeaea9858 xfs: merge xfs_trans_bmap.c into xfs_bmap_item.c
Keep all bmap item related code together.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:29:42 -07:00
Christoph Hellwig
3cfce1e3ce xfs: merge xfs_trans_rmap.c into xfs_rmap_item.c
Keep all rmap item related code together in one file.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:29:41 -07:00
Christoph Hellwig
effd5e96e7 xfs: merge xfs_trans_refcount.c into xfs_refcount_item.c
Keep all the refcount item related code together in one file.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:29:41 -07:00
Christoph Hellwig
81f4004173 xfs: merge xfs_trans_extfree.c into xfs_extfree_item.c
Keep all the extree item related code together in one file.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:28:17 -07:00
Christoph Hellwig
73f0d23633 xfs: merge xfs_bud_init into xfs_trans_get_bud
There is no good reason to keep these two functions separate.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:36 -07:00
Christoph Hellwig
60883447f4 xfs: merge xfs_rud_init into xfs_trans_get_rud
There is no good reason to keep these two functions separate.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:36 -07:00
Christoph Hellwig
ebeb8e0629 xfs: merge xfs_cud_init into xfs_trans_get_cud
There is no good reason to keep these two functions separate.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:35 -07:00
Christoph Hellwig
9c5e7c2ae3 xfs: merge xfs_efd_init into xfs_trans_get_efd
There is no good reason to keep these two functions separate.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:35 -07:00
Christoph Hellwig
95cf0e4a0d xfs: remove a pointless comment duplicated above all xfs_item_ops instances
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:34 -07:00
Christoph Hellwig
89ae379d56 xfs: use a list_head for iclog callbacks
Replace the hand grown linked list handling and cil context attachment
with the standard list_head structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:34 -07:00
Christoph Hellwig
efe2330fdc xfs: remove the xfs_log_item_t typedef
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:33 -07:00
Christoph Hellwig
b3b14aacc6 xfs: don't cast inode_log_items to get the log_item
The cast is not type safe, and we can just dereference the first
member instead to start with.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:33 -07:00
Christoph Hellwig
9ce632a28a xfs: add a flag to release log items on commit
We have various items that are released from ->iop_comitting.  Add a
flag to just call ->iop_release from the commit path to avoid tons
of boilerplate code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:32 -07:00
Christoph Hellwig
ddf92053e4 xfs: split iop_unlock
The iop_unlock method is called when comitting or cancelling a
transaction.  In the latter case, the transaction may or may not be
aborted.  While there is no known problem with the current code in
practice, this implementation is limited in that any log item
implementation that might want to differentiate between a commit and a
cancellation must rely on the aborted state.  The aborted bit is only
set when the cancelled transaction is dirty, however.  This means that
there is no way to distinguish between a commit and a clean transaction
cancellation.

For example, intent log items currently rely on this distinction.  The
log item is either transferred to the CIL on commit or released on
transaction cancel. There is currently no possibility for a clean intent
log item in a transaction, but if that state is ever introduced a cancel
of such a transaction will immediately result in memory leaks of the
associated log item(s).  This is an interface deficiency and landmine.

To clean this up, replace the iop_unlock method with an iop_release
method that is specific to transaction cancel.  The existing
iop_committing method occurs at the same time as iop_unlock in the
commit path and there is no need for two separate callbacks here.
Overload the iop_committing method with the current commit time
iop_unlock implementations to eliminate the need for the latter and
further simplify the interface.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:32 -07:00
Christoph Hellwig
195cd83d1b xfs: don't use xfs_trans_free_items in the commit path
While commiting items looks very similar to freeing them on error it is
a different operation, and they will diverge a bit soon.

Split out the commit case from xfs_trans_free_items, inline it into
xfs_log_commit_cil and give it a separate trace point.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:31 -07:00
Christoph Hellwig
8e4b20ea83 xfs: remove the dummy iop_push implementation for inode creation items
This method should never be called, so don't waste code on it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:31 -07:00
Christoph Hellwig
e8b78db77d xfs: don't require log items to implement optional methods
Just check if they are present first.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:30 -07:00
Christoph Hellwig
d15cbf2f38 xfs: stop using XFS_LI_ABORTED as a parameter flag
Just pass a straight bool aborted instead of abusing XFS_LI_ABORTED as a
flag in function parameters.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:30 -07:00
Christoph Hellwig
086252c34b xfs: fix a trivial comment typo in xfs_trans_committed_bulk
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:29 -07:00
Christoph Hellwig
dbd329f1e4 xfs: add struct xfs_mount pointer to struct xfs_buf
We need to derive the mount pointer from a buffer in a lot of place.
Add a direct pointer to short cut the pointer chasing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:29 -07:00
Christoph Hellwig
8124b9b601 xfs: remove the b_io_length field in struct xfs_buf
This field is now always idential to b_length.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:28 -07:00
Christoph Hellwig
e99b4bd0cb xfs: properly type the b_log_item field in struct xfs_buf
Now that the log code doesn't abuse this field any more we can
declare it as a struct xfs_buf_log_item pointer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:28 -07:00
Christoph Hellwig
0564501ff5 xfs: remove unused buffer cache APIs
Now that the log code uses bios directly we can drop various special
cases in the buffer cache code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:27 -07:00
Christoph Hellwig
6e9b3dd80f xfs: stop using bp naming for log recovery buffers
Now that we don't use struct xfs_buf to hold log recovery buffer rename
the related functions and variables to just talk of a buffer instead of
using the bp name that we usually use for xfs_buf related functionality.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:27 -07:00
Christoph Hellwig
6ad5b3255b xfs: use bios directly to read and write the log recovery buffers
The xfs_buf structure is basically used as a glorified container for
a memory allocation in the log recovery code.  Replace it with a
call to kmem_alloc_large and a simple abstraction to read into or
write from it synchronously using chained bios.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:26 -07:00
Christoph Hellwig
18ffb8c3f0 xfs: return an offset instead of a pointer from xlog_align
This simplifies both the helper and the callers.  We lost a bit of
size sanity checking, but that is already covered by KASAN if needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:26 -07:00
Christoph Hellwig
1058d0f5ee xfs: move the log ioend workqueue to struct xlog
Move the workqueue used for log I/O completions from struct xfs_mount
to struct xlog to keep it self contained in the log code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: destroy the log workqueue after ensuring log ios are done]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:25 -07:00
Christoph Hellwig
79b54d9bfc xfs: use bios directly to write log buffers
Currently the XFS logging code uses the xfs_buf structure and
associated APIs to write the log buffers to disk.  This requires
various special cases in the log code and is generally not very
optimal.

Instead of using a buffer just allocate a kmem_alloc_larger region for
each log buffer, and use a bio and bio_vec array embedded in the iclog
structure to write the buffer to disk.  This also allows for using
the bio split and chaining case to deal with the case of a log
buffer wrapping around the end of the log.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: don't split if/else with an #endif]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:25 -07:00
Christoph Hellwig
2d15d2c0e0 xfs: make use of the l_targ field in struct xlog
Use the slightly shorter way to get at the buftarg for the log device
wherever we can in the log and log recovery code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:24 -07:00
Christoph Hellwig
abca1f33f8 xfs: remove the syncing argument from xlog_verify_iclog
The only caller unconditionally passes true here.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:24 -07:00
Christoph Hellwig
9b0489c1d1 xfs: update both stat counters together in xlog_sync
Just a small bit of code tidying up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:23 -07:00
Christoph Hellwig
db0a6faf93 xfs: factor out iclog size calculation from xlog_sync
Split out another self-contained bit of code from xlog_sync.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:23 -07:00
Christoph Hellwig
5693384805 xfs: factor out splitting of an iclog from xlog_sync
Split out a self-contained chunk of code from xlog_sync that calculates
the split offset for an iclog that wraps the log end and bumps the
cycles for the second half.

Use the chance to bring some sanity to the variables used to track the
split in xlog_sync by not changing the count variable, and instead use
split as the offset for the split and use those to calculate the
sizes and offsets for the two write buffers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:22 -07:00
Christoph Hellwig
94860a301b xfs: factor out log buffer writing from xlog_sync
Replace the not very useful xlog_bdstrat wrapper with a new version that
that takes care of all the common logic for writing log buffers.  Use
the opportunity to avoid overloading the buffer address with the log
relative address, and to shed the unused return value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:22 -07:00
Christoph Hellwig
1f9489be02 xfs: don't use REQ_PREFLUSH for split log writes
If we have to split a log write because it wraps the end of the log we
can't just use REQ_PREFLUSH to flush before the first log write,
as the writes might get reordered somewhere in the I/O stack.  Issue
a manual flush in that case so that the ordering of the two log I/Os
doesn't matter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:21 -07:00
Christoph Hellwig
366fc4b898 xfs: remove XLOG_STATE_IOABORT
This value is the only flag in ic_state, which we otherwise use as
a state.  Switch it to a new debug-only field and also report and
actual error in the buffer in the I/O completion path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:21 -07:00
Christoph Hellwig
9bff313253 xfs: reformat xlog_get_lowest_lsn
Reformat xlog_get_lowest_lsn to our usual style.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:20 -07:00
Christoph Hellwig
4f62282a36 xfs: cleanup xlog_get_iclog_buffer_size
We don't really need all the messy branches in the function, as it
really does three things, out of which 2 are common for all branches:

 1) set up mount point log buffer size and count values if not already
    done from mount options
 2) calculate the number of log headers
 3) set up all the values in struct xlog based on the above

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:20 -07:00
Christoph Hellwig
76ce9823ac xfs: remove the l_iclog_size_log field from struct xlog
This field is never used, so we can simply kill it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:19 -07:00
Christoph Hellwig
72945d86dd xfs: make mem_to_page available outside of xfs_buf.c
Rename the function to kmem_to_page and move it to kmem.h together
with our kmem_large allocator that may either return kmalloced or
vmalloc pages.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:19 -07:00
Christoph Hellwig
ce89755cdf xfs: renumber XBF_WRITE_FAIL
Assining a numerical value that is not close to the flags
defined near by is just asking for conflicts later on.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:18 -07:00
Christoph Hellwig
153fd7b57c xfs: remove the never used _XBF_COMPOUND flag
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:18 -07:00
Christoph Hellwig
1e85a3670d xfs: remove the no-op spinlock_destroy stub
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28 19:27:17 -07:00
Darrick J. Wong
5467b34bd1 xfs: move xfs_ino_geometry to xfs_shared.h
The inode geometry structure isn't related to ondisk format; it's
support for the mount structure.  Move it to xfs_shared.h.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-06-28 19:25:35 -07:00
Christoph Hellwig
ff896738be block: return from __bio_try_merge_page if merging occured in the same page
We currently have an input same_page parameter to __bio_try_merge_page
to prohibit merging in the same page.  The rationale for that is that
some callers need to account for every page added to a bio.  Instead of
letting these callers call twice into the merge code to account for the
new vs existing page cases, just turn the paramter into an output one that
returns if a merge in the same page occured and let them act accordingly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-17 09:33:02 -06:00
Eric Sandeen
f5b999c03f xfs: remove unused flag arguments
There are several functions which take a flag argument that is
only ever passed as "0," so remove these arguments.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-12 09:00:00 -07:00
Christoph Hellwig
76dee76921 xfs: remove the debug-only q_transp field from struct xfs_dquot
The field is only used for a few assertations.  Shrink the dqout
structure instead, similarly to what commit f3ca87389d
("xfs: remove i_transp") did for the xfs_inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-12 08:59:59 -07:00
Christoph Hellwig
f9a196ee5a xfs: merge xfs_buf_zero and xfs_buf_iomove
xfs_buf_zero is the only caller of xfs_buf_iomove.  Remove support
for copying from or to the buffer in xfs_buf_iomove and merge the
two functions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-12 08:59:59 -07:00
Eric Sandeen
8c9ce2f707 xfs: remove unused flags arg from getsb interfaces
The flags value is always passed as 0 so remove the argument.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-12 08:59:58 -07:00
Eric Sandeen
d03a2f1b9f xfs: include WARN, REPAIR build options in XFS_BUILD_OPTIONS
The XFS_BUILD_OPTIONS string, shown at module init time and
in modinfo output, does not currently include all available
build options.  So, add in CONFIG_XFS_WARN and CONFIG_XFS_REPAIR.

It has been suggested in some quarters
That this is not enough.
Well ...

Anybody who would like to see this in a sysfs file can send
a patch.  :)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-12 08:37:40 -07:00
Darrick J. Wong
4b4d98cca3 xfs: finish converting to inodes_per_cluster
Finish converting all the old inode_cluster_size >> inopblog users to
inodes_per_cluster.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-06-12 08:37:40 -07:00
Darrick J. Wong
490d451fa5 xfs: fix inode_cluster_size rounding mayhem
inode_cluster_size is supposed to represent the size (in bytes) of an
inode cluster buffer.  We avoid having to handle multiple clusters per
filesystem block on filesystems with large blocks by openly rounding
this value up to 1 FSB when necessary.  However, we never reset
inode_cluster_size to reflect this new rounded value, which adds to the
potential for mistakes in calculating geometries.

Fix this by setting inode_cluster_size to reflect the rounded-up size if
needed, and special-case the few places in the sparse inodes code where
we actually need the smaller value to validate on-disk metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-06-12 08:37:40 -07:00
Darrick J. Wong
494dba7b27 xfs: refactor inode geometry setup routines
Migrate all of the inode geometry setup code from xfs_mount.c into a
single libxfs function that we can share with xfsprogs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-06-12 08:37:40 -07:00
Darrick J. Wong
ef32595999 xfs: separate inode geometry
Separate the inode geometry information into a distinct structure.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-06-12 08:37:40 -07:00
Amir Goldstein
8c3f406c09 xfs: use file_modified() helper
Note that by using the helper, the order of calling file_remove_privs()
after file_update_mtime() in xfs_file_aio_write_checks() has changed.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-09 10:06:19 -07:00
Linus Torvalds
01047631df Changes since last update:
- Fix some forgotten strings in a log debugging function
 - Fix incorrect unit conversion in online fsck code
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlz1Ur0ACgkQ+H93GTRK
 tOshcA//X+tEFZGF+oosQ3h/IJqr9pODZfVf76mQIpg/5IU7WUzONGjTsjiHyH93
 zC2qw7l7m/3LyCSXfuJfuWOCyPXRRj7Dv+iUCbjmAh0OAn6Alpa1VN9jxABTeTuY
 xeCoNdRBCg6wF2XLLVgEN+VEdrFc7F7x/4OyoW5dmmBQakNOWlVgLtQqO+7gK/hS
 Qc+xw9ekKvkceHi//NXJTSXAZ0EfntzNZ/Fg41O2cztWV4oKxl2a4ej+j0g8/u9e
 rabTP54RH8dNJIejRmWU3dxz/w7OHSOO84LW/Q4LMKykfhLV6lhPL25igy1eLNhY
 OpcCWQio4ZKBwd8UoXCH0HA4dprusPipI1JIXp/YDHGFb0PumhkZaO/1xM0G5J+o
 3n1A1m7MW16jXAwlwBq0A6UNUFGyMfhfQiCluikBwkpR5dhWQoLQ5/IWZxs7ibUy
 0M5POwnYUsx7xcx3/TjUsLaGbBrqD/F4LXXGmxVJQa6Dt6B0ctyLDnjN08mNwR0e
 4Kvck4yxUHrUoXGaOhtddZmloVcrY58CwwHPabwoA3qC39ktKvp5PPrTNPiao8Ew
 oILteFrpQxPfg1AItXO/QMcGvj7ragHSA2O0XqD5Tm61jWErfpIAyRQ89oXT5GUt
 Z52uOI164u/8T4S/HUvi9tqYuZ+xudG4krIClDCXBaww1W2xNjo=
 =WFWW
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.2-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "Here are a couple more bug fixes for 5.2. Changes since last update:

   - Fix some forgotten strings in a log debugging function

   - Fix incorrect unit conversion in online fsck code"

* tag 'xfs-5.2-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: inode btree scrubber should calculate im_boffset correctly
  xfs: fix broken log reservation debugging
2019-06-06 12:36:54 -07:00
Darrick J. Wong
025197ebb0 xfs: inode btree scrubber should calculate im_boffset correctly
The im_boffset field is in units of bytes, whereas XFS_INO_OFFSET
returns a value in units of inodes.  Convert the units so that scrub on
a 64k-block filesystem works correctly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-06-03 09:18:40 -07:00
Darrick J. Wong
d31d718528 xfs: fix broken log reservation debugging
xlog_print_tic_res() is supposed to print a human readable string for
each element of the log ticket reservation array.  Unfortunately, I
forgot to update the string array when we added rmap & reflink support,
so the debug message prints "region[3]: (null) - 352 bytes" which isn't
useful at all.  Add the missing elements and add a build check so that
we don't forget again to add a string when adding a new XLOG_REG_TYPE.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-05-24 07:32:01 -07:00
Linus Torvalds
4dde821e42 Fixes for 5.1:
- Fix an accounting mistake where we included the log space when
   calculating the reserve space for metadata expansion.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlzkEpIACgkQ+H93GTRK
 tOuuXg//Zjuz7KxWtfMRNvNPL3w9YTuu0FyvlQRwFeIKOu9oc2bq+3JIEjakFNM0
 C4htke3Q0CWEokLGNMxjTg53/IbWlrq2G3mszjyQbH/xTidO+ULm6A6z7aIYa4Wj
 hShpHnkd12F543nUgsDUg4+caHrfqlU2ZOkpn7HtmVsCASzNOeBBINSFDUAGqouB
 gWAq17PeBwnUMJLBFvOW4CMcfYfxOkhM4XsAnmJn1+Mk+CVVichgXy8HsxIqOlGT
 wUbt4iVYLcoAd9dsRkTpiCE+ni5AtfyF2aCtMb39Vm18wzGhokIvHlH+IVqdR0k8
 fUoexMPx7XsDWbm0MR5n8uirQUfL6RIBxxVdSwFRNvngmYiAdvxiNlC+rzZm+kdb
 r4tKk+SNXMIonv73bmwBBEuu4LRhb77Ls4jc+Q/6oRp1N6lf16QlmPzAyTmu0jRH
 Uxe3Nqa0PlerjiU6sgC45Dtym3loF9Cvun2f8JJHEYrapVpDgmWdH+l5bQoENpV5
 GH5T90RjjEE+L2PGg83xVBLIBElJ05npzT73SkyfMacBPTR4ziYNX2blPnHFU+es
 VyZG6IqyjINTsiS3Y13SBd658BhEUrSsxaMJo3zwOMJRfVcqEPOIa+DLPt3Mok5i
 Pw382Ym4aNdHNTO0g83XOdtyHt8qAxM330Sayiw2ac/FGwtnoEM=
 =C52a
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.2-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fix from Darrick Wong:
 "Fix an accounting mistake where we included the log space when
  calculating the reserve space for metadata expansion"

* tag 'xfs-5.2-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: don't reserve per-AG space for an internal log
2019-05-23 11:18:18 -07:00
Thomas Gleixner
ec8f24b7fa treewide: Add SPDX license identifier - Makefile/Kconfig
Add SPDX license identifiers to all Make/Kconfig files which:

 - Have no license information of any form

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:46 +02:00
Darrick J. Wong
5cd213b0fe xfs: don't reserve per-AG space for an internal log
It turns out that the log can consume nearly all the space in an AG, and
when this happens this it's possible that there will be less free space
in the AG than the reservation would try to hide.  On a debug kernel
this can trigger an ASSERT in xfs/250:

XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 319

The log is permanently allocated, so we know we're never going to have
to expand the btrees to hold any records associated with the log space.
We therefore can treat the space as if it doesn't exist.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2019-05-20 11:25:39 -07:00
Masahiro Yamada
9cc342f6c4 treewide: prefix header search paths with $(srctree)/
Currently, the Kbuild core manipulates header search paths in a crazy
way [1].

To fix this mess, I want all Makefiles to add explicit $(srctree)/ to
the search paths in the srctree. Some Makefiles are already written in
that way, but not all. The goal of this work is to make the notation
consistent, and finally get rid of the gross hacks.

Having whitespaces after -I does not matter since commit 48f6e3cf5b
("kbuild: do not drop -I without parameter").

[1]: https://patchwork.kernel.org/patch/9632347/

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-05-18 11:49:57 +09:00
Linus Torvalds
67a2422239 for-5.2/block-20190507
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlzR0AAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpo0MD/47D1kBK9rGzkAwIz1Jkh1Qy/ITVaDJzmHJ
 UP5uncQsgKFLKMR1LbRcrWtmk2MwFDNULGbteHFeCYE1ypCrTgpWSp5+SJluKd1Q
 hma9krLSAXO9QiSaZ4jafshXFIZxz6IjakOW8c9LrT80Ze47yh7AxiLwDafcp/Jj
 x6NW790qB7ENDtfarDkZk14NCS8HGLRHO5B21LB+hT0Kfbh0XZaLzJdj7Mck1wPA
 VT8hL9mPuA++AjF7Ra4kUjwSakgmajTa3nS2fpkwTYdztQfas7x5Jiv7FWxrrelb
 qbabkNkWKepcHAPEiZR7o53TyfCucGeSK/jG+dsJ9KhNp26kl1ci3frl5T6PfVMP
 SPPDjsKIHs+dqFrU9y5rSGhLJqewTs96hHthnLGxyF67+5sRb5+YIy+dcqgiyc/b
 TUVyjCD6r0cO2q4v9VhwnhOyeBUA9Rwbu8nl7JV5Q45uG7qI4BC39l1jfubMNDPO
 GLNGUUzb6ER7z6lYINjRSF2Jhejsx8SR9P7jhpb1Q7k/VvDDxO1T4FpwvqWFz9+s
 Gn+s6//+cA6LL+42eZkQjvwF2CUNE7TaVT8zdb+s5HP1RQkZToqUnsQCGeRTrFni
 RqWXfW9o9+awYRp431417oMdX/LvLGq9+ZtifRk9DqDcowXevTaf0W2RpplWSuiX
 RcCuPeLAVg==
 =Ot0g
 -----END PGP SIGNATURE-----

Merge tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Nothing major in this series, just fixes and improvements all over the
  map. This contains:

   - Series of fixes for sed-opal (David, Jonas)

   - Fixes and performance tweaks for BFQ (via Paolo)

   - Set of fixes for bcache (via Coly)

   - Set of fixes for md (via Song)

   - Enabling multi-page for passthrough requests (Ming)

   - Queue release fix series (Ming)

   - Device notification improvements (Martin)

   - Propagate underlying device rotational status in loop (Holger)

   - Removal of mtip32xx trim support, which has been disabled for years
     (Christoph)

   - Improvement and cleanup of nvme command handling (Christoph)

   - Add block SPDX tags (Christoph)

   - Cleanup/hardening of bio/bvec iteration (Christoph)

   - A few NVMe pull requests (Christoph)

   - Removal of CONFIG_LBDAF (Christoph)

   - Various little fixes here and there"

* tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits)
  block: fix mismerge in bvec_advance
  block: don't drain in-progress dispatch in blk_cleanup_queue()
  blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
  blk-mq: always free hctx after request queue is freed
  blk-mq: split blk_mq_alloc_and_init_hctx into two parts
  blk-mq: free hw queue's resource in hctx's release handler
  blk-mq: move cancel of requeue_work into blk_mq_release
  blk-mq: grab .q_usage_counter when queuing request from plug code path
  block: fix function name in comment
  nvmet: protect discovery change log event list iteration
  nvme: mark nvme_core_init and nvme_core_exit static
  nvme: move command size checks to the core
  nvme-fabrics: check more command sizes
  nvme-pci: check more command sizes
  nvme-pci: remove an unneeded variable initialization
  nvme-pci: unquiesce admin queue on shutdown
  nvme-pci: shutdown on timeout during deletion
  nvme-pci: fix psdt field for single segment sgls
  nvme-multipath: don't print ANA group state by default
  nvme-multipath: split bios with the ns_head bio_set before submitting
  ...
2019-05-07 18:14:36 -07:00
Eric Sandeen
910832697c xfs: change some error-less functions to void types
There are several functions which have no opportunity to return
an error, and don't contain any ASSERTs which could be argued
to be better constructed as error cases.  So, make them voids
to simplify the callers.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-05-01 20:26:30 -07:00
Christoph Hellwig
2b070cfe58 block: remove the i argument to bio_for_each_segment_all
We only have two callers that need the integer loop iterator, and they
can easily maintain it themselves.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:26:13 -06:00
Darrick J. Wong
75efa57d0b xfs: add online scrub for superblock counters
Teach online scrub how to check the filesystem summary counters.  We use
the incore delalloc block counter along with the incore AG headers to
compute expected values for fdblocks, icount, and ifree, and then check
that the percpu counter is within a certain threshold of the expected
value.  This is done to avoid having to freeze or otherwise lock the
filesystem, which means that we're only checking that the counters are
fairly close, not that they're exactly correct.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-30 08:19:13 -07:00
Christoph Hellwig
9407928575 xfs: don't parse the mtpt mount option
The text isn't really any more useful than the default unknown option
handling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-04-30 08:19:13 -07:00
Darrick J. Wong
710d707d2f xfs: always rejoin held resources during defer roll
During testing of xfs/141 on a V4 filesystem, I observed some
inconsistent behavior with regards to resources that are held (i.e.
remain locked) across a defer roll.  The transaction roll always gives
the defer roll function a new transaction, even if committing the old
transaction fails.  However, the defer roll function only rejoins the
held resources if the transaction commit succeedied.  This means that
callers of defer roll have to figure out whether the held resources are
attached to the transaction being passed back.

Worse yet, if the defer roll was part of a defer finish call, we have a
third possibility: the defer finish could pass back a dirty transaction
with dirty held resources and an error code.

The only sane way to handle all of these scenarios is to require that
the code that held the resource either cancel the transaction before
unlocking and releasing the resources, or use functions that detach
resources from a transaction properly (e.g.  xfs_trans_brelse) if they
need to drop the reference before committing or cancelling the
transaction.

In order to make this so, change the defer roll code to join held
resources to the new transaction unconditionally and fix all the bhold
callers to release the held buffers correctly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-30 08:19:13 -07:00
Brian Foster
1749d1ea89 xfs: add missing error check in xfs_prepare_shift()
xfs_prepare_shift() fails to check the error return from
xfs_flush_unmap_range(). If the latter fails, that could lead to an
insert/collapse range operation over a delalloc range, which is not
supported.

Add an error check and return appropriately. This is reproduced
rarely by generic/475.

Fixes: 7f9f71be84 ("xfs: extent shifting doesn't fully invalidate page cache")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-04-26 12:28:56 -07:00
Darrick J. Wong
47cd97b5b2 xfs: scrub should check incore counters against ondisk headers
In theory, the incore per-AG structure counters should match the ones on
disk, so check that.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-04-26 12:28:56 -07:00
Darrick J. Wong
9a1f3049f4 xfs: allow scrubbers to pause background reclaim
The forthcoming summary counter patch races with regular filesystem
activity to compute rough expected values for the counters.  This design
was chosen to avoid having to freeze the entire filesystem to check the
counters, but while that's running we'd prefer to minimize background
reclamation activity to reduce the perturbations to the incore free
block count.  Therefore, provide a way for scrubbers to disable
background posteof and cowblock reclamation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-04-26 12:28:56 -07:00
Darrick J. Wong
ed30dcbd90 xfs: rename the speculative block allocation reclaim toggle functions
"reclaim" is used throughout the icache code to mean reclamation of
incore inode structures.  It's also used for two helper functions that
toggle background deletion of speculative preallocations.  Separate
the second of the two uses to make things less confusing.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-04-26 12:28:55 -07:00
Darrick J. Wong
9fe82b8c42 xfs: track delayed allocation reservations across the filesystem
Add a percpu counter to track the number of blocks directly reserved for
delayed allocations on the data device.  This counter (in contrast to
i_delayed_blks) does not track allocated CoW staging extents or anything
going on with the realtime device.  It will be used in the upcoming
summary counter scrub function to check the free block counts without
having to freeze the filesystem or walk all the inodes to find the
delayed allocations.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-04-26 12:28:55 -07:00
Darrick J. Wong
f60be90fc9 xfs: fix broken bhold behavior in xrep_roll_ag_trans
In xrep_roll_ag_trans, the transaction roll will always set sc->tp to
the new transaction, even if committing the old one fails.  A bare
transaction roll leaves the buffer(s) locked but not joined to the new
transaction, so it's not necessary to release the hold if the roll
fails.  Remove the incorrect xfs_trans_bhold_release calls.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-26 12:28:55 -07:00
Darrick J. Wong
3de5eab3fd xfs: unlock inode when xfs_ioctl_setattr_get_trans can't get transaction
We passed an inode into xfs_ioctl_setattr_get_trans with join_flags
indicating which locks are held on that inode.  If we can't allocate a
transaction then we need to unlock the inode before we bail out, like
all the other error paths do.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-23 08:36:23 -07:00
Darrick J. Wong
078f4a7d31 xfs: kill the xfs_dqtrx_t typedef
There's only a few uses left, so just kill the typedef while we're at
it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-04-23 08:36:23 -07:00
Darrick J. Wong
394aafdc15 xfs: widen inode delalloc block counter to 64-bits
Widen the incore inode's i_delayed_blks counter to be a 64-bit integer.
This is necessary to fix an integer overflow problem that can be
reproduced easily now that we use the counter to track blocks that are
assigned to the inode in memory but not on disk.  This includes actual
delalloc reservations as well as real extents in the COW fork that
are waiting to be remapped into the data fork.

These 'delayed mapping' blocks can easily exceed 2^32 blocks if one
creates a very large sparse file of size approximately 2^33 bytes with
one byte written every 2^23 bytes, sets a very large COW extent size
hint of 2^23 blocks, reflinks the first file into a second file, and
then writes a single byte every 2^23 blocks in the original file.

When this happens, we'll try to create approximately 1024 2^23 extent
reservations in the COW fork, which will overflow the counter and cause
problems.

Note that on x64 we end up filling a 4-byte gap in the structure so this
doesn't increase the incore size.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-04-23 08:36:23 -07:00
Darrick J. Wong
903b1fc273 xfs: widen quota block counters to 64-bit integers
Widen the incore quota transaction delta structure to treat block
counters as 64-bit integers.  This is a necessary addition so that we
can widen the i_delayed_blks counter to be a 64-bit integer.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-04-23 08:36:23 -07:00
Darrick J. Wong
1fdeaea4d9 xfs: abort unaligned nowait directio early
Dave Chinner noticed that xfs_file_dio_aio_write returns EAGAIN without
dropping the IOLOCK when its deciding not to wait, which means that we
leak the IOLOCK there.  Since we now make unaligned directio always
wait, we have the opportunity to bail out before trying to take the
lock, which should reduce the overhead of this never-gonna-work case
considerably while also solving the dropped lock problem.

Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-04-23 08:36:23 -07:00
Brian Foster
362f5e745a xfs: assert that we don't enter agfl freeing with a non-permanent transaction
Block allocation requires a permanent transaction for deferred AGFL
frees.  Add an assert in the block allocation path to make explicit and
obvious to future callers the requirement of a transaction with a
permanent reservation.

Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: split this out from the previous patch per hch request]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-04-23 08:36:23 -07:00
Brian Foster
945c941fcd xfs: make tr_growdata a permanent transaction
The growdata transaction is used by growfs operations to increase
the data size of the filesystem. Part of this sequence involves
extending the size of the last preexisting AG in the fs, if
necessary. This is implemented by freeing the newly available
physical range to the AG.

tr_growdata is not a permanent transaction, however, and block
allocation transactions must be permanent to handle deferred frees
of AGFL blocks. If the grow operation extends an existing AG that
requires AGFL fixing, assert failures occur due to a populated dfops
list on a non-permanent transaction and the AGFL free does not
occur. This is reproduced (rarely) by xfs/104.

Change tr_growdata to a permanent transaction with a default log
count. This increases initial transaction reservation size, but
growfs is an infrequent and non-performance critical operation and
so should have minimal impact.

Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: add a comment to the assert]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-04-22 16:28:45 -07:00
Darrick J. Wong
3994fc4895 xfs: merge adjacent io completions of the same type
It's possible for pagecache writeback to split up a large amount of work
into smaller pieces for throttling purposes or to reduce the amount of
time a writeback operation is pending.  Whatever the reason, XFS can end
up with a bunch of IO completions that call for the same operation to be
performed on a contiguous extent mapping.  Since mappings are extent
based in XFS, we'd prefer to run fewer transactions when we can.

When we're processing an ioend on the list of io completions, check to
see if the next items on the list are both adjacent and of the same
type.  If so, we can merge the completions to reduce transaction
overhead.

On fast storage this doesn't seem to make much of a difference in
performance, though the number of transactions for an overnight xfstests
run seems to drop by ~5%.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-16 10:01:58 -07:00
Darrick J. Wong
2840824370 xfs: remove unused m_data_workqueue
Now that we're no longer using m_data_workqueue, remove it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-16 10:01:58 -07:00
Darrick J. Wong
cb357bf3d1 xfs: implement per-inode writeback completion queues
When scheduling writeback of dirty file data in the page cache, XFS uses
IO completion workqueue items to ensure that filesystem metadata only
updates after the write completes successfully.  This is essential for
converting unwritten extents to real extents at the right time and
performing COW remappings.

Unfortunately, XFS queues each IO completion work item to an unbounded
workqueue, which means that the kernel can spawn dozens of threads to
try to handle the items quickly.  These threads need to take the ILOCK
to update file metadata, which results in heavy ILOCK contention if a
large number of the work items target a single file, which is
inefficient.

Worse yet, the writeback completion threads get stuck waiting for the
ILOCK while holding transaction reservations, which can use up all
available log reservation space.  When that happens, metadata updates to
other parts of the filesystem grind to a halt, even if the filesystem
could otherwise have handled it.

Even worse, if one of the things grinding to a halt happens to be a
thread in the middle of a defer-ops finish holding the same ILOCK and
trying to obtain more log reservation having exhausted the permanent
reservation, we now have an ABBA deadlock - writeback completion has a
transaction reserved and wants the ILOCK, and someone else has the ILOCK
and wants a transaction reservation.

Therefore, we create a per-inode writeback io completion queue + work
item.  When writeback finishes, it can add the ioend to the per-inode
queue and let the single worker item process that queue.  This
dramatically cuts down on the number of kworkers and ILOCK contention in
the system, and seems to have eliminated an occasional deadlock I was
seeing while running generic/476.

Testing with a program that simulates a heavy random-write workload to a
single file demonstrates that the number of kworkers drops from
approximately 120 threads per file to 1, without dramatically changing
write bandwidth or pagecache access latency.

Note that we leave the xfs-conv workqueue's max_active alone because we
still want to be able to run ioend processing for as many inodes as the
system can handle.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-16 10:01:57 -07:00
Darrick J. Wong
4fb7951fde xfs: scrub should only cross-reference with healthy btrees
Skip cross-referencing with a btree if the health report tells us that
it's known to be bad.  This should reduce the dmesg spew considerably.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-04-16 10:01:57 -07:00
Darrick J. Wong
4860a05d24 xfs: scrub/repair should update filesystem metadata health
Now that we have the ability to track sick metadata in-core, make scrub
and repair update those health assessments after doing work.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-04-16 10:01:57 -07:00
Darrick J. Wong
160b5a7845 xfs: hoist the already_fixed variable to the scrub context
Now that we no longer memset the scrub context, we can move the
already_fixed variable into the scrub context's state flags instead of
passing around pointers to separate stack variables.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-04-16 10:01:57 -07:00
Darrick J. Wong
f8c2a2257c xfs: collapse scrub bool state flags into a single unsigned int
Combine all the boolean state flags in struct xfs_scrub into a single
unsigned int, because we're going to be adding more state flags soon.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-04-16 10:01:57 -07:00
Darrick J. Wong
9d71e15586 xfs: refactor scrub context initialization
It's a little silly how the memset in scrub context initialization
forces us to declare stack variables to preserve context variables
across a retry.  Since the teardown functions already null out most of
the ephemeral state (buffer pointers, btree cursors, etc.), just skip
the memset and move the initialization as needed.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-04-16 10:01:57 -07:00
Darrick J. Wong
89d139d5ad xfs: report inode health via bulkstat
Use space in the bulkstat ioctl structure to report any problems
observed with the inode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-14 18:15:58 -07:00
Darrick J. Wong
1302c6a24f xfs: report AG health via AG geometry ioctl
Use the AG geometry info ioctl to report health status too.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-14 18:15:57 -07:00
Darrick J. Wong
c23232d409 xfs: report fs and rt health via geometry structure
Use our newly expanded geometry structure to report the overall fs and
realtime health status.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-14 18:15:57 -07:00
Darrick J. Wong
7cd5006bdb xfs: add a new ioctl to describe allocation group geometry
Add a new ioctl to describe an allocation group's geometry.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-14 18:15:57 -07:00
Dave Chinner
1b6d968de2 xfs: bump XFS_IOC_FSGEOMETRY to v5 structures
Unfortunately, the V4 XFS_IOC_FSGEOMETRY structure is out of space so we
can't just add a new field to it. Hence we need to bump the definition
to V5 and and treat the V4 ioctl and structure similar to v1 to v3.

While doing this, clean up all the definitions associated with the
XFS_IOC_FSGEOMETRY ioctl.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: forward port to 5.1, expand structure size to 256 bytes]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-14 18:15:57 -07:00
Darrick J. Wong
519841c207 xfs: clear BAD_SUMMARY if unmounting an unhealthy filesystem
If we know the filesystem metadata isn't healthy during unmount, we want
to encourage the administrator to run xfs_repair right away.  We can't
do this if BAD_SUMMARY will cause an unclean log unmount to force
summary recalculation, so turn it off if the fs is bad.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-14 18:15:57 -07:00
Darrick J. Wong
39353ff6e9 xfs: replace the BAD_SUMMARY mount flag with the equivalent health code
Replace the BAD_SUMMARY mount flag with calls to the equivalent health
tracking code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-14 18:15:57 -07:00
Darrick J. Wong
6772c1f112 xfs: track metadata health status
Add the necessary in-core metadata fields to keep track of which parts
of the filesystem have been observed and which parts were observed to be
unhealthy, and print a warning at unmount time if we have unfixed
problems.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-14 18:15:57 -07:00
Wang Shilong
2bf9d264ef xfs,fstrim: fix to return correct minlen
This patch tries to address two problems:

1) return @minlen we used to trim to
user space.

2) return EINVAL if granularity is larger than
avg size, even most of cases, granularity is small(4K),
but if devices return a lager granularity for some reaons
(testing, bugs etc), fstrim should return failure directly.

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-04-14 18:15:57 -07:00
Brian Foster
1ca89fbc48 xfs: don't account extra agfl blocks as available
The block allocation AG selection code has parameters that allow a
caller to perform multiple allocations from a single AG and
transaction (under certain conditions). The parameters specify the
total block allocation count required by the transaction and the AG
selection code selects and locks an AG that will be able to satisfy
the overall requirement. If the available block accounting
calculation turns out to be inaccurate and a subsequent allocation
call fails with -ENOSPC, the resulting transaction cancel leads to
filesystem shutdown because the transaction is dirty.

This exact problem can be reproduced with a highly parallel space
consumer and fsstress workload running long enough to a large
filesystem against -ENOSPC conditions. A bmbt block allocation
request made for inode extent to bmap format conversion after an
extent allocation is expected to be satisfied by the same AG and the
same transaction as the extent allocation. The bmbt block allocation
fails, however, because the block availability of the AG has changed
since the AG was selected (outside of the blocks used for the extent
itself).

The inconsistent block availability calculation is caused by the
deferred block freeing behavior of the AGFL. This immediately
removes extra blocks from the AGFL to free up AGFL slots, but rather
than immediately freeing such blocks as was done in the past, the
block free is deferred such that said blocks are not available for
allocation until the current transaction commits. The AG selection
logic currently considers all AGFL blocks as available and executes
shortly before any extra AGFL blocks are freed. This means the block
availability of the current AG can change before the first
allocation even occurs, but in practice a failure is more likely to
manifest via a subsequent allocation because extent allocation
usually has a contiguity requirement larger than a single block that
can't be satisfied from the AGFL.

In general, XFS prefers operational robustness to absolute
allocation efficiency. In other words, we prefer to return -ENOSPC
slightly earlier at the expense of not being able to allocate every
last block in an AG to avoid this kind of problem. As such, update
the AG block availability calculation to consider extra AGFL blocks
as unavailable since they are immediately removed following the
calculation and will not become available until the current
transaction commits.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-04-14 18:15:57 -07:00
Brian Foster
22fedd80b6 xfs: shutdown after buf release in iflush cluster abort path
If xfs_iflush_cluster() fails due to corruption, the error path
issues a shutdown and simulates an I/O completion to release the
buffer. This code has a couple small problems. First, the shutdown
sequence can issue a synchronous log force, which is unsafe to do
with buffer locks held. Second, the simulated I/O completion does not
guarantee the buffer is async and thus is unlocked and released.

For example, if the last operation on the buffer was a read off disk
prior to the corruption event, XBF_ASYNC is not set and the buffer
is left locked and held upon return. This results in a memory leak
as shown by the following message on module unload:

 BUG xfs_buf (...): Objects remaining in xfs_buf on __kmem_cache_shutdown()

Fix both of these problems by setting XBF_ASYNC on the buffer prior
to the simulated I/O error and performing the shutdown immediately
after ioend processing when the buffer has been released.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-04-14 18:15:56 -07:00
Brian Foster
545aa41f5c xfs: wake commit waiters on CIL abort before log item abort
XFS shutdown deadlocks have been reproduced by fstest generic/475.
The deadlock signature involves log I/O completion running error
handling to abort logged items and waiting for an inode cluster
buffer lock in the buffer item unpin handler. The buffer lock is
held by xfsaild attempting to flush an inode. The buffer happens to
be pinned and so xfs_iflush() triggers an async log force to begin
work required to get it unpinned. The log force is blocked waiting
on the commit completion, which never occurs and thus leaves the
filesystem deadlocked.

The root problem is that aborted log I/O completion pots commit
completion behind callback completion, which is unexpected for async
log forces. Under normal running conditions, an async log force
returns to the caller once the CIL ctx has been formatted/submitted
and the commit completion event triggered at the tail end of
xlog_cil_push(). If the filesystem has shutdown, however, we rely on
xlog_cil_committed() to trigger the completion event and it happens
to do so after running log item unpin callbacks. This makes it
unsafe to invoke an async log force from contexts that hold locks
that might also be required in log completion processing.

To address this problem, wake commit completion waiters before
aborting log items in the log I/O completion handler. This ensures
that an async log force will not deadlock on held locks if the
filesystem happens to shutdown. Note that it is still unsafe to
issue a sync log force while holding such locks because a sync log
force explicitly waits on the force completion, which occurs after
log I/O completion processing.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-04-14 18:15:56 -07:00
Brian Foster
4d09807f20 xfs: fix use after free in buf log item unlock assert
The xfs_buf_log_item ->iop_unlock() callback asserts that the buffer
is unlocked when either non-stale or aborted. This assert occurs
after the bli refcount has been dropped and the log item potentially
freed. The aborted check is thus a potential use after free. This
problem has been reproduced with KASAN enabled via generic/475.

Fix up xfs_buf_item_unlock() to query aborted state before the bli
reference is dropped to prevent a potential use after free.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-04-14 18:15:56 -07:00
Christoph Hellwig
72deb455b5 block: remove CONFIG_LBDAF
Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit
architectures.  These types are required to support block device and/or
file sizes larger than 2 TiB, and have generally defaulted to on for
a long time.  Enabling the option only increases the i386 tinyconfig
size by 145 bytes, and many data structures already always use
64-bit values for their in-core and on-disk data structures anyway,
so there should not be a large change in dynamic memory usage either.

Dropping this option removes a somewhat weird non-default config that
has cause various bugs or compiler warnings when actually used.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-06 10:48:35 -06:00
Brian Foster
2032a8a27b xfs: serialize unaligned dio writes against all other dio writes
XFS applies more strict serialization constraints to unaligned
direct writes to accommodate things like direct I/O layer zeroing,
unwritten extent conversion, etc. Unaligned submissions acquire the
exclusive iolock and wait for in-flight dio to complete to ensure
multiple submissions do not race on the same block and cause data
corruption.

This generally works in the case of an aligned dio followed by an
unaligned dio, but the serialization is lost if I/Os occur in the
opposite order. If an unaligned write is submitted first and
immediately followed by an overlapping, aligned write, the latter
submits without the typical unaligned serialization barriers because
there is no indication of an unaligned dio still in-flight. This can
lead to unpredictable results.

To provide proper unaligned dio serialization, require that such
direct writes are always the only dio allowed in-flight at one time
for a particular inode. We already acquire the exclusive iolock and
drain pending dio before submitting the unaligned dio. Wait once
more after the dio submission to hold the iolock across the I/O and
prevent further submissions until the unaligned I/O completes. This
is heavy handed, but consistent with the current pre-submission
serialization for unaligned direct writes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-03-26 08:37:55 -07:00
Darrick J. Wong
ed79dac98c xfs: prohibit fstrim in norecovery mode
The xfs fstrim implementation uses the free space btrees to find free
space that can be discarded.  If we haven't recovered the log, the bnobt
will be stale and we absolutely *cannot* use stale metadata to zap the
underlying storage.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2019-03-25 08:03:29 -07:00
Darrick J. Wong
4b0bce30f3 xfs: always init bma in xfs_bmapi_write
Always init the tp/ip fields of bma in xfs_bmapi_write so that the
bmapi_finish at the bottom never trips over null transaction or inode
pointers.

Coverity-id: 1443964
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-03-19 08:16:54 -07:00
Darrick J. Wong
a72e9d8d69 xfs: fix btree scrub checking with regards to root-in-inode
In xchk_btree_check_owner, we can be passed a null buffer pointer.  This
should only happen for the root of a root-in-inode btree type, but we
should program defensively in case the btree cursor state ever gets
screwed up and we get a null buffer anyway.

Coverity-id: 1438713
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-03-19 08:16:41 -07:00
Darrick J. Wong
228de124f2 xfs: dabtree scrub needs to range-check level
Make sure scrub's dabtree iterator function checks that we're not
going deeper in the stack than our cursor permits.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-03-19 08:16:21 -07:00
Brian Foster
6958d11f77 xfs: don't trip over uninitialized buffer on extent read of corrupted inode
We've had rather rare reports of bmap btree block corruption where
the bmap root block has a level count of zero. The root cause of the
corruption is so far unknown. We do have verifier checks to detect
this form of on-disk corruption, but this doesn't cover a memory
corruption variant of the problem. The latter is a reasonable
possibility because the root block is part of the inode fork and can
reside in-core for some time before inode extents are read.

If this occurs, it leads to a system crash such as the following:

 BUG: unable to handle kernel paging request at ffffffff00000221
 PF error: [normal kernel read fault]
 ...
 RIP: 0010:xfs_trans_brelse+0xf/0x200 [xfs]
 ...
 Call Trace:
  xfs_iread_extents+0x379/0x540 [xfs]
  xfs_file_iomap_begin_delay+0x11a/0xb40 [xfs]
  ? xfs_attr_get+0xd1/0x120 [xfs]
  ? iomap_write_begin.constprop.40+0x2d0/0x2d0
  xfs_file_iomap_begin+0x4c4/0x6d0 [xfs]
  ? __vfs_getxattr+0x53/0x70
  ? iomap_write_begin.constprop.40+0x2d0/0x2d0
  iomap_apply+0x63/0x130
  ? iomap_write_begin.constprop.40+0x2d0/0x2d0
  iomap_file_buffered_write+0x62/0x90
  ? iomap_write_begin.constprop.40+0x2d0/0x2d0
  xfs_file_buffered_aio_write+0xe4/0x3b0 [xfs]
  __vfs_write+0x150/0x1b0
  vfs_write+0xba/0x1c0
  ksys_pwrite64+0x64/0xa0
  do_syscall_64+0x5a/0x1d0
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The crash occurs because xfs_iread_extents() attempts to release an
uninitialized buffer pointer as the level == 0 value prevented the
buffer from ever being allocated or read. Change the level > 0
assert to an explicit error check in xfs_iread_extents() to avoid
crashing the kernel in the event of localized, in-core inode
corruption.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-03-17 15:21:49 -07:00
Linus Torvalds
de578188ed Changes since last update:
- Fix some clang/smatch/sparse warnings about uninitialized variables.
 - Clean up some typedef usage.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlyH+rEACgkQ+H93GTRK
 tOsoYA/+MBjGB3rbDAZfY7/tlTnQS4Yc7XsBz9C0SvLul0/CbcTPM1w8CO0clH3d
 OxemwdqpoxZkcet3Sv1m2/sr6PVM+7r6f2vn0j19iPOI5soF0hX4XLnvkNhFZbm0
 cl25rO9GuXcG7U7iLXdjyGrXNc+8Hy5kmZJzx3MA7DPTjkEQGgWrB4XIgvNnv0k9
 cIfJtuC9FKFO1/+6oTWid1v+HCPea7m8ORosWgh0q6S9noPAE63vDbesrxHpI3i4
 TLu5L3r6IXHzLRuCcDcB7aPu98L9eLhrBSBqEuiFlkf03ASJqAO4jMarV73WSdvO
 YR1CcWaOGO1W6VRp67N9iLw5WZxplG9n0NaecM1w70g84wSimNmmtBnzHNnfIa8P
 ZopsLJgflQV18qcmjWTnzeNF5RvAu7tQRLLmzJkLiZjQzmk9mr+t41MeIybho9eZ
 zDs8ePN56pUJ6xqaLFTx4MdUkJ8LlllOqsKa7tILu1w76ClGEtSGo48Y/eog+aAu
 MIOAjFY9esUNdVlMu8fsa83DWg31AlwTPYQ5nlrRQ1Xk1GGPAr8lzfQTOG+NI1qo
 eWM8NRqaFDYI/1Ruy3keOsAfuNQkOiNrLz8ge3xH9Y10+meMejoaOLgWMdnyDlYZ
 WxmhlYkmVycpZXmm9lR9Dt7qKLg+6texQccbkNUjPwrA8bTs5Ek=
 =YIiT
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.1-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs cleanups from Darrick Wong:
 "Here's a few more cleanups that trickled in for the merge window.

  It's all fixes for static checker complaints and slowly unwinding
  typedef usage. The four patches here have gone through a few days
  worth of fstest runs with no new problems observed.

  Summary:

   - Fix some clang/smatch/sparse warnings about uninitialized
     variables.

   - Clean up some typedef usage"

* tag 'xfs-5.1-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: clean up xfs_dir2_leaf_addname
  xfs: zero initialize highstale and lowstale in xfs_dir2_leaf_addname
  xfs: clean up xfs_dir2_leafn_add
  xfs: Zero initialize highstale and lowstale in xfs_dir2_leafn_add
2019-03-15 13:55:30 -07:00
Darrick J. Wong
6ef50fe9af xfs: clean up xfs_dir2_leaf_addname
Remove typedefs and consolidate local variable initialization.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2019-03-12 09:19:38 -07:00
Darrick J. Wong
f51fac6892 xfs: zero initialize highstale and lowstale in xfs_dir2_leaf_addname
Smatch complains about the following:

fs/xfs/libxfs/xfs_dir2_leaf.c:848 xfs_dir2_leaf_addname() error:
uninitialized symbol 'lowstale'.

fs/xfs/libxfs/xfs_dir2_leaf.c:849 xfs_dir2_leaf_addname() error:
uninitialized symbol 'highstale'.

I don't think there's any incorrect behavior associated with the
uninitialized variable, but as the author of the previous zero-init
patch points out, it's best not to be passing around pointers to
uninitialized stack areas.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2019-03-10 11:41:31 -07:00
Darrick J. Wong
79622c7ce6 xfs: clean up xfs_dir2_leafn_add
Remove typedefs and consolidate local variable initialization.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
2019-03-08 14:24:43 -08:00
Nathan Chancellor
7be73fa1c1 xfs: Zero initialize highstale and lowstale in xfs_dir2_leafn_add
When building with -Wsometimes-uninitialized, Clang warns:

fs/xfs/libxfs/xfs_dir2_node.c:481:6: warning: variable 'lowstale' is
used uninitialized whenever 'if' condition is false
[-Wsometimes-uninitialized]
fs/xfs/libxfs/xfs_dir2_node.c:481:6: warning: variable 'highstale' is
used uninitialized whenever 'if' condition is false
[-Wsometimes-uninitialized]

While it isn't technically wrong, it isn't a problem in practice because
highstale and lowstale are only initialized in xfs_dir2_leafn_add when
compact is not zero then they are passed to xfs_dir3_leaf_find_entry,
where they are initialized before use when compact is zero. Regardless,
it's better not to be passing around uninitialized stack memory so zero
initialize these variables, which silences this warning.

Link: https://github.com/ClangBuiltLinux/linux/issues/393
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-03-08 14:24:43 -08:00
Linus Torvalds
80201fe175 for-5.1/block-20190302
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlx63XIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpp2vEACfrrQsap7R+Av28mmXpmXi2FPa3g5Tev1t
 yYjK2qHvhlMZjPTYw3hCmbYdDDczlF7PEgSE2x2DjdcsYapb8Fy1lZ2X16c7ztBR
 HD/t9b5AVSQsczZzKgv3RqsNtTnjzS5V0A8XH8FAP2QRgiwDMwSN6G0FP0JBLbE/
 ZgxQrH1Iy1F33Wz4hI3Z7dEghKPZrH1IlegkZCEu47q9SlWS76qUetSy2GEtchOl
 3Lgu54mQZyVdI5/QZf9DyMDLF6dIz3tYU2qhuo01AHjGRCC72v86p8sIiXcUr94Q
 8pbegJhJ/g8KBol9Qhv3+pWG/QUAZwi/ZwasTkK+MJ4klRXfOrznxPubW1z6t9Vn
 QRo39Po5SqqP0QWAscDxCFjESIQlWlKa+LZurJL7DJDCUGrSgzTpnVwFqKwc5zTP
 HJa5MT2tEeL2TfUYRYCfh0ZV0elINdHA1y1klDBh38drh4EWr2gW8xdseGYXqRjh
 fLgEpoF7VQ8kTvxKN+E4jZXkcZmoLmefp0ZyAbblS6IawpPVC7kXM9Fdn2OU8f2c
 fjVjvSiqxfeN6dnpfeLDRbbN9894HwgP/LPropJOQ7KmjCorQq5zMDkAvoh3tElq
 qwluRqdBJpWT/F05KweY+XVW8OawIycmUWqt6JrVNoIDAK31auHQv47kR0VA4OvE
 DRVVhYpocw==
 =VBaU
 -----END PGP SIGNATURE-----

Merge tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block

Pull block layer updates from Jens Axboe:
 "Not a huge amount of changes in this round, the biggest one is that we
  finally have Mings multi-page bvec support merged. Apart from that,
  this pull request contains:

   - Small series that avoids quiescing the queue for sysfs changes that
     match what we currently have (Aleksei)

   - Series of bcache fixes (via Coly)

   - Series of lightnvm fixes (via Mathias)

   - NVMe pull request from Christoph. Nothing major, just SPDX/license
     cleanups, RR mp policy (Hannes), and little fixes (Bart,
     Chaitanya).

   - BFQ series (Paolo)

   - Save blk-mq cpu -> hw queue mapping, removing a pointer indirection
     for the fast path (Jianchao)

   - fops->iopoll() added for async IO polling, this is a feature that
     the upcoming io_uring interface will use (Christoph, me)

   - Partition scan loop fixes (Dongli)

   - mtip32xx conversion from managed resource API (Christoph)

   - cdrom registration race fix (Guenter)

   - MD pull from Song, two minor fixes.

   - Various documentation fixes (Marcos)

   - Multi-page bvec feature. This brings a lot of nice improvements
     with it, like more efficient splitting, larger IOs can be supported
     without growing the bvec table size, and so on. (Ming)

   - Various little fixes to core and drivers"

* tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block: (117 commits)
  block: fix updating bio's front segment size
  block: Replace function name in string with __func__
  nbd: propagate genlmsg_reply return code
  floppy: remove set but not used variable 'q'
  null_blk: fix checking for REQ_FUA
  block: fix NULL pointer dereference in register_disk
  fs: fix guard_bio_eod to check for real EOD errors
  blk-mq: use HCTX_TYPE_DEFAULT but not 0 to index blk_mq_tag_set->map
  block: optimize bvec iteration in bvec_iter_advance
  block: introduce mp_bvec_for_each_page() for iterating over page
  block: optimize blk_bio_segment_split for single-page bvec
  block: optimize __blk_segment_map_sg() for single-page bvec
  block: introduce bvec_nth_page()
  iomap: wire up the iopoll method
  block: add bio_set_polled() helper
  block: wire up block device iopoll method
  fs: add an iopoll method to struct file_operations
  loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()
  loop: do not print warn message if partition scan is successful
  block: bounce: make sure that bvec table is updated
  ...
2019-03-08 14:12:17 -08:00
Luis R. Rodriguez
1b9598c8fb xfs: fix reporting supported extra file attributes for statx()
statx(2) notes that any attribute that is not indicated as supported by
stx_attributes_mask has no usable value. Commit 5f955f26f3 ("xfs: report
crtime and attribute flags to statx") added support for informing userspace
of extra file attributes but forgot to list these flags as supported
making reporting them rather useless for the pedantic userspace author.

$ git describe --contains 5f955f26f3
v4.11-rc6~5^2^2~2

Fixes: 5f955f26f3 ("xfs: report crtime and attribute flags to statx")
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: add a comment reminding people to keep attributes_mask up to date]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-03-01 08:57:25 -08:00
Darrick J. Wong
3d129e1be3 xfs: fix backwards endian conversion in scrub
Fix a backwards endian conversion of a constant.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2019-02-25 10:16:42 -08:00
Darrick J. Wong
c1a4447f5e xfs: fix uninitialized error variables
smatch complained about some uninitialized error returns, so fix those.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2019-02-25 10:16:41 -08:00
Darrick J. Wong
4f29e10d68 xfs: rework breaking of shared extents in xfs_file_iomap_begin
Rework the data flow in xfs_file_iomap_begin where we decide if we have
to break shared extents.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-25 09:26:18 -08:00
Darrick J. Wong
affe250a08 xfs: don't pass iomap flags to xfs_reflink_allocate_cow
Don't pass raw iomap flags to xfs_reflink_allocate_cow; signal our
intention with a boolean argument.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-25 09:04:31 -08:00
Christoph Hellwig
81214bab58 iomap: wire up the iopoll method
Store the request queue the last bio was submitted to in the iocb
private data in addition to the cookie so that we find the right block
device.  Also refactor the common direct I/O bio submission code into a
nice little helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Modified to use bio_set_polled().

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-24 08:20:17 -07:00
Colin Ian King
081a8ae2a5 xfs: fix uninitialized error variable
A previous commit removed the initialization of variable 'error' to zero,
and can cause a bogus error return.  This occurs when error contains a
non-zero garbage value and the call to xchk_should_terminate detects a
pending fatal signal and checks for a zero error before setting it
to -EAGAIN. Fix the issue by initializing error to zero.

Fixes: b9454fe056 ("xfs: clean up the inode cluster checking in the inobt scrub")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21 07:55:07 -08:00
Christoph Hellwig
66ae56a53f xfs: introduce an always_cow mode
Add a mode where XFS never overwrites existing blocks in place.  This
is to aid debugging our COW code, and also put infatructure in place
for things like possible future support for zoned block devices, which
can't support overwrites.

This mode is enabled globally by doing a:

    echo 1 > /sys/fs/xfs/debug/always_cow

Note that the parameter is global to allow running all tests in xfstests
easily in this mode, which would not easily be possible with a per-fs
sysfs file.

In always_cow mode persistent preallocations are disabled, and fallocate
will fail when called with a 0 mode (with our without
FALLOC_FL_KEEP_SIZE), and not create unwritten extent for zeroed space
when called with FALLOC_FL_ZERO_RANGE or FALLOC_FL_UNSHARE_RANGE.

There are a few interesting xfstests failures when run in always_cow
mode:

 - generic/392 fails because the bytes used in the file used to test
   hole punch recovery are less after the log replay.  This is
   because the blocks written and then punched out are only freed
   with a delay due to the logging mechanism.
 - xfs/170 will fail as the already fragile file streams mechanism
   doesn't seem to interact well with the COW allocator
 - xfs/180 xfs/182 xfs/192 xfs/198 xfs/204 and xfs/208 will claim
   the file system is badly fragmented, but there is not much we
   can do to avoid that when always writing out of place
 - xfs/205 fails because overwriting a file in always_cow mode
   will require new space allocation and the assumption in the
   test thus don't work anymore.
 - xfs/326 fails to modify the file at all in always_cow mode after
   injecting the refcount error, leading to an unexpected md5sum
   after the remount, but that again is expected

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21 07:55:07 -08:00
Christoph Hellwig
c4feb0b194 xfs: report IOMAP_F_SHARED from xfs_file_iomap_begin_delay
No user of it in the iomap code at the moment, but we should not
actively report wrong information if we can trivially get it right.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21 07:55:07 -08:00
Christoph Hellwig
26b91c728b xfs: make COW fork unwritten extent conversions more robust
If we have racing buffered and direct I/O COW fork extents under
writeback can have been moved to the data fork by the time we call
xfs_reflink_convert_cow from xfs_submit_ioend.  This would be mostly
harmless as the block numbers don't change by this move, except for
the fact that xfs_bmapi_write will crash or trigger asserts when
not finding existing extents, even despite trying to paper over this
with the XFS_BMAPI_CONVERT_ONLY flag.

Instead of special casing non-transaction conversions in the already
way too complicated xfs_bmapi_write just add a new helper for the much
simpler non-transactional COW fork case, which simplify ignores not
found extents.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21 07:55:07 -08:00
Christoph Hellwig
db46e604ad xfs: merge COW handling into xfs_file_iomap_begin_delay
Besides simplifying the code a bit this allows to actually implement
the behavior of using COW preallocation for non-COW data mentioned
in the current comments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21 07:55:07 -08:00
Christoph Hellwig
12df89f28f xfs: also truncate holes covered by COW blocks
This only matters if we want to write data through the COW fork that is
not actually an overwrite of existing data.  Reasons for that are
speculative COW fork allocations using the cowextsize, or a mode where
we always write through the COW fork.  Currently both can't actually
happen, but I plan to enable them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21 07:55:07 -08:00
Christoph Hellwig
78f0cc9d55 xfs: don't use delalloc extents for COW on files with extsize hints
While using delalloc for extsize hints is generally a good idea, the
current code that does so only for COW doesn't help us much and creates
a lot of special cases.  Switch it to use real allocations like we
do for direct I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21 07:55:07 -08:00
Christoph Hellwig
60271ab79d xfs: fix SEEK_DATA for speculative COW fork preallocation
We speculatively allocate extents in the COW fork to reduce
fragmentation.  But when we write data into such COW fork blocks that
do now shadow an allocation in the data fork SEEK_DATA will not
correctly report it, as it only looks at the data fork extents.
The only reason why that hasn't been an issue so far is because
we even use these speculative COW fork preallocations over holes in
the data fork at all for buffered writes, and blocks in the COW
fork that are written by direct writes are moved into the data
fork immediately at I/O completion time.

Add a new set of iomap_ops for SEEK_HOLE/SEEK_DATA which looks into
both the COW and data fork, and reports all COW extents as unwritten
to the iomap layer.  While this isn't strictly true for COW fork
extents that were already converted to real extents, the practical
semantics that you can't read data from them until they are moved
into the data fork are very similar, and this will force the iomap
layer into probing the extents for actually present data.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21 07:55:07 -08:00
Christoph Hellwig
16be143373 xfs: make xfs_bmbt_to_iomap more useful
Move checking for invalid zero blocks and setting of various iomap flags
into this helper.  Also make it deal with "raw" delalloc extents to
avoid clutter in the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21 07:55:07 -08:00
Darrick J. Wong
15baadf72c xfs: fix xfs_buf magic number endian checks
Create a separate magic16 check function so that we don't run afoul of
static checkers.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-18 09:38:41 -08:00
Christoph Hellwig
7588cbeec6 xfs: retry COW fork delalloc conversion when no extent was found
While we can only truncate a block under the page lock for the current
page, there is no high-level synchronization for moving extents from the
COW to the data fork.  This means that for example we can have another
thread doing a direct I/O completion that moves extents from the COW to
the data fork race with writeback.  While this race is very hard to hit
the always_cow seems to reproduce it reasonably well, and it also exists
without that.  Because of that there is a chance that a delalloc
conversion for the COW fork might not find any extents to convert.  In
that case we should retry the whole block lookup and now find the blocks
in the data fork.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17 11:55:54 -08:00
Christoph Hellwig
19c8e4e258 xfs: remove the truncate short cut in xfs_map_blocks
Now that we properly handle the race with truncate in the delalloc
allocator there is no need to short cut this exceptional case earlier
on.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17 11:55:54 -08:00
Christoph Hellwig
4ad765edb0 xfs: move xfs_iomap_write_allocate to xfs_aops.c
This function is a small wrapper only used by the writeback code, so
move it together with the writeback code and simplify it down to the
glorified do { } while loop that is now is.

A few bits intentionally got lost here: no need to call xfs_qm_dqattach
because quotas are always attached when we create the delalloc
reservation, and no need for the imap->br_startblock == 0 check given
that xfs_bmapi_convert_delalloc already has a WARN_ON_ONCE for exactly
that condition.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17 11:55:54 -08:00
Christoph Hellwig
125851ac92 xfs: move stat accounting to xfs_bmapi_convert_delalloc
This way we can actually count how many bytes got converted and how many
calls we need, unlike in the caller which doesn't have the detailed
view.

Note that this includes a slight change in behavior as the
xs_xstrat_quick is now bumped for every allocation instead of just the
one covering the requested writeback offset, which makes a lot more
sense.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17 11:55:54 -08:00
Christoph Hellwig
491ce61e93 xfs: move transaction handling to xfs_bmapi_convert_delalloc
No need to deal with the transaction and the inode locking in the
caller. Note that we also switch to passing whichfork as the second
paramter, matching what most related functions do.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17 11:55:54 -08:00
Christoph Hellwig
d8ae82e394 xfs: split XFS_BMAPI_DELALLOC handling from xfs_bmapi_write
Delalloc conversion has traditionally been part of our function to
allocate blocks on disk (first xfs_bmapi, then xfs_bmapi_write), but
delalloc conversion is a little special as we really do not want
to allocate blocks over holes, for which we don't have reservations.

Split the delalloc conversions into a separate helper to keep the
code simple and structured.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17 11:55:54 -08:00
Christoph Hellwig
c8b54673b3 xfs: factor out two helpers from xfs_bmapi_write
We want to be able to reuse them for the upcoming dedidcated delalloc
convert routine.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17 11:55:54 -08:00
Christoph Hellwig
b101e3342a xfs: simplify the xfs_bmap_btree_to_extents calling conventions
Move boilerplate code from the callers into xfs_bmap_btree_to_extents:

 - exit early without failure if we don't need to convert to the
   extent format
 - assert that we have a btree cursor
 - don't reinitialize the passed in logflags argument

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17 11:55:53 -08:00
Christoph Hellwig
b4e29032f2 xfs: remove the s_maxbytes checks in xfs_map_blocks
We already ensure all data fits into s_maxbytes in the write / fault
path.  The only reason we have them here is that they were copy and
pasted from xfs_bmapi_read when we stopped using that function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17 11:55:53 -08:00
Christoph Hellwig
be225fec72 xfs: remove the io_type field from the writeback context and ioend
The io_type field contains what is basically a summary of information
from the inode fork and the imap.  But we can just as easily use that
information directly, simplifying a few bits here and there and
improving the trace points.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17 11:55:53 -08:00
Jens Axboe
6fb845f0e7 Linux 5.0-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFRBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlxgqNUeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwsoH+OVXu0NQofwTvVru
 8lgF3BSDG2mhf7mxbBBlBizGVy9jnjRNGCFMC+Jq8IwiFLwprja/G27kaDTkpuF1
 PHC3yfjKvjTeUP5aNdHlmxv6j1sSJfZl0y46DQal4UeTG/Giq8TFTi+Tbz7Wb/WV
 yCx4Lr8okAwTuNhnL8ojUCVIpd3c8QsyR9v6nEQ14Mj+MvEbokyTkMJV0bzOrM38
 JOB+/X1XY4JPZ6o3MoXrBca3bxbAJzMneq+9CWw1U5eiIG3msg4a+Ua3++RQMDNr
 8BP0yCZ6wo32S8uu0PI6HrZaBnLYi5g9Wh7Q7yc0mn1Uh1zWFykA6TtqK90agJeR
 A6Ktjw==
 =scY4
 -----END PGP SIGNATURE-----

Merge tag 'v5.0-rc6' into for-5.1/block

Pull in 5.0-rc6 to avoid a dumb merge conflict with fs/iomap.c.
This is needed since io_uring is now based on the block branch,
to avoid a conflict between the multi-page bvecs and the bits
of io_uring that touch the core block parts.

* tag 'v5.0-rc6': (525 commits)
  Linux 5.0-rc6
  x86/mm: Make set_pmd_at() paravirt aware
  MAINTAINERS: Update the ocores i2c bus driver maintainer, etc
  blk-mq: remove duplicated definition of blk_mq_freeze_queue
  Blk-iolatency: warn on negative inflight IO counter
  blk-iolatency: fix IO hang due to negative inflight counter
  MAINTAINERS: unify reference to xen-devel list
  x86/mm/cpa: Fix set_mce_nospec()
  futex: Handle early deadlock return correctly
  futex: Fix barrier comment
  net: dsa: b53: Fix for failure when irq is not defined in dt
  blktrace: Show requests without sector
  mips: cm: reprime error cause
  mips: loongson64: remove unreachable(), fix loongson_poweroff().
  sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
  geneve: should not call rt6_lookup() when ipv6 was disabled
  KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
  KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222)
  kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)
  signal: Better detection of synchronous signals
  ...
2019-02-15 08:43:59 -07:00
Ming Lei
07173c3ec2 block: enable multipage bvecs
This patch pulls the trigger for multi-page bvecs.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:12 -07:00
Ming Lei
6dc4f100c1 block: allow bio_for_each_segment_all() to iterate over multi-page bvec
This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.

Given it is just one mechannical & simple change on all bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:11 -07:00
Darrick J. Wong
c4a6bf7f6c xfs: don't ever put nlink > 0 inodes on the unlinked list
When XFS creates an O_TMPFILE file, the inode is created with nlink = 1,
put on the unlinked list, and then the VFS sets nlink = 0 in d_tmpfile.
If we crash before anything logs the inode (it's dirty incore but the
vfs doesn't tell us it's dirty so we never log that change), the iunlink
processing part of recovery will then explode with a pile of:

XFS: Assertion failed: VFS_I(ip)->i_nlink == 0, file:
fs/xfs/xfs_log_recover.c, line: 5072

Worse yet, since nlink is nonzero, the inodes also don't get cleaned up
and they just leak until the next xfs_repair run.

Therefore, change xfs_iunlink to require that inodes being put on the
unlinked list have nlink == 0, change the tmpfile callers to instantiate
nodes that way, and set the nlink to 1 just prior to calling d_tmpfile.
Fix the comment for xfs_iunlink while we're at it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-02-14 22:42:57 -08:00
Darrick J. Wong
15a268d9f2 xfs: reserve blocks for ifree transaction during log recovery
Log recovery frees all the inodes stored in the unlinked list, which can
cause expansion of the free inode btree.  The ifree code skips block
reservations if it thinks there's a per-AG space reservation, but we
don't set up the reservation until after log recovery, which means that
a finobt expansion blows up in xfs_trans_mod_sb when we exceed the
transaction's block reservation.

To fix this, we set the "no finobt reservation" flag to true when we
create the xfs_mount and only set it to false if we confirm that every
AG had enough free space to put aside for the finobt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-02-14 22:42:57 -08:00
Darrick J. Wong
e1f6ca1138 xfs: rename m_inotbt_nores to m_finobt_nores
Rename this flag variable to imply more strongly that it's related to
the free inode btree (finobt) operation.  No functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-02-14 22:42:57 -08:00
Darrick J. Wong
3b50086f0c xfs: don't overflow xattr listent buffer
For VFS listxattr calls, xfs_xattr_put_listent calls
__xfs_xattr_put_listent twice if it sees an attribute
"trusted.SGI_ACL_FILE": once for that name, and again for
"system.posix_acl_access".  Unfortunately, if we happen to run out of
buffer space while emitting the first name, we set count to -1 (so that
we can feed ERANGE to the caller).  The second invocation doesn't check that
the context parameters make sense and overwrites the byte before the
buffer, triggering a KASAN report:

==================================================================
BUG: KASAN: slab-out-of-bounds in strncpy+0xb3/0xd0
Write of size 1 at addr ffff88807fbd317f by task syz/1113

CPU: 3 PID: 1113 Comm: syz Not tainted 5.0.0-rc6-xfsx #rc6
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
 dump_stack+0xcc/0x180
 print_address_description+0x6c/0x23c
 kasan_report.cold.3+0x1c/0x35
 strncpy+0xb3/0xd0
 __xfs_xattr_put_listent+0x1a9/0x2c0 [xfs]
 xfs_attr_list_int_ilocked+0x11af/0x1800 [xfs]
 xfs_attr_list_int+0x20c/0x2e0 [xfs]
 xfs_vn_listxattr+0x225/0x320 [xfs]
 listxattr+0x11f/0x1b0
 path_listxattr+0xbd/0x130
 do_syscall_64+0x139/0x560

While we're at it we add an assert to the other put_listent to avoid
this sort of thing ever happening to the attrlist_by_handle code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-02-14 09:36:52 -08:00
Brian Foster
670105de15 xfs: compile time offset checks for common v4/v5 metadata
The v5 superblock format added various metadata fields (such as crc,
metadata lsn, owner uuid, etc.) to v4 metadata headers or created
new v5 headers for blocks where no such headers existed on v4. Where
v4 headers did exist, the v5 structures are careful to place v4
metadata at the original location. For example, the magic value is
expected to be at the same location in certain blocks to facilitate
version detection.

While failure of this invariant is likely to cause severe and
obvious problems at runtime, we can detect this condition at compile
time via the more recently added on-disk format check
infrastructure. Since there is no runtime cost, add some offset
checks that start with v5 structure definitions, traverse down to
the first bit of common metadata with v4 and ensure that common
metadata is at the expected offset. Note that we don't care about
blocks which had no v4 header because there is no common metadata in
those cases. No functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
9228d751eb xfs: use buf ops magic to detect btree block type
Now that we encode block magic numbers in all the buffer ops, use that
for block type detection in the ag header repair code instead of
encoding magics directly in the repair code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
4260baac62 xfs: add magic numbers to dquot buffer ops
Add dquot magic numbers to the buffer ops type, in case we ever want to
use them.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
2bfe7069f7 xfs: add inode magic to inode verifier
Use xfs_verify_magic to check the magic numbers of inodes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:07:01 -08:00
Brian Foster
8764f98351 xfs: factor xfs_da3_blkinfo verification into common helper
With the verifier magic value helper in place, we've left a bit more
duplicate code across the verifiers that involve struct
xfs_da3_blkinfo. This includes the da node, xattr leaf and dir leaf
verifiers, all of which perform similar checks for v4 and v5
filesystems.

Create a common helper to verify an xfs_da3_blkinfo structure,
taking care to only access v5 fields where appropriate, and refactor
the aforementioned verifiers to use the helper. No functional
changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
39708c20ab xfs: miscellaneous verifier magic value fixups
Most buffer verifiers have hardcoded magic value checks
conditionalized on the version of the filesystem. The magic value
field of the verifier structure facilitates abstraction of some of
this code. Populate the ->magic field of various verifiers to take
advantage of this abstraction. No functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
09f420197d xfs: use verifier magic field in dir2 leaf verifiers
The dir2 leaf verifiers share the same underlying structure
verification code, but implement six accessor functions to multiplex
the code across the two verifiers. Further, the magic value isn't
sufficiently abstracted such that the common helper has to manually
fix up the magic from the caller on v5 filesystems.

Use the magic field in the verifier structure to eliminate the
duplicate code and clean this all up. No functional change.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
b8f8980166 xfs: distinguish between bnobt and cntbt magic values
The allocation btree verifiers share code that is unable to detect
cross-tree magic value corruptions such as a bnobt block with a
cntbt magic value. Populate the b_ops->magic field of the associated
verifier structures such that the structure verifier can check the
magic value against the expected value based on tree type.

The btree level check requires knowledge of the tree type to
determine the appropriate maximum value. This was previously part of
the hardcoded magic value checks. With that code removed, peek at
the first magic value in the verifier to determine the expected tree
type of the current block.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
27df4f5045 xfs: split up allocation btree verifier
Similar to the inode btree verifier, the same allocation btree
verifier structure is shared between the by-bno (bnobt) and by-size
(cntbt) btrees. This prevents the ability to distinguish magic
values between them. Separate the verifier into two, one for each
tree, and assign them appropriately. No functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
8473fee340 xfs: distinguish between inobt and finobt magic values
The inode btree verifier code is shared between the inode btree and
free inode btree because the underlying metadata formats are
essentially equivalent. A side effect of this is that the verifier
cannot determine whether a particular btree block should have an
inobt or finobt magic value.

This logic allows an unfortunate xfs_repair bug to escape detection
where certain level > 0 nodes of the finobt are stamped with inobt
magic by xfs_repair finobt reconstruction. This is fortunately not a
severe problem since the inode btree magic values do not contribute
to any changes in kernel behavior, but we do need a means to detect
and prevent this problem in the future.

Add a field to xfs_buf_ops to store the v4 and v5 superblock magic
values expected by a particular verifier. Add a helper to check an
on-disk magic value against the value expected by the verifier. Call
the helper from the shared [f]inobt verifier code for magic value
verification. This ensures that the inode btree blocks each have the
appropriate magic value based on specific tree type and superblock
version.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
01e68f40bf xfs: create a separate finobt verifier
The inobt verifier is reused for the inobt and finobt, which
prevents the ability to distinguish between magic values on a
per-tree basis. Create a separate finobt structure in preparation
for changes to enforce the appropriate magic value for the
associated tree. This patch has no functional change.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
e34d3e74eb xfs: always check magic values in on-disk byte order
Most verifiers that check on-disk magic values convert the CPU
endian magic value constant to disk endian to facilitate compile
time optimization of the byte swap and reduce the need for runtime
byte swaps in buffer verifiers. Several buffer verifiers do not
follow this pattern. Update those verifiers for consistency.

Also fix up a random typo in the inode readahead verifier name.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
75d0230314 xfs: clarify documentation for the function to reverify buffers
Improve the documentation around xfs_buf_ensure_ops, which is the
function that is responsible for cleaning up the b_ops state of buffers
that go through xrep_findroot_block but don't match anything.  Rename
the function to xfs_buf_reverify.

[darrick: this started off as bfoster mods of a previous patch of mine,
but the renaming part is now this separate patch.]

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
9b24717979 xfs: cache unlinked pointers in an rhashtable
Use a rhashtable to cache the unlinked list incore.  This should speed
up unlinked processing considerably when there are a lot of inodes on
the unlinked list because iunlink_remove no longer has to traverse an
entire bucket list to find which inode points to the one being removed.

The incore list structure records "X.next_unlinked = Y" relations, with
the rhashtable using Y to index the records.  This makes finding the
inode X that points to a inode Y very quick.  If our cache fails to find
anything we can always fall back on the old method.

FWIW this drastically reduces the amount of time it takes to remove
inodes from the unlinked list.  I wrote a program to open a lot of
O_TMPFILE files and then close them in the same order, which takes
a very long time if we have to traverse the unlinked lists.  With the
ptach, I see:

+ /d/t/tmpfile/tmpfile
Opened 193531 files in 6.33s.
Closed 193531 files in 5.86s

real    0m12.192s
user    0m0.064s
sys     0m11.619s
+ cd /
+ umount /mnt

real    0m0.050s
user    0m0.004s
sys     0m0.030s

And without the patch:

+ /d/t/tmpfile/tmpfile
Opened 193588 files in 6.35s.
Closed 193588 files in 751.61s

real    12m38.853s
user    0m0.084s
sys     12m34.470s
+ cd /
+ umount /mnt

real    0m0.086s
user    0m0.000s
sys     0m0.060s

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
4664c66c91 xfs: add tracepoints for high level iunlink operations
Add tracepoints so we can associate high level operations with low level
updates.  No functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
b1d2a068ea xfs: refactor inode update in iunlink_remove
In xfs_iunlink_remove we have two identical calls to
xfs_iunlink_update_inode, so move it out of the if statement to simplify
the code some more.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
23ffa52cc7 xfs: refactor unlinked list search and mapping to a separate function
There's a loop that searches an unlinked bucket list to find the inode
that points to a given inode.  Hoist this into a separate function;
later we'll use our iunlink backref cache to bypass the slow list
operation.  No functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
f2fc16a3d7 xfs: refactor inode unlinked pointer update functions
Hoist the functions that update an inode's unlinked pointer updates into
a helper.  No functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
86bfd3750f xfs: strengthen AGI unlinked inode bucket pointer checks
Strengthen our checking of the AGI unlinked pointers when we start to
use them for updating the metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
9a4a511864 xfs: refactor AGI unlinked bucket updates
Split the AGI unlinked bucket updates into a separate function.  No
functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
7d36c19538 xfs: add xfs_verify_agino_or_null helper
Add a new helper to check that a per-AG inode pointer is either null or
points somewhere valid within that AG.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
5837f62592 xfs: clean up iunlink functions
Fix some indentation issues with the iunlink functions and reorganize
the tops of the functions to be identical.  No functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-02-11 16:07:01 -08:00
Brian Foster
c2b3164320 xfs: use the latest extent at writeback delalloc conversion time
The writeback delalloc conversion code is racy with respect to
changes in the currently cached file mapping outside of the current
page. This is because the ilock is cycled between the time the
caller originally looked up the mapping and across each real
allocation of the provided file range. This code has collected
various hacks over the years to help combat the symptoms of these
races (i.e., truncate race detection, allocation into hole
detection, etc.), but none address the fundamental problem that the
imap may not be valid at allocation time.

Rather than continue to use race detection hacks, update writeback
delalloc conversion to a model that explicitly converts the delalloc
extent backing the current file offset being processed. The current
file offset is the only block we can trust to remain once the ilock
is dropped because any operation that can remove the block
(truncate, hole punch, etc.) must flush and discard pagecache pages
first.

Modify xfs_iomap_write_allocate() to use the xfs_bmapi_delalloc()
mechanism to request allocation of the entire delalloc extent
backing the current offset instead of assuming the extent passed by
the caller is unchanged. Record the range specified by the caller
and apply it to the resulting allocated extent so previous checks by
the caller for COW fork overlap are not lost. Finally, overload the
bmapi delalloc flag with the range reval flag behavior since this is
the only use case for both.

This ensures that writeback always picks up the correct
and current extent associated with the page, regardless of races
with other extent modifying operations. If operating on a data fork
and the COW overlap state has changed since the ilock was cycled,
the caller revalidates against the COW fork sequence number before
using the imap for the next block.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
627209fbcc xfs: create delalloc bmapi wrapper for full extent allocation
The writeback delalloc conversion code is racy with respect to
changes in the currently cached file mapping. This stems from the
fact that the bmapi allocation code requires a file range to
allocate and the writeback conversion code assumes the range of the
currently cached mapping is still valid with respect to the fork. It
may not be valid, however, because the ilock is cycled (potentially
multiple times) between the time the cached mapping was populated
and the delalloc conversion occurs.

To facilitate a solution to this problem, create a new
xfs_bmapi_delalloc() wrapper to xfs_bmapi_write() that takes a file
(FSB) offset and attempts to allocate whatever delalloc extent backs
the offset. Use a new bmapi flag to cause xfs_bmapi_write() to set
the range based on the extent backing the bno parameter unless bno
lands in a hole. If bno does land in a hole, fall back to the
current behavior (which may result in an error or quietly skipping
holes in the specified range depending on other parameters). This
patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
3b35089807 xfs: remove superfluous writeback mapping eof trimming
Now that the cached writeback mapping is explicitly invalidated on
data fork changes, the EOF trimming band-aid is no longer necessary.
Remove xfs_trim_extent_eof() as well since it has no other users.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
d9252d526b xfs: validate writeback mapping using data fork seq counter
The writeback code caches the current extent mapping across multiple
xfs_do_writepage() calls to avoid repeated lookups for sequential
pages backed by the same extent. This is known to be slightly racy
with extent fork changes in certain difficult to reproduce
scenarios. The cached extent is trimmed to within EOF to help avoid
the most common vector for this problem via speculative
preallocation management, but this is a band-aid that does not
address the fundamental problem.

Now that we have an xfs_ifork sequence counter mechanism used to
facilitate COW writeback, we can use the same mechanism to validate
consistency between the data fork and cached writeback mappings. On
its face, this is somewhat of a big hammer approach because any
change to the data fork invalidates any mapping currently cached by
a writeback in progress regardless of whether the data fork change
overlaps with the range under writeback. In practice, however, the
impact of this approach is minimal in most cases.

First, data fork changes (delayed allocations) caused by sustained
sequential buffered writes are amortized across speculative
preallocations. This means that a cached mapping won't be
invalidated by each buffered write of a common file copy workload,
but rather only on less frequent allocation events. Second, the
extent tree is always entirely in-core so an additional lookup of a
usable extent mostly costs a shared ilock cycle and in-memory tree
lookup. This means that a cached mapping reval is relatively cheap
compared to the I/O itself. Third, spurious invalidations don't
impact ioend construction. This means that even if the same extent
is revalidated multiple times across multiple writepage instances,
we still construct and submit the same size ioend (and bio) if the
blocks are physically contiguous.

Update struct xfs_writepage_ctx with a new field to hold the
sequence number of the data fork associated with the currently
cached mapping. Check the wpc seqno against the data fork when the
mapping is validated and reestablish the mapping whenever the fork
has changed since the mapping was cached. This ensures that
writeback always uses a valid extent mapping and thus prevents lost
writebacks and stale delalloc block problems.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
9f9bc034b8 xfs: update fork seq counter on data fork changes
The sequence counter in the xfs_ifork structure is only updated on
COW forks. This is because the counter is currently only used to
optimize out repetitive COW fork checks at writeback time.

Tweak the extent code to update the seq counter regardless of the
fork type in preparation for using this counter on data forks as
well.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:00 -08:00
Marco Benatto
d519da41e2 xfs: Introduce XFS_PTAG_VERIFIER_ERROR panic mask
Currently we have a few PTAGs in place allowing us to transform a filesystem
error in a BUG() call.  However, we don't have a panic tag for corrupt
metadata, so introduce XFS_PTAG_VERIFIER_ERROR so that the administrator can
use the fs.xfs.panic_mask sysctl knob to convert any error detected by buffer
verifiers into a kernel panic.

Signed-off-by: Marco Benatto <mbenatto@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
[darrick: light editing of commit message]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:00 -08:00
YueHaibing
e88db81645 xfs: remove duplicated xfs_defer.h
Remove duplicated include.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:00 -08:00
Darrick J. Wong
654805367d xfs: check attribute name validity
Check extended attribute entry names for invalid characters.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:40 -08:00
Darrick J. Wong
e5d7d51b34 xfs: check directory name validity
Check directory entry names for invalid characters.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:40 -08:00
Darrick J. Wong
87c9607df2 xfs: fix off-by-one error in rtbitmap cross-reference
Fix an off-by-one error in the realtime bitmap "is used" cross-reference
helper function if the realtime extent size is a single block.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:40 -08:00
Darrick J. Wong
f8c1d7023e xfs: scrub should flag dir/attr offsets that aren't mappable with xfs_dablk_t
Teach scrub to flag extent maps that exceed the range that can be mapped
with a xfs_dablk_t.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:40 -08:00
Darrick J. Wong
3258cb208c xfs: abort xattr scrub if fatal signals are pending
The extended attribute scrubber should abort the "read all attrs" loop
if there's a fatal signal pending on the process.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:39 -08:00
Darrick J. Wong
f9e63342b8 xfs: consolidate scrub dinode mapping code into a single function
Move all the confusing dinode mapping code that's split between
xchk_iallocbt_check_cluster and xchk_iallocbt_check_cluster_ifree into
the first function so that it's clearer how we find the dinode for a
given inode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:39 -08:00
Darrick J. Wong
4539b8a780 xfs: scrub big block inode btrees correctly
Teach scrub how to handle the case that there are one or more inobt
records covering a given inode cluster.  This fixes the operation on big
block filesystems (e.g. 64k blocks, 512 byte inodes).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:39 -08:00
Darrick J. Wong
b9454fe056 xfs: clean up the inode cluster checking in the inobt scrub
The code to check inobt records against inode clusters is a mess of
poorly named variables and unnecessary parameters.  Clean the
unnecessary inode number parameters out of _check_cluster_freemask in
favor of computing them inside the function instead of making the caller
do it.  In xchk_iallocbt_check_cluster, rename the variables to make it
more obvious just what chunk_ino and cluster_ino represent.

Add a tracepoint to make it easier to track each inode cluster as we
scrub it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:39 -08:00
Darrick J. Wong
a1954242fa xfs: hoist inode cluster checks out of loop
Hoist the inode cluster checks out of the inobt record check loop into
a separate function in preparation for refactoring of that loop.  No
functional changes here; that's in the next patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:39 -08:00
Darrick J. Wong
22234c62f9 xfs: check inobt record alignment on big block filesystems
On a big block filesystem, there may be multiple inobt records covering
a single inode cluster.  These records obviously won't be aligned to
cluster alignment rules, and they must cover the entire cluster.  Teach
scrub to check for these things.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:39 -08:00
Darrick J. Wong
c050fdfeb5 xfs: check the ir_startino alignment directly
In xchk_iallocbt_rec, check the alignment of ir_startino by converting
the inode cluster block alignment into units of inodes instead of the
other way around (converting ir_startino to blocks).  This prevents us
from tripping over off-by-one errors in ir_startino which are obscured
by the inode -> block conversion.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:38 -08:00
Darrick J. Wong
435dcf0787 xfs: never try to scrub more than 64 inodes per inobt record
Make sure we never check more than XFS_INODES_PER_CHUNK inodes for any
given inobt record since there can be more than one inobt record mapped
to an inode cluster.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:38 -08:00
Darrick J. Wong
add46b3b02 xfs: set buffer ops when repair probes for btree type
In xrep_findroot_block, we work out the btree type and correctness of a
given block by calling different btree verifiers on root block
candidates.  However, we leave the NULL b_ops while ->verify_read
validates the block, which means that if the verifier calls
xfs_buf_verifier_error it'll crash on the null b_ops.  Fix it to set
b_ops before calling the verifier and unsetting it if the verifier
fails.

Furthermore, improve the documentation around xfs_buf_ensure_ops, which
is the function that is responsible for cleaning up the b_ops state of
buffers that go through xrep_findroot_block but don't match anything.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-03 14:03:59 -08:00
Brian Foster
465fa17f4a xfs: end sync buffer I/O properly on shutdown error
As of commit e339dd8d8b ("xfs: use sync buffer I/O for sync delwri
queue submission"), the delwri submission code uses sync buffer I/O
for sync delwri I/O. Instead of waiting on async I/O to unlock the
buffer, it uses the underlying sync I/O completion mechanism.

If delwri buffer submission fails due to a shutdown scenario, an
error is set on the buffer and buffer completion never occurs. This
can cause xfs_buf_delwri_submit() to deadlock waiting on a
completion event.

We could check the error state before waiting on such buffers, but
that doesn't serialize against the case of an error set via a racing
I/O completion. Instead, invoke I/O completion in the shutdown case
regardless of buffer I/O type.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-03 14:03:06 -08:00
Brian Foster
aa6ee4ab69 xfs: eof trim writeback mapping as soon as it is cached
The cached writeback mapping is EOF trimmed to try and avoid races
between post-eof block management and writeback that result in
sending cached data to a stale location. The cached mapping is
currently trimmed on the validation check, which leaves a race
window between the time the mapping is cached and when it is trimmed
against the current inode size.

For example, if a new mapping is cached by delalloc conversion on a
blocksize == page size fs, we could cycle various locks, perform
memory allocations, etc.  in the writeback codepath before the
associated mapping is eventually trimmed to i_size. This leaves
enough time for a post-eof truncate and file append before the
cached mapping is trimmed. The former event essentially invalidates
a range of the cached mapping and the latter bumps the inode size
such the trim on the next writepage event won't trim all of the
invalid blocks. fstest generic/464 reproduces this scenario
occasionally and causes a lost writeback and stale delalloc blocks
warning on inode inactivation.

To work around this problem, trim the cached writeback mapping as
soon as it is cached in addition to on subsequent validation checks.
This is a minor tweak to tighten the race window as much as possible
until a proper invalidation mechanism is available.

Fixes: 40214d128e ("xfs: trim writepage mapping to within eof")
Cc: <stable@vger.kernel.org> # v4.14+
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-03 14:02:49 -08:00
Julia Lawall
90be9b86da xfs: xfs_fsops: drop useless LIST_HEAD
Drop LIST_HEAD where the variable it declares is never used.

Commit 0410c3bb2b ("xfs: factor ag btree root block
initialisation") stopped using buffer_list and started using a
buffer list in an aghdr_init_data structure, but the declaration
of buffer_list was not removed.

The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
identifier x;
@@
- LIST_HEAD(x);
  ... when != x
// </smpl>

Fixes: 0410c3bb2b ("xfs: factor ag btree root block initialisation")
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-29 10:47:58 -08:00
Julia Lawall
89be677b6b xfs: xfs_buf: drop useless LIST_HEAD
Drop LIST_HEAD where the variable it declares has never
been used.

The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
identifier x;
@@
- LIST_HEAD(x);
  ... when != x
// </smpl>

Fixes: 26f1fe858f ("xfs: reduce lock hold times in buffer writeback")
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-29 10:47:34 -08:00
Omar Sandoval
65eed012d1 xfs: reallocate realtime summary cache on growfs
At mount time, we allocate m_rsum_cache with the number of realtime
bitmap blocks. However, xfs_growfs_rt() can increase the number of
realtime bitmap blocks. Using the cache after this happens may access
out of the bounds of the cache. Fix it by reallocating the cache in this
case.

Fixes: 355e353213 ("xfs: cache minimum realtime summary level")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-21 18:45:18 -08:00
Darrick J. Wong
86d163dbfe xfs: stringify scrub types in ftrace output
Use __print_symbolic to print the scrub type in ftrace output.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-19 14:02:01 -08:00
Darrick J. Wong
c494213f30 xfs: stringify btree cursor types in ftrace output
Use __print_symbolic to print the btree type in ftrace output.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-19 14:02:01 -08:00
Darrick J. Wong
0357d21a6c xfs: move XFS_INODE_FORMAT_STR mappings to libxfs
Move XFS_INODE_FORMAT_STR to libxfs so that we don't forget to keep it
updated, and add necessary TRACE_DEFINE_ENUM.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-19 14:02:01 -08:00
Darrick J. Wong
05c753c4cf xfs: move XFS_AG_BTREE_CMP_FORMAT_STR mappings to libxfs
Move XFS_AG_BTREE_CMP_FORMAT_STR to libxfs so that we don't forget to
keep it updated, and TRACE_DEFINE_ENUM the values while we're at it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-19 14:02:01 -08:00
Darrick J. Wong
85f8dff00a xfs: fix symbolic enum printing in ftrace output
ftrace's __print_symbolic() has a (very poorly documented) requirement
that any enum values used in the symbol to string translation table be
wrapped in a TRACE_DEFINE_ENUM so that the enum value can be encoded in
the ftrace ring buffer.  Fix this unsatisfied requirement.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-19 14:02:01 -08:00
Darrick J. Wong
7af8150f99 xfs: fix function pointer type in ftrace format
Use %pS instead of %pF in ftrace strings so that we record the actual
function address instead of the function descriptor.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-19 14:02:00 -08:00
Nick Bowler
a9d25bde1e xfs: Fix x32 ioctls when cmd numbers differ from ia32.
Several ioctl structs change size between native 32-bit (ia32) and x32
applications, because x32 follows the native 64-bit (amd64) integer
alignment rules and uses 64-bit time_t.  In these instances, the ioctl
number changes so userspace simply gets -ENOTTY.  This scenario can be
handled by simply adding more cases.

Looking at the different ioctls implemented here:

- All the ones marked 'No size or alignment issue on any arch' should
  presumably all be fine.

- All the ones under BROKEN_X86_ALIGNMENT are different under integer
  alignment rules.  Since x32 matches amd64 here, we just need both
  sets of cases handled.

- XFS_IOC_SWAPEXT has both integer alignment differences and time_t
  differences.  Since x32 matches amd64 here, we need to add a case
  which calls the native implementation.

- The remaining ioctls have neither 64-bit integers nor time_t, so
  x32 matches ia32 here and no change is required at this level.  The
  bulkstat ioctl implementations have some pointer chasing which is
  handled separately.

Signed-off-by: Nick Bowler <nbowler@draconx.ca>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-18 10:55:21 -08:00
Nick Bowler
7ca860e3c1 xfs: Fix bulkstat compat ioctls on x32 userspace.
The bulkstat family of ioctls are problematic on x32, because there is
a mixup of native 32-bit and 64-bit conventions.  The xfs_fsop_bulkreq
struct contains pointers and 32-bit integers so that matches the native
32-bit layout, and that means the ioctl implementation goes into the
regular compat path on x32.

However, the 'ubuffer' member of that struct in turn refers to either
struct xfs_inogrp or xfs_bstat (or an array of these).  On x32, those
structures match the native 64-bit layout.  The compat implementation
writes out the 32-bit version of these structures.  This is not the
expected format for x32 userspace, causing problems.

Fortunately the functions which actually output these xfs_inogrp and
xfs_bstat structures have an easy way to select which output format
is required, so we just need a little tweak to select the right format
on x32.

Signed-off-by: Nick Bowler <nbowler@draconx.ca>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-18 10:55:20 -08:00
Nick Bowler
c456d64449 xfs: Align compat attrlist_by_handle with native implementation.
While inspecting the ioctl implementations, I noticed that the compat
implementation of XFS_IOC_ATTRLIST_BY_HANDLE does not do exactly the
same thing as the native implementation.  Specifically, the "cursor"
does not appear to be written out to userspace on the compat path,
like it is on the native path.

This adjusts the compat implementation to copy out the cursor just
like the native implementation does.  The attrlist cursor does not
require any special compat handling.  This fixes xfstests xfs/269
on both IA-32 and x32 userspace, when running on an amd64 kernel.

Signed-off-by: Nick Bowler <nbowler@draconx.ca>
Fixes: 0facef7fb0 ("xfs: in _attrlist_by_handle, copy the cursor back to userspace")
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-18 10:55:20 -08:00
Darrick J. Wong
64bafd2f1e xfs: require both realtime inodes to mount
Since mkfs always formats the filesystem with the realtime bitmap and
summary inodes immediately after the root directory, we should expect
that both of them are present and loadable, even if there isn't a
realtime volume attached.  There's no reason to skip this if rbmino ==
NULLFSINO; in fact, this causes an immediate crash if the there /is/ a
realtime volume and someone writes to it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2018-12-13 12:03:45 -08:00
Omar Sandoval
355e353213 xfs: cache minimum realtime summary level
The realtime summary is a two-dimensional array on disk, effectively:

u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap]

rsum[log][bbno] is the number of extents of size 2**log which start in
bitmap block bbno.

xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether
rsum[log][bbno] != 0 for any log level. However, the summary array is
stored in row-major order (i.e., like an array in C), so all of these
entries are not adjacent, but rather spread across the entire summary
file. In the worst case (a full bitmap block), xfs_rtany_summary() has
to check every level.

This means that on a moderately-used realtime device, an allocation will
waste a lot of time finding, reading, and releasing buffers for the
realtime summary. In particular, one of our storage services (which runs
on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems)
spends almost 5% of its CPU cycles in xfs_rtbuf_get() and
xfs_trans_brelse() called from xfs_rtany_summary().

One solution would be to also store the summary with the dimensions
swapped. However, this would require a disk format change to a very old
component of XFS.

Instead, we can cache the minimum size which contains any extents. We do
so lazily; rather than guaranteeing that the cache contains the precise
minimum, it always contains a loose lower bound which we tighten when we
read or update a summary block. This only uses a few kilobytes of memory
and is already serialized via the realtime bitmap and summary inode
locks, so the cost is minimal. With this change, the same workload only
spends 0.2% of its CPU cycles in the realtime allocator.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-12 08:47:17 -08:00
Darrick J. Wong
2c2d9d3a20 xfs: count inode blocks correctly in inobt scrub
A big block filesystem might require more than one inobt record to cover
all the inodes in the block.  In these cases it is not correct to round
the irec count up to the nearest block because this causes us to
overestimate the number of inode blocks we expect to find.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-12-12 08:47:17 -08:00
Darrick J. Wong
c1b4a321ed xfs: precalculate cluster alignment in inodes and blocks
Store the inode cluster alignment information in units of inodes and
blocks in the mount data so that we don't have to keep recalculating
them.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12 08:47:17 -08:00
Darrick J. Wong
83dcdb4469 xfs: precalculate inodes and blocks per inode cluster
Store the number of inodes and blocks per inode cluster in the mount
data so that we don't have to keep recalculating them.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12 08:47:17 -08:00
Darrick J. Wong
43004b2a8d xfs: add a block to inode count converter
Add new helpers to convert units of fs blocks into inodes, and AG blocks
into AG inodes, respectively.  Convert all the open-coded conversions
and XFS_OFFBNO_TO_AGINO(, , 0) calls to use them, as appropriate.  The
OFFBNO_TO_AGINO macro is retained for xfs_repair.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12 08:47:16 -08:00
Darrick J. Wong
7280fedaf3 xfs: remove xfs_rmap_ag_owner and friends
Owner information for static fs metadata can be defined readonly at
build time because it never changes across filesystems.  This enables us
to reduce stack usage (particularly in scrub) because we can use the
statically defined oinfo structures.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12 08:47:16 -08:00
Darrick J. Wong
66e3237e72 xfs: const-ify xfs_owner_info arguments
Only certain functions actually change the contents of an
xfs_owner_info; the rest can accept a const struct pointer.  This will
enable us to save stack space by hoisting static owner info types to
be const global variables.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12 08:47:16 -08:00
Darrick J. Wong
02b100fb83 xfs: streamline defer op type handling
There's no need to bundle a pointer to the defer op type into the defer
op control structure.  Instead, store the defer op type enum, which
enables us to shorten some of the lines.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12 08:47:16 -08:00
Darrick J. Wong
bc9f2b7c8a xfs: idiotproof defer op type configuration
Recently, we forgot to port a new defer op type to xfsprogs, which
caused us some userspace pain.  Reorganize the way we make libxfs
clients supply defer op type information so that all type information
has to be provided at build time instead of risky runtime dynamic
configuration.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12 08:47:16 -08:00
Dave Chinner
43feeea88c xfs: zero length symlinks are not valid
A log recovery failure has been reproduced where a symlink inode has
a zero length in extent form. It was caused by a shutdown during a
combined fstress+fsmark workload.

The underlying problem is the issue in xfs_inactive_symlink(): the
inode is unlocked between the symlink inactivation/truncation and
the inode being freed. This opens a window for the inode to be
written to disk before it xfs_ifree() removes it from the unlinked
list, marks it free in the inobt and zeros the mode.

For shortform inodes, the fix is simple. xfs_ifree() clears the data
fork state, so there's no need to do it in xfs_inactive_symlink().
This means the shortform fork verifier will not see a zero length
data fork as it mirrors the inode size through to xfs_ifree()), and
hence if the inode gets written back and the fork verifiers are run
they will still see a fork that matches the on-disk inode size.

For extent form (remote) symlinks, it is a little more tricky. Here
we explicitly set the inode size to zero, so the above race can lead
to zero length symlinks on disk. Because the inode is unlinked at
this point (i.e. on the unlinked list) and unreferenced, it can
never be seen again by a user. Hence when we set the inode size to
zeor, also change the type to S_IFREG. xfs_ifree() expects S_IFREG
inodes to be of zero length, and so this avoids all the problems of
zero length symlinks ever hitting the disk. It also avoids the
problem of needing to handle zero length symlink inodes in log
recovery to replay the extent free intents and the remaining
deferops to free the extents the symlink used.

Also add a couple of asserts to warn us if zero length symlinks end
up in either the symlink create or inactivation paths.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-12 08:47:15 -08:00
Colin Ian King
8c4ce794ee xfs: clean up indentation issues, remove an unwanted space
There is a statement that has an unwanted space in the indentation.
Remove it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-12 08:46:20 -08:00
Pan Bian
fe5ed6c22e xfs: libxfs: move xfs_perag_put late
The function xfs_alloc_get_freelist calls xfs_perag_put to drop the
reference. However, pag->pagf_btreeblks is read and written after the
put operation. This patch moves the put operation later.

Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
[darrick: minor changelog edits]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-12 08:46:20 -08:00
Darrick J. Wong
d6f215f359 xfs: split up the xfs_reflink_end_cow work into smaller transactions
In xfs_reflink_end_cow, we allocate a single transaction for the entire
end_cow operation and then loop the CoW fork mappings to move them to
the data fork.  This design fails on a heavily fragmented filesystem
where an inode's data fork has exactly one more extent than would fit in
an extents-format fork, because the unmap can collapse the data fork
into extents format (freeing the bmbt block) but the remap can expand
the data fork back into a (newly allocated) bmbt block.  If the number
of extents we end up remapping is large, we can overflow the block
reservation because we reserved blocks assuming that we were adding
mappings into an already-cleared area of the data fork.

Let's say we have 8 extents in the data fork, 8 extents in the CoW fork,
and the data fork can hold at most 7 extents before needing to convert
to btree format; and that blocks A-P are discontiguous single-block
extents:

   0......7
D: ABCDEFGH
C: IJKLMNOP

When a write to file blocks 0-7 completes, we must remap I-P into the
data fork.  We start by removing H from the btree-format data fork.  Now
we have 7 extents, so we convert the fork to extents format, freeing the
bmbt block.   We then move P into the data fork and it now has 8 extents
again.  We must convert the data fork back to btree format, requiring a
block allocation.  If we repeat this sequence for blocks 6-5-4-3-2-1-0,
we'll need a total of 8 block allocations to remap all 8 blocks.  We
reserved only enough blocks to handle one btree split (5 blocks on a 4k
block filesystem), which means we overflow the block reservation.

To fix this issue, create a separate helper function to remap a single
extent, and change _reflink_end_cow to call it in a tight loop over the
entire range we're completing.  As a side effect this also removes the
size restrictions on how many extents we can end_cow at a time, though
nobody ever hit that.  It is not reasonable to reserve N blocks to remap
N blocks.

Note that this can be reproduced after ~320 million fsx ops while
running generic/938 (long soak directio fsx exerciser):

XFS: Assertion failed: tp->t_blk_res >= tp->t_blk_res_used, file: fs/xfs/xfs_trans.c, line: 116
<machine registers snipped>
Call Trace:
 xfs_trans_dup+0x211/0x250 [xfs]
 xfs_trans_roll+0x6d/0x180 [xfs]
 xfs_defer_trans_roll+0x10c/0x3b0 [xfs]
 xfs_defer_finish_noroll+0xdf/0x740 [xfs]
 xfs_defer_finish+0x13/0x70 [xfs]
 xfs_reflink_end_cow+0x2c6/0x680 [xfs]
 xfs_dio_write_end_io+0x115/0x220 [xfs]
 iomap_dio_complete+0x3f/0x130
 iomap_dio_rw+0x3c3/0x420
 xfs_file_dio_aio_write+0x132/0x3c0 [xfs]
 xfs_file_write_iter+0x8b/0xc0 [xfs]
 __vfs_write+0x193/0x1f0
 vfs_write+0xba/0x1c0
 ksys_write+0x52/0xc0
 do_syscall_64+0x50/0x160
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12 08:46:19 -08:00
Eric Sandeen
7d048df4e9 xfs: fix inverted return from xfs_btree_sblock_verify_crc
xfs_btree_sblock_verify_crc is a bool so should not be returning
a failaddr_t; worse, if xfs_log_check_lsn fails it returns
__this_address which looks like a boolean true (i.e. success)
to the caller.

(interestingly xfs_btree_lblock_verify_crc doesn't have the issue)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-04 08:50:49 -08:00
Darrick J. Wong
a579121f94 xfs: fix PAGE_MASK usage in xfs_free_file_space
In commit e53c4b598, I *tried* to teach xfs to force writeback when we
fzero/fpunch right up to EOF so that if EOF is in the middle of a page,
the post-EOF part of the page gets zeroed before we return to userspace.
Unfortunately, I missed the part where PAGE_MASK is ~(PAGE_SIZE - 1),
which means that we totally fail to zero if we're fpunching and EOF is
within the first page.  Worse yet, the same PAGE_MASK thinko plagues the
filemap_write_and_wait_range call, so we'd initiate writeback of the
entire file, which (mostly) masked the thinko.

Drop the tricky PAGE_MASK and replace it with correct usage of PAGE_SIZE
and the proper rounding macros.

Fixes: e53c4b598 ("xfs: ensure post-EOF zeroing happens after zeroing part of a file")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-12-04 08:50:49 -08:00
Ye Yin
de7243057e fs/xfs: fix f_ffree value for statfs when project quota is set
When project is set, we should use inode limit minus the used count

Signed-off-by: Ye Yin <dbyin@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-26 15:01:37 -08:00
Dave Chinner
9230a0b65b xfs: delalloc -> unwritten COW fork allocation can go wrong
Long saga. There have been days spent following this through dead end
after dead end in multi-GB event traces. This morning, after writing
a trace-cmd wrapper that enabled me to be more selective about XFS
trace points, I discovered that I could get just enough essential
tracepoints enabled that there was a 50:50 chance the fsx config
would fail at ~115k ops. If it didn't fail at op 115547, I stopped
fsx at op 115548 anyway.

That gave me two traces - one where the problem manifested, and one
where it didn't. After refining the traces to have the necessary
information, I found that in the failing case there was a real
extent in the COW fork compared to an unwritten extent in the
working case.

Walking back through the two traces to the point where the CWO fork
extents actually diverged, I found that the bad case had an extra
unwritten extent in it. This is likely because the bug it led me to
had triggered multiple times in those 115k ops, leaving stray
COW extents around. What I saw was a COW delalloc conversion to an
unwritten extent (as they should always be through
xfs_iomap_write_allocate()) resulted in a /written extent/:

xfs_writepage:        dev 259:0 ino 0x83 pgoff 0x17000 size 0x79a00 offset 0 length 0
xfs_iext_remove:      dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/2 offset 32 block 152 count 20 flag 1 caller xfs_bmap_add_extent_delay_real
xfs_bmap_pre_update:  dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 4503599627239429 count 31 flag 0 caller xfs_bmap_add_extent_delay_real
xfs_bmap_post_update: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 121 count 51 flag 0 caller xfs_bmap_add_ex

Basically, Cow fork before:

	0 1            32          52
	+H+DDDDDDDDDDDD+UUUUUUUUUUU+
	   PREV		RIGHT

COW delalloc conversion allocates:

	  1	       32
	  +uuuuuuuuuuuu+
	  NEW

And the result according to the xfs_bmap_post_update trace was:

	0 1            32          52
	+H+wwwwwwwwwwwwwwwwwwwwwwww+
	   PREV

Which is clearly wrong - it should be a merged unwritten extent,
not an unwritten extent.

That lead me to look at the LEFT_FILLING|RIGHT_FILLING|RIGHT_CONTIG
case in xfs_bmap_add_extent_delay_real(), and sure enough, there's
the bug.

It takes the old delalloc extent (PREV) and adds the length of the
RIGHT extent to it, takes the start block from NEW, removes the
RIGHT extent and then updates PREV with the new extent.

What it fails to do is update PREV.br_state. For delalloc, this is
always XFS_EXT_NORM, while in this case we are converting the
delayed allocation to unwritten, so it needs to be updated to
XFS_EXT_UNWRITTEN. This LF|RF|RC case does not do this, and so
the resultant extent is always written.

And that's the bug I've been chasing for a week - a bmap btree bug,
not a reflink/dedupe/copy_file_range bug, but a BMBT bug introduced
with the recent in core extent tree scalability enhancements.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-21 10:10:53 -08:00
Dave Chinner
2c307174ab xfs: flush removing page cache in xfs_reflink_remap_prep
On a sub-page block size filesystem, fsx is failing with a data
corruption after a series of operations involving copying a file
with the destination offset beyond EOF of the destination of the file:

8093(157 mod 256): TRUNCATE DOWN        from 0x7a120 to 0x50000 ******WWWW
8094(158 mod 256): INSERT 0x25000 thru 0x25fff  (0x1000 bytes)
8095(159 mod 256): COPY 0x18000 thru 0x1afff    (0x3000 bytes) to 0x2f400
8096(160 mod 256): WRITE    0x5da00 thru 0x651ff        (0x7800 bytes) HOLE
8097(161 mod 256): COPY 0x2000 thru 0x5fff      (0x4000 bytes) to 0x6fc00

The second copy here is beyond EOF, and it is to sub-page (4k) but
block aligned (1k) offset. The clone runs the EOF zeroing, landing
in a pre-existing post-eof delalloc extent. This zeroes the post-eof
extents in the page cache just fine, dirtying the pages correctly.

The problem is that xfs_reflink_remap_prep() now truncates the page
cache over the range that it is copying it to, and rounds that down
to cover the entire start page. This removes the dirty page over the
delalloc extent from the page cache without having written it back.
Hence later, when the page cache is flushed, the page at offset
0x6f000 has not been written back and hence exposes stale data,
which fsx trips over less than 10 operations later.

Fix this by changing xfs_reflink_remap_prep() to use
xfs_flush_unmap_range().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-21 10:10:53 -08:00
Dave Chinner
7f9f71be84 xfs: extent shifting doesn't fully invalidate page cache
The extent shifting code uses a flush and invalidate mechainsm prior
to shifting extents around. This is similar to what
xfs_free_file_space() does, but it doesn't take into account things
like page cache vs block size differences, and it will fail if there
is a page that it currently busy.

xfs_flush_unmap_range() handles all of these cases, so just convert
xfs_prepare_shift() to us that mechanism rather than having it's own
special sauce.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-20 10:36:19 -08:00
Dave Chinner
c08768977b xfs: finobt AG reserves don't consider last AG can be a runt
The last AG may be very small comapred to all other AGs, and hence
AG reservations based on the superblock AG size may actually consume
more space than the AG actually has. This results on assert failures
like:

XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 319
[   48.932891]  xfs_ag_resv_init+0x1bd/0x1d0
[   48.933853]  xfs_fs_reserve_ag_blocks+0x37/0xb0
[   48.934939]  xfs_mountfs+0x5b3/0x920
[   48.935804]  xfs_fs_fill_super+0x462/0x640
[   48.936784]  ? xfs_test_remount_options+0x60/0x60
[   48.937908]  mount_bdev+0x178/0x1b0
[   48.938751]  mount_fs+0x36/0x170
[   48.939533]  vfs_kern_mount.part.43+0x54/0x130
[   48.940596]  do_mount+0x20e/0xcb0
[   48.941396]  ? memdup_user+0x3e/0x70
[   48.942249]  ksys_mount+0xba/0xd0
[   48.943046]  __x64_sys_mount+0x21/0x30
[   48.943953]  do_syscall_64+0x54/0x170
[   48.944835]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Hence we need to ensure the finobt per-ag space reservations take
into account the size of the last AG rather than treat it like all
the other full size AGs.

Note that both refcountbt and rmapbt already take the size of the AG
into account via reading the AGF length directly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-20 10:36:11 -08:00
Dave Chinner
d43aaf1685 xfs: fix transient reference count error in xfs_buf_resubmit_failed_buffers
When retrying a failed inode or dquot buffer,
xfs_buf_resubmit_failed_buffers() clears all the failed flags from
the inde/dquot log items. In doing so, it also drops all the
reference counts on the buffer that the failed log items hold. This
means it can drop all the active references on the buffer and hence
free the buffer before it queues it for write again.

Putting the buffer on the delwri queue takes a reference to the
buffer (so that it hangs around until it has been written and
completed), but this goes bang if the buffer has already been freed.

Hence we need to add the buffer to the delwri queue before we remove
the failed flags from the log items attached to the buffer to ensure
it always remains referenced during the resubmit process.

Reported-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-20 10:36:01 -08:00
Dave Chinner
d61fa8cbf3 xfs: uncached buffer tracing needs to print bno
Useless:

xfs_buf_get_uncached: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
xfs_buf_unlock:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
xfs_buf_submit:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
xfs_buf_hold:         dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
xfs_buf_iowait:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
xfs_buf_iodone:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
xfs_buf_iowait_done:  dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
xfs_buf_rele:         dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...

Useful:


xfs_buf_get_uncached: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
xfs_buf_unlock:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
xfs_buf_submit:       dev 253:32 bno 0x200b5 nblks 0x1 ...
xfs_buf_hold:         dev 253:32 bno 0x200b5 nblks 0x1 ...
xfs_buf_iowait:       dev 253:32 bno 0x200b5 nblks 0x1 ...
xfs_buf_iodone:       dev 253:32 bno 0x200b5 nblks 0x1 ...
xfs_buf_iowait_done:  dev 253:32 bno 0x200b5 nblks 0x1 ...
xfs_buf_rele:         dev 253:32 bno 0x200b5 nblks 0x1 ...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-20 10:35:05 -08:00
Eric Biggers
da034bcc6a xfs: make xfs_file_remap_range() static
xfs_file_remap_range() is only used in fs/xfs/xfs_file.c, so make it
static.

This addresses a gcc warning when -Wmissing-prototypes is enabled.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-19 13:30:38 -08:00
Brian Foster
59e4293149 xfs: fix shared extent data corruption due to missing cow reservation
Page writeback indirectly handles shared extents via the existence
of overlapping COW fork blocks. If COW fork blocks exist, writeback
always performs the associated copy-on-write regardless if the
underlying blocks are actually shared. If the blocks are shared,
then overlapping COW fork blocks must always exist.

fstests shared/010 reproduces a case where a buffered write occurs
over a shared block without performing the requisite COW fork
reservation.  This ultimately causes writeback to the shared extent
and data corruption that is detected across md5 checks of the
filesystem across a mount cycle.

The problem occurs when a buffered write lands over a shared extent
that crosses an extent size hint boundary and that also happens to
have a partial COW reservation that doesn't cover the start and end
blocks of the data fork extent.

For example, a buffered write occurs across the file offset (in FSB
units) range of [29, 57]. A shared extent exists at blocks [29, 35]
and COW reservation already exists at blocks [32, 34]. After
accommodating a COW extent size hint of 32 blocks and the existing
reservation at offset 32, xfs_reflink_reserve_cow() allocates 32
blocks of reservation at offset 0 and returns with COW reservation
across the range of [0, 34]. The associated data fork extent is
still [29, 35], however, which isn't fully covered by the COW
reservation.

This leads to a buffered write at file offset 35 over a shared
extent without associated COW reservation. Writeback eventually
kicks in, performs an overwrite of the underlying shared block and
causes the associated data corruption.

Update xfs_reflink_reserve_cow() to accommodate the fact that a
delalloc allocation request may not fully cover the extent in the
data fork. Trim the data fork extent appropriately, just as is done
for shared extent boundaries and/or existing COW reservations that
happen to overlap the start of the data fork extent. This prevents
shared/010 failures due to data corruption on reflink enabled
filesystems.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-19 13:30:38 -08:00
Dave Chinner
837514f7a4 xfs: fix overflow in xfs_attr3_leaf_verify
generic/070 on 64k block size filesystems is failing with a verifier
corruption on writeback or an attribute leaf block:

[   94.973083] XFS (pmem0): Metadata corruption detected at xfs_attr3_leaf_verify+0x246/0x260, xfs_attr3_leaf block 0x811480
[   94.975623] XFS (pmem0): Unmount and run xfs_repair
[   94.976720] XFS (pmem0): First 128 bytes of corrupted metadata buffer:
[   94.978270] 000000004b2e7b45: 00 00 00 00 00 00 00 00 3b ee 00 00 00 00 00 00  ........;.......
[   94.980268] 000000006b1db90b: 00 00 00 00 00 81 14 80 00 00 00 00 00 00 00 00  ................
[   94.982251] 00000000433f2407: 22 7b 5c 82 2d 5c 47 4c bb 31 1c 37 fa a9 ce d6  "{\.-\GL.1.7....
[   94.984157] 0000000010dc7dfb: 00 00 00 00 00 81 04 8a 00 0a 18 e8 dd 94 01 00  ................
[   94.986215] 00000000d5a19229: 00 a0 dc f4 fe 98 01 68 f0 d8 07 e0 00 00 00 00  .......h........
[   94.988171] 00000000521df36c: 0c 2d 32 e2 fe 20 01 00 0c 2d 58 65 fe 0c 01 00  .-2.. ...-Xe....
[   94.990162] 000000008477ae06: 0c 2d 5b 66 fe 8c 01 00 0c 2d 71 35 fe 7c 01 00  .-[f.....-q5.|..
[   94.992139] 00000000a4a6bca6: 0c 2d 72 37 fc d4 01 00 0c 2d d8 b8 f0 90 01 00  .-r7.....-......
[   94.994789] XFS (pmem0): xfs_do_force_shutdown(0x8) called from line 1453 of file fs/xfs/xfs_buf.c. Return address = ffffffff815365f3

This is failing this check:

                end = ichdr.freemap[i].base + ichdr.freemap[i].size;
                if (end < ichdr.freemap[i].base)
>>>>>                   return __this_address;
                if (end > mp->m_attr_geo->blksize)
                        return __this_address;

And from the buffer output above, the freemap array is:

	freemap[0].base = 0x00a0
	freemap[0].size = 0xdcf4	end = 0xdd94
	freemap[1].base = 0xfe98
	freemap[1].size = 0x0168	end = 0x10000
	freemap[2].base = 0xf0d8
	freemap[2].size = 0x07e0	end = 0xf8b8

These all look valid - the block size is 0x10000 and so from the
last check in the above verifier fragment we know that the end
of freemap[1] is valid. The problem is that end is declared as:

	uint16_t	end;

And (uint16_t)0x10000 = 0. So we have a verifier bug here, not a
corruption. Fix the verifier to use uint32_t types for the check and
hence avoid the overflow.

Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=201577
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-06 07:50:50 -08:00
Darrick J. Wong
bdec055bb9 xfs: print buffer offsets when dumping corrupt buffers
Use DUMP_PREFIX_OFFSET when printing hex dumps of corrupt buffers
because modern Linux now prints a 32-bit hash of our 64-bit pointer when
using DUMP_PREFIX_ADDRESS:

00000000b4bb4297: 00 00 00 00 00 00 00 00 3b ee 00 00 00 00 00 00  ........;.......
00000005ec77e26: 00 00 00 00 02 d0 5a 00 00 00 00 00 00 00 00 00  ......Z.........
000000015938018: 21 98 e8 b4 fd de 4c 07 bc ea 3c e5 ae b4 7c 48  !.....L...<...|H

This is totally worthless for a sequential dump since we probably only
care about tracking the buffer offsets and afaik there's no way to
recover the actual pointer from the hashed value.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-11-06 07:50:50 -08:00
Christophe JAILLET
132bf67237 xfs: Fix error code in 'xfs_ioc_getbmap()'
In this function, once 'buf' has been allocated, we unconditionally
return 0.
However, 'error' is set to some error codes in several error handling
paths.
Before commit 232b51948b ("xfs: simplify the xfs_getbmap interface")
this was not an issue because all error paths were returning directly,
but now that some cleanup at the end may be needed, we must propagate the
error code.

Fixes: 232b51948b ("xfs: simplify the xfs_getbmap interface")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-06 07:50:50 -08:00
Linus Torvalds
c2aa1a444c vfs: rework data cloning infrastructure
Rework the vfs_clone_file_range and vfs_dedupe_file_range infrastructure to use
 a common .remap_file_range method and supply generic bounds and sanity checking
 functions that are shared with the data write path. The current VFS
 infrastructure has problems with rlimit, LFS file sizes, file time stamps,
 maximum filesystem file sizes, stripping setuid bits, etc and so they are
 addressed in these commits.
 
 We also introduce the ability for the ->remap_file_range methods to return short
 clones so that clones for vfs_copy_file_range() don't get rejected if the entire
 range can't be cloned. It also allows filesystems to sliently skip deduplication
 of partial EOF blocks if they are not capable of doing so without requiring
 errors to be thrown to userspace.
 
 All existing filesystems are converted to user the new .remap_file_range method,
 and both XFS and ocfs2 are modified to make use of the new generic checking
 infrastructure.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJb29gEAAoJEK3oKUf0dfodpOAQAL2VbHjvKXEwNMDTKscSRMmZ
 Z0xXo3gamFKQ+VGOqy2g2lmAYQs9SAnTuCGTJ7zIAp7u+q8gzUy5FzKAwLS4Id6L
 8siaY6nzlicfO04d0MdXnWz0f3xykChgzfdQfVUlUi7WrDioBUECLPmx4a+USsp1
 DQGjLOZfoOAmn2rijdnH9RTEaHqg+8mcTaLN9TRav4gGqrWxldFKXw2y6ouFC7uo
 /hxTRNXR9VI+EdbDelwBNXl9nU9gQA0WLOvRKwgUrtv6bSJohTPsmXt7EbBtNcVR
 cl3zDNc1sLD1bLaRLEUAszI/33wXaaQgom1iB51obIcHHef+JxRNG/j6rUMfzxZI
 VaauGv5EIvtaKN0LTAqVVLQ8t2MQFYfOr8TykmO+1UFog204aKRANdVMHDSjxD/0
 dTGKJGcq+HnKQ+JHDbTdvuXEL8sUUl1FiLjOQbZPw63XmuddLKFUA2TOjXn6htbU
 1h1MG5d9KjGLpabp2BQheczD08NuSmcrOBNt7IoeI3+nxr3HpMwprfB9TyaERy9X
 iEgyVXmjjc9bLLRW7A2wm77aW64NvPs51wKMnvuNgNwnCewrGS6cB8WVj2zbQjH1
 h3f3nku44s9ctNPSBzb/sJLnpqmZQ5t0oSmrMSN+5+En6rNTacoJCzxHRJBA7z/h
 Z+C6y1GTZw0euY6Zjiwu
 =CE/A
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull vfs dedup fixes from Dave Chinner:
 "This reworks the vfs data cloning infrastructure.

  We discovered many issues with these interfaces late in the 4.19 cycle
  - the worst of them (data corruption, setuid stripping) were fixed for
  XFS in 4.19-rc8, but a larger rework of the infrastructure fixing all
  the problems was needed. That rework is the contents of this pull
  request.

  Rework the vfs_clone_file_range and vfs_dedupe_file_range
  infrastructure to use a common .remap_file_range method and supply
  generic bounds and sanity checking functions that are shared with the
  data write path. The current VFS infrastructure has problems with
  rlimit, LFS file sizes, file time stamps, maximum filesystem file
  sizes, stripping setuid bits, etc and so they are addressed in these
  commits.

  We also introduce the ability for the ->remap_file_range methods to
  return short clones so that clones for vfs_copy_file_range() don't get
  rejected if the entire range can't be cloned. It also allows
  filesystems to sliently skip deduplication of partial EOF blocks if
  they are not capable of doing so without requiring errors to be thrown
  to userspace.

  Existing filesystems are converted to user the new remap_file_range
  method, and both XFS and ocfs2 are modified to make use of the new
  generic checking infrastructure"

* tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (28 commits)
  xfs: remove [cm]time update from reflink calls
  xfs: remove xfs_reflink_remap_range
  xfs: remove redundant remap partial EOF block checks
  xfs: support returning partial reflink results
  xfs: clean up xfs_reflink_remap_blocks call site
  xfs: fix pagecache truncation prior to reflink
  ocfs2: remove ocfs2_reflink_remap_range
  ocfs2: support partial clone range and dedupe range
  ocfs2: fix pagecache truncation prior to reflink
  ocfs2: truncate page cache for clone destination file before remapping
  vfs: clean up generic_remap_file_range_prep return value
  vfs: hide file range comparison function
  vfs: enable remap callers that can handle short operations
  vfs: plumb remap flags through the vfs dedupe functions
  vfs: plumb remap flags through the vfs clone functions
  vfs: make remap_file_range functions take and return bytes completed
  vfs: remap helper should update destination inode metadata
  vfs: pass remap flags to generic_remap_checks
  vfs: pass remap flags to generic_remap_file_range_prep
  vfs: combine the clone and dedupe into a single remap_file_range
  ...
2018-11-02 09:33:08 -07:00
Darrick J. Wong
bf4a1fcf0b xfs: remove [cm]time update from reflink calls
Now that the vfs remap helper dirties the inode [cm]time for us, xfs no
longer needs to do that on its own.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:47:48 +11:00
Darrick J. Wong
3fc9f5e409 xfs: remove xfs_reflink_remap_range
Since xfs_file_remap_range is a thin wrapper, move the contents of
xfs_reflink_remap_range into the shell.  This cuts down on the vfs
calls being made from internal xfs code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:47:26 +11:00
Darrick J. Wong
7a6ccf004e xfs: remove redundant remap partial EOF block checks
Now that we've moved the partial EOF block checks to the VFS helpers, we
can remove the redundant functionality from XFS.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:47:16 +11:00
Darrick J. Wong
3f68c1f562 xfs: support returning partial reflink results
Back when the XFS reflink code only supported clone_file_range, we were
only able to return zero or negative error codes to userspace.  However,
now that copy_file_range (which returns bytes copied) can use XFS'
clone_file_range, we have the opportunity to return partial results.
For example, if userspace sends a 1GB clone request and we run out of
space halfway through, we at least can tell userspace that we completed
512M of that request like a regular write.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:47:06 +11:00
Darrick J. Wong
9f04aaffdd xfs: clean up xfs_reflink_remap_blocks call site
Move the offset <-> blocks unit conversions into
xfs_reflink_remap_blocks to make the call site less ugly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:46:50 +11:00
Darrick J. Wong
4918ef4ea0 xfs: fix pagecache truncation prior to reflink
Prior to remapping blocks, it is necessary to remove pages from the
destination file's page cache.  Unfortunately, the truncation is not
aggressive enough -- if page size > block size, we'll end up zeroing
subpage blocks instead of removing them.  So, round the start offset
down and the end offset up to page boundaries.  We already wrote all
the dirty data so the larger range shouldn't be a problem.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:46:33 +11:00
Darrick J. Wong
8c5c836bd6 vfs: clean up generic_remap_file_range_prep return value
Since the remap prep function can update the length of the remap
request, we can change this function to return the usual return status
instead of the odd behavior it has now.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:42:24 +11:00
Darrick J. Wong
42ec3d4c02 vfs: make remap_file_range functions take and return bytes completed
Change the remap_file_range functions to take a number of bytes to
operate upon and return the number of bytes they operated on.  This is a
requirement for allowing fs implementations to return short clone/dedupe
results to the user, which will enable us to obey resource limits in a
graceful manner.

A subsequent patch will enable copy_file_range to signal to the
->clone_file_range implementation that it can handle a short length,
which will be returned in the function's return value.  For now the
short return is not implemented anywhere so the behavior won't change --
either copy_file_range manages to clone the entire range or it tries an
alternative.

Neither clone ioctl can take advantage of this, alas.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:41:49 +11:00
Darrick J. Wong
8dde90bca6 vfs: remap helper should update destination inode metadata
Extend generic_remap_file_range_prep to handle inode metadata updates
when remapping into a file.  If the operation can possibly alter the
file contents, we must update the ctime and mtime and remove security
privileges, just like we do for regular file writes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:41:41 +11:00
Darrick J. Wong
a91ae49bba vfs: pass remap flags to generic_remap_file_range_prep
Plumb the remap flags through the filesystem from the vfs function
dispatcher all the way to the prep function to prepare for behavior
changes in subsequent patches.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:41:28 +11:00
Darrick J. Wong
2e5dfc99f2 vfs: combine the clone and dedupe into a single remap_file_range
Combine the clone_file_range and dedupe_file_range operations into a
single remap_file_range file operation dispatch since they're
fundamentally the same operation.  The differences between the two can
be made in the prep functions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:41:21 +11:00
Darrick J. Wong
a83ab01a62 vfs: rename vfs_clone_file_prep to be more descriptive
The vfs_clone_file_prep is a generic function to be called by filesystem
implementations only.  Rename the prefix to generic_ and make it more
clear that it applies to remap operations, not just clones.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:41:08 +11:00
Darrick J. Wong
1383a7ed67 vfs: check file ranges before cloning files
Move the file range checks from vfs_clone_file_prep into a separate
generic_remap_checks function so that all the checks are collected in a
central location.  This forms the basis for adding more checks from
generic_write_checks that will make cloning's input checking more
consistent with write input checking.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:40:31 +11:00
Christoph Hellwig
96987eea53 xfs: cancel COW blocks before swapext
We need to make sure we have no outstanding COW blocks before we swap
extents, as there is nothing preventing us from having preallocated COW
delalloc on either inode that swapext is called on.  That case can
easily be reproduced by running generic/324 in always_cow mode:

[  620.760572] XFS: Assertion failed: tip->i_delayed_blks == 0, file: fs/xfs/xfs_bmap_util.c, line: 1669
[  620.761608] ------------[ cut here ]------------
[  620.762171] kernel BUG at fs/xfs/xfs_message.c:102!
[  620.762732] invalid opcode: 0000 [#1] SMP PTI
[  620.763272] CPU: 0 PID: 24153 Comm: xfs_fsr Tainted: G        W         4.19.0-rc1+ #4182
[  620.764203] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
[  620.765202] RIP: 0010:assfail+0x20/0x28
[  620.765646] Code: 31 ff e8 83 fc ff ff 0f 0b c3 48 89 f1 41 89 d0 48 c7 c6 48 ca 8d 82 48 89 fa 38
[  620.767758] RSP: 0018:ffffc9000898bc10 EFLAGS: 00010202
[  620.768359] RAX: 0000000000000000 RBX: ffff88012f14ba40 RCX: 0000000000000000
[  620.769174] RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff828560d9
[  620.769982] RBP: ffff88012f14b300 R08: 0000000000000000 R09: 0000000000000000
[  620.770788] R10: 000000000000000a R11: f000000000000000 R12: ffffc9000898bc98
[  620.771638] R13: ffffc9000898bc9c R14: ffff880130b5e2b8 R15: ffff88012a1fa2a8
[  620.772504] FS:  00007fdc36e0fbc0(0000) GS:ffff88013ba00000(0000) knlGS:0000000000000000
[  620.773475] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  620.774168] CR2: 00007fdc3604d000 CR3: 0000000132afc000 CR4: 00000000000006f0
[  620.774978] Call Trace:
[  620.775274]  xfs_swap_extent_forks+0x2a0/0x2e0
[  620.775792]  xfs_swap_extents+0x38b/0xab0
[  620.776256]  xfs_ioc_swapext+0x121/0x140
[  620.776709]  xfs_file_ioctl+0x328/0xc90
[  620.777154]  ? rcu_read_lock_sched_held+0x50/0x60
[  620.777694]  ? xfs_iunlock+0x233/0x260
[  620.778127]  ? xfs_setattr_nonsize+0x3be/0x6a0
[  620.778647]  do_vfs_ioctl+0x9d/0x680
[  620.779071]  ? ksys_fchown+0x47/0x80
[  620.779552]  ksys_ioctl+0x35/0x70
[  620.780040]  __x64_sys_ioctl+0x11/0x20
[  620.780530]  do_syscall_64+0x4b/0x190
[  620.780927]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  620.781467] RIP: 0033:0x7fdc364d0f07
[  620.781900] Code: b3 66 90 48 8b 05 81 5f 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 28
[  620.784044] RSP: 002b:00007ffe2a766038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  620.784896] RAX: ffffffffffffffda RBX: 0000000000000025 RCX: 00007fdc364d0f07
[  620.785667] RDX: 0000560296ca2fc0 RSI: 00000000c0c0586d RDI: 0000000000000005
[  620.786398] RBP: 0000000000000025 R08: 0000000000001200 R09: 0000000000000000
[  620.787283] R10: 0000000000000432 R11: 0000000000000246 R12: 0000000000000005
[  620.788051] R13: 0000000000000000 R14: 0000000000001000 R15: 0000000000000006
[  620.788927] Modules linked in:
[  620.789340] ---[ end trace 9503b7417ffdbdb0 ]---
[  620.790065] RIP: 0010:assfail+0x20/0x28
[  620.790642] Code: 31 ff e8 83 fc ff ff 0f 0b c3 48 89 f1 41 89 d0 48 c7 c6 48 ca 8d 82 48 89 fa 38
[  620.793038] RSP: 0018:ffffc9000898bc10 EFLAGS: 00010202
[  620.793609] RAX: 0000000000000000 RBX: ffff88012f14ba40 RCX: 0000000000000000
[  620.794317] RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff828560d9
[  620.795025] RBP: ffff88012f14b300 R08: 0000000000000000 R09: 0000000000000000
[  620.795778] R10: 000000000000000a R11: f000000000000000 R12: ffffc9000898bc98
[  620.796675] R13: ffffc9000898bc9c R14: ffff880130b5e2b8 R15: ffff88012a1fa2a8
[  620.797782] FS:  00007fdc36e0fbc0(0000) GS:ffff88013ba00000(0000) knlGS:0000000000000000
[  620.798908] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  620.799594] CR2: 00007fdc3604d000 CR3: 0000000132afc000 CR4: 00000000000006f0
[  620.800424] Kernel panic - not syncing: Fatal exception
[  620.801191] Kernel Offset: disabled
[  620.801597] ---[ end Kernel panic - not syncing: Fatal exception ]---

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:21:55 +11:00
Brian Foster
efc3289cf8 xfs: clear ail delwri queued bufs on unmount of shutdown fs
In the typical unmount case, the AIL is forced out by the unmount
sequence before the xfsaild task is stopped. Since AIL items are
removed on writeback completion, this means that the AIL
->ail_buf_list delwri queue has been drained. This is not always
true in the shutdown case, however.

It's possible for buffers to sit on a delwri queue for a period of
time across submission attempts if said items are locked or have
been relogged and pinned since first added to the queue. If the
attempt to log such an item results in a log I/O error, the error
processing can shutdown the fs, remove the item from the AIL, stale
the buffer (dropping the LRU reference) and clear its delwri queue
state. The latter bit means the buffer will be released from a
delwri queue on the next submission attempt, but this might never
occur if the filesystem has shutdown and the AIL is empty.

This means that such buffers are held indefinitely by the AIL delwri
queue across destruction of the AIL. Aside from being a memory leak,
these buffers can also hold references to in-core perag structures.
The latter problem manifests as a generic/475 failure, reproducing
the following asserts at unmount time:

  XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0,
	file: fs/xfs/xfs_mount.c, line: 151
  XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0,
	file: fs/xfs/xfs_mount.c, line: 132

To prevent this problem, clear the AIL delwri queue as a final step
before xfsaild() exit. The !empty state should never occur in the
normal case, so add an assert to catch unexpected problems going
forward.

[dgc: add comment explaining need for xfs_buf_delwri_cancel() after
 calling xfs_buf_delwri_submit_nowait().]

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:21:49 +11:00
Carlos Maiolino
26ca39015e xfs: use offsetof() in place of offset macros for __xfsstats
Most offset macro mess is used in xfs_stats_format() only, and we can
simply get the right offsets using offsetof(), instead of several macros
to mark the offsets inside __xfsstats structure.

Replace all XFSSTAT_END_* macros by a single helper macro to get the
right offset into __xfsstats, and use this helper in xfs_stats_format()
directly.

The quota stats code, still looks a bit cleaner when using XFSSTAT_*
macros, so, this patch also defines XFSSTAT_START_XQMSTAT and
XFSSTAT_END_XQMSTAT locally to that code. This also should prevent
offset mistakes when updates are done into __xfsstats.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:21:39 +11:00
Carlos Maiolino
41657e5507 xfs: Fix xqmstats offsets in /proc/fs/xfs/xqmstat
The addition of FIBT, RMAP and REFCOUNT changed the offsets into
__xfssats structure.

This caused xqmstat_proc_show() to display garbage data via
/proc/fs/xfs/xqmstat, once it relies on the offsets marked via macros.

Fix it.

Fixes: 00f4e4f9 xfs: add rmap btree stats infrastructure
Fixes: aafc3c24 xfs: support the XFS_BTNUM_FINOBT free inode btree type
Fixes: 46eeb521 xfs: introduce refcount btree definitions
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:21:34 +11:00
Dave Chinner
37fd167824 xfs: fix use-after-free race in xfs_buf_rele
When looking at a 4.18 based KASAN use after free report, I noticed
that racing xfs_buf_rele() may race on dropping the last reference
to the buffer and taking the buffer lock. This was the symptom
displayed by the KASAN report, but the actual issue that was
reported had already been fixed in 4.19-rc1 by commit e339dd8d8b
("xfs: use sync buffer I/O for sync delwri queue submission").

Despite this, I think there is still an issue with xfs_buf_rele()
in this code:

        release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
        spin_lock(&bp->b_lock);
        if (!release) {
.....

If two threads race on the b_lock after both dropping a reference
and one getting dropping the last reference so release = true, we
end up with:

CPU 0				CPU 1
atomic_dec_and_lock()
				atomic_dec_and_lock()
				spin_lock(&bp->b_lock)
spin_lock(&bp->b_lock)
<spins>
				<release = true bp->b_lru_ref = 0>
				<remove from lists>
				freebuf = true
				spin_unlock(&bp->b_lock)
				xfs_buf_free(bp)
<gets lock, reading and writing freed memory>
<accesses freed memory>
spin_unlock(&bp->b_lock) <reads/writes freed memory>

IOWs, we can't safely take bp->b_lock after dropping the hold
reference because the buffer may go away at any time after we
drop that reference. However, this can be fixed simply by taking the
bp->b_lock before we drop the reference.

It is safe to nest the pag_buf_lock inside bp->b_lock as the
pag_buf_lock is only used to serialise against lookup in
xfs_buf_find() and no other locks are held over or under the
pag_buf_lock there. Make this clear by documenting the buffer lock
orders at the top of the file.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:21:29 +11:00
Allison Henderson
068f985a9e xfs: Add attibute remove and helper functions
This patch adds xfs_attr_remove_args. These sub-routines remove
the attributes specified in @args. We will use this later for setting
parent pointers as a deferred attribute operation.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:21:23 +11:00
Allison Henderson
2f3cd80919 xfs: Add attibute set and helper functions
This patch adds xfs_attr_set_args and xfs_bmap_set_attrforkoff.
These sub-routines set the attributes specified in @args.
We will use this later for setting parent pointers as a deferred
attribute operation.

[dgc: remove attr fork init code from xfs_attr_set_args().]
[dgc: xfs_attr_try_sf_addname() NULLs args.trans after commit.]
[dgc: correct sf add error handling.]

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:21:16 +11:00
Allison Henderson
4c74a56b9d xfs: Add helper function xfs_attr_try_sf_addname
This patch adds a subroutine xfs_attr_try_sf_addname
used by xfs_attr_set.  This subrotine will attempt to
add the attribute name specified in args in shortform,
as well and perform error handling previously done in
xfs_attr_set.

This patch helps to pre-simplify xfs_attr_set for reviewing
purposes and reduce indentation.  New function will be added
in the next patch.

[dgc: moved commit to helper function, too.]

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:20:50 +11:00
Allison Henderson
e2421f0b5f xfs: Move fs/xfs/xfs_attr.h to fs/xfs/libxfs/xfs_attr.h
This patch moves fs/xfs/xfs_attr.h to fs/xfs/libxfs/xfs_attr.h
since xfs_attr.c is in libxfs.  We will need these later in
xfsprogs.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:20:45 +11:00
Dave Chinner
56668a5cc4 xfs: issue log message on user force shutdown
The kernel only issues a log message that it's been shut down when
the filesystem triggers a shutdown itself. Hence there is no trace
in the log when a shutdown is triggered manually from userspace.
This can make it hard to see sequence of events in the log when
things go wrong, so make sure we always log a message when a
shutdown is run.

While there, clean up the logic flow so we don't have to continually
check if the shutdown trigger was user initiated before logging
shutdown messages.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:20:39 +11:00
Darrick J. Wong
38b6238eb6 xfs: fix buffer state management in xrep_findroot_block
We don't handle buffer state properly in online repair's findroot
routine.  If a buffer already has b_ops set, we don't ever want to touch
that, and we don't want to call the read verifiers on a buffer that
could be dirty (CRCs are only recomputed during log checkpoints).

Therefore, be more careful about what we do with a buffer -- if someone
else already attached ops that are not the ones for this btree type,
just ignore the buffer.  We only attach our btree type's buf ops if it
matches the magic/uuid and structure checks.

We also modify xfs_buf_read_map to allow callers to set buffer ops on a
DONE buffer with NULL ops so that repair doesn't leave behind buffers
which won't have buffers attached to them.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:20:35 +11:00
Darrick J. Wong
1aff5696f3 xfs: always assign buffer verifiers when one is provided
If a caller supplies buffer ops when trying to read a buffer and the
buffer doesn't already have buf ops assigned, ensure that the ops are
assigned to the buffer and the verifier is run on that buffer.

Note that current XFS code is careful to assign buffer ops after a
xfs_{trans_,}buf_read call in which ops were not supplied.  However, we
should apply ops defensively in case there is ever a coding mistake; and
an upcoming repair patch will need to be able to read a buffer without
assigning buf ops.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:20:30 +11:00
Darrick J. Wong
1002ff45ef xfs: xrep_findroot_block should reject root blocks with siblings
In xrep_findroot_block, if we find a candidate root block with sibling
pointers or sibling blocks on the same tree level, we should not return
that block as a tree root because root blocks cannot have siblings.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:20:26 +11:00
Adam Borowski
dddde68b8f xfs: add a define for statfs magic to uapi
Needed by userspace programs that call fstatfs().

It'd be natural to publish XFS_SB_MAGIC in uapi, but while these two
have identical values, they have different semantic meaning: one is
an enum cookie meant for statfs, the other a signature of the
on-disk format.

Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:20:19 +11:00
Christoph Hellwig
4831822ff1 xfs: print dangling delalloc extents
Instead of just asserting that we have no delalloc space dangling
in an inode that gets freed print the actual offenders for debug
mode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:20:11 +11:00
Christoph Hellwig
032dc923b2 xfs: fix fork selection in xfs_find_trim_cow_extent
We should want to write directly into the data fork for blocks that don't
have an extent in the COW fork covering them yet.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:19:58 +11:00
Christoph Hellwig
d392bc81bb xfs: remove the unused trimmed argument from xfs_reflink_trim_around_shared
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:19:48 +11:00
Christoph Hellwig
fc439464e3 xfs: remove the unused shared argument to xfs_reflink_reserve_cow
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:19:37 +11:00
Christoph Hellwig
0365c5d6c3 xfs: handle zeroing in xfs_file_iomap_begin_delay
We only need to allocate blocks for zeroing for reflink inodes,
and for we currently have a special case for reflink files in
the otherwise direct I/O path that I'd like to get rid of.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:19:26 +11:00
Christoph Hellwig
daa79baefc xfs: remove suport for filesystems without unwritten extent flag
The option to enable unwritten extents was made default in 2003,
removed from mkfs in 2007, and cannot be disabled in v5.  We also
rely on it for a lot of common functionality, so filesystems without
it will run a completely untested and buggy code path.  Enabling the
support also is a simple bit flip using xfs_db, so legacy file
systems can still be brought forward.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:18:58 +11:00
Christoph Hellwig
97e5a6e6dc xfs: remove XFS_IO_INVALID
The invalid state isn't any different from a hole, so merge the two
states.  Use the more descriptive hole name, but keep it as the first
value of the enum to catch uninitialized fields.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:17:50 +11:00
Dave Chinner
b39989009b xfs: fix data corruption w/ unaligned reflink ranges
When reflinking sub-file ranges, a data corruption can occur when
the source file range includes a partial EOF block. This shares the
unknown data beyond EOF into the second file at a position inside
EOF, exposing stale data in the second file.

XFS only supports whole block sharing, but we still need to
support whole file reflink correctly.  Hence if the reflink
request includes the last block of the souce file, only proceed with
the reflink operation if it lands at or past the destination file's
current EOF. If it lands within the destination file EOF, reject the
entire request with -EINVAL and make the caller go the hard way.

This avoids the data corruption vector, but also avoids disruption
of returning EINVAL to userspace for the common case of whole file
cloning.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-06 11:44:39 +10:00
Dave Chinner
dceeb47b0e xfs: fix data corruption w/ unaligned dedupe ranges
A deduplication data corruption is Exposed by fstests generic/505 on
XFS. It is caused by extending the block match range to include the
partial EOF block, but then allowing unknown data beyond EOF to be
considered a "match" to data in the destination file because the
comparison is only made to the end of the source file. This corrupts
the destination file when the source extent is shared with it.

XFS only supports whole block dedupe, but we still need to appear to
support whole file dedupe correctly.  Hence if the dedupe request
includes the last block of the souce file, don't include it in the
actual XFS dedupe operation. If the rest of the range dedupes
successfully, then report the partial last block as deduped, too, so
that userspace sees it as a successful dedupe rather than return
EINVAL because we can't dedupe unaligned blocks.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-06 11:44:19 +10:00
Darrick J. Wong
7debbf015f xfs: update ctime and remove suid before cloning files
Before cloning into a file, update the ctime and remove sensitive
attributes like suid, just like we'd do for a regular file write.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-05 19:05:41 +10:00
Darrick J. Wong
410fdc72b0 xfs: zero posteof blocks when cloning above eof
When we're reflinking between two files and the destination file range
is well beyond the destination file's EOF marker, zero any posteof
speculative preallocations in the destination file so that we don't
expose stale disk contents.  The previous strategy of trying to clear
the preallocations does not work if the destination file has the
PREALLOC flag set.

Uncovered by shared/010.

Reported-by: Zorro Lang <zlang@redhat.com>
Bugzilla-id: https://bugzilla.kernel.org/show_bug.cgi?id=201259
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-05 19:04:27 +10:00
Darrick J. Wong
0d41e1d28c xfs: refactor clonerange preparation into a separate helper
Refactor all the reflink preparation steps into a separate helper
that we'll use to land all the upcoming fixes for insufficient input
checks.

This rework also moves the invalidation of the destination range to
the prep function so that it is done before the range is remapped.
This ensures that nobody can access the data in range being remapped
until the remap is complete.

[dgc: fix xfs_reflink_remap_prep() return value and caller check to
handle vfs_clone_file_prep_inodes() returning 0 to mean "nothing to
do". ]

[dgc: make sure length changed by vfs_clone_file_prep_inodes() gets
propagated back to XFS code that does the remapping. ]

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-05 19:04:22 +10:00
Dave Chinner
e55ec4ddbe xfs: fix error handling in xfs_bmap_extents_to_btree
Commit 01239d77b9 ("xfs: fix a null pointer dereference in
xfs_bmap_extents_to_btree") attempted to fix a null pointer
dreference when a fuzzing corruption of some kind was found.
This fix was flawed, resulting in assert failures like:

XFS: Assertion failed: ifp->if_broot == NULL, file: fs/xfs/libxfs/xfs_bmap.c, line: 715
.....
Call Trace:
  xfs_bmap_extents_to_btree+0x6b9/0x7b0
  __xfs_bunmapi+0xae7/0xf00
  ? xfs_log_reserve+0x1c8/0x290
  xfs_reflink_remap_extent+0x20b/0x620
  xfs_reflink_remap_blocks+0x7e/0x290
  xfs_reflink_remap_range+0x311/0x530
  vfs_dedupe_file_range_one+0xd7/0xe0
  vfs_dedupe_file_range+0x15b/0x1a0
  do_vfs_ioctl+0x267/0x6c0

The problem is that the error handling code now asserts that the
inode fork is not in btree format before the error handling code
undoes the modifications that put the fork back in extent format.
Fix this by moving the assert back to after the xfs_iroot_realloc()
call that returns the fork to extent format, and clean up the jump
labels to be meaningful.

Also, returning ENOSPC when xfs_btree_get_bufl() fails to
instantiate the buffer that was allocated (the actual fix in the
commit mentioned above) is incorrect. This is a fatal error - only
an invalid block address or a filesystem shutdown can result in
failing to get a buffer here.

Hence change this to EFSCORRUPTED so that the higher layer knows
this was a corruption related failure and should not treat it as an
ENOSPC error.  This should result in a shutdown (via cancelling a
dirty transaction) which is necessary as we do not attempt to clean
up the (invalid) block that we have already allocated.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-01 08:11:07 +10:00
Brian Foster
ec2ed0b5e9 xfs: remove invalid log recovery first/last cycle check
One of the first steps of log recovery is to check for the special
case of a zeroed log. If the first cycle in the log is zero or the
tail portion of the log is zeroed, the head is set to the first
instance of cycle 0. xlog_find_zeroed() includes a sanity check that
enforces that the first cycle in the log must be 1 if the last cycle
is 0. While this is true in most cases, the check is not totally
valid because it doesn't consider the case where the filesystem
crashed after a partial/out of order log buffer completion that
wraps around the end of the physical log.

For example, consider a filesystem that has completed most of the
first cycle of the log, reaches the end of the physical log and
splits the next single log buffer write into two in order to wrap
around the end of the log. If these I/Os are reordered, the second
(wrapped) I/O completes and the first happens to fail, the log is
left in a state where the last cycle of the log is 0 and the first
cycle is 2. This causes the xlog_find_zeroed() sanity check to fail
and prevents the filesystem from mounting. This situation has been
reproduced on particular systems via repeated runs of generic/475.

This is an expected state that log recovery already knows how to
deal with, however. Since the log is still partially zeroed, the
head is detected correctly and points to a valid tail. The
subsequent stale block detection clears blocks beyond the head up to
the tail (within a maximum range), with the express purpose of
clearing such out of order writes. As expected, this removes the out
of order cycle 2 blocks at the physical start of the log.

In other words, the only thing that prevents a clean mount and
recovery of the filesystem in this scenario is the specific (last ==
0 && first != 1) sanity check in xlog_find_zeroed(). Since the log
head/tail are now independently validated via cycle, log record and
CRC checks, this highly specific first cycle check is of dubious
value. Remove it and rely on the higher level validation to
determine whether log content is sane and recoverable.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:50:41 +10:00
Eric Sandeen
339e1a3fcd xfs: validate inode di_forkoff
Verify the inode di_forkoff, lifted from xfs_repair's
process_check_inode_forkoff().

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:50:13 +10:00
Christoph Hellwig
f5f3f959b7 xfs: skip delalloc COW blocks in xfs_reflink_end_cow
The iomap direct I/O code issues a single ->end_io call for the whole
I/O request, and if some of the extents cowered needed a COW operation
it will call xfs_reflink_end_cow over the whole range.

When we do AIO writes we drop the iolock after doing the initial setup,
but before the I/O completion.  Between dropping the lock and completing
the I/O we can have a racing buffered write create new delalloc COW fork
extents in the region covered by the outstanding direct I/O write, and
thus see delalloc COW fork extents in xfs_reflink_end_cow.  As
concurrent writes are fundamentally racy and no guarantees are given we
can simply skip those.

This can be easily reproduced with xfstests generic/208 in always_cow
mode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:49:58 +10:00
Eric Sandeen
f369a13cea xfs: don't treat unknown di_flags2 as corruption in scrub
xchk_inode_flags2() currently treats any di_flags2 values that the
running kernel doesn't recognize as corruption, and calls
xchk_ino_set_corrupt() if they are set.  However, it's entirely possible
that these flags were set in some newer kernel and are quite valid,
but ignored in this kernel.

(Validators don't care one bit about unknown di_flags2.)

Call xchk_ino_set_warning instead, because this may or may not actually
indicate a problem.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:49:00 +10:00
YueHaibing
2863c2ebc4 xfs: remove duplicated include from alloc.c
Remove duplicated include xfs_alloc.h

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:48:21 +10:00
Christoph Hellwig
0065b54119 xfs: don't bring in extents in xfs_bmap_punch_delalloc_range
This function is only used to punch out delayed allocations on I/O
failure, which means we need to have read the extents earlier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:47:46 +10:00
Dave Chinner
df30707791 xfs: fix transaction leak in xfs_reflink_allocate_cow()
When xfs_reflink_allocate_cow() allocates a transaction, it drops
the ILOCK to perform the operation. This Introduces a race condition
where another thread modifying the file can perform the COW
allocation operation underneath us. This result in the retry loop
finding an allocated block and jumping straight to the conversion
code. It does not, however, cancel the transaction it holds and so
this gets leaked. This results in a lockdep warning:

================================================
WARNING: lock held when returning to user space!
4.18.5 #1 Not tainted
------------------------------------------------
worker/6123 is leaving the kernel with locks still held!
1 lock held by worker/6123:
 #0: 000000009eab4f1b (sb_internal#2){.+.+}, at: xfs_trans_alloc+0x17c/0x220

And eventually the filesystem deadlocks because it runs out of log
space that is reserved by the leaked transaction and never gets
released.

The logic flow in xfs_reflink_allocate_cow() is a convoluted mess of
gotos - it's no surprise that it has bug where the flow through
several goto jumps then fails to clean up context from a non-obvious
logic path. CLean up the logic flow and make sure every path does
the right thing.

Reported-by: Alexander Y. Fomichev <git.user@gmail.com>
Tested-by: Alexander Y. Fomichev <git.user@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200981
Signed-off-by: Dave Chinner <dchinner@redhat.com>
[hch: slight refactor]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:47:15 +10:00
Dave Chinner
8683edb775 xfs: avoid lockdep false positives in xfs_trans_alloc
We've had a few reports of lockdep tripping over memory reclaim
context vs filesystem freeze "deadlocks". They all have looked
to be false positives on analysis, but it seems that they are
being tripped because we take freeze references before we run
a GFP_KERNEL allocation for the struct xfs_trans.

We can avoid this false positive vector just by re-ordering the
operations in xfs_trans_alloc(). That is. we need allocate the
structure before we take the freeze reference and enter the GFP_NOFS
allocation context that follows the xfs_trans around. This prevents
lockdep from seeing the GFP_KERNEL allocation inside the transaction
context, and that prevents it from triggering the freeze level vs
alloc context vs reclaim warnings.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-09-29 13:46:21 +10:00
Brian Foster
95808459b1 xfs: refactor xfs_buf_log_item reference count handling
The xfs_buf_log_item structure has a reference counter with slightly
tricky semantics. In the common case, a buffer is logged and
committed in a transaction, committed to the on-disk log (added to
the AIL) and then finally written back and removed from the AIL. The
bli refcount covers two potentially overlapping timeframes:

 1. the bli is held in an active transaction
 2. the bli is pinned by the log

The caveat to this approach is that the reference counter does not
purely dictate the lifetime of the bli. IOW, when a dirty buffer is
physically logged and unpinned, the bli refcount may go to zero as
the log item is inserted into the AIL. Only once the buffer is
written back can the bli finally be freed.

The above semantics means that it is not enough for the various
refcount decrementing contexts to release the bli on decrement to
zero. xfs_trans_brelse(), transaction commit (->iop_unlock()) and
unpin (->iop_unpin()) must all drop the associated reference and
make additional checks to determine if the current context is
responsible for freeing the item.

For example, if a transaction holds but does not dirty a particular
bli, the commit may drop the refcount to zero. If the bli itself is
clean, it is also not AIL resident and must be freed at this time.
The same is true for xfs_trans_brelse(). If the transaction dirties
a bli and then aborts or an unpin results in an abort due to a log
I/O error, the last reference count holder is expected to explicitly
remove the item from the AIL and release it (since an abort means
filesystem shutdown and metadata writeback will never occur).

This leads to fairly complex checks being replicated in a few
different places. Since ->iop_unlock() and xfs_trans_brelse() are
nearly identical, refactor the logic into a common helper that
implements and documents the semantics in one place. This patch does
not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:45:26 +10:00
Brian Foster
23420d05e6 xfs: clean up xfs_trans_brelse()
xfs_trans_brelse() is a bit of a historical mess, similar to
xfs_buf_item_unlock(). It is unnecessarily verbose, has snippets of
commented out code, inconsistency with regard to stale items, etc.

Clean up xfs_trans_brelse() to use similar logic and flow as
xfs_buf_item_unlock() with regard to bli reference count handling.
This patch makes no functional changes, but facilitates further
refactoring of the common bli reference count handling code.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:45:02 +10:00
Brian Foster
d9183105ca xfs: don't unlock invalidated buf on aborted tx commit
xfstests generic/388,475 occasionally reproduce assertion failures
in xfs_buf_item_unpin() when the final bli reference is dropped on
an invalidated buffer and the buffer is not locked as it is expected
to be. Invalidated buffers should remain locked on transaction
commit until the final unpin, at which point the buffer is removed
from the AIL and the bli is freed since stale buffers are not
written back.

The assert failures are associated with filesystem shutdown,
typically due to log I/O errors injected by the test. The
problematic situation can occur if the shutdown happens to cause a
race between an active transaction that has invalidated a particular
buffer and an I/O error on a log buffer that contains the bli
associated with the same (now stale) buffer.

Both transaction and log contexts acquire a bli reference. If the
transaction has already invalidated the buffer by the time the I/O
error occurs and ends up aborting due to shutdown, the transaction
and log hold the last two references to a stale bli. If the
transaction cancel occurs first, it treats the buffer as non-stale
due to the aborted state: the bli reference is dropped and the
buffer is released/unlocked. The log buffer I/O error handling
eventually calls into xfs_buf_item_unpin(), drops the final
reference to the bli and treats it as stale. The buffer wasn't left
locked by xfs_buf_item_unlock(), however, so the assert fails and
the buffer is double unlocked. The latter problem is mitigated by
the fact that the fs is shutdown and no further damage is possible.

->iop_unlock() of an invalidated buffer should behave consistently
with respect to the bli refcount, regardless of aborted state. If
the refcount remains elevated on commit, we know the bli is awaiting
an unpin (since it can't be in another transaction) and will be
handled appropriately on log buffer completion. If the final bli
reference of an invalidated buffer is dropped in ->iop_unlock(), we
can assume the transaction has aborted because invalidation implies
a dirty transaction. In the non-abort case, the log would have
acquired a bli reference in ->iop_pin() and prevented bli release at
->iop_unlock() time. In the abort case the item must be freed and
buffer unlocked because it wasn't pinned by the log.

Rework xfs_buf_item_unlock() to simplify the currently circuitous
and duplicate logic and leave invalidated buffers locked based on
bli refcount, regardless of aborted state. This ensures that a
pinned, stale buffer is always found locked when eventually
unpinned.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:44:40 +10:00
Brian Foster
d5a2e2893d xfs: remove last of unnecessary xfs_defer_cancel() callers
Now that deferred operations are completely managed via
transactions, it's no longer necessary to cancel the dfops in error
paths that already cancel the associated transaction. There are a
few such calls lingering throughout the codebase.

Remove all remaining unnecessary calls to xfs_defer_cancel(). This
leaves xfs_defer_cancel() calls in two places. The first is the call
in the transaction cancel path itself, which facilitates this patch.
The second is made via the xfs_defer_finish() error path to provide
consistent error semantics with transaction commit. For example,
xfs_trans_commit() expects an xfs_defer_finish() failure to clean up
the dfops structure before it returns.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:41:58 +10:00
Darrick J. Wong
ae29478766 xfs: don't crash the vfs on a garbage inline symlink
The VFS routine that calls ->get_link blindly copies whatever's returned
into the user's buffer.  If we return a NULL pointer, the vfs will
crash on the null pointer.  Therefore, return -EFSCORRUPTED instead of
blowing up the kernel.

[dgc: clean up with hch's suggestions]

Reported-by: wen.xu@gatech.edu
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:40:40 +10:00
Linus Torvalds
d9a185f8b4 overlayfs update for 4.19
This contains two new features:
 
  1) Stack file operations: this allows removal of several hacks from the
     VFS, proper interaction of read-only open files with copy-up,
     possibility to implement fs modifying ioctls properly, and others.
 
  2) Metadata only copy-up: when file is on lower layer and only metadata is
     modified (except size) then only copy up the metadata and continue to
     use the data from the lower file.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW3srhAAKCRDh3BK/laaZ
 PC6tAQCP+KklcN+TvNp502f+O/kATahSpgnun4NY1/p4I8JV+AEAzdlkTN3+MiAO
 fn9brN6mBK7h59DO3hqedPLJy2vrgwg=
 =QDXH
 -----END PGP SIGNATURE-----

Merge tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs updates from Miklos Szeredi:
 "This contains two new features:

   - Stack file operations: this allows removal of several hacks from
     the VFS, proper interaction of read-only open files with copy-up,
     possibility to implement fs modifying ioctls properly, and others.

   - Metadata only copy-up: when file is on lower layer and only
     metadata is modified (except size) then only copy up the metadata
     and continue to use the data from the lower file"

* tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
  ovl: Enable metadata only feature
  ovl: Do not do metacopy only for ioctl modifying file attr
  ovl: Do not do metadata only copy-up for truncate operation
  ovl: add helper to force data copy-up
  ovl: Check redirect on index as well
  ovl: Set redirect on upper inode when it is linked
  ovl: Set redirect on metacopy files upon rename
  ovl: Do not set dentry type ORIGIN for broken hardlinks
  ovl: Add an inode flag OVL_CONST_INO
  ovl: Treat metacopy dentries as type OVL_PATH_MERGE
  ovl: Check redirects for metacopy files
  ovl: Move some dir related ovl_lookup_single() code in else block
  ovl: Do not expose metacopy only dentry from d_real()
  ovl: Open file with data except for the case of fsync
  ovl: Add helper ovl_inode_realdata()
  ovl: Store lower data inode in ovl_inode
  ovl: Fix ovl_getattr() to get number of blocks from lower
  ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
  ovl: Copy up meta inode data from lowest data inode
  ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
  ...
2018-08-21 18:19:09 -07:00
Linus Torvalds
c22fc16d17 Changes since last update:
- Fix an uninitialized variable
 - Don't use obviously garbage AG header counters to calculate
   transaction reservations
 - Trigger icount recalculation on bad icount when monting.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlty8tsACgkQ+H93GTRK
 tOtoKw/+OeCaY6jZc2JoztBwLSUJsMYQ0R8Wsj5GRb4bVp9b0zes7RJMFU03nCtj
 XuE4Rhdsx+6+QZQKxTq/Z6lrKHEjF0kL1EVGHtL46Inr+Z+Rr4bLBG6NV1o0dg7B
 CR1IqW5vYcZ7Vrk9ko/RXVXtuCIxBS5jSW/S/uFT95Y4lVMAf/2asR/OoYt5ZVE3
 17CUfWRifiSGoBQpjtfZd63F23XlEEusiErC5iS9rUbE2qC9FxP9EuvoUP5M/n01
 nLS34Fjw7X739AiwHbf10fQPOvBr7atTazCXskjy4gbwqIWTmuhbF4ieTU1OfTI8
 ozhvYomBYLiZbsEYBhVCs09VEnIfHmf2HoLh//efGE8VEvoQllxdn/g2PQekoPAn
 M7VnRUXCTvaLI8IE2d3Ed1VWm0OTea09xqEiNpB0XGjegim9pXuf6t/zbe4R0vJy
 YLBgQT8XRPw5ZgCnBbxvZOXXxQtAqKnqZzYSWGxlHJhhduKVeKMqerhP0nn0ui8g
 wAOmOe3XEoyLfSY8WY0ACEEEA00pAwErerwVEFLCpaKTh5GOY4i3OBdqcZOtXacn
 f5oIeG9HZKAXKkOTGwpq1zGHTOYhz4mxAYhodRFiEE8rXHDa9odUWQ/iG0zgZaO6
 19xznXjXkVWVg0QJqQJi6SbEkkrAEFtFRYH+VPTgWM/1tg47a14=
 =+0Eq
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.19-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:

 - Fix an uninitialized variable

 - Don't use obviously garbage AG header counters to calculate
   transaction reservations

 - Trigger icount recalculation on bad icount when mounting

* tag 'xfs-4.19-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: fix WARN_ON_ONCE on uninitialized variable
  xfs: sanity check ag header values in xrep_calc_ag_resblks
  xfs: recalculate summary counters at mount time if icount is bad
2018-08-21 18:15:47 -07:00
Dave Jiang
e1fb4a0864 dax: remove VM_MIXEDMAP for fsdax and device dax
This patch is reworked from an earlier patch that Dan has posted:
https://patchwork.kernel.org/patch/10131727/

VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map.  The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations.  In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.

Now that we have explicit driver pfn_t-flag opt-in/opt-out for
get_user_pages() support for DAX we can stop setting VM_MIXEDMAP.  This
also means we no longer need to worry about safely manipulating vm_flags
in a future where we support dynamically changing the dax mode of a
file.

DAX should also now be supported with madvise_behavior(), vma_merge(),
and copy_page_range().

This patch has been tested against ndctl unit test.  It has also been
tested against xfstests commit: 625515d using fake pmem created by
memmap and no additional issues have been observed.

Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:27 -07:00
Linus Torvalds
781fca5b10 Changes for 4.19:
- Use extent maps to track pagecache page status instead of bufferhead
   state.
 - Refactor pagecache read and write paths to use the new iomap library
   functions, which enable us to drop the old bufferhead code for
   pagesize == blocksize filesystems.
 - Set up parallel per-block-per-page metadata to track subpage
   information that was tracked by buffer heads, which enables us to drop
   the old bufferhead code for pagesize > blocksize filesystems.
 - Tie a deferred ops control structure to a transaction so that we can
   take advantage of an upper-level dfops without having to plumb pointer
   passing through the code.
 - Refactor the deferred ops code to track deferred ops as part of the
   transaction structure (instead of as a separate data structure) so
   that we can simplify the scoping rules around defer_ops.
 - Refactor twisty delwri buffer submission code to avoid deadlocks.
 - Shorten and fix indenting problems in the scrub code.
 - Detect obviously bad summary counts at mount and fix them.
 - Directly associate deferred ops control structure with a transaction
   so that callers no longer have to manage it themselves.
 - Remove a couple of IRIX-era inode macros.
 - Remove the long-deprecated 'barrier' and 'nobarrier' mount options.
 - Clean up the inode fork structure a bit.
 - Check for bad fs summary counter values in the superblock.
 - Reduce COW fork lookups during writeback.
 - Refactor the deferred ops control structures into the transaction
   structure, thereby eliminating the need for transaction users to
   handle the deferred ops as a separate data structure.
 - Add the ability to repair AG headers online.
 - Fix a crash due to insufficient return value checking.
 - Various fixes and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAltwVGoACgkQ+H93GTRK
 tOt3bw//WaG1mR44Oo/mhf27MaEJK74LXeViqCH4Sdk10gujClTVnl6h33ChAyEi
 7BT4x1JtwM6xOh7nPsXQy/besVxWadjQcTtAz/3U2wJoFyOX2+I27SAawrmX6jfR
 Hi1DxXFFK7z/8YvuZqYl3vTgxMNb7bLAUybe2sYX8q+vrQaUvl9eLQlHSaT3sxrc
 /lBkog1dYmbw3yjLnWYpQtC0I6Pa3ZuG/S2vpeJ2H5MADtzrRNjuC9MHZJW7tIGm
 +rCLm0agk8yFkEA84VvS5Afee3TppY/JBaYlsvG1rp3bs0fELAJFnzS4g/QDbbsX
 HAKPcMICJksF4C9y0Xb7wXPz/4PKur5/OSuGXN4QtOivOEoAdWfh2PLInqAjo/Le
 mO92PdkBucfVqJzfEC2q2QAnGIaJlG8txhAz87wZ1YfZDQQlJDy385Z9GQXfUpy5
 /1xH7V0cze1ZBSxWSddSFg0gCtaWSerfp0CmAG3A+HWKIN6c/ZNSCrqdq0DBC99D
 qOn6ThjckZWGvz/KV5xBr/KvUYOpSeEyREtgcAN008TiUaNy4nOhWV2xgLGuPY/J
 ed4V2B9qVbq+l+sZyzukB8cmOXmcCey6omwJ7LqZzoTWTAtTQtM2MwhaQFUWtQG8
 mCqPXJp1XyL24sn0bI1t2NuKgQcs6QEQWX3zN4DA6I+N9+sTDqo=
 =2G+i
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.19-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "This is the second part of the XFS changes for 4.19.

  The biggest changes are the removal of buffer heads frm XFS, a massive
  reworking of the deferred transaction operations handling code, the
  removal of the long defunct barrier/nobarrier mount options, and the
  addition of a few more online repair functions.

  Summary:

   - Use extent maps to track pagecache page status instead of
     bufferhead state.

   - Refactor pagecache read and write paths to use the new iomap
     library functions, which enable us to drop the old bufferhead code
     for pagesize == blocksize filesystems.

   - Set up parallel per-block-per-page metadata to track subpage
     information that was tracked by buffer heads, which enables us to
     drop the old bufferhead code for pagesize > blocksize filesystems.

   - Tie a deferred ops control structure to a transaction so that we
     can take advantage of an upper-level dfops without having to plumb
     pointer passing through the code.

   - Refactor the deferred ops code to track deferred ops as part of the
     transaction structure (instead of as a separate data structure) so
     that we can simplify the scoping rules around defer_ops.

   - Refactor twisty delwri buffer submission code to avoid deadlocks.

   - Shorten and fix indenting problems in the scrub code.

   - Detect obviously bad summary counts at mount and fix them.

   - Directly associate deferred ops control structure with a
     transaction so that callers no longer have to manage it themselves.

   - Remove a couple of IRIX-era inode macros.

   - Remove the long-deprecated 'barrier' and 'nobarrier' mount options.

   - Clean up the inode fork structure a bit.

   - Check for bad fs summary counter values in the superblock.

   - Reduce COW fork lookups during writeback.

   - Refactor the deferred ops control structures into the transaction
     structure, thereby eliminating the need for transaction users to
     handle the deferred ops as a separate data structure.

   - Add the ability to repair AG headers online.

   - Fix a crash due to insufficient return value checking.

   - Various fixes and cleanups"

* tag 'xfs-4.19-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (155 commits)
  xfs: fix a null pointer dereference in xfs_bmap_extents_to_btree
  xfs: remove b_last_holder & associated macros
  iomap: Switch to offset_in_page for clarity
  xfs: Close race between direct IO and xfs_break_layouts()
  xfs: repair the AGI
  xfs: repair the AGFL
  xfs: repair the AGF
  xfs: remove dead error handling code in xfs_dquot_disk_alloc()
  xfs: use WRITE_ONCE to update if_seq
  xfs: fix a comment in xfs_log_reserve
  xfs: only validate summary counts on primary superblock
  xfs: substitute spaces with tabs
  xfs: fold dfops into the transaction
  xfs: always defer agfl block frees
  xfs: pass transaction to xfs_defer_add()
  xfs: replace xfs_defer_ops ->dop_pending with on-stack list
  xfs: cancel dfops on xfs_defer_finish() error
  xfs: clean out superfluous dfops dop params/vars
  xfs: drop dop param from xfs_defer_op_type ->finish_item() callback
  xfs: automatic dfops inode relogging
  ...
2018-08-14 08:56:02 -07:00
Darrick J. Wong
1fc25f51d7 xfs: sanity check ag header values in xrep_calc_ag_resblks
Check the values we read in from the AG headers when calculating the
block reservations for a repair transaction.  If they're obviously
wrong, substitute worst case assumptions (rather than ENOSPC on a bogus
reservation request).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-08-14 08:17:02 -07:00
Linus Torvalds
161fa27ff2 Merge branch 'iomap-4.19-merge' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull fs iomap refactoring from Darrick Wong:
 "This is the first part of the XFS changes for 4.19.

  Christoph and Andreas coordinated some refactoring work on the iomap
  code in preparation for removing buffer heads from XFS and porting
  gfs2 to iomap. I'm sending this small pull request ahead of the main
  XFS merge to avoid holding up gfs2 unnecessarily"

* 'iomap-4.19-merge' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: add inline data support to iomap_readpage_actor
  iomap: support direct I/O to inline data
  iomap: refactor iomap_dio_actor
  iomap: add initial support for writes without buffer heads
  iomap: add an iomap-based readpage and readpages implementation
  iomap: add private pointer to struct iomap
  iomap: add a page_done callback
  iomap: generic inline data handling
  iomap: complete partial direct I/O writes synchronously
  iomap: mark newly allocated buffer heads as new
  fs: factor out a __generic_write_end helper
2018-08-13 22:29:03 -07:00
Linus Torvalds
0ea97a2d61 Merge branch 'work.mkdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs icache updates from Al Viro:

 - NFS mkdir/open_by_handle race fix

 - analogous solution for FUSE, replacing the one currently in mainline

 - new primitive to be used when discarding halfway set up inodes on
   failed object creation; gives sane warranties re icache lookups not
   returning such doomed by still not freed inodes. A bunch of
   filesystems switched to that animal.

 - Miklos' fix for last cycle regression in iget5_locked(); -stable will
   need a slightly different variant, unfortunately.

 - misc bits and pieces around things icache-related (in adfs and jfs).

* 'work.mkdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  jfs: don't bother with make_bad_inode() in ialloc()
  adfs: don't put inodes into icache
  new helper: inode_fake_hash()
  vfs: don't evict uninitialized inode
  jfs: switch to discard_new_inode()
  ext2: make sure that partially set up inodes won't be returned by ext2_iget()
  udf: switch to discard_new_inode()
  ufs: switch to discard_new_inode()
  btrfs: switch to discard_new_inode()
  new primitive: discard_new_inode()
  kill d_instantiate_no_diralias()
  nfs_instantiate(): prevent multiple aliases for directory inode
2018-08-13 20:25:58 -07:00
Darrick J. Wong
00d22a1c36 xfs: recalculate summary counters at mount time if icount is bad
Since the sb write verifier trips on bad icounts, we should also force a
mount time recalculation of the summary counters if the icount is bad.
This helps us avoid blowing up at freeze/unmount time when the bad
counter gets written back out.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-08-13 07:58:27 -07:00
Shan Hai
01239d77b9 xfs: fix a null pointer dereference in xfs_bmap_extents_to_btree
Fuzzing tool reports a write to null pointer error in the
xfs_bmap_extents_to_btree, fix it by bailing out on encountering
a null pointer.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-12 08:37:31 -07:00
Eric Sandeen
fa6c668d80 xfs: remove b_last_holder & associated macros
The old lock tracking infrastructure in xfs using the b_last_holder
field seems to only be useful if you can get into the system with a
debugger; it seems that the existing tracepoints would be the way to
go these days, and this old infrastructure can be removed.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-12 08:37:31 -07:00
Dave Jiang
e25ff835af xfs: Close race between direct IO and xfs_break_layouts()
This patch is the duplicate of ross's fix for ext4 for xfs.

If the refcount of a page is lowered between the time that it is returned
by dax_busy_page() and when the refcount is again checked in
xfs_break_layouts() => ___wait_var_event(), the waiting function
xfs_wait_dax_page() will never be called.  This means that
xfs_break_layouts() will still have 'retry' set to false, so we'll stop
looping and never check the refcount of other pages in this inode.

Instead, always continue looping as long as dax_layout_busy_page() gives us
a page which it found with an elevated refcount.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-12 08:37:31 -07:00
Darrick J. Wong
13942aa94a xfs: repair the AGI
Rebuild the AGI header items with some help from the rmapbt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-08-10 11:44:31 -07:00
Darrick J. Wong
0e93d3f43e xfs: repair the AGFL
Repair the AGFL from the rmap data.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-08-10 11:44:31 -07:00
Darrick J. Wong
f9ed6debca xfs: repair the AGF
Regenerate the AGF from the rmap data.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-08-10 11:44:31 -07:00
Brian Foster
73971b172a xfs: remove dead error handling code in xfs_dquot_disk_alloc()
Colin Ian King reports that commit 82ff27bc52 ("xfs: automatic dfops
buffer relogging") leaves around some dead error handling code in
xfs_dquot_disk_alloc(). This was discovered via Coverity scan.

Since the associated commit eliminates the act of joining a buffer
to a dfops, this intermediate error state is no longer possible and
the error handling code can be removed. Since the caller cancels the
transaction on error, which cancels the dfops, eliminate the
unnecessary xfs_defer_cancel() call and error handling labels.

Fixes: 82ff27bc52 ("xfs: automatic dfops buffer relogging")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-07 10:57:13 -07:00
Christoph Hellwig
2ba090d521 xfs: use WRITE_ONCE to update if_seq
This adds ordering of the updates and makes sure we always see the if_seq
update before the extent tree is modified.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-07 10:57:12 -07:00
Al Viro
5bef915104 new helper: inode_fake_hash()
open-coded in a quite a few places...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 16:03:32 -04:00
Huang Chong
a0e336ba3e xfs: fix a comment in xfs_log_reserve
Fix the comment in xfs_log_reserve to avoid confusing.

Signed-of-by: Huang Chong <huang.chong@zte.com.cn>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-03 08:17:54 -07:00
Darrick J. Wong
1f31c98d65 xfs: only validate summary counts on primary superblock
Skip the summary counter checks for secondary superblocks and inprogress
primary superblocks because mkfs has always written those out with
zeroed summary counters.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-08-03 08:17:35 -07:00
Thomas Bianchi
c2b6e1591b xfs: substitute spaces with tabs
Inside xfs_attr_shortform_list removes spaces at the beginnig of the line
and replaces with tabs.
Issue found by checkpatch.

ERROR: code indent should use tabs where possible

Signed-off-by: Thomas Bianchi <thomas.bianchi8@gmail.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
9d9e623385 xfs: fold dfops into the transaction
struct xfs_defer_ops has now been reduced to a single list_head. The
external dfops mechanism is unused and thus everywhere a (permanent)
transaction is accessible the associated dfops structure is as well.

Remove the xfs_defer_ops structure and fold the list_head into the
transaction. Also remove the last remnant of external dfops in
xfs_trans_dup().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
c03edc9e49 xfs: always defer agfl block frees
The AGFL fixup code conditionally defers block frees from the free
list based on whether the current transaction has an associated
xfs_defer_ops structure. Now that dfops is embedded in the
transaction and the internal dfops is used unconditionally, this
invariant is always true.

Remove the now dead logic to check for ->t_dfops in
xfs_alloc_fix_freelist() and unconditionally defer AGFL block frees.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
0f37d1780c xfs: pass transaction to xfs_defer_add()
The majority of remaining references to struct xfs_defer_ops in XFS
are associated with xfs_defer_add(). At this point, there are no
more external xfs_defer_ops users left. All instances of
xfs_defer_ops are embedded in the transaction, which means we can
safely pass the transaction down to the dfops add interface.

Update xfs_defer_add() to receive the transaction as a parameter.
Various subsystems implement wrappers to allocate and construct the
context specific data structures for the associated deferred
operation type. Update these to also carry the transaction down as
needed and clean up unused dfops parameters along the way.

This removes most of the remaining references to struct
xfs_defer_ops throughout the code and facilitates removal of the
structure.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[darrick: fix unused variable warnings with ftrace disabled]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
1ae093cbea xfs: replace xfs_defer_ops ->dop_pending with on-stack list
The xfs_defer_ops ->dop_pending list is used to track active
deferred operations once intents are logged. These items must be
aborted in the event of an error. The list is populated as intents
are logged and items are removed as they complete (or are aborted).

Now that xfs_defer_finish() cancels on error, there is no need to
ever access ->dop_pending outside of xfs_defer_finish(). The list is
only ever populated after xfs_defer_finish() begins and is either
completed or cancelled before it returns.

Remove ->dop_pending from xfs_defer_ops and replace it with a local
list in the xfs_defer_finish() path. Pass the local list to the
various helpers now that it is not accessible via dfops. Note that
we have to check for NULL in the abort case as the final tx roll
occurs outside of the scope of the new local list (once the dfops
has completed and thus drained the list).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
9b1f4e9831 xfs: cancel dfops on xfs_defer_finish() error
The current semantics of xfs_defer_finish() require the caller to
call xfs_defer_cancel() on error. This is slightly inconsistent with
transaction commit error handling where a failed commit cleans up
the transaction before returning.

More significantly, the only requirement for exposure of
->dop_pending outside of xfs_defer_finish() is so that
xfs_defer_cancel() can drain it on error. Since the only recourse of
xfs_defer_finish() errors is cancellation, mirror the transaction
logic and cancel remaining dfops before returning from
xfs_defer_finish() with an error.

Beside simplifying xfs_defer_finish() semantics, this ensures that
xfs_defer_finish() always returns with an empty ->dop_pending and
thus facilitates removal of the list from xfs_defer_ops.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
60f31a609e xfs: clean out superfluous dfops dop params/vars
The dfops code still passes around the xfs_defer_ops pointer
superfluously in a few places. Clean this up wherever the
transaction will suffice.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
7dbddbaccd xfs: drop dop param from xfs_defer_op_type ->finish_item() callback
The dfops infrastructure ->finish_item() callback passes the
transaction and dfops as separate parameters. Since dfops is always
part of a transaction, the latter parameter is no longer necessary.
Remove it from the various callbacks.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
a8198666fb xfs: automatic dfops inode relogging
Inodes that are held across deferred operations are explicitly
joined to the dfops structure to ensure appropriate relogging.
While inodes are currently joined explicitly, we can detect the
conditions that require relogging at dfops finish time by inspecting
the transaction item list for inodes with ili_lock_flags == 0.

Replace the xfs_defer_ijoin() infrastructure with such detection and
automatic relogging of held inodes. This eliminates the need for the
per-dfops inode list, replaced by an on-stack variant in
xfs_defer_trans_roll().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
82ff27bc52 xfs: automatic dfops buffer relogging
Buffers that are held across deferred operations are explicitly
joined to the dfops structure to ensure appropriate relogging.
While buffers are currently joined explicitly, we can detect the
conditions that require relogging at dfops finish time by inspecting
the transaction item list for held buffers.

Replace the xfs_defer_bjoin() infrastructure with such detection and
automatic relogging of held buffers. This eliminates the need for
the per-dfops buffer list, replaced by an on-stack variant in
xfs_defer_trans_roll().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:13 -07:00
Brian Foster
488c919a5b xfs: add missing defer ijoins for held inodes
Log items that require relogging during deferred operations
processing are explicitly joined to the associated dfops via the
xfs_defer_*join() helpers. These calls imply that the associated
object is "held" by the transaction such that when rolled, the item
can be immediately joined to a follow up transaction. For buffers,
this means the buffer remains locked and held after each roll. For
inodes, this means that the inode remains locked.

Failure to join a held item to the dfops structure means the
associated object pins the tail of the log while dfops processing
completes, because the item never relogs and is not unlocked or
released until deferred processing completes.

Currently, all buffers that are held in transactions (XFS_BLI_HOLD)
with deferred operations are explicitly joined to the dfops. This is
not the case for inodes, however, as various contexts defer
operations to transactions with held inodes without explicit joins
to the associated dfops (and thus not relogging).

While this is not a catastrophic problem, it is not ideal. Given
that we want to eventually relog such items automatically during
dfops processing, start by explicitly adding these missing
xfs_defer_ijoin() calls. A call is added everywhere an inode is
joined to a transaction without transferring lock ownership and
said transaction runs deferred operations.

All xfs_defer_ijoin() calls will eventually be replaced by automatic
dfops inode relogging. This patch essentially implements the
behavior change that would otherwise occur due to automatic inode
dfops relogging.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:13 -07:00
Brian Foster
1214f1cf66 xfs: replace dop_low with transaction flag
The dop_low field enables the low free space allocation mode when a
previous allocation has detected difficulty allocating blocks. It
has historically been part of the xfs_defer_ops structure, which
means if enabled, it remains enabled across a set of transactions
until the deferred operations have completed and the dfops is reset.

Now that the dfops is embedded in the transaction, we can save a bit
more space by using a transaction flag rather than a standalone
boolean. Drop the ->dop_low field and replace it with a transaction
flag that is set at the same points, carried across rolling
transactions and cleared on completion of deferred operations. This
essentially emulates the behavior of ->dop_low and so should not
change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:13 -07:00
Brian Foster
ce356d6477 xfs: pass transaction to dfops reset/move helpers
All callers pass ->t_dfops of the associated transactions. Refactor
the helpers to receive the transactions and facilitate further
cleanups between xfs_defer_ops and xfs_trans.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:13 -07:00
Brian Foster
7279aa13b8 xfs: remove unused __xfs_defer_cancel() internal helper
With no more external dfops users, there is no need for an
xfs_defer_ops cancel wrapper.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:13 -07:00
Brian Foster
fbfa977d25 xfs: use transaction for intent recovery instead of raw dfops
Log intent recovery is the last user of an external (on-stack)
dfops. The pattern exists because the dfops is used to collect
additional deferred operations queued during the whole recovery
sequence. The dfops is finished with a new transaction after intent
recovery completes.

We already have a mechanism to create an empty, container-like
transaction to support the scrub infrastructure. We can reuse that
mechanism here to drop the final user of external dfops. This
facilitates folding dfops state (i.e., dop_low) into the
transaction, the elimination of now unused external dfops support
and also eliminates the only caller of __xfs_defer_cancel().

Replace the on-stack dfops with an empty transaction and pass it
around to the various helpers that queue and finish deferred
operations during intent recovery.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:13 -07:00
Brian Foster
98719051e7 xfs: refactor internal dfops initialization
The current transaction allocation code conditionally initializes
the ->t_dfops indirection pointer. Transaction commit/cancel check
the validity of the pointer to determine whether to finish/cancel
the internal dfops.

This disallows the ability to use the internal dfops list as a
temporary container (via xfs_trans_alloc_empty()). Refactor
transaction allocation to always initialize ->t_dfops and check
permanent reservation state on transaction commit/cancel.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:13 -07:00
Darrick J. Wong
56830d6cc1 xfs: check da node magic in _node_lookup_int
Before we start processing what we /think/ is a da3 node block, actually
check the magic to make sure that we're looking at a node block.  This
way we won't blow the asserts in _node_hdr_from_disk on corrupted
metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-08-01 07:42:43 -07:00
Darrick J. Wong
611995db2c xfs: use a local variable for magic number in xfs_da3_node_lookup_int
Use a local variable for the block magic number checks instead of
abusing blk->magic.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-08-01 07:42:18 -07:00
Darrick J. Wong
0c60d3aa0e xfs: refactor log recovery check
Add a predicate to decide if the log is actively in recovery and use
that instead of open-coding a pagf_init check in the attr leaf verifier.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-08-01 07:40:48 -07:00
Darrick J. Wong
ff23f4af7e xfs: move extent busy tree initialization to xfs_initialize_perag
Move the per-AG busy extent tree initialization to the per-ag structure
initialization since we don't want online repair to leak the old tree.
We only deconstruct the tree at unmount time, so this should be safe.
This also enables us to eliminate the commented out initialization in
the xfsprogs libxfs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-31 13:18:09 -07:00
Christoph Hellwig
e666aa37f4 xfs: avoid COW fork extent lookups in writeback if the fork didn't change
Used the per-fork sequence counter to avoid lookups in the writeback code
unless the COW fork actually changed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-31 13:18:09 -07:00
Christoph Hellwig
745b3f76d1 xfs: maintain a sequence count for inode fork manipulations
Add a simple 32-bit unsigned integer as the sequence count for
modifications to the extent list in the inode fork.  This will be
used to optimize away extent list lookups in the writeback code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-31 13:18:09 -07:00
Darrick J. Wong
9e037cb797 xfs: check for unknown v5 feature bits in superblock write verifier
Make sure we never try to write the superblock with unknown feature bits
set.  We checked those at mount time, so if they're set now then memory
is corrupt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-07-31 13:18:09 -07:00
Darrick J. Wong
69775fd15d xfs: verify icount in superblock write
Add a helper predicate to check the inode count for sanity, then use it
in the superblock write verifier to inspect sb_icount.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-07-31 13:18:09 -07:00
Bill O'Donnell
8756a5af18 libxfs: add more bounds checking to sb sanity checks
Current sb verifier doesn't check bounds on sb_fdblocks and sb_ifree.
Add sanity checks for these parameters.

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
[darrick: port to refactored sb validation predicates]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-07-31 13:18:09 -07:00
Darrick J. Wong
eca383fcd6 xfs: refactor superblock verifiers
Split the superblock verifier into the common checks, the read-time
checks, and the write-time check functions.  No functional changes, but
we're setting up to add more write-only checks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-07-31 13:18:09 -07:00
Darrick J. Wong
86d969b425 xfs: refactor the xrep_extent_list into xfs_bitmap
As mentioned previously, the xrep_extent_list basically implements a
bitmap with two functions: set and disjoint union.  Rename all these
functions to xfs_bitmap to shorten the name and make it more obvious
what we're doing.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-07-31 13:18:08 -07:00
Christoph Hellwig
51d6269030 xfs: introduce a new xfs_inode_has_cow_data helper
We have a few places that already check if an inode has actual data in
the COW fork to avoid work on reflink inodes that do not actually have
outstanding COW blocks.  There are a few more places that can avoid
working if doing the same check, so add a documented helper for this
condition and use it in all places where it makes sense.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-30 07:57:48 -07:00
Christoph Hellwig
3ba738df25 xfs: remove the xfs_ifork_t typedef
We only have a few more callers left, so seize the opportunity and kill
it off.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-30 07:57:48 -07:00
Christoph Hellwig
1216b58b35 xfs: simplify xfs_idata_realloc
Streamline the code and take advantage of the fact that kmem_realloc
through krealloc will be have like a normal allocation if passing in a
NULL old pointer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-30 07:57:48 -07:00
Christoph Hellwig
fcacbc3f51 xfs: remove if_real_bytes
The field is only used for asserts, and to track if we really need to do
realloc when growing the inode fork data.  But the krealloc function
already performs this check internally, so there is no need to keep track
of the real allocation size.

This will free space in the inode fork for keeping a sequence counter of
changes to the extent list.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-30 07:57:48 -07:00
Darrick J. Wong
bc270b53e6 xfs: move the repair extent list into its own file
Move the xrep_extent_list code into a separate file.  Logically, this
data structure is really just a clumsy bitmap, and in the next patch
we'll make this more obvious.  No functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-07-29 22:37:09 -07:00
Darrick J. Wong
ebcbef3a61 xfs: pass transaction lock while setting up agresv on cyclic metadata
Pass a tranaction pointer through to all helpers that calculate the
per-AG block reservation.  Online repair will use this to reinitialize
per-ag reservations while it still holds all the AG headers locked to
the repair transaction.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-07-29 22:37:08 -07:00
Eric Sandeen
1c02d502c2 xfs: remove deprecated barrier/nobarrier mount
The barrier mount options have been no-ops and deprecated since

4cf4573 xfs: deprecate barrier/nobarrier mount option

i.e. kernel 4.10 / December 2016, with a stated deprecation schedule
after v4.15.  Should be fair game to remove them now.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:17 -07:00
Darrick J. Wong
44a8736bd2 xfs: clean up IRELE/iput callsites
Replace the IRELE macro with a proper function so that we can do proper
typechecking and so that we can stop open-coding iput in scrub, which
means that we'll be able to ftrace inode lifetimes going through scrub
correctly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-07-26 10:15:16 -07:00
Darrick J. Wong
89c3e8cf3c xfs: kill IHOLD
Nobody uses this macro, get rid of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-07-26 10:15:16 -07:00
Brian Foster
b277c37f43 xfs: bypass final dfops roll in trans commit path
Once xfs_defer_finish() has completed all deferred operations, it
checks the dirty state of the transaction and rolls it once more to
return a clean transaction for the caller. This primarily to cover
the case where repeated xfs_defer_finish() calls are made in a loop
and we need to make sure that the caller starts the next iteration
with a clean transaction. Otherwise we risk transaction reservation
overrun.

This final transaction roll is not required in the transaction
commit path, however, because the transaction is immediately
committed and freed after dfops completion. Refactor the final roll
into a separate helper such that we can avoid it in the transaction
commit path.  Lift the dfops reset as well so dfops remains valid
until after the last call to xfs_defer_trans_roll(). The reset is
also unnecessary in the transaction commit path because the
transaction is about to complete.

This eliminates unnecessary regrants of transactions where the
associated transaction roll can be replaced by a transaction commit.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:16 -07:00
Brian Foster
9e28a242be xfs: drop unnecessary xfs_defer_finish() dfops parameter
Every caller of xfs_defer_finish() now passes the transaction and
its associated ->t_dfops. The xfs_defer_ops parameter is therefore
no longer necessary and can be removed.

Since most xfs_defer_finish() callers also have to consider
xfs_defer_cancel() on error, update the latter to also receive the
transaction for consistency. The log recovery code contains an
outlier case that cancels a dfops directly without an available
transaction. Retain an internal wrapper to support this outlier case
for the time being.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:16 -07:00
Brian Foster
d5cca7eb24 xfs: remove unnecessary dfops init calls in xattr code
Each xfs_defer_init() call in the xattr code uses the internal dfops
reference. In addition, a successful xfs_defer_finish() always
returns with a reset xfs_defer_ops structure.

Given that along with the fact that every xfs_defer_init() call in
the xattr code is followed up by an xfs_defer_finish(), the former
calls are no longer necessary and can be removed.

Note that the xfs_defer_init() call in the remote value copy loop of
xfs_attr_rmtval_set() is not followed by a finish, but the dfops is
unused in this instance.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:15 -07:00
Brian Foster
c8eac49ef7 xfs: remove all boilerplate defer init/finish code
At this point, the transaction subsystem completely manages deferred
items internally such that the common and boilerplate
xfs_trans_alloc() -> xfs_defer_init() -> xfs_defer_finish() ->
xfs_trans_commit() sequence can be replaced with a simple
transaction allocation and commit.

Remove all such boilerplate deferred ops code. In doing so, we
change each case over to use the dfops in the transaction and
specifically eliminate:

- The on-stack dfops and associated xfs_defer_init() call, as the
  internal dfops is initialized on transaction allocation.
- xfs_bmap_finish() calls that precede a final xfs_trans_commit() of
  a transaction.
- xfs_defer_cancel() calls in error handlers that precede a
  transaction cancel.

The only deferred ops calls that remain are those that are
non-deterministic with respect to the final commit of the associated
transaction or are open-coded due to special handling.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:15 -07:00
Brian Foster
91ef75b657 xfs: use internal dfops during [b|c]ui recovery
bmap and refcount intent processing associates a dfops from the
caller with a local transaction to collect all deferred items for
post-processing. Use the internal dfops in both of these functions
and move the deferred items to the parent dfops before the
transaction commits.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:15 -07:00
Brian Foster
9c6bb0cf7b xfs: use internal dfops in attr code
Remove the unnecessary on-stack dfops structure and use the internal
transaction dfops instead. The lower level xattr code already
appropriately accesses ->t_dfops throughout.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:14 -07:00
Brian Foster
1e5ae1995a xfs: use internal dfops in cow blocks cancel
All callers either explicitly initialize a dfops or pass a
transaction with an internal dfops. Drop the hacky old dfops
replacement logic and use the one associated with the transaction.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:14 -07:00
Brian Foster
e021a2e5fc xfs: support embedded dfops in transaction
The dfops structure used by multi-transaction operations is
typically stored on the stack and carried around by the associated
transaction. The lifecycle of dfops does not quite match that of the
transaction, but they are tightly related in that the former depends
on the latter.

The relationship of these objects is tight enough that we can avoid
the cumbersome boilerplate code required in most cases to manage
them separately by just embedding an xfs_defer_ops in the
transaction itself. This means that a transaction allocation returns
with an initialized dfops, a transaction commit finishes pending
deferred items before the tx commit, a transaction cancel cancels
the dfops before the transaction and a transaction dup operation
transfers the current dfops state to the new transaction.

The dup operation is slightly complicated by the fact that we can no
longer just copy a dfops pointer from the old transaction to the new
transaction. This is solved through a dfops move helper that
transfers the pending items and other dfops state across the
transactions. This also requires that transaction rolling code
always refer to the transaction for the current dfops reference.

Finally, to facilitate incremental conversion to the internal dfops
and continue to support the current external dfops mode of
operation, create the new ->t_dfops_internal field with a layer of
indirection. On allocation, ->t_dfops points to the internal dfops.
This state is overridden by callers who re-init a local dfops on the
transaction. Once ->t_dfops is overridden, the external dfops
reference is maintained as the transaction rolls.

This patch adds the fundamental ability to support an internal
dfops. All codepaths that perform deferred processing continue to
override the internal dfops until they are converted over in
subsequent patches.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:14 -07:00
Brian Foster
44fd294681 xfs: pack holes in xfs_defer_ops and xfs_trans
Both structures have holes due to member alignment. Move dop_low to
the end of xfs_defer ops to sanitize the cache line alignment and
move t_flags to save 8 bytes in xfs_trans.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:13 -07:00
Brian Foster
509308b413 xfs: reset dfops to initial state after finish
xfs_defer_init() is currently used in two particular situations. The
first and most obvious case is raw initialization of an
xfs_defer_ops struct. The other case is partial reinit of
xfs_defer_ops on reuse due to iteration.

Most instances of the first case will be replaced by a single init
of a dfops embedded in the transaction. Init calls are still
technically required for the second case because the dfops may have
low space mode enabled or have joined items that need to be reset
before the dfops should be reused.

Since the current dfops usage expects either a final transaction
commit after xfs_defer_finish() or xfs_defer_init() if dfops is to
be reused, we can shift some of the init logic into
xfs_defer_finish() such that the latter returns with a reinitialized
dfops. This eliminates the second dependency noted above such that a
dfops is immediately ready for reuse after an xfs_defer_finish()
without the need to change any calling code.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:13 -07:00
Brian Foster
83200bfac6 xfs: remove unused deferred ops committed field
dop_committed is set when deferred item processing rolls the
transaction at least once, but is only ever accessed in tracepoints.
The transaction roll/commit events are already available via
independent tracepoints, so remove the otherwise unused field.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:13 -07:00
Brian Foster
03f4e4b26c xfs: make deferred processing safe for embedded dfops
xfs_defer_finish() has a couple quirks that are not safe with
respect to the upcoming internal dfops functionality. First,
xfs_defer_finish() attaches the passed in dfops structure to
->t_dfops and caches and restores the original value. Second, it
continues to use the initial dfops reference before and after the
transaction roll.

These behaviors assume that dop is an independent memory allocation
from the transaction itself, which may not always be true once
transactions begin to use an embedded dfops structure. In the latter
model, dfops processing creates a new xfs_defer_ops structure with
each transaction and the associated state is migrated across to the
new transaction.

Fix up xfs_defer_finish() to handle the possibility of the current
dfops changing after a transaction roll. Since ->t_dfops is used
unconditionally in this path, it is no longer necessary to
attach/restore ->t_dfops and pass it explicitly down to
xfs_defer_trans_roll(). Update dop in the latter function and the
caller to ensure that it always refers to the current dfops
structure.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:12 -07:00
Brian Foster
dcbd44f799 xfs: fix transaction leak on remote attr set/remove failure
The xattr remote value set/remove handlers both clear args.trans in
the error path without having cancelled the transaction. This leaks
the transaction, causes warnings around returning to userspace with
locks held and leads to system lockups or other general problems.

The higher level xfs_attr_[set|remove]() functions already detect
and cancel args.trans when set in the error path. Drop the NULL
assignments from the rmtval handlers and allow the callers to clean
up the transaction correctly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:12 -07:00
Brian Foster
a61acc3c78 xfs: use ->t_dfops in log recovery intent processing
xlog_finish_defer_ops() processes the deferred operations collected
over the entire intent recovery sequence. We can't xfs_defer_init()
here because the dfops is already populated. Attach it manually and
eliminate the last caller of xfs_defer_finish() that doesn't pass
->t_dfops.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:12 -07:00
Brian Foster
02dff7bf81 xfs: pull up dfops from xfs_itruncate_extents()
xfs_itruncate_extents[_flags]() uses a local dfops with a
transaction provided by the caller. It uses hacky ->t_dfops
replacement logic to avoid stomping over an already populated
->t_dfops.

The latter never occurs for current callers and the logic itself is
not really appropriate. Clean this up by updating all callers to
initialize a dfops and to use that down in xfs_itruncate_extents().
This more closely resembles the upcoming logic where dfops will be
embedded within the transaction. We can also replace the
xfs_defer_init() in the xfs_itruncate_extents_flags() loop with an
assert. Both dfops and firstblock should be in a valid state
after xfs_defer_finish() and the inode joined to the dfops is fixed
throughout the loop.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:12 -07:00
Eric Sandeen
d4a34e1655 xfs: properly handle free inodes in extent hint validators
When inodes are freed in xfs_ifree(), di_flags is cleared (so extent size
hints are removed) but the actual extent size fields are left intact.
This causes the extent hint validators to fail on freed inodes which once
had extent size hints.

This can be observed (for example) by running xfs/229 twice on a
non-crc xfs filesystem, or presumably on V5 with ikeep.

Fixes: 7d71a67 ("xfs: verify extent size hint is valid in inode verifier")
Fixes: 02a0fda ("xfs: verify COW extent size hint is valid in inode verifier")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-24 11:34:52 -07:00
Darrick J. Wong
f467cad95f xfs: force summary counter recalc at next mount
Use the "bad summary count" mount flag from the previous patch to skip
writing the unmount record to force log recovery at the next mount,
which will recalculate the summary counters for us.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:08:01 -07:00
Darrick J. Wong
53235f2215 xfs: refactor unmount record write
Refactor the writing of the unmount record into a separate helper.  No
functionality changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:08:01 -07:00
Darrick J. Wong
2e9e6481e2 xfs: detect and fix bad summary counts at mount
Filippo Giunchedi complained that xfs doesn't even perform basic sanity
checks of the fs summary counters at mount time.  Therefore, recalculate
the summary counters from the AGFs after log recovery if the counts were
bad (or we had to recover the fs).  Enhance the recalculation routine to
fail the mount entirely if the new values are also obviously incorrect.

We use a mount state flag to record the "bad summary count" state so
that the (subsequent) online fsck patches can detect subtlely incorrect
counts and set the flag; clear it userspace asks for a repair; or force
a recalculation at the next mount if nobody fixes it by unmount time.

Reported-by: Filippo Giunchedi <fgiunchedi@wikimedia.org>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:08:01 -07:00
Darrick J. Wong
032d91f982 xfs: fix indentation and other whitespace problems in scrub/repair
Now that we've shortened everything, fix up all the indentation and
whitespace problems.  There are no functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-07-23 09:08:01 -07:00
Darrick J. Wong
1d8a748a8a xfs: shorten struct xfs_scrub_context to struct xfs_scrub
Shorten the name of the online fsck context structure.  Whitespace
damage will be fixed by a subsequent patch.  There are no functional
changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-07-23 09:08:00 -07:00
Darrick J. Wong
b5e2196e9c xfs: shorten xfs_repair_ prefix to xrep_
Shorten all the metadata repair xfs_repair_* symbols to xrep_.
Whitespace damage will be fixed by a subsequent patch.  There are no
functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-07-23 09:08:00 -07:00
Darrick J. Wong
c517b3aa02 xfs: shorten xfs_scrub_ prefix
Shorten all the metadata checking xfs_scrub_ prefixes to xchk_.  After
this, the only xfs_scrub* symbols are the ones that pertain to both
scrub and repair.  Whitespace damage will be fixed in a subsequent
patch.  There are no functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-07-23 09:08:00 -07:00
Darrick J. Wong
ef97ef26d2 xfs: clean up xfs_btree_del_cursor callers
Less trivial cleanups of the error argument to xfs_btree_del_cursor;
these require some minor code refactoring.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:08:00 -07:00
Darrick J. Wong
0b04b6b875 xfs: trivial xfs_btree_del_cursor cleanups
The error argument to xfs_btree_del_cursor already understands the
"nonzero for error" semantics, so remove pointless error testing in the
callers and pass it directly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:08:00 -07:00
Darrick J. Wong
81b549aa62 xfs: return from _defer_finish with a clean transaction
The following assertion was seen on generic/051:

XFS: Assertion failed: tp->t_firstblock == NULLFSBLOCK, file: fs/xfs/libxfs5
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:102!
invalid opcode: 0000 [#1] SMP PTI
CPU: 2 PID: 20757 Comm: fsstress Not tainted 4.18.0-rc4+ #3969
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/4
RIP: 0010:assfail+0x23/0x30
Code: c3 66 0f 1f 44 00 00 48 89 f1 41 89 d0 48 c7 c6 88 e0 8c 82 48 89 fa
RSP: 0018:ffff88012dc43c08 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff88012dc43ca0 RCX: 0000000000000000
RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff828480eb
RBP: ffff88012aa92758 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: f000000000000000 R12: 0000000000000000
R13: ffff88012dc43d48 R14: ffff88013092e7e8 R15: 0000000000000014
FS:  00007f8d689b8e80(0000) GS:ffff88013fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8d689c7000 CR3: 000000012ba6a000 CR4: 00000000000006e0
Call Trace:
 xfs_defer_init+0xff/0x160
 xfs_reflink_remap_extent+0x31b/0xa00
 xfs_reflink_remap_blocks+0xec/0x4a0
 xfs_reflink_remap_range+0x3a1/0x650
 xfs_file_dedupe_range+0x39/0x50
 vfs_dedupe_file_range+0x218/0x260
 do_vfs_ioctl+0x262/0x6a0
 ? __se_sys_newfstat+0x3c/0x60
 ksys_ioctl+0x35/0x60
 __x64_sys_ioctl+0x11/0x20
 do_syscall_64+0x4b/0x190
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The root cause of the assertion failure is that xfs_defer_finish doesn't
roll the transaction after processing all the deferred items.  Therefore
it returns a dirty transaction to the caller, which leaves the caller at
risk of exceeding the transaction reservation if it logs more items.

Brian Foster's patchset to move the defer_ops firstblock into the
transaction requires t_firstblock == NULLFSBLOCK upon defer_ops
initialization, which is how this was noticed at all.

Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:08:00 -07:00
Darrick J. Wong
65cfcc3897 xfs: check leaf attribute block freemap in verifier
Check the leaf attribute freemap when we're verifying the block.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:08:00 -07:00
Carlos Maiolino
5089eafffb libxfs: Fix a couple of sparse complaintis
No significant changes, just silence a couple of sparse errors.

Using cpu_to_be32(NULLAGINO), the NULLAGINO constant will be encoded in
BE as a constant, avoiding a BE -> CPU conversion every iteraction of
the loop, if be32_to_cpu(agi->agi_unlinked[i]) was used instead.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-17 14:25:58 -07:00
Gustavo A. R. Silva
e4e542a683 xfs: use swap macro in xfs_dir2_leafn_rebalance
Make use of the swap macro and remove unnecessary variable *tmp*. This
makes the code easier to read and maintain. Also, slightly refactor some
code.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-17 14:25:57 -07:00
Gustavo A. R. Silva
897992b7e3 xfs_bmap_util: use swap macro
Make use of the swap macro and remove some unnecessary variables.
This makes the code easier to read and maintain. Also, reduces the
stack usage.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-17 14:25:57 -07:00
Gustavo A. R. Silva
1d5bebbafc xfs_attr_leaf: use swap macro in xfs_attr3_leaf_rebalance
Make use of the swap macro and remove some unnecessary variables.
This makes the code easier to read and maintain. Also, reduces the
stack usage.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-17 14:25:57 -07:00
Darrick J. Wong
fa248de98a xfs: don't assume a left rmap when allocating a new rmap
The original rmap code assumed that there would always be at least one
rmap in the rmapbt (the AG sb/agf/agi) and so errored out if it didn't
find one.  This assumption isn't true for the rmapbt repair function
(and it won't be true for realtime rmap either), so remove the check and
just deal with the situation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2018-07-17 14:25:57 -07:00
Carlos Maiolino
efe8032773 xfs: Initialize variables in xfs_alloc_get_rec before using them
Make sure we initialize *bno and *len, before jumping to out_bad_rec
label, and risk calling xfs_warn() with uninitialized variables.

Coverity: 100898
Coverity: 1437081
Coverity: 1437129
Coverity: 1437191
Coverity: 1437201
Coverity: 1437212
Coverity: 1437341
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:36 -07:00
Eric Sandeen
a4722a643f xfs: remove unused iolock arg from xfs_break_dax_layouts
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:36 -07:00
Brian Foster
bb00b6f1e2 xfs: kill __xfs_buf_submit_common()
Now that there is only one caller, fold the common submission helper
into __xfs_buf_submit().

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:35 -07:00
Brian Foster
6af88cda00 xfs: combine [a]sync buffer submission apis
The buffer I/O submission path consists of separate function calls
per type. The buffer I/O type is already controlled via buffer
state (XBF_ASYNC), however, so there is no real need for separate
submission functions.

Combine the buffer submission functions into a single function that
processes the buffer appropriately based on XBF_ASYNC. Retain an
internal helper with a conditional wait parameter to continue to
support batched !XBF_ASYNC submission/completion required by delwri
queues.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:35 -07:00
Brian Foster
e339dd8d8b xfs: use sync buffer I/O for sync delwri queue submission
If a delwri queue occurs of a buffer that sits on a delwri queue
wait list, the queue sets _XBF_DELWRI_Q without changing the state
of ->b_list. This occurs, for example, if another thread beats the
current delwri waiter thread to the buffer lock after I/O
completion. Once the waiter acquires the lock, it removes the buffer
from the wait list and leaves a buffer with _XBF_DELWRI_Q set but
not populated on a list. This results in a lost buffer submission
and in turn can result in assert failures due to _XBF_DELWRI_Q being
set on buffer reclaim or filesystem lockups if the buffer happens to
cover an item in the AIL.

This problem has been reproduced by repeated iterations of xfs/305
on high CPU count (28xcpu) systems with limited memory (~1GB). Dirty
dquot reclaim races with an xfsaild push of a separate dquot backed
by the same buffer such that the buffer sits on the reclaim wait
list at the time xfsaild attempts to queue it. Since the latter
dquot has been flush locked but the underlying buffer not submitted
for I/O, the dquot pins the AIL and causes the filesystem to
livelock.

This race is essentially made possible by the buffer lock cycle
involved with waiting on a synchronous delwri queue submission.
Close the race by using synchronous buffer I/O for respective delwri
queue submission. This means the buffer remains locked across the
I/O and so is inaccessible from other contexts while in the
intermediate wait list state. The sync buffer I/O wait mechanism is
factored into a helper such that sync delwri buffer submission and
serialization are batched operations.

Designed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:34 -07:00
Brian Foster
eaebb515f1 xfs: refactor buffer submission into a common helper
Sync and async buffer submission both do generally similar things
with a couple odd exceptions. Refactor the core buffer submission
code into a common helper to isolate buffer submission from
completion handling of synchronous buffer I/O.

This patch does not change behavior. It is a step towards support
for using synchronous buffer I/O via synchronous delwri queue
submission.

Designed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:34 -07:00
Brian Foster
5fdd97944e xfs: remove xfs_defer_init() firstblock param
All but one caller of xfs_defer_init() passes in the ->t_firstblock
of the associated transaction. The one outlier is
xlog_recover_process_intents(), which simply passes a dummy value
because a valid pointer is required. This firstblock variable can
simply be removed.

At this point we could remove the xfs_defer_init() firstblock
parameter and initialize ->t_firstblock directly. Even that is not
necessary, however, because ->t_firstblock is automatically
reinitialized in the new transaction on a transaction roll. Since
xfs_defer_init() should never occur more than once on a particular
transaction (since the corresponding finish will roll it), replace
the reinit from xfs_defer_init() with an assert that verifies the
transaction has a NULLFSBLOCK firstblock.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:33 -07:00
Brian Foster
9c3bf5da80 xfs: use ->t_firstblock in inode inactivate
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:32 -07:00
Brian Foster
f537538921 xfs: use ->t_firstblock in extent swap
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:32 -07:00
Brian Foster
381d592848 xfs: use ->t_firstblock in reflink cow block cancel
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:31 -07:00
Brian Foster
fb91f4b5d6 xfs: replace no-op firstblock init with ->t_firstblock
xfs_refcount_recover_cow_leftovers() has no need for a firstblock
variable and so passes an unrelated xfs_fsblock_t to
xfs_defer_init() to avoid declaring one. Replace this no-op
initialization with ->t_firstblock. This will be optimized away by
the removal of the xfs_defer_init() firstblock param.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:31 -07:00
Brian Foster
058529c5f5 xfs: use ->t_firstblock in dq alloc
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:30 -07:00
Brian Foster
64396ff2c2 xfs: remove xfs_alloc_arg firstblock field
The xfs_alloc_arg.firstblock field is used to control the starting
agno for an allocation. The structure already carries a pointer to
the transaction, which carries the current firstblock value.

Remove the field and access ->t_firstblock directly in the
allocation code.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:30 -07:00
Brian Foster
cf612de732 xfs: remove xfs_btree_cur private firstblock field
The bmbt cursor private structure has a firstblock field that is
used to maintain locking order on bmbt allocations. The field holds
an actual firstblock value (as opposed to a pointer), so it is
initialized on cursor creation, updated on allocation and then the
value is transferred back to the source before the cursor is
destroyed.

This value is always transferred from and back to the ->t_firstblock
field. Since xfs_btree_cur already carries a reference to the
transaction, we can remove this field from xfs_btree_cur and the
associated copying. The bmbt allocations will update the value in
the transaction directly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:29 -07:00
Brian Foster
280253d213 xfs: remove bmap format helpers firstblock params
The bmap format helpers receive firstblock via ->t_firstblock. Drop
the param and access it directly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:29 -07:00
Brian Foster
92f9da30f5 xfs: remove bmap extent add helper firstblock params
The add extent helpers all receive firstblock via ->t_firstblock.
Drop the parameter and access it directly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:28 -07:00
Brian Foster
94c07b4dba xfs: remove xfs_bmalloca firstblock field
The xfs_bmalloca.firstblock field carries the firstblock value from
the transaction into the bmap infrastructure. It's initialized in
one place from ->t_firstblock, so drop the field and access
->t_firstblock directly throughout the bmap code.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:28 -07:00
Brian Foster
4b77a088d7 xfs: use ->t_firstblock in bmap extent split
Also remove the unnecessary xfs_bmap_split_extent_at() parameter.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:27 -07:00
Brian Foster
333f950c89 xfs: remove bmap insert/collapse firstblock param
The only callers pass ->t_firstblock.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:27 -07:00
Brian Foster
2af5284253 xfs: remove xfs_bunmapi() firstblock param
All callers pass ->t_firstblock from the current transaction.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:25 -07:00
Brian Foster
a7beabeae2 xfs: remove xfs_bmapi_write() firstblock param
All callers pass ->t_firstblock from the current transaction.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:25 -07:00
Brian Foster
d0a9d79572 xfs: use ->t_firstblock in insert/collapse range
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:24 -07:00
Brian Foster
580c4ff948 xfs: use ->t_firstblock in xfs_bmapi_remap()
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:24 -07:00
Brian Foster
372837978d xfs: use ->t_firstblock for all xfs_bunmapi() callers
Convert all xfs_bunmapi() callers to ->t_firstblock.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:23 -07:00
Brian Foster
650919f131 xfs: use ->t_firstblock for all xfs_bmapi_write() callers
Convert all xfs_bmapi_write() users to ->t_firstblock.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:23 -07:00
Brian Foster
766139032f xfs: use ->t_firstblock in xattr ops
Similar to the dirops code, the xattr code uses an on-stack
firstblock variable for the various operations. This code rolls the
underlying transaction in various places, however, which means we
cannot simply replace the local firstblock vars with ->t_firstblock.
Doing so (without further changes) would invalidate the memory
pointed to by xfs_da_args.firstblock as soon as the first
transaction rolls.

To avoid this problem, remove xfs_da_args.firstblock and replace all
such accesses with ->t_firstblock at the same time. This ensures
that accesses to the current firstblock always occur through the
current transaction rather than a potentially invalid xfs_da_args
pointer.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:22 -07:00
Brian Foster
825d75cd8c xfs: use ->t_firstblock in attrfork add
Note that this codepath is a user of struct xfs_da_args. Switch it
over to ->t_firstblock in preparation to remove
xfs_da_args.firstblock.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:21 -07:00
Brian Foster
381eee69f8 xfs: remove firstblock param from xfs dir ops
All callers of the xfs_dir_*() functions pass ->t_firstblock as the
firstblock parameter. Drop the parameter and access ->t_firstblock
directly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:21 -07:00
Brian Foster
f16dea54b7 xfs: use ->t_firstblock in dir ops
Callers of the xfs_dir_*() functions currently pass an on-stack
firstblock variable. While the dirops infrastructure carries a
pointer to this variable, it never rolls the transaction and so it
is safe to use ->t_firstblock instead.

Fix up the various xfs_dir_*() callers to use ->t_firstblock. Also
remove the unnecessary parameter for xfs_cross_rename().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:20 -07:00
Brian Foster
bba59c5e4b xfs: add firstblock field to xfs_trans
A firstblock var is typically allocated and initialized along with
xfs_defer_ops structures and passed around independent from the
associated transaction. To facilitate combining the two, add an
optional ->t_firstblock field to xfs_trans that can be used in place
of an on-stack variable.

The firstblock value follows the lifetime of the transaction, so
initialize it on allocation and when a transaction rolls.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:20 -07:00
Brian Foster
3ae2d89174 xfs: allow null firstblock in xfs_bmapi_write() when tp is null
xfs_bmapi_write() always expects a valid firstblock pointer. It
immediately dereferences the pointer to help determine how to
initialize the bma.minleft field. The remaining accesses are
related to modifying btree format forks, which is only relevant for
!COW fork callers.

The reflink code passes a NULL transaction to xfs_bmapi_write() in a
couple places that do COW fork unwritten conversion. The purpose of
the firstblock field is to track the first block allocation in the
current transaction, so technically firstblock should not be
required for these callers either.

Tweak xfs_bmapi_write() to initialize the bma correctly without
accessing the firstblock pointer if no transaction is provided in
the first place. Update the reflink callers to pass NULL instead of
otherwise unused firstblock references.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:19 -07:00
Brian Foster
bcd2c9f335 xfs: refactor dfops init to attach to transaction
Most callers of xfs_defer_init() immediately attach the dfops
structure to a transaction. Add a transaction parameter to eliminate
much of this boilerplate code. This also helps self-document the
fact that many codepaths now expect a dfops pointer implicitly via
xfs_trans->t_dfops.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:19 -07:00
Brian Foster
d5669ed581 xfs: use ->t_dfops in reflink cow recover path
Use ->t_dfops of the leftover COW reservation cleanup transaction.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:18 -07:00
Brian Foster
27356a063a xfs: use ->t_dfops in cancel cow blocks operation
Use ->t_dfops of the transaction from the caller. Reset it before we
return to avoid leaks of local stack memory.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:18 -07:00
Brian Foster
7a7943c7e0 xfs: use ->t_dfops for rmap extent swap operations
xfs_swap_extent_rmap() uses a local dfops instance with a
transaction from the caller. Since there is only one caller, pull
the dfops structure into the caller and attach it to the
transaction. This avoids the need to clear ->t_dfops to prevent
invalid stack memory access.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:17 -07:00
Brian Foster
ed7ef8e55c xfs: remove unused btree cursor bc_private.a.dfops field
The xfs_btree_cur.bc_private.a.dfops field is only ever initialized
by the refcountbt cursor init function. The only caller of that
function with a non-NULL dfops is from deferred completion context,
which already has attached to ->t_dfops.

In addition to that, the only actual reference of a.dfops is the
cursor duplication function, which means the field is effectively
unused.

Remove the dfops field from the bc_private.a union. Any future users
can acquire the dfops from the transaction. This patch does not
change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:17 -07:00
Brian Foster
42b394a925 xfs: remove xfs_btree_cur bmbt dfops field
All assignments of xfs_btree_cur.bc_private.b.dfops originate from
->t_dfops. Replace accesses of the former with the latter and remove
the unnecessary field. This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:16 -07:00
Brian Foster
81ba8f3e94 xfs: remove dfops param from internal bmap extent helpers
All callers of the various bmap extent helpers now use ->t_dfops.
Remove the unnecessary dfops params and access ->t_dfops directly.
This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:16 -07:00
Brian Foster
f4a9cf97fa xfs: use ->t_dfops for collapse/insert range operations
Use ->t_dfops for the collapse and insert range transactions. These
are the only callers of the respective bmap helpers, so replace the
unnecessary dfops parameters with direct accesses to ->t_dfops.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:15 -07:00
Brian Foster
3e3673e302 xfs: remove struct xfs_bmalloca dfops field
Now that bma.dfops is only assigned from ->t_dfops, replace all
accesses to the former with the latter and remove the unnecessary
field. This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:14 -07:00
Brian Foster
ff3edf255d xfs: remove xfs_bmapi_remap() dfops param
All xfs_bmapi_remap() callers already use ->t_dfops. Note that
deferred completion context unconditionally sets ->t_dfops if it
hasn't already been set by the caller. Remove the unnecessary
parameter and access ->t_dfops directly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:14 -07:00
Brian Foster
ccd9d91148 xfs: remove xfs_bunmapi() dfops param
Now that all xfs_bunmapi() callers use ->t_dfops, remove the
unnecessary parameter and access ->t_dfops directly. This patch does
not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:13 -07:00
Brian Foster
4bcfa613a0 xfs: use ->t_dfops for all xfs_bunmapi() callers
Use ->t_dfops for all remaining xfs_bunmapi() callers. This prepares
the latter to no longer require a dfops parameter.

Note that xfs_itruncate_extents_flags() associates a local dfops
with a transaction provided from the caller. Since there are
multiple callers, set and reset ->t_dfops before the function
returns to avoid exposure of stack memory to the caller.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:13 -07:00
Brian Foster
6e702a5dcb xfs: remove xfs_bmapi_write() dfops param
Now that all callers use ->t_dfops, the xfs_bmapi_write() dfops
parameter is no longer necessary. Remove it and access ->t_dfops
directly. This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:12 -07:00
Brian Foster
175d1a013e xfs: use ->t_dfops for all xfs_bmapi_write() callers
Attach ->t_dfops for all remaining callers of xfs_bmapi_write().
This prepares the latter to no longer require a separate dfops
parameter.

Note that xfs_symlink() already uses ->t_dfops. Fix up the local
references for consistency.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:12 -07:00
Brian Foster
2ba1372125 xfs: use ->t_dfops in dqalloc transaction
xfs_dquot_disk_alloc() receives a transaction from the caller and
passes a local dfops along to xfs_bmapi_write(). If we attach this
dfops to the transaction, we have to make sure to clear it before
returning to avoid invalid access of stack memory.

Since xfs_qm_dqread_alloc() is the only caller, pull dfops into the
caller and attach it to the transaction to eliminate this pattern
entirely.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:11 -07:00
Brian Foster
32a9b7c65c xfs: replace xfs_da_args->dfops accesses with ->t_dfops and remove
Now that xfs_da_args->dfops is always assigned from a ->t_dfops
pointer (or one that is immediately attached), replace all
downstream accesses of the former with the latter and remove the
field from struct xfs_da_args.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:11 -07:00
Brian Foster
d76e6ce8ed xfs: use ->t_dfops in extent split tx and remove param
Attach the local dfops to ->t_dfops of the extent split transaction.
Since this is the only caller of xfs_bmap_split_extent_at(), remove
the dfops parameter as well.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:10 -07:00
Brian Foster
0bd6207f83 xfs: remove dfops param in attr fork add path
Now that the attribute fork add tx carries dfops along with the
transaction, it is unnecessary to pass it down the stack. Remove the
dfops parameter and access ->t_dfops directly where necessary. This
patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:10 -07:00
Brian Foster
40d03ac6aa xfs: use ->t_dfops for attr set/remove operations
Attach the local dfops to the transaction allocated for xattr add
and remove operations. Add an earlier initialization in
xfs_attr_remove() to ensure the structure is valid if it remains
unused at transaction commit time.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:09 -07:00
Brian Foster
813d08cb6d xfs: use ->t_dfops for recovery of [b|c]ui log items
Log recovery passes down a central dfops structure to recovery
handlers for bui and cui log items. Each of these handlers allocates
and commits a transaction and defers any remaining operations to be
completed by the main recovery sequence.

Since dfops outlives the transaction in this context, set and clear
->t_dfops appropriately such that the *_finish_item() paths and
below (i.e., xfs_bmapi*()) can expect to find the dfops in the
transaction without it being committed with the dfops attached. This
is required because transaction commit expects that an associated
dfops is finished and in this context the dfops may be populated at
commit time.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:09 -07:00
Brian Foster
c9cfdb3811 xfs: remove dfops param from high level dirname calls
All callers of the directory create, rename and remove interfaces
already associate the dfops with the transaction. Drop the dfops
parameters in these calls in preparation for further cleanups in the
layers below. This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:08 -07:00
Brian Foster
0e0417f3e5 xfs: remove dfops parameter from ifree call stack
The inode free callchain starting in xfs_inactive_ifree() already
associates its dfops with the transaction. It still passes the dfops
on the stack down through xfs_difree_inobt(), however.

Clean up the call stack and reference dfops directly from the
transaction. This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:07 -07:00
Brian Foster
6aa6718439 xfs: rename xfs_trans ->t_agfl_dfops to ->t_dfops
The ->t_agfl_dfops field is currently used to defer agfl block frees
from associated transaction contexts. While all known problematic
contexts have already been updated to use ->t_agfl_dfops, the
broader goal is defer agfl frees from all callers that already use a
deferred operations structure. Further, the transaction field
facilitates a good amount of code clean up where the transaction and
dfops have historically been passed down through the stack
separately.

Rename the field to something more generic to prepare to use it as
such throughout XFS. This patch does not change behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:07 -07:00
Brian Foster
8a74938649 xfs: cow unwritten conversion uses uninitialized dfops
A couple COW fork unwritten extent conversion helpers pass an
uninitialized dfops pointer to xfs_bmapi_write(). This does not
cause problems because conversion does not use a transaction or the
dfops structure for the COW fork.  Drop the uninitialized usage of
dfops in these codepaths and pass NULL along to xfs_bmapi_write()
instead.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:06 -07:00
Christoph Hellwig
98c1a7c0ec xfs: update my copyrights for the writeback and iomap code
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:06 -07:00
Christoph Hellwig
82cb14175e xfs: add support for sub-pagesize writeback without buffer_heads
Switch to using the iomap_page structure for checking sub-page uptodate
status and track sub-page I/O completion status, and remove large
quantities of boilerplate code working around buffer heads.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:05 -07:00
Christoph Hellwig
ac8ee54669 xfs: allow writeback on pages without buffer heads
Disable the IOMAP_F_BUFFER_HEAD flag on file systems with a block size
equal to the page size, and deal with pages without buffer heads in
writeback.  Thanks to the previous refactoring this is basically trivial
now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:04 -07:00
Christoph Hellwig
8e1f065bea xfs: refactor the tail of xfs_writepage_map
Rejuggle how we deal with the different error vs non-error and have
ioends vs not have ioend cases to keep the fast path streamlined, and
the duplicate code at a minimum.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:04 -07:00
Christoph Hellwig
1b65d3dd2d xfs: remove xfs_start_page_writeback
This helper only has two callers, one of them with a constant error
argument.  Remove it to make pending changes to the code a little easier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:03 -07:00
Christoph Hellwig
6d465e8953 xfs: move all writeback buffer_head manipulation into xfs_map_at_offset
This keeps it in a single place so it can be made otional more easily.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:03 -07:00
Christoph Hellwig
3faed66764 xfs: don't look at buffer heads in xfs_add_to_ioend
Calculate all information for the bio based on the passed in information
without requiring a buffer_head structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:02 -07:00
Christoph Hellwig
889c65b3f6 xfs: remove the imap_valid flag
Simplify the way we check for a valid imap - we know we have a valid
mapping after xfs_map_blocks returned successfully, and we know we can
call xfs_imap_valid on any imap, as it will always fail on a
zero-initialized map.

We can also remove the xfs_imap_valid function and fold it into
xfs_map_blocks now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:02 -07:00
Christoph Hellwig
3345746ef3 xfs: simplify xfs_map_blocks by using xfs_iext_lookup_extent directly
xfs_bmapi_read adds zero value in xfs_map_blocks.  Replace it with a
direct call to the low-level extent lookup function.

Note that we now always pass a 0 length to the trace points as we ask
for an unspecified len.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:02 -07:00
Christoph Hellwig
060d4eaa0b xfs: remove xfs_reflink_find_cow_mapping
We only have one caller left, and open coding the simple extent list
lookup in it allows us to make the code both more understandable and
reuse calculations and variables already present.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:01 -07:00
Christoph Hellwig
c3a2f9fff1 xfs: remove the now unused XFS_BMAPI_IGSTATE flag
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:01 -07:00
Dave Chinner
e2f6ad4624 xfs: make xfs_writepage_map extent map centric
xfs_writepage_map() iterates over the bufferheads on a page to decide
what sort of IO to do and what actions to take.  However, when it comes
to reflink and deciding when it needs to execute a COW operation, we no
longer look at the bufferhead state but instead we ignore than and look
up internal state held in the COW fork extent list.

This means xfs_writepage_map() is somewhat confused. It does stuff, then
ignores it, then tries to handle the impedence mismatch by shovelling the
results inside the existing mapping code.  It works, but it's a bit of a
mess and it makes it hard to fix the cached map bug that the writepage
code currently has.

To unify the two different mechanisms, we first have to choose a direction.
That's already been set - we're de-emphasising bufferheads so they are no
longer a control structure as we need to do taht to allow for eventual
removal.  Hence we need to move away from looking at bufferhead state to
determine what operations we need to perform.

We can't completely get rid of bufferheads yet - they do contain some
state that is absolutely necessary, such as whether that part of the page
contains valid data or not (buffer_uptodate()).  Other state in the
bufferhead is redundant:

	BH_dirty - the page is dirty, so we can ignore this and just
		write it
	BH_delay - we have delalloc extent info in the DATA fork extent
		tree
	BH_unwritten - same as BH_delay
	BH_mapped - indicates we've already used it once for IO and it is
		mapped to a disk address. Needs to be ignored for COW
		blocks.

The BH_mapped flag is an interesting case - it's supposed to indicate that
it's already mapped to disk and so we can just use it "as is".  In theory,
we don't even have to do an extent lookup to find where to write it too,
but we have to do that anyway to determine we are actually writing over a
valid extent.  Hence it's not even serving the purpose of avoiding a an
extent lookup during writeback, and so we can pretty much ignore it.
Especially as we have to ignore it for COW operations...

Therefore, use the extent map as the source of information to tell us
what actions we need to take and what sort of IO we should perform.  The
first step is to have xfs_map_blocks() set the io type according to what
it looks up.  This means it can easily handle both normal overwrite and
COW cases.  The only thing we also need to add is the ability to return
hole mappings.

We need to return and cache hole mappings now for the case of multiple
blocks per page.  We no longer use the BH_mapped to indicate a block over
a hole, so we have to get that info from xfs_map_blocks().  We cache it so
that holes that span two pages don't need separate lookups.  This allows us
to avoid ever doing write IO over a hole, too.

Now that we have xfs_map_blocks() returning both a cached map and the type
of IO we need to perform, we can rewrite xfs_writepage_map() to drop all
the bufferhead control. It's also much simplified because it doesn't need
to explicitly handle COW operations.  Instead of iterating bufferheads, it
iterates blocks within the page and then looks up what per-block state is
required from the appropriate bufferhead.  It then validates the cached
map, and if it's not valid, we get a new map.  If we don't get a valid map
or it's over a hole, we skip the block.

At this point, we have to remap the bufferhead via xfs_map_at_offset().
As previously noted, we had to do this even if the buffer was already
mapped as the mapping would be stale for XFS_IO_DELALLOC, XFS_IO_UNWRITTEN
and XFS_IO_COW IO types.  With xfs_map_blocks() now controlling the type,
even XFS_IO_OVERWRITE types need remapping, as converted-but-not-yet-
written delalloc extents beyond EOF can be reported at XFS_IO_OVERWRITE.
Bufferheads that span such regions still need their BH_Delay flags cleared
and their block numbers calculated, so we now unconditionally map each
bufferhead before submission.

But wait! There's more - remember the old "treat unwritten extents as
holes on read" hack?  Yeah, that means we can have a dirty page with
unmapped, unwritten bufferheads that contain data!  What makes these so
special is that the unwritten "hole" bufferheads do not have a valid block
device pointer, so if we attempt to write them xfs_add_to_ioend() blows
up. So we make xfs_map_at_offset() do the "realtime or data device"
lookup from the inode and ignore what was or wasn't put into the
bufferhead when the buffer was instantiated.

The astute reader will have realised by now that this code treats
unwritten extents in multiple-blocks-per-page situations differently.
If we get any combination of unwritten blocks on a dirty page that contain
valid data in the page, we're going to convert them to real extents.  This
can actually be a win, because it means that pages with interleaving
unwritten and written blocks will get converted to a single written extent
with zeros replacing the interspersed unwritten blocks.  This is actually
good for reducing extent list and conversion overhead, and it means we
issue a contiguous IO instead of lots of little ones.  The downside is
that we use up a little extra IO bandwidth.  Neither of these seem like a
bad thing given that spinning disks are seek sensitive, and SSDs/pmem have
bandwidth to burn and the lower Io latency/CPU overhead of fewer, larger
IOs will result in better performance on them...

As a result of all this, the only state we actually care about from the
bufferhead is a single flag - BH_Uptodate. We still use the bufferhead to
pass some information to the bio via xfs_add_to_ioend(), but that is
trivial to separate and pass explicitly.  This means we really only need
1 bit of state per block per page from the buffered write path in the
writeback path.  Everything else we do with the bufferhead is purely to
make the buffered IO front end continue to work correctly. i.e we've
pretty much marginalised bufferheads in the writeback path completely.

Signed-off-By: Dave Chinner <dchinner@redhat.com>
[hch: forward port, refactor and split off bits into other commits]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:00 -07:00
Christoph Hellwig
6a4c950136 xfs: rename the offset variable in xfs_writepage_map
Calling it file_offset makes the usage more clear, especially with
a new poffset variable that will be added soon for the offset inside
the page.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:00 -07:00
Christoph Hellwig
5c665e5b5a xfs: remove xfs_map_cow
We can handle the existing cow mapping case as a special case directly
in xfs_writepage_map, and share code for allocating delalloc blocks
with regular I/O in xfs_map_blocks.  This means we need to always
call xfs_map_blocks for reflink inodes, but we can still skip most of
the work if it turns out that there is no COW mapping overlapping the
current block.

As a subtle detail we need to start caching holes in the wpc to deal
with the case of COW reservations between EOF.  But we'll need that
infrastructure later anyway, so this is no big deal.

Based on a patch from Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:59 -07:00
Christoph Hellwig
fca8c80542 xfs: remove xfs_reflink_trim_irec_to_next_cow
We already have to check for overlapping COW extents everytime we
come back to a page in xfs_writepage_map / xfs_map_cow, so this
additional trim is not required.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:59 -07:00
Christoph Hellwig
a7b28f72ab xfs: don't use XFS_BMAPI_IGSTATE in xfs_map_blocks
We want to be able to use the extent state as a reliably indicator for
the type of I/O, and stop using the buffer head state.  For this we
need to stop using the XFS_BMAPI_IGSTATE so that we don't see merged
extents of different types.

Based on a patch from Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:59 -07:00
Christoph Hellwig
c57371a16d xfs: don't clear imap_valid for a non-uptodate buffers
Finding a buffer that isn't uptodate doesn't invalidate the mapping for
any given block.  The last_sector check will already take care of starting
another ioend as soon as we find any non-update buffer, and if the current
mapping doesn't include the next uptodate buffer the xfs_imap_valid check
will take care of it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:58 -07:00
Christoph Hellwig
91cdfd1761 xfs: do not set the page uptodate in xfs_writepage_map
We already track the page uptodate status based on the buffer uptodate
status, which is updated whenever reading or zeroing blocks.

This code has been there since commit a ptool commit in 2002, which
claims to:

    "merge" the 2.4 fsx fix for block size < page size to 2.5.  This needed
    major changes to actually fit.

and isn't present in other writepage implementations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:58 -07:00
Christoph Hellwig
d438017757 xfs: move locking into xfs_bmap_punch_delalloc_range
Both callers want the same looking, so do it only once.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:57 -07:00
Christoph Hellwig
0362572138 xfs: simplify xfs_aops_discard_page
Instead of looking at the buffer heads to see if a block is delalloc just
call xfs_bmap_punch_delalloc_range on the whole page - this will leave
any non-delalloc block intact and handle the iteration for us.  As a side
effect one more place stops caring about buffer heads and we can remove the
xfs_check_page_type function entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:57 -07:00
Christoph Hellwig
8b2e77c163 xfs: use iomap for blocksize == PAGE_SIZE readpage and readpages
For file systems with a block size that equals the page size we never do
partial reads, so we can use the buffer_head-less iomap versions of
readpage and readpages without conflicting with the buffer_head structures
create later in write_begin.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:56 -07:00
Darrick J. Wong
c2efdfc100 Merge branch 'iomap-4.19-merge' into xfs-4.19-merge 2018-07-11 22:24:40 -07:00
Miklos Szeredi
87eb5eb242 vfs: dedupe: rationalize args
Clean up f_op->dedupe_file_range() interface.

1) Use loff_t for offsets and length instead of u64
2) Order the arguments the same way as {copy|clone}_file_range().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-06 23:57:03 +02:00
Miklos Szeredi
5740c99e9d vfs: dedupe: return int
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-06 23:57:03 +02:00
Carlos Maiolino
9991274fdd xfs: Initialize variables in xfs_alloc_get_rec before using them
Make sure we initialize *bno and *len, before jumping to out_bad_rec
label, and risk calling xfs_warn() with uninitialized variables.

Coverity: 100898
Coverity: 1437081
Coverity: 1437129
Coverity: 1437191
Coverity: 1437201
Coverity: 1437212
Coverity: 1437341
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-28 06:56:23 -07:00
Darrick J. Wong
d8cb5e4237 xfs: fix fdblocks accounting w/ RMAPBT per-AG reservation
In __xfs_ag_resv_init we incorrectly calculate the amount by which to
decrease fdblocks when reserving blocks for the rmapbt.  Because rmapbt
allocations do not decrease fdblocks, we must decrease fdblocks by the
entire size of the requested reservation in order to achieve our goal of
always having enough free blocks to satisfy an rmapbt expansion.

This is in contrast to the refcountbt/finobt, which /do/ subtract from
fdblocks whenever they allocate a block.  For this allocation type we
preserve the existing behavior where we decrease fdblocks only by the
requested reservation minus the size of the existing tree.

This fixes the problem where the available block counts reported by
statfs change across a remount if there had been an rmapbt size change
since mount time.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2018-06-24 12:00:12 -07:00
Darrick J. Wong
e53c4b5983 xfs: ensure post-EOF zeroing happens after zeroing part of a file
If a user asks us to zero_range part of a file, the end of the range is
EOF, and not aligned to a page boundary, invoke writeback of the EOF
page to ensure that the post-EOF part of the page is zeroed.  This
ensures that we don't expose stale memory contents via mmap, if in a
clumsy manner.

Found by running generic/127 when it runs zero_range and mapread at EOF
one after the other.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2018-06-24 11:56:36 -07:00
Darrick J. Wong
a3a374bf18 xfs: fix off-by-one error in xfs_rtalloc_query_range
In commit 8ad560d256 ("xfs: strengthen rtalloc query range checks")
we strengthened the input parameter checks in the rtbitmap range query
function, but introduced an off-by-one error in the process.  The call
to xfs_rtfind_forw deals with the high key being rextents, but we clamp
the high key to rextents - 1.  This causes the returned results to stop
one block short of the end of the rtdev, which is incorrect.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-06-24 11:56:36 -07:00
Darrick J. Wong
232d0a24b0 xfs: fix uninitialized field in rtbitmap fsmap backend
Initialize the extent count field of the high key so that when we use
the high key to synthesize an 'unknown owner' record (i.e. used space
record) at the end of the queried range we have a field with which to
compute rm_blockcount.  This is not strictly necessary because the
synthesizer never uses the rm_blockcount field, but we can shut up the
static code analysis anyway.

Coverity-id: 1437358
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-06-24 11:56:36 -07:00
Darrick J. Wong
5bd88d1539 xfs: recheck reflink state after grabbing ILOCK_SHARED for a write
The reflink iflag could have changed since the earlier unlocked check,
so if we got ILOCK_SHARED for a write and but we're now a reflink inode
we have to switch to ILOCK_EXCL and relock.

This helps us avoid blowing lock assertions in things like generic/166:

XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_EXCL), file: fs/xfs/xfs_reflink.c, line: 383
WARNING: CPU: 1 PID: 24707 at fs/xfs/xfs_message.c:104 assfail+0x25/0x30 [xfs]
Modules linked in: deadline_iosched dm_snapshot dm_bufio ext4 mbcache jbd2 dm_flakey xfs libcrc32c dax_pmem device_dax nd_pmem sch_fq_codel af_packet [last unloaded: scsi_debug]
CPU: 1 PID: 24707 Comm: xfs_io Not tainted 4.18.0-rc1-djw #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
RIP: 0010:assfail+0x25/0x30 [xfs]
Code: ff 0f 0b c3 90 66 66 66 66 90 48 89 f1 41 89 d0 48 c7 c6 e8 ef 1b a0 48 89 fa 31 ff e8 54 f9 ff ff 80 3d fd ba 0f 00 00 75 03 <0f> 0b c3 0f 0b 66 0f 1f 44 00 00 66 66 66 66 90 48 63 f6 49 89 f9
RSP: 0018:ffffc90006423ad8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff880030b65e80 RCX: 0000000000000000
RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffa01b0447
RBP: ffffc90006423c10 R08: 0000000000000000 R09: 0000000000000000
R10: ffff88003d43fc30 R11: f000000000000000 R12: ffff880077cda000
R13: 0000000000000000 R14: ffffc90006423c30 R15: ffffc90006423bf9
FS:  00007feba8986800(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000138ab58 CR3: 000000003d40a000 CR4: 00000000000006a0
Call Trace:
 xfs_reflink_allocate_cow+0x24c/0x3d0 [xfs]
 xfs_file_iomap_begin+0x6d2/0xeb0 [xfs]
 ? iomap_to_fiemap+0x80/0x80
 iomap_apply+0x5e/0x130
 iomap_dio_rw+0x2e0/0x400
 ? iomap_to_fiemap+0x80/0x80
 ? xfs_file_dio_aio_write+0x133/0x4a0 [xfs]
 xfs_file_dio_aio_write+0x133/0x4a0 [xfs]
 xfs_file_write_iter+0x7b/0xb0 [xfs]
 __vfs_write+0x16f/0x1f0
 vfs_write+0xc8/0x1c0
 ksys_pwrite64+0x74/0x90
 do_syscall_64+0x56/0x180
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-06-24 11:56:36 -07:00
Darrick J. Wong
f62cb48e43 xfs: don't allow insert-range to shift extents past the maximum offset
Zorro Lang reports that generic/485 blows an assert on a filesystem with
512 byte blocks.  The test tries to fallocate a post-eof extent at the
maximum file size and calls insert range to shift the extents right by
two blocks.  On a 512b block filesystem this causes startoff to overflow
the 54-bit startoff field, leading to the assert.

Therefore, always check the rightmost extent to see if it would overflow
prior to invoking the insert range machinery.

Reported-by: zlang@redhat.com
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200137
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-06-24 11:56:36 -07:00
Darrick J. Wong
aafe12cee0 xfs: don't trip over negative free space in xfs_reserve_blocks
If we somehow end up with a filesystem that has fewer free blocks than
the blocks set aside to avoid ENOSPC deadlocks, it's possible that the
free space calculation in xfs_reserve_blocks will spit out a negative
number (because percpu_counter_sum returns s64).  We fail to notice
this negative number and set fdblks_delta to it.  Now we increment
fdblocks(!) and the unsigned type of m_resblks means that we end up
setting a ridiculously huge m_resblks reservation.

Avoid this comedy of errors by detecting the negative free space and
returning -ENOSPC.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-06-24 11:56:36 -07:00
Darrick J. Wong
10ee25268e xfs: allow empty transactions while frozen
In commit e89c041338 ("xfs: implement the GETFSMAP ioctl") we
created the ability to obtain empty transactions.  These transactions
have no log or block reservations and therefore can't modify anything.
Since they're also NO_WRITECOUNT they can run while the fs is frozen,
so we don't need to WARN_ON about that usage.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-06-24 11:56:35 -07:00
Dave Chinner
e53946dbd3 xfs: xfs_iflush_abort() can be called twice on cluster writeback failure
When a corrupt inode is detected during xfs_iflush_cluster, we can
get a shutdown ASSERT failure like this:

XFS (pmem1): Metadata corruption detected at xfs_symlink_shortform_verify+0x5c/0xa0, inode 0x86627 data fork
XFS (pmem1): Unmount and run xfs_repair
XFS (pmem1): xfs_do_force_shutdown(0x8) called from line 3372 of file fs/xfs/xfs_inode.c.  Return address = ffffffff814f4116
XFS (pmem1): Corruption of in-memory data detected.  Shutting down filesystem
XFS (pmem1): xfs_do_force_shutdown(0x1) called from line 222 of file fs/xfs/libxfs/xfs_defer.c.  Return address = ffffffff814a8a88
XFS (pmem1): xfs_do_force_shutdown(0x1) called from line 222 of file fs/xfs/libxfs/xfs_defer.c.  Return address = ffffffff814a8ef9
XFS (pmem1): Please umount the filesystem and rectify the problem(s)
XFS: Assertion failed: xfs_isiflocked(ip), file: fs/xfs/xfs_inode.h, line: 258
.....
Call Trace:
 xfs_iflush_abort+0x10a/0x110
 xfs_iflush+0xf3/0x390
 xfs_inode_item_push+0x126/0x1e0
 xfsaild+0x2c5/0x890
 kthread+0x11c/0x140
 ret_from_fork+0x24/0x30

Essentially, xfs_iflush_abort() has been called twice on the
original inode that that was flushed. This happens because the
inode has been flushed to teh buffer successfully via
xfs_iflush_int(), and so when another inode is detected as corrupt
in xfs_iflush_cluster, the buffer is marked stale and EIO, and
iodone callbacks are run on it.

Running the iodone callbacks walks across the original inode and
calls xfs_iflush_abort() on it. When xfs_iflush_cluster() returns
to xfs_iflush(), it runs the error path for that function, and that
calls xfs_iflush_abort() on the inode a second time, leading to the
above assert failure as the inode is not flush locked anymore.

This bug has been there a long time.

The simple fix would be to just avoid calling xfs_iflush_abort() in
xfs_iflush() if we've got a failure from xfs_iflush_cluster().
However, xfs_iflush_cluster() has magic delwri buffer handling that
means it may or may not have run IO completion on the buffer, and
hence sometimes we have to call xfs_iflush_abort() from
xfs_iflush(), and sometimes we shouldn't.

After reading through all the error paths and the delwri buffer
code, it's clear that the error handling in xfs_iflush_cluster() is
unnecessary. If the buffer is delwri, it leaves it on the delwri
list so that when the delwri list is submitted it sees a shutdown
fliesystem in xfs_buf_submit() and that marks the buffer stale, EIO
and runs IO completion. i.e. exactly what xfs+iflush_cluster() does
when it's not a delwri buffer. Further, marking a buffer stale
clears the _XBF_DELWRI_Q flag on the buffer, which means when
submission of the buffer occurs, it just skips over it and releases
it.

IOWs, the error handling in xfs_iflush_cluster doesn't need to care
if the buffer is already on a the delwri queue or not - it just
needs to mark the buffer stale, EIO and run completions. That means
we can just use the easy fix for xfs_iflush() to avoid the double
abort.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-21 23:31:38 -07:00
Dave Chinner
23fcb3340d xfs: More robust inode extent count validation
When the inode is in extent format, it can't have more extents that
fit in the inode fork. We don't currenty check this, and so this
corruption goes unnoticed by the inode verifiers. This can lead to
crashes operating on invalid in-memory structures.

Attempts to access such a inode will now error out in the verifier
rather than allowing modification operations to proceed.

Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix a typedef, add some braces and breaks to shut up compiler warnings]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-21 23:25:57 -07:00
Christoph Hellwig
e2ac836307 xfs: simplify xfs_bmap_punch_delalloc_range
Instead of using xfs_bmapi_read to find delalloc extents and then punch
them out using xfs_bunmapi, opencode the loop to iterate over the extents
and call xfs_bmap_del_extent_delay directly.  This both simplifies the
code and reduces the number of extent tree lookups required.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-21 23:24:38 -07:00
Christoph Hellwig
c03cea4214 iomap: add initial support for writes without buffer heads
For now just limited to blocksize == PAGE_SIZE, where we can simply read
in the full page in write begin, and just set the whole page dirty after
copying data into it.  This code is enabled by default and XFS will now
be feed pages without buffer heads in ->writepage and ->writepages.

If a file system sets the IOMAP_F_BUFFER_HEAD flag on the iomap the old
path will still be used, this both helps the transition in XFS and
prepares for the gfs2 migration to the iomap infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-20 09:32:41 -07:00
Linus Torvalds
7a932516f5 vfs/y2038: inode timestamps conversion to timespec64
This is a late set of changes from Deepa Dinamani doing an automated
 treewide conversion of the inode and iattr structures from 'timespec'
 to 'timespec64', to push the conversion from the VFS layer into the
 individual file systems.
 
 There were no conflicts between this and the contents of linux-next
 until just before the merge window, when we saw multiple problems:
 
 - A minor conflict with my own y2038 fixes, which I could address
   by adding another patch on top here.
 - One semantic conflict with late changes to the NFS tree. I addressed
   this by merging Deepa's original branch on top of the changes that
   now got merged into mainline and making sure the merge commit includes
   the necessary changes as produced by coccinelle.
 - A trivial conflict against the removal of staging/lustre.
 - Multiple conflicts against the VFS changes in the overlayfs tree.
   These are still part of linux-next, but apparently this is no longer
   intended for 4.18 [1], so I am ignoring that part.
 
 As Deepa writes:
 
   The series aims to switch vfs timestamps to use struct timespec64.
   Currently vfs uses struct timespec, which is not y2038 safe.
 
   The series involves the following:
   1. Add vfs helper functions for supporting struct timepec64 timestamps.
   2. Cast prints of vfs timestamps to avoid warnings after the switch.
   3. Simplify code using vfs timestamps so that the actual
      replacement becomes easy.
   4. Convert vfs timestamps to use struct timespec64 using a script.
      This is a flag day patch.
 
   Next steps:
   1. Convert APIs that can handle timespec64, instead of converting
      timestamps at the boundaries.
   2. Update internal data structures to avoid timestamp conversions.
 
 Thomas Gleixner adds:
 
   I think there is no point to drag that out for the next merge window.
   The whole thing needs to be done in one go for the core changes which
   means that you're going to play that catchup game forever. Let's get
   over with it towards the end of the merge window.
 
 [1] https://www.spinics.net/lists/linux-fsdevel/msg128294.html
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJbInZAAAoJEGCrR//JCVInReoQAIlVIIMt5ZX6wmaKbrjy9Itf
 MfgbFihQ/djLnuSPVQ3nztcxF0d66BKHZ9puVjz6+mIHqfDvJTRwZs9nU+sOF/T1
 g78fRkM1cxq6ZCkGYAbzyjyo5aC4PnSMP/NQLmwqvi0MXqqrbDoq5ZdP9DHJw39h
 L9lD8FM/P7T29Fgp9tq/pT5l9X8VU8+s5KQG1uhB5hii4VL6pD6JyLElDita7rg+
 Z7/V7jkxIGEUWF7vGaiR1QTFzEtpUA/exDf9cnsf51OGtK/LJfQ0oiZPPuq3oA/E
 LSbt8YQQObc+dvfnGxwgxEg1k5WP5ekj/Wdibv/+rQKgGyLOTz6Q4xK6r8F2ahxs
 nyZQBdXqHhJYyKr1H1reUH3mrSgQbE5U5R1i3My0xV2dSn+vtK5vgF21v2Ku3A1G
 wJratdtF/kVBzSEQUhsYTw14Un+xhBLRWzcq0cELonqxaKvRQK9r92KHLIWNE7/v
 c0TmhFbkZA+zR8HdsaL3iYf1+0W/eYy8PcvepyldKNeW2pVk3CyvdTfY2Z87G2XK
 tIkK+BUWbG3drEGG3hxZ3757Ln3a9qWyC5ruD3mBVkuug/wekbI8PykYJS7Mx4s/
 WNXl0dAL0Eeu1M8uEJejRAe1Q3eXoMWZbvCYZc+wAm92pATfHVcKwPOh8P7NHlfy
 A3HkjIBrKW5AgQDxfgvm
 =CZX2
 -----END PGP SIGNATURE-----

Merge tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground

Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
 "This is a late set of changes from Deepa Dinamani doing an automated
  treewide conversion of the inode and iattr structures from 'timespec'
  to 'timespec64', to push the conversion from the VFS layer into the
  individual file systems.

  As Deepa writes:

   'The series aims to switch vfs timestamps to use struct timespec64.
    Currently vfs uses struct timespec, which is not y2038 safe.

    The series involves the following:
    1. Add vfs helper functions for supporting struct timepec64
       timestamps.
    2. Cast prints of vfs timestamps to avoid warnings after the switch.
    3. Simplify code using vfs timestamps so that the actual replacement
       becomes easy.
    4. Convert vfs timestamps to use struct timespec64 using a script.
       This is a flag day patch.

    Next steps:
    1. Convert APIs that can handle timespec64, instead of converting
       timestamps at the boundaries.
    2. Update internal data structures to avoid timestamp conversions'

  Thomas Gleixner adds:

   'I think there is no point to drag that out for the next merge
    window. The whole thing needs to be done in one go for the core
    changes which means that you're going to play that catchup game
    forever. Let's get over with it towards the end of the merge window'"

* tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
  pstore: Remove bogus format string definition
  vfs: change inode times to use struct timespec64
  pstore: Convert internal records to timespec64
  udf: Simplify calls to udf_disk_stamp_to_time
  fs: nfs: get rid of memcpys for inode times
  ceph: make inode time prints to be long long
  lustre: Use long long type to print inode time
  fs: add timespec64_truncate()
2018-06-15 07:31:07 +09:00
Linus Torvalds
a205f0c974 Changes since last update:
- Strengthen metadata checking to avoid ASSERTing on bad disk contents
 - Validate btree records that are being retrieved for clients
 - Strengthen root inode verification
 - Convert license blurbs to SPDX tags
 - Enable changing DAX flag on directories
 - Fix some writeback deadlocks in reflink
 - Refactor out some old xfs helpers
 - Move type verifiers to a separate file
 - Fix some fuzzer crashes
 - Various other bug fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlsfUegACgkQ+H93GTRK
 tOv0sw/9HW63zhuzGs9uihmyGvtqNeeUBfKc+3ovJ80wnhYOa0n8yYCBORS1EaMS
 YPk74IbD0yAak0H9ePdOpont43gGVTDox6/K8+6rPFnWtX30Z2/ckb6BWE4UfoW8
 QeojpB2+aS6fqfO1wcSb3i//XRu4h90ORQY0xNkHYcN4GWwIDwPCyBf+AT9HH1E+
 GFHtB3QWANZg6LRT7X0GVgz5r68lzyxX1WisJ4uAm0NwKR5zVb9NWFCcOszQ45Ky
 +YFw4kfgithbIHlwTpo3LrvQk7+cBhlSpWuASZOYjugxcQ2d85B/+9mF/QDnLOey
 ddbO6WK+wo0KZImpFvOOQZY07cO7vtWwkWHraz0PkUdaEab5rcnooLoJg9UTMZa4
 WT8wM8CrX1kkFvJQCuAMV9jblovjETeYhHfG8ak8Z/lWc3WEnEBUFQiO9ZVQdiAv
 B02xMmpOkfi0fqRCg6li9u3CJtN+2vxPiNEME3lz5zdY5aE2aXSmCspvP3aPVZMt
 y1fZ90u5NONz6Q9WrIh0plEru4oynhwVuqRrnVRDPCT4X64IZXuf/fBmYqrfZGmJ
 K45P/LQDvfcHj3xBLhfkKv5OpXtyYgDtLSBNqYAYrcGS4sW7Z4Ts8ohqcOhF1OqR
 g3mFp75aO4Ekw6hFbg9CRX13G4mu80BmnRKDVwFjThkl6d0Xyxw=
 =SD3u
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.18-merge-10' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull more xfs updates from Darrick Wong:
 "Here's the second round of patches for XFS for 4.18. Most of the
  commits are small cleanups, bug fixes, and continued strengthening of
  metadata verifiers; the bulk of the diff is the conversion of the
  fs/xfs/ tree to use SPDX tags.

  This series has been run through a full xfstests run over the weekend
  and through a quick xfstests run against this morning's master, with
  no major failures reported.

  Summary:

   - Strengthen metadata checking to avoid ASSERTing on bad disk
     contents

   - Validate btree records that are being retrieved for clients

   - Strengthen root inode verification

   - Convert license blurbs to SPDX tags

   - Enable changing DAX flag on directories

   - Fix some writeback deadlocks in reflink

   - Refactor out some old xfs helpers

   - Move type verifiers to a separate file

   - Fix some fuzzer crashes

   - Various other bug fixes"

* tag 'xfs-4.18-merge-10' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (31 commits)
  xfs: update incore per-AG inode count
  xfs: replace do_mod with native operations
  xfs: don't call xfs_da_shrink_inode with NULL bp
  xfs: clean up MIN/MAX
  xfs: move various type verifiers to common file
  xfs: xfs_reflink_convert_cow() memory allocation deadlock
  xfs: setup VFS i_rwsem lockdep state correctly
  xfs: fix string handling in label get/set functions
  xfs: convert to SPDX license tags
  xfs: validate btree records on retrieval
  xfs: push corruption -> ESTALE conversion to xfs_nfs_get_inode()
  xfs: verify root inode more thoroughly
  xfs: verify COW extent size hint is valid in inode verifier
  xfs: verify extent size hint is valid in inode verifier
  xfs: catch bad stripe alignment configurations
  iomap: fsync swap files before iterating mappings
  xfs: use xfs_trans_getsb in xfs_sync_sb_buf
  xfs: don't assert on corrupted unlinked inode list
  xfs: explicitly pass buffer size to xfs_corruption_error
  xfs: don't assert when on-disk btree pointers are garbage
  ...
2018-06-12 15:49:00 -07:00
Darrick J. Wong
89e9b5c091 xfs: update incore per-AG inode count
For whatever reason we never actually update pagi_count (the in-core
perag inode count) when we allocate or free inode chunks.  Online scrub
is going to use it, so we need to fix the accounting.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-06-11 21:53:52 -07:00
Linus Torvalds
7d3bf613e9 libnvdimm for 4.18
* DAX broke a fundamental assumption of truncate of file mapped pages.
   The truncate path assumed that it is safe to disconnect a pinned page
   from a file and let the filesystem reclaim the physical block. With DAX
   the page is equivalent to the filesystem block. Introduce
   dax_layout_busy_page() to enable filesystems to wait for pinned DAX
   pages to be released. Without this wait a filesystem could allocate
   blocks under active device-DMA to a new file.
 
 * DAX arranges for the block layer to be bypassed and uses
   dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
   However, the memcpy_mcsafe() facility is available through the pmem
   block driver. In order to safely handle media errors, via the DAX
   block-layer bypass, introduce copy_to_iter_mcsafe().
 
 * Fix cache management policy relative to the ACPI NFIT Platform
   Capabilities Structure to properly elide cache flushes when they are not
   necessary. The table indicates whether CPU caches are power-fail
   protected. Clarify that a deep flush is always performed on
   REQ_{FUA,PREFLUSH} requests.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbGxI7AAoJEB7SkWpmfYgCDjsP/2Lcibu9Kf4tKIzuInsle6iE
 6qP29qlkpHVTpDKbhvIxTYTYL9sMU0DNUrpPCJR/EYdeyztLWDFC5EAT1wF240vf
 maV37s/uP331jSC/2VJnKWzBs2ztQxmKLEIQCxh6aT0qs9cbaOvJgB/WlVu+qtsl
 aGJFLmb6vdQacp31noU5plKrMgMA1pADyF5qx9I9K2HwowHE7T368ZEFS/3S//c3
 LXmpx/Nfq52sGu/qbRbu6B1CTJhIGhmarObyQnvBYoKntK1Ov4e8DS95wD3EhNDe
 FuRkOCUKhjl6cFy7QVWh1ct1bFm84ny+b4/AtbpOmv9l/+0mveJ7e+5mu8HQTifT
 wYiEe2xzXJ+OG/xntv8SvlZKMpjP3BqI0jYsTutsjT4oHrciiXdXM186cyS+BiGp
 KtFmWyncQJgfiTq6+Hj5XpP9BapNS+OYdYgUagw9ZwzdzptuGFYUMSVOBrYrn6c/
 fwqtxjubykJoW0P3pkIoT91arFSea7nxOKnGwft06imQ7TwR4ARsI308feQ9itJq
 2P2e7/20nYMsw2aRaUDDA70Yu+Lagn1m8WL87IybUGeUDLb1BAkjphAlWa6COJ+u
 PhvAD2tvyM9m0c7O5Mytvz7iWKG6SVgatoAyOPkaeplQK8khZ+wEpuK58sO6C1w8
 4GBvt9ri9i/Ww/A+ppWs
 =4bfw
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "This adds a user for the new 'bytes-remaining' updates to
  memcpy_mcsafe() that you already received through Ingo via the
  x86-dax- for-linus pull.

  Not included here, but still targeting this cycle, is support for
  handling memory media errors (poison) consumed via userspace dax
  mappings.

  Summary:

   - DAX broke a fundamental assumption of truncate of file mapped
     pages. The truncate path assumed that it is safe to disconnect a
     pinned page from a file and let the filesystem reclaim the physical
     block. With DAX the page is equivalent to the filesystem block.
     Introduce dax_layout_busy_page() to enable filesystems to wait for
     pinned DAX pages to be released. Without this wait a filesystem
     could allocate blocks under active device-DMA to a new file.

   - DAX arranges for the block layer to be bypassed and uses
     dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
     However, the memcpy_mcsafe() facility is available through the pmem
     block driver. In order to safely handle media errors, via the DAX
     block-layer bypass, introduce copy_to_iter_mcsafe().

   - Fix cache management policy relative to the ACPI NFIT Platform
     Capabilities Structure to properly elide cache flushes when they
     are not necessary. The table indicates whether CPU caches are
     power-fail protected. Clarify that a deep flush is always performed
     on REQ_{FUA,PREFLUSH} requests"

* tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
  dax: Use dax_write_cache* helpers
  libnvdimm, pmem: Do not flush power-fail protected CPU caches
  libnvdimm, pmem: Unconditionally deep flush on *sync
  libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH
  acpi, nfit: Remove ecc_unit_size
  dax: dax_insert_mapping_entry always succeeds
  libnvdimm, e820: Register all pmem resources
  libnvdimm: Debug probe times
  linvdimm, pmem: Preserve read-only setting for pmem devices
  x86, nfit_test: Add unit test for memcpy_mcsafe()
  pmem: Switch to copy_to_iter_mcsafe()
  dax: Report bytes remaining in dax_iomap_actor()
  dax: Introduce a ->copy_to_iter dax operation
  uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation
  xfs, dax: introduce xfs_break_dax_layouts()
  xfs: prepare xfs_break_layouts() for another layout type
  xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
  mm, fs, dax: handle layout changes to pinned dax mappings
  mm: fix __gup_device_huge vs unmap
  mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
  ...
2018-06-08 17:21:52 -07:00
Dan Williams
b56845794e Merge branch 'for-4.18/dax' into libnvdimm-for-next 2018-06-08 15:16:40 -07:00
Dave Chinner
0703a8e1c1 xfs: replace do_mod with native operations
do_mod() is a hold-over from when we have different sizes for file
offsets and and other internal values for 40 bit XFS filesystems.
Hence depending on build flags variables passed to do_mod() could
change size. We no longer support those small format filesystems and
hence everything is of fixed size theses days, even on 32 bit
platforms.

As such, we can convert all the do_mod() callers to platform
optimised modulus operations as defined by linux/math64.h.
Individual conversions depend on the types of variables being used.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-08 10:07:52 -07:00
Eric Sandeen
bb3d48dcf8 xfs: don't call xfs_da_shrink_inode with NULL bp
xfs_attr3_leaf_create may have errored out before instantiating a buffer,
for example if the blkno is out of range.  In that case there is no work
to do to remove it, and in fact xfs_da_shrink_inode will lead to an oops
if we try.

This also seems to fix a flaw where the original error from
xfs_attr3_leaf_create gets overwritten in the cleanup case, and it
removes a pointless assignment to bp which isn't used after this.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199969
Reported-by: Xu, Wen <wen.xu@gatech.edu>
Tested-by: Xu, Wen <wen.xu@gatech.edu>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-08 10:07:52 -07:00
Dave Chinner
9bb54cb56a xfs: clean up MIN/MAX
Get rid of the MIN/MAX macros and just use the native min/max macros
directly in the XFS code.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-08 10:07:52 -07:00
Dave Chinner
86210fbeba xfs: move various type verifiers to common file
New verification functions like xfs_verify_fsbno() and
xfs_verify_agino() are spread across multiple files and different
header files. They really don't fit cleanly into the places they've
been put, and have wider scope than the current header includes.

Move the type verifiers to a new file in libxfs (xfs-types.c) and
the prototypes to xfs_types.h where they will be visible to all the
code that uses the types.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-08 10:07:51 -07:00
Dave Chinner
4a2d01b076 xfs: xfs_reflink_convert_cow() memory allocation deadlock
xfs_reflink_convert_cow() manipulates the incore extent list
in GFP_KERNEL context in the IO submission path whilst holding
locked pages under writeback. This is a memory reclaim deadlock
vector. This code is not in a transaction, so any memory allocations
it makes aren't protected via the memalloc_nofs_save() context that
transactions carry.

Hence we need to run this call under memalloc_nofs_save() context to
prevent potential memory allocations from being run as GFP_KERNEL
and deadlocking.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-08 10:07:51 -07:00
Dave Chinner
ef215e394e xfs: setup VFS i_rwsem lockdep state correctly
When lockdep is enabled, it changes the type of the inode i_rwsem
semaphore before unlocking a newly instantiated inode. THere is the
possibility that there is already a waiter on that inode lock by the
time we unlock the new inode, so having lockdep re-initialise the
lock is a vector for trouble.

Avoid this whole situation by setting up the i_rwsem lockdep class
at the same time we set up the XFS inode i_ilock classes and so the
VFS doesn't have to change the lock class itself when it is
potentially unsafe.

This change is necessary because the equivalent fixes to the VFS code
made in commit 1e2e547a93 ("do d_instantiate/unlock_new_inode
combinations safely") are not relevant to XFS as it has it's own
internal inode cache lookup and instantiation routines.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-08 10:07:51 -07:00
Arnd Bergmann
4bb8b65a04 xfs: fix string handling in label get/set functions
[sandeen: fix subject, avoid copy-out of uninit data in getlabel]

gcc-8 reports two warnings for the newly added getlabel/setlabel code:

fs/xfs/xfs_ioctl.c: In function 'xfs_ioc_getlabel':
fs/xfs/xfs_ioctl.c:1822:38: error: argument to 'sizeof' in 'strncpy' call is the same expression as the source; did you mean to use the size of the destination? [-Werror=sizeof-pointer-memaccess]
  strncpy(label, sbp->sb_fname, sizeof(sbp->sb_fname));
                                      ^
In function 'strncpy',
    inlined from 'xfs_ioc_setlabel' at /git/arm-soc/fs/xfs/xfs_ioctl.c:1863:2,
    inlined from 'xfs_file_ioctl' at /git/arm-soc/fs/xfs/xfs_ioctl.c:1918:10:
include/linux/string.h:254:9: error: '__builtin_strncpy' output may be truncated copying 12 bytes from a string of length 12 [-Werror=stringop-truncation]
  return __builtin_strncpy(p, q, size);

In both cases, part of the problem is that one of the strncpy()
arguments is a fixed-length character array with zero-padding rather
than a zero-terminated string. In the first one case, we also get an
odd warning about sizeof-pointer-memaccess, which doesn't seem right
(the sizeof is for an array that happens to be the same as the second
strncpy argument).

To work around the bogus warning, I use a plain 'XFSLABEL_MAX' for
the strncpy() length when copying the label in getlabel. For setlabel(),
using memcpy() with the correct length that is already known avoids
the second warning and is slightly simpler.

In a related issue, it appears that we accidentally skip the trailing
\0 when copying a 12-character label back to user space in getlabel().
Using the correct sizeof() argument here copies the extra character.

Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85602
Fixes: f7664b3197 ("xfs: implement online get/set fs label")
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Martin Sebor <msebor@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 14:17:53 -07:00
Dave Chinner
0b61f8a407 xfs: convert to SPDX license tags
Remove the verbose license text from XFS files and replace them
with SPDX tags. This does not change the license of any of the code,
merely refers to the common, up-to-date license files in LICENSES/

This change was mostly scripted. fs/xfs/Makefile and
fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected
and modified by the following command:

for f in `git grep -l "GNU General" fs/xfs/` ; do
	echo $f
	cat $f | awk -f hdr.awk > $f.new
	mv -f $f.new $f
done

And the hdr.awk script that did the modification (including
detecting the difference between GPL-2.0 and GPL-2.0+ licenses)
is as follows:

$ cat hdr.awk
BEGIN {
	hdr = 1.0
	tag = "GPL-2.0"
	str = ""
}

/^ \* This program is free software/ {
	hdr = 2.0;
	next
}

/any later version./ {
	tag = "GPL-2.0+"
	next
}

/^ \*\// {
	if (hdr > 0.0) {
		print "// SPDX-License-Identifier: " tag
		print str
		print $0
		str=""
		hdr = 0.0
		next
	}
	print $0
	next
}

/^ \* / {
	if (hdr > 1.0)
		next
	if (hdr > 0.0) {
		if (str != "")
			str = str "\n"
		str = str $0
		next
	}
	print $0
	next
}

/^ \*/ {
	if (hdr > 0.0)
		next
	print $0
	next
}

// {
	if (hdr > 0.0) {
		if (str != "")
			str = str "\n"
		str = str $0
		next
	}
	print $0
}

END { }
$

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 14:17:53 -07:00
Dave Chinner
9e6c08d4a8 xfs: validate btree records on retrieval
So we don't check the validity of records as we walk the btree. When
there are corrupt records in the free space btree (e.g. zero
startblock/length or beyond EOAG) we just blindly use it and things
go bad from there. That leads to assert failures on debug kernels
like this:

XFS: Assertion failed: fs_is_ok, file: fs/xfs/libxfs/xfs_alloc.c, line: 450
....
Call Trace:
 xfs_alloc_fixup_trees+0x368/0x5c0
 xfs_alloc_ag_vextent_near+0x79a/0xe20
 xfs_alloc_ag_vextent+0x1d3/0x330
 xfs_alloc_vextent+0x5e9/0x870

Or crashes like this:

XFS (loop0): xfs_buf_find: daddr 0x7fb28 out of range, EOFS 0x8000
.....
BUG: unable to handle kernel NULL pointer dereference at 00000000000000c8
....
Call Trace:
 xfs_bmap_add_extent_hole_real+0x67d/0x930
 xfs_bmapi_write+0x934/0xc90
 xfs_da_grow_inode_int+0x27e/0x2f0
 xfs_dir2_grow_inode+0x55/0x130
 xfs_dir2_sf_to_block+0x94/0x5d0
 xfs_dir2_sf_addname+0xd0/0x590
 xfs_dir_createname+0x168/0x1a0
 xfs_rename+0x658/0x9b0

By checking that free space records pulled from the trees are
within the valid range, we catch many of these corruptions before
they can do damage.

This is a generic btree record checking deficiency. We need to
validate the records we fetch from all the different btrees before
we use them to catch corruptions like this.

This patch results in a corrupt record emitting an error message and
returning -EFSCORRUPTED, and the higher layers catch that and abort:

 XFS (loop0): Size Freespace BTree record corruption in AG 0 detected!
 XFS (loop0): start block 0x0 block count 0x0
 XFS (loop0): Internal error xfs_trans_cancel at line 1012 of file fs/xfs/xfs_trans.c.  Caller xfs_create+0x42a/0x670
 .....
 Call Trace:
  dump_stack+0x85/0xcb
  xfs_trans_cancel+0x19f/0x1c0
  xfs_create+0x42a/0x670
  xfs_generic_create+0x1f6/0x2c0
  vfs_create+0xf9/0x180
  do_mknodat+0x1f9/0x210
  do_syscall_64+0x5a/0x180
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
.....
 XFS (loop0): xfs_do_force_shutdown(0x8) called from line 1013 of file fs/xfs/xfs_trans.c.  Return address = ffffffff81500868
 XFS (loop0): Corruption of in-memory data detected.  Shutting down filesystem

Signed-off-by: Dave Chinner <dchinner@redhat.com>

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 08:12:00 -07:00
Dave Chinner
29cad0b3ed xfs: push corruption -> ESTALE conversion to xfs_nfs_get_inode()
In xfs_imap_to_bp(), we convert a -EFSCORRUPTED error to -EINVAL if
we are doing an untrusted lookup. This is done because we need
failed filehandle lookups to report -ESTALE to the caller, and it
does this by converting -EINVAL and -ENOENT errors to -ESTALE.

The squashing of EFSCORRUPTED in imap_to_bp makes it impossible for
for xfs_iget(UNTRUSTED) callers to determine the difference between
"inode does not exist" and "corruption detected during lookup". We
realy need that distinction in places calling xfS_iget(UNTRUSTED),
so move the filehandle error case handling all the way out to
xfs_nfs_get_inode() where it is needed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 08:10:26 -07:00
Dave Chinner
541b5acc85 xfs: verify root inode more thoroughly
When looking up the root inode at mount time, we don't actually do
any verification to check that the inode is allocated and accounted
for correctly in the INOBT. Make the checks on the root inode more
robust by making it an untrusted lookup. This forces the inode
lookup to use the inode btree to verify the inode is allocated
and mapped correctly to disk. This will also have the effect of
catching a significant number of AGI/INOBT related corruptions in
AG 0 at mount time.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 08:10:26 -07:00
Dave Chinner
02a0fda875 xfs: verify COW extent size hint is valid in inode verifier
There are rules for vald extent size hints. We enforce them when
applications set them, but fuzzers violate those rules and that
screws us over. Validate COW extent size hint rules in the inode
verifier to catch this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 08:10:26 -07:00
Dave Chinner
7d71a671a2 xfs: verify extent size hint is valid in inode verifier
There are rules for vald extent size hints. We enforce them when
applications set them, but fuzzers violate those rules and that
screws us over.

This results in alignment assertion failures when setting up
allocations such as this in direct IO:

XFS: Assertion failed: ap->length, file: fs/xfs/libxfs/xfs_bmap.c, line: 3432
....
Call Trace:
 xfs_bmap_btalloc+0x415/0x910
 xfs_bmapi_write+0x71c/0x12e0
 xfs_iomap_write_direct+0x2a9/0x420
 xfs_file_iomap_begin+0x4dc/0xa70
 iomap_apply+0x43/0x100
 iomap_file_buffered_write+0x62/0x90
 xfs_file_buffered_aio_write+0xba/0x300
 __vfs_write+0xd5/0x150
 vfs_write+0xb6/0x180
 ksys_write+0x45/0xa0
 do_syscall_64+0x5a/0x180
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

And from xfs_db:

core.extsize = 10380288

Which is not an integer multiple of the block size, and so violates
Rule #7 for setting extent size hints. Validate extent size hint
rules in the inode verifier to catch this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 08:10:26 -07:00
Dave Chinner
fa4ca9c557 xfs: catch bad stripe alignment configurations
When stripe alignments are invalid, data alignment algorithms in the
allocator may not work correctly. Ensure we catch superblocks with
invalid stripe alignment setups at mount time. These data alignment
mismatches are now detected at mount time like this:

XFS (loop0): SB stripe unit sanity check failed
XFS (loop0): Metadata corruption detected at xfs_sb_read_verify+0xab/0x110, xfs_sb block 0xffffffffffffffff
XFS (loop0): Unmount and run xfs_repair
XFS (loop0): First 128 bytes of corrupted metadata buffer:
0000000091c2de02: 58 46 53 42 00 00 10 00 00 00 00 00 00 00 10 00  XFSB............
0000000023bff869: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000000cdd8c893: 17 32 37 15 ff ca 46 3d 9a 17 d3 33 04 b5 f1 a2  .27...F=...3....
000000009fd2844f: 00 00 00 00 00 00 00 04 00 00 00 00 00 00 06 d0  ................
0000000088e9b0bb: 00 00 00 00 00 00 06 d1 00 00 00 00 00 00 06 d2  ................
00000000ff233a20: 00 00 00 01 00 00 10 00 00 00 00 01 00 00 00 00  ................
000000009db0ac8b: 00 00 03 60 e1 34 02 00 08 00 00 02 00 00 00 00  ...`.4..........
00000000f7022460: 00 00 00 00 00 00 00 00 0c 09 0b 01 0c 00 00 19  ................
XFS (loop0): SB validate failed with error -117.

And the mount fails.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 08:10:26 -07:00
Deepa Dinamani
95582b0083 vfs: change inode times to use struct timespec64
struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.

The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.

The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.

virtual patch

@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
  current_time ( ... )
  {
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
  ...
- return timespec_trunc(
+ return timespec64_trunc(
  ... );
  }

@ depends on patch @
identifier xtime;
@@
 struct \( iattr \| inode \| kstat \) {
 ...
-       struct timespec xtime;
+       struct timespec64 xtime;
 ...
 }

@ depends on patch @
identifier t;
@@
 struct inode_operations {
 ...
int (*update_time) (...,
-       struct timespec t,
+       struct timespec64 t,
...);
 ...
 }

@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
 fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
 ...) { ... }

@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
  ) { ... }

@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)

<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 =  timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)

@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
 fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}

@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1  ;
|
 node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
 node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
 node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
 stat->xtime = node2->i_xtime1;
|
 stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1  ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
 node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
 node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>
2018-06-05 16:57:31 -07:00
Linus Torvalds
6567af78ac Changes for 4.18:
- Strengthen inode number and structure validation when allocating inodes.
 - Reduce pointless buffer allocations during cache miss
 - Use FUA for pure data O_DSYNC directio writes
 - Various iomap refactorings
 - Strengthen quota metadata verification to avoid unfixable broken quota
 - Make AGFL block freeing a deferred operation to avoid blowing out
   transaction reservations when running complex operations
 - Get rid of the log item descriptors to reduce log overhead
 - Fix various reflink bugs where inodes were double-joined to
   transactions
 - Don't issue discards when trimming unwritten extents
 - Refactor incore dquot initialization and retrieval interfaces
 - Fix some locking problmes in the quota scrub code
 - Strengthen btree structure checks in scrub code
 - Rewrite swapfile activation to use iomap and support unwritten extents
 - Make scrub exit to userspace sooner when corruptions or
   cross-referencing problems are found
 - Make scrub invoke the data fork scrubber directly on metadata inodes
 - Don't do background reclamation of post-eof and cow blocks when the fs
   is suspended
 - Fix secondary superblock buffer lifespan hinting
 - Refactor growfs to use table-dispatched functions instead of long
   stringy functions
 - Move growfs code to libxfs
 - Implement online fs label getting and setting
 - Introduce online filesystem repair (in a very limited capacity)
 - Fix unit conversion problems in the realtime freemap iteration
   functions
 - Various refactorings and cleanups in preparation to remove buffer
   heads in a future release
 - Reimplement the old bmap call with iomap
 - Remove direct buffer head accesses from seek hole/data
 - Various bug fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlsR9dEACgkQ+H93GTRK
 tOv0dw//cBwRgY4jhC6b9oMk2DNRWUiTt1F2yoqr28661GPo124iXAMLIwJe1DiV
 W/qpN3HUz7P46xKOVY+MXaj0JIDFxJ8c5tHAQMH/TkDc49S+mkcGyaoPJ39hnc6u
 yikG+Hq4m0YWhHaeUhKTe8pnhXBaziz5A2NtKtwh6lPOIW+Wds51T77DJnViqADq
 tZzmAq8fS9/ELpxe0Th/2D7iTWCr2c3FLsW2KgbbNvQ4e34zVE1ix1eBtEzQE+Mm
 GUjdQhYVS1oCzqZfCxJkzR4R/1TAFyS0FXOW7PHo8FAX/kas9aQbRlnHSAQ/08EE
 8Z2p3GsFip7dgmd6O6nAmFAStW6GRvgyycJ7Y+Y0IsJj6aDp9OxhRExyF+uocJR9
 b9ChOH6PMEtRB/RRlBg66pbS61abvNGutzl61ZQZGBHEvL3VqDcd68IomdD5bNSB
 pXo6mOJIcKuXsghZszsHAV9uuMe4zQAMbLy7QH6V8LyWeSAG9hTXOT9EA4MWktEJ
 SCQFf7RRPgU5pEAgOS8LgKrawqnBaqFcFvkvWsQhyiltTFz29cwxH7tjSXYMAOFE
 W+RMp8kbkPnGOaJJeKxT+/RGRB534URk0jIEKtRb679xkEF3HE58exXEVrnojJq6
 0m712+EYuZSYhFBwrvEnQjNHr0x2r/A/iBJZ6HhyV0aO1RWm4n4=
 =11pr
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.18-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "New features this cycle include the ability to relabel mounted
  filesystems, support for fallocated swapfiles, and using FUA for pure
  data O_DSYNC directio writes. With this cycle we begin to integrate
  online filesystem repair and refactor the growfs code in preparation
  for eventual subvolume support, though the road ahead for both
  features is quite long.

  There are also numerous refactorings of the iomap code to remove
  unnecessary log overhead, to disentangle some of the quota code, and
  to prepare for buffer head removal in a future upstream kernel.

  Metadata validation continues to improve, both in the hot path
  veifiers and the online filesystem check code. I anticipate sending a
  second pull request in a few days with more metadata validation
  improvements.

  This series has been run through a full xfstests run over the weekend
  and through a quick xfstests run against this morning's master, with
  no major failures reported.

  Summary:

   - Strengthen inode number and structure validation when allocating
     inodes.

   - Reduce pointless buffer allocations during cache miss

   - Use FUA for pure data O_DSYNC directio writes

   - Various iomap refactorings

   - Strengthen quota metadata verification to avoid unfixable broken
     quota

   - Make AGFL block freeing a deferred operation to avoid blowing out
     transaction reservations when running complex operations

   - Get rid of the log item descriptors to reduce log overhead

   - Fix various reflink bugs where inodes were double-joined to
     transactions

   - Don't issue discards when trimming unwritten extents

   - Refactor incore dquot initialization and retrieval interfaces

   - Fix some locking problmes in the quota scrub code

   - Strengthen btree structure checks in scrub code

   - Rewrite swapfile activation to use iomap and support unwritten
     extents

   - Make scrub exit to userspace sooner when corruptions or
     cross-referencing problems are found

   - Make scrub invoke the data fork scrubber directly on metadata
     inodes

   - Don't do background reclamation of post-eof and cow blocks when the
     fs is suspended

   - Fix secondary superblock buffer lifespan hinting

   - Refactor growfs to use table-dispatched functions instead of long
     stringy functions

   - Move growfs code to libxfs

   - Implement online fs label getting and setting

   - Introduce online filesystem repair (in a very limited capacity)

   - Fix unit conversion problems in the realtime freemap iteration
     functions

   - Various refactorings and cleanups in preparation to remove buffer
     heads in a future release

   - Reimplement the old bmap call with iomap

   - Remove direct buffer head accesses from seek hole/data

   - Various bug fixes"

* tag 'xfs-4.18-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (121 commits)
  fs: use ->is_partially_uptodate in page_cache_seek_hole_data
  fs: remove the buffer_unwritten check in page_seek_hole_data
  fs: move page_cache_seek_hole_data to iomap.c
  xfs: use iomap_bmap
  iomap: add an iomap-based bmap implementation
  iomap: add a iomap_sector helper
  iomap: use __bio_add_page in iomap_dio_zero
  iomap: move IOMAP_F_BOUNDARY to gfs2
  iomap: fix the comment describing IOMAP_NOWAIT
  iomap: inline data should be an iomap type, not a flag
  mm: split ->readpages calls to avoid non-contiguous pages lists
  mm: return an unsigned int from __do_page_cache_readahead
  mm: give the 'ret' variable a better name __do_page_cache_readahead
  block: add a lower-level bio_add_page interface
  xfs: fix error handling in xfs_refcount_insert()
  xfs: fix xfs_rtalloc_rec units
  xfs: strengthen rtalloc query range checks
  xfs: xfs_rtbuf_get should check the bmapi_read results
  xfs: xfs_rtword_t should be unsigned, not signed
  dax: change bdev_dax_supported() to support boolean returns
  ...
2018-06-05 13:24:20 -07:00
Eric Sandeen
89c2e71123 xfs: use xfs_trans_getsb in xfs_sync_sb_buf
Use xfs_trans_getsb rather than reaching right in for
mp->m_sb_bp; I think this is more correct, and it facilitates
building this libxfs code in userspace as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-04 18:25:05 -07:00
Darrick J. Wong
d2e7366542 xfs: don't assert on corrupted unlinked inode list
Use the per-ag inode number verifiers to detect corrupt lists and error
out, instead of using ASSERTs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 18:25:05 -07:00
Darrick J. Wong
2551a53053 xfs: explicitly pass buffer size to xfs_corruption_error
Explicitly pass the buffer length to xfs_corruption_error() instead of
assuming XFS_CORRUPTION_DUMP_LEN so that we avoid dumping off the end
of the buffer.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 18:25:05 -07:00
Darrick J. Wong
85ae01098c xfs: don't assert when on-disk btree pointers are garbage
Don't ASSERT when we encounter bad on-disk btree pointers in the debug
check functions.  Log the error to leave breadcrumbs and let the upper
layers deal with it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 18:25:05 -07:00
Darrick J. Wong
e63a1008ee xfs: strengthen btree pointer checks before use
Instead of ASSERTing on null btree pointers in xfs_btree_ptr_to_daddr,
use the new block number verifiers to ensure that the btree pointer
doesn't point to any sensitive areas (AG headers, past-EOFS) and return
-EFSCORRUPTED if this is the case.  Remove the ASSERT because on-disk
corruptions shouldn't trigger ASSERTs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 18:25:05 -07:00
Darrick J. Wong
4cbae4b816 xfs: introduce xfs_btree_debug_check_ptr
Make xfs_btree_check_ptr a non-debug function and introduce a new _debug
version that only runs when #ifdef DEBUG.   This will enable us to reuse
the checking logic with other parts of the btree code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 18:25:05 -07:00
Darrick J. Wong
e4f45eff86 xfs: check directory bestfree information in the verifier
Create a variant of xfs_dir2_data_freefind that is suitable for use in a
verifier.  Because _freefind is called by the verifier, we simply
duplicate the _freefind function, convert the ASSERTs to return
__this_address, and modify the verifier to call our new function.  Once
we've made it impossible for directory blocks with bad bestfree data to
make it into the filesystem we can remove the DEBUG code from the
regular _freefind function.

Underlying argument: corruption of on-disk metadata should return
-EFSCORRUPTED instead of blowing ASSERTs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 18:25:04 -07:00
Darrick J. Wong
924cade4df xfs: don't return garbage buffers in xfs_da3_node_read
If we're reading a node in a dir/attr btree and the buffer comes off the
disk with a magic number we don't recognize, don't ASSERT and don't set
a garbage buffer type (0 also triggers ASSERTs).  Instead, report the
corruption, release the buffer, and return -EFSCORRUPTED because that's
what the dabtree is -- corrupt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 14:45:30 -07:00
Darrick J. Wong
1f5c071d19 xfs: don't ASSERT on short form btree root pointer of zero
Don't ASSERT if the short form btree root pointer is zero.  Now that we
use xfs_verify_agbno to check all short form btree pointers, we'll let
that log the error and pass it to the upper layers.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 14:45:30 -07:00
Darrick J. Wong
eeee0d6a9b xfs: btree lookup shouldn't ASSERT on empty btree nodes
If a btree lookup encounters an empty btree node or an empty btree leaf
on a multi-level btree, that's evidence of a corrupt on-disk btree.
Therefore, we should return -EFSCORRUPTED to the upper levels, not an
ASSERT failure.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 14:45:30 -07:00
Darrick J. Wong
a37f7b127e xfs: xfs_alloc_get_rec should return EFSCORRUPTED for obvious bnobt corruption
Return -EFSCORRUPTED when the bnobt/cntbt return obviously corrupt
values, rather than letting them bounce around in the internal code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 14:45:30 -07:00
Darrick J. Wong
b3986010ce xfs: remove redundant ASSERT on insufficient bestfree length in _leaf_addname
In xfs_dir2_leaf_addname we ASSERT if the length of the unused space
described by bestfree[0] is less the amount of space we wish to consume.
Immediately after it is a call to xfs_dir2_data_use_free where the
offset parameter is offset of the unused space and the length parameter
is the amount of space we wish to consume.  Both values (and the unused
space pointer) are passed into xfs_dir2_data_check_free, which also
validates that the region of unused space is big enough to cover the
space we wish to consume.  This is effectively the same check that the
ASSERT covers, and since a check failure results in a corruption message
being logged we can remove the ASSERT.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 14:45:29 -07:00
Darrick J. Wong
17ba2cc7b5 xfs: don't assert when reporting on-disk corruption while loading btree
Don't bother ASSERTing when we're already going to log and return the
corruption status.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 14:45:29 -07:00
Darrick J. Wong
aaacdd257f xfs: don't forbid setting dax flag on directories if device doesn't dax
On a directory, the DAX flag is merely a hint that files created in the
directory should have the DAX flag set at creation time.  We don't care
if the underlying device supports DAX or not because directory metadata
are always cached in DRAM.  We don't care if new files get the flag even
if the device doesn't support DAX because we always check for DAX
support before setting the VFS flag (S_DAX).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 14:45:29 -07:00
Linus Torvalds
b058efc1ac Merge branch 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dcache lookup cleanups from Al Viro:
 "Cleaning ->lookup() instances up - mostly d_splice_alias() conversions"

* 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (29 commits)
  switch the rest of procfs lookups to d_splice_alias()
  procfs: switch instantiate_t to d_splice_alias()
  don't bother with tid_fd_revalidate() in lookups
  proc_lookupfd_common(): don't bother with instantiate unless the file is open
  procfs: get rid of ancient BS in pid_revalidate() uses
  cifs_lookup(): switch to d_splice_alias()
  cifs_lookup(): cifs_get_inode_...() never returns 0 with *inode left NULL
  9p: unify paths in v9fs_vfs_lookup()
  ncp_lookup(): use d_splice_alias()
  hfsplus: switch to d_splice_alias()
  hfs: don't allow mounting over .../rsrc
  hfs: use d_splice_alias()
  omfs_lookup(): report IO errors, use d_splice_alias()
  orangefs_lookup: simplify
  openpromfs: switch to d_splice_alias()
  xfs_vn_lookup: simplify a bit
  adfs_lookup: do not fail with ENOENT on negatives, use d_splice_alias()
  adfs_lookup_byname: .. *is* taken care of in fs/namei.c
  romfs_lookup: switch to d_splice_alias()
  qnx6_lookup: switch to d_splice_alias()
  ...
2018-06-04 13:46:22 -07:00
Linus Torvalds
cf626b0da7 Merge branch 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull procfs updates from Al Viro:
 "Christoph's proc_create_... cleanups series"

* 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (44 commits)
  xfs, proc: hide unused xfs procfs helpers
  isdn/gigaset: add back gigaset_procinfo assignment
  proc: update SIZEOF_PDE_INLINE_NAME for the new pde fields
  tty: replace ->proc_fops with ->proc_show
  ide: replace ->proc_fops with ->proc_show
  ide: remove ide_driver_proc_write
  isdn: replace ->proc_fops with ->proc_show
  atm: switch to proc_create_seq_private
  atm: simplify procfs code
  bluetooth: switch to proc_create_seq_data
  netfilter/x_tables: switch to proc_create_seq_private
  netfilter/xt_hashlimit: switch to proc_create_{seq,single}_data
  neigh: switch to proc_create_seq_data
  hostap: switch to proc_create_{seq,single}_data
  bonding: switch to proc_create_seq_data
  rtc/proc: switch to proc_create_single_data
  drbd: switch to proc_create_single
  resource: switch to proc_create_seq_data
  staging/rtl8192u: simplify procfs code
  jfs: simplify procfs code
  ...
2018-06-04 10:00:01 -07:00
Dave Chinner
9f96cc958e xfs: verify AGI unlinked list contains valid blocks
The heads of tha AGI unlinked list are only scanned on debug
kernels when the verifier runs. Change that to always scan the heads
and validate that the inode numbers are valid.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-03 16:12:16 -07:00
Christoph Hellwig
b84e772299 xfs: use iomap_bmap
Switch to the iomap based bmap implementation to get rid of one of the
last users of xfs_get_blocks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-01 18:37:33 -07:00
Dave Chinner
16858f7c21 xfs: fix error handling in xfs_refcount_insert()
generic/475 fired an assert failure just after the filesystem was
shut down:

XFS: Assertion failed: fs_is_ok, file: fs/xfs/libxfs/xfs_refcount.c, line: 182
.....
Call Trace:
 xfs_refcount_insert+0x151/0x190
 xfs_refcount_adjust_extents.constprop.11+0x9c/0x470
 xfs_refcount_adjust.constprop.10+0xb0/0x270
 xfs_refcount_finish_one+0x25a/0x420
 xfs_trans_log_finish_refcount_update+0x2a/0x40
 xfs_refcount_update_finish_item+0x35/0xa0
 xfs_defer_finish+0x15e/0x4d0
 xfs_reflink_remap_extent+0x1bc/0x610
 xfs_reflink_remap_blocks+0x6e/0x280
 xfs_reflink_remap_range+0x311/0x530
 vfs_clone_file_range+0x119/0x200
 ....

If xfs_btree_insert() returns an error, the corruption check fires
instead of passing the error back the caller. The corruption check
should be after we've checked for an error, not before, thereby
avoiding assert failures if the filesystem shuts down during a
refcount btree record insert.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-01 09:00:16 -07:00
Darrick J. Wong
a0e5c435ba xfs: fix xfs_rtalloc_rec units
All the realtime allocation functions deal with space on the rtdev in
units of realtime extents.  However, struct xfs_rtalloc_rec confusingly
uses the word 'block' in the name, even though they're really extents.

Fix the naming problem and fix all the unit handling problems in the two
existing users.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2018-06-01 09:00:16 -07:00
Darrick J. Wong
8ad560d256 xfs: strengthen rtalloc query range checks
Strengthen the rtalloc range query checks to make sure that the keys do
not run off the end of the realtime device inappropriately.  Note that
the query range functions require units of rt extents, not blocks,
despite the type name.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2018-06-01 09:00:16 -07:00
Darrick J. Wong
a03f1641c7 xfs: xfs_rtbuf_get should check the bmapi_read results
The xfs_rtbuf_get function should check the block mapping it gets back
from bmapi_read.  If there are no mappings or the mapping isn't a real
extent, we should return -EFSCORRUPTED rather than trying to read a
garbage value.  We also require realtime bitmap blocks to be real,
written allocations.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2018-06-01 09:00:16 -07:00
Darrick J. Wong
2483113f3d xfs: xfs_rtword_t should be unsigned, not signed
xfs_rtword_t is used for bit manipulations in the realtime bitmap file.
Since we're performing bit shifts with this type, we don't want sign
extension and we don't want to be left shifting negative quantities
because that's undefined behavior.

This also shuts up these UBSAN warnings:
UBSAN: Undefined behaviour in fs/xfs/libxfs/xfs_rtbitmap.c:833:48
signed integer overflow:
-2147483648 - 1 cannot be represented in type 'int'

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2018-06-01 09:00:16 -07:00
Dave Jiang
80660f2025 dax: change bdev_dax_supported() to support boolean returns
The function return values are confusing with the way the function is
named. We expect a true or false return value but it actually returns
0/-errno.  This makes the code very confusing. Changing the return values
to return a bool where if DAX is supported then return true and no DAX
support returns false.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-31 08:58:34 -07:00
Darrick J. Wong
ba23cba9b3 fs: allow per-device dax status checking for filesystems
Change bdev_dax_supported so it takes a bdev parameter.  This enables
multi-device filesystems like xfs to check that a dax device can work for
the particular filesystem.  Once that's in place, actually fix all the
parts of XFS where we need to be able to distinguish between datadev and
rtdev.

This patch fixes the problem where we screw up the dax support checking
in xfs if the datadev and rtdev have different dax capabilities.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[rez: Re-added __bdev_dax_supported() for !CONFIG_FS_DAX cases]
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-05-31 08:58:33 -07:00
Kent Overstreet
e292d7bc63 xfs: convert to bioset_init()/mempool_init()
Convert XFS to embedded bio sets.

Acked-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 15:33:32 -06:00
Darrick J. Wong
d25522f10c xfs: repair superblocks
If one of the backup superblocks is found to differ seriously from
superblock 0, write out a fresh copy from the in-core sb.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:15 -07:00
Darrick J. Wong
7e85bc6c87 xfs: add helpers to attach quotas to inodes
Add a helper routine to attach quota information to inodes that are
about to undergo repair.  If that fails, we need to schedule a
quotacheck for the next mount but allow the corrupted metadata repair to
continue.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:15 -07:00
Darrick J. Wong
04a2b7b254 xfs: recover AG btree roots from rmap data
Add a helper function to help us recover btree roots from the rmap data.
Callers pass in a list of rmap owner codes, buffer ops, and magic
numbers.  We iterate the rmap records looking for owner matches, and
then read the matching blocks to see if the magic number & uuid match.
If so, we then read-verify the block, and if that passes then we retain
a pointer to the block with the highest level, assuming that by the end
of the call we will have found the root.  This will be used to reset the
AGF/AGI btree root fields during their rebuild procedures.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:14 -07:00
Darrick J. Wong
12c6510e2f xfs: add helpers to dispose of old btree blocks after a repair
Now that we've plumbed in the ability to construct a list of dead btree
blocks following a repair, add more helpers to dispose of them.  This is
done by examining the rmapbt -- if the btree was the only owner we can
free the block, otherwise it's crosslinked and we can only remove the
rmapbt record.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:14 -07:00
Darrick J. Wong
64a39d876e xfs: add helpers to collect and sift btree block pointers during repair
Add some helpers to assemble a list of fs block extents.  Generally,
repair functions will iterate the rmapbt to make a list (1) of all
extents owned by the nominal owner of the metadata structure; then they
will iterate all other structures with the same rmap owner to make a
list (2) of active blocks; and finally we have a subtraction function to
subtract all the blocks in (2) from (1), with the result that (1) is now
a list of blocks that were owned by the old btree and must be disposed.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:14 -07:00
Darrick J. Wong
73d6b42aa4 xfs: add helpers to allocate and initialize fresh btree roots
Add a pair of helper functions to allocate and initialize fresh btree
roots.  The repair functions will use these as part of recreating
corrupted metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2018-05-30 08:03:14 -07:00
Darrick J. Wong
0a9633fa2f xfs: add helpers to deal with transaction allocation and rolling
For repairs, we need to reserve at least as many blocks as we think
we're going to need to rebuild the data structure, and we're going to
need some helpers to roll transactions while maintaining locks on the AG
headers so that other threads cannot wander into the middle of a repair.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2018-05-30 08:03:14 -07:00
Darrick J. Wong
51863d7dd7 xfs: grab the per-ag structure whenever relevant
Grab and hold the per-AG data across a scrub run whenever relevant.
This helps us avoid repeated trips through rcu and the radix tree
in the repair code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:14 -07:00
Souptick Joarder
05edd888d1 fs: xfs: Change return type to vm_fault_t
Use new return type vm_fault_t for fault handlers.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-29 10:46:03 -07:00
Darrick J. Wong
2e050e648a xfs: fix inobt magic number check
In commit a6a781a58b ("xfs: have buffer verifier functions
report failing address") the bad magic number return was ported
incorrectly.

Fixes: a6a781a58b
Reported-by: syzbot+08ab33be0178b76851c8@syzkaller.appspotmail.com
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-05-29 10:46:03 -07:00
Arnd Bergmann
5ef03dbd91 xfs, proc: hide unused xfs procfs helpers
These two functions now trigger a warning when CONFIG_PROC_FS is disabled:

fs/xfs/xfs_stats.c:128:12: error: 'xqmstat_proc_show' defined but not used [-Werror=unused-function]
 static int xqmstat_proc_show(struct seq_file *m, void *v)
            ^~~~~~~~~~~~~~~~~
fs/xfs/xfs_stats.c:118:12: error: 'xqm_proc_show' defined but not used [-Werror=unused-function]
 static int xqm_proc_show(struct seq_file *m, void *v)
            ^~~~~~~~~~~~~

Previously, they were referenced from an unused 'static const' structure,
which is silently dropped by gcc.

We can address the warning by adding the same #ifdef around them that
hides the reference.

Fixes: 3f3942aca6 ("proc: introduce proc_create_single{,_data}")
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-25 20:43:08 -04:00
Al Viro
b113a6d3cf xfs_vn_lookup: simplify a bit
have all post-xfs_lookup() branches converge on d_splice_alias()

Cc: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-22 14:27:57 -04:00
Dan Williams
d6dc57e251 xfs, dax: introduce xfs_break_dax_layouts()
xfs_break_dax_layouts(), similar to xfs_break_leased_layouts(), scans
for busy / pinned dax pages and waits for those pages to go idle before
any potential extent unmap operation.

dax_layout_busy_page() handles synchronizing against new page-busy
events (get_user_pages). It invalidates all mappings to trigger the
get_user_pages slow path which will eventually block on the xfs inode
lock held in XFS_MMAPLOCK_EXCL mode. If dax_layout_busy_page() finds a
busy page it returns it for xfs to wait for the page-idle event that
will fire when the page reference count reaches 1 (recall ZONE_DEVICE
pages are idle at count 1, see generic_dax_pagefree()).

While waiting, the XFS_MMAPLOCK_EXCL lock is dropped in order to not
deadlock the process that might be trying to elevate the page count of
more pages before arranging for any of them to go idle. I.e. the typical
case of submitting I/O is that iov_iter_get_pages() elevates the
reference count of all pages in the I/O before starting I/O on the first
page. The process of elevating the reference count of all pages involved
in an I/O may cause faults that need to take XFS_MMAPLOCK_EXCL.

Although XFS_MMAPLOCK_EXCL is dropped while waiting, XFS_IOLOCK_EXCL is
held while sleeping. We need this to prevent starvation of the truncate
path as continuous submission of direct-I/O could starve the truncate
path indefinitely if the lock is dropped.

Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-05-22 07:19:08 -07:00
Dan Williams
69eb5fa10e xfs: prepare xfs_break_layouts() for another layout type
When xfs is operating as the back-end of a pNFS block server, it
prevents collisions between local and remote operations by requiring a
lease to be held for remotely accessed blocks. Local filesystem
operations break those leases before writing or mutating the extent map
of the file.

A similar mechanism is needed to prevent operations on pinned dax
mappings, like device-DMA, from colliding with extent unmap operations.

BREAK_WRITE and BREAK_UNMAP are introduced as two distinct levels of
layout breaking.

Layouts are broken in the BREAK_WRITE case to ensure that layout-holders
do not collide with local writes. Additionally, layouts are broken in
the BREAK_UNMAP case to make sure the layout-holder has a consistent
view of the file's extent map. While BREAK_WRITE breaks can be satisfied
be recalling FL_LAYOUT leases, BREAK_UNMAP breaks additionally require
waiting for busy dax-pages to go idle while holding XFS_MMAPLOCK_EXCL.

After this refactoring xfs_break_layouts() becomes the entry point for
coordinating both types of breaks. Finally, xfs_break_leased_layouts()
becomes just the BREAK_WRITE handler.

Note that the unlock tracking is needed in a follow on change. That will
coordinate retrying either break handler until both successfully test
for a lease break while maintaining the lock state.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-05-22 07:19:08 -07:00
Dan Williams
c63a8eae63 xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
In preparation for adding coordination between extent unmap operations
and busy dax-pages, update xfs_break_layouts() to permit it to be called
with the mmap lock held. This lock scheme will be required for
coordinating the break of 'dax layouts' (non-idle dax (ZONE_DEVICE)
pages mapped into the file's address space). Breaking dax layouts will
be added to xfs_break_layouts() in a future patch, for now this preps
the unmap call sites to take and hold XFS_MMAPLOCK_EXCL over the call to
xfs_break_layouts().

Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <darrick.wong@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-05-22 07:19:08 -07:00
Eric Sandeen
f7664b3197 xfs: implement online get/set fs label
The GET ioctl is trivial, just return the current label.

The SET ioctl is more involved:
It transactionally modifies the superblock to write a new filesystem
label to the primary super.

A new variant of xfs_sync_sb then writes the superblock buffer
immediately to disk so that the change is visible from userspace.

It then invalidates any page cache that userspace might have previously
read on the block device so that i.e. blkid can see the change
immediately, and updates all secondary superblocks as userspace relable
does.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[darrick: use dchinner's new xfs_update_secondary_sbs function]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-16 08:50:16 -07:00
Christoph Hellwig
3f3942aca6 proc: introduce proc_create_single{,_data}
Variants of proc_create{,_data} that directly take a seq_file show
callback and drastically reduces the boilerplate code in the callers.

All trivial callers converted over.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16 07:23:35 +02:00
Dave Chinner
49dd56f26e xfs: factor the ag length extension code into libxfs
Growfs currently manually codes the extension of the last AG in a
filesytem during the growfs process. Factor that out of the growfs
code and move it into libxfs along with teh rest of the AG header
modification code.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 18:12:51 -07:00
Dave Chinner
b16817b66b xfs: move growfs core to libxfs
So it can be shared with userspace (e.g. mkfs) easily.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 18:12:51 -07:00
Dave Chinner
8125147288 xfs: rework secondary superblock updates in growfs
Right now we wait until we've committed changes to the primary
superblock before we initialise any of the new secondary
superblocks. This means that if we have any write errors for new
secondary superblocks we end up with garbage in place rather than
zeros or even an "in progress" superblock to indicate a grow
operation is being done.

To ensure we can write the secondary superblocks, initialise them
earlier in the same loop that initialises the AG headers. We stamp
the new secondary superblocks here with the old geometry, but set
the "sb_inprogress" field to indicate that updates are being done to
the superblock so they cannot be used.  This will result in the
secondary superblock fields being updated or triggering errors that
will abort the grow before we commit any permanent changes.

This also means we can change the update mechanism of the secondary
superblocks.  We know that we are going to wholly overwrite the
information in the struct xfs_sb in the buffer, so there's no point
reading it from disk. Just allocate an uncached buffer, zero it in
memory, stamp the new superblock structure in it and write it out.
If we fail to write it out, then we'll leave the existing sb (old or
new w/ inprogress) on disk for repair to deal with later.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 18:12:51 -07:00
Dave Chinner
83a7f86e39 xfs: separate secondary sb update in growfs
This happens after all the transactions to update the superblock
occur, and errors need to be handled slightly differently. Seperate
out the code into it's own function, and clean up the error goto
stack in the core growfs code as it is now much simpler.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 18:12:51 -07:00
Dave Chinner
87444b8c26 xfs: make imaxpct changes in growfs separate
When growfs changes the imaxpct value of the filesystem, it runs
through all the "change size" growfs code, whether it needs to or
not. Separate out changing imaxpct into it's own function and
transaction to simplify the rest of the growfs code.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 18:12:51 -07:00
Dave Chinner
532ff647d8 xfs: turn ag header initialisation into a table driven operation
There's still more cookie cutter code in setting up each AG header.
Separate all the variables into a simple structure and iterate a
table of header definitions to initialise everything.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 18:12:51 -07:00
Dave Chinner
0410c3bb2b xfs: factor ag btree root block initialisation
Cookie cutter code, easily factored.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 18:12:51 -07:00
Dave Chinner
9aebe805a5 xfs: convert growfs AG header init to use buffer lists
We currently write all new AG headers synchronously, which can be
slow for large grow operations. All we really need to do is ensure
all the headers are on disk before we run the growfs transaction, so
convert this to a buffer list and a delayed write operation. We
block waiting for the delayed write buffer submission to complete,
so this will fulfill the requirement to have all the buffers written
correctly before proceeding.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 18:12:51 -07:00
Dave Chinner
cce77bcf48 xfs: factor out AG header initialisation from growfs core
The intialisation of new AG headers is mostly common with the
userspace mkfs code and growfs in the kernel, so start factoring it
out so we can move it to libxfs and use it in both places.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 18:12:51 -07:00
Dave Chinner
879de98ead xfs: one-shot cached buffers
For the new growfs work, we want to ensure that we serialise
secondary superblock updates with other operations (e.g. scrub)
correctly, but we don't want to cache the buffers for long term
reuse. We need cached buffers for serialisation, however.

To solve this, introduce a "oneshot" buffer which will be marshalled
through the cache but then released once the last current reference
goes away. If the buffer is already cached, then we ignore the
"one-shot" behaviour and leave the buffer in the state it was prior
to the one-shot command being run. This means we don't perturb
either the working set or existing cached buffer state by a one-shot
operation.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 18:12:51 -07:00
Darrick J. Wong
84d42ea6b6 xfs: implement the metadata repair ioctl flag
Plumb in the pieces necessary to make the "scrub" subfunction of
the scrub ioctl actually work.  This means that we make the IFLAG_REPAIR
flag to the scrub ioctl actually do something, and we add an errortag
knob so that xfstests can force the kernel to rebuild a metadata
structure even if there's nothing wrong with it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-15 18:12:50 -07:00
Darrick J. Wong
718fa74b15 xfs: create tracepoints for online repair
These tracepoints will be used to debug the online repair routines.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-15 18:12:50 -07:00
Darrick J. Wong
7644bd988d xfs: teach xfs_bmapi_remap to accept some bmapi flags
Teach xfs_bmapi_remap how to map in unwritten extent and to skip rmap
updates.  This enables us to rebuild real and unwritten extents from the
rmapbt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 18:12:50 -07:00
Darrick J. Wong
7cf199ba5a xfs: make xfs_bmapi_remapi work with attribute forks
Add a new flags argument to xfs_bmapi_remapi so that we can pass BMAPI
flags into the function.  This enables us to pass in BMAPI_ATTRFORK so
that we can remap things into the attribute fork.  Eventually the
online repair code will use this to rebuild attribute forks, so make it
non-static.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 18:12:50 -07:00
Darrick J. Wong
9f3a080ef1 xfs: hoist xfs_scrub_agfl_walk to libxfs as xfs_agfl_walk
This function is basically a generic AGFL block iterator, so promote it
to libxfs ahead of online repair wanting to use it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 18:12:50 -07:00
Darrick J. Wong
ddd10c2fe2 xfs: avoid ABBA deadlock when scrubbing parent pointers
In normal operation, the XFS convention is to take an inode's iolock
and then allocate a transaction.  However, when scrubbing parent inodes
this is inverted -- we allocated the transaction to do the scrub, and
now we're trying to grab the parent's iolock.  This can lead to ABBA
deadlocks: some thread grabbed the parent's iolock and is waiting for
space for a transaction while our parent scrubber is sitting on a
transaction trying to get the parent's iolock.

Therefore, convert all iolock attempts to use trylock; if that fails,
they can use the existing mechanisms to back off and try again.

The ABBA deadlock didn't happen with a non-repair scrub because the
transactions don't reserve any space, but repair scrubs require
reservation in order to update metadata.  However, any other concurrent
metadata update (e.g. directory create in the parent) could also induce
this deadlock with the parent scrubber.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 18:12:50 -07:00
Darrick J. Wong
517b32b7fa xfs: scrub the data fork of the realtime inodes
The realtime bitmap and summary inodes live on the metadata device, so
we can scrub their data forks with the regular scrubbers.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 18:12:50 -07:00
Darrick J. Wong
87d9d609c2 xfs: quota scrub should use bmapbtd scrubber
Replace the quota scrubber's open-coded data fork scrubber with a
redirected call to the bmapbtd scrubber.  This strengthens the quota
scrub to include all the cross-referencing that it does.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 18:12:50 -07:00
Darrick J. Wong
8bc763c24d xfs: don't continue scrub if already corrupt
If we've already decided that something is corrupt, we might as well
abort all the loops and exit as quickly as possible.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 18:12:50 -07:00
Darrick J. Wong
eac69e1676 xfs: refactor quota limits initialization
Replace all the if (!error) weirdness with helper functions that follow
our regular coding practices, and factor out the ternary expression soup.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
689e11c84b xfs: superblock scrub should use short-lived buffers
Secondary superblocks are rarely used, so create a helper to read a
given non-primary AG's superblock and ensure that it won't stick around
hogging memory.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
8389f3ffa2 xfs: skip scrub xref if corruption already noted
Don't bother looking for cross-referencing problems if the metadata is
already corrupt or we've already found a cross-referencing problem.
Since we added a helper function for flags testing, convert existing
users to use it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 17:57:05 -07:00
Dave Chinner
c9fbd7bbc2 xfs: clear sb->s_fs_info on mount failure
We recently had an oops reported on a 4.14 kernel in
xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
and so the m_perag_tree lookup walked into lala land.

Essentially, the machine was under memory pressure when the mount
was being run, xfs_fs_fill_super() failed after allocating the
xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and
freed the xfs_mount, but the sb->s_fs_info field still pointed to
the freed memory. Hence when the superblock shrinker then ran
it fell off the bad pointer.

With the superblock shrinker problem fixed at teh VFS level, this
stale s_fs_info pointer is still a problem - we use it
unconditionally in ->put_super when the superblock is being torn
down, and hence we can still trip over it after a ->fill_super
call failure. Hence we need to clear s_fs_info if
xfs-fs_fill_super() fails, and we need to check if it's valid in
the places it can potentially be dereferenced after a ->fill_super
failure.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 17:57:05 -07:00
Dave Chinner
dae5cd8118 xfs: add mount delay debug option
Similar to log_recovery_delay, this delay occurs between the VFS
superblock being initialised and the xfs_mount being fully
initialised. It also poisons the per-ag radix tree node so that it
can be used for triggering shrinker races during mount
such as the following:

<run memory pressure workload in background>

$ cat dirty-mount.sh
#! /bin/bash

umount -f /dev/pmem0
mkfs.xfs -f /dev/pmem0
mount /dev/pmem0 /mnt/test
rm -f /mnt/test/foo
xfs_io -fxc "pwrite 0 4k" -c fsync -c "shutdown" /mnt/test/foo
umount /dev/pmem0

# let's crash it now!
echo 30 > /sys/fs/xfs/debug/mount_delay
mount /dev/pmem0 /mnt/test
echo 0 > /sys/fs/xfs/debug/mount_delay
umount /dev/pmem0
$ sudo ./dirty-mount.sh
.....
[   60.378118] CPU: 3 PID: 3577 Comm: fs_mark Tainted: G      D W        4.16.0-rc5-dgc #440
[   60.378120] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[   60.378124] RIP: 0010:radix_tree_next_chunk+0x76/0x320
[   60.378127] RSP: 0018:ffffc9000276f4f8 EFLAGS: 00010282
[   60.383670] RAX: a5a5a5a5a5a5a5a4 RBX: 0000000000000010 RCX: 000000000000001a
[   60.385277] RDX: 0000000000000000 RSI: ffffc9000276f540 RDI: 0000000000000000
[   60.386554] RBP: 0000000000000000 R08: 0000000000000000 R09: a5a5a5a5a5a5a5a5
[   60.388194] R10: 0000000000000006 R11: 0000000000000001 R12: ffffc9000276f598
[   60.389288] R13: 0000000000000040 R14: 0000000000000228 R15: ffff880816cd6458
[   60.390827] FS:  00007f5c124b9740(0000) GS:ffff88083fc00000(0000) knlGS:0000000000000000
[   60.392253] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   60.393423] CR2: 00007f5c11bba0b8 CR3: 000000035580e001 CR4: 00000000000606e0
[   60.394519] Call Trace:
[   60.395252]  radix_tree_gang_lookup_tag+0xc4/0x130
[   60.395948]  xfs_perag_get_tag+0x37/0xf0
[   60.396522]  xfs_reclaim_inodes_count+0x32/0x40
[   60.397178]  xfs_fs_nr_cached_objects+0x11/0x20
[   60.397837]  super_cache_count+0x35/0xc0
[   60.399159]  shrink_slab.part.66+0xb1/0x370
[   60.400194]  shrink_node+0x7e/0x1a0
[   60.401058]  try_to_free_pages+0x199/0x470
[   60.402081]  __alloc_pages_slowpath+0x3a1/0xd20
[   60.403729]  __alloc_pages_nodemask+0x1c3/0x200
[   60.404941]  cache_grow_begin+0x20b/0x2e0
[   60.406164]  fallback_alloc+0x160/0x200
[   60.407088]  kmem_cache_alloc+0x111/0x4e0
[   60.408038]  ? xfs_buf_rele+0x61/0x430
[   60.408925]  kmem_zone_alloc+0x61/0xe0
[   60.409965]  xfs_inode_alloc+0x24/0x1d0
.....


Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 17:57:05 -07:00
Brian Foster
4e529339af xfs: factor out nodiscard helpers
The changes to skip discards of speculative preallocation and
unwritten extents introduced several new wrapper functions through
the bunmapi -> extent free codepath to reduce churn in all of the
associated callers. In several cases, these wrappers simply toggle a
single flag to skip or not skip discards for the resulting blocks.

The explicit _nodiscard() wrappers for such an isolated set of
callers is a bit overkill. Kill off these wrappers and replace with
the calls to the underlying functions in the contexts that need to
control discard behavior. Retain the wrappers that preserve the
original calling conventions to serve the original purpose of
reducing code churn.

This is a refactoring patch and does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
67482129cd iomap: add a swapfile activation function
Add a new iomap_swapfile_activate function so that filesystems can
activate swap files without having to use the obsolete and slow bmap
function.  This enables XFS to support fallocate'd swap files and
swap files on realtime devices.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
d6b636ebb1 xfs: halt auto-reclamation activities while rebuilding rmap
Rebuilding the reverse-mapping tree requires us to quiesce all inodes in
the filesystem, so we must stop background reclamation of post-EOF and
CoW prealloc blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
95eb308caa xfs: add BMAPI_NORMAP flag to perform block remapping without updating rmapbt
Add a new flag, XFS_BMAPI_NORMAP, which will perform file block
remapping without updating the rmapbt.  This will be used by the repair
code to reconstruct bmbts from the rmapbt, in which case we don't want
the rmapbt update.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
08daa3ccf5 xfs: add repair helpers for the reference count btree
Add a couple of functions to the refcount btree and generic btree code
that will be used to repair the refcountbt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
4d4f86b49f xfs: add repair helpers for the reverse mapping btree
Add a couple of functions to the reverse mapping btree that will be used
to repair the rmapbt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
7f8f1313d9 xfs: expose various functions to repair code
Expose various helpers that the repair code will want to use.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
14861c4740 xfs: add helpers to calculate btree size
Add a bunch of helper functions that calculate the sizes of various
btrees.  These will be used to repair btrees and btree headers.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
9d9c90286a xfs: refactor scrub transaction allocation function
Since the transaction allocation helper is about to become more complex,
move it to common.c and remove the redundant parameters.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
08a3a692ef xfs: btree scrub should check minrecs
Strengthen the btree block header checks to detect the number of records
being less than the btree type's minimum record count.  Certain blocks
are allowed to violate this constraint -- specifically any btree block
at the top of the tree can have fewer than minrecs records.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
631fc955bd xfs: clean up scrub usage of KM_NOFS
All scrub code runs in transaction context, which means that memory
allocations are automatically run in PF_MEMALLOC_NOFS context.  It's
therefore unnecessary to pass in KM_NOFS to allocation routines, so
clean them all out.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
eb41c93fef xfs: avoid ilock games in the quota scrubber
Refactor the quota scrubber to take the quotaofflock and grab the quota
inode in the setup function so that we can treat quota in the same
"scrub in the context of this inode" (i.e. sc->ip) manner as we treat
any other inode.  We do have to drop the quota inode's ILOCK_EXCL to use
dqiterate, but since dquots have their own individual locks the ILOCK
wasn't helping us anyway.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-15 17:57:05 -07:00
Darrick J. Wong
554ba96540 xfs: refactor dquot iteration
Create a helper function to iterate all the dquots of a given type in
the system, and refactor the dquot scrub to use it.  This will get more
use in the quota repair code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-05-15 17:56:59 -07:00
Darrick J. Wong
28b9060bd8 xfs: rename on-disk dquot counter zap functions
The function 'xfs_qm_dqiterate' doesn't iterate dquots at all, it
iterates all dquot blocks of a quota inode and clears the counters.
Therefore, change the name to something more descriptive so that we can
introduce a real dquot iterator later.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-05-10 08:56:48 -07:00
Darrick J. Wong
30ab2dcf2c xfs: replace XFS_QMOPT_DQALLOC with a simple boolean
DQALLOC is only ever used with xfs_qm_dqget*, and the only flag that the
_dqget family of functions cares about is DQALLOC.  Therefore, change
it to a boolean 'can alloc?' flag for the dqget interfaces where that
makes sense.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-10 08:56:48 -07:00
Darrick J. Wong
114e73ccfa xfs: remove direct calls to _qm_dqread
The quota initialization code needs an "uncached" variant of _dqget to
read in default quota limits and timers before the dquot cache is fully
set up.  We've already split up _dqget into its component pieces so
create a fourth variant to address this need, and make dqread internal
to xfs_dquot.c again.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-05-10 08:56:48 -07:00
Darrick J. Wong
d63192c890 xfs: refactor xfs_qm_dqtobp and xfs_qm_dqalloc
Separate the disk dquot read and allocation functionality into
two helper functions, then refactor dqread to call them directly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-10 08:56:48 -07:00
Darrick J. Wong
617cd5c12c xfs: refactor incore dquot initialization functions
Create two incore dquot initialization functions that will help us to
disentangle dqget and dqread.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-05-10 08:56:48 -07:00
Darrick J. Wong
0fcef1270f xfs: fetch dquots directly during quotacheck
Quotacheck only runs during mount, which means that there are no other
processes in the system that could be doing chown or chproj.  Therefore
there's no potential for racing to attach dquots to the inode so we can
drop all the ILOCK and race detection bits from quotacheck.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-05-10 08:56:48 -07:00
Darrick J. Wong
4882c19d2a xfs: split out dqget for inodes from regular dqget
There are two uses of dqget here -- one is to return the dquot for a
given type and id, and the other is to return the dquot for a given type
and inode.  Those are two separate things, so split them into two
smaller functions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-05-10 08:56:48 -07:00
Darrick J. Wong
c14cfccabe xfs: remove unnecessary xfs_qm_dqattach parameter
The flags argument is always zero, get rid of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-05-10 08:56:47 -07:00
Darrick J. Wong
d7103eeb00 xfs: delegate dqget input checks to helper function
Move the dqget input checks to a separate function in preparation for
splitting up the dqget functionality.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-05-10 08:56:47 -07:00
Darrick J. Wong
cc2047c4d0 xfs: refactor dquot cache handling
Delegate the dquot cache handling (radix tree lookup and insertion) to
separate helper functions so that we can continue to simplify the body
of dqget.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-05-10 08:56:47 -07:00
Darrick J. Wong
2e330e76e0 xfs: refactor XFS_QMOPT_DQNEXT out of existence
There's only one caller of DQNEXT and its semantics can be moved into a
separate function, so create the function and get rid of the flag.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-05-10 08:56:47 -07:00
Darrick J. Wong
609001bca4 xfs: don't spray logs when dquot flush/purge fail
When dquot flush or purge fail there's no need to spam the logs, we've
already logged the IO error or fs shutdown that caused the flush
failures.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-10 08:56:47 -07:00
Darrick J. Wong
7b6b50f55c xfs: release new dquot buffer on defer_finish error
In commit efa092f3d4 "[XFS] Fixes a bug in the quota code when
allocating a new dquot record", we allocate a new dquot block, grab a
buffer to initialize it, and return the locked initialized dquot buffer
to the caller for further in-core dquot initialization.  Unfortunately,
if the _bmap_finish errored out, _qm_dqalloc would also error out
without bothering to free the (locked) buffer.  Leaking a locked buffer
caused hangs in generic/388 when quotas are enabled.

Furthermore, the _bmap_finish -> _defer_finish conversion in
310a75a3c6 ("xfs: change xfs_bmap_{finish,cancel,init,free} ->
xfs_defer_*") failed to observe that the buffer was held going into
_defer_finish and therefore failed to notice that the buffer lock is
/not/ maintained afterwards.  Now that we can bjoin a buffer to a
defer_ops, use this mechanism to ensure that the buffer stays locked
across the _defer_finish.  Release the holds and locks on the buffer as
appropriate if we have to error out.

There is a subtlety here for the caller in that the buffer emerges
locked and held to the transaction, so if the _trans_commit fails we
have to release the buffer explicitly.  This fixes the unmount hang
in generic/388 when quotas are enabled.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-10 08:56:47 -07:00
Brian Foster
84ca484ecf xfs: don't discard on free of unwritten extents
Unwritten extents by definition have not been written to until they
are converted to normal written extents. If unwritten extents are
freed from a file, it is therefore guaranteed that the blocks have
not been written to since allocation (note that zero range punches
and reallocates blocks).

To cut down on online discards generated from workloads that make
use of preallocation, skip discards of extents if they are in the
unwritten state when the extent is freed.

Note that this optimization does not apply to log recovery, during
which all freed extents are discarded if online discard is enabled.
Also note that it may be possible for a filesystem crash to occur
after write completion of an unwritten extent but before unwritten
conversion such that the extent remains unwritten after log
recovery. Since this pseudo-inconsistency may already be possible
after a crash (consider writing to recently allocated blocks where
the allocation transaction is lost after a crash), this change
shouldn't introduce any fundamental limitations that don't already
exist. In short, on storage stacks where discards are important,
it's good practice to run an occasional fstrim even with online
discard enabled in the filesystem, particularly after a crash.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:47 -07:00
Brian Foster
13b86fc337 xfs: skip online discard during eofblocks trims
We've had reports of online discard operations being sent from XFS
on write-only workloads. These discards occur as a result of
eofblocks trims that can occur after a large file copy completes.

These discards are slightly confusing for users who might be paying
close attention to online discards (i.e., vdo) due to performance
sensitivity. They also happen to be spurious because freed post-eof
blocks by definition have not been written to during the current
allocation cycle.

Update xfs_free_eofblocks() to skip discards that are purely
attributed to eofblocks trims. This cuts down the number of spurious
discards that may occur on write-only workloads due to normal
preallocation activity.

Note that discards of post-eof extents can still occur from other
codepaths that do not isolate handling of post-eof blocks from those
within eof. For example, file unlinks and truncates may still cause
discards for any file blocks affected by the operation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:47 -07:00
Brian Foster
fcb762f5de xfs: add bmapi nodiscard flag
Freed extents are unconditionally discarded when online discard is
enabled. Define XFS_BMAPI_NODISCARD to allow callers to bypass
discards when unnecessary. For example, this will be useful for
eofblocks trimming.

This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:46 -07:00
Dave Chinner
e6631f8554 xfs: get rid of the log item descriptor
It's just a connector between a transaction and a log item. There's
a 1:1 relationship between a log item descriptor and a log item,
and a 1:1 relationship between a log item descriptor and a
transaction. Both relationships are created and terminated at the
same time, so why do we even have the descriptor?

Replace it with a specific list_head in the log item and a new
log item dirtied flag to replace the XFS_LID_DIRTY flag.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[darrick: fix up deferred agfl intent finish_item use of LID_DIRTY]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:46 -07:00
Dave Chinner
1a2ebf835a xfs: add some more debug checks to buffer log item reuse
Just to make sure the item isn't associated with another
transaction when we try to reuse it.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:46 -07:00
Dave Chinner
844e5e74c1 xfs: fix double ijoin in xfs_reflink_clear_inode_flag()
xfs_reflink_clear_inode_flag double-joins an inode to a transaction,
which is not allowed.  Fix that and document that the caller must have
already joined it.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
[darrick: edit out trace for nonexistent ASSERT]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:46 -07:00
Dave Chinner
c5295c6aad xfs: fix double ijoin in xfs_reflink_cancel_cow_range
xfs_reflink_cancel_cow_range joins an inode twice to the same
transaction.  This is not allowed, so fix it and document that the
callers of xfs_reflink_cancel_cow_blocks() must have already joined the
inode to the permanent transaction passed in.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
[darrick: edited the commit log to remove trace for nonexistent ASSERT]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:46 -07:00
Dave Chinner
3565b660e5 xfs: fix double ijoin in xfs_inactive_symlink_rmt()
xfs_inactive_symlink_rmt() does something nasty - it joins an inode
into a transaction it is already joined to. This means the inode can
have multiple log item descriptors attached to the transaction for
it. This breaks teh 1:1 mapping that is supposed to exist
between the log item and log item descriptor.

This results in the log item being processed twice during
transaction commit and CIL formatting, and there are lots of other
potential issues tha arise from double processing of log items in
the transaction commit state machine.

In this case, the inode is already held by the rolling transaction
returned from xfs_defer_finish(), so there's no need to join it
again.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:46 -07:00
Dave Chinner
d686d12d23 xfs: don't assert fail with AIL lock held
Been hitting AIL ordering assert failures recently, but been unable
to trace them down because the system immediately hangs up onteh
spinlock that was held when this assert fires:

XFS: Assertion failed: XFS_LSN_CMP(prev_lip->li_lsn, lip->li_lsn) <= 0, file: fs/xfs/xfs_trans_ail.c, line: 52

Move the assertions outside of the spinlock so the corpse can
be dissected. Thanks to Brian Foster for supplying a clean
way of doing this.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:46 -07:00
Dave Chinner
e632a5690c xfs: adder caller IP to xfs_defer* tracepoints
So it's clear in the trace where they are being called from.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:46 -07:00
Dave Chinner
ba18781b91 xfs: add tracing to high level transaction operations
Because currently we have no idea what the transaction context we
are operating in is, and I need to know that information to track
down bugs in multiple log item joins to transactions.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:46 -07:00
Dave Chinner
22525c17ed xfs: log item flags are racy
The log item flags contain a field that is protected by the AIL
lock - the XFS_LI_IN_AIL flag. We use non-atomic RMW operations to
set and clear these flags, but most of the updates and checks are
not done with the AIL lock held and so are susceptible to update
races.

Fix this by changing the log item flags to use atomic bitops rather
than be reliant on the AIL lock for update serialisation.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10 08:56:41 -07:00
Darrick J. Wong
52101dfe56 xfs: add missing rmap error return
xfs_rmap_lookup_le_range can return errors, so we need to check for
them and bail out.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-10 08:56:41 -07:00
Darrick J. Wong
cec572561a xfs: bmap debugging should never panic the system
Don't panic() the system if the bmap records are garbage, just call
ASSERT which gives us the same backtrace but enables developers to
control if the system goes down or not.  This makes debugging with
generic/388 much easier because it won't reboot the machine midway
through a run just because btree_read_bufl returns EIO when the fs has
already shut down.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-09 10:04:02 -07:00
Brian Foster
8804630e1e xfs: defer agfl frees from directory op transactions
Directory operations can perform block allocations as entries are
added/removed from directories. Defer AGFL block frees from the
remaining directory operation transactions. This covers the hard
link, remove and rename operations.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:02 -07:00
Brian Foster
8b922f0e6a xfs: defer frees from common inode allocation paths
Inode allocation can require block allocation for physical inode
chunk allocation, inode btree record insertion, and/or directory
block allocation for entry insertion. Any of these block allocation
requests can require AGFL fixups prior to the actual allocation.
Update the common file creation transacions to defer AGFL frees from
these contexts to avoid too much log reservation consumption
per-transaction.

Since these transactions are already passed down through the btree
cursors and da_args structure, this simply requires to attach dfops
to the transaction. Note that this covers tr_create, tr_mkdir and
tr_symlink. Other transactions such as tr_create_tmpfile do not
already make use of deferred operations and so are left alone for
the time being.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:02 -07:00
Brian Foster
658f8f9511 xfs: defer agfl frees from inode inactivation
XFS inode chunks are already freed via deferred operations (which
now also defer AGFL block frees), but inode btree blocks are freed
directly in the associated context. This has been known to lead to
log reservation overruns in particular workloads where an inobt
block free may require several AGFL block frees (and thus several
allocation btree modifications) before the inobt block itself is
actually freed.

To avoid this problem, defer the frees of any AGFL blocks before the
inobt block free takes place. This requires passing the dfops from
xfs_inactive_ifree() down through the inobt ->[alloc|free]_block()
callouts, which essentially only requires to attach the dfops to the
transaction since it is already carried all the way through to the
inobt update and allocation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:02 -07:00
Brian Foster
2bc5eba8b6 xfs: defer agfl block frees from deferred ops processing context
Now that AGFL block frees are deferred when dfops is set in the
transaction, start deferring AGFL block frees from contexts that are
known to push the limits of existing log reservations.

The first such context is deferred operation processing itself. This
primarily targets deferred extent frees (such as file extents and
inode chunks), but in doing so covers all allocation operations that
occur in deferred operation processing context.

Update xfs_defer_finish() to set and reset ->t_agfl_dfops across the
processing sequence. This means that any AGFL block frees due to
allocation events result in the addition of new EFIs to the dfops
rather than being processed immediately. xfs_defer_finish() rolls
the transaction at least once more to process the frees of the AGFL
blocks back to the allocation btrees and returns once the AGFL is
rectified.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:02 -07:00
Brian Foster
f8f2835a9c xfs: defer agfl block frees when dfops is available
The AGFL fixup code executes before every block allocation/free and
rectifies the AGFL based on the current, dynamic allocation
requirements of the fs. The AGFL must hold a minimum number of
blocks to satisfy a worst case split of the free space btrees caused
by the impending allocation operation. The AGFL is also updated to
maintain the implicit requirement for a minimum number of free slots
to satisfy a worst case join of the free space btrees.

Since the AGFL caches individual blocks, AGFL reduction typically
involves multiple, single block frees. We've had reports of
transaction overrun problems during certain workloads that boil down
to AGFL reduction freeing multiple blocks and consuming more space
in the log than was reserved for the transaction.

Since the objective of freeing AGFL blocks is to ensure free AGFL
free slots are available for the upcoming allocation, one way to
address this problem is to release surplus blocks from the AGFL
immediately but defer the free of those blocks (similar to how
file-mapped blocks are unmapped from the file in one transaction and
freed via a deferred operation) until the transaction is rolled.
This turns AGFL reduction into an operation with predictable log
reservation consumption.

Add the capability to defer AGFL block frees when a deferred ops
list is available to the AGFL fixup code. Add a dfops pointer to the
transaction to carry dfops through various contexts to the allocator
context. Deferring AGFL frees is  conditional behavior based on
whether the transaction pointer is populated. The long term
objective is to reuse the transaction pointer to clean up all
unrelated callchains that pass dfops on the stack along with a
transaction and in doing so, consistently defer AGFL blocks from the
allocator.

A bit of customization is required to handle deferred completion
processing because AGFL blocks are accounted against a per-ag
reservation pool and AGFL blocks are not inserted into the extent
busy list when freed (they are inserted when used and released back
to the AGFL). Reuse the majority of the existing deferred extent
free infrastructure and customize it appropriately to handle AGFL
blocks.

Note that this patch only adds infrastructure. It does not change
behavior because no callers have been updated to pass ->t_agfl_dfops
into the allocation code.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:02 -07:00
Brian Foster
4223f659dd xfs: create agfl block free helper function
Refactor the AGFL block free code into a new helper such that it can
be invoked from deferred context. No functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:01 -07:00
Eric Sandeen
72c5c5f6d0 xfs: print specific dqblk that failed verifiers
Rather than printing the top of the buffer that held a corrupted dqblk,
restructure things to print out the specific one that failed by pushing
the calls to the verifier_error function down into the verifier which
iterates over the buffer and detects the error.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:01 -07:00
Eric Sandeen
7224fa482a xfs: add full xfs_dqblk verifier
Add an xfs_dqblk verifier so that it can check the uuid on V5 filesystems;
it calls the existing xfs_dquot_verify verifier to validate the
xfs_disk_dquot_t contained inside it.  This lets us move the uuid
verification out of the crc verifier, which makes little sense.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:01 -07:00
Eric Sandeen
48fa1db87f xfs: pass full xfs_dqblk to repair during quotacheck
It's a bit dicey to pass in the smaller xfs_disk_dquot and then cast it to
something larger; pass in the full xfs_dqblk so we know the caller has sent
us the right thing.  Rename the function to xfs_dqblk_repair for
clarity.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:01 -07:00
Eric Sandeen
57ab324553 xfs: check type in quota verifier during quotacheck
During quotacheck we send in the quota type, so verify that as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:01 -07:00
Eric Sandeen
e381a0f6c2 xfs: remove unused flags arg from xfs_dquot_verify
Long ago the flags argument was used to determine whether to issue warnings
about corruptions, but that's done elsewhere now and the flag is unused
here, so remove it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:01 -07:00
Dave Chinner
dfa03a5f80 xfs: clean up locking in xfs_file_iomap_begin
Rather than checking what kind of locking is needed in a helper
function and then jumping through hoops to do the locking in line,
move the locking to the helper function that does all the checks
and rename it to xfs_ilock_for_iomap().

This also allows us to hoist all the nonblocking checks up into the
locking helper, further simplifier the code flow in
xfs_file_iomap_begin() and making it easier to understand.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:01 -07:00
Dave Chinner
d064178094 xfs: simplify xfs_file_iomap_begin() logic
The current logic that determines whether allocation should be done
has grown somewhat spaghetti like with the addition of IOMAP_NOWAIT
functionality. Separate out each of the different cases into single,
obvious checks to get rid most of the nested IOMAP_NOWAIT checks
in the allocation logic.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:01 -07:00
Dave Chinner
4f8ff44ba0 iomap: iomap_dio_rw() handles all sync writes
Currently iomap_dio_rw() only handles (data)sync write completions
for AIO. This means we can't optimised non-AIO IO to minimise device
flushes as we can't tell the caller whether a flush is required or
not.

To solve this problem and enable further optimisations, make
iomap_dio_rw responsible for data sync behaviour for all IO, not
just AIO.

In doing so, the sync operation is now accounted as part of the DIO
IO by inode_dio_end(), hence post-IO data stability updates will no
long race against operations that serialise via inode_dio_wait()
such as truncate or hole punch.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:01 -07:00
Dave Chinner
ed5c3e66a3 xfs: move generic_write_sync calls inwards
To prepare for iomap iinfrastructure based DSYNC optimisations.

While moving the code araound, move the XFS write bytes metric
update for direct IO into xfs_dio_write_end_io callback so that we
always capture the amount of data written via AIO+DIO. This fixes
the problem where queued AIO+DIO writes are not accounted to this
metric.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:00 -07:00
Dave Chinner
b027d4c97b xfs: don't retry xfs_buf_find on XBF_TRYLOCK failure
When looking at an event trace recently, I noticed that non-blocking
buffer lookup attempts would fail on cached locked buffers and then
run the slow cache-miss path. This means we are doing an xfs_buf
allocation, lookup and free unnecessarily every time we avoid
blocking on a locked buffer.

Fix this by changing _xfs_buf_find() to return an error status to
the caller to indicate that we failed the lock attempt rather than
just returning a NULL. This allows the higher level code to
discriminate between a cache miss and an cache hit that we failed to
lock.

This also allows us to return a -EFSCORRUPTED state if we are asked
to look up a block number outside the range of the filesystem in
_xfs_buf_find(), which moves us one step closer to being able to
handle such errors in a more graceful manner at the higher levels.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:00 -07:00
Dave Chinner
8925a3dc47 xfs: make xfs_buf_incore out of line
Move xfs_buf_incore out of line and make it the only way to look up
a buffer in the buffer cache from outside the buffer cache. Convert
the external users of _xfs_buf_find() to xfs_buf_incore() and make
_xfs_buf_find() static.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: actually rename xfs_incore -> xfs_buf_incore]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:00 -07:00
Eric Sandeen
e443523d19 xfs: trace ATTR flags in xattr tracepoints
This will trace i.e. the ATTR_SECURE/ATTR_CREATE/ATTR_REPLACE
flags as well as the OP_FLAGS.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:00 -07:00
Dave Chinner
8b26984dbd xfs: validate allocated inode number
When we have corrupted free inode btrees, we can attempt to
allocate inodes that we know are already allocated. Catch allocation
of these inodes and report corruption as early as possible to
prevent corruption propagation or deadlocks.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:00 -07:00
Dave Chinner
afca6c5b25 xfs: validate cached inodes are free when allocated
A recent fuzzed filesystem image cached random dcache corruption
when the reproducer was run. This often showed up as panics in
lookup_slow() on a null inode->i_ops pointer when doing pathwalks.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
....
Call Trace:
 lookup_slow+0x44/0x60
 walk_component+0x3dd/0x9f0
 link_path_walk+0x4a7/0x830
 path_lookupat+0xc1/0x470
 filename_lookup+0x129/0x270
 user_path_at_empty+0x36/0x40
 path_listxattr+0x98/0x110
 SyS_listxattr+0x13/0x20
 do_syscall_64+0xf5/0x280
 entry_SYSCALL_64_after_hwframe+0x42/0xb7

but had many different failure modes including deadlocks trying to
lock the inode that was just allocated or KASAN reports of
use-after-free violations.

The cause of the problem was a corrupt INOBT on a v4 fs where the
root inode was marked as free in the inobt record. Hence when we
allocated an inode, it chose the root inode to allocate, found it in
the cache and re-initialised it.

We recently fixed a similar inode allocation issue caused by inobt
record corruption problem in xfs_iget_cache_miss() in commit
ee457001ed ("xfs: catch inode allocation state mismatch
corruption"). This change adds similar checks to the cache-hit path
to catch it, and turns the reproducer into a corruption shutdown
situation.

Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix typos in comment]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09 10:04:00 -07:00
Darrick J. Wong
021ba8e98f xfs: cap the length of deduplication requests
Since deduplication potentially has to read in all the pages in both
files in order to compare the contents, cap the deduplication request
length at MAX_RW_COUNT/2 (roughly 1GB) so that we have /some/ upper bound
on the request length and can't just lock up the kernel forever.  Found
by running generic/304 after commit 1ddae54555b62 ("common/rc: add
missing 'local' keywords").

Reported-by: matorola@gmail.com
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-05-02 09:21:33 -07:00
Darrick J. Wong
7b38460dc8 xfs: don't fail when converting shortform attr to long form during ATTR_REPLACE
Kanda Motohiro reported that expanding a tiny xattr into a large xattr
fails on XFS because we remove the tiny xattr from a shortform fork and
then try to re-add it after converting the fork to extents format having
not removed the ATTR_REPLACE flag.  This fails because the attr is no
longer present, causing a fs shutdown.

This is derived from the patch in his bug report, but we really
shouldn't ignore a nonzero retval from the remove call.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199119
Reported-by: kanda.motohiro@gmail.com
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-17 19:10:15 -07:00
Darrick J. Wong
7d83fb1425 xfs: prevent creating negative-sized file via INSERT_RANGE
During the "insert range" fallocate operation, i_size grows by the
specified 'len' bytes.  XFS verifies that i_size + len < s_maxbytes, as
it should.  But this comparison is done using the signed 'loff_t', and
'i_size + len' can wrap around to a negative value, causing the check to
incorrectly pass, resulting in an inode with "negative" i_size.  This is
possible on 64-bit platforms, where XFS sets s_maxbytes = LLONG_MAX.
ext4 and f2fs don't run into this because they set a smaller s_maxbytes.

Fix it by using subtraction instead.

Reproducer:
    xfs_io -f file -c "truncate $(((1<<63)-1))" -c "finsert 0 4096"

Fixes: a904b1ca57 ("xfs: Add support FALLOC_FL_INSERT_RANGE for fallocate")
Cc: <stable@vger.kernel.org> # v4.1+
Originally-From: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix signed integer addition overflow too]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-17 17:29:11 -07:00
Eric Sandeen
2c4306f719 xfs: set format back to extents if xfs_bmap_extents_to_btree
If xfs_bmap_extents_to_btree fails in a mode where we call
xfs_iroot_realloc(-1) to de-allocate the root, set the
format back to extents.

Otherwise we can assume we can dereference ifp->if_broot
based on the XFS_DINODE_FMT_BTREE format, and crash.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199423
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-17 17:10:17 -07:00
Eric Sandeen
b42db0860e xfs: enhance dinode verifier
Add several more validations to xfs_dinode_verify:

- For LOCAL data fork formats, di_nextents must be 0.
- For LOCAL attr fork formats, di_anextents must be 0.
- For inodes with no attr fork offset,
  - format must be XFS_DINODE_FMT_EXTENTS if set at all
  - di_anextents must be 0.

Thanks to dchinner for pointing out a couple related checks I had
forgotten to add.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199377
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-17 17:10:17 -07:00
Linus Torvalds
80aa76bcd3 Changes since last update:
- Cleanup unnecessary function call parameters
 - Fix a use-after-free bug when aborting logging intents
 - Refactor filestreams state data to avoid use-after-free bug
 - Fix incorrect removal of cow extents when truncating extended
   attributes.
 - Refactor open-coded __set_page_dirty in favor of using vfs function.
 - Fix a deadlock when fstrim and fs shutdown race.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCgAGBQJazZ/HAAoJEPh/dxk0SrTrGeYP/Asis7MZ3TWfzsPHwK6EoH+w
 q1VnKVMqjSRnE8DYmx8w3tMqny2qg9klLkT1SRRz9Htr5CER5XqIX6ZYlQpVKx2R
 ycS+I1l/V/kqEmAgDgSK3DZ5uMgKHfo0w7GbKFDg69YDztN3yNBpDfUZ4VqA/Ua2
 ADaeMYkJh6oBB5yZAWGyAJEM5TieqQppqyc+WLhbORyjEFreUTTmLqbzLPnceSih
 rQLg0noDwZK0XqbwndYXGTNKKoQtiJalnZP18DhH4zOr+FH03i/gmlU3w+ANl3eX
 IEE0TR2EkNHLZeduz7xT7ZHUOo0TaGObBK8CJFSojibQ/HooVjLcFQResd3q0coN
 WTkbTuxHwMqk2IujKRTqli/saENhvFrOrm/nYTFPw9+3GpRt0iLrXPSBbeMjmsZG
 XntdimPEHywjYrdW10VRH+6E6tvQiC/tl3abBuXdaEOJs1KZmPYNt42EF0ZQ5Xs5
 IeDOhPLuuyUwRf12RVA9WS6xGdMR0+foMqncXZNcAzxQeAVfUNEtASNMTNIOO2H5
 xD34/1ooFJnwT755VLT9U/qUd5CHtWO3AkH+9RVCpKWTaEOSY+gV6ZgmINv9npur
 Vw2xnZwxysegVlu76uct/b/sy/8J3OKYIoSYwbt6Zxi0M1WF9zanJ9pKCsBYdL6A
 xWoMr0X1yFGcYJozqwJU
 =uGz0
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.17-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull more xfs updates from Darrick Wong:
 "Most of these are code cleanups, but there are a couple of notable
  use-after-free bug fixes.

  This series has been run through a full xfstests run over the week and
  through a quick xfstests run against this morning's master, with no
  major failures reported.

   - clean up unnecessary function call parameters

   - fix a use-after-free bug when aborting logging intents

   - refactor filestreams state data to avoid use-after-free bug

   - fix incorrect removal of cow extents when truncating extended
     attributes.

   - refactor open-coded __set_page_dirty in favor of using vfs
     function.

   - fix a deadlock when fstrim and fs shutdown race"

* tag 'xfs-4.17-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  Force log to disk before reading the AGF during a fstrim
  Export __set_page_dirty
  xfs: only cancel cow blocks when truncating the data fork
  xfs: non-scrub - remove unused function parameters
  xfs: remove filestream item xfs_inode reference
  xfs: fix intent use-after-free on abort
  xfs: Remove "committed" argument of xfs_dir_ialloc
2018-04-12 13:28:22 -07:00
Matthew Wilcox
f82b376413 export __set_page_dirty
XFS currently contains a copy-and-paste of __set_page_dirty().  Export
it from buffer.c instead.

Link: http://lkml.kernel.org/r/20180313132639.17387-6-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:39 -07:00
Carlos Maiolino
8c81dd46ef Force log to disk before reading the AGF during a fstrim
Forcing the log to disk after reading the agf is wrong, we might be
calling xfs_log_force with XFS_LOG_SYNC with a metadata lock held.

This can cause a deadlock when racing a fstrim with a filesystem
shutdown.

The deadlock has been identified due a miscalculation bug in device-mapper
dm-thin, which returns lack of space to its users earlier than the device itself
really runs out of space, changing the device-mapper volume into an error state.

The problem happened while filling the filesystem with a single file,
triggering the bug in device-mapper, consequently causing an IO error
and shutting down the filesystem.

If such file is removed, and fstrim executed before the XFS finishes the
shut down process, the fstrim process will end up holding the buffer
lock, and going to sleep on the cil wait queue.

At this point, the shut down process will try to wake up all the threads
waiting on the cil wait queue, but for this, it will try to hold the
same buffer log already held my the fstrim, locking up the filesystem.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-10 22:39:04 -07:00
Matthew Wilcox
fbbb450904 Export __set_page_dirty
XFS currently contains a copy-and-paste of __set_page_dirty().  Export
it from buffer.c instead.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-10 22:39:01 -07:00
Linus Torvalds
9f3a0941fb libnvdimm for 4.17
* A rework of the filesytem-dax implementation provides for detection of
   unmap operations (truncate / hole punch) colliding with in-progress
   device-DMA. A fix for these collisions remains a work-in-progress
   pending resolution of truncate latency and starvation regressions.
 
 * The of_pmem driver expands the users of libnvdimm outside of x86 and
   ACPI to describe an implementation of persistent memory on PowerPC with
   Open Firmware / Device tree.
 
 * Address Range Scrub (ARS) handling is completely rewritten to account for
   the fact that ARS may run for 100s of seconds and there is no platform
   defined way to cancel it. ARS will now no longer block namespace
   initialization.
 
 * The NVDIMM Namespace Label implementation is updated to handle label
   areas as small as 1K, down from 128K.
 
 * Miscellaneous cleanups and updates to unit test infrastructure.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJazDt5AAoJEB7SkWpmfYgCqGMQALLwdPeY87cUK7AvQ2IXj46B
 lJgeVuHPzyQDbC03AS5uUYnnU3I5lFd7i4y7ZrywNpFs4lsb/bNmbUpQE5xp+Yvc
 1MJ/JYDIP5X4misWYm3VJo85N49+VqSRgAQk52PBigwnZ7M6/u4cSptXM9//c9JL
 /NYbat6IjjY6Tx49Tec6+F3GMZjsFLcuTVkQcREoOyOqVJE4YpP0vhNjEe0vq6vr
 EsSWiqEI5VFH4PfJwKdKj/64IKB4FGKj2A5cEgjQBxW2vw7tTJnkRkdE3jDUjqtg
 xYAqGp/Dqs4+bgdYlT817YhiOVrcr5mOHj7TKWQrBPgzKCbcG5eKDmfT8t+3NEga
 9kBlgisqIcG72lwZNA7QkEHxq1Omy9yc1hUv9qz2YA0G+J1WE8l1T15k1DOFwV57
 qIrLLUypklNZLxvrzNjclempboKc4JCUlj+TdN5E5Y6pRs55UWTXaP7Xf5O7z0vf
 l/uiiHkc3MPH73YD2PSEGFJ8m8EU0N8xhrcz3M9E2sHgYCnbty1Lw3FH0/GhThVA
 ya1mMeDdb8A2P7gWCBk1Lqeig+rJKXSey4hKM6D0njOEtMQO1H4tFqGjyfDX1xlJ
 3plUR9WBVEYzN5+9xWbwGag/ezGZ+NfcVO2gmy6yXiEph796BxRAZx/18zKRJr0m
 9eGJG1H+JspcbtLF9iHn
 =acZQ
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "This cycle was was not something I ever want to repeat as there were
  several late changes that have only now just settled.

  Half of the branch up to commit d2c997c0f1 ("fs, dax: use
  page->mapping to warn...") have been in -next for several releases.
  The of_pmem driver and the address range scrub rework were late
  arrivals, and the dax work was scaled back at the last moment.

  The of_pmem driver missed a previous merge window due to an oversight.
  A sense of obligation to rectify that miss is why it is included for
  4.17. It has acks from PowerPC folks. Stephen reported a build failure
  that only occurs when merging it with your latest tree, for now I have
  fixed that up by disabling modular builds of of_pmem. A test merge
  with your tree has received a build success report from the 0day robot
  over 156 configs.

  An initial version of the ARS rework was submitted before the merge
  window. It is self contained to libnvdimm, a net code reduction, and
  passing all unit tests.

  The filesystem-dax changes are based on the wait_var_event()
  functionality from tip/sched/core. However, late review feedback
  showed that those changes regressed truncate performance to a large
  degree. The branch was rewound to drop the truncate behavior change
  and now only includes preparation patches and cleanups (with full acks
  and reviews). The finalization of this dax-dma-vs-trnucate work will
  need to wait for 4.18.

  Summary:

   - A rework of the filesytem-dax implementation provides for detection
     of unmap operations (truncate / hole punch) colliding with
     in-progress device-DMA. A fix for these collisions remains a
     work-in-progress pending resolution of truncate latency and
     starvation regressions.

   - The of_pmem driver expands the users of libnvdimm outside of x86
     and ACPI to describe an implementation of persistent memory on
     PowerPC with Open Firmware / Device tree.

   - Address Range Scrub (ARS) handling is completely rewritten to
     account for the fact that ARS may run for 100s of seconds and there
     is no platform defined way to cancel it. ARS will now no longer
     block namespace initialization.

   - The NVDIMM Namespace Label implementation is updated to handle
     label areas as small as 1K, down from 128K.

   - Miscellaneous cleanups and updates to unit test infrastructure"

* tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits)
  libnvdimm, of_pmem: workaround OF_NUMA=n build error
  nfit, address-range-scrub: add module option to skip initial ars
  nfit, address-range-scrub: rework and simplify ARS state machine
  nfit, address-range-scrub: determine one platform max_ars value
  powerpc/powernv: Create platform devs for nvdimm buses
  doc/devicetree: Persistent memory region bindings
  libnvdimm: Add device-tree based driver
  libnvdimm: Add of_node to region and bus descriptors
  libnvdimm, region: quiet region probe
  libnvdimm, namespace: use a safe lookup for dimm device name
  libnvdimm, dimm: fix dpa reservation vs uninitialized label area
  libnvdimm, testing: update the default smart ctrl_temperature
  libnvdimm, testing: Add emulation for smart injection commands
  nfit, address-range-scrub: introduce nfit_spa->ars_state
  libnvdimm: add an api to cast a 'struct nd_region' to its 'struct device'
  nfit, address-range-scrub: fix scrub in-progress reporting
  dax, dm: allow device-mapper to operate without dax support
  dax: introduce CONFIG_DAX_DRIVER
  fs, dax: use page->mapping to warn if truncate collides with a busy page
  ext2, dax: introduce ext2_dax_aops
  ...
2018-04-10 10:25:57 -07:00
Darrick J. Wong
4919d42ab6 xfs: only cancel cow blocks when truncating the data fork
In xfs_itruncate_extents, only cancel cow blocks and clear the reflink
flag if we were asked to truncate the data fork.  Attr fork blocks
cannot be shared, so this makes no sense.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-04-10 08:28:33 -07:00
Dan Williams
e13e75b86e Merge branch 'for-4.17/dax' into libnvdimm-for-next 2018-04-09 10:50:17 -07:00
Eric Sandeen
a1f69417c6 xfs: non-scrub - remove unused function parameters
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-09 10:23:42 -07:00
Christoph Hellwig
7fcd3efa1e xfs: remove filestream item xfs_inode reference
The filestreams allocator stores an xfs_fstrm_item structure in the MRU to
cache inode number to agno mappings for a particular length of time.  Each
xfs_fstrm_item contains the internal MRU structure, an inode pointer and
agno value.

The inode pointer stored in the xfs_fstrm_item is not referenced, however,
which means the inode itself can be removed and reclaimed before the MRU
item is freed. If this occurs, xfs_fstrm_free_func() can access freed or
unrelated memory through xfs_fstrm_item->ip and crash.

The obvious solution is to grab an inode reference for xfs_fstrm_item.
The filestream mechanism only actually uses the inode pointer as a means
to access the xfs_mount, however.  Rather than add unnecessary
complexity, simplify the implementation to store an xfs_mount pointer in
struct xfs_mru_cache, and pass it to the free callback.  This also
requires updates to the tracepoint class to provide the associated data
via parameters rather than the inode and a minor hack to peek at the MRU
key to establish the inode number at free time.

Based on debugging work and an earlier patch from Brian Foster, who
also wrote most of this changelog.

Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-09 10:23:39 -07:00
Linus Torvalds
9022ca6b11 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted stuff, including Christoph's I_DIRTY patches"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: move I_DIRTY_INODE to fs.h
  ubifs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call
  ntfs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call
  gfs2: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) calls
  fs: fold open_check_o_direct into do_dentry_open
  vfs: Replace stray non-ASCII homoglyph characters with their ASCII equivalents
  vfs: make sure struct filename->iname is word-aligned
  get rid of pointless includes of fs_struct.h
  [poll] annotate SAA6588_CMD_POLL users
2018-04-06 11:07:08 -07:00
Linus Torvalds
672a9c1069 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  kfifo: fix inaccurate comment
  tools/thermal: tmon: fix for segfault
  net: Spelling s/stucture/structure/
  edd: don't spam log if no EDD information is present
  Documentation: Fix early-microcode.txt references after file rename
  tracing: Block comments should align the * on each line
  treewide: Fix typos in printk
  GenWQE: Fix a typo in two comments
  treewide: Align function definition open/close braces
2018-04-05 11:56:35 -07:00
Dave Chinner
0612d11663 xfs: fix intent use-after-free on abort
When an intent is aborted during it's initial commit through
xfs_defer_trans_abort(), there is a use after free. The current
report is for a RUI  through this path in generic/388:

 Freed by task 6274:
  __kasan_slab_free+0x136/0x180
  kmem_cache_free+0xe7/0x4b0
  xfs_trans_free_items+0x198/0x2e0
  __xfs_trans_commit+0x27f/0xcc0
  xfs_trans_roll+0x17b/0x2a0
  xfs_defer_trans_roll+0x6ad/0xe60
  xfs_defer_finish+0x2a6/0x2140
  xfs_alloc_file_space+0x53a/0xf90
  xfs_file_fallocate+0x5c6/0xac0
  vfs_fallocate+0x2f5/0x930
  ioctl_preallocate+0x1dc/0x320
  do_vfs_ioctl+0xfe4/0x1690

The problem is that the RUI has two active references - one in the
current transaction, and another held by the defer_ops structure
that is passed to the RUD (intent done) so that both the intent and
the intent done structures are freed on commit of the intent done.

Hence during abort, we need to release the intent item, because the
defer_ops reference is released separately via ->abort_intent
callback. Fix all the intent code to do this correctly.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-02 20:08:27 -07:00
Chandan Rajendra
c959025eda xfs: Remove "committed" argument of xfs_dir_ialloc
xfs_dir_ialloc() rolls the current transaction when allocation of a new
inode required the space manager to perform an allocation and replinish
the Inode btree.

None of the callers of xfs_dir_ialloc() need to know if the
transaction was committed. Hence this commit removes the "committed"
argument of xfs_dir_ialloc.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-02 15:47:43 -07:00
Dan Williams
6e2608dfd9 xfs, dax: introduce xfs_dax_aops
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings like the
following:

 WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
 xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
 [..]
 CPU: 27 PID: 1783 Comm: dma-collision Tainted: G           O 4.15.0-rc2+ #984
 [..]
 Call Trace:
  set_page_dirty_lock+0x40/0x60
  bio_set_pages_dirty+0x37/0x50
  iomap_dio_actor+0x2b7/0x3b0
  ? iomap_dio_zero+0x110/0x110
  iomap_apply+0xa4/0x110
  iomap_dio_rw+0x29e/0x3b0
  ? iomap_dio_zero+0x110/0x110
  ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_read_iter+0xa0/0xc0 [xfs]
  __vfs_read+0xf9/0x170
  vfs_read+0xa6/0x150
  SyS_pread64+0x93/0xb0
  entry_SYSCALL_64_fastpath+0x1f/0x96

...where the default set_page_dirty() handler assumes that dirty state
is being tracked in 'struct page' flags.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-03-30 11:34:55 -07:00
Eric Sandeen
dc1baa715b xfs: do not log/recover swapext extent owner changes for deleted inodes
Today if we run xfs_fsr and crash[1], log replay can fail because
the recovery code tries to instantiate the donor inode from
disk to replay the swapext, but it's been deleted and we get
verifier failures when we try to read the inode off disk with
i_mode == 0.

This fixes both sides: We don't log the swapext change if the
inode has been deleted, and we don't try to recover it either.

[1] or if systemd doesn't cleanly unmount root, as it is wont
    to do ...

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-29 10:19:15 -07:00
Brian Foster
72c44e35f0 xfs: clean up xfs_mount allocation and dynamic initializers
Most of the generic data structures embedded in xfs_mount are
dynamically initialized immediately after mp is allocated. A few
fields are left out and initialized during the xfs_mountfs()
sequence, after mp has been attached to the superblock.

To clean this up and help prevent premature access of associated
fields, refactor xfs_mount allocation and all dependent init calls
into a new helper. This self-documents that all low level data
structures (i.e., locks, trees, etc.) should be initialized before
xfs_mount is attached to the superblock.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-26 08:54:15 -07:00
Joe Perches
447a5647c9 treewide: Align function definition open/close braces
Some functions definitions have either the initial open brace and/or
the closing brace outside of column 1.

Move those braces to column 1.

This allows various function analyzers like gnu complexity to work
properly for these modified functions.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Nicolin Chen <nicoleotsuka@gmail.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2018-03-26 11:13:09 +02:00
Dave Chinner
fa4493f0d9 xfs: remove dead inode version setting code
We can only get into the branch if CRCs are enabled, so there's no
need to check inside the branch for CRCs being enabled....

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-23 18:05:09 -07:00
Dave Chinner
ee457001ed xfs: catch inode allocation state mismatch corruption
We recently came across a V4 filesystem causing memory corruption
due to a newly allocated inode being setup twice and being added to
the superblock inode list twice. From code inspection, the only way
this could happen is if a newly allocated inode was not marked as
free on disk (i.e. di_mode wasn't zero).

Running the metadump on an upstream debug kernel fails during inode
allocation like so:

XFS: Assertion failed: ip->i_d.di_nblocks == 0, file: fs/xfs/xfs_inod=
e.c, line: 838
 ------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:114!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 11 PID: 3496 Comm: mkdir Not tainted 4.16.0-rc5-dgc #442
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/0=
1/2014
RIP: 0010:assfail+0x28/0x30
RSP: 0018:ffffc9000236fc80 EFLAGS: 00010202
RAX: 00000000ffffffea RBX: 0000000000004000 RCX: 0000000000000000
RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff8227211b
RBP: ffffc9000236fce8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000bec R11: f000000000000000 R12: ffffc9000236fd30
R13: ffff8805c76bab80 R14: ffff8805c77ac800 R15: ffff88083fb12e10
FS:  00007fac8cbff040(0000) GS:ffff88083fd00000(0000) knlGS:0000000000000=
000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fffa6783ff8 CR3: 00000005c6e2b003 CR4: 00000000000606e0
Call Trace:
 xfs_ialloc+0x383/0x570
 xfs_dir_ialloc+0x6a/0x2a0
 xfs_create+0x412/0x670
 xfs_generic_create+0x1f7/0x2c0
 ? capable_wrt_inode_uidgid+0x3f/0x50
 vfs_mkdir+0xfb/0x1b0
 SyS_mkdir+0xcf/0xf0
 do_syscall_64+0x73/0x1a0
 entry_SYSCALL_64_after_hwframe+0x42/0xb7

Extracting the inode number we crashed on from an event trace and
looking at it with xfs_db:

xfs_db> inode 184452204
xfs_db> p
core.magic = 0x494e
core.mode = 0100644
core.version = 2
core.format = 2 (extents)
core.nlinkv2 = 1
core.onlink = 0
.....

Confirms that it is not a free inode on disk. xfs_repair
also trips over this inode:

.....
zero length extent (off = 0, fsbno = 0) in ino 184452204
correcting nextents for inode 184452204
bad attribute fork in inode 184452204, would clear attr fork
bad nblocks 1 for inode 184452204, would reset to 0
bad anextents 1 for inode 184452204, would reset to 0
imap claims in-use inode 184452204 is free, would correct imap
would have cleared inode 184452204
.....
disconnected inode 184452204, would move to lost+found

And so we have a situation where the directory structure and the
inobt thinks the inode is free, but the inode on disk thinks it is
still in use. Where this corruption came from is not possible to
diagnose, but we can detect it and prevent the kernel from oopsing
on lookup. The reproducer now results in:

$ sudo mkdir /mnt/scratch/{0,1,2,3,4,5}{0,1,2,3,4,5}
mkdir: cannot create directory =E2=80=98/mnt/scratch/00=E2=80=99: File ex=
ists
mkdir: cannot create directory =E2=80=98/mnt/scratch/01=E2=80=99: File ex=
ists
mkdir: cannot create directory =E2=80=98/mnt/scratch/03=E2=80=99: Structu=
re needs cleaning
mkdir: cannot create directory =E2=80=98/mnt/scratch/04=E2=80=99: Input/o=
utput error
mkdir: cannot create directory =E2=80=98/mnt/scratch/05=E2=80=99: Input/o=
utput error
....

And this corruption shutdown:

[   54.843517] XFS (loop0): Corruption detected! Free inode 0xafe846c not=
 marked free on disk
[   54.845885] XFS (loop0): Internal error xfs_trans_cancel at line 1023 =
of file fs/xfs/xfs_trans.c.  Caller xfs_create+0x425/0x670
[   54.848994] CPU: 10 PID: 3541 Comm: mkdir Not tainted 4.16.0-rc5-dgc #=
443
[   54.850753] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIO=
S 1.10.2-1 04/01/2014
[   54.852859] Call Trace:
[   54.853531]  dump_stack+0x85/0xc5
[   54.854385]  xfs_trans_cancel+0x197/0x1c0
[   54.855421]  xfs_create+0x425/0x670
[   54.856314]  xfs_generic_create+0x1f7/0x2c0
[   54.857390]  ? capable_wrt_inode_uidgid+0x3f/0x50
[   54.858586]  vfs_mkdir+0xfb/0x1b0
[   54.859458]  SyS_mkdir+0xcf/0xf0
[   54.860254]  do_syscall_64+0x73/0x1a0
[   54.861193]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
[   54.862492] RIP: 0033:0x7fb73bddf547
[   54.863358] RSP: 002b:00007ffdaa553338 EFLAGS: 00000246 ORIG_RAX: 0000=
000000000053
[   54.865133] RAX: ffffffffffffffda RBX: 00007ffdaa55449a RCX: 00007fb73=
bddf547
[   54.866766] RDX: 0000000000000001 RSI: 00000000000001ff RDI: 00007ffda=
a55449a
[   54.868432] RBP: 00007ffdaa55449a R08: 00000000000001ff R09: 00005623a=
8670dd0
[   54.870110] R10: 00007fb73be72d5b R11: 0000000000000246 R12: 000000000=
00001ff
[   54.871752] R13: 00007ffdaa5534b0 R14: 0000000000000000 R15: 00007ffda=
a553500
[   54.873429] XFS (loop0): xfs_do_force_shutdown(0x8) called from line 1=
024 of file fs/xfs/xfs_trans.c.  Return address = ffffffff814cd050
[   54.882790] XFS (loop0): Corruption of in-memory data detected.  Shutt=
ing down filesystem
[   54.884597] XFS (loop0): Please umount the filesystem and rectify the =
problem(s)

Note that this crash is only possible on v4 filesystemsi or v5
filesystems mounted with the ikeep mount option. For all other V5
filesystems, this problem cannot occur because we don't read inodes
we are allocating from disk - we simply overwrite them with the new
inode information.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Tested-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-23 18:05:09 -07:00
Darrick J. Wong
b83e4c3ced xfs: xfs_scrub_iallocbt_xref_rmap_inodes should use xref_set_corrupt
In xfs_scrub_iallocbt_xref_rmap_inodes we're checking inodes against
rmap records, so we should use xfs_scrub_btree_xref_set_corrupt if we
encounter discrepancies here so that we know that it's a cross
referencing error, not necessarily a corruption in the inobt itself.

The userspace xfs_scrub program will try to repair outright corruptions
in the agi/inobt prior to phase 3 so that the inode scan will proceed.
If only a cross-referencing error is noted, the repair program defers
the repair attempt until it can check the other space metadata at least
once.

It is therefore essential that the inobt scrubber can correctly
distinguish between corruptions and "unable to cross-reference something
else with this inobt".  The same reasoning applies to "xfs: record inode
buf errors as a xref error in inobt scrubber".

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:09 -07:00
Darrick J. Wong
5927268f5a xfs: flag inode corruption if parent ptr doesn't get us a real inode
If a directory's parent inode pointer doesn't point to an inode, the
directory should be flagged as corrupt.  Enable IGET_UNTRUSTED here so
that _iget will return -EINVAL if the inobt does not confirm that the
inode is present and allocated and we can flag the directory corruption.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:08 -07:00
Darrick J. Wong
6a96c56505 xfs: don't accept inode buffers with suspicious unlinked chains
When we're verifying inode buffers, sanity-check the unlinked pointer.
We don't want to run the risk of trying to purge something that's
obviously broken.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:08 -07:00
Darrick J. Wong
8bb82bc12a xfs: move inode extent size hint validation to libxfs
Extent size hint validation is used by scrub to decide if there's an
error, and it will be used by repair to decide to remove the hint.
Since these use the same validation functions, move them to libxfs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:08 -07:00
Darrick J. Wong
1b44a6aecc xfs: record inode buf errors as a xref error in inobt scrubber
During the inode btree scrubs we try to confirm the freemask bits
against the inode records.  If the inode buffer read fails, this is a
cross-referencing error, not a corruption of the inode btree itself.
Use the xref_process_error call here.  Found via core.version middlebit
fuzz in xfs/415.

The userspace xfs_scrub program will try to repair outright corruptions
in the agi/inobt prior to phase 3 so that the inode scan will proceed.
If only a cross-referencing error is noted, the repair program defers
the repair attempt until it can check the other space metadata at least
once.

It is therefore essential that the inobt scrubber can correctly
distinguish between corruptions and "unable to cross-reference something
else with this inobt".

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:08 -07:00
Darrick J. Wong
7e56d9eaea xfs: remove xfs_buf parameter from inode scrub methods
Now that we no longer do raw inode buffer scrubbing, the bp parameter is
no longer used anywhere we're dealing with an inode, so remove it and
all the useless NULL parameters that go with it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:08 -07:00
Darrick J. Wong
d0018ad889 xfs: inode scrubber shouldn't bother with raw checks
The inode scrubber tries to _iget the inode prior to running checks.
If that _iget call fails with corruption errors that's an automatic
fail, regardless of whether it was the inode buffer read verifier,
the ifork verifier, or the ifork formatter that errored out.

Therefore, get rid of the raw mode scrub code because it's not needed.
Found by trying to fix some test failures in xfs/379 and xfs/415.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:08 -07:00
Darrick J. Wong
5e777b62b0 xfs: bmap scrubber should do rmap xref with bmap for sparse files
When we're scanning an extent mapping inode fork, ensure that every rmap
record for this ifork has a corresponding bmbt record too.  This
(mostly) provides the ability to cross-reference rmap records with bmap
data.  The rmap scrubber cannot do the xref on its own because that
requires taking an ilock with the agf lock held, which violates our
locking order rules (inode, then agf).

Note that we only do this for forks that are in btree format due to the
increased complexity; or forks that should have data but suspiciously
have zero extents because the inode could have just had its iforks
zapped by the inode repair code and now we need to reclaim the old
extents.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:07 -07:00
Darrick J. Wong
6edb181053 xfs: refactor inode buffer verifier error logging
When the inode buffer verifier encounters an error, it's much more
helpful to print a buffer from the offending inode instead of just the
start of the inode chunk buffer.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:07 -07:00
Darrick J. Wong
90a58f9571 xfs: refactor inode verifier error logging
Refactor some of the inode verifier failure logging call sites to use
the new xfs_inode_verifier_error method which dumps the offending buffer
as well as the code location of the failed check.  This trims the
output, makes it clearer to the admin that repair must be run, and gives
the developers more details to work from.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:07 -07:00
Darrick J. Wong
30b0984d91 xfs: refactor bmap record validation
Refactor the bmap validator into a more complete helper that looks for
extents that run off the end of the device, overflow into the next AG,
or have invalid flag states.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:07 -07:00
Darrick J. Wong
6915ef35c0 xfs: sanity-check the unused space before trying to use it
In xfs_dir2_data_use_free, we examine on-disk metadata and ASSERT if
it doesn't make sense.  Since a carefully crafted fuzzed image can cause
the kernel to crash after blowing a bunch of assertions, let's move
those checks into a validator function and rig everything up to return
EFSCORRUPTED to userspace.  Found by lastbit fuzzing ltail.bestcount via
xfs/391.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:07 -07:00
Brian Foster
a27ba2607e xfs: detect agfl count corruption and reset agfl
The struct xfs_agfl v5 header was originally introduced with
unexpected padding that caused the AGFL to operate with one less
slot than intended. The header has since been packed, but the fix
left an incompatibility for users who upgrade from an old kernel
with the unpacked header to a newer kernel with the packed header
while the AGFL happens to wrap around the end. The newer kernel
recognizes one extra slot at the physical end of the AGFL that the
previous kernel did not. The new kernel will eventually attempt to
allocate a block from that slot, which contains invalid data, and
cause a crash.

This condition can be detected by comparing the active range of the
AGFL to the count. While this detects a padding mismatch, it can
also trigger false positives for unrelated flcount corruption. Since
we cannot distinguish a size mismatch due to padding from unrelated
corruption, we can't trust the AGFL enough to simply repopulate the
empty slot.

Instead, avoid unnecessarily complex detection logic and and use a
solution that can handle any form of flcount corruption that slips
through read verifiers: distrust the entire AGFL and reset it to an
empty state. Any valid blocks within the AGFL are intentionally
leaked. This requires xfs_repair to rectify (which was already
necessary based on the state the AGFL was found in). The reset
mitigates the side effect of the padding mismatch problem from a
filesystem crash to a free space accounting inconsistency. The
generic approach also means that this patch can be safely backported
to kernels with or without a packed struct xfs_agfl.

Check the AGF for an invalid freelist count on initial read from
disk. If detected, set a flag on the xfs_perag to indicate that a
reset is required before the AGFL can be used. In the first
transaction that attempts to use a flagged AGFL, reset it to empty,
warn the user about the inconsistency and allow the freelist fixup
code to repopulate the AGFL with new blocks. The xfs_perag flag is
cleared to eliminate the need for repeated checks on each block
allocation operation.

This allows kernels that include the packing fix commit 96f859d52b
("libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct")
to handle older unpacked AGFL formats without a filesystem crash.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by Dave Chiluk <chiluk+linuxxfs@indeed.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-23 18:05:06 -07:00
Christoph Hellwig
3e4da466bf xfs: unwind the try_again loop in xfs_log_force
Instead split out a __xfs_log_fore_lsn helper that gets called again
with the already_slept flag set to true in case we had to sleep.

This prepares for aio_fsync support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-23 18:05:06 -07:00
Christoph Hellwig
93806299b5 xfs: refactor xfs_log_force_lsn
Use the the smallest possible loop as preable to find the correct iclog
buffer, and then use gotos for unwinding to straighten the code.

Also fix the top of function comment while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-23 18:05:06 -07:00
Ingo Molnar
793057e1c7 vfs: Replace stray non-ASCII homoglyph characters with their ASCII equivalents
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-19 01:07:42 -04:00
Christoph Hellwig
df79b81b2e xfs: minor cleanup for xfs_reflink_end_cow
Use xfs_iext_prev_extent to skip to the previous extent instead of
opencoding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-15 10:31:38 -07:00
Christoph Hellwig
1d4352de51 xfs: minor cleanup for xfs_get_blocks
Simplify the control flow a bit in preparation for O_ATOMIC-related
changes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-15 10:31:38 -07:00
Christoph Hellwig
f5c54717bf xfs: remove xfs_zero_range
This helper doesn't add any real value over just calling iomap_zero_range
directly, so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-15 10:31:38 -07:00
Christoph Hellwig
c7dbe3f2c4 xfs: assert that xfs_reflink_allocate_cow is called with XFS_ILOCK_EXCL
Now that we convert COW preallocations from unwritten to real on every
call this function needs to be called with the ilock held exclusively.

Fortunately we already do that, but update the assert to match.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-15 10:31:38 -07:00
Christoph Hellwig
7d9df3c163 xfs: don't use XFS_BMAPI_ENTRIRE in xfs_get_blocks
There is no reason to get a mapping bigger than what we were asked for.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-15 10:31:38 -07:00
Christoph Hellwig
5bcffe300c xfs: fix the check for COW extents in xfs_swap_extents
i_cnextents does not include delayed allocated extents, so switch
to the inode fork size check that we already use in other places
instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-15 10:31:38 -07:00
Christoph Hellwig
e6b9657056 xfs: refactor xfs_log_force
Streamline the conditionals so that it is more obvious which specific case
form the top of the function comments is being handled.  Use gotos only
for early returns.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-14 11:12:52 -07:00
Christoph Hellwig
656de4ffaf xfs: merge _xfs_log_force_lsn and xfs_log_force_lsn
Switch to a single interface for flushing the log to a specific LSN, which
gives consistent trace point coverage and a less confusing interface.

The was only a single user of the previous xfs_log_force_lsn function,
which now also passes a NULL log_flushed argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-14 11:12:52 -07:00
Christoph Hellwig
60e5bb7844 xfs: merge _xfs_log_force and xfs_log_force
Switch to a single interface for flushing the whole log, which gives
consistent trace point coverage, and removes the unused log_flushed
argument for the previous _xfs_log_force callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-14 11:12:52 -07:00
Christoph Hellwig
2b56c2857f xfs: remove the unused log_flushed variable in xfs_extent_busy_flush
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-14 11:12:52 -07:00
Christoph Hellwig
4f7aa2fd8c xfs: remove an outdated comment for xfs_inode_item_committing
The function now does something, and that something is central to our
inode logging scheme.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-14 11:12:51 -07:00
Christoph Hellwig
2022ab36fe xfs: remove misleading comment text on xfs_inode_item_unlock
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-14 11:12:51 -07:00
Brian Foster
0ab32086d0 xfs: account only rmapbt-used blocks against rmapbt perag res
The rmapbt perag metadata reservation reserves blocks for the
reverse mapping btree (rmapbt). Since the rmapbt uses blocks from
the agfl and perag accounting is updated as blocks are allocated
from the allocation btrees, the reservation actually accounts blocks
as they are allocated to (or freed from) the agfl rather than the
rmapbt itself.

While this works for blocks that are eventually used for the rmapbt,
not all agfl blocks are destined for the rmapbt. Blocks that are
allocated to the agfl (and thus "reserved" for the rmapbt) but then
used by another structure leads to a growing inconsistency over time
between the runtime tracking of rmapbt usage vs. actual rmapbt
usage. Since the runtime tracking thinks all agfl blocks are rmapbt
blocks, it essentially believes that less future reservation is
required to satisfy the rmapbt than what is actually necessary.

The inconsistency is rectified across mount cycles because the perag
reservation is initialized based on the actual rmapbt usage at mount
time. The problem, however, is that the excessive drain of the
reservation at runtime opens a window to allocate blocks for other
purposes that might be required for the rmapbt on a subsequent
mount. This problem can be demonstrated by a simple test that runs
an allocation workload to consume agfl blocks over time and then
observe the difference in the agfl reservation requirement across an
unmount/mount cycle:

  mount ...: xfs_ag_resv_init: ... resv 3193 ask 3194 len 3194
  ...
  ...      : xfs_ag_resv_alloc_extent: ... resv 2957 ask 3194 len 1
  umount...: xfs_ag_resv_free: ... resv 2956 ask 3194 len 0
  mount ...: xfs_ag_resv_init: ... resv 3052 ask 3194 len 3194

As the above tracepoints show, the reservation requirement reduces
from 3194 blocks to 2956 blocks as the workload runs.  Without any
other changes in the filesystem, the same reservation requirement
jumps from 2956 to 3052 blocks over a umount/mount cycle.

To address this divergence, update the RMAPBT reservation to account
blocks used for the rmapbt only rather than all blocks filled into
the agfl. This patch makes several high-level changes toward that
end:

1.) Reintroduce an AGFL reservation type to serve as an accounting
    no-op for blocks allocated to (or freed from) the AGFL.
2.) Invoke RMAPBT usage accounting from the actual rmapbt block
    allocation path rather than the AGFL allocation path.

The first change is required because agfl blocks are considered free
blocks throughout their lifetime. The perag reservation subsystem is
invoked unconditionally by the allocation subsystem, so we need a
way to tell the perag subsystem (via the allocation subsystem) to
not make any accounting changes for blocks filled into the AGFL.

The second change causes the in-core RMAPBT reservation usage
accounting to remain consistent with the on-disk state at all times
and eliminates the risk of leaving the rmapbt reservation
underfilled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:57 -07:00
Brian Foster
2159286335 xfs: rename agfl perag res type to rmapbt
The AGFL perag reservation type accounts all allocations that feed
into (or are released from) the allocation group free list (agfl).
The purpose of the reservation is to support worst case conditions
for the reverse mapping btree (rmapbt). As such, the agfl
reservation usage accounting only considers rmapbt usage when the
in-core counters are initialized at mount time.

This implementation inconsistency leads to divergence of the in-core
and on-disk usage accounting over time. In preparation to resolve
this inconsistency and adjust the AGFL reservation into an rmapbt
specific reservation, rename the AGFL reservation type and
associated accounting fields to something more rmapbt-specific. Also
fix up a couple tracepoints that incorrectly use the AGFL
reservation type to pass the agfl state of the associated extent
where the raw reservation type is expected.

Note that this patch does not change perag reservation behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:57 -07:00
Brian Foster
b3fed43482 xfs: account format bouncing into rmapbt swapext tx reservation
The extent swap mechanism requires a unique implementation for
rmapbt enabled filesystems. Because the rmapbt tracks extent owner
information, extent swap must individually unmap and remap each
extent between the two inodes.

The rmapbt extent swap transaction block reservation currently
accounts for the worst case bmapbt block and rmapbt block
consumption based on the extent count of each inode. There is a
corner case that exists due to the extent swap implementation that
is not covered by this reservation, however.

If one of the associated inodes is just over the max extent count
used for extent format inodes (i.e., the inode is in btree format by
a single extent), the unmap/remap cycle of the extent swap can
bounce the inode between extent and btree format multiple times,
almost as many times as there are extents in the inode (if the
opposing inode happens to have one less, for example). Each back and
forth cycle involves a block free and allocation, which isn't a
problem except for that the initial transaction reservation must
account for the total number of block allocations performed by the
chain of deferred operations. If not, a block reservation overrun
occurs and the filesystem shuts down.

Update the rmapbt extent swap block reservation to check for this
situation and add some block reservation slop to ensure the entire
operation succeeds. We'd never likely require reservation for both
inodes as fsr wouldn't defrag the file in that case, but the
additional reservation is constrained by the data fork size so be
cautious and check for both.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:57 -07:00
Brian Foster
3e78b9a468 xfs: shutdown if block allocation overruns tx reservation
The ->t_blk_res_used field tracks how many blocks have been used in
the current transaction. This should never exceed the block
reservation (->t_blk_res) for a particular transaction. We currently
assert this condition in the transaction block accounting code, but
otherwise take no additional action should this situation occur.

The overrun generally has no effect if space ends up being available
and the associated transaction commits. If the transaction is
duplicated, however, the current block usage is used to determine
the remaining block reservation to be transferred to the new
transaction. If usage exceeds reservation, this calculation
underflows and creates a transaction with an invalid and excessive
reservation. When the second transaction commits, the release of
unused blocks corrupts the in-core free space counters. With lazy
superblock accounting enabled, this inconsistency eventually
trickles to the on-disk superblock and corrupts the filesystem.

Replace the transaction block usage accounting assert with an
explicit overrun check. If the transaction overruns the reservation,
shutdown the filesystem immediately to prevent corruption. Add a new
assert to xfs_trans_dup() to catch any callers that might induce
this invalid state in the future.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:57 -07:00
Matthew Wilcox
57e8095611 xfs: Rename xa_ elements to ail_
This is a simple rename, except that xa_ail becomes ail_head.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:56 -07:00
Dave Chinner
a78ee256c3 xfs: convert XFS_AGFL_SIZE to a helper function
The AGFL size calculation is about to get more complex, so lets turn
the macro into a function first and remove the macro.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[darrick: forward port to newer kernel, simplify the helper]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-11 20:27:56 -07:00
Darrick J. Wong
6231848c3a xfs: check for cow blocks before trying to clear them
There's no point in allocating a transaction and locking the inode in
preparation to clear cow blocks if there actually are any cow fork
extents.  Therefore, move the xfs_reflink_cancel_cow_range hunk to
xfs_inactive and check the cow ifp first.  This makes inode reclamation
run faster.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-11 20:27:56 -07:00
Darrick J. Wong
3f883f5bb1 xfs: convert a few more directory asserts to corruption
Yet another round of playing whack-a-mole with directory code that
asserts on corrupt on-disk metadata when it really should be returning
-EFSCORRUPTED instead of ASSERTing.  Found by a xfs/391 crash while
lastbit fuzzing of ltail.bestcount.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-11 20:27:56 -07:00
Darrick J. Wong
8241f7f983 xfs: don't iunlock the quota ip when quota block
In xfs_qm_dqalloc, we join the locked quota inode to the transaction we
use to allocate blocks.  If the allocation or mapping fails, we're not
allowed to unlock the inode because the transaction code is in charge of
unlocking it for us.  Therefore, remove the iunlock call to avoid
blowing asserts about unbalanced locking + mount hang.

Found by corrupting the AGF and allocating space in the filesystem
(quotacheck) immediately after mount.  The upcoming agfl wrapping fixup
test will trigger this scenario.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-11 20:27:56 -07:00
Vratislav Bendel
19957a1816 xfs: Correctly invert xfs_buftarg LRU isolation logic
Due to an inverted logic mistake in xfs_buftarg_isolate()
the xfs_buffers with zero b_lru_ref will take another trip
around LRU, while isolating buffers with non-zero b_lru_ref.

Additionally those isolated buffers end up right back on the LRU
once they are released, because b_lru_ref remains elevated.

Fix that circuitous route by leaving them on the LRU
as originally intended.

Signed-off-by: Vratislav Bendel <vbendel@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:56 -07:00
Dave Chinner
4df0f7f145 xfs: fix transaction allocation deadlock in IO path
xfs_trans_alloc() does GFP_KERNEL allocation, and we can call it
while holding pages locked for writeback in the ->writepages path.
The memory allocation is allowed to wait on pages under writeback,
and so can wait on pages that are tagged as writeback by the
caller.

This affects both pre-IO submission and post-IO submission paths.
Hence xfs_setsize_trans_alloc(), xfs_reflink_end_cow(),
xfs_iomap_write_unwritten() and xfs_reflink_cancel_cow_range().
xfs_iomap_write_unwritten() already does the right thing, but the
others don't. Fix them.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Fixes: 281627df3e ("xfs: log file size updates at I/O completion time")
Fixes: 43caeb187d ("xfs: move mappings from cow fork to data fork after copy-write)"
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Christoph Hellwig
c3b1b13190 xfs: implement the lazytime mount option
Use the VFS dirty inode tracking for lazytime inodes only, and just
log them in ->dirty_inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Nikolay Borisov
bcab2ebfa1 xfs: Remove dead code from inode recover function
The memcpy is guarded by a check which is performed a right before we
call xfs_log_dinode_to_disk. At this point we are sure this check will
always be false otherwise we would have errored out. So let's remove
this dead weight.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Carlos Maiolino
e157ebdcb3 Cleanup old XFS_BTREE_* traces
Remove unused legacy btree traces from IRIX era.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Eric Sandeen
4603fa744c xfs: remove unused m_dmevmask from xfs_mount struct
The dmevmask structure member is a dmapi leftover; it's
set here and there but never actually used.  Remove it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Dave Chinner
cb0a8d2302 xfs: fall back to vmalloc when allocation log vector buffers
When using large directory blocks, we regularly see memory
allocations of >64k being made for the shadow log vector buffer.
When we are under memory pressure, kmalloc() may not be able to find
contiguous memory chunks large enough to satisfy these allocations
easily, and if memory is fragmented we can potentially stall here.

TO avoid this problem, switch the log vector buffer allocation to
use kmem_alloc_large(). This will allow failed allocations to fall
back to vmalloc and so remove the dependency on large contiguous
regions of memory being available. This should prevent slowdowns
and potential stalls when memory is low and/or fragmented.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Christoph Hellwig
ff3d8b9c4c xfs: don't block on the ilock for RWF_NOWAIT
Fix xfs_file_iomap_begin to trylock the ilock if IOMAP_NOWAIT is passed,
so that we don't block io_submit callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-01 14:12:45 -08:00
Christoph Hellwig
af5b5afe9a xfs: don't start out with the exclusive ilock for direct I/O
There is no reason to take the ilock exclusively at the start of
xfs_file_iomap_begin for direct I/O, given that it will be demoted
just before calling xfs_iomap_write_direct anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-01 14:12:12 -08:00
Christoph Hellwig
172ed391f6 xfs: don't allocate COW blocks for zeroing holes or unwritten extents
The iomap zeroing interface is smart enough to skip zeroing holes or
unwritten extents.  Don't subvert this logic for reflink files.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-01 14:10:31 -08:00
Chengguang Xu
5b4c845ea4 xfs: fix potential memory leak in mount option parsing
When specifying string type mount option (e.g., logdev)
several times in a mount, current option parsing may
cause memory leak. Hence, call kfree for previous one
in this case.

Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-02-26 10:02:13 -08:00
Darrick J. Wong
b31c2bdcd8 xfs: reserve blocks for refcount / rmap log item recovery
During log recovery, the per-AG reservations aren't yet set up, so log
recovery has to reserve enough blocks to handle all possible btree
splits.

Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-02-22 14:41:25 -08:00
Eric Sandeen
86516eff3b xfs: use memset to initialize xfs_scrub_agfl_info
Apparently different gcc versions have competing and
incompatible notions of how to initialize at declaration,
so just give up and fall back to the time-tested memset().

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-02-22 14:41:25 -08:00
Linus Torvalds
e237f98a9c Changes since last update:
- Print scrub build status in the xfs build info.
  - Explicitly call out the remaining two scenarios where we don't
    support
    reflink and never have.
  - Remove EXPERIMENTAL tag from reverse mapping btree!
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCgAGBQJaeJjKAAoJEPh/dxk0SrTrZP8P/RT0bcKc1PkmonX6rZBYa9OB
 Mz5X7TpVRsXtZPtGSNM3IBIubjIVEZ/f3s5CZefN08uV8s+AFBjEAdHmeAiGtT/X
 qakQyvsBJ3mEyVsMyzuI7eu4TU3/5Xad7kSp9TFPnXfW8z09Z4GygyGVJPRqpKRQ
 liFzh8BIVgS/IFcpTL+6wKEHdAHEuyz6u/78ylgCtLMuiNiMY1mYv/+U2f7dEV3u
 yiRY4oHGQfOiw1aXy3EO2WUdSKcAQwIJIEsLOllYQRe3f5W2milflFCJF9RoEEuE
 OLmur4PBwFWpTfLVl1BqGa6rr/nhaY1y7Lyy3mVrmv0QiHlnNM/BQ5UKICZJdx5O
 8Ai4ZyaJ5Q/nQxA6USOBHSlkeexMOH82i7gJCCfPtYqW1l0QjStLcoTYjWXa/0u9
 ULEkdnocNm/HSCIGocFrd6dzOKR8TxJDVh3DxIFo8VjTj/XI57+ePfbZT7J+0vuB
 elhKcho87xKHeF1RQfsVdgh+518GGAXp5zZjAJ3P/6GpxuB9sa+ShEEtR7OzSf0K
 sfkXw3P/tH9ladBxWvMC6Gx0tSUSUTAUeYSbfOC1wRio7iI7sf8Gl8SkU65y4RdE
 ZhQp8M4i2+vt9JS/E/mbAVxKIn1iF7L9ZiWlycJXyuqFf7bv1uBXG+tTE7lM7nJA
 YjSmXBWN5j6kxQeUR0NE
 =U54J
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.16-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull more xfs updates from Darrick Wong:
 "As promised, here's a (much smaller) second pull request for the
  second week of the merge cycle. This time around we have a couple
  patches shutting off unsupported fs configurations, and a couple of
  cleanups.

  Last, we turn off EXPERIMENTAL for the reverse mapping btree, since
  the primary downstream user of that information (online fsck) is now
  upstream and I haven't seen any major failures in a few kernel
  releases.

  Summary:

   - Print scrub build status in the xfs build info.

   - Explicitly call out the remaining two scenarios where we don't
     support reflink and never have.

   - Remove EXPERIMENTAL tag from reverse mapping btree!"

* tag 'xfs-4.16-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: remove experimental tag for reverse mapping
  xfs: don't allow reflink + realtime filesystems
  xfs: don't allow DAX on reflink filesystems
  xfs: add scrub to XFS_BUILD_OPTIONS
  xfs: fix u32 type usage in sb validation function
2018-02-05 13:35:56 -08:00
Linus Torvalds
23aedc4b9b Only miscellaneous cleanups and bug fixes for ext4 this cycle.
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlp16xMACgkQ8vlZVpUN
 gaP1IAf8C48AKVnqy6ftFphzV1CdeGHDwJLL63lChs97fNr1mxo5TZE/6vdYB55j
 k7C7huQ582cEiGWQJ0U4/+En0hF85zkAk5mTfnSao5BqxLr9ANsAocwBUNBXdFSp
 B7IyMo4Dct7NCkwfmKLPRcEqZ49vwyv99TqM/9wUkgUStkTjPT7bhHgarB6VPbhp
 BxoXVnFYgU0sZN0y71IBt8ngWqCK6j7fjw3gsl37oEenG3/h3SO0H9ih1FrysX8S
 VOwwLJq6vfAgEwQvZACnBwWKDYsZpH7akNp9WGeDMByo28t514RNRjIi0mvLHEZa
 h72I8Sb3bwHO9MJNvHFe/0b1Say4vw==
 =dxAX
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Only miscellaneous cleanups and bug fixes for ext4 this cycle"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: create ext4_kset dynamically
  ext4: create ext4_feat kobject dynamically
  ext4: release kobject/kset even when init/register fail
  ext4: fix incorrect indentation of if statement
  ext4: correct documentation for grpid mount option
  ext4: use 'sbi' instead of 'EXT4_SB(sb)'
  ext4: save error to disk in __ext4_grp_locked_error()
  jbd2: fix sphinx kernel-doc build warnings
  ext4: fix a race in the ext4 shutdown path
  mbcache: make sure c_entry_count is not decremented past zero
  ext4: no need flush workqueue before destroying it
  ext4: fixed alignment and minor code cleanup in ext4.h
  ext4: fix ENOSPC handling in DAX page fault handler
  dax: pass detailed error code from dax_iomap_fault()
  mbcache: revert "fs/mbcache.c: make count_objects() more robust"
  mbcache: initialize entry->e_referenced in mb_cache_entry_create()
  ext4: fix up remaining files with SPDX cleanups
2018-02-03 13:49:22 -08:00
Darrick J. Wong
76883f7988 xfs: remove experimental tag for reverse mapping
Reverse mapping has had a while to soak, so remove the experimental tag.
Now that we've landed space metadata cross-referencing in scrub, the
feature actually has a purpose.

Reject rmap filesystems with an rt device until the code to support it
is actually implemented.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2018-02-01 21:07:26 -08:00
Darrick J. Wong
c14632ddac xfs: don't allow reflink + realtime filesystems
We don't support realtime filesystems with reflink either, so fail
those mounts.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2018-02-01 21:06:16 -08:00
Darrick J. Wong
b6e03c10bf xfs: don't allow DAX on reflink filesystems
Now that reflink is no longer experimental, reject attempts to mount
with DAX until that whole mess gets sorted out.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-02-01 21:06:15 -08:00
Eric Sandeen
494370ccaa xfs: add scrub to XFS_BUILD_OPTIONS
Advertise this config option along with the others.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-02-01 21:06:15 -08:00
Darrick J. Wong
131fa58d39 xfs: fix u32 type usage in sb validation function
Don't use u32, use uint32_t, because this won't work in xfsprogs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-01-31 20:39:20 -08:00
Linus Torvalds
255442c938 Documentation updates for 4.16. New stuff includes refcount_t
documentation, errseq documentation, kernel-doc support for nested
 structure definitions, the removal of lots of crufty kernel-doc support for
 unused formats, SPDX tag documentation, the beginnings of a manual for
 subsystem maintainers, and lots of fixes and updates.
 
 As usual, some of the changesets reach outside of Documentation/ to effect
 kerneldoc comment fixes.  It also adds the new LICENSES directory, of which
 Thomas promises I do not need to be the maintainer.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJab11TAAoJEI3ONVYwIuV6i1UP/1LgGPHW9Ygq5qaLFbReZd/u
 Mx/orrhHX0PdkbCCE+CbL8Vm1m4UKFDTBdlpk3s542zxeeG0ZBXuTnvq4Kyk+cTN
 p4/vsIEzk/Ih13/glGE5MlV+EjiEK+8hK69TIUj7bAyuHmpzofjRz9/1M6RLDGDC
 HY6UI58AXG0yOQWMWCGRMYpQAFUGij2equ7Doe1ugXRq14dx7V4RsOhI140iRk7t
 bquAq1rS2fXniiuPFmLBUe4dWW28isVa/Vl/aXcaWQDKMyT0OLhjOMW36wWKqtPi
 WdVCpHv1NLZNyZZr9S3kvfOwW+BUqpEzfVwssyBLW4h0tsnIx0U0HVhSTY8/TvFZ
 QD9yCSana4LB/e5CHXIX5lBHbjHxf+rETXqVV4MgwDaMvM3mCo4X6WUTJDmZADo6
 vQISEKeb4su5uWAbc9T9xwRSLhZnFVdJ/QuYdNQ5+EpFJYLhzQ9eBvEz6JstSIXL
 p9ASBiPNY3ulpVZ8q0JOHJRBhq5mHJH6Dy8achzbILy2l/ZI4b8lJ53mw9II04cp
 puF96E6HpvuZ8Tgjjrg9U3ZdxXNrUgc/tjk2ZDkyTglk1XF2jKSq2tiNSZ3oLrJm
 XqJPnpCeyJM5UDvwkIBzgC41WEHwe8uvoNbUnc4X7UJSZegFzcSLQXf5qaprHS5k
 XeQ7sbd+S+jzVVjFi0W5
 =Z15Z
 -----END PGP SIGNATURE-----

Merge tag 'docs-4.16' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "Documentation updates for 4.16.

  New stuff includes refcount_t documentation, errseq documentation,
  kernel-doc support for nested structure definitions, the removal of
  lots of crufty kernel-doc support for unused formats, SPDX tag
  documentation, the beginnings of a manual for subsystem maintainers,
  and lots of fixes and updates.

  As usual, some of the changesets reach outside of Documentation/ to
  effect kerneldoc comment fixes. It also adds the new LICENSES
  directory, of which Thomas promises I do not need to be the
  maintainer"

* tag 'docs-4.16' of git://git.lwn.net/linux: (65 commits)
  linux-next: docs-rst: Fix typos in kfigure.py
  linux-next: DOC: HWPOISON: Fix path to debugfs in hwpoison.txt
  Documentation: Fix misconversion of #if
  docs: add index entry for networking/msg_zerocopy
  Documentation: security/credentials.rst: explain need to sort group_list
  LICENSES: Add MPL-1.1 license
  LICENSES: Add the GPL 1.0 license
  LICENSES: Add Linux syscall note exception
  LICENSES: Add the MIT license
  LICENSES: Add the BSD-3-clause "Clear" license
  LICENSES: Add the BSD 3-clause "New" or "Revised" License
  LICENSES: Add the BSD 2-clause "Simplified" license
  LICENSES: Add the LGPL-2.1 license
  LICENSES: Add the LGPL 2.0 license
  LICENSES: Add the GPL 2.0 license
  Documentation: Add license-rules.rst to describe how to properly identify file licenses
  scripts: kernel_doc: better handle show warnings logic
  fs/*/Kconfig: drop links to 404-compliant http://acl.bestbits.at
  doc: md: Fix a file name to md-fault.c in fault-injection.txt
  errseq: Add to documentation tree
  ...
2018-01-31 19:25:25 -08:00
Linus Torvalds
20c59c71ae New in this version:
- Log faulting code locations when verifiers fail, for improved diagnosis
    of corrupt filesystems.
  - Implement metadata verifiers for local format inode fork data.
  - Online scrub now cross-references metadata records with other metadata.
  - Refactor the fs geometry ioctl generation functions.
  - Harden various metadata verifiers.
  - Fix various accounting problems.
  - Fix uncancelled transactions leaking when xattr functions fail.
  - Prevent the copy-on-write speculative preallocation garbage collector
    from racing with writeback.
  - Emit log reservation type information as trace data so that we can
    compare against xfsprogs.
  - Fix some erroneous asserts in the online scrub code.
  - Clean up the transaction reservation calculations.
  - Fix various minor bugs in online scrub.
  - Log complaints about mixed dio/buffered writes once per day and less
    noisily than before.
  - Refactor buffer log item lists to use list_head.
  - Break PNFS leases before reflinking blocks.
  - Reduce lock contention on reflink source files.
  - Fix some quota accounting problems with reflink.
  - Fix a serious corruption problem in the direct cow write code where we
    fed bad iomaps to the vfs iomap consumers.
  - Various other refactorings.
  - Remove EXPERIMENTAL tag from reflink!
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCgAGBQJabz1mAAoJEPh/dxk0SrTrZ2YQAJDPbmq6efgIwXc8J7wf1SzI
 Djh9bQNfMllP6d6UfIsmWsktVvW8koIJ8I9gZLKjMREd7/UGlrhBvzEQT95X8JFb
 6U+gAODOcRfRitDoISm4FRcxFo77B3OkmuzTM1sV6Z1On5qfMufmlDMg3CZbsB8b
 i/32BJb/r7AaU6Nfg/no0XPHi+5hdi1NhswM7i3mjqj83LPdobwE9lh2BaT0GZn0
 gJs6zijPNfkg1+LFtciIk7PCcVlO49aLpKE1iP2UrUVYBuWcQmm97SiZgvydFGxg
 48nIBQ6CJ3y1sR5USjejZZT0fAY37IAvlCfC9JCFrwqzSbxSMCCgyf8hhBLjGc25
 EyEi9fuDdHS+Im4+5kb/vtdRfyoim5KwHGRpN6ZtqH8hYizFu3su9LsgHCXfGoI3
 ehPgxWeQY9f+dUyJE060n/SF3uIw8+OnLtU7axxx4yvFiUuRgI4U0pLhpJdeRu3x
 ms1GZDgvhzsvX4h3b0Svv4Y2UHygvMYT1CR/gG9iXbFzUdg5wFJJ8dqgnnqoRfLT
 HnWOw93NTz62csxE+3RobYlNGNIeNBD0NjZiQsPKLuuVeJqT9llkL0/B7pKPYxQb
 KoDDkf/azgmH1gUs1XlDmPF5FE8DObeOMoXYn+693LpIMlewwqsyC3Ytu9+VJ6TZ
 X2+OAuTRGP+LYD6FNnEP
 =HL5B
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.16-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "This merge cycle, we're again some substantive changes to XFS.

  Metadata verifiers have been restructured to provide more detail about
  which part of a metadata structure failed checks, and we've enhanced
  the new online fsck feature to cross-reference extent allocation
  information with the other metadata structures. With this pull, the
  metadata verification part of online fsck is more or less finished,
  though the feature is still experimental and still disabled by
  default.

  We're also preparing to remove the EXPERIMENTAL tag from a couple of
  features this cycle. This week we're committing a bunch of space
  accounting fixes for reflink and removing the EXPERIMENTAL tag from
  reflink; I anticipate that we'll be ready to do the same for the
  reverse mapping feature next week. (I don't have any pending fixes for
  rmap; however I wish to remove the tags one at a time.)

  This giant pile of patches has been run through a full xfstests run
  over the weekend and through a quick xfstests run against this
  morning's master, with no major failures reported. Let me know if
  there's any merge problems -- git merge reported that one of our
  patches touched the same function as the i_version series, but it
  resolved things cleanly.

  Summary:

   - Log faulting code locations when verifiers fail, for improved
     diagnosis of corrupt filesystems.

   - Implement metadata verifiers for local format inode fork data.

   - Online scrub now cross-references metadata records with other
     metadata.

   - Refactor the fs geometry ioctl generation functions.

   - Harden various metadata verifiers.

   - Fix various accounting problems.

   - Fix uncancelled transactions leaking when xattr functions fail.

   - Prevent the copy-on-write speculative preallocation garbage
     collector from racing with writeback.

   - Emit log reservation type information as trace data so that we can
     compare against xfsprogs.

   - Fix some erroneous asserts in the online scrub code.

   - Clean up the transaction reservation calculations.

   - Fix various minor bugs in online scrub.

   - Log complaints about mixed dio/buffered writes once per day and
     less noisily than before.

   - Refactor buffer log item lists to use list_head.

   - Break PNFS leases before reflinking blocks.

   - Reduce lock contention on reflink source files.

   - Fix some quota accounting problems with reflink.

   - Fix a serious corruption problem in the direct cow write code where
     we fed bad iomaps to the vfs iomap consumers.

   - Various other refactorings.

   - Remove EXPERIMENTAL tag from reflink!"

* tag 'xfs-4.16-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (94 commits)
  xfs: remove experimental tag for reflinks
  xfs: don't screw up direct writes when freesp is fragmented
  xfs: check reflink allocation mappings
  iomap: warn on zero-length mappings
  xfs: treat CoW fork operations as delalloc for quota accounting
  xfs: only grab shared inode locks for source file during reflink
  xfs: allow xfs_lock_two_inodes to take different EXCL/SHARED modes
  xfs: reflink should break pnfs leases before sharing blocks
  xfs: don't clobber inobt/finobt cursors when xref with rmap
  xfs: skip CoW writes past EOF when writeback races with truncate
  xfs: preserve i_rdev when recycling a reclaimable inode
  xfs: refactor accounting updates out of xfs_bmap_btalloc
  xfs: refactor inode verifier corruption error printing
  xfs: make tracepoint inode number format consistent
  xfs: always zero di_flags2 when we free the inode
  xfs: call xfs_qm_dqattach before performing reflink operations
  xfs: bmap code cleanup
  Use list_head infra-structure for buffer's log items list
  Split buffer's b_fspriv field
  Get rid of xfs_buf_log_item_t typedef
  ...
2018-01-31 10:18:00 -08:00
Linus Torvalds
a4b7fd7d34 inode->i_version rework for v4.16
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJabwjlAAoJEAAOaEEZVoIVeEEP/R84kZJjlZV/vNmFFvY46jM+
 0hpMHXRNym+nW1Du1CKNkesEUAY8ACAQIyzJh63Q72341QTDdz3+asHwPYRNOqdC
 PgryidPieojkNKQg+h7dmoKYlYh1xiCicvn66Q5PFb9B0lH36twekOK4X1qqJj8Z
 breRmRoFLka9looMSuYgwbErts023fmASalvGum6T0ZM/7F9hUj4O3OsQtKTLUNM
 VQ+gLJTQrUqrgzvWUwq3WTMa9YAaKP4oad8nsglNSpiVLG7WtURr5HokW9hAziqL
 k99Y+K2ni1wZJlNGJAyV7PyEG2ieI5Xn+LzM2RM+SndD1QHF2QXACmSTDYfL51k5
 G2RsKeTZvQPtX4qx9+vnCp/4oV6JduvCaq2Mt8SQb9nYZxKjs85TNLrARJv+85eQ
 zP0OTxlH1Gfu3j36n3cny4XemyMYYF4hCFYfRPqTGst37fgLBtfIfUSQ6jedoCK2
 Xcyb6ukGXMh6If/A7DSy91hvSSPrWSH7TPPsbfLy6o+wUOtpAGR4eXVlEuAiXrzc
 gnoAz85oIMUQae66LrdrPk1NyE59qOb24g/yU5gyRBSpi2+/aoboNCKaD73tgs/C
 XIMwGXLYmqkcud7IBQF0tHHiM+jsEkbSM4LUqRXSnqMdwNnS18Z4Q+JKqpdP0cii
 eRdenDvUfu8Gu1Y9vWBv
 =iihN
 -----END PGP SIGNATURE-----

Merge tag 'iversion-v4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull inode->i_version rework from Jeff Layton:
 "This pile of patches is a rework of the inode->i_version field. We
  have traditionally incremented that field on every inode data or
  metadata change. Typically this increment needs to be logged on disk
  even when nothing else has changed, which is rather expensive.

  It turns out though that none of the consumers of that field actually
  require this behavior. The only real requirement for all of them is
  that it be different iff the inode has changed since the last time the
  field was checked.

  Given that, we can optimize away most of the i_version increments and
  avoid dirtying inode metadata when the only change is to the i_version
  and no one is querying it. Queries of the i_version field are rather
  rare, so we can help write performance under many common workloads.

  This patch series converts existing accesses of the i_version field to
  a new API, and then converts all of the in-kernel filesystems to use
  it. The last patch in the series then converts the backend
  implementation to a scheme that optimizes away a large portion of the
  metadata updates when no one is looking at it.

  In my own testing this series significantly helps performance with
  small I/O sizes. I also got this email for Christmas this year from
  the kernel test robot (a 244% r/w bandwidth improvement with XFS over
  DAX, with 4k writes):

    https://lkml.org/lkml/2017/12/25/8

  A few of the earlier patches in this pile are also flowing to you via
  other trees (mm, integrity, and nfsd trees in particular)".

* tag 'iversion-v4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: (22 commits)
  fs: handle inode->i_version more efficiently
  btrfs: only dirty the inode in btrfs_update_time if something was changed
  xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing
  fs: only set S_VERSION when updating times if necessary
  IMA: switch IMA over to new i_version API
  xfs: convert to new i_version API
  ufs: use new i_version API
  ocfs2: convert to new i_version API
  nfsd: convert to new i_version API
  nfs: convert to new i_version API
  ext4: convert to new i_version API
  ext2: convert to new i_version API
  exofs: switch to new i_version API
  btrfs: convert to new i_version API
  afs: convert to new i_version API
  affs: convert to new i_version API
  fat: convert to new i_version API
  fs: don't take the i_lock in inode_inc_iversion
  fs: new API for handling inode->i_version
  ntfs: remove i_version handling
  ...
2018-01-29 13:33:53 -08:00
Christoph Hellwig
1e369b0e19 xfs: remove experimental tag for reflinks
But reject reflink + DAX file systems for now until the code to
support reflinks on DAX is actually implemented.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: port to 4.16]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-29 07:27:24 -08:00
Darrick J. Wong
6d8a45ce29 xfs: don't screw up direct writes when freesp is fragmented
xfs_bmap_btalloc is given a range of file offset blocks that must be
allocated to some data/attr/cow fork.  If the fork has an extent size
hint associated with it, the request will be enlarged on both ends to
try to satisfy the alignment hint.  If free space is fragmentated,
sometimes we can allocate some blocks but not enough to fulfill any of
the requested range.  Since bmapi_allocate always trims the new extent
mapping to match the originally requested range, this results in
bmapi_write returning zero and no mapping.

The consequences of this vary -- buffered writes will simply re-call
bmapi_write until it can satisfy at least one block from the original
request.  Direct IO overwrites notice nmaps == 0 and return -ENOSPC
through the dio mechanism out to userspace with the weird result that
writes fail even when we have enough space because the ENOSPC return
overrides any partial write status.  For direct CoW writes the situation
was disastrous because nobody notices us returning an invalid zero-length
wrong-offset mapping to iomap and the write goes off into space.

Therefore, if free space is so fragmented that we managed to allocate
some space but not enough to map into even a single block of the
original allocation request range, we should break the alignment hint in
order to guarantee at least some forward progress for the direct write.
If we return a short allocation to iomap_apply it'll call back about the
remaining blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:24 -08:00
Darrick J. Wong
9f37bd11b4 xfs: check reflink allocation mappings
There's a really bad bug in xfs_reflink_allocate_cow -- if bmapi_write
can return a zero error code but no mappings.  This happens if there's
an extent size hint (which causes allocation requests to be rounded to
extsz granularity internally), but there wasn't a big enough chunk of
free space to start filling at the extsz granularity and fill even one
block of the range that we actually requested.

In any case, if we got no mappings we can't possibly do anything useful
with the contents of imap, so we must bail out with ENOSPC here.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:24 -08:00
Darrick J. Wong
4b4c1326fd xfs: treat CoW fork operations as delalloc for quota accounting
Since the CoW fork only exists in memory, it is incorrect to update the
on-disk quota block counts when we modify the CoW fork.  Unlike the data
fork, even real extents in the CoW fork are only delalloc-style
reservations (on-disk they're owned by the refcountbt) so they must not
be tracked in the on disk quota info.  Ensure the i_delayed_blks
accounting reflects this too.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:23 -08:00
Darrick J. Wong
01c2e13dca xfs: only grab shared inode locks for source file during reflink
Reflink and dedupe operations remap blocks from a source file into a
destination file.  The destination file needs exclusive locks on all
levels because we're updating its block map, but the source file isn't
undergoing any block map changes so we can use a shared lock.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:23 -08:00
Darrick J. Wong
7c2d238ac6 xfs: allow xfs_lock_two_inodes to take different EXCL/SHARED modes
Refactor xfs_lock_two_inodes to take separate locking modes for each
inode.  Specifically, this enables us to take a SHARED lock on one inode
and an EXCL lock on the other.  The lock class (MMAPLOCK/ILOCK) must be
the same for each inode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:23 -08:00
Darrick J. Wong
1364b1d4b5 xfs: reflink should break pnfs leases before sharing blocks
Before we share blocks between files, we need to break the pnfs leases
on the layout before we start slicing and dicing the block map.  The
structure of this function sets us up for the lock contention reduction
in the next patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:23 -08:00
Darrick J. Wong
c47b74fb2d xfs: don't clobber inobt/finobt cursors when xref with rmap
Even if we can't use the inobt/finobt cursors to count the number of
inode btree blocks, we are never allowed to clobber the cursor of the
btree being checked, so don't do this.  Found by fuzzing level = ones
in xfs/364.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:23 -08:00
Darrick J. Wong
70c57dcd60 xfs: skip CoW writes past EOF when writeback races with truncate
Every so often we blow the ASSERT(type != XFS_IO_COW) in xfs_map_blocks
when running fsstress, as we do in generic/269.  The cause of this is
writeback racing with truncate -- writeback doesn't take the iolock, so
truncate can sneak in to decrease i_size and truncate page cache while
writeback is gathering buffer heads to schedule writeout.

If we hit this race on a block that has a CoW mapping, we'll get a valid
imap from the CoW fork but the reduced i_size trims the mapping to zero
length (which makes it invalid), so we call xfs_map_blocks to try again.
This doesn't do much anyway, since any mapping we get out of that will
also be invalid, so we might as well skip the assert and just stop.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:23 -08:00
Amir Goldstein
acd1d71598 xfs: preserve i_rdev when recycling a reclaimable inode
Commit 66f364649d ("xfs: remove if_rdev") moved storing of rdev
value for special inodes to VFS inodes, but forgot to preserve the
value of i_rdev when recycling a reclaimable xfs_inode.

This was detected by xfstest overlay/017 with inodex=on mount option
and xfs base fs. The test does a lookup of overlay chardev and blockdev
right after drop caches.

Overlayfs inodes hold a reference on underlying xfs inodes when mount
option index=on is configured. If drop caches reclaim xfs inodes, before
it relclaims overlayfs inodes, that can sometimes leave a reclaimable xfs
inode and that test hits that case quite often.

When that happens, the xfs inode cache remains broken (zere i_rdev)
until the next cycle mount or drop caches.

Fixes: 66f364649d ("xfs: remove if_rdev")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-29 07:27:23 -08:00
Darrick J. Wong
751f3767c2 xfs: refactor accounting updates out of xfs_bmap_btalloc
Move all the inode and quota accounting updates out of xfs_bmap_btalloc
in preparation for fixing some quota accounting problems with copy on
write.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-01-29 07:27:23 -08:00
Darrick J. Wong
22431bf3df xfs: refactor inode verifier corruption error printing
Refactor inode verifier error reporting into a non-libxfs function so
that we aren't encoding the message format in libxfs.  This also
changes the kernel dmesg output to resemble buffer verifier errors
more closely.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:22 -08:00
Darrick J. Wong
67a3f6d014 xfs: make tracepoint inode number format consistent
Fix all the inode number formats to be consistently (0x%llx) in all
trace point definitions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:22 -08:00
Darrick J. Wong
beaae8cd58 xfs: always zero di_flags2 when we free the inode
Always zero the di_flags2 field when we free the inode so that we never
end up with an on-disk record for an unallocated inode that also has the
reflink iflag set.  This is in keeping with the general principle that
only files can have the reflink iflag set, even though we'll zero out
di_flags2 if we ever reallocate the inode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:22 -08:00
Darrick J. Wong
09ac862397 xfs: call xfs_qm_dqattach before performing reflink operations
Ensure that we've attached all the necessary dquots before performing
reflink operations so that quota accounting is accurate.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-29 07:27:22 -08:00
Shan Hai
6ca30729c2 xfs: bmap code cleanup
Remove the extent size hint and realtime inode relevant code from
the xfs_bmapi_reserve_delalloc since it is not called on the inode
with extent size hint set or on a realtime inode.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-29 07:27:22 -08:00
Carlos Maiolino
643c8c05e7 Use list_head infra-structure for buffer's log items list
Now that buffer's b_fspriv has been split, just replace the current
singly linked list of xfs_log_items, by the list_head infrastructure.

Also, remove the xfs_log_item argument from xfs_buf_resubmit_failed_buffers(),
there is no need for this argument, once the log items can be walked
through the list_head in the buffer.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: minor style cleanups]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-29 07:27:22 -08:00
Carlos Maiolino
fb1755a645 Split buffer's b_fspriv field
By splitting the b_fspriv field into two different fields (b_log_item
and b_li_list). It's possible to get rid of an old ABI workaround, by
using the new b_log_item field to store xfs_buf_log_item separated from
the log items attached to the buffer, which will be linked in the new
b_li_list field.

This way, there is no more need to reorder the log items list to place
the buf_log_item at the beginning of the list, simplifying a bit the
logic to handle buffer IO.

This also opens the possibility to change buffer's log items list into a
proper list_head.

b_log_item field is still defined as a void *, because it is still used
by the log buffers to store xlog_in_core structures, and there is no
need to add an extra field on xfs_buf just for xlog_in_core.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: minor style changes]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-29 07:27:22 -08:00
Carlos Maiolino
70a2065533 Get rid of xfs_buf_log_item_t typedef
Take advantage of the rework on xfs_buf log items list, to get rid of
ths typedef for xfs_buf_log_item.

This patch also fix some indentation alignment issues found along the way.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-29 07:27:22 -08:00
Jeff Layton
d17260fd5f xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing
If XFS_ILOG_CORE is already set then go ahead and increment it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
2018-01-29 06:42:21 -05:00
Jeff Layton
f0e2828062 xfs: convert to new i_version API
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
2018-01-29 06:42:21 -05:00
Darrick J. Wong
75d4a13b1f xfs: fix non-debug build compiler warnings
Fix compiler warning on non-debug build

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-01-17 21:00:47 -08:00
Darrick J. Wong
4bb73d0147 xfs: check sb_agblocks and sb_agblklog when validating superblock
Currently, we don't check sb_agblocks or sb_agblklog when we validate
the superblock, which means that we can fuzz garbage values into those
values and the mount succeeds.  This leads to all sorts of UBSAN
warnings in xfs/350 since we can then coerce other parts of xfs into
shifting by ridiculously large values.

Once we've validated agblocks, make sure the agcount makes sense.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-01-17 21:00:47 -08:00
Darrick J. Wong
be78ff0e72 xfs: recheck reflink / dirty page status before freeing CoW reservations
Eryu Guan reported seeing occasional hangs when running generic/269 with
a new fsstress that supports clonerange/deduperange.  The cause of this
hang is an infinite loop when we convert the CoW fork extents from
unwritten to real just prior to writing the pages out; the infinite
loop happens because there's nothing in the CoW fork to convert, and so
it spins forever.

The fundamental issue here is that when we go to perform these CoW fork
conversions, we're supposed to have an extent waiting for us, but the
low space CoW reaper has snuck in and blown them away!  There are four
conditions that can dissuade the reaper from touching our file -- no
reflink iflag; dirty page cache; writeback in progress; or directio in
progress.  We check the four conditions prior to taking the locks, but
we neglect to recheck them once we have the locks, which is how we end
up whacking the writeback that's in progress.

Therefore, refactor the four checks into a helper function and call it
once again once we have the locks to make sure we really want to reap
the inode.  While we're at it, add an ASSERT for this weird condition so
that we'll fail noisily if we ever screw this up again.

Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Tested-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-01-17 21:00:47 -08:00
Darrick J. Wong
a5f460b168 xfs: check that br_blockcount doesn't overflow
xfs_bmbt_irec.br_blockcount is declared as xfs_filblks_t, which is an
unsigned 64-bit integer.  Though the bmbt helpers will never set a value
larger than 2^21 (since the underlying on-disk extent record has a
length field that is only 21 bits wide), we should be a little defensive
about checking that a bmbt record doesn't exceed what we're expecting or
overflow into the next AG.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-01-17 21:00:47 -08:00