Commit Graph

51456 Commits

Author SHA1 Message Date
David Howells
989782dcdc afs: Overhaul cell database management
Overhaul the way that the in-kernel AFS client keeps track of cells in the
following manner:

 (1) Cells are now held in an rbtree to make walking them quicker and RCU
     managed (though this is probably overkill).

 (2) Cells now have a manager work item that:

     (A) Looks after fetching and refreshing the VL server list.

     (B) Manages cell record lifetime, including initialising and
     	 destruction.

     (B) Manages cell record caching whereby threads are kept around for a
     	 certain time after last use and then destroyed.

     (C) Manages the FS-Cache index cookie for a cell.  It is not permitted
     	 for a cookie to be in use twice, so we have to be careful to not
     	 allow a new cell record to exist at the same time as an old record
     	 of the same name.

 (3) Each AFS network namespace is given a manager work item that manages
     the cells within it, maintaining a single timer to prod cells into
     updating their DNS records.

     This uses the reduce_timer() facility to make the timer expire at the
     soonest timed event that needs happening.

 (4) When a module is being unloaded, cells and cell managers are now
     counted out using dec_after_work() to make sure the module text is
     pinned until after the data structures have been cleaned up.

 (5) Each cell's VL server list is now protected by a seqlock rather than a
     semaphore.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:18 +00:00
David Howells
be080a6f43 afs: Overhaul permit caching
Overhaul permit caching in AFS by making it per-vnode and sharing permit
lists where possible.

When most of the fileserver operations are called, they return a status
structure indicating the (revised) details of the vnode or vnodes involved
in the operation.  This includes the access mark derived from the ACL
(named CallerAccess in the protocol definition file).  This is cacheable
and if the ACL changes, the server will tell us that it is breaking the
callback promise, at which point we can discard the currently cached
permits.

With this patch, the afs_permits structure has, at the end, an array of
{ key, CallerAccess } elements, sorted by key pointer.  This is then cached
in a hash table so that it can be shared between vnodes with the same
access permits.

Permit lists can only be shared if they contain the exact same set of
key->CallerAccess mappings.

Note that that table is global rather than being per-net_ns.  If the keys
in a permit list cross net_ns boundaries, there is no problem sharing the
cached permits, since the permits are just integer masks.

Since permit lists pin keys, the permit cache also makes it easier for a
future patch to find all occurrences of a key and remove them by means of
setting the afs_permits::invalidated flag and then clearing the appropriate
key pointer.  In such an event, memory barriers will need adding.

Lastly, the permit caching is skipped if the server has sent either a
vnode-specific or an entire-server callback since the start of the
operation.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:18 +00:00
David Howells
c435ee3455 afs: Overhaul the callback handling
Overhaul the AFS callback handling by the following means:

 (1) Don't give up callback promises on vnodes that we are no longer using,
     rather let them just expire on the server or let the server break
     them.  This is actually more efficient for the server as the callback
     lookup is expensive if there are lots of extant callbacks.

 (2) Only give up the callback promises we have from a server when the
     server record is destroyed.  Then we can just give up *all* the
     callback promises on it in one go.

 (3) Servers can end up being shared between cells if cells are aliased, so
     don't add all the vnodes being backed by a particular server into a
     big FID-indexed tree on that server as there may be duplicates.

     Instead have each volume instance (~= superblock) register an interest
     in a server as it starts to make use of it and use this to allow the
     processor for callbacks from the server to find the superblock and
     thence the inode corresponding to the FID being broken by means of
     ilookup_nowait().

 (4) Rather than iterating over the entire callback list when a mass-break
     comes in from the server, maintain a counter of mass-breaks in
     afs_server (cb_seq) and make afs_validate() check it against the copy
     in afs_vnode.

     It would be nice not to have to take a read_lock whilst doing this,
     but that's tricky without using RCU.

 (5) Save a ref on the fileserver we're using for a call in the afs_call
     struct so that we can access its cb_s_break during call decoding.

 (6) Write-lock around callback and status storage in a vnode and read-lock
     around getattr so that we don't see the status mid-update.

This has the following consequences:

 (1) Data invalidation isn't seen until someone calls afs_validate() on a
     vnode.  Unfortunately, we need to use a key to query the server, but
     getting one from a background thread is tricky without caching loads
     of keys all over the place.

 (2) Mass invalidation isn't seen until someone calls afs_validate().

 (3) Callback breaking is going to hit the inode_hash_lock quite a bit.
     Could this be replaced with rcu_read_lock() since inodes are destroyed
     under RCU conditions.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:18 +00:00
David Howells
d0676a1678 afs: Rename struct afs_call server member to cm_server
Rename the server member of struct afs_call to cm_server as we're only
going to be using it for incoming calls for the Cache Manager service.
This makes it easier to differentiate from the pointer to the target server
for the client, which will point to a different structure to allow for
callback handling.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:18 +00:00
David Howells
03dc2cfca5 afs: Fix the afs_uuid struct to make the char-sized fields signed
In AFS's encoding of a UUID, the eight 'char' fields are all signed, so
represent them with __s8 rather than __u8.  This makes the compiler
sign-extend them correctly when XDR-encoding them.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:18 +00:00
David Howells
f4b3526d83 afs: Connect up the CB.ProbeUuid
The handler for the CB.ProbeUuid operation in the cache manager is
implemented, but isn't listed in the switch-statement of operation
selection, so won't be used.  Fix this by adding it.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:18 +00:00
David Howells
33cd7f2b76 afs: Potentially return call->reply[0] from afs_make_call()
If call->ret_reply0 is set, return call->reply[0] on success.  Change the
return type of afs_make_call() to long so that this can be passed back
without bit loss and then cast to a pointer if required.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:17 +00:00
David Howells
97e3043ad8 afs: Condense afs_call's reply{,2,3,4} into an array
Condense struct afs_call's reply anchor members - reply{,2,3,4} - into an
array.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:17 +00:00
David Howells
f780c8ea0e afs: Consolidate abort_to_error translators
The AFS abort code space is shared across all services, so there's no need
for separate abort_to_error translators for each service.

Consolidate them into a single function and remove the function pointers
for them.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:17 +00:00
David Howells
3838d3ecde afs: Allow IPv6 address specification of VL servers
Allow VL server specifications to be given IPv6 addresses as well as IPv4
addresses, for example as:

	echo add foo.org 1111:2222:3333:0:4444:5555:6666:7777 >/proc/fs/afs/cells

Note that ':' is the expected separator for separating IPv4 addresses, but
if a ',' is detected or no '.' is detected in the string, the delimiter is
switched to ','.

This also works with DNS AFSDB or SRV record strings fetched by upcall from
userspace.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:17 +00:00
David Howells
4d9df9868f afs: Keep and pass sockaddr_rxrpc addresses rather than in_addr
Keep and pass sockaddr_rxrpc addresses around rather than keeping and
passing in_addr addresses to allow for the use of IPv6 and non-standard
port numbers in future.

This also allows the port and service_id fields to be removed from the
afs_call struct.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:17 +00:00
David Howells
ad6a942a9e afs: Update the cache index structure
Update the cache index structure in the following ways:

 (1) Don't use the volume name followed by the volume type as levels in the
     cache index.  Volumes can be renamed.  Use the volume ID instead.

 (2) Don't store the VLDB data for a volume in the tree.  If the volume
     database should be cached locally, then it should be done in a separate
     tree.

 (3) Expand the volume ID stored in the cache to 64 bits.

 (4) Expand the file/vnode ID stored in the cache to 96 bits.

 (5) Increment the cache structure version number to 1.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:17 +00:00
David Howells
91a90380ef afs: Add some protocol defs
Add some protocol definitions, including max field lengths, flag defs, an
XDR-encoded UUID def, more VL operation IDs and more fileserver abort
codes.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:17 +00:00
David Howells
9ed900b116 afs: Push the net ns pointer to more places
Push the network namespace pointer to more places in AFS, including the
afs_server structure (which doesn't hold a ref on the netns).

In particular, afs_put_cell() now takes requires a net ns parameter so that
it can safely alter the netns after decrementing the cell usage count - the
cell will be deallocated by a background thread after being cached for a
period, which means that it's not safe to access it after reducing its
usage count.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:17 +00:00
David Howells
49566f6f06 afs: Note the cell in the superblock info also
Keep a reference to the cell in the superblock info structure in addition
to the volume and net pointers.  This will make it easier to clean up in a
future patch in which afs_put_volume() will need the cell pointer.

Whilst we're at it, make the cell and volume getting functions return a
pointer to the object got to make the call sites look neater.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:16 +00:00
David Howells
59fa1c4a9f afs: Fix server reaping
Fix server reaping and make sure it's all done before we start trying to
purge cells, given that servers currently pin cells.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:16 +00:00
David Howells
e3b2ffe0f0 afs: Close the rxrpc socket only after purging the servers
Close the rxrpc socket only after we've purged the server records (and also
cell and volume records which might refer to servers) so that we can give
up the callbacks on each server.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:16 +00:00
David Howells
f044c8847b afs: Lay the groundwork for supporting network namespaces
Lay the groundwork for supporting network namespaces (netns) to the AFS
filesystem by moving various global features to a network-namespace struct
(afs_net) and providing an instance of this as a temporary global variable
that everything uses via accessor functions for the moment.

The following changes have been made:

 (1) Store the netns in the superblock info.  This will be obtained from
     the mounter's nsproxy on a manual mount and inherited from the parent
     superblock on an automount.

 (2) The cell list is made per-netns.  It can be viewed through
     /proc/net/afs/cells and also be modified by writing commands to that
     file.

 (3) The local workstation cell is set per-ns in /proc/net/afs/rootcell.
     This is unset by default.

 (4) The 'rootcell' module parameter, which sets a cell and VL server list
     modifies the init net namespace, thereby allowing an AFS root fs to be
     theoretically used.

 (5) The volume location lists and the file lock manager are made
     per-netns.

 (6) The AF_RXRPC socket and associated I/O bits are made per-ns.

The various workqueues remain global for the moment.

Changes still to be made:

 (1) /proc/fs/afs/ should be moved to /proc/net/afs/ and a symlink emplaced
     from the old name.

 (2) A per-netns subsys needs to be registered for AFS into which it can
     store its per-netns data.

 (3) Rather than the AF_RXRPC socket being opened on module init, it needs
     to be opened on the creation of a superblock in that netns.

 (4) The socket needs to be closed when the last superblock using it is
     destroyed and all outstanding client calls on it have been completed.
     This prevents a reference loop on the namespace.

 (5) It is possible that several namespaces will want to use AFS, in which
     case each one will need its own UDP port.  These can either be set
     through /proc/net/afs/cm_port or the kernel can pick one at random.
     The init_ns gets 7001 by default.

Other issues that need resolving:

 (1) The DNS keyring needs net-namespacing.

 (2) Where do upcalls go (eg. DNS request-key upcall)?

 (3) Need something like open_socket_in_file_ns() syscall so that AFS
     command line tools attempting to operate on an AFS file/volume have
     their RPC calls go to the right place.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:16 +00:00
David Howells
5e4def2038 Pass mode to wait_on_atomic_t() action funcs and provide default actions
Make wait_on_atomic_t() pass the TASK_* mode onto its action function as an
extra argument and make it 'unsigned int throughout.

Also, consolidate a bunch of identical action functions into a default
function that can do the appropriate thing for the mode.

Also, change the argument name in the bit_wait*() function declarations to
reflect the fact that it's the mode and not the bit number.

[Peter Z gives this a grudging ACK, but thinks that the whole atomic_t wait
should be done differently, though he's not immediately sure as to how]

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
cc: Ingo Molnar <mingo@kernel.org>
2017-11-13 15:38:16 +00:00
David Howells
81445e63e6 Merge remote-tracking branch 'tip/timers/core' into afs-next
These AFS patches need the timer_reduce() patch from timers/core.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:36:33 +00:00
Chao Yu
88d8ff976a quota: be aware of error from dquot_initialize
Commit 6184fc0b8d ("quota: Propagate error from ->acquire_dquot()")
missed to handle error from dquot_initialize in dquot_file_open, fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-11-13 10:36:56 +01:00
Arnd Bergmann
df27067e60 pstore: Use ktime_get_real_fast_ns() instead of __getnstimeofday()
__getnstimeofday() is a rather odd interface, with a number of quirks:

- The caller may come from NMI context, but the implementation is not NMI safe,
  one way to get there from NMI is

      NMI handler:
        something bad
          panic()
            kmsg_dump()
              pstore_dump()
                 pstore_record_init()
                   __getnstimeofday()

- The calling conventions are different from any other timekeeping functions,
  to deal with returning an error code during suspended timekeeping.

Address the above issues by using a completely different method to get the
time: ktime_get_real_fast_ns() is NMI safe and has a reasonable behavior
when timekeeping is suspended: it returns the time at which it got
suspended. As Thomas Gleixner explained, this is safe, as
ktime_get_real_fast_ns() does not call into the clocksource driver that
might be suspended.

The result can easily be transformed into a timespec structure. Since
ktime_get_real_fast_ns() was not exported to modules, add the export.

The pstore behavior for the suspended case changes slightly, as it now
stores the timestamp at which timekeeping was suspended instead of storing
a zero timestamp.

This change is not addressing y2038-safety, that's subject to a more
complex follow up patch.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Colin Cross <ccross@android.com>
Link: https://lkml.kernel.org/r/20171110152530.1926955-1-arnd@arndb.de
2017-11-12 15:05:52 +01:00
Greg Edwards
67f2519fe2 fs: guard_bio_eod() needs to consider partitions
guard_bio_eod() needs to look at the partition capacity, not just the
capacity of the whole device, when determining if truncation is
necessary.

[   60.268688] attempt to access beyond end of device
[   60.268690] unknown-block(9,1): rw=0, want=67103509, limit=67103506
[   60.268693] buffer_io_error: 2 callbacks suppressed
[   60.268696] Buffer I/O error on dev md1p7, logical block 4524305, async page read

Fixes: 74d46992e0 ("block: replace bi_bdev with a gendisk pointer and partitions index")
Cc: stable@vger.kernel.org # v4.13
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:55:57 -07:00
Bart Van Assche
3a0a529971 block, scsi: Make SCSI quiesce and resume work reliably
The contexts from which a SCSI device can be quiesced or resumed are:
* Writing into /sys/class/scsi_device/*/device/state.
* SCSI parallel (SPI) domain validation.
* The SCSI device power management methods. See also scsi_bus_pm_ops.

It is essential during suspend and resume that neither the filesystem
state nor the filesystem metadata in RAM changes. This is why while
the hibernation image is being written or restored that SCSI devices
are quiesced. The SCSI core quiesces devices through scsi_device_quiesce()
and scsi_device_resume(). In the SDEV_QUIESCE state execution of
non-preempt requests is deferred. This is realized by returning
BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI
devices. Avoid that a full queue prevents power management requests
to be submitted by deferring allocation of non-preempt requests for
devices in the quiesced state. This patch has been tested by running
the following commands and by verifying that after each resume the
fio job was still running:

for ((i=0; i<10; i++)); do
  (
    cd /sys/block/md0/md &&
    while true; do
      [ "$(<sync_action)" = "idle" ] && echo check > sync_action
      sleep 1
    done
  ) &
  pids=($!)
  for d in /sys/class/block/sd*[a-z]; do
    bdev=${d#/sys/class/block/}
    hcil=$(readlink "$d/device")
    hcil=${hcil#../../../}
    echo 4 > "$d/queue/nr_requests"
    echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth"
    fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 \
      --rw=randread --ioengine=libaio --numjobs=4 --iodepth=16       \
      --iodepth_batch=1 --thread --loops=$((2**31)) &
    pids+=($!)
  done
  sleep 1
  echo "$(date) Hibernating ..." >>hibernate-test-log.txt
  systemctl hibernate
  sleep 10
  kill "${pids[@]}"
  echo idle > /sys/block/md0/md/sync_action
  wait
  echo "$(date) Done." >>hibernate-test-log.txt
done

Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
References: "I/O hangs after resuming from suspend-to-ram" (https://marc.info/?l=linux-block&m=150340235201348).
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Martin Steigerwald <martin@lichtvoll.de>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
LiFan
0dd99ca76f f2fs: validate before set/clear free nat bitmap
In flush_nat_entries, all dirty nats will be flushed and if
their new address isn't NULL_ADDR, their bitmaps will be updated,
the free_nid_count of the bitmaps will be increaced regardless
of whether the nats have already been occupied before.
This could lead to wrong free_nid_count.
So this patch checks the status of the bits beforeactually
set/clear them.

Fixes: 586d1492f3 ("f2fs: skip scanning free nid bitmap of full NAT blocks")
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-10 17:35:07 -08:00
Chao Yu
19526d74cf f2fs: avoid opened loop codes in __add_ino_entry
We will keep __add_ino_entry success all the time, for ENOMEM failure
case, we have already handled it by using  __GFP_NOFAIL flag, so we
don't have to use additional opened loop codes here, remove them.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-10 11:50:12 -08:00
Darrick J. Wong
2d1d1da3d9 xfs: on failed mount, force-reclaim inodes after unmounting quota controls
When mounting fails, we must force-reclaim inodes (and disable delayed
reclaim) /after/ the realtime and quota control have let go of the
realtime and quota inodes.  Without this, we corrupt the timer list and
cause other weird problems.

Found by xfs/376 fuzzing u3.bmbt[0].lastoff on an rmap filesystem to
force a bogus post-eof extent reclaim that causes the fs to go down.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2017-11-09 19:27:33 -08:00
Darrick J. Wong
d44b47fdd1 xfs: check the uniqueness of the AGFL entries
Make sure we don't list a block twice in the agfl by copying the
contents of the AGFL to an array, sorting it, and looking for
duplicates.  We can easily check that the number of agfl entries we see
actually matches the flcount, so do that too.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2017-11-09 19:27:32 -08:00
Darrick J. Wong
65a7935ddc xfs: remove u_int* type usage
Use the uint* types instead of the u_int* types.  This will (hopefully)
pair with an xfsprogs cleanup.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2017-11-09 15:50:29 -08:00
Christoph Hellwig
ae82968ee9 xfs: handle zero entries case in xfs_iext_rebalance_leaf
And also rename fill to nr_entries to match the rest of the code.

Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-09 14:08:54 -08:00
Christoph Hellwig
3e27c418a7 xfs: add comments documenting the rebalance algorithm
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-09 14:08:54 -08:00
Christoph Hellwig
b9aee1d5fe xfs: trivial indentation fixup for xfs_iext_remove_node
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-09 14:08:54 -08:00
Christoph Hellwig
f1be313697 xfs: remove a superflous assignment in xfs_iext_remove_node
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-09 14:08:54 -08:00
Christoph Hellwig
fc258f4b8b xfs: add some comments to xfs_iext_insert/xfs_iext_insert_node
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-09 14:08:53 -08:00
Christoph Hellwig
43d193aa02 xfs: fix number of records handling in xfs_iext_split_leaf
Fix to check the correct value, and remove a duplicate handling of the
uneven record number split algorith,

Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-09 14:08:53 -08:00
Hyunchul Lee
4f0a03d34d f2fs: apply write hints to select the type of segments for buffered write
Write hints helps F2FS to determine which type of segments would be
selected for buffered write.

This patch implements the mapping from write hints to segment types
as shown below.

  hints               segment type
  -----               ------------
  WRITE_LIFE_SHORT    CURSEG_HOT_DATA
  WRITE_LIFE_EXTREME  CURSEG_COLD_DATA
  others              CURSEG_WARM_DATA

the F2FS poliy for hot/cold seperation has precedence over this hints.
And hints are not applied in in-place update.

Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-09 10:18:16 -08:00
Chao Yu
2fbaa25fde f2fs: introduce scan_curseg_cache for cleanup
Commit 4ac912427c ("f2fs: introduce free nid bitmap") copied codes
from __build_free_nids() into scan_free_nid_bits(), they are redundant,
introduce one common function scan_curseg_cache for cleanup.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-09 10:12:26 -08:00
Fan Li
9745657449 f2fs: optimize the way of traversing free_nid_bitmap
We call scan_free_nid_bits only when there isn't many
free nids left, it means that marked bits in free_nid_bitmap
are supposed to be few, use find_next_bit_le is more
efficient in such case.
According to my tests, use find_next_bit_le instead of
test_bit_le will cut down the traversal time to one
third of its original.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-09 09:43:09 -08:00
Fan Li
74986213ad f2fs: keep scanning until enough free nids are acquired
In current version, after scan_free_nid_bits, the scan is over if
nid_cnt[FREE_NID] != 0. In most cases, there are still free nids in the
free list during the scan, and scan_free_nid_bits usually can't increase
nid_cnt[FREE_NID]. It causes that __build_free_nids is called many times
without solving the shortage of the free nids. This patch fixes that.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-09 09:41:08 -08:00
Tim Hansen
478f8da0f7 fs/xfs: Remove NULL check before kmem_cache_destroy
kmem_cache_destroy already checks for null values.

Signed-off-by: Tim Hansen <devtimhansen@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-09 09:23:47 -08:00
Darrick J. Wong
4da4b10b5b xfs: only check da node header padding on v5 filesystems
It turns out that we only started zeroing a new da btree node's block
header on v5 filesystems.  Prior to that, we just wouldn't set anything
at all, which means that the pad field never got set and would retain
whatever happened to be in memory.

Therefore, we can only check the pad for zeroness on v5 filesystems.
shared/006 on a v4 filesystem exposes this scrub bug.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2017-11-09 09:10:45 -08:00
Darrick J. Wong
a605e86912 xfs: fix btree scrub deref check
The btree scrubber has some custom code to retrieve and check a btree
block via xfs_btree_lookup_get_block.  This function will either return
an error code (verifiers failed) or a *pblock will be untouched (bad
pointer).  Since we previously set *pblock to NULL, we need to check
*pblock, not pblock, to trigger the early bailout.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2017-11-09 09:10:45 -08:00
Darrick J. Wong
72f76f7364 xfs: fix uninitialized return values in scrub code
Fix smatch complaints about uninitialized return codes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2017-11-09 09:10:45 -08:00
Darrick J. Wong
0a1e1567b3 xfs: pass inode number to xfs_scrub_ino_set_{preen,warning}
There are two ways to scrub an inode -- calling xfs_iget and checking
the raw inode core, or by loading the inode cluster buffer and checking
the on-disk contents directly.  The second method is only useful if
_iget fails the verifiers; when this is the case, sc->ip is NULL and
calling the tracepoint will cause a system crash.

Therefore, pass the raw inode number directly into the _preen and
_warning functions.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2017-11-09 09:10:45 -08:00
Darrick J. Wong
35ce852334 xfs: refactor the directory data block bestfree checks
In a directory data block, the zeroth bestfree item must point to the
longest free space.  Therefore, when we check the bestfree block's
records against the data blocks, we only need to compare with bf[0] and
don't need the loop.

The weird loop was most probably the result of an earlier refactoring
gone bad.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2017-11-09 09:10:45 -08:00
Theodore Ts'o
2325306802 ext4: improve smp scalability for inode generation
->s_next_generation is protected by s_next_gen_lock but its usage
pattern is very primitive.  We don't actually need sequentially
increasing new generation numbers, so let's use prandom_u32() instead.

Reported-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-11-08 22:23:20 -05:00
Mimi Zohar
bb02b186d0 ima: call ima_file_free() prior to calling fasync
The file hash is calculated and written out as an xattr after
calling fasync().  In order for the file data and metadata to be
written out to disk at the same time, this patch calculates the
file hash and stores it as an xattr before calling fasync.

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
2017-11-08 15:16:36 -05:00
Mimi Zohar
46cdc6d533 vfs: fix mounting a filesystem with i_version
The mount i_version flag is not enabled in the new sb_flags.  This patch
adds the missing SB_I_VERSION flag.

Fixes: e462ec5 "VFS: Differentiate mount flags (MS_*) from internal
       superblock flags"
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
2017-11-08 15:16:36 -05:00
Ingo Molnar
8a103df440 Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08 10:17:15 +01:00
Greg Kroah-Hartman
2b2d8788dd debugfs: Remove redundant license text
Now that the SPDX tag is in all debugfs files, that identifies the
license in a specific and legally-defined manner.  So the extra GPL text
wording can be removed as it is no longer needed at all.

This is done on a quest to remove the 700+ different ways that files in
the kernel describe the GPL license text.  And there's unneeded stuff
like the address (sometimes incorrect) for the FSF which is never
needed.

No copyright headers or other non-license-description text was removed.

Cc: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-07 20:25:03 +01:00
Greg Kroah-Hartman
3bce94fd5f debugfs: add SPDX identifiers to all debugfs files
It's good to have SPDX identifiers in all files to make it easier to
audit the kernel tree for correct licenses.

Update the debugfs files files with the correct SPDX license identifier
based on the license text in the file itself.  The SPDX identifier is a
legally binding shorthand, which can be used instead of the full boiler
plate text.

This work is based on a script and data from Thomas Gleixner, Philippe
Ombredanne, and Kate Stewart.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-07 20:25:03 +01:00
Nicolai Stange
7d39bc50c4 debugfs: defer debugfs_fsdata allocation to first usage
Currently, __debugfs_create_file allocates one struct debugfs_fsdata
instance for every file created. However, there are potentially many
debugfs file around, most of which are never touched by userspace.

Thus, defer the allocations to the first usage, i.e. to the first
debugfs_file_get().

A dentry's ->d_fsdata starts out to point to the "real", user provided
fops. After a debugfs_fsdata instance has been allocated (and the real
fops pointer has been moved over into its ->real_fops member),
->d_fsdata is changed to point to it from then on. The two cases are
distinguished by setting BIT(0) for the real fops case.

struct debugfs_fsdata's foremost purpose is to track active users and to
make debugfs_remove() block until they are done. Since no debugfs_fsdata
instance means no active users, make debugfs_remove() return immediately
in this case.

Take care of possible races between debugfs_file_get() and
debugfs_remove(): either debugfs_remove() must see a debugfs_fsdata
instance and thus wait for possible active users or debugfs_file_get() must
see a dead dentry and return immediately.

Make a dentry's ->d_release(), i.e. debugfs_release_dentry(), check whether
->d_fsdata is actually a debugfs_fsdata instance before kfree()ing it.

Similarly, make debugfs_real_fops() check whether ->d_fsdata is actually
a debugfs_fsdata instance before returning it, otherwise emit a warning.

The set of possible error codes returned from debugfs_file_get() has grown
from -EIO to -EIO and -ENOMEM. Make open_proxy_open() and full_proxy_open()
pass the -ENOMEM onwards to their callers.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-07 20:25:03 +01:00
Nicolai Stange
154b9d7512 debugfs: call debugfs_real_fops() only after debugfs_file_get()
The current implementation of debugfs_real_fops() relies on a
debugfs_fsdata instance to be installed at ->d_fsdata.

With future patches introducing lazy allocation of these, this requirement
will be guaranteed to be fullfilled only inbetween a
debugfs_file_get()/debugfs_file_put() pair.

The full proxies' fops implemented by debugfs happen to be the only
offenders. Fix them up by moving their debugfs_real_fops() calls past those
to debugfs_file_get().

full_proxy_release() is special as it doesn't invoke debugfs_file_get() at
all. Leave it alone for now.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-07 20:25:03 +01:00
Nicolai Stange
c9afbec270 debugfs: purge obsolete SRCU based removal protection
Purge the SRCU based file removal race protection in favour of the new,
refcount based debugfs_file_get()/debugfs_file_put() API.

Fixes: 49d200deaa ("debugfs: prevent access to removed files' private data")
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-07 20:25:02 +01:00
Nicolai Stange
69d29f9e6a debugfs: convert to debugfs_file_get() and -put()
Convert all calls to the now obsolete debugfs_use_file_start() and
debugfs_use_file_finish() from the debugfs core itself to the new
debugfs_file_get() and debugfs_file_put() API.

Fixes: 49d200deaa ("debugfs: prevent access to removed files' private data")
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-07 20:25:02 +01:00
Nicolai Stange
055ab8e3e3 debugfs: debugfs_real_fops(): drop __must_hold sparse annotation
Currently, debugfs_real_fops() is annotated with a
__must_hold(&debugfs_srcu) sparse annotation.

With the conversion of the SRCU based protection of users against
concurrent file removals to a per-file refcount based scheme, this becomes
wrong.

Drop this annotation.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-07 20:25:02 +01:00
Nicolai Stange
e9117a5a4b debugfs: implement per-file removal protection
Since commit 49d200deaa ("debugfs: prevent access to removed files'
private data"), accesses to a file's private data are protected from
concurrent removal by covering all file_operations with a SRCU read section
and sychronizing with those before returning from debugfs_remove() by means
of synchronize_srcu().

As pointed out by Johannes Berg, there are debugfs files with forever
blocking file_operations. Their corresponding SRCU read side sections would
block any debugfs_remove() forever as well, even unrelated ones. This
results in a livelock. Because a remover can't cancel any indefinite
blocking within foreign files, this is a problem.

Resolve this by introducing support for more granular protection on a
per-file basis.

This is implemented by introducing an  'active_users' refcount_t to the
per-file struct debugfs_fsdata state. At file creation time, it is set to
one and a debugfs_remove() will drop that initial reference. The new
debugfs_file_get() and debugfs_file_put(), intended to be used in place of
former debugfs_use_file_start() and debugfs_use_file_finish(), increment
and decrement it respectively. Once the count drops to zero,
debugfs_file_put() will signal a completion which is possibly being waited
for from debugfs_remove().
Thus, as long as there is a debugfs_file_get() not yet matched by a
corresponding debugfs_file_put() around, debugfs_remove() will block.

Actual users of debugfs_use_file_start() and -finish() will get converted
to the new debugfs_file_get() and debugfs_file_put() by followup patches.

Fixes: 49d200deaa ("debugfs: prevent access to removed files' private data")
Reported-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-07 20:25:02 +01:00
Nicolai Stange
7c8d469877 debugfs: add support for more elaborate ->d_fsdata
Currently, the user provided fops, "real_fops", are stored directly into
->d_fsdata.

In order to be able to store more per-file state and thus prepare for more
granular file removal protection, wrap the real_fops into a dynamically
allocated container struct, debugfs_fsdata.

A struct debugfs_fsdata gets allocated at file creation and freed from the
newly intoduced ->d_release().

Finally, move the implementation of debugfs_real_fops() out of the public
debugfs header such that struct debugfs_fsdata's declaration can be kept
private.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-07 20:25:02 +01:00
Ingo Molnar
8c5db92a70 Merge branch 'linus' into locking/core, to resolve conflicts
Conflicts:
	include/linux/compiler-clang.h
	include/linux/compiler-gcc.h
	include/linux/compiler-intel.h
	include/uapi/linux/stddef.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-07 10:32:44 +01:00
Chao Yu
a5fd505092 f2fs: trace checkpoint reason in fsync()
This patch slightly changes need_do_checkpoint to return the detail
info that indicates why we need do checkpoint, then caller could print
it with trace message.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-06 17:01:20 -08:00
Chao Yu
e8ed90a6d9 f2fs: keep isize once block is reserved cross EOF
Without FADVISE_KEEP_SIZE_BIT, we will try to recover file size
according to last non-hole block, so in fallocate(), we must set
FADVISE_KEEP_SIZE_BIT flag once we have preallocated block cross
EOF, instead of when all preallocation is success. Otherwise, file
size will be incorrect due to lack of this flag.

Simple testcase to reproduce this:

1. echo 2 > /sys/fs/f2fs/<device>/inject_type
2. echo 10 > /sys/fs/f2fs/<device>/inject_rate
3. run tests/generic/392
4. disable fault injection
5. do remount

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-06 17:00:09 -08:00
Christoph Hellwig
181fdfe662 xfs: mark xlog_verify_dest_ptr STATIC
We already did it in the forward declaration, but not for the function
body itself.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:57:39 -08:00
Christoph Hellwig
e89fbb5ee1 xfs: mark xlog_recover_check_summary STATIC
We already did it in the forward declaration, but not for the function
body itself.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:57:39 -08:00
Christoph Hellwig
4483eb566b xfs: mark xfs_btree_check_lblock and xfs_btree_check_ptr static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:57:39 -08:00
Christoph Hellwig
a61a2c8683 xfs: remove unreachable error injection code in xfs_qm_dqget
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:57:39 -08:00
Christoph Hellwig
afd72454e1 xfs: remove unused debug counts for xfs_lock_inodes
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:57:39 -08:00
Christoph Hellwig
4371155e80 xfs: mark xfs_errortag_ktype static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:57:39 -08:00
Christoph Hellwig
88aa5de46b xfs: trivial sparse fixes for the new scrub code
[darrick: fix broken initializer in xfs_scrub_xattr]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:58 -08:00
Christoph Hellwig
fc41e2a193 xfs: always define STATIC to static noinline
Ever since we added the noinline tag there is no good reason to define
away the static for debug builds - we'll get just as good debug
information with our without it, so don't mess up sparse and other
checkers due to it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:58 -08:00
Christoph Hellwig
866d7826c9 xfs: move xfs_bmbt_irec and xfs_exntst_t to xfs_types.h
Neither defines an on-disk format, so move them out of xfs_format.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:41 -08:00
Christoph Hellwig
dac9c9b137 xfs: pass struct xfs_bmbt_irec to xfs_bmbt_validate_extent
This removed an unaligned load per extent, as well as the manual poking
into the on-disk extent format.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:41 -08:00
Christoph Hellwig
c38ccf5990 xfs: remove the nr_extents argument to xfs_iext_remove
We only have two places that remove 2 extents at the same time, so unroll
the loop there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:41 -08:00
Christoph Hellwig
0254c2f253 xfs: remove the nr_extents argument to xfs_iext_insert
We only have two places that insert 2 extents at the same time, so unroll
the loop there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:41 -08:00
Christoph Hellwig
6bdcf26ade xfs: use a b+tree for the in-core extent list
Replace the current linear list and the indirection array for the in-core
extent list with a b+tree to avoid the need for larger memory allocations
for the indirection array when lots of extents are present.  The current
extent list implementations leads to heavy pressure on the memory
allocator when modifying files with a high extent count, and can lead
to high latencies because of that.

The replacement is a b+tree with a few quirks.  The leaf nodes directly
store the extent record in two u64 values.  The encoding is a little bit
different from the existing in-core extent records so that the start
offset and length which are required for lookups can be retreived with
simple mask operations.  The inner nodes store a 64-bit key containing
the start offset in the first half of the node, and the pointers to the
next lower level in the second half.  In either case we walk the node
from the beginninig to the end and do a linear search, as that is more
efficient for the low number of cache lines touched during a search
(2 for the inner nodes, 4 for the leaf nodes) than a binary search.
We store termination markers (zero length for the leaf nodes, an
otherwise impossible high bit for the inner nodes) to terminate the key
list / records instead of storing a count to use the available cache
lines as efficiently as possible.

One quirk of the algorithm is that while we normally split a node half and
half like usual btree implementations we just spill over entries added at
the very end of the list to a new node on its own.  This means we get a
100% fill grade for the common cases of bulk insertion when reading an
inode into memory, and when only sequentially appending to a file.  The
downside is a slightly higher chance of splits on the first random
insertions.

Both insert and removal manually recurse into the lower levels, but
the bulk deletion of the whole tree is still implemented as a recursive
function call, although one limited by the overall depth and with very
little stack usage in every iteration.

For the first few extents we dynamically grow the list from a single
extent to the next powers of two until we have a first full leaf block
and that building the actual tree.

The code started out based on the generic lib/btree.c code from Joern
Engel based on earlier work from Peter Zijlstra, but has since been
rewritten beyond recognition.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:41 -08:00
Christoph Hellwig
135dcc10d6 xfs: allow unaligned extent records in xfs_bmbt_disk_set_all
To make life a little simpler make xfs_bmbt_set_all unaligned access
aware so that we can use it directly on the destination buffer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:41 -08:00
Christoph Hellwig
43518812d2 xfs: remove support for inlining data/extents into the inode fork
Supporting a small bit of data inside the inode fork blows up the fork size
a lot, removing the 32 bytes of inline data halves the effective size of
the inode fork (and it still has a lot of unused padding left), and the
performance of a single kmalloc doesn't show up compared to the size to read
an inode or create one.

It also simplifies the fork management code a lot.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:40 -08:00
Christoph Hellwig
b121459c7a xfs: simplify xfs_reflink_convert_cow
Instead of looking up extents to convert and calling xfs_bmapi_write on
each of them just let xfs_bmapi_write handle the full range.  To make
this robust add a new XFS_BMAPI_CONVERT_ONLY that only converts ranges
and never allocates blocks.

[darrick: shorten the stringified CONVERT_ONLY trace flag]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:40 -08:00
Christoph Hellwig
41caabd0ab xfs: iterate backwards in xfs_reflink_cancel_cow_blocks
Match the iteration order for extent deletion in the truncate and
reflink I/O completion path.

This also happens to make implementing the new incore extent list
a lot easier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:40 -08:00
Christoph Hellwig
b2b1712a64 xfs: introduce the xfs_iext_cursor abstraction
Add a new xfs_iext_cursor structure to hide the direct extent map
index manipulations. In addition to the existing lookup/get/insert/
remove and update routines new primitives to get the first and last
extent cursor, as well as moving up and down by one extent are
provided.  Also new are convenience to increment/decrement the
cursor and retreive the new extent, as well as to peek into the
previous/next extent without updating the cursor and last but not
least a macro to iterate over all extents in a fork.

[darrick: rename for_each_iext to for_each_xfs_iext]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:40 -08:00
Christoph Hellwig
906abed501 xfs: iterate over extents in xfs_bmap_extents_to_btree
This actually makes the function very slightly less efficient for now as we
detour through the expanded irect format between the in-core extent format
and the on-disk one instead of just endian swapping them.  But with the
incore extent btree the in-core one will use a different format and the
representation will be entirely hidden.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:40 -08:00
Christoph Hellwig
71565f4b92 xfs: iterate over extents in xfs_iextents_copy
This actually makes the function very slightly less efficient for now as we
detour through the expanded irect format between the in-core extent format
and the on-disk one instead of just endian swapping them.  But with the
incore extent btree the in-core one will use a different format and the
representation will be entirely hidden.  It also happens to make the
function a whole more readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:39 -08:00
Christoph Hellwig
f36bc228e1 xfs: pass an on-disk extent to xfs_bmbt_validate_extent
This prepares for getting rid of the current in-memory extent format.
At the end of the series we will change the calling convention again
to pass the xfs_bmbt_irec structure once it is available everywhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:39 -08:00
Christoph Hellwig
4263036100 xfs: treat idx as a cursor in xfs_bmap_collapse_extents
Stop poking before and after the index and just increment or decrement
it while doing our operations on it to prepare for a new extent list
implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:39 -08:00
Christoph Hellwig
657fcb2336 xfs: treat idx as a cursor in xfs_bmap_del_extent_*
Stop poking before and after the index and just increment or decrement
it while doing our operations on it to prepare for a new extent list
implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:39 -08:00
Christoph Hellwig
a681847796 xfs: treat idx as a cursor in xfs_bmap_add_extent_unwritten_real
Stop poking before and after the index and just increment or decrement
it while doing our operations on it to prepare for a new extent list
implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:39 -08:00
Christoph Hellwig
1d2e0089e1 xfs: treat idx as a cursor in xfs_bmap_add_extent_hole_real
Stop poking before and after the index and just increment or decrement
it while doing our operations on it to prepare for a new extent list
implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:39 -08:00
Christoph Hellwig
41d196f439 xfs: treat idx as a cursor in xfs_bmap_add_extent_hole_delay
Stop poking before and after the index and just increment or decrement
it while doing our operations on it to prepare for a new extent list
implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:38 -08:00
Christoph Hellwig
0d045540ed xfs: treat idx as a cursor in xfs_bmap_add_extent_delay_real
Stop poking before and after the index and just increment or decrement
it while doing our operations on it to prepare for a new extent list
implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:38 -08:00
Christoph Hellwig
bf99971c82 xfs: remove a duplicate assignment in xfs_bmap_add_extent_delay_real
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:38 -08:00
Christoph Hellwig
1bfd7618cb xfs: don't create overlapping extents in xfs_bmap_add_extent_delay_real
Two cases in xfs_bmap_add_extent_delay_real currently insert a new
extent before updating the existing one that is being split.  While
this works fine with a simple extent list, a more complex tree can't
easily cope with overlapping extent.  Reshuffle the code a bit to update
the slot of the existing delalloc extent to the new real extent before
inserting the shortened delalloc extent before or after it.  This
avoids the overlapping extents while still allowing to update the
br_startblock field of the delalloc extent with the updated indirect
block reservation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-06 11:53:38 -08:00
Chao Yu
bb06664a53 f2fs: avoid race in between GC and block exchange
During block exchange in {insert,collapse,move}_range, page-block mapping
is unstable due to mapping moving or recovery, so there should be no
concurrent cache read operation rely on such mapping, nor cache write
operation to mess up block exchange.

So this patch let background GC be aware of that.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:42:10 -08:00
Fan Li
f6986ede80 f2fs: save a multiplication for last_nid calculation
Use a slightly easier way to calculate last_nid.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:42:09 -08:00
Chao Yu
2b60311dd1 f2fs: fix summary info corruption
Sometimes, after running generic/270 of fstest, fsck reports summary
info and actual position of block address in direct node becoming
inconsistent.

The root cause is race in between __f2fs_replace_block and change_curseg
as below:

Thread A				Thread B
- __clone_blkaddrs
 - f2fs_replace_block
  - __f2fs_replace_block
   - segnoA = GET_SEGNO(sbi, blkaddrA);
   - type = se->type:=CURSEG_HOT_DATA
   - if (!IS_CURSEG(sbi, segnoA))
         type = CURSEG_WARM_DATA
					- allocate_data_block
					 - allocate_segment
					  - get_ssr_segment
					  - change_curseg(segnoA, CURSEG_HOT_DATA)
   - change_curseg(segnoA, CURSEG_WARM_DATA)
    - reset_curseg
     - __set_sit_entry_type
      - change se->type from CURSEG_HOT_DATA to CURSEG_WARM_DATA

So finally, hot curseg locates in segnoA, but type of segnoA becomes
CURSEG_WARM_DATA.

Then if we invoke __f2fs_replace_block(blkaddrB, blkaddrA, true, false),
as blkaddrA locates in segnoA, so we will move warm type curseg to segnoA,
then change its summary cache and writeback it to summary block.

But segnoA is used by hot type curseg too, once it moves or persist, it
will cover summary block content with inner old summary cache, result in
inconsistent status.

This patch tries to fix this issue by introduce global curseg lock to avoid
race in between __f2fs_replace_block and change_curseg.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:42:08 -08:00
Chao Yu
0537b81153 f2fs: remove dead code in update_meta_page
After commit a468f0ef51 ("f2fs: use crc and cp version to determine
roll-forward recovery"), last caller of update_meta_page passing @src
with NULL is gone, so remove related dead code there.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:42:07 -08:00
Chao Yu
dee668c143 f2fs: remove unneeded semicolon
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:42:06 -08:00
Jeff Layton
d1954ab4c9 f2fs: don't bother with inode->i_version
f2fs does not set the SB_I_VERSION flag, so the i_version will never
be incremented on write. It was recently changed to increment the
i_version on a quota write, which isn't necessary here.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:42:05 -08:00
Chao Yu
bf34c93d26 f2fs: check curseg space before foreground GC
When we are closing to trigger foreground GC, if there are only a few
of dirty metas, we can log these dirty metas in left space of opened
segments instead of triggering foreground GC.

With this patch, total count of foreground GC triggered by
test/generic/* of fstest suit reduce from 254 to 184.

So let's do the check before foreground GC anyway.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:42:04 -08:00
Chao Yu
3d26fa6be3 f2fs: use rw_semaphore to protect SIT cache
There are some cases user didn't update SIT cache under this lock,
so let's use rw_semaphore instead of mutex to enhance concurrently
accessing.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:42:03 -08:00
Jaegeuk Kim
ea6767337f f2fs: support quota sys files
This patch supports hidden quota files in the system, which will be used for
Android. It requires up-to-date f2fs-tools later than v1.9.0.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:42:02 -08:00
Jaegeuk Kim
234a968961 f2fs: add quota_ino feature infra
This patch adds quota_ino feature infra to be used for quota files.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:42:01 -08:00
Fan Li
37a0ab2a3b f2fs: optimize __update_nat_bits
Make three modification for __update_nat_bits:
1. Take the codes of dealing the nat with nid 0 out of the loop
    Such nat only needs to be dealt with once at beginning.
2. Use " nat_index == 0" instead of " start_nid == 0" to decide if it's the first nat block
    It's better that we don't assume @start_nid is the first nid of the nat block it's in.
3. Use " if (nat_blk->entries[i].block_addr != NULL_ADDR)" to explicitly comfirm the value of block_addr
    use constant to make sure the codes is right, even if the value of NULL_ADDR changes.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:42:00 -08:00
Yunlei He
f15194fcfa f2fs: modify for accurate fggc node io stat
modify for accurate fggc node io stat

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:59 -08:00
Yunlong Song
65f1b80b33 Revert "f2fs: handle dirty segments inside refresh_sit_entry"
This reverts commit 5e443818fa

The commit should be reverted because call sequence of below two parts
of code must be kept:
a. update sit information, it needs to be updated before segment
allocation since latter allocation may trigger SSR, and SSR allocation
needs latest valid block information of all segments.
b. update segment status, it needs to be updated after segment allocation
since we can skip updating current opened segment status.

Fixes: 5e443818fa ("f2fs: handle dirty segments inside refresh_sit_entry")
Suggested-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: remove refresh_sit_entry function]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:58 -08:00
Fan Li
a0761f63ea f2fs: add a function to move nid
This patch add a new function to move nid from one state to another.
Move operation is heavily used, by adding a new function for it
we can cut down some branches from several flow.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:58 -08:00
Chao Yu
a2a12b679f f2fs: export SSR allocation threshold
This patch exports min_ssr_segments threshold in sysfs to let user
control triggering SSR allocation flexibly.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:57 -08:00
Chao Yu
0ea805129d f2fs: give correct trimmed blocks in fstrim
We have supported to issue discard in specified range during fstrim,
it needs to return caller with successfully trimmed bytes in that
range instead of bytes of invalid blocks which are scanned in
checkpoint.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:56 -08:00
Chao Yu
d62fe97148 f2fs: support bio allocation error injection
This patch adds to support bio allocation error injection to simulate
out-of-memory test scenario.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:55 -08:00
Chao Yu
01eccef793 f2fs: support get_page error injection
This patch adds to support get_page error injection to simulate
out-of-memory test scenario.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:54 -08:00
Yunlong Song
80d4214501 f2fs: support soft block reservation
It supports to extend reserved_blocks sysfs interface to be soft
threshold, which allows user configure it exceeding current available
user space. This patch also introduces a new sysfs interface called
current_reserved_blocks, which shows the current blocks that have
already been reserved.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:52 -08:00
Jaegeuk Kim
bf9c142785 f2fs: handle error case when adding xattr entry
This patch fixes recovering incomplete xattr entries remaining in inline xattr
and xattr block, caused by any kind of errors.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:51 -08:00
Chao Yu
6afc662e68 f2fs: support flexible inline xattr size
Now, in product, more and more features based on file encryption were
introduced, their demand of xattr space is increasing, however, inline
xattr has fixed-size of 200 bytes, once inline xattr space is full, new
increased xattr data would occupy additional xattr block which may bring
us more space usage and performance regression during persisting.

In order to resolve above issue, it's better to expand inline xattr size
flexibly according to user's requirement.

So this patch introduces new filesystem feature 'flexible inline xattr',
and new mount option 'inline_xattr_size=%u', once mkfs enables the
feature, we can use the option to make f2fs supporting flexible inline
xattr size.

To support this feature, we add extra attribute i_inline_xattr_size in
inode layout, indicating that how many space inline xattr borrows from
block address mapping space in inode layout, by this, we can easily
locate and store flexible-sized inline xattr data in inode.

Inode disk layout:
  +----------------------+
  | .i_mode              |
  | ...                  |
  | .i_ext               |
  +----------------------+
  | .i_extra_isize       |
  | .i_inline_xattr_size |-----------+
  | ...                  |           |
  +----------------------+           |
  | .i_addr              |           |
  |  - block address or  |           |
  |  - inline data       |           |
  +----------------------+<---+      v
  |    inline xattr      |    +---inline xattr range
  +----------------------+<---+
  | .i_nid               |
  +----------------------+
  |   node_footer        |
  | (nid, ino, offset)   |
  +----------------------+

Note that, we have to cnosider backward compatibility which reserved
inline_data space, 200 bytes, all the time, reported by Sheng Yong.

Previous inline data or directory always reserved 200 bytes in inode layout,
even if inline_xattr is disabled. In order to keep inline_dentry's structure
for backward compatibility, we get the space back only from inline_data.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reported-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:50 -08:00
Jaegeuk Kim
b4b153f8c2 f2fs: show current cp state
This patch shows whether checkpoint met any error case.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:49 -08:00
Jaegeuk Kim
d8d1389ea1 f2fs: add missing quota_initialize
This patch adds to call quota_intialize in f2fs_set_acl, f2fs_unlink,
and f2fs_rename.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:48 -08:00
Jaegeuk Kim
8f1572f7ce f2fs: show # of dirty segments via sysfs
This patch adds one sysfs entry to show # of dirty segments which can be
used for gc timing by user.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:47 -08:00
Jaegeuk Kim
1f227a3e21 f2fs: stop all the operations by cp_error flag
This patch replaces to use cp_error flag instead of RDONLY for quota off.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-05 16:41:43 -08:00
David S. Miller
2a171788ba Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Files removed in 'net-next' had their license header updated
in 'net'.  We take the remove from 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-04 09:26:51 +09:00
Darrick J. Wong
0dca060c2a xfs: scrub: avoid uninitialized return code
The newly added xfs_scrub_da_btree_block() function has one code path
that returns the 'error' variable without initializing it first, as
shown by this compiler warning:

fs/xfs/scrub/dabtree.c: In function 'xfs_scrub_da_btree_block':
fs/xfs/scrub/dabtree.c:462:9: error: 'error' may be used uninitialized in this function [-Werror=maybe-uninitialized]

Return zero since the caller will exit the scrub code if we don't produce a
buffer pointer.

Fixes: 7c4a07a424 ("xfs: scrub directory/attribute btrees")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-11-03 09:45:56 -07:00
Eryu Guan
350976ae21 xfs: truncate pagecache before writeback in xfs_setattr_size()
On truncate down, if new size is not block size aligned, we zero the
rest of block to avoid exposing stale data to user, and
iomap_truncate_page() skips zeroing if the range is already in
unwritten state or a hole. Then we writeback from on-disk i_size to
the new size if this range hasn't been written to disk yet, and
truncate page cache beyond new EOF and set in-core i_size.

The problem is that we could write data between di_size and newsize
before removing the page cache beyond newsize, as the extents may
still be in unwritten state right after a buffer write. As such, the
page of data that newsize lies in has not been zeroed by page cache
invalidation before it is written, and xfs_do_writepage() hasn't
triggered it's "zero data beyond EOF" case because we haven't
updated in-core i_size yet. Then a subsequent mmap read could see
non-zeros past EOF.

I occasionally see this in fsx runs in fstests generic/112, a
simplified fsx operation sequence is like (assuming 4k block size
xfs):

  fallocate 0x0 0x1000 0x0 keep_size
  write 0x0 0x1000 0x0
  truncate 0x0 0x800 0x1000
  punch_hole 0x0 0x800 0x800
  mapread 0x0 0x800 0x800

where fallocate allocates unwritten extent but doesn't update
i_size, buffer write populates the page cache and extent is still
unwritten, truncate skips zeroing page past new EOF and writes the
page to disk, punch_hole invalidates the page cache, at last mapread
reads the block back and sees non-zero beyond EOF.

Fix it by moving truncate_setsize() to before writeback so the page
cache invalidation zeros the partial page at the new EOF. This also
triggers "zero data beyond EOF" in xfs_do_writepage() at writeback
time, because newsize has been set and page straddles the newsize.

Also fixed the wrong 'end' param of filemap_write_and_wait_range()
call while we're at it, the 'end' is inclusive and should be
'newsize - 1'.

Suggested-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eryu Guan <eguan@redhat.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-03 09:45:56 -07:00
Christoph Hellwig
ea435e1b93 block: add a poll_fn callback to struct request_queue
That we we can also poll non blk-mq queues.  Mostly needed for
the NVMe multipath code, but could also be useful elsewhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 10:31:48 -06:00
Dave Martin
27e64b4be4 regset: Add support for dynamically sized regsets
Currently the regset API doesn't allow for the possibility that
regsets (or at least, the amount of meaningful data in a regset)
may change in size.

In particular, this results in useless padding being added to
coredumps if a regset's current size is smaller than its
theoretical maximum size.

This patch adds a get_size() function to struct user_regset.
Individual regset implementations can implement this function to
return the current size of the regset data.  A regset_size()
function is added to provide callers with an abstract interface for
determining the size of a regset without needing to know whether
the regset is dynamically sized or not.

The only affected user of this interface is the ELF coredump code:
This patch ports ELF coredump to dump regsets with their actual
size in the coredump.  This has no effect except for new regsets
that are dynamically sized and provide a get_size() implementation.

Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: H. J. Lu <hjl.tools@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-11-03 15:24:11 +00:00
Mike Kravetz
ab615a5b87 fs/hugetlbfs/inode.c: fix hwpoison reserve accounting
Calling madvise(MADV_HWPOISON) on a hugetlbfs page will result in bad
(negative) reserved huge page counts.  This may not happen immediately,
but may happen later when the underlying file is removed or filesystem
unmounted.  For example:

  AnonHugePages:         0 kB
  ShmemHugePages:        0 kB
  HugePages_Total:       1
  HugePages_Free:        0
  HugePages_Rsvd:    18446744073709551615
  HugePages_Surp:        0
  Hugepagesize:       2048 kB

In routine hugetlbfs_error_remove_page(), hugetlb_fix_reserve_counts is
called after remove_huge_page.  hugetlb_fix_reserve_counts is designed
to only be called/used only if a failure is returned from
hugetlb_unreserve_pages.  Therefore, call hugetlb_unreserve_pages as
required and only call hugetlb_fix_reserve_counts in the unlikely event
that hugetlb_unreserve_pages returns an error.

Link: http://lkml.kernel.org/r/20171019230007.17043-2-mike.kravetz@oracle.com
Fixes: 78bb920344 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-03 07:39:19 -07:00
Ashish Samant
105ddc93f0 ocfs2: fstrim: Fix start offset of first cluster group during fstrim
The first cluster group descriptor is not stored at the start of the
group but at an offset from the start.  We need to take this into
account while doing fstrim on the first cluster group.  Otherwise we
will wrongly start fstrim a few blocks after the desired start block and
the range can cross over into the next cluster group and zero out the
group descriptor there.  This can cause filesytem corruption that cannot
be fixed by fsck.

Link: http://lkml.kernel.org/r/1507835579-7308-1-git-send-email-ashish.samant@oracle.com
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-03 07:39:19 -07:00
Huang Ying
b83d7e4323 mm, /proc/pid/pagemap: fix soft dirty marking for PMD migration entry
When the pagetable is walked in the implementation of /proc/<pid>/pagemap,
pmd_soft_dirty() is used for both the PMD huge page map and the PMD
migration entries.  That is wrong, pmd_swp_soft_dirty() should be used
for the PMD migration entries instead because the different page table
entry flag is used.

As a result, /proc/pid/pagemap may report incorrect soft dirty information
for PMD migration entries.

Link: http://lkml.kernel.org/r/20171017081818.31795-1-ying.huang@intel.com
Fixes: 84c3fc4e9c ("mm: thp: check pmd migration entry in common path")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-03 07:39:19 -07:00
Gilad Ben-Yossef
a5186b85fe cifs: move to generic async completion
cifs starts an async. crypto op and waits for their completion.
Move it over to generic code doing the same.

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-11-03 22:11:21 +08:00
Gilad Ben-Yossef
d0082e1a7c fscrypt: move to generic async completion
fscrypt starts several async. crypto ops and waiting for them to
complete. Move it over to generic code doing the same.

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-11-03 22:11:20 +08:00
Christoph Hellwig
a39e596baa xfs: support for synchronous DAX faults
Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare
blocks for writing and the inode is pinned, and has dirty fields other
than the timestamps.  In __xfs_filemap_fault() we then detect this case
and call dax_finish_sync_fault() to make sure all metadata is committed,
and to insert the page table entry.

Note that this will also dirty corresponding radix tree entry which is
what we want - fsync(2) will still provide data integrity guarantees for
applications not using userspace flushing. And applications using
userspace flushing can avoid calling fsync(2) and thus avoid the
performance overhead.

[JK: Added VM_SYNC flag handling]

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-03 06:26:26 -07:00
Jan Kara
7b565c9f96 xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()
xfs_filemap_pfn_mkwrite() duplicates a lot of __xfs_filemap_fault().
It will also need to handle flushing for synchronous page faults. So
just make that function use __xfs_filemap_fault().

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-03 06:26:26 -07:00
Jan Kara
b8a6176c21 ext4: Support for synchronous DAX faults
We return IOMAP_F_DIRTY flag from ext4_iomap_begin() when asked to
prepare blocks for writing and the inode has some uncommitted metadata
changes. In the fault handler ext4_dax_fault() we then detect this case
(through VM_FAULT_NEEDDSYNC return value) and call helper
dax_finish_sync_fault() to flush metadata changes and insert page table
entry. Note that this will also dirty corresponding radix tree entry
which is what we want - fsync(2) will still provide data integrity
guarantees for applications not using userspace flushing. And
applications using userspace flushing can avoid calling fsync(2) and
thus avoid the performance overhead.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-03 06:26:26 -07:00
Jan Kara
497f6926d8 ext4: Simplify error handling in ext4_dax_huge_fault()
If transaction starting fails, just bail out of the function immediately
instead of checking for that condition throughout the function.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-03 06:26:26 -07:00
Jan Kara
71eab6dfd9 dax: Implement dax_finish_sync_fault()
Implement a function that filesystems can call to finish handling of
synchronous page faults. It takes care of syncing appropriare file range
and insertion of page table entry.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-03 06:26:25 -07:00
Jan Kara
caa51d26f8 dax, iomap: Add support for synchronous faults
Add a flag to iomap interface informing the caller that inode needs
fdstasync(2) for returned extent to become persistent and use it in DAX
fault code so that we don't map such extents into page tables
immediately. Instead we propagate the information that fdatasync(2) is
necessary from dax_iomap_fault() with a new VM_FAULT_NEEDDSYNC flag.
Filesystem fault handler is then responsible for calling fdatasync(2)
and inserting pfn into page tables.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-03 06:26:25 -07:00
Jan Kara
b6fb293f24 mm: Define MAP_SYNC and VM_SYNC flags
Define new MAP_SYNC flag and corresponding VMA VM_SYNC flag. As the
MAP_SYNC flag is not part of LEGACY_MAP_MASK, currently it will be
refused by all MAP_SHARED_VALIDATE map attempts and silently ignored for
everything else.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-03 06:26:25 -07:00
Jan Kara
f5b7b74876 dax: Allow tuning whether dax_insert_mapping_entry() dirties entry
Currently we dirty radix tree entry whenever dax_insert_mapping_entry()
gets called for a write fault. With synchronous page faults we would
like to insert clean radix tree entry and dirty it only once we call
fdatasync() and update page tables to save some unnecessary cache
flushing. Add 'dirty' argument to dax_insert_mapping_entry() for that.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-03 06:26:25 -07:00
Jan Kara
9a0dd42251 dax: Allow dax_iomap_fault() to return pfn
For synchronous page fault dax_iomap_fault() will need to return PFN
which will then need to be inserted into page tables after fsync()
completes. Add necessary parameter to dax_iomap_fault().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-03 06:26:24 -07:00
Jan Kara
cec04e8c82 dax: Fix comment describing dax_iomap_fault()
Add missing argument description.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-03 06:26:24 -07:00
Jan Kara
302a5e312b dax: Inline dax_pmd_insert_mapping() into the callsite
dax_pmd_insert_mapping() has only one callsite and we will need to
further fine tune what it does for synchronous faults. Just inline it
into the callsite so that we don't have to pass awkward bools around.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-03 06:26:24 -07:00
Jan Kara
1b5a1cb21e dax: Inline dax_insert_mapping() into the callsite
dax_insert_mapping() has only one callsite and we will need to further
fine tune what it does for synchronous faults. Just inline it into the
callsite so that we don't have to pass awkward bools around.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-03 06:26:24 -07:00
Jan Kara
d2c43ef133 dax: Create local variable for vmf->flags & FAULT_FLAG_WRITE test
There are already two users and more are coming.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-03 06:26:23 -07:00
Jan Kara
a0987ad5c5 dax: Create local variable for VMA in dax_iomap_pte_fault()
There are already two users and more are coming.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-03 06:26:23 -07:00
Jan Kara
5e161e4066 dax: Factor out getting of pfn out of iomap
Factor out code to get pfn out of iomap that is shared between PTE and
PMD fault path.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-03 06:26:23 -07:00
Jan Kara
31a6f1a6e5 dax: Simplify arguments of dax_insert_mapping()
dax_insert_mapping() has lots of arguments and a lot of them is actuall
duplicated by passing vm_fault structure as well. Change the function to
take the same arguments as dax_pmd_insert_mapping().

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-03 06:26:23 -07:00
Kees Cook
9b5dfbdd1f fs/ncpfs: Convert timers to use timer_setup()
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Petr Vandrovec <petr@vandrovec.name>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-11-02 15:44:09 -07:00
Linus Torvalds
ead751507d License cleanup: add SPDX license identifiers to some files
Many source files in the tree are missing licensing information, which
 makes it harder for compliance tools to determine the correct license.
 
 By default all files without license information are under the default
 license of the kernel, which is GPL version 2.
 
 Update the files which contain no license information with the 'GPL-2.0'
 SPDX license identifier.  The SPDX identifier is a legally binding
 shorthand, which can be used instead of the full boiler plate text.
 
 This patch is based on work done by Thomas Gleixner and Kate Stewart and
 Philippe Ombredanne.
 
 How this work was done:
 
 Patches were generated and checked against linux-4.14-rc6 for a subset of
 the use cases:
  - file had no licensing information it it.
  - file was a */uapi/* one with no licensing information in it,
  - file was a */uapi/* one with existing licensing information,
 
 Further patches will be generated in subsequent months to fix up cases
 where non-standard license headers were used, and references to license
 had to be inferred by heuristics based on keywords.
 
 The analysis to determine which SPDX License Identifier to be applied to
 a file was done in a spreadsheet of side by side results from of the
 output of two independent scanners (ScanCode & Windriver) producing SPDX
 tag:value files created by Philippe Ombredanne.  Philippe prepared the
 base worksheet, and did an initial spot review of a few 1000 files.
 
 The 4.13 kernel was the starting point of the analysis with 60,537 files
 assessed.  Kate Stewart did a file by file comparison of the scanner
 results in the spreadsheet to determine which SPDX license identifier(s)
 to be applied to the file. She confirmed any determination that was not
 immediately clear with lawyers working with the Linux Foundation.
 
 Criteria used to select files for SPDX license identifier tagging was:
  - Files considered eligible had to be source code files.
  - Make and config files were included as candidates if they contained >5
    lines of source
  - File already had some variant of a license header in it (even if <5
    lines).
 
 All documentation files were explicitly excluded.
 
 The following heuristics were used to determine which SPDX license
 identifiers to apply.
 
  - when both scanners couldn't find any license traces, file was
    considered to have no license information in it, and the top level
    COPYING file license applied.
 
    For non */uapi/* files that summary was:
 
    SPDX license identifier                            # files
    ---------------------------------------------------|-------
    GPL-2.0                                              11139
 
    and resulted in the first patch in this series.
 
    If that file was a */uapi/* path one, it was "GPL-2.0 WITH
    Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
 
    SPDX license identifier                            # files
    ---------------------------------------------------|-------
    GPL-2.0 WITH Linux-syscall-note                        930
 
    and resulted in the second patch in this series.
 
  - if a file had some form of licensing information in it, and was one
    of the */uapi/* ones, it was denoted with the Linux-syscall-note if
    any GPL family license was found in the file or had no licensing in
    it (per prior point).  Results summary:
 
    SPDX license identifier                            # files
    ---------------------------------------------------|------
    GPL-2.0 WITH Linux-syscall-note                       270
    GPL-2.0+ WITH Linux-syscall-note                      169
    ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
    ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
    LGPL-2.1+ WITH Linux-syscall-note                      15
    GPL-1.0+ WITH Linux-syscall-note                       14
    ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
    LGPL-2.0+ WITH Linux-syscall-note                       4
    LGPL-2.1 WITH Linux-syscall-note                        3
    ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
    ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
 
    and that resulted in the third patch in this series.
 
  - when the two scanners agreed on the detected license(s), that became
    the concluded license(s).
 
  - when there was disagreement between the two scanners (one detected a
    license but the other didn't, or they both detected different
    licenses) a manual inspection of the file occurred.
 
  - In most cases a manual inspection of the information in the file
    resulted in a clear resolution of the license that should apply (and
    which scanner probably needed to revisit its heuristics).
 
  - When it was not immediately clear, the license identifier was
    confirmed with lawyers working with the Linux Foundation.
 
  - If there was any question as to the appropriate license identifier,
    the file was flagged for further research and to be revisited later
    in time.
 
 In total, over 70 hours of logged manual review was done on the
 spreadsheet to determine the SPDX license identifiers to apply to the
 source files by Kate, Philippe, Thomas and, in some cases, confirmation
 by lawyers working with the Linux Foundation.
 
 Kate also obtained a third independent scan of the 4.13 code base from
 FOSSology, and compared selected files where the other two scanners
 disagreed against that SPDX file, to see if there was new insights.  The
 Windriver scanner is based on an older version of FOSSology in part, so
 they are related.
 
 Thomas did random spot checks in about 500 files from the spreadsheets
 for the uapi headers and agreed with SPDX license identifier in the
 files he inspected. For the non-uapi files Thomas did random spot checks
 in about 15000 files.
 
 In initial set of patches against 4.14-rc6, 3 files were found to have
 copy/paste license identifier errors, and have been fixed to reflect the
 correct identifier.
 
 Additionally Philippe spent 10 hours this week doing a detailed manual
 inspection and review of the 12,461 patched files from the initial patch
 version early this week with:
  - a full scancode scan run, collecting the matched texts, detected
    license ids and scores
  - reviewing anything where there was a license detected (about 500+
    files) to ensure that the applied SPDX license was correct
  - reviewing anything where there was no detection but the patch license
    was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
    SPDX license was correct
 
 This produced a worksheet with 20 files needing minor correction.  This
 worksheet was then exported into 3 different .csv files for the
 different types of files to be modified.
 
 These .csv files were then reviewed by Greg.  Thomas wrote a script to
 parse the csv files and add the proper SPDX tag to the file, in the
 format that the file expected.  This script was further refined by Greg
 based on the output to detect more types of files automatically and to
 distinguish between header and source .c files (which need different
 comment types.)  Finally Greg ran the script using the .csv files to
 generate the patches.
 
 Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
 Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
 Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWfswbQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykvEwCfXU1MuYFQGgMdDmAZXEc+xFXZvqgAoKEcHDNA
 6dVh26uchcEQLN/XqUDt
 =x306
 -----END PGP SIGNATURE-----

Merge tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull initial SPDX identifiers from Greg KH:
 "License cleanup: add SPDX license identifiers to some files

  Many source files in the tree are missing licensing information, which
  makes it harder for compliance tools to determine the correct license.

  By default all files without license information are under the default
  license of the kernel, which is GPL version 2.

  Update the files which contain no license information with the
  'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally
  binding shorthand, which can be used instead of the full boiler plate
  text.

  This patch is based on work done by Thomas Gleixner and Kate Stewart
  and Philippe Ombredanne.

  How this work was done:

  Patches were generated and checked against linux-4.14-rc6 for a subset
  of the use cases:

   - file had no licensing information it it.

   - file was a */uapi/* one with no licensing information in it,

   - file was a */uapi/* one with existing licensing information,

  Further patches will be generated in subsequent months to fix up cases
  where non-standard license headers were used, and references to
  license had to be inferred by heuristics based on keywords.

  The analysis to determine which SPDX License Identifier to be applied
  to a file was done in a spreadsheet of side by side results from of
  the output of two independent scanners (ScanCode & Windriver)
  producing SPDX tag:value files created by Philippe Ombredanne.
  Philippe prepared the base worksheet, and did an initial spot review
  of a few 1000 files.

  The 4.13 kernel was the starting point of the analysis with 60,537
  files assessed. Kate Stewart did a file by file comparison of the
  scanner results in the spreadsheet to determine which SPDX license
  identifier(s) to be applied to the file. She confirmed any
  determination that was not immediately clear with lawyers working with
  the Linux Foundation.

  Criteria used to select files for SPDX license identifier tagging was:

   - Files considered eligible had to be source code files.

   - Make and config files were included as candidates if they contained
     >5 lines of source

   - File already had some variant of a license header in it (even if <5
     lines).

  All documentation files were explicitly excluded.

  The following heuristics were used to determine which SPDX license
  identifiers to apply.

   - when both scanners couldn't find any license traces, file was
     considered to have no license information in it, and the top level
     COPYING file license applied.

     For non */uapi/* files that summary was:

       SPDX license identifier                            # files
       ---------------------------------------------------|-------
       GPL-2.0                                              11139

     and resulted in the first patch in this series.

     If that file was a */uapi/* path one, it was "GPL-2.0 WITH
     Linux-syscall-note" otherwise it was "GPL-2.0". Results of that
     was:

       SPDX license identifier                            # files
       ---------------------------------------------------|-------
       GPL-2.0 WITH Linux-syscall-note                        930

     and resulted in the second patch in this series.

   - if a file had some form of licensing information in it, and was one
     of the */uapi/* ones, it was denoted with the Linux-syscall-note if
     any GPL family license was found in the file or had no licensing in
     it (per prior point). Results summary:

       SPDX license identifier                            # files
       ---------------------------------------------------|------
       GPL-2.0 WITH Linux-syscall-note                       270
       GPL-2.0+ WITH Linux-syscall-note                      169
       ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
       ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
       LGPL-2.1+ WITH Linux-syscall-note                      15
       GPL-1.0+ WITH Linux-syscall-note                       14
       ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
       LGPL-2.0+ WITH Linux-syscall-note                       4
       LGPL-2.1 WITH Linux-syscall-note                        3
       ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
       ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

     and that resulted in the third patch in this series.

   - when the two scanners agreed on the detected license(s), that
     became the concluded license(s).

   - when there was disagreement between the two scanners (one detected
     a license but the other didn't, or they both detected different
     licenses) a manual inspection of the file occurred.

   - In most cases a manual inspection of the information in the file
     resulted in a clear resolution of the license that should apply
     (and which scanner probably needed to revisit its heuristics).

   - When it was not immediately clear, the license identifier was
     confirmed with lawyers working with the Linux Foundation.

   - If there was any question as to the appropriate license identifier,
     the file was flagged for further research and to be revisited later
     in time.

  In total, over 70 hours of logged manual review was done on the
  spreadsheet to determine the SPDX license identifiers to apply to the
  source files by Kate, Philippe, Thomas and, in some cases,
  confirmation by lawyers working with the Linux Foundation.

  Kate also obtained a third independent scan of the 4.13 code base from
  FOSSology, and compared selected files where the other two scanners
  disagreed against that SPDX file, to see if there was new insights.
  The Windriver scanner is based on an older version of FOSSology in
  part, so they are related.

  Thomas did random spot checks in about 500 files from the spreadsheets
  for the uapi headers and agreed with SPDX license identifier in the
  files he inspected. For the non-uapi files Thomas did random spot
  checks in about 15000 files.

  In initial set of patches against 4.14-rc6, 3 files were found to have
  copy/paste license identifier errors, and have been fixed to reflect
  the correct identifier.

  Additionally Philippe spent 10 hours this week doing a detailed manual
  inspection and review of the 12,461 patched files from the initial
  patch version early this week with:

   - a full scancode scan run, collecting the matched texts, detected
     license ids and scores

   - reviewing anything where there was a license detected (about 500+
     files) to ensure that the applied SPDX license was correct

   - reviewing anything where there was no detection but the patch
     license was not GPL-2.0 WITH Linux-syscall-note to ensure that the
     applied SPDX license was correct

  This produced a worksheet with 20 files needing minor correction. This
  worksheet was then exported into 3 different .csv files for the
  different types of files to be modified.

  These .csv files were then reviewed by Greg. Thomas wrote a script to
  parse the csv files and add the proper SPDX tag to the file, in the
  format that the file expected. This script was further refined by Greg
  based on the output to detect more types of files automatically and to
  distinguish between header and source .c files (which need different
  comment types.) Finally Greg ran the script using the .csv files to
  generate the patches.

  Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
  Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
  Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

* tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  License cleanup: add SPDX license identifier to uapi header files with a license
  License cleanup: add SPDX license identifier to uapi header files with no license
  License cleanup: add SPDX GPL-2.0 license identifier to files with no license
2017-11-02 10:04:46 -07:00
Juerg Haefliger
88a96fa841 jfs: Add missing NULL pointer check in __get_metapage
alloc_metapage can return a NULL pointer so check for that.

Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2017-11-02 09:46:50 -05:00
Greg Kroah-Hartman
b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
David S. Miller
ed29668d1a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Smooth Cong Wang's bug fix into 'net-next'.  Basically put
the bulk of the tcf_block_put() logic from 'net' into
tcf_block_put_ext(), but after the offload unbind.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-02 15:23:39 +09:00
Dave Chinner
5d0eda0307 xfs: convert remaining xfs_sb_version_... checks to bool
Some were missed in the pass that converted the function return
values from int to bool. Update the remaining ones for consistency.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-01 15:03:16 -07:00
Darrick J. Wong
13791d3b83 xfs: scrub extended attribute leaf space
As we walk the attribute btree, explicitly check the structure of the
attribute leaves to make sure the pointers make sense and the freemap is
sensible.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2017-11-01 15:03:16 -07:00
Darrick J. Wong
e9e899a2a8 xfs: move error injection tags into their own file
Move the error injection tag names into a libxfs header so that we can
share it between kernel and userspace.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2017-11-01 15:03:16 -07:00
Darrick J. Wong
06b1132120 xfs: remove inode log format typedef
Remove xfs_inode_log_format_t now that xfs_inode_log_format is
explicitly padded and therefore is a real on-disk structure.  This
enables xfs/122 to check the size of the structure.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2017-11-01 15:03:16 -07:00