Commit Graph

2220 Commits

Author SHA1 Message Date
xuejiufei
60d663cb52 ocfs2/dlm: add DEREF_DONE message
This series of patches is to fix the dis-order issue of setting/clearing
refmap bit described below.

Node 1                               Node 2(master)
dlmlock
dlm_do_master_request
                                dlm_master_request_handler
                                -> dlm_lockres_set_refmap_bit
dlmlock succeed
dlmunlock succeed

dlm_purge_lockres
                                dlm_deref_handler
                                -> find lock resource is in
                                   DLM_LOCK_RES_SETREF_INPROG state,
                                   so dispatch a deref work
dlm_purge_lockres succeed.

call dlmlock again
dlm_do_master_request
                                dlm_master_request_handler
                                -> dlm_lockres_set_refmap_bit

                                deref work trigger, call
                                dlm_lockres_clear_refmap_bit
                                to clear Node 1 from refmap

                                dlm_purge_lockres succeed

dlm_send_remote_lock_request
                                return DLM_IVLOCKID because
                                the lockres is not exist
BUG if the lockres is $RECOVERY

This series of patches add a new message to keep the order of set and
clear.  Other nodes can purge the lock resource only after the refmap bit
on master is cleared.

This patch is to add DEREF_DONE message and corresponding handler.  Node
can purge the lock resource after receiving this message.  As a new
message is added, so increase the minor number of dlm protocol version.

Signed-off-by: xuejiufei <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15 16:55:16 -07:00
Joseph Qi
39b29af030 ocfs2/dlm: fix a typo in dlmcommon.h
Refer to cluster/tcp.h, NET_MAX_PAYLOAD_BYTES is a typo for
O2NET_MAX_PAYLOAD_BYTES.

Since currently DLM_MIG_LOCKRES_RESERVED is not actually used, it won't
cause any problem.  But we'd better correct it for further use.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15 16:55:16 -07:00
jiangyiwen
bfd97a0320 ocfs2: use spinlock_irqsave() to downconvert lock in ocfs2_osb_dump()
Commit a75e9ccabd ("ocfs2: use spinlock irqsave for downconvert lock")
missed an unmodified place in ocfs2_osb_dump(), so it still exists a
deadlock scenario.

    ocfs2_wake_downconvert_thread
    ocfs2_rw_unlock
    ocfs2_dio_end_io
    dio_complete
    .....
    bio_endio
    req_bio_endio
    ....
    scsi_io_completion
    blk_done_softirq
    __do_softirq
    do_softirq
    irq_exit
    do_IRQ
    ocfs2_osb_dump
    cat /sys/kernel/debug/ocfs2/${uuid}/fs_state

This patch still uses spin_lock_irqsave() - replace spin_lock() to solve
this situation.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15 16:55:16 -07:00
jiangyiwen
4d548f61d6 ocfs2/cluster: replace the interrupt safe spinlocks with common ones
There actually no hardware or software interrupts in the context which
using o2hb_live_lock, so we don't need to worry about race conditions
caused by irq/softirq with spinlock held.  Turning off irq is not good
for system performance after all.  Just replace them with a non
interrupt safe function.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15 16:55:16 -07:00
Jan Kara
566e8dfd88 ocfs2: fix return value from ocfs2_page_mkwrite()
ocfs2_page_mkwrite() could mistakenly return error code instead of
mkwrite status value.  Fix it.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-09 15:43:42 -08:00
Christoph Hellwig
1ae1602de0 configfs: switch ->default groups to a linked list
Replace the current NULL-terminated array of default groups with a linked
list.  This gets rid of lots of nasty code to size and/or dynamically
allocate the array.

While we're at it also provide a conveniant helper to remove the default
groups.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Felipe Balbi <balbi@kernel.org>		[drivers/usb/gadget]
Acked-by: Joel Becker <jlbec@evilplan.org>
Acked-by: Nicholas Bellinger <nab@linux-iscsi.org>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
2016-03-06 16:11:24 +01:00
Guozhonghua
a4a8481ff6 ocfs2: unlock inode if deleting inode from orphan fails
When doing append direct io cleanup, if deleting inode fails, it goes
out without unlocking inode, which will cause the inode deadlock.

This issue was introduced by commit cf1776a9e8 ("ocfs2: fix a tiny
race when truncate dio orohaned entry").

Signed-off-by: Guozhonghua <guozhonghua@h3c.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>	[4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27 10:28:52 -08:00
Jan Kara
5a9530e498 ocfs2: Implement get_next_id()
Implement get_next_id() callback to enable use of Q_GETNEXTQUOTA
quotactl for OCFS2.

Signed-off-by: Jan Kara <jack@suse.cz>
2016-02-09 13:05:23 +01:00
Christoph Hellwig
187372a3b9 direct-io: always call ->end_io if non-NULL
This way we can pass back errors to the file system, and allow for
cleanup required for all direct I/O invocations.

Also allow the ->end_io handlers to return errors on their own, so that
I/O completion errors can be passed on to the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 14:40:51 +11:00
xuejiufei
c95a51807b ocfs2/dlm: clear refmap bit of recovery lock while doing local recovery cleanup
When recovery master down, dlm_do_local_recovery_cleanup() only remove
the $RECOVERY lock owned by dead node, but do not clear the refmap bit.
Which will make umount thread falling in dead loop migrating $RECOVERY
to the dead node.

Signed-off-by: xuejiufei <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05 18:10:40 -08:00
Joseph Qi
a4a1dfa4bb ocfs2/cluster: fix memory leak in o2hb_region_release
o2hb_region_release currently doesn't free o2hb_debug_buf
hr_db_elapsed_time and hr_db_pinned malloced in o2hb_debug_create.  Also
we should call debugfs_remove before freeing its data, to prevent the risk
accessing debugfs rightly after its data has been freed.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jiufei Xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-03 08:28:43 -08:00
Al Viro
5955102c99 wrappers for ->i_mutex access
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-22 18:04:28 -05:00
Tariq Saeed
b1b1e15ef6 ocfs2: NFS hangs in __ocfs2_cluster_lock due to race with ocfs2_unblock_lock
NFS on a 2 node ocfs2 cluster each node exporting dir.  The lock causing
the hang is the global bit map inode lock.  Node 1 is master, has the
lock granted in PR mode; Node 2 is in the converting list (PR -> EX).
There are no holders of the lock on the master node so it should
downconvert to NL and grant EX to node 2 but that does not happen.
BLOCKED + QUEUED in lock res are set and it is on osb blocked list.
Threads are waiting in __ocfs2_cluster_lock on BLOCKED.  One thread
wants EX, rest want PR.  So it is as though the downconvert thread needs
to be kicked to complete the conv.

The hang is caused by an EX req coming into __ocfs2_cluster_lock on the
heels of a PR req after it sets BUSY (drops l_lock, releasing EX
thread), forcing the incoming EX to wait on BUSY without doing anything.
PR has called ocfs2_dlm_lock, which sets the node 1 lock from NL -> PR,
queues ast.

At this time, upconvert (PR ->EX) arrives from node 2, finds conflict
with node 1 lock in PR, so the lock res is put on dlm thread's dirty
listt.

After ret from ocf2_dlm_lock, PR thread now waits behind EX on BUSY till
awoken by ast.

Now it is dlm_thread that serially runs dlm_shuffle_lists, ast, bast, in
that order.  dlm_shuffle_lists ques a bast on behalf of node 2 (which
will be run by dlm_thread right after the ast).  ast does its part, sets
UPCONVERT_FINISHING, clears BUSY and wakes its waiters.  Next,
dlm_thread runs bast.  It sets BLOCKED and kicks dc thread.  dc thread
runs ocfs2_unblock_lock, but since UPCONVERT_FINISHING set, skips doing
anything and reques.

Inside of __ocfs2_cluster_lock, since EX has been waiting on BUSY ahead
of PR, it wakes up first, finds BLOCKED set and skips doing anything but
clearing UPCONVERT_FINISHING (which was actually "meant" for the PR
thread), and this time waits on BLOCKED.  Next, the PR thread comes out
of wait but since UPCONVERT_FINISHING is not set, it skips updating the
l_ro_holders and goes straight to wait on BLOCKED.  So there, we have a
hang! Threads in __ocfs2_cluster_lock wait on BLOCKED, lock res in osb
blocked list.  Only when dc thread is awoken, it will run
ocfs2_unblock_lock and things will unhang.

One way to fix this is to wake the dc thread on the flag after clearing
UPCONVERT_FINISHING

Orabug: 20933419
Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Eric Ren <zren@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21 17:20:51 -08:00
Linus Torvalds
71e4634e00 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending
Pull SCSI target updates from Nicholas Bellinger:
 "The highlights this round include:

   - Introduce configfs support for unlocked configfs_depend_item()
     (krzysztof + andrezej)
   - Conversion of usb-gadget target driver to new function registration
     interface (andrzej + sebastian)
   - Enable qla2xxx FC target mode support for Extended Logins (himansu +
     giridhar)
   - Enable qla2xxx FC target mode support for Exchange Offload (himansu +
     giridhar)
   - Add qla2xxx FC target mode irq affinity notification + selective
     command queuing.  (quinn + himanshu)
   - Fix iscsi-target deadlock in se_node_acl configfs deletion (sagi +
     nab)
   - Convert se_node_acl configfs deletion + se_node_acl->queue_depth to
     proper se_session->sess_kref + target_get_session() usage.  (hch +
     sagi + nab)
   - Fix long-standing race between se_node_acl->acl_kref get and
     get_initiator_node_acl() lookup.  (hch + nab)
   - Fix target/user block-size handling, and make sure netlink reaches
     all network namespaces (sheng + andy)

  Note there is an outstanding bug-fix series for remote I_T nexus port
  TMR LUN_RESET has been posted and still being tested, and will likely
  become post -rc1 material at this point"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (56 commits)
  scsi: qla2xxxx: avoid type mismatch in comparison
  target/user: Make sure netlink would reach all network namespaces
  target: Obtain se_node_acl->acl_kref during get_initiator_node_acl
  target: Convert ACL change queue_depth se_session reference usage
  iscsi-target: Fix potential dead-lock during node acl delete
  ib_srpt: Convert acl lookup to modern get_initiator_node_acl usage
  tcm_fc: Convert acl lookup to modern get_initiator_node_acl usage
  tcm_fc: Wait for command completion before freeing a session
  target: Fix a memory leak in target_dev_lba_map_store()
  target: Support aborting tasks with a 64-bit tag
  usb/gadget: Remove set-but-not-used variables
  target: Remove an unused variable
  target: Fix indentation in target_core_configfs.c
  target/user: Allow user to set block size before enabling device
  iser-target: Fix non negative ERR_PTR isert_device_get usage
  target/fcoe: Add tag support to tcm_fc
  qla2xxx: Check for online flag instead of active reset when transmitting responses
  qla2xxx: Set all queues to 4k
  qla2xxx: Disable ZIO at start time.
  qla2xxx: Move atioq to a different lock to reduce lock contention
  ...
2016-01-20 17:20:53 -08:00
Linus Torvalds
1d3671df72 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF fixes and quota cleanups from Jan Kara:
 "Several UDF fixes and some minor quota cleanups"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Check output buffer length when converting name to CS0
  udf: Prevent buffer overrun with multi-byte characters
  quota: constify qtree_fmt_operations structures
  udf: avoid uninitialized variable use
  udf: Fix lost indirect extent block
  udf: Factor out code for creating indirect extent
  udf: limit the maximum number of indirect extents in a row
  udf: limit the maximum number of TD redirections
  fs: make quota/dquot.c explicitly non-modular
  fs: make quota/netlink.c explicitly non-modular
2016-01-15 11:51:51 -08:00
Vladimir Davydov
5d097056c9 kmemcg: account certain kmem allocations to memcg
Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg.  For the list, see below:

 - threadinfo
 - task_struct
 - task_delay_info
 - pid
 - cred
 - mm_struct
 - vm_area_struct and vm_region (nommu)
 - anon_vma and anon_vma_chain
 - signal_struct
 - sighand_struct
 - fs_struct
 - files_struct
 - fdtable and fdtable->full_fds_bits
 - dentry and external_name
 - inode for all filesystems. This is the most tedious part, because
   most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds.  Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Joseph Qi
3c973b0e71 ocfs2/dlm: cleanup redunant lksb flags in dlmcommon.h
lksb flags are defined both in dlmapi.h and dlmcommon.h.  So clean them
up from dlmcommon.h.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jiufei Xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Junxiao Bi
98e141f266 ocfs2: dlm: remove redundant code
Found this when do patch review, remove to make it clear and save a
little cpu time.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Joseph Qi
074a6c655f ocfs2: access orphan dinode before delete entry in ocfs2_orphan_del
In ocfs2_orphan_del, currently it finds and deletes entry first, and
then access orphan dir dinode.  This will have a problem once
ocfs2_journal_access_di fails.  In this case, entry will be removed from
orphan dir, but in deed the inode hasn't been deleted successfully.  In
other words, the file is missing but not actually deleted.  So we should
access orphan dinode first like unlink and rename.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jiufei Xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
xuejiufei
32e493265b ocfs2/dlm: do not insert a new mle when another process is already migrating
When two processes are migrating the same lockres,
dlm_add_migration_mle() return -EEXIST, but insert a new mle in hash
list.  dlm_migrate_lockres() will detach the old mle and free the new
one which is already in hash list, that will destroy the list.

Signed-off-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
xuejiufei
bef5502de0 ocfs2/dlm: ignore cleaning the migration mle that is inuse
We have found that migration source will trigger a BUG that the refcount
of mle is already zero before put when the target is down during
migration.  The situation is as follows:

dlm_migrate_lockres
  dlm_add_migration_mle
  dlm_mark_lockres_migrating
  dlm_get_mle_inuse
  <<<<<< Now the refcount of the mle is 2.
  dlm_send_one_lockres and wait for the target to become the
  new master.
  <<<<<< o2hb detect the target down and clean the migration
  mle. Now the refcount is 1.

dlm_migrate_lockres woken, and put the mle twice when found the target
goes down which trigger the BUG with the following message:

  "ERROR: bad mle: ".

Signed-off-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Goldwyn Rodrigues
1cce4df04f ocfs2: do not lock/unlock() inode DLM lock
DLM does not cache locks.  So, blocking lock and unlock will only make
the performance worse where contention over the locks is high.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
jiangyiwen
1247017f43 ocfs2: fix slot overwritten if storage link down during mount
The following case will lead to slot overwritten.

N1                               N2
mount ocfs2 volume, find and
allocate slot 0, then set
osb->slot_num to 0, begin to
write slot info to disk
                                 mount ocfs2 volume, wait for super lock
write block fail because of
storage link down, unlock
super lock
                                 got super lock and also allocate slot 0
                                 then unlock super lock

mount fail and then dismount,
since osb->slot_num is 0, try to
put invalid slot to disk. And it
will succeed if storage link
restores.
                                 N2 slot info is now overwritten

Once another node say N3 mount, it will find and allocate slot 0 again,
which will lead to mount hung because journal has already been locked by
N2.  so when write slot info failed, invalidate slot in advance to avoid
overwrite slot.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Xue jiufei
c372f2193a ocfs2/dlm: return appropriate value when dlm_grab() returns NULL
dlm_grab() may return NULL when the node is doing unmount.  When doing
code review, we found that some dlm handlers may return error to caller
when dlm_grab() returns NULL and make caller BUG or other problems.
Here is an example:

Node 1                                 Node 2
receives migration message
from node 3, and send
migrate request to others
                                     start unmounting

                                     receives migrate request
                                     from node 1 and call
                                     dlm_migrate_request_handler()

                                     unmount thread unregisters
                                     domain handlers and removes
                                     dlm_context from dlm_domains

                                     dlm_migrate_request_handlers()
                                     returns -EINVAL to node 1
Exit migration neither clearing the
migration state nor sending
assert master message to node 3 which
cause node 3 hung.

Signed-off-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Joseph Qi
72865d9230 ocfs2: clean up redundant NULL check before iput
Since iput will take care the NULL check itself, NULL check before
calling it is redundant.  So clean them up.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
jiangyiwen
b556014338 ocfs2/dlm: wait until DLM_LOCK_RES_SETREF_INPROG is cleared in dlm_deref_lockres_worker
Commit f3f854648d ("ocfs2_dlm: Ensure correct ordering of set/clear
refmap bit on lockres") still exists a race which can't ensure the
ordering is exactly correct.

Node1               Node2                    Node3
umount, migrate
lockres to Node2
                    migrate finished,
                    send migrate request
                    to Node3
                                              received migrate request,
                                              create a migration_mle,
                                              respond to Node2.
                    set DLM_LOCK_RES_SETREF_INPROG
                    and send assert master to
                    Node3
                                              delete migration_mle in
                                              assert_master_handler,
                                              Node3 umount without response
                                              dlm_thread purge
                                              this lockres, send drop
                                              deref message to Node2
                    found the flag of
                    DLM_LOCK_RES_SETREF_INPROG
                    is set, dispatch
                    dlm_deref_lockres_worker to
                    clear refmap, but in function of
                    dlm_deref_lockres_worker,
                    only if node in refmap it wait
                    DLM_LOCK_RES_SETREF_INPROG
                    to be cleared. So worker is
                    done successfully

                                              purge lockres, send
                                              assert master response
                                              to Node1, and finish umount
                    set Node3 in refmap, and it
                    won't be cleared forever, thus
                    lead to umount hung

so wait until DLM_LOCK_RES_SETREF_INPROG is cleared in
dlm_deref_lockres_worker.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Julia Lawall
9e62dc096e ocfs2: constify ocfs2_extent_tree_operations structures
The ocfs2_extent_tree_operations structures are never modified, so
declare them as const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Xue jiufei
30bee898f8 ocfs2/dlm: fix a race between purge and migration
We found a race between purge and migration when doing code review.
Node A put lockres to purgelist before receiving the migrate message
from node B which is the master.  Node A call dlm_mig_lockres_handler to
handle this message.

dlm_mig_lockres_handler
  dlm_lookup_lockres
  >>>>>> race window, dlm_run_purge_list may run and send
         deref message to master, waiting the response
  spin_lock(&res->spinlock);
  res->state |= DLM_LOCK_RES_MIGRATING;
  spin_unlock(&res->spinlock);
  dlm_mig_lockres_handler returns

  >>>>>> dlm_thread receives the response from master for the deref
  message and triggers the BUG because the lockres has the state
  DLM_LOCK_RES_MIGRATING with the following message:

dlm_purge_lockres:209 ERROR: 6633EB681FA7474A9C280A4E1A836F0F: res
M0000000000000000030c0300000000 in use after deref

Signed-off-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Junxiao Bi
a84ac334dc ocfs2: o2hb: increase unsteady iterations
When run multiple xattr test of ocfs2-test on a three-nodes cluster,
mount failed sometimes with the following message.

  o2hb: Unable to stabilize heartbeart on region D18B775E758D4D80837E8CF3D086AD4A (xvdb)

Stabilize heartbeat depends on the timing order to mount ocfs2 from
cluster nodes and how fast the tcp connections are established.  So
increase unsteady interations to leave more time for it.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
John Haxby
d6364627ef ocfs2: return non-zero st_blocks for inline data
Some versions of tar assume that files with st_blocks == 0 do not
contain any data and will skip reading them entirely.  See also commit
9206c56155 ("ext4: return non-zero st_blocks for inline data").

Signed-off-by: John Haxby <john.haxby@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Acked-by: Gang He <ghe@suse.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Norton.Zhu
3eb5bdf0f4 ocfs2: optimize bad declarations and redundant assignment
In ocfs2_parse_options,

a) it's better to declare variables(small size) outside of while loop;

b) 'option' will be set by match_int, 'option = 0;' makes no sense, if
   match_int failed, it just goto bail and return.

Signed-off-by: Norton.Zhu <norton.zhu@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Linus Torvalds
ddf1d6238d Merge branch 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs xattr updates from Al Viro:
 "Andreas' xattr cleanup series.

  It's a followup to his xattr work that went in last cycle; -0.5KLoC"

* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  xattr handlers: Simplify list operation
  ocfs2: Replace list xattr handler operations
  nfs: Move call to security_inode_listsecurity into nfs_listxattr
  xfs: Change how listxattr generates synthetic attributes
  tmpfs: listxattr should include POSIX ACL xattrs
  tmpfs: Use xattr handler infrastructure
  btrfs: Use xattr handler infrastructure
  vfs: Distinguish between full xattr names and proper prefixes
  posix acls: Remove duplicate xattr name definitions
  gfs2: Remove gfs2_xattr_acl_chmod
  vfs: Remove vfs_xattr_cmp
2016-01-11 13:32:10 -08:00
Linus Torvalds
32fb378437 Merge branch 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs RCU symlink updates from Al Viro:
 "Replacement of ->follow_link/->put_link, allowing to stay in RCU mode
  even if the symlink is not an embedded one.

  No changes since the mailbomb on Jan 1"

* 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  switch ->get_link() to delayed_call, kill ->put_link()
  kill free_page_put_link()
  teach nfs_get_link() to work in RCU mode
  teach proc_self_get_link()/proc_thread_self_get_link() to work in RCU mode
  teach shmem_get_link() to work in RCU mode
  teach page_get_link() to work in RCU mode
  replace ->follow_link() with new method that could stay in RCU mode
  don't put symlink bodies in pagecache into highmem
  namei: page_getlink() and page_follow_link_light() are the same thing
  ufs: get rid of ->setattr() for symlinks
  udf: don't duplicate page_symlink_inode_operations
  logfs: don't duplicate page_symlink_inode_operations
  switch befs long symlinks to page_symlink_operations
2016-01-11 13:13:23 -08:00
Julia Lawall
d1b98c23f7 quota: constify qtree_fmt_operations structures
The qtree_fmt_operations structures are never modified, so declare them as
const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-01-04 10:58:35 +01:00
Al Viro
fceef393a5 switch ->get_link() to delayed_call, kill ->put_link()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-30 13:01:03 -05:00
xuejiufei
cc28d6d80f ocfs2/dlm: clear migration_pending when migration target goes down
We have found a BUG on res->migration_pending when migrating lock
resources.  The situation is as follows.

dlm_mark_lockres_migration
  res->migration_pending = 1;
  __dlm_lockres_reserve_ast
  dlm_lockres_release_ast returns with res->migration_pending remains
      because other threads reserve asts
  wait dlm_migration_can_proceed returns 1
  >>>>>>> o2hb found that target goes down and remove target
          from domain_map
  dlm_migration_can_proceed returns 1
  dlm_mark_lockres_migrating returns -ESHOTDOWN with
      res->migration_pending still remains.

When reentering dlm_mark_lockres_migrating(), it will trigger the BUG_ON
with res->migration_pending.  So clear migration_pending when target is
down.

Signed-off-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-29 17:45:49 -08:00
Junxiao Bi
b5a8bc338e ocfs2: fix flock panic issue
Commit 4f6563677a ("Move locks API users to locks_lock_inode_wait()")
move flock/posix lock indentify code to locks_lock_inode_wait(), but
missed to set fl_flags to FL_FLOCK which caused the following kernel
panic on 4.4.0_rc5.

  kernel BUG at fs/locks.c:1895!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: ocfs2(O) ocfs2_dlmfs(O) ocfs2_stack_o2cb(O) ocfs2_dlm(O) ocfs2_nodemanager(O) ocfs2_stackglue(O) iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi xen_kbdfront xen_netfront xen_fbfront xen_blkfront
  CPU: 0 PID: 20268 Comm: flock_unit_test Tainted: G           O    4.4.0-rc5-next-20151217 #1
  Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014
  task: ffff88007b3672c0 ti: ffff880028b58000 task.ti: ffff880028b58000
  RIP: locks_lock_inode_wait+0x2e/0x160
  Call Trace:
    ocfs2_do_flock+0x91/0x160 [ocfs2]
    ocfs2_flock+0x76/0xd0 [ocfs2]
    SyS_flock+0x10f/0x1a0
    entry_SYSCALL_64_fastpath+0x12/0x71
  Code: e5 41 57 41 56 49 89 fe 41 55 41 54 53 48 89 f3 48 81 ec 88 00 00 00 8b 46 40 83 e0 03 83 f8 01 0f 84 ad 00 00 00 83 f8 02 74 04 <0f> 0b eb fe 4c 8d ad 60 ff ff ff 4c 8d 7b 58 e8 0e 8e 73 00 4d
  RIP  locks_lock_inode_wait+0x2e/0x160
   RSP <ffff880028b5bce8>
  ---[ end trace dfca74ec9b5b274c ]---

Fixes: 4f6563677a ("Move locks API users to locks_lock_inode_wait()")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-29 17:45:49 -08:00
Joseph Qi
5c9ee4cbf2 ocfs2: fix BUG when calculate new backup super
When resizing, it firstly extends the last gd.  Once it should backup
super in the gd, it calculates new backup super and update the
corresponding value.

But it currently doesn't consider the situation that the backup super is
already done.  And in this case, it still sets the bit in gd bitmap and
then decrease from bg_free_bits_count, which leads to a corrupted gd and
trigger the BUG in ocfs2_block_group_set_bits:

    BUG_ON(le16_to_cpu(bg->bg_free_bits_count) < num_bits);

So check whether the backup super is done and then do the updates.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-29 17:45:49 -08:00
Krzysztof Opasiak
9a9e3415ed fs: configfs: Drop unused parameter from configfs_undepend_item()
subsys parameter is never used by configfs_undepend_item()
so there is no point in passing it to this function.

Signed-off-by: Krzysztof Opasiak <k.opasiak@samsung.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2015-12-20 18:03:57 -08:00
Andreas Gruenbacher
1046cb1195 ocfs2: Replace list xattr handler operations
The list operations of the ocfs2 xattr handlers were never called
anywhere.  Remove them and directly check in ocfs2_xattr_list_entry
which attributes should be skipped over instead.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: ocfs2-devel@oss.oracle.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-13 19:46:00 -05:00
Junxiao Bi
854ee2e944 ocfs2: fix SGID not inherited issue
Commit 8f1eb48758 ("ocfs2: fix umask ignored issue") introduced an
issue, SGID of sub dir was not inherited from its parents dir.  It is
because SGID is set into "inode->i_mode" in ocfs2_get_init_inode(), but
is overwritten by "mode" which don't have SGID set later.

Fixes: 8f1eb48758 ("ocfs2: fix umask ignored issue")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Acked-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12 10:15:34 -08:00
Al Viro
6b2553918d replace ->follow_link() with new method that could stay in RCU mode
new method: ->get_link(); replacement of ->follow_link().  The differences
are:
	* inode and dentry are passed separately
	* might be called both in RCU and non-RCU mode;
the former is indicated by passing it a NULL dentry.
	* when called that way it isn't allowed to block
and should return ERR_PTR(-ECHILD) if it needs to be called
in non-RCU mode.

It's a flagday change - the old method is gone, all in-tree instances
converted.  Conversion isn't hard; said that, so far very few instances
do not immediately bail out when called in RCU mode.  That'll change
in the next commits.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-08 22:41:54 -05:00
Al Viro
21fc61c73c don't put symlink bodies in pagecache into highmem
kmap() in page_follow_link_light() needed to go - allowing to hold
an arbitrary number of kmaps for long is a great way to deadlocking
the system.

new helper (inode_nohighmem(inode)) needs to be used for pagecache
symlinks inodes; done for all in-tree cases.  page_follow_link_light()
instrumented to yell about anything missed.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-08 22:41:36 -05:00
Andreas Gruenbacher
98e9cb5711 vfs: Distinguish between full xattr names and proper prefixes
Add an additional "name" field to struct xattr_handler.  When the name
is set, the handler matches attributes with exactly that name.  When the
prefix is set instead, the handler matches attributes with the given
prefix and with a non-empty suffix.

This patch should avoid bugs like the one fixed in commit c361016a in
the future.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-06 21:33:52 -05:00
Junxiao Bi
8f1eb48758 ocfs2: fix umask ignored issue
New created file's mode is not masked with umask, and this makes umask not
work for ocfs2 volume.

Fixes: 702e5bc ("ocfs2: use generic posix ACL infrastructure")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-20 16:17:32 -08:00
Linus Torvalds
9aa3d651a9 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending
Pull SCSI target updates from Nicholas Bellinger:
 "This series contains HCH's changes to absorb configfs attribute
  ->show() + ->store() function pointer usage from it's original
  tree-wide consumers, into common configfs code.

  It includes usb-gadget, target w/ drivers, netconsole and ocfs2
  changes to realize the improved simplicity, that now renders the
  original include/target/configfs_macros.h CPP magic for fabric drivers
  and others, unnecessary and obsolete.

  And with common code in place, new configfs attributes can be added
  easier than ever before.

  Note, there are further improvements in-flight from other folks for
  v4.5 code in configfs land, plus number of target fixes for post -rc1
  code"

In the meantime, a new user of the now-removed old configfs API came in
through the char/misc tree in commit 7bd1d4093c ("stm class: Introduce
an abstraction for System Trace Module devices").

This merge resolution comes from Alexander Shishkin, who updated his stm
class tracing abstraction to account for the removal of the old
show_attribute and store_attribute methods in commit 517982229f
("configfs: remove old API") from this pull.  As Alexander says about
that patch:

 "There's no need to keep an extra wrapper structure per item and the
  awkward show_attribute/store_attribute item ops are no longer needed.

  This patch converts policy code to the new api, all the while making
  the code quite a bit smaller and easier on the eyes.

  Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>"

That patch was folded into the merge so that the tree should be fully
bisectable.

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (23 commits)
  configfs: remove old API
  ocfs2/cluster: use per-attribute show and store methods
  ocfs2/cluster: move locking into attribute store methods
  netconsole: use per-attribute show and store methods
  target: use per-attribute show and store methods
  spear13xx_pcie_gadget: use per-attribute show and store methods
  dlm: use per-attribute show and store methods
  usb-gadget/f_serial: use per-attribute show and store methods
  usb-gadget/f_phonet: use per-attribute show and store methods
  usb-gadget/f_obex: use per-attribute show and store methods
  usb-gadget/f_uac2: use per-attribute show and store methods
  usb-gadget/f_uac1: use per-attribute show and store methods
  usb-gadget/f_mass_storage: use per-attribute show and store methods
  usb-gadget/f_sourcesink: use per-attribute show and store methods
  usb-gadget/f_printer: use per-attribute show and store methods
  usb-gadget/f_midi: use per-attribute show and store methods
  usb-gadget/f_loopback: use per-attribute show and store methods
  usb-gadget/ether: use per-attribute show and store methods
  usb-gadget/f_acm: use per-attribute show and store methods
  usb-gadget/f_hid: use per-attribute show and store methods
  ...
2015-11-13 20:04:17 -08:00
Andreas Gruenbacher
d9a82a0403 xattr handlers: Pass handler to operations instead of flags
The xattr_handler operations are currently all passed a file system
specific flags value which the operations can use to disambiguate between
different handlers; some file systems use that to distinguish the xattr
namespace, for example.  In some oprations, it would be useful to also have
access to the handler prefix.  To allow that, pass a pointer to the handler
to operations instead of the flags value alone.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-11-13 20:34:32 -05:00
Linus Torvalds
2e3078af2c Merge branch 'akpm' (patches from Andrew)
Merge patch-bomb from Andrew Morton:

 - inotify tweaks

 - some ocfs2 updates (many more are awaiting review)

 - various misc bits

 - kernel/watchdog.c updates

 - Some of mm.  I have a huge number of MM patches this time and quite a
   lot of it is quite difficult and much will be held over to next time.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits)
  selftests: vm: add tests for lock on fault
  mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage
  mm: introduce VM_LOCKONFAULT
  mm: mlock: add new mlock system call
  mm: mlock: refactor mlock, munlock, and munlockall code
  kasan: always taint kernel on report
  mm, slub, kasan: enable user tracking by default with KASAN=y
  kasan: use IS_ALIGNED in memory_is_poisoned_8()
  kasan: Fix a type conversion error
  lib: test_kasan: add some testcases
  kasan: update reference to kasan prototype repo
  kasan: move KASAN_SANITIZE in arch/x86/boot/Makefile
  kasan: various fixes in documentation
  kasan: update log messages
  kasan: accurately determine the type of the bad access
  kasan: update reported bug types for kernel memory accesses
  kasan: update reported bug types for not user nor kernel memory accesses
  mm/kasan: prevent deadlock in kasan reporting
  mm/kasan: don't use kasan shadow pointer in generic functions
  mm/kasan: MODULE_VADDR is not available on all archs
  ...
2015-11-05 23:10:54 -08:00
Joseph Qi
262d8a8779 ocfs2: clean up unused variable in ocfs2_duplicate_clusters_by_page()
readahead_pages in ocfs2_duplicate_clusters_by_page is defined but not
used, so clean it up.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Joseph Qi
5afc44e2e9 ocfs2: add uuid to ocfs2 thread name for problem analysis
A node can mount multiple ocfs2 volumes.  And if thread names are same for
each volume/domain, it will bring inconvenience when analyzing problems
because we have to identify which volume/domain the messages belong to.

Since thread name will be printed to messages, so add volume uuid or dlm
name to thread name can benefit problem analysis.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Gang He <ghe@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
alex chen
b1529a41f7 ocfs2: should reclaim the inode if '__ocfs2_mknod_locked' returns an error
In ocfs2_mknod_locked if '__ocfs2_mknod_locke		d' returns an error, we
should reclaim the inode successfully claimed above, otherwise, the
inode never be reused. The case is described below:

ocfs2_mknod
    ocfs2_mknod_locked
        ocfs2_claim_new_inode
                Successfully claim the inode
        __ocfs2_mknod_locked
            ocfs2_journal_access_di
            Failed because of -ENOMEM or other reasons, the inode
                        lockres has not been initialized yet.

    iput(inode)
        ocfs2_evict_inode
            ocfs2_delete_inode
                ocfs2_inode_lock
                    ocfs2_inode_lock_full_nested
                        __ocfs2_cluster_lock
                                Return -EINVAL because of the inode
                                lockres has not been initialized.

                So the following operations are not performed
                ocfs2_wipe_inode
                        ocfs2_remove_inode
                                ocfs2_free_dinode
                                        ocfs2_free_suballoc_bits

Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Joseph Qi
0986fe9b50 ocfs2: fix race between mount and delete node/cluster
There is a race case between mount and delete node/cluster, which will
lead o2hb_thread to malfunctioning dead loop.

    o2hb_thread
    {
        o2nm_depend_this_node();
        <<<<<< race window, node may have already been deleted, and then
               enter the loop, o2hb thread will be malfunctioning
               because of no configured nodes found.
        while (!kthread_should_stop() &&
               !reg->hr_unclean_stop && !reg->hr_aborted_start) {
    }

So check the return value of o2nm_depend_this_node() is needed.  If node
has been deleted, do not enter the loop and let mount fail.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Joseph Qi
93d911fcce ocfs2: only take lock if dio entry when recover orphans
We have no need to take inode mutex, rw and inode lock if it is not dio
entry when recover orphans.  Optimize it by adding a flag
OCFS2_INODE_DIO_ORPHAN_ENTRY to ocfs2_inode_info to reduce contention.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Joseph Qi
30edc43c7f ocfs2: do not include dio entry in case of orphan scan
dio entry will only do truncate in case of ORPHAN_NEED_TRUNCATE. So do
not include it when doing normal orphan scan to reduce contention.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Joseph Qi
1d1aff8cf3 ocfs2: improve performance for localalloc
Currently cluster allocation is always trying to find a victim chain (a
chian has most space), and this may lead to poor performance because of
discontiguous allocation in some scenarios.

Our test case is block size 4k, cluster size 1M and mount option with
localalloc=2048 (2G), since a gd is 32256M (about 31.5G) and a localalloc
window is only 2G, creating 50G file will result in 2G from gd0, 2G from
gd1, ...

One way to improve performance is enlarge localalloc window size (max
31104M), but this will make end user feel that about 30G is suddenly
"missing", and localalloc currently do not support steal, which means one
node cannot use another node's localalloc even it is not used in fact.  So
using the last gd to record the allocation and continues with the gd if it
has enough space for a localalloc window can make the allocation as more
contiguous as possible.

Our test result is below (evaluated in IOPS), which is using iometer
running in VM, dynamic vhd virtual disk stored in ocfs2.

IO model                Original   After   Improved(%)
16K60%Write100%Random     703       876     24.59%
8K90%Write100%Random      735       827     12.59%
4K100%Write100%Random     859       915      6.52%
4K100%Read100%Random     2092      2600     24.30%

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Tested-by: Norton Zhu <norton.zhu@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
jiangyiwen
4e357b932a ocfs2: fill in the unused portion of the block with zeros by dio_zero_block()
A simplified test case is (this case from Ryan):
1) dd if=/dev/zero of=/mnt/hello bs=512 count=1 oflag=direct;
2) truncate /mnt/hello -s 2097152
file 'hello' is not exist before test. After this command,
file 'hello' should be all zero. But 512~4096 is some random data.

Setting bh state to new when get a new block, if so,
direct_io_worker()->dio_zero_block() will fill-in the unused portion
of the block with zero.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Norton.Zhu
d162eaad77 ocfs2_direct_IO_write() misses ocfs2_is_overwrite() error code
If ocfs2_is_overwrite failed, ocfs2_direct_IO_write mays till return
success to the caller.

Signed-off-by: Norton.Zhu <norton.zhu@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Linus Torvalds
9576c2f293 File locking related changes for v4.4 (pile #1)
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWNsKlAAoJEAAOaEEZVoIVKNMP+QHb96HMNWnMlBE9jwPbBK/2
 yM80sa6wRcbCF519sRFbmOheet4bgNSHixegtUez5kyqyI7Hr0tsRYvIo5/amAWX
 EIh03fZoM+Bgm+dblYivorSrPmmx2UQ9RG6pUbcOPtxdCpQ79tfzVyYVykG5wcb5
 NLSibG9s5USutOXPTatxDqS6P2QwvvWXHR5oX1mkU2W7nQXfHOdQKSuk5CqUeIWx
 JSGIa+plS9fath1Ndu4pJ7atvU8cR0t+VeOqPmGoqqIDyGVbo45XgXZmk0xCxEs9
 XsVSbdGBMAtA63xlZHFROADFNXIosay2zA7mdG0i3IrLRMQr/okQhTqBrFMKmj0m
 cDMDNOs4j4M8JJPkwrJQ3S/1Tnl+zyAuKKTJwgvVnd1tcyTZjs3g77I9e84pSTsp
 chL4FmfeR7dhk+YJgcnbzvnnP7tBbQcV0ET/ILVsDU7bNDujWlcDzYkbbWx70WLa
 KobjmsW/OAGaQugIMA1oGLTexT1u9HtDYOw8JVNBKwlrnPKyFVb8X88gx2Laf34L
 Qa04TdrFseuxbnBGifLyQTsLxgF9QalUo+51J0I4a7G3WX0U2Zuk+ZTbHc6ChhdW
 d0oL2SEyToscRADRL0/u2CUR1dEXkdDXi3pxgvDs5PTJVU+lIy4czp/dI5JrjKUA
 L7O27Kstgoe2GctHn6FI
 =OYAZ
 -----END PGP SIGNATURE-----

Merge tag 'locks-v4.4-1' of git://git.samba.org/jlayton/linux

Pull file locking updates from Jeff Layton:
 "The largest series of changes is from Ben who offered up a set to add
  a new helper function for setting locks based on the type set in
  fl_flags.  Dmitry also send in a fix for a potential race that he
  found with KTSAN"

* tag 'locks-v4.4-1' of git://git.samba.org/jlayton/linux:
  locks: cleanup posix_lock_inode_wait and flock_lock_inode_wait
  Move locks API users to locks_lock_inode_wait()
  locks: introduce locks_lock_inode_wait()
  locks: Use more file_inode and fix a comment
  fs: fix data races on inode->i_flctx
  locks: change tracepoint for generic_add_lease
2015-11-05 10:31:29 -08:00
Joseph Qi
b67de018b3 ocfs2/dlm: unlock lockres spinlock before dlm_lockres_put
dlm_lockres_put will call dlm_lockres_release if it is the last
reference, and then it may call dlm_print_one_lock_resource and
take lockres spinlock.

So unlock lockres spinlock before dlm_lockres_put to avoid deadlock.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-23 17:55:10 +09:00
Benjamin Coddington
4f6563677a Move locks API users to locks_lock_inode_wait()
Instead of having users check for FL_POSIX or FL_FLOCK to call the correct
locks API function, use the check within locks_lock_inode_wait().  This
allows for some later cleanup.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-10-22 14:57:36 -04:00
Christoph Hellwig
45b997737a ocfs2/cluster: use per-attribute show and store methods
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2015-10-13 22:17:55 -07:00
Christoph Hellwig
13a83fc909 ocfs2/cluster: move locking into attribute store methods
The test and separate set bit scheme was racy to start with, so move to do
a test_and_set_bit after doing the earlier error checks inside the actual
store methods.  Also remove the locking for the local attribute which
already has a different scheme to synchronize.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2015-10-13 22:17:53 -07:00
Joseph Qi
012572d4fc ocfs2/dlm: fix deadlock when dispatch assert master
The order of the following three spinlocks should be:
dlm_domain_lock < dlm_ctxt->spinlock < dlm_lock_resource->spinlock

But dlm_dispatch_assert_master() is called while holding
dlm_ctxt->spinlock and dlm_lock_resource->spinlock, and then it calls
dlm_grab() which will take dlm_domain_lock.

Once another thread (for example, dlm_query_join_handler) has already
taken dlm_domain_lock, and tries to take dlm_ctxt->spinlock deadlock
happens.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: "Junxiao Bi" <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-22 15:09:53 -07:00
Andrew Morton
e527b22c3f revert "ocfs2/dlm: use list_for_each_entry instead of list_for_each"
Revert commit f83c7b5e9f ("ocfs2/dlm: use list_for_each_entry instead
of list_for_each").

list_for_each_entry() will dereference its `pos' argument, which can be
NULL in dlm_process_recovery_data().

Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Reported-by: Fengguang Wu <fengguang.wu@gmail.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-11 15:21:34 -07:00
Kees Cook
a068acf2ee fs: create and use seq_show_option for escaping
Many file systems that implement the show_options hook fail to correctly
escape their output which could lead to unescaped characters (e.g.  new
lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files.  This
could lead to confusion, spoofed entries (resulting in things like
systemd issuing false d-bus "mount" notifications), and who knows what
else.  This looks like it would only be the root user stepping on
themselves, but it's possible weird things could happen in containers or
in other situations with delegated mount privileges.

Here's an example using overlay with setuid fusermount trusting the
contents of /proc/mounts (via the /etc/mtab symlink).  Imagine the use
of "sudo" is something more sneaky:

  $ BASE="ovl"
  $ MNT="$BASE/mnt"
  $ LOW="$BASE/lower"
  $ UP="$BASE/upper"
  $ WORK="$BASE/work/ 0 0
  none /proc fuse.pwn user_id=1000"
  $ mkdir -p "$LOW" "$UP" "$WORK"
  $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
  $ cat /proc/mounts
  none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
  none /proc fuse.pwn user_id=1000 0 0
  $ fusermount -u /proc
  $ cat /proc/mounts
  cat: /proc/mounts: No such file or directory

This fixes the problem by adding new seq_show_option and
seq_show_option_n helpers, and updating the vulnerable show_option
handlers to use them as needed.  Some, like SELinux, need to be open
coded due to unusual existing escape mechanisms.

[akpm@linux-foundation.org: add lost chunk, per Kees]
[keescook@chromium.org: seq_show_option should be using const parameters]
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Joseph Qi
46359295a3 ocfs2: clean up redundant NULL checks before kfree
NULL check before kfree is redundant and so clean them up.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Joe Perches
7ecef14ab1 ocfs2: neaten do_error, ocfs2_error and ocfs2_abort
These uses sometimes do and sometimes don't have '\n' terminations.  Make
the uses consistently use '\n' terminations and remove the newline from
the functions.

Miscellanea:

o Coalesce formats
o Realign arguments

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Xue jiufei
d0c97d52f5 ocfs2: do not set fs read-only if rec[0] is empty while committing truncate
While appending an extent to a file, it will call these functions:
ocfs2_insert_extent

  -> call ocfs2_grow_tree() if there's no free rec
     -> ocfs2_add_branch add a new branch to extent tree,
        now rec[0] in the leaf of rightmost path is empty
  -> ocfs2_do_insert_extent
     -> ocfs2_rotate_tree_right
       -> ocfs2_extend_rotate_transaction
          -> jbd2_journal_restart if jbd2_journal_extend fail
     -> ocfs2_insert_path
        -> ocfs2_extend_trans
          -> jbd2_journal_restart if jbd2_journal_extend fail
        -> ocfs2_insert_at_leaf
     -> ocfs2_et_update_clusters
Function jbd2_journal_restart() may be called and it may happened that
buffers dirtied in ocfs2_add_branch() are committed
while buffers dirtied in ocfs2_insert_at_leaf() and
ocfs2_et_update_clusters() are not.
So an empty rec[0] is left in rightmost path which will cause
read-only filesystem when call ocfs2_commit_truncate()
with the error message: "Inode %lu has an empty extent record".

This is not a serious problem, so remove the rightmost path when call
ocfs2_commit_truncate().

Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
yangwenfang
7f27ec978b ocfs2: call ocfs2_journal_access_di() before ocfs2_journal_dirty() in ocfs2_write_end_nolock()
1: After we call ocfs2_journal_access_di() in ocfs2_write_begin(),
   jbd2_journal_restart() may also be called, in this function transaction
   A's t_updates-- and obtains a new transaction B.  If
   jbd2_journal_commit_transaction() is happened to commit transaction A,
   when t_updates==0, it will continue to complete commit and unfile
   buffer.

   So when jbd2_journal_dirty_metadata(), the handle is pointed a new
   transaction B, and the buffer head's journal head is already freed,
   jh->b_transaction == NULL, jh->b_next_transaction == NULL, it returns
   EINVAL, So it triggers the BUG_ON(status).

thread 1                                          jbd2
ocfs2_write_begin                     jbd2_journal_commit_transaction
ocfs2_write_begin_nolock
  ocfs2_start_trans
    jbd2__journal_start(t_updates+1,
                       transaction A)
    ocfs2_journal_access_di
    ocfs2_write_cluster_by_desc
      ocfs2_mark_extent_written
        ocfs2_change_extent_flag
          ocfs2_split_extent
            ocfs2_extend_rotate_transaction
              jbd2_journal_restart
              (t_updates-1,transaction B) t_updates==0
                                        __jbd2_journal_refile_buffer
                                        (jh->b_transaction = NULL)
ocfs2_write_end
ocfs2_write_end_nolock
    ocfs2_journal_dirty
        jbd2_journal_dirty_metadata(bug)
   ocfs2_commit_trans

2.  In ext4, I found that: jbd2_journal_get_write_access() called by
   ext4_write_end.

ext4_write_begin
    ext4_journal_start
        __ext4_journal_start_sb
            ext4_journal_check_start
            jbd2__journal_start

ext4_write_end
    ext4_mark_inode_dirty
        ext4_reserve_inode_write
            ext4_journal_get_write_access
                jbd2_journal_get_write_access
        ext4_mark_iloc_dirty
            ext4_do_update_inode
                ext4_handle_dirty_metadata
                    jbd2_journal_dirty_metadata

3. So I think we should put ocfs2_journal_access_di before
   ocfs2_journal_dirty in the ocfs2_write_end.  and it works well after my
   modification.

Signed-off-by: vicky <vicky.yangwenfang@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Zhangguanghui <zhang.guanghui@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Tina Ruchandani
40476b8294 ocfs2: use 64bit variables to track heartbeat time
o2hb_elapsed_msecs computes the time taken for a disk heartbeat.
'struct timeval' variables are used to store start and end times.  On
32-bit systems, the 'tv_sec' component of 'struct timeval' will overflow
in year 2038 and beyond.

This patch solves the overflow with the following:

1. Replace o2hb_elapsed_msecs using 'ktime_t' values to measure start
   and end time, and built-in function 'ktime_ms_delta' to compute the
   elapsed time.  ktime_get_real() is used since the code prints out the
   wallclock time.

2. Changes format string to print time as a single 64-bit nanoseconds
   value ("%lld") instead of seconds and microseconds.  This simplifies
   the code since converting ktime_t to that format would need expensive
   computation.  However, the debug log string is less readable than the
   previous format.

Signed-off-by: Tina Ruchandani <ruchandani.tina@gmail.com>
Suggested by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Joseph Qi
ad69482122 ocfs2: fix race between crashed dio and rm
There is a race case between crashed dio and rm, which will lead to
OCFS2_VALID_FL not set read-only.

  N1                              N2
  ------------------------------------------------------------------------
  dd with direct flag
                                  rm file
  crashed with an dio entry left
  in orphan dir
                                  clear OCFS2_VALID_FL in
                                  ocfs2_remove_inode
                                  recover N1 and read the corrupted inode,
                                  and set filesystem read-only

So we skip the inode deletion this time and wait for dio entry recovered
first.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Yiwen Jiang
f57a22ddec ocfs2: avoid access invalid address when read o2dlm debug messages
The following case will lead to a lockres is freed but is still in use.

cat /sys/kernel/debug/o2dlm/locking_state	dlm_thread
lockres_seq_start
    -> lock dlm->track_lock
    -> get resA
                                                resA->refs decrease to 0,
                                                call dlm_lockres_release,
                                                and wait for "cat" unlock.
Although resA->refs is already set to 0,
increase resA->refs, and then unlock
                                                lock dlm->track_lock
                                                    -> list_del_init()
                                                    -> unlock
                                                    -> free resA

In such a race case, invalid address access may occurs.  So we should
delete list res->tracking before resA->refs decrease to 0.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Tariq Saeed
743b5f1434 ocfs2: take inode lock in ocfs2_iop_set/get_acl()
This bug in mainline code is pointed out by Mark Fasheh.  When
ocfs2_iop_set_acl() and ocfs2_iop_get_acl() are entered from VFS layer,
inode lock is not held.  This seems to be regression from older kernels.
The patch is to fix that.

Orabug: 20189959
Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Tariq Saeed
3d46a44a0c ocfs2: fix BUG_ON() in ocfs2_ci_checkpointed()
PID: 614    TASK: ffff882a739da580  CPU: 3   COMMAND: "ocfs2dc"
  #0 [ffff882ecc3759b0] machine_kexec at ffffffff8103b35d
  #1 [ffff882ecc375a20] crash_kexec at ffffffff810b95b5
  #2 [ffff882ecc375af0] oops_end at ffffffff815091d8
  #3 [ffff882ecc375b20] die at ffffffff8101868b
  #4 [ffff882ecc375b50] do_trap at ffffffff81508bb0
  #5 [ffff882ecc375ba0] do_invalid_op at ffffffff810165e5
  #6 [ffff882ecc375c40] invalid_op at ffffffff815116fb
     [exception RIP: ocfs2_ci_checkpointed+208]
     RIP: ffffffffa0a7e940  RSP: ffff882ecc375cf0  RFLAGS: 00010002
     RAX: 0000000000000001  RBX: 000000000000654b  RCX: ffff8812dc83f1f8
     RDX: 00000000000017d9  RSI: ffff8812dc83f1f8  RDI: ffffffffa0b2c318
     RBP: ffff882ecc375d20   R8: ffff882ef6ecfa60   R9: ffff88301f272200
     R10: 0000000000000000  R11: 0000000000000000  R12: ffffffffffffffff
     R13: ffff8812dc83f4f0  R14: 0000000000000000  R15: ffff8812dc83f1f8
     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  #7 [ffff882ecc375d28] ocfs2_check_meta_downconvert at ffffffffa0a7edbd [ocfs2]
  #8 [ffff882ecc375d38] ocfs2_unblock_lock at ffffffffa0a84af8 [ocfs2]
  #9 [ffff882ecc375dc8] ocfs2_process_blocked_lock at ffffffffa0a85285 [ocfs2]
#10 [ffff882ecc375e18] ocfs2_downconvert_thread_do_work at ffffffffa0a85445 [ocfs2]
#11 [ffff882ecc375e68] ocfs2_downconvert_thread at ffffffffa0a854de [ocfs2]
#12 [ffff882ecc375ee8] kthread at ffffffff81090da7
#13 [ffff882ecc375f48] kernel_thread_helper at ffffffff81511884
assert is tripped because the tran is not checkpointed and the lock level is PR.

Some time ago, chmod command had been executed. As result, the following call
chain left the inode cluster lock in PR state, latter on causing the assert.
system_call_fastpath
  -> my_chmod
   -> sys_chmod
    -> sys_fchmodat
     -> notify_change
      -> ocfs2_setattr
       -> posix_acl_chmod
        -> ocfs2_iop_set_acl
         -> ocfs2_set_acl
          -> ocfs2_acl_set_mode
Here is how.
1119 int ocfs2_setattr(struct dentry *dentry, struct iattr *attr)
1120 {
1247         ocfs2_inode_unlock(inode, 1); <<< WRONG thing to do.
..
1258         if (!status && attr->ia_valid & ATTR_MODE) {
1259                 status =  posix_acl_chmod(inode, inode->i_mode);

519 posix_acl_chmod(struct inode *inode, umode_t mode)
520 {
..
539         ret = inode->i_op->set_acl(inode, acl, ACL_TYPE_ACCESS);

287 int ocfs2_iop_set_acl(struct inode *inode, struct posix_acl *acl, ...
288 {
289         return ocfs2_set_acl(NULL, inode, NULL, type, acl, NULL, NULL);

224 int ocfs2_set_acl(handle_t *handle,
225                          struct inode *inode, ...
231 {
..
252                                 ret = ocfs2_acl_set_mode(inode, di_bh,
253                                                          handle, mode);

168 static int ocfs2_acl_set_mode(struct inode *inode, struct buffer_head ...
170 {
183         if (handle == NULL) {
                    >>> BUG: inode lock not held in ex at this point <<<
184                 handle = ocfs2_start_trans(OCFS2_SB(inode->i_sb),
185                                            OCFS2_INODE_UPDATE_CREDITS);

ocfs2_setattr.#1247 we unlock and at #1259 call posix_acl_chmod. When we reach
ocfs2_acl_set_mode.#181 and do trans, the inode cluster lock is not held in EX
mode (it should be). How this could have happended?

We are the lock master, were holding lock EX and have released it in
ocfs2_setattr.#1247.  Note that there are no holders of this lock at
this point.  Another node needs the lock in PR, and we downconvert from
EX to PR.  So the inode lock is PR when do the trans in
ocfs2_acl_set_mode.#184.  The trans stays in core (not flushed to disc).
Now another node want the lock in EX, downconvert thread gets kicked
(the one that tripped assert abovt), finds an unflushed trans but the
lock is not EX (it is PR).  If the lock was at EX, it would have flushed
the trans ocfs2_ci_checkpointed -> ocfs2_start_checkpoint before
downconverting (to NULL) for the request.

ocfs2_setattr must not drop inode lock ex in this code path.  If it
does, takes it again before the trans, say in ocfs2_set_acl, another
cluster node can get in between, execute another setattr, overwriting
the one in progress on this node, resulting in a mode acl size combo
that is a mix of the two.

Orabug: 20189959
Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Norton.Zhu
72f6fe1fe5 ocfs2: optimize error handling in dlm_request_join
Currently error handling in dlm_request_join is a little obscure, so
optimize it to promote readability.

If packet.code is invalid, reset it to JOIN_DISALLOW to keep it
meaningful.  It only influences the log printing.

Signed-off-by: Norton.Zhu <norton.zhu@huawei.com>
Cc: Srinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Yiwen Jiang
928dda1f94 ocfs2: fix a tiny case that inode can not removed
When running dirop_fileop_racer we found a case that inode
can not removed.

Two nodes, say Node A and Node B, mount the same ocfs2 volume.  Create
two dirs /race/1/ and /race/2/ in the filesystem.

  Node A                            Node B
  rm -r /race/2/
                                    mv /race/1/ /race/2/
  call ocfs2_unlink(), get
  the EX mode of /race/2/
                                    wait for B unlock /race/2/
  decrease i_nlink of /race/2/ to 0,
  and add inode of /race/2/ into
  orphan dir, unlock /race/2/
                                    got EX mode of /race/2/. because
                                    /race/1/ is dir, so inc i_nlink
                                    of /race/2/ and update into disk,
                                    unlock /race/2/
  because i_nlink of /race/2/
  is not zero, this inode will
  always remain in orphan dir

This patch fixes this case by test whether i_nlink of new dir is zero.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Xue jiufei <xuejiufei@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
WeiWei Wang
6ab855a99b ocfs2: add ip_alloc_sem in direct IO to protect allocation changes
In ocfs2, ip_alloc_sem is used to protect allocation changes on the
node.  In direct IO, we add ip_alloc_sem to protect date consistent
between direct-io and ocfs2_truncate_file race (buffer io use
ip_alloc_sem already).  Although inode->i_mutex lock is used to avoid
concurrency of above situation, i think ip_alloc_sem is still needed
because protect allocation changes is significant.

Other filesystem like ext4 also uses rw_semaphore to protect data
consistent between get_block-vs-truncate race by other means, So
ip_alloc_sem in ocfs2 direct io is needed.

Signed-off-by: Weiwei Wang <wangww631@huawei.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Goldwyn Rodrigues
34237681e0 ocfs2: clear the rest of the buffers on error
In case a validation fails, clear the rest of the buffers and return the
error to the calling function.

This also facilitates bubbling up the error originating from ocfs2_error
to calling functions.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Goldwyn Rodrigues
17a5b9ab32 ocfs2: acknowledge return value of ocfs2_error()
Caveat: This may return -EROFS for a read case, which seems wrong.  This
is happening even without this patch series though.  Should we convert
EROFS to EIO?

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Goldwyn Rodrigues
7d0fb9148a ocfs2: add errors=continue
OCFS2 is often used in high-availaibility systems.  However, ocfs2
converts the filesystem to read-only at the drop of the hat.  This may
not be necessary, since turning the filesystem read-only would affect
other running processes as well, decreasing availability.

This attempt is to add errors=continue, which would return the EIO to
the calling process and terminate furhter processing so that the
filesystem is not corrupted further.  However, the filesystem is not
converted to read-only.

As a future plan, I intend to create a small utility or extend
fsck.ocfs2 to fix small errors such as in the inode.  The input to the
utility such as the inode can come from the kernel logs so we don't have
to schedule a downtime for fixing small-enough errors.

The patch changes the ocfs2_error to return an error.  The error
returned depends on the mount option set.  If none is set, the default
is to turn the filesystem read-only.

Perhaps errors=continue is not the best option name.  Historically it is
used for making an attempt to progress in the current process itself.
Should we call it errors=eio? or errors=killproc? Suggestions/Comments
welcome.

Sources are available at:
  https://github.com/goldwynr/linux/tree/error-cont

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Xue jiufei
513e2dae94 ocfs2: flush inode data to disk and free inode when i_count becomes zero
Disk inode deletion may be heavily delayed when one node unlink a file
after the same dentry is freed on another node(say N1) because of memory
shrink but inode is left in memory.  This inode can only be freed while
N1 doing the orphan scan work.

However, N1 may skip orphan scan for several times because other nodes
may do the work earlier.  In our tests, it may take 1 hour on 4 nodes
cluster and it hurts the user experience.  So we think the inode should
be freed after the data flushed to disk when i_count becomes zero to
avoid such circumstances.

Signed-off-by: Joyce.xue <xuejiufei@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Sanidhya Kashyap
0f5e7b41f9 ocfs2: trusted xattr missing CAP_SYS_ADMIN check
The trusted extended attributes are only visible to the process which
hvae CAP_SYS_ADMIN capability but the check is missing in ocfs2
xattr_handler trusted list.  The check is important because this will be
used for implementing mechanisms in the userspace for which other
ordinary processes should not have access to.

Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Taesoo kim <taesoo@gatech.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
jiangyiwen
807a790711 ocfs2: set filesytem read-only when ocfs2_delete_entry failed.
In ocfs2_rename, it will lead to an inode with two entried(old and new) if
ocfs2_delete_entry(old) failed.  Thus, filesystem will be inconsistent.

The case is described below:

ocfs2_rename
    -> ocfs2_start_trans
    -> ocfs2_add_entry(new)
    -> ocfs2_delete_entry(old)
        -> __ocfs2_journal_access *failed* because of -ENOMEM
    -> ocfs2_commit_trans

So filesystem should be set to read-only at the moment.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Joseph Qi
f83c7b5e9f ocfs2/dlm: use list_for_each_entry instead of list_for_each
Use list_for_each_entry instead of list_for_each to simplify code.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Joseph Qi
0e3d9eafb8 ocfs2: remove unneeded code in dlm_register_domain_handlers
The last goto statement is unneeded, so remove it.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Joseph Qi
cdd09f49cb ocfs2: fix BUG when o2hb_register_callback fails
In dlm_register_domain_handlers, if o2hb_register_callback fails, it
will call dlm_unregister_domain_handlers to unregister.  This will
trigger the BUG_ON in o2hb_unregister_callback because hc_magic is 0.
So we should call o2hb_setup_callback to initialize hc first.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Joseph Qi
914a9b7429 ocfs2: remove unneeded code in ocfs2_dlm_init
status is already initialized and it will only be 0 or negatives in the
code flow.  So remove the unneeded assignment after the lable 'local'.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Joseph Qi
3cb2ec43f6 ocfs2: adjust code to match locking/unlocking order
Unlocking order in ocfs2_unlink and ocfs2_rename mismatches the
corresponding locking order, although it won't cause issues, adjust the
code so that it looks more reasonable.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Joseph Qi
bf59e6623a ocfs2: clean up unused local variables in ocfs2_file_write_iter
Since commit 86b9c6f3f8 ("ocfs2: remove filesize checks for sync I/O
journal commit") removes filesize checks for sync I/O journal commit,
variables old_size and old_clusters are not actually used any more.  So
clean them up.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Christophe JAILLET
372a447c4b ocfs2: do not log twice error messages
'o2hb_map_slot_data' and 'o2hb_populate_slot_data' are called from only
one place, in 'o2hb_region_dev_write'.  Return value is checked and
'mlog_errno' is called to log a message if it is not 0.

So there is no need to call 'mlog_errno' directly within these functions.
This would result on logging the message twice.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Joseph Qi
acf8fdbe6a ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access
When storage network is unstable, it may trigger the BUG in
__ocfs2_journal_access because of buffer not uptodate.  We can retry the
write in this case or return error instead of BUG.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reported-by: Zhangguanghui <zhang.guanghui@h3c.com>
Tested-by: Zhangguanghui <zhang.guanghui@h3c.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Joseph Qi
faaebf18f8 ocfs2: fix several issues of append dio
1) Take rw EX lock in case of append dio.
2) Explicitly treat the error code -EIOCBQUEUED as normal.
3) Set di_bh to NULL after brelse if it may be used again later.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Joseph Qi
512f62acbd ocfs2: fix race between dio and recover orphan
During direct io the inode will be added to orphan first and then
deleted from orphan.  There is a race window that the orphan entry will
be deleted twice and thus trigger the BUG when validating
OCFS2_DIO_ORPHANED_FL in ocfs2_del_inode_from_orphan.

ocfs2_direct_IO_write
    ...
    ocfs2_add_inode_to_orphan
    >>>>>>>> race window.
             1) another node may rm the file and then down, this node
             take care of orphan recovery and clear flag
             OCFS2_DIO_ORPHANED_FL.
             2) since rw lock is unlocked, it may race with another
             orphan recovery and append dio.
    ocfs2_del_inode_from_orphan

So take inode mutex lock when recovering orphans and make rw unlock at the
end of aio write in case of append dio.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reported-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Ryan Ding
aa1057b3de ocfs2: direct write will call ocfs2_rw_unlock() twice when doing aio+dio
ocfs2_file_write_iter() is usng the wrong return value ('written').  This
will cause ocfs2_rw_unlock() be called both in write_iter & end_io,
triggering a BUG_ON.

This issue was introduced by commit 7da839c475 ("ocfs2: use
__generic_file_write_iter()").

Orabug: 21612107
Fixes: 7da839c475 ("ocfs2: use __generic_file_write_iter()")
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Linus Torvalds
e31fb9e005 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext3 removal, quota & udf fixes from Jan Kara:
 "The biggest change in the pull is the removal of ext3 filesystem
  driver (~28k lines removed).  Ext4 driver is a full featured
  replacement these days and both RH and SUSE use it for several years
  without issues.  Also there are some workarounds in VM & block layer
  mainly for ext3 which we could eventually get rid of.

  Other larger change is addition of proper error handling for
  dquot_initialize().  The rest is small fixes and cleanups"

[ I wasn't convinced about the ext3 removal and worried about things
  falling through the cracks for legacy users, but ext4 maintainers
  piped up and were all unanimously in favor of removal, and maintaining
  all legacy ext3 support inside ext4.   - Linus ]

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Don't modify filesystem for read-only mounts
  quota: remove an unneeded condition
  ext4: memory leak on error in ext4_symlink()
  mm/Kconfig: NEED_BOUNCE_POOL: clean-up condition
  ext4: Improve ext4 Kconfig test
  block: Remove forced page bouncing under IO
  fs: Remove ext3 filesystem driver
  doc: Update doc about journalling layer
  jfs: Handle error from dquot_initialize()
  reiserfs: Handle error from dquot_initialize()
  ocfs2: Handle error from dquot_initialize()
  ext4: Handle error from dquot_initialize()
  ext2: Handle error from dquot_initalize()
  quota: Propagate error from ->acquire_dquot()
2015-09-03 12:28:30 -07:00
Linus Torvalds
1081230b74 Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
 "This first core part of the block IO changes contains:

   - Cleanup of the bio IO error signaling from Christoph.  We used to
     rely on the uptodate bit and passing around of an error, now we
     store the error in the bio itself.

   - Improvement of the above from myself, by shrinking the bio size
     down again to fit in two cachelines on x86-64.

   - Revert of the max_hw_sectors cap removal from a revision again,
     from Jeff Moyer.  This caused performance regressions in various
     tests.  Reinstate the limit, bump it to a more reasonable size
     instead.

   - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me.
     Most devices have huge trim limits, which can cause nasty latencies
     when deleting files.  Enable the admin to configure the size down.
     We will look into having a more sane default instead of UINT_MAX
     sectors.

   - Improvement of the SGP gaps logic from Keith Busch.

   - Enable the block core to handle arbitrarily sized bios, which
     enables a nice simplification of bio_add_page() (which is an IO hot
     path).  From Kent.

   - Improvements to the partition io stats accounting, making it
     faster.  From Ming Lei.

   - Also from Ming Lei, a basic fixup for overflow of the sysfs pending
     file in blk-mq, as well as a fix for a blk-mq timeout race
     condition.

   - Ming Lin has been carrying Kents above mentioned patches forward
     for a while, and testing them.  Ming also did a few fixes around
     that.

   - Sasha Levin found and fixed a use-after-free problem introduced by
     the bio->bi_error changes from Christoph.

   - Small blk cgroup cleanup from Viresh Kumar"

* 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
  blk: Fix bio_io_vec index when checking bvec gaps
  block: Replace SG_GAPS with new queue limits mask
  block: bump BLK_DEF_MAX_SECTORS to 2560
  Revert "block: remove artifical max_hw_sectors cap"
  blk-mq: fix race between timeout and freeing request
  blk-mq: fix buffer overflow when reading sysfs file of 'pending'
  Documentation: update notes in biovecs about arbitrarily sized bios
  block: remove bio_get_nr_vecs()
  fs: use helper bio_add_page() instead of open coding on bi_io_vec
  block: kill merge_bvec_fn() completely
  md/raid5: get rid of bio_fits_rdev()
  md/raid5: split bio for chunk_aligned_read
  block: remove split code in blkdev_issue_{discard,write_same}
  btrfs: remove bio splitting and merge_bvec_fn() calls
  bcache: remove driver private bio splitting code
  block: simplify bio_add_page()
  block: make generic_make_request handle arbitrarily sized bios
  blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
  block: don't access bio->bi_error after bio_put()
  block: shrink struct bio down to 2 cache lines again
  ...
2015-09-02 13:10:25 -07:00
Linus Torvalds
089b669506 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "The usual stuff from trivial tree for 4.3 (kerneldoc updates, printk()
  fixes, Documentation and MAINTAINERS updates)"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits)
  MAINTAINERS: update my e-mail address
  mod_devicetable: add space before */
  scsi: a100u2w: trivial typo in printk
  i2c: Fix typo in i2c-bfin-twi.c
  treewide: fix typos in comment blocks
  Doc: fix trivial typo in SubmittingPatches
  proportions: Spelling s/consitent/consistent/
  dm: Spelling s/consitent/consistent/
  aic7xxx: Fix typo in error message
  pcmcia: Fix typo in locking documentation
  scsi/arcmsr: Fix typos in error log
  drm/nouveau/gr: Fix typo in nv10.c
  [SCSI] Fix printk typos in drivers/scsi
  staging: comedi: Grammar s/Enable support a/Enable support for a/
  Btrfs: Spelling s/consitent/consistent/
  README: GTK+ is a acronym
  ASoC: omap: Fix typo in config option description
  mm: tlb.c: Fix error message
  ntfs: super.c: Fix error log
  fix typo in Documentation/SubmittingPatches
  ...
2015-09-01 18:46:42 -07:00
Greg Kroah-Hartman
5d44f4b348 Merge 4.2-rc6 into char-misc-next
We want the fixes in Linus's tree in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-09 16:28:09 -07:00
Masahiro Yamada
e1c05067c3 treewide: fix typos in comment blocks
Looks like the word "contiguous" is often mistyped.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
2015-08-07 14:46:24 +02:00
Joseph Qi
32e5a2a2be ocfs2: fix shift left overflow
When using a large volume, for example 9T volume with 2T already used,
frequent creation of small files with O_DIRECT when the IO is not
cluster aligned may clear sectors in the wrong place.  This will cause
filesystem corruption.

This is because p_cpos is a u32.  When calculating the corresponding
sector it should be converted to u64 first, otherwise it may overflow.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>	[4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07 04:39:41 +03:00