Commit Graph

604 Commits

Author SHA1 Message Date
Al Viro
40a3cb0d23 d_add_ci(): make sure we don't miss d_lookup_done()
All callers of d_alloc_parallel() must make sure that resulting
in-lookup dentry (if any) will encounter __d_lookup_done() before
the final dput().  d_add_ci() might end up creating in-lookup
dentries; they are fed to d_splice_alias(), which will normally
make sure they meet __d_lookup_done().  However, it is possible
to end up with d_splice_alias() failing with ERR_PTR(-ELOOP)
without having done so.  It takes a corrupted ntfs or case-insensitive
xfs image, but neither should end up with memory corruption...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-30 00:29:05 -04:00
Muchun Song
f53bf711d4 mm: dcache: use kmem_cache_alloc_lru() to allocate dentry
Like inode cache, the dentry will also be added to its memcg list_lru.  So
replace kmem_cache_alloc() with kmem_cache_alloc_lru() to allocate dentry.

Link: https://lkml.kernel.org/r/20220228122126.37293-8-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Alex Shi <alexs@kernel.org>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kari Argillander <kari.argillander@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:03 -07:00
Luis Chamberlain
c8c0c239d5 fs: move dcache sysctls to its own file
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start by moving sysctls to places
where they actually belong.  The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.

So move the dcache sysctl clutter out of kernel/sysctl.c.  This is a
small one-off entry, perhaps later we can simplify this representation,
but for now we use the helpers we have.  We won't know how we can
simplify this further untl we're fully done with the cleanup.

[arnd@arndb.de: avoid unused-function warning]
  Link: https://lkml.kernel.org/r/20211203190123.874239-2-arnd@kernel.org

Link: https://lkml.kernel.org/r/20211129205548.605569-4-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-22 08:33:36 +02:00
Al Viro
80e5d1ff5d useful constants: struct qstr for ".."
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-04-15 22:36:45 -04:00
Randy Dunlap
3d742d4b6e fs: delete repeated words in comments
Delete duplicate words in fs/*.c.
The doubled words that are being dropped are:
  that, be, the, in, and, for

Link: https://lkml.kernel.org/r/20201224052810.25315-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:26 -08:00
Linus Torvalds
250a25e7a1 Merge branch 'work.audit' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull RCU-safe common_lsm_audit() from Al Viro:
 "Make common_lsm_audit() non-blocking and usable from RCU pathwalk
  context.

  We don't really need to grab/drop dentry in there - rcu_read_lock() is
  enough. There's a couple of followups using that to simplify the
  logics in selinux, but those hadn't soaked in -next yet, so they'll
  have to go in next window"

* 'work.audit' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  make dump_common_audit_data() safe to be called from RCU pathwalk
  new helper: d_find_alias_rcu()
2021-02-22 13:05:30 -08:00
Mauro Carvalho Chehab
961f3c898e fs: fix kernel-doc markups
Two markups are at the wrong place. Kernel-doc only
support having the comment just before the identifier.

Also, some identifiers have different names between their
prototypes and the kernel-doc markup.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/96b1e1b388600ab092331f6c4e88ff8e8779ce6c.1610610937.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2021-01-21 14:06:00 -07:00
Al Viro
bca585d24a new helper: d_find_alias_rcu()
similar to d_find_alias(inode), except that
	* the caller must be holding rcu_read_lock()
	* inode must not be freed until matching rcu_read_unlock()
	* result is *NOT* pinned and can only be dereferenced until
the matching rcu_read_unlock().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-01-16 15:12:06 -05:00
Hao Li
77573fa310 fs: Kill DCACHE_DONTCACHE dentry even if DCACHE_REFERENCED is set
If DCACHE_REFERENCED is set, fast_dput() will return true, and then
retain_dentry() have no chance to check DCACHE_DONTCACHE. As a result,
the dentry won't be killed and the corresponding inode can't be evicted.
In the following example, the DAX policy can't take effects unless we
do a drop_caches manually.

  # DCACHE_LRU_LIST will be set
  echo abcdefg > test.txt

  # DCACHE_REFERENCED will be set and DCACHE_DONTCACHE can't do anything
  xfs_io -c 'chattr +x' test.txt

  # Drop caches to make DAX changing take effects
  echo 2 > /proc/sys/vm/drop_caches

What this patch does is preventing fast_dput() from returning true if
DCACHE_DONTCACHE is set. Then retain_dentry() will detect the
DCACHE_DONTCACHE and will return false. As a result, the dentry will be
killed and the inode will be evicted. In this way, if we change per-file
DAX policy, it will take effects automatically after this file is closed
by all processes.

I also add some comments to make the code more clear.

Signed-off-by: Hao Li <lihao2018.fnst@cn.fujitsu.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-12-10 17:33:17 -05:00
Ahmed S. Darwish
2647537197 vfs: Use sequence counter with associated spinlock
A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_spinlock_t data type, which allows to associate a
spinlock with the sequence counter. This enables lockdep to verify that
the spinlock used for writer serialization is held when the write side
critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200720155530.1173732-19-a.darwish@linutronix.de
2020-07-29 16:14:27 +02:00
Linus Torvalds
cb8e59cc87 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

 2) Add GSO partial support to igc, from Sasha Neftin.

 3) Several cleanups and improvements to r8169 from Heiner Kallweit.

 4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

 5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

 6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

 7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

 8) Add sriov and vf support to hinic, from Luo bin.

 9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

18) Several RISCV bpf jit optimizations, from Luke Nelson.

19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

21) Add BPF iterators, from Yonghang Song.

22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

25) Add CAP_BPF, from Alexei Starovoitov.

26) Support terse dumps in the packet scheduler, from Vlad Buslov.

27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

28) Add devm_register_netdev(), from Bartosz Golaszewski.

29) Minimize qdisc resets, from Cong Wang.

30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
  selftests: net: ip_defrag: ignore EPERM
  net_failover: fixed rollback in net_failover_open()
  Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
  Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
  vmxnet3: allow rx flow hash ops only when rss is enabled
  hinic: add set_channels ethtool_ops support
  selftests/bpf: Add a default $(CXX) value
  tools/bpf: Don't use $(COMPILE.c)
  bpf, selftests: Use bpf_probe_read_kernel
  s390/bpf: Use bcr 0,%0 as tail call nop filler
  s390/bpf: Maintain 8-byte stack alignment
  selftests/bpf: Fix verifier test
  selftests/bpf: Fix sample_cnt shared between two threads
  bpf, selftests: Adapt cls_redirect to call csum_level helper
  bpf: Add csum_level helper for fixing up csum levels
  bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
  sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
  crypto/chtls: IPv6 support for inline TLS
  Crypto/chcr: Fixes a coccinile check error
  Crypto/chcr: Fixes compilations warnings
  ...
2020-06-03 16:27:18 -07:00
Ira Weiny
2c567af418 fs: Introduce DCACHE_DONTCACHE
DCACHE_DONTCACHE indicates a dentry should not be cached on final
dput().

Also add a helper function to mark DCACHE_DONTCACHE on all dentries
pointing to a specific inode when that inode is being set I_DONTCACHE.

This facilitates dropping dentry references to inodes sooner which
require eviction to swap S_DAX mode.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-13 08:44:35 -07:00
Christoph Hellwig
32927393dc sysctl: pass kernel pointers to ->proc_handler
Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from  userspace in common code.  This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.

As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-27 02:07:40 -04:00
Linus Torvalds
5bf9a06a5f Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs cleanups from Al Viro:
 "No common topic, just three cleanups".

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  make __d_alloc() static
  fs/namespace: add __user to open_tree and move_mount syscalls
  fs/fnctl: fix missing __user in fcntl_rw_hint()
2019-12-08 11:08:28 -08:00
Linus Torvalds
0aecba6173 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs d_inode/d_flags memory ordering fixes from Al Viro:
 "Fallout from tree-wide audit for ->d_inode/->d_flags barriers use.
  Basically, the problem is that negative pinned dentries require
  careful treatment - unless ->d_lock is locked or parent is held at
  least shared, another thread can make them positive right under us.

  Most of the uses turned out to be safe - the main surprises as far as
  filesystems are concerned were

   - race in dget_parent() fastpath, that might end up with the caller
     observing the returned dentry _negative_, due to insufficient
     barriers. It is positive in memory, but we could end up seeing the
     wrong value of ->d_inode in CPU cache. Fixed.

   - manual checks that result of lookup_one_len_unlocked() is positive
     (and rejection of negatives). Again, insufficient barriers (we
     might end up with inconsistent observed values of ->d_inode and
     ->d_flags). Fixed by switching to a new primitive that does the
     checks itself and returns ERR_PTR(-ENOENT) instead of a negative
     dentry. That way we get rid of boilerplate converting negatives
     into ERR_PTR(-ENOENT) in the callers and have a single place to
     deal with the barrier-related mess - inside fs/namei.c rather than
     in every caller out there.

  The guts of pathname resolution *do* need to be careful - the race
  found by Ritesh is real, as well as several similar races.
  Fortunately, it turns out that we can take care of that with fairly
  local changes in there.

  The tree-wide audit had not been fun, and I hate the idea of repeating
  it. I think the right approach would be to annotate the places where
  we are _not_ guaranteed ->d_inode/->d_flags stability and have sparse
  catch regressions. But I'm still not sure what would be the least
  invasive way of doing that and it's clearly the next cycle fodder"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs/namei.c: fix missing barriers when checking positivity
  fix dget_parent() fastpath race
  new helper: lookup_positive_unlocked()
  fs/namei.c: pull positivity check into follow_managed()
2019-12-06 09:06:58 -08:00
Al Viro
2fa6b1e01a fs/namei.c: fix missing barriers when checking positivity
Pinned negative dentries can, generally, be made positive
by another thread.  Conditions that prevent that are
	* ->d_lock on dentry in question
	* parent directory held at least shared
	* nobody else could have observed the address of dentry
Most of the places working with those fall into one of those
categories; however, d_lookup() and friends need to be used
with some care.  Fortunately, there's not a lot of call sites,
and with few exceptions all of those fall under one of the
cases above.

Exceptions are all in fs/namei.c - in lookup_fast(), lookup_dcache()
and mountpoint_last().  Another one is lookup_slow() - there
dcache lookup is done with parent held shared, but the result
is used after we'd drop the lock.  The same happens in do_last() -
the lookup (in lookup_one()) is done with parent locked, but
result is used after unlocking.

lookup_fast(), do_last() and mountpoint_last() flat-out reject
negatives.

Most of lookup_dcache() calls are made with parent locked at least
shared; the only exception is lookup_one_len_unlocked().  It might
return pinned negative, needs serious care from callers.  Fortunately,
almost nobody calls it directly anymore; all but two callers have
converted to lookup_positive_unlocked(), which rejects negatives.

lookup_slow() is called by the same lookup_one_len_unlocked() (see
above), mountpoint_last() and walk_component().  In those two negatives
are rejected.

In other words, there is a small set of places where we need to
check carefully if a pinned potentially negative dentry is, in
fact, positive.  After that check we want to be sure that both
->d_inode and type bits in ->d_flags are stable and observed.
The set consists of follow_managed() (where the rejection happens
for lookup_fast(), walk_component() and do_last()), last_mountpoint()
and lookup_positive_unlocked().

Solution:
	1) transition from negative to positive (in __d_set_inode_and_type())
stores ->d_inode, then uses smp_store_release() to set ->d_flags type bits.
	2) aforementioned 3 places in fs/namei.c fetch ->d_flags with
smp_load_acquire() and bugger off if it type bits say "negative".
That way anyone downstream of those checks has dentry know positive pinned,
with ->d_inode and type bits of ->d_flags stable and observed.

I considered splitting off d_lookup_positive(), so that the checks could
be done right there, under ->d_lock.  However, that leads to massive
duplication of rather subtle code in fs/namei.c and fs/dcache.c.  It's
worse than it might seem, thanks to autofs ->d_manage() getting involved ;-/
No matter what, autofs_d_manage()/autofs_d_automount() must live with
the possibility of pinned negative dentry passed their way, becoming
positive under them - that's the intended behaviour when lookup comes
in the middle of automount in progress, so we can't keep them out of
the area that has to deal with those, more's the pity...

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-15 13:49:04 -05:00
Al Viro
e840093367 fix dget_parent() fastpath race
We are overoptimistic about taking the fast path there; seeing
the same value in ->d_parent after having grabbed a reference
to that parent does *not* mean that it has remained our parent
all along.

That wouldn't be a big deal (in the end it is our parent and
we have grabbed the reference we are about to return), but...
the situation with barriers is messed up.

We might have hit the following sequence:

d is a dentry of /tmp/a/b
CPU1:					CPU2:
parent = d->d_parent (i.e. dentry of /tmp/a)
					rename /tmp/a/b to /tmp/b
					rmdir /tmp/a, making its dentry negative
grab reference to parent,
end up with cached parent->d_inode (NULL)
					mkdir /tmp/a, rename /tmp/b to /tmp/a/b
recheck d->d_parent, which is back to original
decide that everything's fine and return the reference we'd got.

The trouble is, caller (on CPU1) will observe dget_parent()
returning an apparently negative dentry.  It actually is positive,
but CPU1 has stale ->d_inode cached.

Use d->d_seq to see if it has been moved instead of rechecking ->d_parent.
NOTE: we are *NOT* going to retry on any kind of ->d_seq mismatch;
we just go into the slow path in such case.  We don't wait for ->d_seq
to become even either - again, if we are racing with renames, we
can bloody well go to slow path anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-15 13:49:04 -05:00
Al Viro
5c8b0dfc6f make __d_alloc() static
no users outside of fs/dcache.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-10-25 14:08:24 -04:00
Qian Cai
5facae4f35 locking/lockdep: Remove unused @nested argument from lock_release()
Since the following commit:

  b4adfe8e05 ("locking/lockdep: Remove unused argument in __lock_release")

@nested is no longer used in lock_release(), so remove it from all
lock_release() calls and friends.

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: airlied@linux.ie
Cc: akpm@linux-foundation.org
Cc: alexander.levin@microsoft.com
Cc: daniel@iogearbox.net
Cc: davem@davemloft.net
Cc: dri-devel@lists.freedesktop.org
Cc: duyuyang@gmail.com
Cc: gregkh@linuxfoundation.org
Cc: hannes@cmpxchg.org
Cc: intel-gfx@lists.freedesktop.org
Cc: jack@suse.com
Cc: jlbec@evilplan.or
Cc: joonas.lahtinen@linux.intel.com
Cc: joseph.qi@linux.alibaba.com
Cc: jslaby@suse.com
Cc: juri.lelli@redhat.com
Cc: maarten.lankhorst@linux.intel.com
Cc: mark@fasheh.com
Cc: mhocko@kernel.org
Cc: mripard@kernel.org
Cc: ocfs2-devel@oss.oracle.com
Cc: rodrigo.vivi@intel.com
Cc: sean@poorly.run
Cc: st@kernel.org
Cc: tj@kernel.org
Cc: tytso@mit.edu
Cc: vdavydov.dev@gmail.com
Cc: vincent.guittot@linaro.org
Cc: viro@zeniv.linux.org.uk
Link: https://lkml.kernel.org/r/1568909380-32199-1-git-send-email-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-09 12:46:10 +02:00
Linus Torvalds
18253e034d Merge branch 'work.dcache2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dcache and mountpoint updates from Al Viro:
 "Saner handling of refcounts to mountpoints.

  Transfer the counting reference from struct mount ->mnt_mountpoint
  over to struct mountpoint ->m_dentry. That allows us to get rid of the
  convoluted games with ordering of mount shutdowns.

  The cost is in teaching shrink_dcache_{parent,for_umount} to cope with
  mixed-filesystem shrink lists, which we'll also need for the Slab
  Movable Objects patchset"

* 'work.dcache2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  switch the remnants of releasing the mountpoint away from fs_pin
  get rid of detach_mnt()
  make struct mountpoint bear the dentry reference to mountpoint, not struct mount
  Teach shrink_dcache_parent() to cope with mixed-filesystem shrink lists
  fs/namespace.c: shift put_mountpoint() to callers of unhash_mnt()
  __detach_mounts(): lookup_mountpoint() can't return ERR_PTR() anymore
  nfs: dget_parent() never returns NULL
  ceph: don't open-code the check for dead lockref
2019-07-20 09:15:51 -07:00
Al Viro
9bdebc2bd1 Teach shrink_dcache_parent() to cope with mixed-filesystem shrink lists
Currently, running into a shrink list that contains dentries from different
filesystems can cause several unpleasant things for shrink_dcache_parent()
and for umount(2).

The first problem is that there's a window during shrink_dentry_list() between
__dentry_kill() takes a victim out and dropping reference to its parent.  During
that window the parent looks like a genuine busy dentry.  shrink_dcache_parent()
(or, worse yet, shrink_dcache_for_umount()) coming at that time will see no
eviction candidates and no indication that it needs to wait for some
shrink_dentry_list() to proceed further.

That applies for any shrink list that might intersect with the subtree we are
trying to shrink; the only reason it does not blow on umount(2) in the mainline
is that we unregister the memory shrinker before hitting shrink_dcache_for_umount().

Another problem happens if something in a mixed-filesystem shrink list gets
be stuck in e.g. iput(), getting umount of unrelated fs to spin waiting for
the stuck shrinker to get around to our dentries.

Solution:
        1) have shrink_dentry_list() decrement the parent's refcount and
make sure it's on a shrink list (ours unless it already had been on some
other) before calling __dentry_kill().  That eliminates the window when
shrink_dcache_parent() would've blown past the entire subtree without
noticing anything with zero refcount not on shrink lists.
	2) when shrink_dcache_parent() has found no eviction candidates,
but some dentries are still sitting on shrink lists, rather than
repeating the scan in hope that shrinkers have progressed, scan looking
for something on shrink lists with zero refcount.  If such a thing is
found, grab rcu_read_lock() and stop the scan, with caller locking
it for eviction, dropping out of RCU and doing __dentry_kill(), with
the same treatment for parent as shrink_dentry_list() would do.

Note that right now mixed-filesystem shrink lists do not occur, so this
is not a mainline bug.  Howevere, there's a bunch of uses for such
beasts (e.g. the "try and evict everything we can out of given page"
patches; there are potential uses in mount-related code, considerably
simplifying the life in fs/namespace.c, etc.)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-10 07:32:22 -04:00
Amir Goldstein
49246466a9 fsnotify: move fsnotify_nameremove() hook out of d_delete()
d_delete() was piggy backed for the fsnotify_nameremove() hook when
in fact not all callers of d_delete() care about fsnotify events.

For all callers of d_delete() that may be interested in fsnotify events,
we made sure to call one of fsnotify_{unlink,rmdir}() hooks before
calling d_delete().

Now we can move the fsnotify_nameremove() call from d_delete() to the
fsnotify_{unlink,rmdir}() hooks.

Two explicit calls to fsnotify_nameremove() from nfs/afs sillyrename
are also removed. This will cause a change of behavior - nfs/afs will
NOT generate an fsnotify delete event when renaming over a positive
dentry.  This change is desirable, because it is consistent with the
behavior of all other filesystems.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-06-20 14:47:44 +02:00
Thomas Gleixner
457c899653 treewide: Add SPDX license identifier for missed files
Add SPDX license identifiers to all files which:

 - Have no license information of any form

 - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
   initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:45 +02:00
Linus Torvalds
a9fbcd6728 Clean up fscrypt's dcache revalidation support, and other
miscellaneous cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlzSEfQACgkQ8vlZVpUN
 gaNKrQf+O4JCCc8jqhpvUcNr8+DJNhWYpvRo7yDXoWbAyA6eZHV2fTRX5Vw6T8bW
 iQAj9ofkRnakOq6JvnaUyW8eAuRcqellF7HnwFwTxGOpZ1x3UPAV/roKutAhe8sT
 9dA0VxjugBAISbL2AMQKRPYNuzV07D9As6wZRlPuliFVLLnuPG5SseHRhdn3tm1n
 Jwyipu8P6BjomFtfHT25amISaWRx/uGpjTa1fmjwUxIC8EI6V9K6hKNCAUPsk/3g
 m8zEBpBKSmPK66sFPGxddPNGYAyyFluUboQxB7DuSCF7J3cULO8TxRZbsW/5jaio
 ZR8utWezuXnrI80vG/VtCMhqG3398Q==
 =0Bak
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fscrypt updates from Ted Ts'o:
 "Clean up fscrypt's dcache revalidation support, and other
  miscellaneous cleanups"

* tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fscrypt: cache decrypted symlink target in ->i_link
  vfs: use READ_ONCE() to access ->i_link
  fscrypt: fix race where ->lookup() marks plaintext dentry as ciphertext
  fscrypt: only set dentry_operations on ciphertext dentries
  fs, fscrypt: clear DCACHE_ENCRYPTED_NAME when unaliasing directory
  fscrypt: fix race allowing rename() and link() of ciphertext dentries
  fscrypt: clean up and improve dentry revalidation
  fscrypt: use READ_ONCE() to access ->i_crypt_info
  fscrypt: remove WARN_ON_ONCE() when decryption fails
  fscrypt: drop inode argument from fscrypt_get_ctx()
2019-05-07 21:28:04 -07:00
Al Viro
230c6402b1 ovl_lookup_real_one(): don't bother with strlen()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-26 13:13:33 -04:00
Eric Biggers
0bf3d5c160 fs, fscrypt: clear DCACHE_ENCRYPTED_NAME when unaliasing directory
Make __d_move() clear DCACHE_ENCRYPTED_NAME on the source dentry.  This
is needed for when d_splice_alias() moves a directory's encrypted alias
to its decrypted alias as a result of the encryption key being added.

Otherwise, the decrypted alias will incorrectly be invalidated on the
next lookup, causing problems such as unmounting a mount the user just
mount()ed there.

Note that we don't have to support arbitrary moves of this flag because
fscrypt doesn't allow dentries with DCACHE_ENCRYPTED_NAME to be the
source or target of a rename().

Fixes: 28b4c26396 ("ext4 crypto: revalidate dentry after adding or removing the key")
Reported-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-17 10:05:51 -04:00
Al Viro
ab1152dd56 unexport d_alloc_pseudo()
No modular uses since introducion of alloc_file_pseudo(),
and the only non-modular user not in alloc_file_pseudo()
had actually been wrong - should've been d_alloc_anon().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-09 19:20:46 -04:00
Al Viro
5467a68cbf dcache: sort the freeing-without-RCU-delay mess for good.
For lockless accesses to dentries we don't have pinned we rely
(among other things) upon having an RCU delay between dropping
the last reference and actually freeing the memory.

On the other hand, for things like pipes and sockets we neither
do that kind of lockless access, nor want to deal with the
overhead of an RCU delay every time a socket gets closed.

So delay was made optional - setting DCACHE_RCUACCESS in ->d_flags
made sure it would happen.  We tried to avoid setting it unless
we knew we need it.  Unfortunately, that had led to recurring
class of bugs, in which we missed the need to set it.

We only really need it for dentries that are created by
d_alloc_pseudo(), so let's not bother with trying to be smart -
just make having an RCU delay the default.  The ones that do
*not* get it set the replacement flag (DCACHE_NORCU) and we'd
better use that sparingly.  d_alloc_pseudo() is the only
such user right now.

FWIW, the race that finally prompted that switch had been
between __lock_parent() of immediate subdirectory of what's
currently the root of a disconnected tree (e.g. from
open-by-handle in progress) racing with d_splice_alias()
elsewhere picking another alias for the same inode, either
on outright corrupted fs image, or (in case of open-by-handle
on NFS) that subdirectory having been just moved on server.
It's not easy to hit, so the sky is not falling, but that's
not the first race on similar missed cases and the logics
for settinf DCACHE_RCUACCESS has gotten ridiculously
convoluted.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-09 19:18:04 -04:00
Waiman Long
af0c9af1b3 fs/dcache: Track & report number of negative dentries
The current dentry number tracking code doesn't distinguish between
positive & negative dentries.  It just reports the total number of
dentries in the LRU lists.

As excessive number of negative dentries can have an impact on system
performance, it will be wise to track the number of positive and
negative dentries separately.

This patch adds tracking for the total number of negative dentries in
the system LRU lists and reports it in the 5th field in the
/proc/sys/fs/dentry-state file.  The number, however, does not include
negative dentries that are in flight but not in the LRU yet as well as
those in the shrinker lists which are on the way out anyway.

The number of positive dentries in the LRU lists can be roughly found by
subtracting the number of negative dentries from the unused count.

Matthew Wilcox had confirmed that since the introduction of the
dentry_stat structure in 2.1.60, the dummy array was there, probably for
future extension.  They were not replacements of pre-existing fields.
So no sane applications that read the value of /proc/sys/fs/dentry-state
will do dummy thing if the last 2 fields of the sysctl parameter are not
zero.  IOW, it will be safe to use one of the dummy array entry for
negative dentry count.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-30 11:02:11 -08:00
Waiman Long
1dbd449c99 fs/dcache: Fix incorrect nr_dentry_unused accounting in shrink_dcache_sb()
The nr_dentry_unused per-cpu counter tracks dentries in both the LRU
lists and the shrink lists where the DCACHE_LRU_LIST bit is set.

The shrink_dcache_sb() function moves dentries from the LRU list to a
shrink list and subtracts the dentry count from nr_dentry_unused.  This
is incorrect as the nr_dentry_unused count will also be decremented in
shrink_dentry_list() via d_shrink_del().

To fix this double decrement, the decrement in the shrink_dcache_sb()
function is taken out.

Fixes: 4e717f5c10 ("list_lru: remove special case function list_lru_dispose_all."
Cc: stable@kernel.org
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-30 11:02:11 -08:00
Mike Rapoport
57c8a661d9 mm: remove include/linux/bootmem.h
Move remaining definitions and declarations from include/linux/bootmem.h
into include/linux/memblock.h and remove the redundant header.

The includes were replaced with the semantic patch below and then
semi-automated removal of duplicated '#include <linux/memblock.h>

@@
@@
- #include <linux/bootmem.h>
+ #include <linux/memblock.h>

[sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
  Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
[sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
  Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
[sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
  Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:16 -07:00
Vlastimil Babka
2e03b4bc4a dcache: allocate external names from reclaimable kmalloc caches
We can use the newly introduced kmalloc-reclaimable-X caches, to allocate
external names in dcache, which will take care of the proper accounting
automatically, and also improve anti-fragmentation page grouping.

This effectively reverts commit f1782c9bc5 ("dcache: account external
names as indirectly reclaimable memory") and instead passes
__GFP_RECLAIMABLE to kmalloc().  The accounting thus moves from
NR_INDIRECTLY_RECLAIMABLE_BYTES to NR_SLAB_RECLAIMABLE, which is also
considered in MemAvailable calculation and overcommit decisions.

Link: http://lkml.kernel.org/r/20180731090649.16028-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Tetsuo Handa
6cd00a01f0 fs/dcache.c: fix kmemcheck splat at take_dentry_name_snapshot()
Since only dentry->d_name.len + 1 bytes out of DNAME_INLINE_LEN bytes
are initialized at __d_alloc(), we can't copy the whole size
unconditionally.

 WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff8fa27465ac50)
 636f6e66696766732e746d70000000000010000000000000020000000188ffff
  i i i i i i i i i i i i i u u u u u u u u u u i i i i i u u u u
                                  ^
 RIP: 0010:take_dentry_name_snapshot+0x28/0x50
 RSP: 0018:ffffa83000f5bdf8 EFLAGS: 00010246
 RAX: 0000000000000020 RBX: ffff8fa274b20550 RCX: 0000000000000002
 RDX: ffffa83000f5be40 RSI: ffff8fa27465ac50 RDI: ffffa83000f5be60
 RBP: ffffa83000f5bdf8 R08: ffffa83000f5be48 R09: 0000000000000001
 R10: ffff8fa27465ac00 R11: ffff8fa27465acc0 R12: ffff8fa27465ac00
 R13: ffff8fa27465acc0 R14: 0000000000000000 R15: 0000000000000000
 FS:  00007f79737ac8c0(0000) GS:ffffffff8fc30000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: ffff8fa274c0b000 CR3: 0000000134aa7002 CR4: 00000000000606f0
  take_dentry_name_snapshot+0x28/0x50
  vfs_rename+0x128/0x870
  SyS_rename+0x3b2/0x3d0
  entry_SYSCALL_64_fastpath+0x1a/0xa4
  0xffffffffffffffff

Link: http://lkml.kernel.org/r/201709131912.GBG39012.QMJLOVFSFFOOtH@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:28 -07:00
Linus Torvalds
4591343e35 Merge branches 'work.misc' and 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Misc cleanups from various folks all over the place

  I expected more fs/dcache.c cleanups this cycle, so that went into a
  separate branch. Said cleanups have missed the window, so in the
  hindsight it could've gone into work.misc instead. Decided not to
  cherry-pick, thus the 'work.dcache' branch"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: dcache: Use true and false for boolean values
  fold generic_readlink() into its only caller
  fs: shave 8 bytes off of struct inode
  fs: Add more kernel-doc to the produced documentation
  fs: Fix attr.c kernel-doc
  removed extra extern file_fdatawait_range

* 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  kill dentry_update_name_case()
2018-08-13 21:28:25 -07:00
Linus Torvalds
0ea97a2d61 Merge branch 'work.mkdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs icache updates from Al Viro:

 - NFS mkdir/open_by_handle race fix

 - analogous solution for FUSE, replacing the one currently in mainline

 - new primitive to be used when discarding halfway set up inodes on
   failed object creation; gives sane warranties re icache lookups not
   returning such doomed by still not freed inodes. A bunch of
   filesystems switched to that animal.

 - Miklos' fix for last cycle regression in iget5_locked(); -stable will
   need a slightly different variant, unfortunately.

 - misc bits and pieces around things icache-related (in adfs and jfs).

* 'work.mkdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  jfs: don't bother with make_bad_inode() in ialloc()
  adfs: don't put inodes into icache
  new helper: inode_fake_hash()
  vfs: don't evict uninitialized inode
  jfs: switch to discard_new_inode()
  ext2: make sure that partially set up inodes won't be returned by ext2_iget()
  udf: switch to discard_new_inode()
  ufs: switch to discard_new_inode()
  btrfs: switch to discard_new_inode()
  new primitive: discard_new_inode()
  kill d_instantiate_no_diralias()
  nfs_instantiate(): prevent multiple aliases for directory inode
2018-08-13 20:25:58 -07:00
Al Viro
4c0d7cd5c8 make sure that __dentry_kill() always invalidates d_seq, unhashed or not
RCU pathwalk relies upon the assumption that anything that changes
->d_inode of a dentry will invalidate its ->d_seq.  That's almost
true - the one exception is that the final dput() of already unhashed
dentry does *not* touch ->d_seq at all.  Unhashing does, though,
so for anything we'd found by RCU dcache lookup we are fine.
Unfortunately, we can *start* with an unhashed dentry or jump into
it.

We could try and be careful in the (few) places where that could
happen.  Or we could just make the final dput() invalidate the damn
thing, unhashed or not.  The latter is much simpler and easier to
backport, so let's do it that way.

Reported-by: "Dae R. Jeong" <threeearcat@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-09 18:07:15 -04:00
Al Viro
90bad5e05b root dentries need RCU-delayed freeing
Since mountpoint crossing can happen without leaving lazy mode,
root dentries do need the same protection against having their
memory freed without RCU delay as everything else in the tree.

It's partially hidden by RCU delay between detaching from the
mount tree and dropping the vfsmount reference, but the starting
point of pathwalk can be on an already detached mount, in which
case umount-caused RCU delay has already passed by the time the
lazy pathwalk grabs rcu_read_lock().  If the starting point
happens to be at the root of that vfsmount *and* that vfsmount
covers the entire filesystem, we get trouble.

Fixes: 48a066e72d ("RCU'd vsfmounts")
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-06 09:13:32 -04:00
Gustavo A. R. Silva
7964410fcf fs: dcache: Use true and false for boolean values
Return statements in functions returning bool should use true or false
instead of an integer value.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-05 15:52:44 -04:00
Al Viro
c2b6d621c4 new primitive: discard_new_inode()
We don't want open-by-handle picking half-set-up in-core
struct inode from e.g. mkdir() having failed halfway through.
In other words, we don't want such inodes returned by iget_locked()
on their way to extinction.  However, we can't just have them
unhashed - otherwise open-by-handle immediately *after* that would've
ended up creating a new in-core inode over the on-disk one that
is in process of being freed right under us.

	Solution: new flag (I_CREATING) set by insert_inode_locked() and
removed by unlock_new_inode() and a new primitive (discard_new_inode())
to be used by such halfway-through-setup failure exits instead of
unlock_new_inode() / iput() combinations.  That primitive unlocks new
inode, but leaves I_CREATING in place.

	iget_locked() treats finding an I_CREATING inode as failure
(-ESTALE, once we sort out the error propagation).
	insert_inode_locked() treats the same as instant -EBUSY.
	ilookup() treats those as icache miss.

[Fix by Dan Carpenter <dan.carpenter@oracle.com> folded in]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 15:55:30 -04:00
Al Viro
c971e6a006 kill d_instantiate_no_diralias()
The only user is fuse_create_new_entry(), and there it's used to
mitigate the same mkdir/open-by-handle race as in nfs_mkdir().
The same solution applies - unhash the mkdir argument, then
call d_splice_alias() and if that returns a reference to preexisting
alias, dput() and report success.  ->mkdir() argument left unhashed
negative with the preexisting alias moved in the right place is just
fine from the ->mkdir() callers point of view.

Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-01 23:18:53 -04:00
Al Viro
63a67a926e kill dentry_update_name_case()
the last user is gone

Spotted-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-06-23 17:16:44 -04:00
Linus Torvalds
f956d08a56 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Misc bits and pieces not fitting into anything more specific"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: delete unnecessary assignment in vfs_listxattr
  Documentation: filesystems: update filesystem locking documentation
  vfs: namei: use path_equal() in follow_dotdot()
  fs.h: fix outdated comment about file flags
  __inode_security_revalidate() never gets NULL opt_dentry
  make xattr_getsecurity() static
  vfat: simplify checks in vfat_lookup()
  get rid of dead code in d_find_alias()
  it's SB_BORN, not MS_BORN...
  msdos_rmdir(): kill BS comment
  remove rpc_rmdir()
  fs: avoid fdput() after failed fdget() in vfs_dedupe_file_range()
2018-06-04 10:14:28 -07:00
Linus Torvalds
06c86e66d6 Merge branch 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dcache updates from Al Viro:
 "This is the first part of dealing with livelocks etc around
  shrink_dcache_parent()."

* 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  restore cond_resched() in shrink_dcache_parent()
  dput(): turn into explicit while() loop
  dcache: move cond_resched() into the end of __dentry_kill()
  d_walk(): kill 'finish' callback
  d_invalidate(): unhash immediately
2018-06-04 08:57:36 -07:00
Al Viro
61fec493c9 get rid of dead code in d_find_alias()
All "try disconnected alias if nothing else fits" logics in d_find_alias()
got accidentally disabled by Neil a while ago; for most of the callers it
was the right thing to do, so fixes belong in few callers that *do* want
disconnected aliases.  This just takes the now-dead code in d_find_alias()
out.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-13 12:08:32 -04:00
Al Viro
1e2e547a93 do d_instantiate/unlock_new_inode combinations safely
For anything NFS-exported we do _not_ want to unlock new inode
before it has grown an alias; original set of fixes got the
ordering right, but missed the nasty complication in case of
lockdep being enabled - unlock_new_inode() does
	lockdep_annotate_inode_mutex_key(inode)
which can only be done before anyone gets a chance to touch
->i_mutex.  Unfortunately, flipping the order and doing
unlock_new_inode() before d_instantiate() opens a window when
mkdir can race with open-by-fhandle on a guessed fhandle, leading
to multiple aliases for a directory inode and all the breakage
that follows from that.

	Correct solution: a new primitive (d_instantiate_new())
combining these two in the right order - lockdep annotate, then
d_instantiate(), then the rest of unlock_new_inode().  All
combinations of d_instantiate() with unlock_new_inode() should
be converted to that.

Cc: stable@kernel.org	# 2.6.29 and later
Tested-by: Mike Marshall <hubcap@omnibond.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-11 15:36:37 -04:00
Al Viro
4fb4887140 restore cond_resched() in shrink_dcache_parent()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-19 23:58:48 -04:00
Al Viro
1088a6408c dput(): turn into explicit while() loop
No need to mess with gotos when the code yielded by straight while()
isn't any worse...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-15 23:36:58 -04:00
Al Viro
9c5f1d3019 dcache: move cond_resched() into the end of __dentry_kill()
cond_resched() in shrink_dentry_list() is too early

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-15 23:36:58 -04:00
Al Viro
3a8e3611e0 d_walk(): kill 'finish' callback
no users left

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-15 23:36:57 -04:00
Al Viro
ff17fa561a d_invalidate(): unhash immediately
Once that is done, we can just hunt mountpoints down one by one;
no new mountpoints can be added from now on, so we don't need
anything tricky in finish() callback, etc.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-15 23:36:57 -04:00
Nikolay Borisov
32785c0539 fs/dcache.c: add cond_resched() in shrink_dentry_list()
As previously reported (https://patchwork.kernel.org/patch/8642031/)
it's possible to call shrink_dentry_list with a large number of dentries
(> 10000).  This, in turn, could trigger the softlockup detector and
possibly trigger a panic.  In addition to the unmount path being
vulnerable to this scenario, at SuSE we've observed similar situation
happening during process exit on processes that touch a lot of dentries.
Here is an excerpt from a crash dump.  The number after the colon are
the number of dentries on the list passed to shrink_dentry_list:

PID 99760: 10722
PID 107530: 215
PID 108809: 24134
PID 108877: 21331
PID 141708: 16487

So we want to kill between 15k-25k dentries without yielding.

And one possible call stack looks like:

4 [ffff8839ece41db0] _raw_spin_lock at ffffffff8152a5f8
5 [ffff8839ece41db0] evict at ffffffff811c3026
6 [ffff8839ece41dd0] __dentry_kill at ffffffff811bf258
7 [ffff8839ece41df0] shrink_dentry_list at ffffffff811bf593
8 [ffff8839ece41e18] shrink_dcache_parent at ffffffff811bf830
9 [ffff8839ece41e50] proc_flush_task at ffffffff8120dd61
10 [ffff8839ece41ec0] release_task at ffffffff81059ebd
11 [ffff8839ece41f08] do_exit at ffffffff8105b8ce
12 [ffff8839ece41f78] sys_exit at ffffffff8105bd53
13 [ffff8839ece41f80] system_call_fastpath at ffffffff81532909

While some of the callers of shrink_dentry_list do use cond_resched,
this is not sufficient to prevent softlockups.  So just move
cond_resched into shrink_dentry_list from its callers.

David said: I've found hundreds of occurrences of warnings that we emit
when need_resched stays set for a prolonged period of time with the
stack trace that is included in the change log.

Link: http://lkml.kernel.org/r/1521718946-31521-1-git-send-email-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:38 -07:00
Roman Gushchin
f1782c9bc5 dcache: account external names as indirectly reclaimable memory
I received a report about suspicious growth of unreclaimable slabs on
some machines.  I've found that it happens on machines with low memory
pressure, and these unreclaimable slabs are external names attached to
dentries.

External names are allocated using generic kmalloc() function, so they
are accounted as unreclaimable.  But they are held by dentries, which
are reclaimable, and they will be reclaimed under the memory pressure.

In particular, this breaks MemAvailable calculation, as it doesn't take
unreclaimable slabs into account.  This leads to a silly situation, when
a machine is almost idle, has no memory pressure and therefore has a big
dentry cache.  And the resulting MemAvailable is too low to start a new
workload.

To address the issue, the NR_INDIRECTLY_RECLAIMABLE_BYTES counter is
used to track the amount of memory, consumed by external names.  The
counter is increased in the dentry allocation path, if an external name
structure is allocated; and it's decreased in the dentry freeing path.

To reproduce the problem I've used the following Python script:

  import os

  for iter in range (0, 10000000):
      try:
          name = ("/some_long_name_%d" % iter) + "_" * 220
          os.stat(name)
      except Exception:
          pass

Without this patch:
  $ cat /proc/meminfo | grep MemAvailable
  MemAvailable:    7811688 kB
  $ python indirect.py
  $ cat /proc/meminfo | grep MemAvailable
  MemAvailable:    2753052 kB

With the patch:
  $ cat /proc/meminfo | grep MemAvailable
  MemAvailable:    7809516 kB
  $ python indirect.py
  $ cat /proc/meminfo | grep MemAvailable
  MemAvailable:    7749144 kB

[guro@fb.com: fix indirectly reclaimable memory accounting for CONFIG_SLOB]
  Link: http://lkml.kernel.org/r/20180312194140.19517-1-guro@fb.com
[guro@fb.com: fix indirectly reclaimable memory accounting]
  Link: http://lkml.kernel.org/r/20180313125701.7955-1-guro@fb.com
Link: http://lkml.kernel.org/r/20180305133743.12746-5-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:29 -07:00
Al Viro
cbd4a5bcb2 d_genocide: move export to definition
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-29 15:08:21 -04:00
Al Viro
42177007aa fold dentry_lock_for_move() into its sole caller and clean it up
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-29 15:07:49 -04:00
Al Viro
076515fc92 make non-exchanging __d_move() copy ->d_parent rather than swap them
Currently d_move(from, to) does the following:
	* name/parent of from <- old name/parent of to, from hashed there
	* to is unhashed
	* name of to is preserved
	* if from used to be detached, to gets detached
	* if from used to be attached, parent of to <- old parent of from.

That's both user-visibly bogus and complicates reasoning a lot.
Much saner semantics would be
	* name/parent of from <- name/parent of to, from hashed there.
	* to is unhashed
	* name/parent of to is unchanged.

The price, of course, is that old parent of from might lose a reference.
However,
	* all potentially cross-directory callers of d_move() have both
parents pinned directly; typically, dentries themselves are grabbed
only after we have grabbed and locked both parents.  IOW, the decrement
of old parent's refcount in case of d_move() won't reach zero.
	* __d_move() from d_splice_alias() is done to detached alias.
No refcount decrements in that case
	* __d_move() from __d_unalias() *can* get the refcount to zero.
So let's grab a reference to alias' old parent before calling __d_unalias()
and dput() it after we'd dropped rename_lock.

That does make d_splice_alias() potentially blocking.  However, it has
no callers in non-sleepable contexts (and the case where we'd grown
that dget/dput pair is _very_ rare, so performance is not an issue).

Another thing that needs adjustment is unlocking in the end of __d_move();
folded it in.  And cleaned the remnants of bogus ordering from the
"lock them in the beginning" counterpart - it's never been right and
now (well, for 7 years now) we have that thing always serialized on
rename_lock anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-29 15:07:49 -04:00
Al Viro
7a5cf791a7 split d_path() and friends into a separate file
Those parts of fs/dcache.c are pretty much self-contained.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-29 15:07:46 -04:00
Al Viro
43986d63b6 dcache.c: trim includes
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-29 15:07:45 -04:00
John Ogness
8f04da2adb fs/dcache: Avoid a try_lock loop in shrink_dentry_list()
shrink_dentry_list() holds dentry->d_lock and needs to acquire
dentry->d_inode->i_lock. This cannot be done with a spin_lock()
operation because it's the reverse of the regular lock order.
To avoid ABBA deadlocks it is done with a trylock loop.

Trylock loops are problematic in two scenarios:

  1) PREEMPT_RT converts spinlocks to 'sleeping' spinlocks, which are
     preemptible. As a consequence the i_lock holder can be preempted
     by a higher priority task. If that task executes the trylock loop
     it will do so forever and live lock.

  2) In virtual machines trylock loops are problematic as well. The
     VCPU on which the i_lock holder runs can be scheduled out and a
     task on a different VCPU can loop for a whole time slice. In the
     worst case this can lead to starvation. Commits 47be61845c
     ("fs/dcache.c: avoid soft-lockup in dput()") and 046b961b45
     ("shrink_dentry_list(): take parent's d_lock earlier") are
     addressing exactly those symptoms.

Avoid the trylock loop by using dentry_kill(). When pruning ancestors,
the same code applies that is used to kill a dentry in dput(). This
also has the benefit that the locking order is now the same. First
the inode is locked, then the parent.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-29 15:07:44 -04:00
Al Viro
f657a666fd get rid of trylock loop around dentry_kill()
In case when trylock in there fails, deal with it directly in
dentry_kill().  Note that in cases when we drop and retake
->d_lock, we need to recheck whether to retain the dentry.
Another thing is that dropping/retaking ->d_lock might have
ended up with negative dentry turning into positive; that,
of course, can happen only once...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-29 15:07:44 -04:00
Al Viro
62d9956cef handle move to LRU in retain_dentry()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-29 15:07:43 -04:00
Al Viro
a338579f2f dput(): consolidate the "do we need to retain it?" into an inlined helper
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-29 15:07:43 -04:00
Al Viro
8b987a46a1 split the slow part of lock_parent() off
Turn the "trylock failed" part into uninlined __lock_parent().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-29 15:07:42 -04:00
Al Viro
65d8eb5a8f now lock_parent() can't run into killed dentry
all remaining callers hold either a reference or ->i_lock

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-29 15:07:42 -04:00
Al Viro
3b3f09f48b get rid of trylock loop in locking dentries on shrink list
In case of trylock failure don't re-add to the list - drop the locks
and carefully get them in the right order.  For shrink_dentry_list(),
somebody having grabbed a reference to dentry means that we can
kick it off-list, so if we find dentry being modified under us we
don't need to play silly buggers with retries anyway - off the list
it is.

The locking logics taken out into a helper of its own; lock_parent()
is no longer used for dentries that can be killed under us.

[fix from Eric Biggers folded]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-29 15:07:02 -04:00
Al Viro
c19457f0ae d_delete(): get rid of trylock loop
just grab ->i_lock first; we have a positive dentry, nothing's going
to happen to inode

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-12 11:59:14 -04:00
John Ogness
c1d0c1a2b5 fs/dcache: Move dentry_kill() below lock_parent()
A subsequent patch will modify dentry_kill() to call lock_parent().
Move the dentry_kill() implementation "as is" below lock_parent()
first. This will help simplify the review of the subsequent patch
with dentry_kill() changes.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-12 11:59:13 -04:00
John Ogness
06080d100d fs/dcache: Remove stale comment from dentry_kill()
Commit 0d98439ea3 ("vfs: use lockred "dead" flag to mark unrecoverably
dead dentries") removed the `ref' parameter in dentry_kill() but its
documentation remained. Remove it.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-12 11:59:13 -04:00
Al Viro
0632a9ac7b take write_seqcount_invalidate() into __d_drop()
... and reorder it with making d_unhashed() true.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-12 11:59:12 -04:00
Will Deacon
8cc07c808c fs: dcache: Use READ_ONCE when accessing i_dir_seq
i_dir_seq is subject to concurrent modification by a cmpxchg or
store-release operation, so ensure that the relaxed access in
d_alloc_parallel uses READ_ONCE.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-02-25 12:51:10 -05:00
Will Deacon
015555fd4d fs: dcache: Avoid livelock between d_alloc_parallel and __d_add
If d_alloc_parallel runs concurrently with __d_add, it is possible for
d_alloc_parallel to continuously retry whilst i_dir_seq has been
incremented to an odd value by __d_add:

CPU0:
__d_add
	n = start_dir_add(dir);
		cmpxchg(&dir->i_dir_seq, n, n + 1) == n

CPU1:
d_alloc_parallel
retry:
	seq = smp_load_acquire(&parent->d_inode->i_dir_seq) & ~1;
	hlist_bl_lock(b);
		bit_spin_lock(0, (unsigned long *)b); // Always succeeds

CPU0:
	__d_lookup_done(dentry)
		hlist_bl_lock
			bit_spin_lock(0, (unsigned long *)b); // Never succeeds

CPU1:
	if (unlikely(parent->d_inode->i_dir_seq != seq)) {
		hlist_bl_unlock(b);
		goto retry;
	}

Since the simple bit_spin_lock used to implement hlist_bl_lock does not
provide any fairness guarantees, then CPU1 can starve CPU0 of the lock
and prevent it from reaching end_dir_add(dir), therefore CPU1 cannot
exit its retry loop because the sequence number always has the bottom
bit set.

This patch resolves the livelock by not taking hlist_bl_lock in
d_alloc_parallel if the sequence counter is odd, since any subsequent
masked comparison with i_dir_seq will fail anyway.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: Naresh Madhusudana <naresh.madhusudana@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-02-25 12:51:09 -05:00
Al Viro
3b82140963 lock_parent() needs to recheck if dentry got __dentry_kill'ed under it
In case when dentry passed to lock_parent() is protected from freeing only
by the fact that it's on a shrink list and trylock of parent fails, we
could get hit by __dentry_kill() (and subsequent dentry_kill(parent))
between unlocking dentry and locking presumed parent.  We need to recheck
that dentry is alive once we lock both it and parent *and* postpone
rcu_read_unlock() until after that point.  Otherwise we could return
a pointer to struct dentry that already is rcu-scheduled for freeing, with
->d_lock held on it; caller's subsequent attempt to unlock it can end
up with memory corruption.

Cc: stable@vger.kernel.org # 3.12+, counting backports
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-02-23 20:47:17 -05:00
Linus Torvalds
139351f1f9 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
 "This work from Amir adds NFS export capability to overlayfs. NFS
  exporting an overlay filesystem is a challange because we want to keep
  track of any copy-up of a file or directory between encoding the file
  handle and decoding it.

  This is achieved by indexing copied up objects by lower layer file
  handle. The index is already used for hard links, this patchset
  extends the use to NFS file handle decoding"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (51 commits)
  ovl: check ERR_PTR() return value from ovl_encode_fh()
  ovl: fix regression in fsnotify of overlay merge dir
  ovl: wire up NFS export operations
  ovl: lookup indexed ancestor of lower dir
  ovl: lookup connected ancestor of dir in inode cache
  ovl: hash non-indexed dir by upper inode for NFS export
  ovl: decode pure lower dir file handles
  ovl: decode indexed dir file handles
  ovl: decode lower file handles of unlinked but open files
  ovl: decode indexed non-dir file handles
  ovl: decode lower non-dir file handles
  ovl: encode lower file handles
  ovl: copy up before encoding non-connectable dir file handle
  ovl: encode non-indexed upper file handles
  ovl: decode connected upper dir file handles
  ovl: decode pure upper file handles
  ovl: encode pure upper file handles
  ovl: document NFS export
  vfs: factor out helpers d_instantiate_anon() and d_alloc_anon()
  ovl: store 'has_upper' and 'opaque' as bit flags
  ...
2018-02-05 13:05:20 -08:00
Linus Torvalds
617aebe6a9 Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
 available to be copied to/from userspace in the face of bugs. To further
 restrict what memory is available for copying, this creates a way to
 whitelist specific areas of a given slab cache object for copying to/from
 userspace, allowing much finer granularity of access control. Slab caches
 that are never exposed to userspace can declare no whitelist for their
 objects, thereby keeping them unavailable to userspace via dynamic copy
 operations. (Note, an implicit form of whitelisting is the use of constant
 sizes in usercopy operations and get_user()/put_user(); these bypass all
 hardened usercopy checks since these sizes cannot change at runtime.)
 
 This new check is WARN-by-default, so any mistakes can be found over the
 next several releases without breaking anyone's system.
 
 The series has roughly the following sections:
 - remove %p and improve reporting with offset
 - prepare infrastructure and whitelist kmalloc
 - update VFS subsystem with whitelists
 - update SCSI subsystem with whitelists
 - update network subsystem with whitelists
 - update process memory with whitelists
 - update per-architecture thread_struct with whitelists
 - update KVM with whitelists and fix ioctl bug
 - mark all other allocations as not whitelisted
 - update lkdtm for more sensible test overage
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJabvleAAoJEIly9N/cbcAmO1kQAJnjVPutnLSbnUteZxtsv7W4
 43Cggvokfxr6l08Yh3hUowNxZVKjhF9uwMVgRRg9Nl5WdYCN+vCQbHz+ZdzGJXKq
 cGqdKWgexMKX+aBdNDrK7BphUeD46sH7JWR+a/lDV/BgPxBCm9i5ZZCgXbPP89AZ
 NpLBji7gz49wMsnm/x135xtNlZ3dG0oKETzi7MiR+NtKtUGvoIszSKy5JdPZ4m8q
 9fnXmHqmwM6uQFuzDJPt1o+D1fusTuYnjI7EgyrJRRhQ+BB3qEFZApXnKNDRS9Dm
 uB7jtcwefJCjlZVCf2+PWTOEifH2WFZXLPFlC8f44jK6iRW2Nc+wVRisJ3vSNBG1
 gaRUe/FSge68eyfQj5OFiwM/2099MNkKdZ0fSOjEBeubQpiFChjgWgcOXa5Bhlrr
 C4CIhFV2qg/tOuHDAF+Q5S96oZkaTy5qcEEwhBSW15ySDUaRWFSrtboNt6ZVOhug
 d8JJvDCQWoNu1IQozcbv6xW/Rk7miy8c0INZ4q33YUvIZpH862+vgDWfTJ73Zy9H
 jR/8eG6t3kFHKS1vWdKZzOX1bEcnd02CGElFnFYUEewKoV7ZeeLsYX7zodyUAKyi
 Yp5CImsDbWWTsptBg6h9nt2TseXTxYCt2bbmpJcqzsqSCUwOQNQ4/YpuzLeG0ihc
 JgOmUnQNJWCTwUUw5AS1
 =tzmJ
 -----END PGP SIGNATURE-----

Merge tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardened usercopy whitelisting from Kees Cook:
 "Currently, hardened usercopy performs dynamic bounds checking on slab
  cache objects. This is good, but still leaves a lot of kernel memory
  available to be copied to/from userspace in the face of bugs.

  To further restrict what memory is available for copying, this creates
  a way to whitelist specific areas of a given slab cache object for
  copying to/from userspace, allowing much finer granularity of access
  control.

  Slab caches that are never exposed to userspace can declare no
  whitelist for their objects, thereby keeping them unavailable to
  userspace via dynamic copy operations. (Note, an implicit form of
  whitelisting is the use of constant sizes in usercopy operations and
  get_user()/put_user(); these bypass all hardened usercopy checks since
  these sizes cannot change at runtime.)

  This new check is WARN-by-default, so any mistakes can be found over
  the next several releases without breaking anyone's system.

  The series has roughly the following sections:
   - remove %p and improve reporting with offset
   - prepare infrastructure and whitelist kmalloc
   - update VFS subsystem with whitelists
   - update SCSI subsystem with whitelists
   - update network subsystem with whitelists
   - update process memory with whitelists
   - update per-architecture thread_struct with whitelists
   - update KVM with whitelists and fix ioctl bug
   - mark all other allocations as not whitelisted
   - update lkdtm for more sensible test overage"

* tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (38 commits)
  lkdtm: Update usercopy tests for whitelisting
  usercopy: Restrict non-usercopy caches to size 0
  kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl
  kvm: whitelist struct kvm_vcpu_arch
  arm: Implement thread_struct whitelist for hardened usercopy
  arm64: Implement thread_struct whitelist for hardened usercopy
  x86: Implement thread_struct whitelist for hardened usercopy
  fork: Provide usercopy whitelisting for task_struct
  fork: Define usercopy region in thread_stack slab caches
  fork: Define usercopy region in mm_struct slab caches
  net: Restrict unwhitelisted proto caches to size 0
  sctp: Copy struct sctp_sock.autoclose to userspace using put_user()
  sctp: Define usercopy region in SCTP proto slab cache
  caif: Define usercopy region in caif proto slab cache
  ip: Define usercopy region in IP proto slab cache
  net: Define usercopy region in struct proto slab cache
  scsi: Define usercopy region in scsi_sense_cache slab cache
  cifs: Define usercopy region in cifs_request slab cache
  vxfs: Define usercopy region in vxfs_inode slab cache
  ufs: Define usercopy region in ufs_inode_cache slab cache
  ...
2018-02-03 16:25:42 -08:00
Linus Torvalds
8e44e6600c Merge branch 'KASAN-read_word_at_a_time'
Merge KASAN word-at-a-time fixups from Andrey Ryabinin.

The word-at-a-time optimizations have caused headaches for KASAN, since
the whole point is that we access byte streams in bigger chunks, and
KASAN can be unhappy about the potential extra access at the end of the
string.

We used to have a horrible hack in dcache, and then people got
complaints from the strscpy() case.  This fixes it all up properly, by
adding an explicit helper for the "access byte stream one word at a
time" case.

* emailed patches from Andrey Ryabinin <aryabinin@virtuozzo.com>:
  fs: dcache: Revert "manually unpoison dname after allocation to shut up kasan's reports"
  fs/dcache: Use read_word_at_a_time() in dentry_string_cmp()
  lib/strscpy: Shut up KASAN false-positives in strscpy()
  compiler.h: Add read_word_at_a_time() function.
  compiler.h, kasan: Avoid duplicating __read_once_size_nocheck()
2018-02-01 12:20:53 -08:00
Andrey Ryabinin
babcbbc7c4 fs: dcache: Revert "manually unpoison dname after allocation to shut up kasan's reports"
This reverts commit df4c0e36f1.

It's no longer needed since dentry_string_cmp() now uses
read_word_at_a_time() to avoid kasan's reports.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 12:20:21 -08:00
Andrey Ryabinin
bfe7aa6c39 fs/dcache: Use read_word_at_a_time() in dentry_string_cmp()
dentry_string_cmp() performs the word-at-a-time reads from 'cs' and may
read slightly more than it was requested in kmallac().  Normally this
would make KASAN to report out-of-bounds access, but this was
workarounded by commit df4c0e36f1 ("fs: dcache: manually unpoison
dname after allocation to shut up kasan's reports").

This workaround is not perfect, since it allows out-of-bounds access to
dentry's name for all the code, not just in dentry_string_cmp().

So it would be better to use read_word_at_a_time() instead and revert
commit df4c0e36f1.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 12:20:21 -08:00
Linus Torvalds
dc1efc3cfa Merge branch 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dcache updates from Al Viro:
 "Neil Brown's d_move()/d_path() race fix"

* 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  VFS: close race between getcwd() and d_move()
2018-01-31 19:15:23 -08:00
Linus Torvalds
19e7b5f994 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "All kinds of misc stuff, without any unifying topic, from various
  people.

  Neil's d_anon patch, several bugfixes, introduction of kvmalloc
  analogue of kmemdup_user(), extending bitfield.h to deal with
  fixed-endians, assorted cleanups all over the place..."

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
  alpha: osf_sys.c: use timespec64 where appropriate
  alpha: osf_sys.c: fix put_tv32 regression
  jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path
  dcache: delete unused d_hash_mask
  dcache: subtract d_hash_shift from 32 in advance
  fs/buffer.c: fold init_buffer() into init_page_buffers()
  fs: fold __inode_permission() into inode_permission()
  fs: add RWF_APPEND
  sctp: use vmemdup_user() rather than badly open-coding memdup_user()
  snd_ctl_elem_init_enum_names(): switch to vmemdup_user()
  replace_user_tlv(): switch to vmemdup_user()
  new primitive: vmemdup_user()
  memdup_user(): switch to GFP_USER
  eventfd: fold eventfd_ctx_get() into eventfd_ctx_fileget()
  eventfd: fold eventfd_ctx_read() into eventfd_read()
  eventfd: convert to use anon_inode_getfd()
  nfs4file: get rid of pointless include of btrfs.h
  uvc_v4l2: clean copyin/copyout up
  vme_user: don't use __copy_..._user()
  usx2y: don't bother with memdup_user() for 16-byte structure
  ...
2018-01-31 09:25:20 -08:00
Alexey Dobriyan
b35d786b67 dcache: delete unused d_hash_mask
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-25 19:34:30 -05:00
Alexey Dobriyan
854d3e6343 dcache: subtract d_hash_shift from 32 in advance
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-25 19:34:29 -05:00
Miklos Szeredi
f9c34674bc vfs: factor out helpers d_instantiate_anon() and d_alloc_anon()
Those helpers are going to be used by overlayfs to implement
NFS export decode.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:59 +01:00
Amir Goldstein
e8f9e5b780 ovl: verify directory index entries on mount
Directory index entries should have 'upper' xattr pointing to the real
upper dir. Verifying that the upper dir file handle is not stale is
expensive, so only verify stale directory index entries on mount if
NFS export feature is enabled.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:53 +01:00
David Windsor
6a9b88204c vfs: Define usercopy region in names_cache slab caches
VFS pathnames are stored in the names_cache slab cache, either inline
or across an entire allocation entry (when approaching PATH_MAX). These
are copied to/from userspace, so they must be entirely whitelisted.

cache object allocation:
    include/linux/fs.h:
        #define __getname()    kmem_cache_alloc(names_cachep, GFP_KERNEL)

example usage trace:
    strncpy_from_user+0x4d/0x170
    getname_flags+0x6f/0x1f0
    user_path_at_empty+0x23/0x40
    do_mount+0x69/0xda0
    SyS_mount+0x83/0xd0

    fs/namei.c:
        getname_flags(...):
            ...
            result = __getname();
            ...
            kname = (char *)result->iname;
            result->name = kname;
            len = strncpy_from_user(kname, filename, EMBEDDED_NAME_MAX);
            ...
            if (unlikely(len == EMBEDDED_NAME_MAX)) {
                const size_t size = offsetof(struct filename, iname[1]);
                kname = (char *)result;

                result = kzalloc(size, GFP_KERNEL);
                ...
                result->name = kname;
                len = strncpy_from_user(kname, filename, PATH_MAX);

In support of usercopy hardening, this patch defines the entire cache
object in the names_cache slab cache as whitelisted, since it may entirely
hold name strings to be copied to/from userspace.

This patch is verbatim from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.

Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, add usage trace]
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-01-15 12:07:50 -08:00
David Windsor
80344266c1 dcache: Define usercopy region in dentry_cache slab cache
When a dentry name is short enough, it can be stored directly in the
dentry itself (instead in a separate kmalloc allocation). These dentry
short names, stored in struct dentry.d_iname and therefore contained in
the dentry_cache slab cache, need to be coped to userspace.

cache object allocation:
    fs/dcache.c:
        __d_alloc(...):
            ...
            dentry = kmem_cache_alloc(dentry_cache, ...);
            ...
            dentry->d_name.name = dentry->d_iname;

example usage trace:
    filldir+0xb0/0x140
    dcache_readdir+0x82/0x170
    iterate_dir+0x142/0x1b0
    SyS_getdents+0xb5/0x160

    fs/readdir.c:
        (called via ctx.actor by dir_emit)
        filldir(..., const char *name, ...):
            ...
            copy_to_user(..., name, namlen)

    fs/libfs.c:
        dcache_readdir(...):
            ...
            next = next_positive(dentry, p, 1)
            ...
            dir_emit(..., next->d_name.name, ...)

In support of usercopy hardening, this patch defines a region in the
dentry_cache slab cache in which userspace copy operations are allowed.

This region is known as the slab cache's usercopy region. Slab caches can
now check that each dynamic copy operation involving cache-managed memory
falls entirely within the slab's usercopy region.

This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.

Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust hunks for kmalloc-specific things moved later]
[kees: adjust commit log, provide usage trace]
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-01-15 12:07:50 -08:00
NeilBrown
61647823aa VFS: close race between getcwd() and d_move()
d_move() will call __d_drop() and then __d_rehash()
on the dentry being moved.  This creates a small window
when the dentry appears to be unhashed.  Many tests
of d_unhashed() are made under ->d_lock and so are safe
from racing with this window, but some aren't.
In particular, getcwd() calls d_unlinked() (which calls
d_unhashed()) without d_lock protection, so it can race.

This races has been seen in practice with lustre, which uses d_move() as
part of name lookup.  See:
   https://jira.hpdd.intel.com/browse/LU-9735
It could race with a regular rename(), and result in ENOENT instead
of either the 'before' or 'after' name.

The race can be demonstrated with a simple program which
has two threads, one renaming a directory back and forth
while another calls getcwd() within that directory: it should never
fail, but does.  See:
  https://patchwork.kernel.org/patch/9455345/

We could fix this race by taking d_lock and rechecking when
d_unhashed() reports true.  Alternately when can remove the window,
which is the approach this patch takes.

___d_drop() is introduce which does *not* clear d_hash.pprev
so the dentry still appears to be hashed.  __d_drop() calls
___d_drop(), then clears d_hash.pprev.
__d_move() now uses ___d_drop() and only clears d_hash.pprev
when not rehashing.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-12-28 14:12:09 -05:00
NeilBrown
f1ee616214 VFS: don't keep disconnected dentries on d_anon
The original purpose of the per-superblock d_anon list was to
keep disconnected dentries in the cache between consecutive
requests to the NFS server.  Dentries can be disconnected if
a client holds a file open and repeatedly performs IO on it,
and if the server drops the dentry, whether due to memory
pressure, server restart, or "echo 3 > /proc/sys/vm/drop_caches".

This purpose was thwarted by commit 75a6f82a0d ("freeing unlinked
file indefinitely delayed") which caused disconnected dentries
to be freed as soon as their refcount reached zero.

This means that, when a dentry being used by nfsd gets disconnected, a
new one needs to be allocated for every request (unless requests
overlap).  As the dentry has no name, no parent, and no children,
there is little of value to cache.  As small memory allocations are
typically fast (from per-cpu free lists) this likely has little cost.

This means that the original purpose of s_anon is no longer relevant:
there is no longer any need to keep disconnected dentries on a list so
they appear to be hashed.

However, s_anon now has a new use.  When you mount an NFS filesystem,
the dentry stored in s_root is just a placebo.  The "real" root dentry
is allocated using d_obtain_root() and so it kept on the s_anon list.
I don't know the reason for this, but suspect it related to NFSv4
where a mount of "server:/some/path" require NFS to look up the root
filehandle on the server, then walk down "/some" and "/path" to get
the filehandle to mount.

Whatever the reason, NFS depends on the s_anon list and on
shrink_dcache_for_umount() pruning all dentries on this list.  So we
cannot simply remove s_anon.

We could just leave the code unchanged, but apart from that being
potentially confusing, the (unfair) bit-spin-lock which protects
s_anon can become a bottle neck when lots of disconnected dentries are
being created.

So this patch renames s_anon to s_roots, and stops storing
disconnected dentries on the list.  Only dentries obtained with
d_obtain_root() are now stored on this list.  There are many fewer of
these (only NFS and NILFS2 use the call, and only during filesystem
mount) so contention on the bit-lock will not be a problem.

Possibly an alternate solution should be found for NFS and NILFS2, but
that would require understanding their needs first.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-12-25 20:22:07 -05:00
Yang Shi
9c5650359a vfs: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by vfs at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-12-07 14:23:30 -05:00
Paul E. McKenney
7088efa913 fs/dcache: Use release-acquire for name/length update
The code in __d_alloc() carefully orders filling in the NUL character
of the name (and the length, hash, and the name itself) with assigning
of the name itself.  However, prepend_name() does not order the accesses
to the ->name and ->len fields, other than on TSO systems.  This commit
therefore replaces prepend_name()'s READ_ONCE() of ->name with an
smp_load_acquire(), which orders against the subsequent READ_ONCE() of
->len.  Because READ_ONCE() now incorporates smp_read_barrier_depends(),
prepend_name()'s smp_read_barrier_depends() is removed.  Finally,
to save a line, the smp_wmb()/store pair in __d_alloc() is replaced
by smp_store_release().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <linux-fsdevel@vger.kernel.org>
2017-12-04 10:52:52 -08:00
Levin, Alexander (Sasha Levin)
4950276672 kmemcheck: remove annotations
Patch series "kmemcheck: kill kmemcheck", v2.

As discussed at LSF/MM, kill kmemcheck.

KASan is a replacement that is able to work without the limitation of
kmemcheck (single CPU, slow).  KASan is already upstream.

We are also not aware of any users of kmemcheck (or users who don't
consider KASan as a suitable replacement).

The only objection was that since KASAN wasn't supported by all GCC
versions provided by distros at that time we should hold off for 2
years, and try again.

Now that 2 years have passed, and all distros provide gcc that supports
KASAN, kill kmemcheck again for the very same reasons.

This patch (of 4):

Remove kmemcheck annotations, and calls to kmemcheck from the kernel.

[alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
  Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Hansen <devtimhansen@gmail.com>
Cc: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Mark Rutland
66702eb590 locking/atomics, fs/dcache: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE()
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't currently harmful.

However, for some features it is necessary to instrument reads and
writes separately, which is not possible with ACCESS_ONCE(). This
distinction is critical to correct operation.

It's possible to transform the bulk of kernel code using the Coccinelle
script below. However, this doesn't handle comments, leaving references
to ACCESS_ONCE() instances which have been removed. As a preparatory
step, this patch converts the dcache code and comments to use
{READ,WRITE}_ONCE() consistently.

----
virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)
----

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-4-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-25 11:00:57 +02:00
Will Deacon
506458efaf locking/barriers: Convert users of lockless_dereference() to READ_ONCE()
READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
can be used instead of lockless_dereference() without any change in
semantics.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-24 13:17:33 +02:00
Linus Torvalds
78dcf73421 Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull ->s_options removal from Al Viro:
 "Preparations for fsmount/fsopen stuff (coming next cycle). Everything
  gets moved to explicit ->show_options(), killing ->s_options off +
  some cosmetic bits around fs/namespace.c and friends. Basically, the
  stuff needed to work with fsmount series with minimum of conflicts
  with other work.

  It's not strictly required for this merge window, but it would reduce
  the PITA during the coming cycle, so it would be nice to have those
  bits and pieces out of the way"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  isofs: Fix isofs_show_options()
  VFS: Kill off s_options and helpers
  orangefs: Implement show_options
  9p: Implement show_options
  isofs: Implement show_options
  afs: Implement show_options
  affs: Implement show_options
  befs: Implement show_options
  spufs: Implement show_options
  bpf: Implement show_options
  ramfs: Implement show_options
  pstore: Implement show_options
  omfs: Implement show_options
  hugetlbfs: Implement show_options
  VFS: Don't use save/replace_mount_options if not using generic_show_options
  VFS: Provide empty name qstr
  VFS: Make get_filesystem() return the affected filesystem
  VFS: Clean up whitespace in fs/namespace.c and fs/super.c
  Provide a function to create a NUL-terminated string from unterminated data
2017-07-15 12:00:42 -07:00
Sahitya Tummala
b17c070fb6 fs/dcache.c: fix spin lockup issue on nlru->lock
__list_lru_walk_one() acquires nlru spin lock (nlru->lock) for longer
duration if there are more number of items in the lru list.  As per the
current code, it can hold the spin lock for upto maximum UINT_MAX
entries at a time.  So if there are more number of items in the lru
list, then "BUG: spinlock lockup suspected" is observed in the below
path:

  spin_bug+0x90
  do_raw_spin_lock+0xfc
  _raw_spin_lock+0x28
  list_lru_add+0x28
  dput+0x1c8
  path_put+0x20
  terminate_walk+0x3c
  path_lookupat+0x100
  filename_lookup+0x6c
  user_path_at_empty+0x54
  SyS_faccessat+0xd0
  el0_svc_naked+0x24

This nlru->lock is acquired by another CPU in this path -

  d_lru_shrink_move+0x34
  dentry_lru_isolate_shrink+0x48
  __list_lru_walk_one.isra.10+0x94
  list_lru_walk_node+0x40
  shrink_dcache_sb+0x60
  do_remount_sb+0xbc
  do_emergency_remount+0xb0
  process_one_work+0x228
  worker_thread+0x2e0
  kthread+0xf4
  ret_from_fork+0x10

Fix this lockup by reducing the number of entries to be shrinked from
the lru list to 1024 at once.  Also, add cond_resched() before
processing the lru list again.

Link: http://marc.info/?t=149722864900001&r=1&w=2
Link: http://lkml.kernel.org/r/1498707575-2472-1-git-send-email-stummala@codeaurora.org
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Alexander Polakov <apolyakov@beget.ru>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:33 -07:00
Linus Torvalds
b8d4c1f9f4 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc filesystem updates from Al Viro:
 "Assorted normal VFS / filesystems stuff..."

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  dentry name snapshots
  Make statfs properly return read-only state after emergency remount
  fs/dcache: init in_lookup_hashtable
  minix: Deinline get_block, save 2691 bytes
  fs: Reorder inode_owner_or_capable() to avoid needless
  fs: warn in case userspace lied about modprobe return
2017-07-08 10:50:54 -07:00
Al Viro
49d31c2f38 dentry name snapshots
take_dentry_name_snapshot() takes a safe snapshot of dentry name;
if the name is a short one, it gets copied into caller-supplied
structure, otherwise an extra reference to external name is grabbed
(those are never modified).  In either case the pointer to stable
string is stored into the same structure.

dentry must be held by the caller of take_dentry_name_snapshot(),
but may be freely dropped afterwards - the snapshot will stay
until destroyed by release_dentry_name_snapshot().

Intended use:
	struct name_snapshot s;

	take_dentry_name_snapshot(&s, dentry);
	...
	access s.name
	...
	release_dentry_name_snapshot(&s);

Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name
to pass down with event.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-07 20:09:10 -04:00
Pavel Tatashin
3d375d7859 mm: update callers to use HASH_ZERO flag
Update dcache, inode, pid, mountpoint, and mount hash tables to use
HASH_ZERO, and remove initialization after allocations.  In case of
places where HASH_EARLY was used such as in __pv_init_lock_hash the
zeroed hash table was already assumed, because memblock zeroes the
memory.

CPU: SPARC M6, Memory: 7T
Before fix:
  Dentry cache hash table entries: 1073741824
  Inode-cache hash table entries: 536870912
  Mount-cache hash table entries: 16777216
  Mountpoint-cache hash table entries: 16777216
  ftrace: allocating 20414 entries in 40 pages
  Total time: 11.798s

After fix:
  Dentry cache hash table entries: 1073741824
  Inode-cache hash table entries: 536870912
  Mount-cache hash table entries: 16777216
  Mountpoint-cache hash table entries: 16777216
  ftrace: allocating 20414 entries in 40 pages
  Total time: 3.198s

CPU: Intel Xeon E5-2630, Memory: 2.2T:
Before fix:
  Dentry cache hash table entries: 536870912
  Inode-cache hash table entries: 268435456
  Mount-cache hash table entries: 8388608
  Mountpoint-cache hash table entries: 8388608
  CPU: Physical Processor ID: 0
  Total time: 3.245s

After fix:
  Dentry cache hash table entries: 536870912
  Inode-cache hash table entries: 268435456
  Mount-cache hash table entries: 8388608
  Mountpoint-cache hash table entries: 8388608
  CPU: Physical Processor ID: 0
  Total time: 3.244s

Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Cc: David Miller <davem@davemloft.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:33 -07:00
David Howells
cdf01226b2 VFS: Provide empty name qstr
Provide an empty name (ie. "") qstr for general use.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-06 03:27:09 -04:00
Sebastian Andrzej Siewior
6916363f30 fs/dcache: init in_lookup_hashtable
in_lookup_hashtable was introduced in commit 94bdd655ca ("parallel
lookups machinery, part 3") and never initialized but since it is in
the data it is all zeros. But we need this for -RT.

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-29 20:17:14 -04:00
Al Viro
81be24d263 Hang/soft lockup in d_invalidate with simultaneous calls
It's not hard to trigger a bunch of d_invalidate() on the same
dentry in parallel.  They end up fighting each other - any
dentry picked for removal by one will be skipped by the rest
and we'll go for the next iteration through the entire
subtree, even if everything is being skipped.  Morevoer, we
immediately go back to scanning the subtree.  The only thing
we really need is to dissolve all mounts in the subtree and
as soon as we've nothing left to do, we can just unhash the
dentry and bugger off.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15 06:52:09 -04:00
Josef Bacik
563f40019d fs: don't set *REFERENCED on single use objects
By default we set DCACHE_REFERENCED and I_REFERENCED on any dentry or
inode we create.  This is problematic as this means that it takes two
trips through the LRU for any of these objects to be reclaimed,
regardless of their actual lifetime.  With enough pressure from these
caches we can easily evict our working set from page cache with single
use objects.  So instead only set *REFERENCED if we've already been
added to the LRU list.  This means that we've been touched since the
first time we were accessed, and so more likely to need to hang out in
cache.

To illustrate this issue I wrote the following scripts

https://github.com/josefbacik/debug-scripts/tree/master/cache-pressure

on my test box.  It is a single socket 4 core CPU with 16gib of RAM and
I tested on an Intel 2tib NVME drive.  The cache-pressure.sh script
creates a new file system and creates 2 6.5gib files in order to take up
13gib of the 16gib of ram with pagecache.  Then it runs a test program
that reads these 2 files in a loop, and keeps track of how often it has
to read bytes for each loop.  On an ideal system with no pressure we
should have to read 0 bytes indefinitely.  The second thing this script
does is start a fs_mark job that creates a ton of 0 length files,
putting pressure on the system with slab only allocations.  On exit the
script prints out how many bytes were read by the read-file program.
The results are as follows

Without patch:
/mnt/btrfs-test/reads/file1: total read during loops 27262988288
/mnt/btrfs-test/reads/file2: total read during loops 27262976000

With patch:
/mnt/btrfs-test/reads/file2: total read during loops 18640457728
/mnt/btrfs-test/reads/file1: total read during loops 9565376512

This patch results in a 50% reduction of the amount of pages evicted
from our working set.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-03 11:47:05 -04:00