We need information about exports when crossing mountpoints during
lookup or NFSv4 readdir. If we don't already have that information
cached, we may have to ask (and wait for) rpc.mountd.
In both cases we currently hold the i_mutex on the parent of the
directory we're asking rpc.mountd about. We've seen situations where
rpc.mountd performs some operation on that directory that tries to take
the i_mutex again, resulting in deadlock.
With some care, we may be able to avoid that in rpc.mountd. But it
seems better just to avoid holding a mutex while waiting on userspace.
It appears that lookup_one_len is pretty much the only operation that
needs the i_mutex. So we could just drop the i_mutex elsewhere and do
something like
mutex_lock()
lookup_one_len()
mutex_unlock()
In many cases though the lookup would have been cached and not required
the i_mutex, so it's more efficient to create a lookup_one_len() variant
that only takes the i_mutex when necessary.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new method: ->get_link(); replacement of ->follow_link(). The differences
are:
* inode and dentry are passed separately
* might be called both in RCU and non-RCU mode;
the former is indicated by passing it a NULL dentry.
* when called that way it isn't allowed to block
and should return ERR_PTR(-ECHILD) if it needs to be called
in non-RCU mode.
It's a flagday change - the old method is gone, all in-tree instances
converted. Conversion isn't hard; said that, so far very few instances
do not immediately bail out when called in RCU mode. That'll change
in the next commits.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
kmap() in page_follow_link_light() needed to go - allowing to hold
an arbitrary number of kmaps for long is a great way to deadlocking
the system.
new helper (inode_nohighmem(inode)) needs to be used for pagecache
symlinks inodes; done for all in-tree cases. page_follow_link_light()
instrumented to yell about anything missed.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
that allows to kill the recheck of nd->seq on the way out in
this case, and this check on the way out is left only for
absolute pathnames.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
we already zero it on outermost set_nameidata(), so initialization in
path_init() is pointless and wrong. The same DoS exists on pre-4.2
kernels, but there a slightly different fix will be needed.
Cc: stable@vger.kernel.org # v4.2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Merge second patch-bomb from Andrew Morton:
- most of the rest of MM
- procfs
- lib/ updates
- printk updates
- bitops infrastructure tweaks
- checkpatch updates
- nilfs2 update
- signals
- various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
dma-debug, dma-mapping, ...
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
ipc,msg: drop dst nil validation in copy_msg
include/linux/zutil.h: fix usage example of zlib_adler32()
panic: release stale console lock to always get the logbuf printed out
dma-debug: check nents in dma_sync_sg*
dma-mapping: tidy up dma_parms default handling
pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
kexec: use file name as the output message prefix
fs, seqfile: always allow oom killer
seq_file: reuse string_escape_str()
fs/seq_file: use seq_* helpers in seq_hex_dump()
coredump: change zap_threads() and zap_process() to use for_each_thread()
coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
signals: kill block_all_signals() and unblock_all_signals()
nilfs2: fix gcc uninitialized-variable warnings in powerpc build
nilfs2: fix gcc unused-but-set-variable warnings
MAINTAINERS: nilfs2: add header file for tracing
nilfs2: add tracepoints for analyzing reading and writing metadata files
...
Pull trivial updates from Jiri Kosina:
"Trivial stuff from trivial tree that can be trivially summed up as:
- treewide drop of spurious unlikely() before IS_ERR() from Viresh
Kumar
- cosmetic fixes (that don't really affect basic functionality of the
driver) for pktcdvd and bcache, from Julia Lawall and Petr Mladek
- various comment / printk fixes and updates all over the place"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
bcache: Really show state of work pending bit
hwmon: applesmc: fix comment typos
Kconfig: remove comment about scsi_wait_scan module
class_find_device: fix reference to argument "match"
debugfs: document that debugfs_remove*() accepts NULL and error values
net: Drop unlikely before IS_ERR(_OR_NULL)
mm: Drop unlikely before IS_ERR(_OR_NULL)
fs: Drop unlikely before IS_ERR(_OR_NULL)
drivers: net: Drop unlikely before IS_ERR(_OR_NULL)
drivers: misc: Drop unlikely before IS_ERR(_OR_NULL)
UBI: Update comments to reflect UBI_METAONLY flag
pktcdvd: drop null test before destroy functions
There are many places which use mapping_gfp_mask to restrict a more
generic gfp mask which would be used for allocations which are not
directly related to the page cache but they are performed in the same
context.
Let's introduce a helper function which makes the restriction explicit and
easier to track. This patch doesn't introduce any functional changes.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull userns hardlink capability check fix from Eric Biederman:
"This round just contains a single patch. There has been a lot of
other work this period but it is not quite ready yet, so I am pushing
it until 4.5.
The remaining change by Dirk Steinmetz wich fixes both Gentoo and
Ubuntu containers allows hardlinks if we have the appropriate
capabilities in the user namespace. Security wise it is really a
gimme as the user namespace root can already call setuid become that
user and create the hardlink"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
namei: permit linking with CAP_FOWNER in userns
Attempting to hardlink to an unsafe file (e.g. a setuid binary) from
within an unprivileged user namespace fails, even if CAP_FOWNER is held
within the namespace. This may cause various failures, such as a gentoo
installation within a lxc container failing to build and install specific
packages.
This change permits hardlinking of files owned by mapped uids, if
CAP_FOWNER is held for that namespace. Furthermore, it improves consistency
by using the existing inode_owner_or_capable(), which is aware of
namespaced capabilities as of 23adbe12ef ("fs,userns: Change
inode_capable to capable_wrt_inode_uidgid").
Signed-off-by: Dirk Steinmetz <public@rsjtdrjgfuzkfg.com>
This is hitting us in Ubuntu during some dpkg upgrades in containers.
When upgrading a file dpkg creates a hard link to the old file to back
it up before overwriting it. When packages upgrade suid files owned by a
non-root user the link isn't permitted, and the package upgrade fails.
This patch fixes our problem.
Tested-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Leandro Awa writes:
"After switching to version 4.1.6, our parallelized and distributed
workflows now fail consistently with errors of the form:
T34: ./regex.c:39:22: error: config.h: No such file or directory
From our 'git bisect' testing, the following commit appears to be the
possible cause of the behavior we've been seeing: commit 766c4cbfacd8"
Al Viro says:
"What happens is that 766c4cbfac got the things subtly wrong.
We used to treat d_is_negative() after lookup_fast() as "fall with
ENOENT". That was wrong - checking ->d_flags outside of ->d_seq
protection is unreliable and failing with hard error on what should've
fallen back to non-RCU pathname resolution is a bug.
Unfortunately, we'd pulled the test too far up and ran afoul of
another kind of staleness. The dentry might have been absolutely
stable from the RCU point of view (and we might be on UP, etc), but
stale from the remote fs point of view. If ->d_revalidate() returns
"it's actually stale", dentry gets thrown away and the original code
wouldn't even have looked at its ->d_flags.
What we need is to check ->d_flags where 766c4cbfac does (prior to
->d_seq validation) but only use the result in cases where we do not
discard this dentry outright"
Reported-by: Leandro Awa <lawa@nvidia.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=104911
Fixes: 766c4cbfac ("namei: d_is_negative() should be checked...")
Tested-by: Leandro Awa <lawa@nvidia.com>
Cc: stable@vger.kernel.org # v4.1+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
IS_ERR(_OR_NULL) already contain an 'unlikely' compiler flag and there
is no need to do that again from its callers. Drop it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steve French <smfrench@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Fix the following warnings:
Warning(.//fs/namei.c:2422): No description found for parameter 'nd'
Warning(.//fs/namei.c:2422): Excess function parameter 'nameidata'
description in 'path_mountpoint'
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In rare cases a directory can be renamed out from under a bind mount.
In those cases without special handling it becomes possible to walk up
the directory tree to the root dentry of the filesystem and down
from the root dentry to every other file or directory on the filesystem.
Like division by zero .. from an unconnected path can not be given
a useful semantic as there is no predicting at which path component
the code will realize it is unconnected. We certainly can not match
the current behavior as the current behavior is a security hole.
Therefore when encounting .. when following an unconnected path
return -ENOENT.
- Add a function path_connected to verify path->dentry is reachable
from path->mnt.mnt_root. AKA to validate that rename did not do
something nasty to the bind mount.
To avoid races path_connected must be called after following a path
component to it's next path component.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that we can get there in RCU mode, we shouldn't play with
nd->path.dentry->d_inode - it's not guaranteed to be stable.
Use nd->inode instead.
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In RCU mode we might end up with dentry evicted just we check
that it's a directory. In such case we should return ECHILD
rather than ENOTDIR, so that pathwalk would be retries in non-RCU
mode.
Breakage had been introduced in commit b18825a - prior to that
we were looking at nd->inode, which had been fetched before
verifying that ->d_seq was still valid. That form of check
would only be satisfied if at some point the pathname prefix
would indeed have resolved to a non-directory. The fix consists
of checking ->d_seq after we'd run into a non-directory dentry,
and failing with ECHILD in case of mismatch.
Note that all branches since 3.12 have that problem...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The only caller that cares about its return value can just
as easily pick it from nd->root_seq itself. We used to just
calculate it and return to caller, but these days we are
storing it in nd->root_seq in all cases.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
these guys are always declared next to each other; might as well put
the former (pointer to previous instance) into the latter and simplify
the calling conventions for {set,restore}_nameidata()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
a) make it reject ERR_PTR() for name
b) make it putname(name) on all other failure exits
c) make it return name on success
again, simplifies the callers
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
a) make it reject ERR_PTR() for name
b) make it putname(name) upon return in all other cases.
seriously simplifies the callers...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Otherwise we are risking a hard error where nonlazy restart would be the right
thing to do; it's a very narrow race with mount --move and most of the time it
ends up being completely harmless, but it's possible to construct a case when
we'll get a bogus hard error instead of falling back to non-lazy walk...
For one thing, when crossing _into_ overmount of parent we need to check for
mount_lock bumps when we get NULL from __lookup_mnt() as well.
For another, and less exotically, we need to make sure that the data fetched
in follow_up_rcu() had been consistent. ->mnt_mountpoint is pinned for as
long as it is a mountpoint, but we need to check mount_lock after fetching
to verify that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
touch_atime is not RCU-safe, and so cannot be called on an RCU walk.
However, in situations where RCU-walk makes a difference, the symlink
will likely to accessed much more often than it is useful to update
the atime.
So split out the test of "Does the atime actually need to be updated"
into atime_needs_update(), and have get_link() unlazy if it finds that
it will need to do that update.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We are almost done - primitives for leaving RCU mode are aware of nd->stack
now, a new primitive for going to non-RCU mode when we have a symlink on hands
added.
The thing we are heavily relying upon is that *any* unlazy failure will be
shortly followed by terminate_walk(), with no access to nameidata in between.
So it's enough to leave the things in a state terminate_walk() would cope with.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We *can't* call that audit garbage in RCU mode - it's doing a weird
mix of allocations (GFP_NOFS, immediately followed by GFP_KERNEL)
and I'm not touching that... thing again.
So if this security sclero^Whardening feature gets triggered when
we are in RCU mode, tough - we'll fail with -ECHILD and have
everything restarted in non-RCU mode. Only to hit the same test
and fail, this time with EACCES and with (oh, rapture) an audit spew
produced.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
very simple - just make path_put() conditional on !RCU.
Note that right now it doesn't get called in RCU mode -
we leave it before getting anything into stack.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
inode_follow_link now takes an inode and rcu flag as well as the
dentry.
inode is used in preference to d_backing_inode(dentry), particularly
in RCU-walk mode.
selinux_inode_follow_link() gets dentry_has_perm() and
inode_has_perm() open-coded into it so that it can call
avc_has_perm_flags() in way that is safe if LOOKUP_RCU is set.
Calling avc_has_perm_flags() with rcu_read_lock() held means
that when avc_has_perm_noaudit calls avc_compute_av(), the attempt
to rcu_read_unlock() before calling security_compute_av() will not
actually drop the RCU read-lock.
However as security_compute_av() is completely in a read_lock()ed
region, it should be safe with the RCU read-lock held.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make use of d_backing_inode() in pathwalk to gain access to an
inode or dentry that's on a lower layer.
Signed-off-by: David Howells <dhowells@redhat.com>
Lift it from link_path_walk(), trailing_symlink(), lookup_last(),
mountpoint_last(), complete_walk() and do_last(). A _lot_ of
those suckers merge.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make trailing_symlink() return the pathname to traverse or ERR_PTR(-E...).
A subtle point is that for "magic" symlinks it returns "" now - that
leads to link_path_walk("", nd), which is immediately returning 0 and
we are back to the treatment of the last component, at whereever the
damn thing has left us.
Reduces the stack footprint - link_path_walk() called on more shallow
stack now.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* lift link_path_walk() into callers; moving it down into path_init()
had been a mistake. Stack footprint, among other things...
* do _not_ call path_cleanup() after path_init() failure; on all failure
exits out of it we have nothing for path_cleanup() to do
* have path_init() return pathname or ERR_PTR(-E...)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
we can do fdput() under rcu_read_lock() just fine; all we need to take
care of is fetching nd->inode value first.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Makes the situation much more regular - we avoid a strange state
when the element just after the top of stack is used to store
struct path of symlink, but isn't counted in nd->depth. This
is much more regular, so the normal failure exits, etc., work
fine.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Just store it in nd->stack[nd->depth].link right in pick_link().
Now that we make sure of stack expansion in pick_link(), we can
do so...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and don't open-code unlazy_walk() in there - the only reason
for that is to avoid verfication of cached nd->root, which is
trivially avoided by discarding said cached nd->root first.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
rather than letting the callers handle the jump-to-root part of
semantics, do it right in get_link() and return the rest of the
body for the caller to deal with - at that point it's treated
the same way as relative symlinks would be. And return NULL
when there's no "rest of the body" - those are treated the same
as pure jump symlink would be.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of saving name and branching to OK:, where we'll immediately restore
it, and call walk_component() with WALK_PUT|WALK_GET and nd->last_type being
LAST_BIND, which is equivalent to put_link(nd), err = 0, we can just treat
that the same way we'd treat procfs-style "jump" symlinks - do put_link(nd)
and move on.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
when cookie is NULL, put_link() is equivalent to path_put(), so
as soon as we'd set last->cookie to NULL, we can bump nd->depth and
let the normal logics in terminate_walk() to take care of cleanups.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
now that it gets nameidata, no reason to have setting LOOKUP_JUMPED on
mountpoint crossing and calling path_put_conditional() on failures
done in every caller.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
task_struct currently contains two ad-hoc members for use by the VFS:
link_count and total_link_count. These are only interesting to fs/namei.c,
so exposing them explicitly is poor layering. Incidentally, link_count
isn't used anymore, so it can just die.
This patches replaces those with a single pointer to 'struct nameidata'.
This structure represents the current filename lookup of which
there can only be one per process, and is a natural place to
store total_link_count.
This will allow the current "nameidata" argument to all
follow_link operations to be removed as current->nameidata
can be used instead in the _very_ few instances that care about
it at all.
As there are occasional circumstances where pathname lookup can
recurse, such as through kern_path_locked, we always save and old
current->nameidata (if there is one) when setting a new value, and
make sure any active link_counts are preserved.
follow_mount and follow_automount now get a 'struct nameidata *'
rather than 'int flags' so that they can directly access
total_link_count, rather than going through 'current'.
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
instead of a single flag (!= 0 => we want to follow symlinks) pass
two bits - WALK_GET (want to follow symlinks) and WALK_PUT (put_link()
once we are done looking at the name). The latter matters only for
success exits - on failure the caller will discard everything anyway.
Suggestions for better variant are welcome; what this thing aims for
is making sure that pending put_link() is done *before* walk_component()
decides to pick a symlink up, rather than between picking it up and
acting upon it. See the next commit for payoff.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All callers of terminate_walk() are followed by more or less
open-coded eqiuvalent of "do put_link() on everything left
in nd->stack". Better done in terminate_walk() itself, and
when we go for RCU symlink traversal we'll have to do it
there anyway.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
rationale: we'll need to have terminate_walk() do put_link() on
everything, which will mean that in some cases ..._last() will do
put_link() anyway. Easier to have them do it in all cases.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
follow_dotdot_rcu() does an equivalent of terminate_walk() on failure;
shifting it into callers makes for simpler rules and those callers
already have terminate_walk() on other failure exits.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The only reason why we needed one more was that purely nested
MAXSYMLINKS symlinks could lead to path_init() using that many
entries in addition to nd->stack[0] which it left unused.
That can't happen now - path_init() starts with entry 0 (and
trailing_symlink() is called only when we'd already encountered
one symlink, so no more than MAXSYMLINKS-1 are left).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
get rid of orig_depth - we only use it on error exit to tell whether
to stop doing put_link() when depth reaches 0 (call from path_init())
or when it reaches 1 (call from trailing_symlink()). However, in
the latter case the caller would immediately follow with one more
put_link(). Just keep doing it until the depth reaches zero (and
simplify trailing_symlink() as the result).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Get rid of orig_depth checks in OK: logics. If nd->depth is
zero, we had been called from path_init() and we are done.
If it is greater than 1, we are not done, whether we'd been
called from path_init() or trailing_symlink(). And in
case when it's 1, we might have been called from path_init()
and reached the end of nested symlink (in which case
nd->stack[0].name will point to the rest of pathname and
we are not done) or from trailing_symlink(), in which case
we are done.
Just have trailing_symlink() leave NULL in nd->stack[0].name
and use that to discriminate between those cases.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make link_path_walk() work with any value of nd->depth on entry -
memorize it and use it in tests instead of comparing with 1.
Don't bother with increment/decrement in path_init().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
move increment of ->depth to the point where we'd discovered
that get_link() has not returned an error, adjust exits
accordingly.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
nd->stack[0] is unused until the handling of trailing symlinks and
we want to get rid of that. Having fucked that transformation up
several times, I went for bloody pedantic series of provably equivalent
transformations. Sorry.
Step 1: keep nd->depth higher by one in link_path_walk() - increment upon
entry, decrement on exits, adjust the arithmetics inside and surround the
calls of functions that care about nd->depth value (nd_alloc_stack(),
get_link(), put_link()) with decrement/increment pairs.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The only restriction is that on the total amount of symlinks
crossed; how they are nested does not matter
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>