Commit Graph

29157 Commits

Author SHA1 Message Date
Eric W. Biederman
5e4a08476b userns: Require CAP_SYS_ADMIN for most uses of setns.
Andy Lutomirski <luto@amacapital.net> found a nasty little bug in
the permissions of setns.  With unprivileged user namespaces it
became possible to create new namespaces without privilege.

However the setns calls were relaxed to only require CAP_SYS_ADMIN in
the user nameapce of the targed namespace.

Which made the following nasty sequence possible.

pid = clone(CLONE_NEWUSER | CLONE_NEWNS);
if (pid == 0) { /* child */
	system("mount --bind /home/me/passwd /etc/passwd");
}
else if (pid != 0) { /* parent */
	char path[PATH_MAX];
	snprintf(path, sizeof(path), "/proc/%u/ns/mnt");
	fd = open(path, O_RDONLY);
	setns(fd, 0);
	system("su -");
}

Prevent this possibility by requiring CAP_SYS_ADMIN
in the current user namespace when joing all but the user namespace.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-12-14 16:12:03 -08:00
Eric W. Biederman
98f842e675 proc: Usable inode numbers for the namespace file descriptors.
Assign a unique proc inode to each namespace, and use that
inode number to ensure we only allocate at most one proc
inode for every namespace in proc.

A single proc inode per namespace allows userspace to test
to see if two processes are in the same namespace.

This has been a long requested feature and only blocked because
a naive implementation would put the id in a global space and
would ultimately require having a namespace for the names of
namespaces, making migration and certain virtualization tricks
impossible.

We still don't have per superblock inode numbers for proc, which
appears necessary for application unaware checkpoint/restart and
migrations (if the application is using namespace file descriptors)
but that is now allowd by the design if it becomes important.

I have preallocated the ipc and uts initial proc inode numbers so
their structures can be statically initialized.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-20 04:19:49 -08:00
Eric W. Biederman
bf056bfa80 proc: Fix the namespace inode permission checks.
Change the proc namespace files into symlinks so that
we won't cache the dentries for the namespace files
which can bypass the ptrace_may_access checks.

To support the symlinks create an additional namespace
inode with it's own set of operations distinct from the
proc pid inode and dentry methods as those no longer
make sense.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-20 04:19:48 -08:00
Eric W. Biederman
33d6dce607 proc: Generalize proc inode allocation
Generalize the proc inode allocation so that it can be
used without having to having to create a proc_dir_entry.

This will allow namespace file descriptors to remain light
weight entitities but still have the same inode number
when the backing namespace is the same.

Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-20 04:19:19 -08:00
Eric W. Biederman
4f326c0064 userns: Allow unprivilged mounts of proc and sysfs
- The context in which proc and sysfs are mounted have no
  effect on the the uid/gid of their files so no conversion is
  needed except allowing the mount.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-11-20 04:19:18 -08:00
Eric W. Biederman
e9f238c304 procfs: Print task uids and gids in the userns that opened the proc file
Instead of using current_userns() use the userns of the opener
of the file so that if the file is passed between processes
the contents of the file do not change.

Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-11-20 04:18:15 -08:00
Eric W. Biederman
cde1975bc2 userns: Implent proc namespace operations
This allows entering a user namespace, and the ability
to store a reference to a user namespace with a bind
mount.

Addition of missing userns_ns_put in userns_install
from Gao feng <gaofeng@cn.fujitsu.com>

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-11-20 04:18:13 -08:00
Eric W. Biederman
7fa294c899 userns: Allow chown and setgid preservation
- Allow chown if CAP_CHOWN is present in the current user namespace
  and the uid of the inode maps into the current user namespace, and
  the destination uid or gid maps into the current user namespace.

- Allow perserving setgid when changing an inode if CAP_FSETID is
  present in the current user namespace and the owner of the file has
  a mapping into the current user namespace.

Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-11-20 04:17:24 -08:00
Eric W. Biederman
3cdf5b45ff userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
When performing an exec where the binary lives in one user namespace and
the execing process lives in another usre namespace there is the possibility
that the target uids can not be represented.

Instead of failing the exec simply ignore the suid/sgid bits and run
the binary with lower privileges.   We already do this in the case
of MNT_NOSUID so this should be a well tested code path.

As the user and group are not changed this should not introduce any
security issues.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-19 05:59:23 -08:00
Zhao Hongjiang
ae11e0f184 userns: fix return value on mntns_install() failure
Change return value from -EINVAL to -EPERM when the permission check fails.

Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-19 05:59:22 -08:00
Eric W. Biederman
0c55cfc416 vfs: Allow unprivileged manipulation of the mount namespace.
- Add a filesystem flag to mark filesystems that are safe to mount as
  an unprivileged user.

- Add a filesystem flag to mark filesystems that don't need MNT_NODEV
  when mounted by an unprivileged user.

- Relax the permission checks to allow unprivileged users that have
  CAP_SYS_ADMIN permissions in the user namespace referred to by the
  current mount namespace to be allowed to mount, unmount, and move
  filesystems.

Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-11-19 05:59:21 -08:00
Eric W. Biederman
7a472ef4be vfs: Only support slave subtrees across different user namespaces
Sharing mount subtress with mount namespaces created by unprivileged
users allows unprivileged mounts created by unprivileged users to
propagate to mount namespaces controlled by privileged users.

Prevent nasty consequences by changing shared subtrees to slave
subtress when an unprivileged users creates a new mount namespace.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-11-19 05:59:20 -08:00
Eric W. Biederman
771b137168 vfs: Add a user namespace reference from struct mnt_namespace
This will allow for support for unprivileged mounts in a new user namespace.

Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-11-19 05:59:19 -08:00
Eric W. Biederman
8823c079ba vfs: Add setns support for the mount namespace
setns support for the mount namespace is a little tricky as an
arbitrary decision must be made about what to set fs->root and
fs->pwd to, as there is no expectation of a relationship between
the two mount namespaces.  Therefore I arbitrarily find the root
mount point, and follow every mount on top of it to find the top
of the mount stack.  Then I set fs->root and fs->pwd to that
location.  The topmost root of the mount stack seems like a
reasonable place to be.

Bind mount support for the mount namespace inodes has the
possibility of creating circular dependencies between mount
namespaces.  Circular dependencies can result in loops that
prevent mount namespaces from every being freed.  I avoid
creating those circular dependencies by adding a sequence number
to the mount namespace and require all bind mounts be of a
younger mount namespace into an older mount namespace.

Add a helper function proc_ns_inode so it is possible to
detect when we are attempting to bind mound a namespace inode.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-19 05:59:18 -08:00
Eric W. Biederman
a85fb273c9 vfs: Allow chroot if you have CAP_SYS_CHROOT in your user namespace
Once you are confined to a user namespace applications can not gain
privilege and escape the user namespace so there is no longer a reason
to restrict chroot.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-11-19 05:59:17 -08:00
Eric W. Biederman
57e8391d32 pidns: Add setns support
- Pid namespaces are designed to be inescapable so verify that the
  passed in pid namespace is a child of the currently active
  pid namespace or the currently active pid namespace itself.

  Allowing the currently active pid namespace is important so
  the effects of an earlier setns can be cancelled.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-19 05:59:14 -08:00
Eric W. Biederman
0a01f2cc39 pidns: Make the pidns proc mount/umount logic obvious.
Track the number of pids in the proc hash table.  When the number of
pids goes to 0 schedule work to unmount the kernel mount of proc.

Move the mount of proc into alloc_pid when we allocate the pid for
init.

Remove the surprising calls of pid_ns_release proc in fork and
proc_flush_task.  Those code paths really shouldn't know about proc
namespace implementation details and people have demonstrated several
times that finding and understanding those code paths is difficult and
non-obvious.

Because of the call path detach pid is alwasy called with the
rtnl_lock held free_pid is not allowed to sleep, so the work to
unmounting proc is moved to a work queue.  This has the side benefit
of not blocking the entire world waiting for the unnecessary
rcu_barrier in deactivate_locked_super.

In the process of making the code clear and obvious this fixes a bug
reported by Gao feng <gaofeng@cn.fujitsu.com> where we would leak a
mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns
succeeded and copy_net_ns failed.

Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-11-19 05:59:10 -08:00
Eric W. Biederman
17cf22c33e pidns: Use task_active_pid_ns where appropriate
The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
aka ns_of_pid(task_pid(tsk)) should have the same number of
cache line misses with the practical difference that
ns_of_pid(task_pid(tsk)) is released later in a processes life.

Furthermore by using task_active_pid_ns it becomes trivial
to write an unshare implementation for the the pid namespace.

So I have used task_active_pid_ns everywhere I can.

In fork since the pid has not yet been attached to the
process I use ns_of_pid, to achieve the same effect.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-19 05:59:09 -08:00
Eric W. Biederman
ae06c7c83f procfs: Don't cache a pid in the root inode.
Now that we have s_fs_info pointing to our pid namespace
the original reason for the proc root inode having a struct
pid is gone.

Caching a pid in the root inode has led to some complicated
code.  Now that we don't need the struct pid, just remove it.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-19 03:09:35 -08:00
Eric W. Biederman
e656d8a6f7 procfs: Use the proc generic infrastructure for proc/self.
I had visions at one point of splitting proc into two filesystems.  If
that had happened proc/self being the the part of proc that actually deals
with pids would have been a nice cleanup.  As it is proc/self requires
a lot of unnecessary infrastructure for a single file.

The only user visible change is that a mounted /proc for a pid namespace
that is dead now shows a broken proc symlink, instead of being completely
invisible.  I don't think anyone will notice or care.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-19 03:09:34 -08:00
Eric W. Biederman
499dcf2024 userns: Support fuse interacting with multiple user namespaces
Use kuid_t and kgid_t in struct fuse_conn and struct fuse_mount_data.

The connection between between a fuse filesystem and a fuse daemon is
established when a fuse filesystem is mounted and provided with a file
descriptor the fuse daemon created by opening /dev/fuse.

For now restrict the communication of uids and gids between the fuse
filesystem and the fuse daemon to the initial user namespace.  Enforce
this by verifying the file descriptor passed to the mount of fuse was
opened in the initial user namespace.  Ensuring the mount happens in
the initial user namespace is not necessary as mounts from non-initial
user namespaces are not yet allowed.

In fuse_req_init_context convert the currrent fsuid and fsgid into the
initial user namespace for the request that will be sent to the fuse
daemon.

In fuse_fill_attr convert the uid and gid passed from the fuse daemon
from the initial user namespace into kuids and kgids.

In iattr_to_fattr called from fuse_setattr convert kuids and kgids
into the uids and gids in the initial user namespace before passing
them to the fuse filesystem.

In fuse_change_attributes_common called from fuse_dentry_revalidate,
fuse_permission, fuse_geattr, and fuse_setattr, and fuse_iget convert
the uid and gid from the fuse daemon into a kuid and a kgid to store
on the fuse inode.

By default fuse mounts are restricted to task whose uid, suid, and
euid matches the fuse user_id and whose gid, sgid, and egid matches
the fuse group id.  Convert the user_id and group_id mount options
into kuids and kgids at mount time, and use uid_eq and gid_eq to
compare the in fuse_allow_task.

Cc: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-14 22:05:33 -08:00
Eric W. Biederman
45634cd8cb userns: Support autofs4 interacing with multiple user namespaces
Use kuid_t and kgid_t in struct autofs_info and struct autofs_wait_queue.

When creating directories and symlinks default the uid and gid of
the mount requester to the global root uid and gid.  autofs4_wait
will update these fields when a mount is requested.

When generating autofsv5 packets report the uid and gid of the mount
requestor in user namespace of the process that opened the pipe,
reporting unmapped uids and gids as overflowuid and overflowgid.

In autofs_dev_ioctl_requester return the uid and gid of the last mount
requester converted into the calling processes user namespace.  When the
uid or gid don't map return overflowuid and overflowgid as appropriate,
allowing failure to find a mount requester to be distinguished from
failure to map a mount requester.

The uid and gid mount options specifying the user and group of the
root autofs inode are converted into kuid and kgid as they are parsed
defaulting to the current uid and current gid of the process that
mounts autofs.

Mounting of autofs for the present remains confined to processes in
the initial user namespace.

Cc: Ian Kent <raven@themaw.net>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-14 22:05:32 -08:00
Mikulas Patocka
1a25b1c4ce Lock splice_read and splice_write functions
Functions generic_file_splice_read and generic_file_splice_write access
the pagecache directly. For block devices these functions must be locked
so that block size is not changed while they are in progress.

This patch is an additional fix for commit b87570f5d3 ("Fix a crash
when block device is read and block size is changed at the same time")
that locked aio_read, aio_write and mmap against block size change.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-28 10:59:37 -07:00
Linus Torvalds
64b1cbaa10 Power management and ACPI fixes for 3.7-rc3
* Fix for a memory leak in acpi_bind_one() from Jesper Juhl.
 
 * Fix for an error code path memory leak in pm_genpd_attach_cpuidle()
   from Jonghwan Choi.
 
 * Fix for smp_processor_id() usage in preemptible code in powernow-k8 from
   Andreas Herrmann.
 
 * Fix for a suspend-related memory leak in cpufreq stats from Xiaobing Tu.
 
 * Freezer fix for failure to clear PF_NOFREEZE along with PF_KTHREAD
   in flush_old_exec() from Oleg Nesterov.
 
 * acpi_processor_notify() fix from Alan Cox.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABAgAGBQJQivlUAAoJEKhOf7ml8uNsPHMP/jGv0umbDl0CrJBjd9eF+Tdt
 52DpJ/c2HjghsmSG26MCsFVO026DcPoO/t7faUVpWiUy/I38TwyjTFMKrxouY5AC
 8p929Gt5yjf/pB7w/P5C3exhv9zSWdVzCZ4rmlt7knBn7vN7jfI5Lv4kaEwAcu4o
 mnGbUVzaFLaiHsKFa8iBOpkdr01Fn9FRINddMQ/+PdiFR+wkqOKMZBExjRoQgS31
 aH90LL5Nfv5pSH126TH5S6GDdAXw0g4eHEfxGjNodEXdmAS+GXrD6QoGabab99ZD
 SaZA41kTv3+ls4Z9uhhpNBgqEDQEWiNVBVfTs0PWTUpemYKlx85Ihdl8PbH1H0TM
 QeHsM3dfHJsfhK/EjUFwx31oWrvfM0Qqw8CxOc/ASm2rpPOZVBgqRzKsqSMQE805
 y3lteaoT9nObnKdy871QmIhAk3fN25u1txCtmNFc0S5VZyiAnD2RVS/a8y93gMse
 5lxSMjOfUqvq3APlz4HIn3YovswjiAOOw0PlD3nK2qLj7tyEVOl5CMyL/05v58wJ
 SeOic10v1oOEDYT3uM+aVERmK9APAsMbcecj7Xd5yqPu8NPx7zrj+z30wr0znS5x
 KWnyZR83F4ZCb00geSZW0LliXuazjj+lWX/TFri9XY6ZdMvmTJOUVKBD8JwhHN0j
 WH9iMOOBTMnJdbnf88sM
 =8Tcc
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-for-3.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management and ACPI fixes from Rafael J Wysocki:

 - Fix for a memory leak in acpi_bind_one() from Jesper Juhl.

 - Fix for an error code path memory leak in pm_genpd_attach_cpuidle()
   from Jonghwan Choi.

 - Fix for smp_processor_id() usage in preemptible code in powernow-k8
   from Andreas Herrmann.

 - Fix for a suspend-related memory leak in cpufreq stats from Xiaobing
   Tu.

 - Freezer fix for failure to clear PF_NOFREEZE along with PF_KTHREAD in
   flush_old_exec() from Oleg Nesterov.

 - acpi_processor_notify() fix from Alan Cox.

* tag 'pm+acpi-for-3.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: missing break
  freezer: exec should clear PF_NOFREEZE along with PF_KTHREAD
  Fix memory leak in cpufreq stats.
  cpufreq / powernow-k8: Remove usage of smp_processor_id() in preemptible code
  PM / Domains: Fix memory leak on error path in pm_genpd_attach_cpuidle
  ACPI: Fix memory leak in acpi_bind_one()
2012-10-26 14:23:35 -07:00
Linus Torvalds
299650cad6 Driver core fixes for 3.7-rc3
Here are a number of firmware core fixes for 3.7, and some other minor fixes.
 And some documentation updates thrown in for good measure.
 
 All have been in the linux-next tree for a while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iEYEABECAAYFAlCKwkIACgkQMUfUDdst+ynzUgCfQDwxUw1PVqQyWy7SakpsjFJJ
 8kwAoITyjppn39v1WuZbg0+FZ6JpocyY
 =2mDG
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg Kroah-Hartman:
 "Here are a number of firmware core fixes for 3.7, and some other minor
  fixes.  And some documentation updates thrown in for good measure.

  All have been in the linux-next tree for a while.

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

* tag 'driver-core-3.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  Documentation:Chinese translation of Documentation/arm64/memory.txt
  Documentation:Chinese translation of Documentation/arm64/booting.txt
  Documentation:Chinese translation of Documentation/IRQ.txt
  firmware loader: document kernel direct loading
  sysfs: sysfs_pathname/sysfs_add_one: Use strlcat() instead of strcat()
  dynamic_debug: Remove unnecessary __used
  firmware loader: sync firmware cache by async_synchronize_full_domain
  firmware loader: let direct loading back on 'firmware_buf'
  firmware loader: fix one reqeust_firmware race
  firmware loader: cancel uncache work before caching firmware
2012-10-26 10:24:51 -07:00
Linus Torvalds
561ec64ae6 VFS: don't do protected {sym,hard}links by default
In commit 800179c9b8 ("This adds symlink and hardlink restrictions to
the Linux VFS"), the new link protections were enabled by default, in
the hope that no actual application would care, despite it being
technically against legacy UNIX (and documented POSIX) behavior.

However, it does turn out to break some applications.  It's rare, and
it's unfortunate, but it's unacceptable to break existing systems, so
we'll have to default to legacy behavior.

In particular, it has broken the way AFD distributes files, see

  http://www.dwd.de/AFD/

along with some legacy scripts.

Distributions can end up setting this at initrd time or in system
scripts: if you have security problems due to link attacks during your
early boot sequence, you have bigger problems than some kernel sysctl
setting. Do:

	echo 1 > /proc/sys/fs/protected_symlinks
	echo 1 > /proc/sys/fs/protected_hardlinks

to re-enable the link protections.

Alternatively, we may at some point introduce a kernel config option
that sets these kinds of "more secure but not traditional" behavioural
options automatically.

Reported-by: Nick Bowler <nbowler@elliptictech.com>
Reported-by: Holger Kiehl <Holger.Kiehl@dwd.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # v3.6
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-26 10:05:07 -07:00
Linus Torvalds
f48d42773b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This has our series of fixes for the next rc.  The biggest batch is
  from Jan Schmidt, fixing up some problems in our subvolume quota code
  and fixing btrfs send/receive to work with the new extended inode
  refs."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: do not bug when we fail to commit the transaction
  Btrfs: fix memory leak when cloning root's node
  Btrfs: Use btrfs_update_inode_fallback when creating a snapshot
  Btrfs: Send: preserve ownership (uid and gid) also for symlinks.
  Btrfs: fix deadlock caused by the nested chunk allocation
  btrfs: Return EINVAL when length to trim is less than FSB
  Btrfs: fix memory leak in btrfs_quota_enable()
  Btrfs: send correct rdev and mode in btrfs-send
  Btrfs: extended inode refs support for send mechanism
  Btrfs: Fix wrong error handling code
  Fix a sign bug causing invalid memory access in the ino_paths ioctl.
  Btrfs: comment for loop in tree_mod_log_insert_move
  Btrfs: fix extent buffer reference for tree mod log roots
  Btrfs: determine level of old roots
  Btrfs: tree mod log's old roots could still be part of the tree
  Btrfs: fix a tree mod logging issue for root replacement operations
  Btrfs: don't put removals from push_node_left into tree mod log twice
2012-10-26 09:34:04 -07:00
Linus Torvalds
fec4fba6e4 NFS bugfixes for Linux 3.7
- Fix the NFSv2/v3 kernel statd protocol, which broke due to net namespace
   related changes.
 - Fix a number of races in the SUNRPC TCP disconnect/reconnect code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQieKUAAoJEGcL54qWCgDylBsP/R8UkhkNuL92x+KMSlNG28rr
 EYLIl5XNpBy/xIPhQmxuzV3KjytNzwe/3caNZ/Cl6thS4l/CRiqPKsq62xIOEMKn
 NBsY0ZULzhi2xwcR3BZxgn055PSTMxbQ9a3oeNbTvUbsrk5AnX7br6ol/oW1s7Fi
 ytCLrW/A4zB4DCJ/HMUnnQTCKCBGXBs2VD0ib6lKzr/6OqMO26cbQfHddTZW9BC3
 9qDKDVUQbLr9lNouYZbqWhZ6g/ecIkFyDyxW6ElVELaTNZFo0MTkQMYBgTQPDFd8
 PAxnq/Dcik2/Ls+qQgtqK2gs67RH7oAWvPR34ozA2pVJJj6wojXHbtIpH6IDuwIM
 XvNCOkQNebzB2glf0WFLqUjmoQVvONiDp0zA9CYz5xBgjgA6AlmGsnW5dHKFiAkS
 RxqZKPPinMOBzULOQYcC5Bu/hkXDMtvSi0UwMnuM5IZh1GAadtw+Do+0k6ociRuK
 ALD4tzfV+o9VM+/714eyYyJ7DEL4HgeEEpoQb6tRyNj8YIlibxM3XKGBBzwssI2d
 L2YuWmBE6N+zYUrwOqY1v9hYwh7qayqQiAgIac8e4/oC+UnQM8tfNNABcouA8a3d
 fSxOK+S0DFYTRN1EMDvhasnxUqmilj0Wuw+g6ytL95k8T2Q5b8PdYt6QCBc9h6SJ
 Qb+E/wkMEGAavm6w2T42
 =8Dt4
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.7-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS bugfixes from Trond Myklebust:

 - Fix the NFSv2/v3 kernel statd protocol, which broke due to net
   namespace related changes.

 - Fix a number of races in the SUNRPC TCP disconnect/reconnect code.

* tag 'nfs-for-3.7-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  LOCKD: Clear ln->nsm_clnt only when ln->nsm_users is zero
  LOCKD: fix races in nsm_client_get
  SUNRPC: Get rid of the xs_error_report socket callback
  SUNRPC: Prevent races in xs_abort_connection()
  Revert "SUNRPC: Ensure we close the socket on EPIPE errors too..."
  SUNRPC: Clear the connect flag when socket state is TCP_CLOSE_WAIT
2012-10-25 19:26:16 -07:00
Kees Cook
1217650336 fs/compat_ioctl.c: VIDEO_SET_SPU_PALETTE missing error check
The compat ioctl for VIDEO_SET_SPU_PALETTE was missing an error check
while converting ioctl arguments.  This could lead to leaking kernel
stack contents into userspace.

Patch extracted from existing fix in grsecurity.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: David Miller <davem@davemloft.net>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: PaX Team <pageexec@freemail.hu>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-25 14:37:53 -07:00
Oleg Nesterov
b40a79591c freezer: exec should clear PF_NOFREEZE along with PF_KTHREAD
flush_old_exec() clears PF_KTHREAD but forgets about PF_NOFREEZE.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2012-10-25 22:28:12 +02:00
Josef Bacik
c37b2b6269 Btrfs: do not bug when we fail to commit the transaction
We BUG if we fail to commit the transaction when creating a snapshot, which
is just obnoxious.  Remove the BUG_ON().  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-25 15:59:57 -04:00
Liu Bo
7bfdcf7fba Btrfs: fix memory leak when cloning root's node
After cloning root's node, we forgot to dec the src's ref
which can lead to a memory leak.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-25 15:55:21 -04:00
Chris Mason
c657c3ef1a Merge branch 'for-chris-fixed' of git://git.jan-o-sch.net/btrfs-unstable 2012-10-25 15:53:10 -04:00
Josef Bacik
be6aef6049 Btrfs: Use btrfs_update_inode_fallback when creating a snapshot
On a really full file system I was getting ENOSPC back from
btrfs_update_inode when trying to update the parent inode when creating a
snapshot.  Just use the fallback method so we can update the inode and not
have to worry about having a delayed ref.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-25 15:50:18 -04:00
Alex Lyakas
e2d044fe77 Btrfs: Send: preserve ownership (uid and gid) also for symlinks.
This patch also requires a change in the user-space part of "receive".
We need to use "lchown" instead of "chown". We will do this in the
following patch.

Signed-off-by: Alex Lyakas <alex.btrfs@zadarastorage.com>

 	if (S_ISREG(sctx->cur_inode_mode)) {
2012-10-25 15:47:31 -04:00
Miao Xie
671415b7db Btrfs: fix deadlock caused by the nested chunk allocation
Steps to reproduce:
 # mkfs.btrfs -m raid1 <disk1> <disk2>
 # btrfstune -S 1 <disk1>
 # mount <disk1> <mnt>
 # btrfs device add <disk3> <disk4> <mnt>
 # mount -o remount,rw <mnt>
 # dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1
 Deadlock happened.

It is because of the nested chunk allocation. When we wrote the data
into the filesystem, we would allocate the data chunk because there was
no data chunk in the filesystem. At the end of the data chunk allocation,
we should insert the metadata of the data chunk into the extent tree, but
there was no raid1 chunk, so we tried to lock the chunk allocation mutex to
allocate the new chunk, but we had held the mutex, the deadlock happened.

By rights, we would allocate the raid1 chunk when we added the second device
because the profile of the seed filesystem is raid1 and we had two devices.
But we didn't do that in fact. It is because the last step of the first device
insertion didn't commit the transaction. So when we added the second device,
we didn't cow the tree, and just inserted the relative metadata into the leaves
which were generated by the first device insertion, and its profile was dup.

So, I fix this problem by commiting the transaction at the end of the first
device insertion.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-25 15:47:00 -04:00
Lukas Czerner
e515c18bfe btrfs: Return EINVAL when length to trim is less than FSB
Currently if len argument in btrfs_ioctl_fitrim() is smaller than
one FSB we will continue and finally return 0 bytes discarded.
However if the length to discard is smaller then file system block
we should really return EINVAL.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2012-10-25 15:46:22 -04:00
Tsutomu Itoh
5b7ff5b3c4 Btrfs: fix memory leak in btrfs_quota_enable()
We should free quota_root before returning from the error
handling code.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-10-25 15:45:43 -04:00
Arne Jansen
d79e50433b Btrfs: send correct rdev and mode in btrfs-send
When sending a device file, the stream was missing the mode. Also the
rdev was encoded wrongly.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-10-25 15:45:25 -04:00
Jan Schmidt
96b5bd7771 Btrfs: extended inode refs support for send mechanism
This adds support for the new extended inode refs to btrfs send.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-25 15:45:16 -04:00
Stefan Behrens
84167d1905 Btrfs: Fix wrong error handling code
gcc says "warning: comparison of unsigned expression >= 0 is always
true" because i is an unsigned long. And gcc is right this time.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2012-10-25 15:40:03 -04:00
Gabriel de Perthuis
661bec6ba8 Fix a sign bug causing invalid memory access in the ino_paths ioctl.
To see the problem, create many hardlinks to the same file (120 should do it),
then look up paths by inode with:

  ls -i
  btrfs inspect inode-resolve -v $ino /mnt/btrfs

I noticed the memory layout of the fspath->val data had some irregularities
(some unnecessary gaps that stop appearing about halfway),
so I'm not sure there aren't any bugs left in it.
2012-10-25 15:39:47 -04:00
Geert Uytterhoeven
66081a7251 sysfs: sysfs_pathname/sysfs_add_one: Use strlcat() instead of strcat()
The warning check for duplicate sysfs entries can cause a buffer overflow
when printing the warning, as strcat() doesn't check buffer sizes.
Use strlcat() instead.

Since strlcat() doesn't return a pointer to the passed buffer, unlike
strcat(), I had to convert the nested concatenation in sysfs_add_one() to
an admittedly more obscure comma operator construct, to avoid emitting code
for the concatenation if CONFIG_BUG is disabled.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-24 15:57:14 -07:00
Trond Myklebust
e498daa812 LOCKD: Clear ln->nsm_clnt only when ln->nsm_users is zero
The current code is clearing it in all cases _except_ when zero.

Reported-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2012-10-24 10:46:22 -04:00
Trond Myklebust
a4ee8d978e LOCKD: fix races in nsm_client_get
Commit e9406db20f (lockd: per-net
NSM client creation and destruction helpers introduced) contains
a nasty race on initialisation of the per-net NSM client because
it doesn't check whether or not the client is set after grabbing
the nsm_create_mutex.

Reported-by: Nix <nix@esperi.org.uk>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2012-10-24 10:46:22 -04:00
Jan Schmidt
01763a2e37 Btrfs: comment for loop in tree_mod_log_insert_move
Emphasis the way tree_mod_log_insert_move avoids adding
MOD_LOG_KEY_REMOVE_WHILE_MOVING operations, depending on the direction of
the move operation.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-24 12:36:40 +02:00
Jan Schmidt
d638108484 Btrfs: fix extent buffer reference for tree mod log roots
In get_old_root we grab a lock on the extent buffer before we obtain a
reference on that buffer. That order is changed now.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-24 12:36:39 +02:00
Jan Schmidt
5b6602e762 Btrfs: determine level of old roots
In btrfs_find_all_roots' termination condition, we compare the level of the
old buffer we got from btrfs_search_old_slot to the level of the current
root node. We'd better compare it to the level of the rewinded root node.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-24 12:36:38 +02:00
Jan Schmidt
834328a849 Btrfs: tree mod log's old roots could still be part of the tree
Tree mod log treated old root buffers as always empty buffers when starting
the rewind operations. However, the old root may still be part of the
current tree at a lower level, with still some valid entries.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-24 12:36:37 +02:00
Linus Torvalds
684baeb1d7 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core kernel fixes from Ingo Molnar:
 "Two small fixes"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Documentation: Reflect the new location of the NMI watchdog info
  nohz: Fix idle ticks in cpu summary line of /proc/stat
2012-10-24 04:07:02 +03:00