linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-14 08:02:07 +00:00

Author	SHA1	Message	Date
Miklos Szeredi	66dbfabf10	ovl: stack fileattr ops Add stacking for the fileattr operations. Add hack for calling security_file_ioctl() for now. Probably better to have a pair of specific hooks for these operations. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2021-04-12 15:04:29 +02:00
Linus Torvalds	7d6beb71da	idmapped-mounts-v5.12 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f 4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg= =yPaw -----END PGP SIGNATURE----- Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull idmapped mounts from Christian Brauner: "This introduces idmapped mounts which has been in the making for some time. Simply put, different mounts can expose the same file or directory with different ownership. This initial implementation comes with ports for fat, ext4 and with Christoph's port for xfs with more filesystems being actively worked on by independent people and maintainers. Idmapping mounts handle a wide range of long standing use-cases. Here are just a few: - Idmapped mounts make it possible to easily share files between multiple users or multiple machines especially in complex scenarios. For example, idmapped mounts will be used in the implementation of portable home directories in systemd-homed.service(8) where they allow users to move their home directory to an external storage device and use it on multiple computers where they are assigned different uids and gids. This effectively makes it possible to assign random uids and gids at login time. - It is possible to share files from the host with unprivileged containers without having to change ownership permanently through chown(2). - It is possible to idmap a container's rootfs and without having to mangle every file. For example, Chromebooks use it to share the user's Download folder with their unprivileged containers in their Linux subsystem. - It is possible to share files between containers with non-overlapping idmappings. - Filesystem that lack a proper concept of ownership such as fat can use idmapped mounts to implement discretionary access (DAC) permission checking. - They allow users to efficiently changing ownership on a per-mount basis without having to (recursively) chown(2) all files. In contrast to chown (2) changing ownership of large sets of files is instantenous with idmapped mounts. This is especially useful when ownership of a whole root filesystem of a virtual machine or container is changed. With idmapped mounts a single syscall mount_setattr syscall will be sufficient to change the ownership of all files. - Idmapped mounts always take the current ownership into account as idmappings specify what a given uid or gid is supposed to be mapped to. This contrasts with the chown(2) syscall which cannot by itself take the current ownership of the files it changes into account. It simply changes the ownership to the specified uid and gid. This is especially problematic when recursively chown(2)ing a large set of files which is commong with the aforementioned portable home directory and container and vm scenario. - Idmapped mounts allow to change ownership locally, restricting it to specific mounts, and temporarily as the ownership changes only apply as long as the mount exists. Several userspace projects have either already put up patches and pull-requests for this feature or will do so should you decide to pull this: - systemd: In a wide variety of scenarios but especially right away in their implementation of portable home directories. https://systemd.io/HOME_DIRECTORY/ - container runtimes: containerd, runC, LXD:To share data between host and unprivileged containers, unprivileged and privileged containers, etc. The pull request for idmapped mounts support in containerd, the default Kubernetes runtime is already up for quite a while now: https://github.com/containerd/containerd/pull/4734 - The virtio-fs developers and several users have expressed interest in using this feature with virtual machines once virtio-fs is ported. - ChromeOS: Sharing host-directories with unprivileged containers. I've tightly synced with all those projects and all of those listed here have also expressed their need/desire for this feature on the mailing list. For more info on how people use this there's a bunch of talks about this too. Here's just two recent ones: https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf https://fosdem.org/2021/schedule/event/containers_idmap/ This comes with an extensive xfstests suite covering both ext4 and xfs: https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts It covers truncation, creation, opening, xattrs, vfscaps, setid execution, setgid inheritance and more both with idmapped and non-idmapped mounts. It already helped to discover an unrelated xfs setgid inheritance bug which has since been fixed in mainline. It will be sent for inclusion with the xfstests project should you decide to merge this. In order to support per-mount idmappings vfsmounts are marked with user namespaces. The idmapping of the user namespace will be used to map the ids of vfs objects when they are accessed through that mount. By default all vfsmounts are marked with the initial user namespace. The initial user namespace is used to indicate that a mount is not idmapped. All operations behave as before and this is verified in the testsuite. Based on prior discussions we want to attach the whole user namespace and not just a dedicated idmapping struct. This allows us to reuse all the helpers that already exist for dealing with idmappings instead of introducing a whole new range of helpers. In addition, if we decide in the future that we are confident enough to enable unprivileged users to setup idmapped mounts the permission checking can take into account whether the caller is privileged in the user namespace the mount is currently marked with. The user namespace the mount will be marked with can be specified by passing a file descriptor refering to the user namespace as an argument to the new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern of extensibility. The following conditions must be met in order to create an idmapped mount: - The caller must currently have the CAP_SYS_ADMIN capability in the user namespace the underlying filesystem has been mounted in. - The underlying filesystem must support idmapped mounts. - The mount must not already be idmapped. This also implies that the idmapping of a mount cannot be altered once it has been idmapped. - The mount must be a detached/anonymous mount, i.e. it must have been created by calling open_tree() with the OPEN_TREE_CLONE flag and it must not already have been visible in the filesystem. The last two points guarantee easier semantics for userspace and the kernel and make the implementation significantly simpler. By default vfsmounts are marked with the initial user namespace and no behavioral or performance changes are observed. The manpage with a detailed description can be found here: `1d7b902e28` In order to support idmapped mounts, filesystems need to be changed and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The patches to convert individual filesystem are not very large or complicated overall as can be seen from the included fat, ext4, and xfs ports. Patches for other filesystems are actively worked on and will be sent out separately. The xfstestsuite can be used to verify that port has been done correctly. The mount_setattr() syscall is motivated independent of the idmapped mounts patches and it's been around since July 2019. One of the most valuable features of the new mount api is the ability to perform mounts based on file descriptors only. Together with the lookup restrictions available in the openat2() RESOLVE_* flag namespace which we added in v5.6 this is the first time we are close to hardened and race-free (e.g. symlinks) mounting and path resolution. While userspace has started porting to the new mount api to mount proper filesystems and create new bind-mounts it is currently not possible to change mount options of an already existing bind mount in the new mount api since the mount_setattr() syscall is missing. With the addition of the mount_setattr() syscall we remove this last restriction and userspace can now fully port to the new mount api, covering every use-case the old mount api could. We also add the crucial ability to recursively change mount options for a whole mount tree, both removing and adding mount options at the same time. This syscall has been requested multiple times by various people and projects. There is a simple tool available at https://github.com/brauner/mount-idmapped that allows to create idmapped mounts so people can play with this patch series. I'll add support for the regular mount binary should you decide to pull this in the following weeks: Here's an example to a simple idmapped mount of another user's home directory: u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt u1001@f2-vm:/$ ls -al /home/ubuntu/ total 28 drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 . drwxr-xr-x 4 root root 4096 Oct 28 04:00 .. -rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history -rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout -rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc -rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile -rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful -rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo u1001@f2-vm:/$ ls -al /mnt/ total 28 drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 . drwxr-xr-x 29 root root 4096 Oct 28 22:01 .. -rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history -rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout -rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc -rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile -rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful -rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo u1001@f2-vm:/$ touch /mnt/my-file u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file u1001@f2-vm:/$ ls -al /mnt/my-file -rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file u1001@f2-vm:/$ ls -al /home/ubuntu/my-file -rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file u1001@f2-vm:/$ getfacl /mnt/my-file getfacl: Removing leading '/' from absolute path names # file: mnt/my-file # owner: u1001 # group: u1001 user::rw- user:u1001:rwx group::rw- mask::rwx other::r-- u1001@f2-vm:/$ getfacl /home/ubuntu/my-file getfacl: Removing leading '/' from absolute path names # file: home/ubuntu/my-file # owner: ubuntu # group: ubuntu user::rw- user:ubuntu:rwx group::rw- mask::rwx other::r--" * tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits) xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl xfs: support idmapped mounts ext4: support idmapped mounts fat: handle idmapped mounts tests: add mount_setattr() selftests fs: introduce MOUNT_ATTR_IDMAP fs: add mount_setattr() fs: add attr_flags_to_mnt_flags helper fs: split out functions to hold writers namespace: only take read lock in do_reconfigure_mnt() mount: make {lock,unlock}_mount_hash() static namespace: take lock_mount_hash() directly when changing flags nfs: do not export idmapped mounts overlayfs: do not mount on top of idmapped mounts ecryptfs: do not mount on top of idmapped mounts ima: handle idmapped mounts apparmor: handle idmapped mounts fs: make helpers idmap mount aware exec: handle idmapped mounts would_dump: handle idmapped mounts ...	2021-02-23 13:39:45 -08:00
Liangyan	e04527fefb	ovl: fix dentry leak in ovl_get_redirect We need to lock d_parent->d_lock before dget_dlock, or this may have d_lockref updated parallelly like calltrace below which will cause dentry->d_lockref leak and risk a crash. CPU 0 CPU 1 ovl_set_redirect lookup_fast ovl_get_redirect __d_lookup dget_dlock //no lock protection here spin_lock(&dentry->d_lock) dentry->d_lockref.count++ dentry->d_lockref.count++ [ 49.799059] PGD 800000061fed7067 P4D 800000061fed7067 PUD 61fec5067 PMD 0 [ 49.799689] Oops: 0002 [#1] SMP PTI [ 49.800019] CPU: 2 PID: 2332 Comm: node Not tainted 4.19.24-7.20.al7.x86_64 #1 [ 49.800678] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014 [ 49.801380] RIP: 0010:_raw_spin_lock+0xc/0x20 [ 49.803470] RSP: 0018:ffffac6fc5417e98 EFLAGS: 00010246 [ 49.803949] RAX: 0000000000000000 RBX: ffff93b8da3446c0 RCX: 0000000a00000000 [ 49.804600] RDX: 0000000000000001 RSI: 000000000000000a RDI: 0000000000000088 [ 49.805252] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff993cf040 [ 49.805898] R10: ffff93b92292e580 R11: ffffd27f188a4b80 R12: 0000000000000000 [ 49.806548] R13: 00000000ffffff9c R14: 00000000fffffffe R15: ffff93b8da3446c0 [ 49.807200] FS: 00007ffbedffb700(0000) GS:ffff93b927880000(0000) knlGS:0000000000000000 [ 49.807935] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 49.808461] CR2: 0000000000000088 CR3: 00000005e3f74006 CR4: 00000000003606a0 [ 49.809113] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 49.809758] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 49.810410] Call Trace: [ 49.810653] d_delete+0x2c/0xb0 [ 49.810951] vfs_rmdir+0xfd/0x120 [ 49.811264] do_rmdir+0x14f/0x1a0 [ 49.811573] do_syscall_64+0x5b/0x190 [ 49.811917] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 49.812385] RIP: 0033:0x7ffbf505ffd7 [ 49.814404] RSP: 002b:00007ffbedffada8 EFLAGS: 00000297 ORIG_RAX: 0000000000000054 [ 49.815098] RAX: ffffffffffffffda RBX: 00007ffbedffb640 RCX: 00007ffbf505ffd7 [ 49.815744] RDX: 0000000004449700 RSI: 0000000000000000 RDI: 0000000006c8cd50 [ 49.816394] RBP: 00007ffbedffaea0 R08: 0000000000000000 R09: 0000000000017d0b [ 49.817038] R10: 0000000000000000 R11: 0000000000000297 R12: 0000000000000012 [ 49.817687] R13: 00000000072823d8 R14: 00007ffbedffb700 R15: 00000000072823d8 [ 49.818338] Modules linked in: pvpanic cirrusfb button qemu_fw_cfg atkbd libps2 i8042 [ 49.819052] CR2: 0000000000000088 [ 49.819368] ---[ end trace 4e652b8aa299aa2d ]--- [ 49.819796] RIP: 0010:_raw_spin_lock+0xc/0x20 [ 49.821880] RSP: 0018:ffffac6fc5417e98 EFLAGS: 00010246 [ 49.822363] RAX: 0000000000000000 RBX: ffff93b8da3446c0 RCX: 0000000a00000000 [ 49.823008] RDX: 0000000000000001 RSI: 000000000000000a RDI: 0000000000000088 [ 49.823658] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff993cf040 [ 49.825404] R10: ffff93b92292e580 R11: ffffd27f188a4b80 R12: 0000000000000000 [ 49.827147] R13: 00000000ffffff9c R14: 00000000fffffffe R15: ffff93b8da3446c0 [ 49.828890] FS: 00007ffbedffb700(0000) GS:ffff93b927880000(0000) knlGS:0000000000000000 [ 49.830725] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 49.832359] CR2: 0000000000000088 CR3: 00000005e3f74006 CR4: 00000000003606a0 [ 49.834085] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 49.835792] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Cc: <stable@vger.kernel.org> Fixes: `a6c6065511` ("ovl: redirect on rename-dir") Signed-off-by: Liangyan <liangyan.peng@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2021-01-28 10:22:48 +01:00
Christian Brauner	549c729771	fs: make helpers idmap mount aware Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-01-24 14:27:20 +01:00
Christian Brauner	6521f89170	namei: prepare for idmapped mounts The various vfs_*() helpers are called by filesystems or by the vfs itself to perform core operations such as create, link, mkdir, mknod, rename, rmdir, tmpfile and unlink. Enable them to handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace and pass it down. Afterwards the checks and operations are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-15-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-01-24 14:27:18 +01:00
Tycho Andersen	c7c7a1a18a	xattr: handle idmapped mounts When interacting with extended attributes the vfs verifies that the caller is privileged over the inode with which the extended attribute is associated. For posix access and posix default extended attributes a uid or gid can be stored on-disk. Let the functions handle posix extended attributes on idmapped mounts. If the inode is accessed through an idmapped mount we need to map it according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. This has no effect for e.g. security xattrs since they don't store uids or gids and don't perform permission checks on them like posix acls do. Link: https://lore.kernel.org/r/20210121131959.646623-10-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Tycho Andersen <tycho@tycho.pizza> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-01-24 14:27:17 +01:00
Christian Brauner	2f221d6f7b	attr: handle idmapped mounts When file attributes are changed most filesystems rely on the setattr_prepare(), setattr_copy(), and notify_change() helpers for initialization and permission checking. Let them handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Helpers that perform checks on the ia_uid and ia_gid fields in struct iattr assume that ia_uid and ia_gid are intended values and have already been mapped correctly at the userspace-kernelspace boundary as we already do today. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-01-24 14:27:16 +01:00
Christian Brauner	21cb47be6f	inode: make init and permission helpers idmapped mount aware The inode_owner_or_capable() helper determines whether the caller is the owner of the inode or is capable with respect to that inode. Allow it to handle idmapped mounts. If the inode is accessed through an idmapped mount it according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Similarly, allow the inode_init_owner() helper to handle idmapped mounts. It initializes a new inode on idmapped mounts by mapping the fsuid and fsgid of the caller from the mount's user namespace. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-01-24 14:27:16 +01:00
Miklos Szeredi	610afc0bd4	ovl: pass ovl_fs down to functions accessing private xattrs This paves the way for optionally using the "user.overlay." xattr namespace. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-09-02 10:58:49 +02:00
Vivek Goyal	28166ab3c8	ovl: initialize OVL_UPPERDATA in ovl_lookup() Currently ovl_get_inode() initializes OVL_UPPERDATA flag and for that it has to call ovl_check_metacopy_xattr() and check if metacopy xattr is present or not. yangerkun reported sometimes underlying filesystem might return -EIO and in that case error handling path does not cleanup properly leading to various warnings. Run generic/461 with ext4 upper/lower layer sometimes may trigger the bug as below(linux 4.19): [ 551.001349] overlayfs: failed to get metacopy (-5) [ 551.003464] overlayfs: failed to get inode (-5) [ 551.004243] overlayfs: cleanup of 'd44/fd51' failed (-5) [ 551.004941] overlayfs: failed to get origin (-5) [ 551.005199] ------------[ cut here ]------------ [ 551.006697] WARNING: CPU: 3 PID: 24674 at fs/inode.c:1528 iput+0x33b/0x400 ... [ 551.027219] Call Trace: [ 551.027623] ovl_create_object+0x13f/0x170 [ 551.028268] ovl_create+0x27/0x30 [ 551.028799] path_openat+0x1a35/0x1ea0 [ 551.029377] do_filp_open+0xad/0x160 [ 551.029944] ? vfs_writev+0xe9/0x170 [ 551.030499] ? page_counter_try_charge+0x77/0x120 [ 551.031245] ? __alloc_fd+0x160/0x2a0 [ 551.031832] ? do_sys_open+0x189/0x340 [ 551.032417] ? get_unused_fd_flags+0x34/0x40 [ 551.033081] do_sys_open+0x189/0x340 [ 551.033632] __x64_sys_creat+0x24/0x30 [ 551.034219] do_syscall_64+0xd5/0x430 [ 551.034800] entry_SYSCALL_64_after_hwframe+0x44/0xa9 One solution is to improve error handling and call iget_failed() if error is encountered. Amir thinks that this path is little intricate and there is not real need to check and initialize OVL_UPPERDATA in ovl_get_inode(). Instead caller of ovl_get_inode() can initialize this state. And this will avoid double checking of metacopy xattr lookup in ovl_lookup() and ovl_get_inode(). OVL_UPPERDATA is inode flag. So I was little concerned that initializing it outside ovl_get_inode() might have some races. But this is one way transition. That is once a file has been fully copied up, it can't go back to metacopy file again. And that seems to help avoid races. So as of now I can't see any races w.r.t OVL_UPPERDATA being set wrongly. So move settingof OVL_UPPERDATA inside the callers of ovl_get_inode(). ovl_obtain_alias() already does it. So only two callers now left are ovl_lookup() and ovl_instantiate(). Reported-by: yangerkun <yangerkun@huawei.com> Suggested-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-02 22:20:25 +02:00
Chengguang Xu	c21c839b84	ovl: whiteout inode sharing Share inode with different whiteout files for saving inode and speeding up delete operation. If EMLINK is encountered when linking a shared whiteout, create a new one. In case of any other error, disable sharing for this super block. Note: ofs->whiteout is protected by inode lock on workdir. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-05-13 11:11:24 +02:00
Miklos Szeredi	83552eacdf	ovl: fix WARN_ON nlink drop to zero Changes to underlying layers should not cause WARN_ON(), but this repro does: mkdir w l u mnt sudo mount -t overlay -o workdir=w,lowerdir=l,upperdir=u overlay mnt touch mnt/h ln u/h u/k rm -rf mnt/k rm -rf mnt/h dmesg ------------[ cut here ]------------ WARNING: CPU: 1 PID: 116244 at fs/inode.c:302 drop_nlink+0x28/0x40 After upper hardlinks were added while overlay is mounted, unlinking all overlay hardlinks drops overlay nlink to zero before all upper inodes are unlinked. After unlink/rename prevent i_nlink from going to zero if there are still hashed aliases (i.e. cached hard links to the victim) remaining. Reported-by: Phasip <phasip@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-27 16:51:02 +01:00
Amir Goldstein	cad218ab33	ovl: check if upper fs supports RENAME_WHITEOUT As with other required upper fs features, we only warn if support is missing to avoid breaking existing sub-optimal setups. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-17 15:04:22 +01:00
Miklos Szeredi	f428884456	ovl: decide if revalidate needed on a per-dentry basis Allow completely skipping ->revalidate() on a per-dentry basis, in case the underlying layers used for a dentry do not themselves have ->revalidate(). E.g. negative overlay dentry has no underlying layers, hence revalidate is unnecessary. Or if lower layer is remote but overlay dentry is pure-upper, then can skip revalidate. The following places need to update whether the dentry needs revalidate or not: - fill-super (root dentry) - lookup - create - fh_to_dentry Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-17 15:04:22 +01:00
lijiazi	1bd0a3aea4	ovl: use pr_fmt auto generate prefix Use pr_fmt auto generate "overlayfs: " prefix. Signed-off-by: lijiazi <lijiazi@xiaomi.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-01-22 20:11:41 +01:00
Amir Goldstein	6889ee5a53	ovl: relax WARN_ON() on rename to self In ovl_rename(), if new upper is hardlinked to old upper underneath overlayfs before upper dirs are locked, user will get an ESTALE error and a WARN_ON will be printed. Changes to underlying layers while overlayfs is mounted may result in unexpected behavior, but it shouldn't crash the kernel and it shouldn't trigger WARN_ON() either, so relax this WARN_ON(). Reported-by: syzbot+bb1836a212e69f8e201a@syzkaller.appspotmail.com Fixes: `804032fabb` ("ovl: don't check rename to self") Cc: <stable@vger.kernel.org> # v4.9+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-12-10 16:00:55 +01:00
Linus Torvalds	c884d8ac7f	SPDX update for 5.2-rc6 Another round of SPDX updates for 5.2-rc6 Here is what I am guessing is going to be the last "big" SPDX update for 5.2. It contains all of the remaining GPLv2 and GPLv2+ updates that were "easy" to determine by pattern matching. The ones after this are going to be a bit more difficult and the people on the spdx list will be discussing them on a case-by-case basis now. Another 5000+ files are fixed up, so our overall totals are: Files checked: 64545 Files with SPDX: 45529 Compared to the 5.1 kernel which was: Files checked: 63848 Files with SPDX: 22576 This is a huge improvement. Also, we deleted another 20000 lines of boilerplate license crud, always nice to see in a diffstat. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXQyQYA8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ymnGQCghETUBotn1p3hTjY56VEs6dGzpHMAnRT0m+lv kbsjBGEJpLbMRB2krnaU =RMcT -----END PGP SIGNATURE----- Merge tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx Pull still more SPDX updates from Greg KH: "Another round of SPDX updates for 5.2-rc6 Here is what I am guessing is going to be the last "big" SPDX update for 5.2. It contains all of the remaining GPLv2 and GPLv2+ updates that were "easy" to determine by pattern matching. The ones after this are going to be a bit more difficult and the people on the spdx list will be discussing them on a case-by-case basis now. Another 5000+ files are fixed up, so our overall totals are: Files checked: 64545 Files with SPDX: 45529 Compared to the 5.1 kernel which was: Files checked: 63848 Files with SPDX: 22576 This is a huge improvement. Also, we deleted another 20000 lines of boilerplate license crud, always nice to see in a diffstat" * tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: (65 commits) treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 507 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 506 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 505 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 504 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 503 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 502 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 501 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 498 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 497 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 496 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 495 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 491 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 490 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 489 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 488 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 487 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 486 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 485 ...	2019-06-21 09:58:42 -07:00
Thomas Gleixner	d2912cb15b	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 Based on 2 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation # extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 4122 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-06-19 17:09:55 +02:00
Nicolas Schier	253e748339	ovl: fix typo in MODULE_PARM_DESC Change first argument to MODULE_PARM_DESC() calls, that each of them matched the actual module parameter name. The matching results in changing (the 'parm' section from) the output of `modinfo overlay` from: parm: ovl_check_copy_up:Obsolete; does nothing parm: redirect_max:ushort parm: ovl_redirect_max:Maximum length of absolute redirect xattr value parm: redirect_dir:bool parm: ovl_redirect_dir_def:Default to on or off for the redirect_dir feature parm: redirect_always_follow:bool parm: ovl_redirect_always_follow:Follow redirects even if redirect_dir feature is turned off parm: index:bool parm: ovl_index_def:Default to on or off for the inodes index feature parm: nfs_export:bool parm: ovl_nfs_export_def:Default to on or off for the NFS export feature parm: xino_auto:bool parm: ovl_xino_auto_def:Auto enable xino feature parm: metacopy:bool parm: ovl_metacopy_def:Default to on or off for the metadata only copy up feature into: parm: check_copy_up:Obsolete; does nothing parm: redirect_max:Maximum length of absolute redirect xattr value (ushort) parm: redirect_dir:Default to on or off for the redirect_dir feature (bool) parm: redirect_always_follow:Follow redirects even if redirect_dir feature is turned off (bool) parm: index:Default to on or off for the inodes index feature (bool) parm: nfs_export:Default to on or off for the NFS export feature (bool) parm: xino_auto:Auto enable xino feature (bool) parm: metacopy:Default to on or off for the metadata only copy up feature (bool) Signed-off-by: Nicolas Schier <n.schier@avm.de> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-06-18 15:06:16 +02:00
Amir Goldstein	acf3062a7e	ovl: relax WARN_ON() for overlapping layers use case This nasty little syzbot repro: https://syzkaller.appspot.com/x/repro.syz?x=12c7a94f400000 Creates overlay mounts where the same directory is both in upper and lower layers. Simplified example: mkdir foo work mount -t overlay none foo -o"lowerdir=.,upperdir=foo,workdir=work" The repro runs several threads in parallel that attempt to chdir into foo and attempt to symlink/rename/exec/mkdir the file bar. The repro hits a WARN_ON() I placed in ovl_instantiate(), which suggests that an overlay inode already exists in cache and is hashed by the pointer of the real upper dentry that ovl_create_real() has just created. At the point of the WARN_ON(), for overlay dir inode lock is held and upper dir inode lock, so at first, I did not see how this was possible. On a closer look, I see that after ovl_create_real(), because of the overlapping upper and lower layers, a lookup by another thread can find the file foo/bar that was just created in upper layer, at overlay path foo/foo/bar and hash the an overlay inode with the new real dentry as lower dentry. This is possible because the overlay directory foo/foo is not locked and the upper dentry foo/bar is in dcache, so ovl_lookup() can find it without taking upper dir inode shared lock. Overlapping layers is considered a wrong setup which would result in unexpected behavior, but it shouldn't crash the kernel and it shouldn't trigger WARN_ON() either, so relax this WARN_ON() and leave a pr_warn() instead to cover all cases of failure to get an overlay inode. The error returned from failure to insert new inode to cache with inode_insert5() was changed to -EEXIST, to distinguish from the error -ENOMEM returned on failure to get/allocate inode with iget5_locked(). Reported-by: syzbot+9c69c282adc4edd2b540@syzkaller.appspotmail.com Fixes: `01b39dcc95` ("ovl: use inode_insert5() to hash a newly...") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-05-08 13:25:53 +02:00
Amir Goldstein	91ff20f34e	ovl: fix missing override creds in link of a metacopy upper Theodore Ts'o reported a v4.19 regression with docker-dropbox: https://marc.info/?l=linux-fsdevel&m=154070089431116&w=2 "I was rebuilding my dropbox Docker container, and it failed in 4.19 with the following error: ... dpkg: error: error creating new backup file \ '/var/lib/dpkg/status-old': Invalid cross-device link" The problem did not reproduce with metacopy feature disabled. The error was caused by insufficient credentials to set "trusted.overlay.redirect" xattr on link of a metacopy file. Reproducer: echo Y > /sys/module/overlay/parameters/redirect_dir echo Y > /sys/module/overlay/parameters/metacopy cd /tmp mkdir l u w m chmod 777 l u touch l/foo ln l/foo l/link chmod 666 l/foo mount -t overlay none -olowerdir=l,upperdir=u,workdir=w m su fsgqa ln m/foo m/bar [ 21.455823] overlayfs: failed to set redirect (-1) ln: failed to create hard link 'm/bar' => 'm/foo':\ Invalid cross-device link Reported-by: Theodore Y. Ts'o <tytso@mit.edu> Reported-by: Maciej Zięba <maciekz82@gmail.com> Fixes: `4120fe64dc` ("ovl: Set redirect on upper inode when it is linked") Cc: <stable@vger.kernel.org> # v4.19 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-11-19 16:21:29 +01:00
Miklos Szeredi	5e12758086	ovl: check whiteout in ovl_create_over_whiteout() Kaixuxia repors that it's possible to crash overlayfs by removing the whiteout on the upper layer before creating a directory over it. This is a reproducer: mkdir lower upper work merge touch lower/file mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merge rm merge/file ls -al merge/file rm upper/file ls -al merge/ mkdir merge/file Before commencing with a vfs_rename(..., RENAME_EXCHANGE) verify that the lookup of "upper" is positive and is a whiteout, and return ESTALE otherwise. Reported by: kaixuxia <xiakaixu1987@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: `e9be9d5e76` ("overlay filesystem") Cc: <stable@vger.kernel.org> # v3.18	2018-10-31 12:15:23 +01:00
Chengguang Xu	14fa085640	ovl: using posix_acl_xattr_size() to get size instead of posix_acl_to_xattr() There is no functional change but it seems better to get size by calling posix_acl_xattr_size() instead of calling posix_acl_to_xattr() with NULL buffer argument. Additionally, remove unnecessary assignments. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-10-26 23:34:40 +02:00
Amir Goldstein	0e32992f7f	ovl: remove the 'locked' argument of ovl_nlink_{start,end} It just makes the interface strange without adding any significant value. The only case where locked is false and return value is 0 is in ovl_rename() when new is negative, so handle that case explicitly in ovl_rename(). Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-10-26 23:34:40 +02:00
Amir Goldstein	6cd078702f	ovl: fix recursive oi->lock in ovl_link() linking a non-copied-up file into a non-copied-up parent results in a nested call to mutex_lock_interruptible(&oi->lock). Fix this by copying up target parent before ovl_nlink_start(), same as done in ovl_rename(). ~/unionmount-testsuite$ ./run --ov -s ~/unionmount-testsuite$ ln /mnt/a/foo100 /mnt/a/dir100/ WARNING: possible recursive locking detected -------------------------------------------- ln/1545 is trying to acquire lock: 00000000bcce7c4c (&ovl_i_lock_key[depth]){+.+.}, at: ovl_copy_up_start+0x28/0x7d but task is already holding lock: 0000000026d73d5b (&ovl_i_lock_key[depth]){+.+.}, at: ovl_nlink_start+0x3c/0xc1 [SzM: this seems to be a false positive, but doing the copy-up first is harmless and removes the lockdep splat] Reported-by: syzbot+3ef5c0d1a5cb0b21e6be@syzkaller.appspotmail.com Fixes: `5f8415d6b8` ("ovl: persistent overlay inode nlink for...") Cc: <stable@vger.kernel.org> # v4.13 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-10-26 23:34:39 +02:00
Miklos Szeredi	6faf05c2b2	ovl: set I_CREATING on inode being created ...otherwise there will be list corruption due to inode_sb_list_add() being called for inode already on the sb list. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: `e950564b97` ("vfs: don't evict uninitialized inode") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-08-22 13:15:25 -07:00
Vivek Goyal	4120fe64dc	ovl: Set redirect on upper inode when it is linked When we create a hardlink to a metacopy upper file, first the redirect on that inode. Path based lookup will not work with newly created link and redirect will solve that issue. Also use absolute redirect as two hardlinks could be in different directores and relative redirect will not work. I have not put any additional locking around setting redirects while introducing redirects for non-dir files. For now it feels like existing locking is sufficient. If that's not the case, we will have add more locking. Following is my rationale about why do I think current locking seems ok. Basic problem for non-dir files is that more than on dentry could be pointing to same inode and in theory only relying on dentry based locks (d->d_lock) did not seem sufficient. We set redirect upon rename and upon link creation. In both the paths for non-dir file, VFS locks both source and target inodes (->i_rwsem). That means vfs rename and link operations on same source and target can't he happening in parallel (Even if there are multiple dentries pointing to same inode). So that probably means that at a time on an inode, only one call of ovl_set_redirect() could be working and we don't need additional locking in ovl_set_redirect(). ovl_inode->redirect is initialized only when inode is created new. That means it should not race with any other path and setting ovl_inode->redirect should be fine. Reading of ovl_inode->redirect happens in ovl_get_redirect() path. And this called only in ovl_set_redirect(). And ovl_set_redirect() already seemed to be protected using ->i_rwsem. That means ovl_set_redirect() and ovl_get_redirect() on source/target inode should not make progress in parallel and is mutually exclusive. Hence no additional locking required. Now, only case where ovl_set_redirect() and ovl_get_redirect() could race seems to be case of absolute redirects where ovl_get_redirect() has to travel up the tree. In that case we already take d->d_lock and that should be sufficient as directories will not have multiple dentries pointing to same inode. So given VFS locking and current usage of redirect, current locking around redirect seems to be ok for non-dir as well. Once we have the logic to remove redirect when metacopy file gets copied up, then we probably will need additional locking. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:15 +02:00
Vivek Goyal	7bb083837d	ovl: Set redirect on metacopy files upon rename Set redirect on metacopy files upon rename. This will help find data dentry in lower dirs. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-20 09:56:15 +02:00
Miklos Szeredi	d9854c87f0	ovl: copy up times Copy up mtime and ctime to overlay inode after times in real object are modified. Be careful not to dirty cachelines when not necessary. This is in preparation for moving overlay functionality out of the VFS. This patch shouldn't have any observable effect. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-07-18 15:44:40 +02:00
Amir Goldstein	01b39dcc95	ovl: use inode_insert5() to hash a newly created inode Currently, there is a small window where ovl_obtain_alias() can race with ovl_instantiate() and create two different overlay inodes with the same underlying real non-dir non-hardlink inode. The race requires an adversary to guess the file handle of the yet to be created upper inode and decode the guessed file handle after ovl_creat_real(), but before ovl_instantiate(). This race does not affect overlay directory inodes, because those are decoded via ovl_lookup_real() and not with ovl_obtain_alias(). This patch fixes the race, by using inode_insert5() to add a newly created inode to cache. If the newly created inode apears to already exist in cache (hashed by the same real upper inode), we instantiate the dentry with the old inode and drop the new inode, instead of silently not hashing the new inode. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:12 +02:00
Miklos Szeredi	dd8ac699ed	ovl: return EIO on internal error EIO better represents an internal error than ENOENT. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:11 +02:00
Al Viro	f73cc77c3a	ovl: make ovl_create_real() cope with vfs_mkdir() safely vfs_mkdir() may succeed and leave the dentry passed to it unhashed and negative. ovl_create_real() is the last caller breaking when that happens. [amir: split re-factoring of ovl_create_temp() to prep patch add comment about unhashed dir after mkdir add pr_warn() if mkdir succeeds and lookup fails] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:11 +02:00
Amir Goldstein	137ec526a2	ovl: create helper ovl_create_temp() Also used ovl_create_temp() in ovl_create_index() instead of calling ovl_do_mkdir() directly, so now all callers of ovl_do_mkdir() are routed through ovl_create_real(), which paves the way for Al's fix for non-hashed result from vfs_mkdir(). Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:11 +02:00
Miklos Szeredi	95a1c8153a	ovl: return dentry from ovl_create_real() Al Viro suggested to simplify callers of ovl_create_real() by returning the created dentry (or ERR_PTR) from ovl_create_real(). Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:11 +02:00
Amir Goldstein	471ec5dcf4	ovl: struct cattr cleanups * Rename to ovl_cattr * Fold ovl_create_real() hardlink argument into struct ovl_cattr * Create macro OVL_CATTR() to initialize struct ovl_cattr from mode Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:10 +02:00
Amir Goldstein	6cf00764b0	ovl: strip debug argument from ovl_do_ helpers It did not prove to be useful. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:10 +02:00
Amir Goldstein	a8b9e0ceed	ovl: remove WARN_ON() real inode attributes mismatch Overlayfs should cope with online changes to underlying layer without crashing the kernel, which is what xfstest overlay/019 checks. This test may sometimes trigger WARN_ON() in ovl_create_or_link() when linking an overlay inode that has been changed on underlying layer. Remove those WARN_ON() to prevent the stress test from failing. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:10 +02:00
Amir Goldstein	e7dd0e7134	ovl: whiteout index when union nlink drops to zero With NFS export feature enabled, when overlay inode nlink drops to zero, instead of removing the index entry, replace it with a whiteout index entry. This is needed for NFS export in order to prevent future open by handle from opening the lower file directly. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-24 11:25:56 +01:00
Amir Goldstein	6d0a8a90a5	ovl: take lower dir inode mutex outside upper sb_writers lock The functions ovl_lower_positive() and ovl_check_empty_dir() both take inode mutex on the real lower dir under ovl_want_write() which takes the upper_mnt sb_writers lock. While this is not a clear locking order or layering violation, it creates an undesired lock dependency between two unrelated layers for no good reason. This lock dependency materializes to a false(?) positive lockdep warning when calling rmdir() on a nested overlayfs, where both nested and underlying overlayfs both use the same fs type as upper layer. rmdir() on the nested overlayfs creates the lock chain: sb_writers of upper_mnt (e.g. tmpfs) in ovl_do_remove() ovl_i_mutex_dir_key[] of lower overlay dir in ovl_lower_positive() rmdir() on the underlying overlayfs creates the lock chain in reverse order: ovl_i_mutex_dir_key[] of lower overlay dir in vfs_rmdir() sb_writers of nested upper_mnt (e.g. tmpfs) in ovl_do_remove() To rid of the unneeded locking dependency, move both ovl_lower_positive() and ovl_check_empty_dir() to before ovl_want_write() in rmdir() and rename() implementation. This change spreads the pieces of ovl_check_empty_and_clear() directly inside the rmdir()/rename() implementations so the helper is no longer needed and removed. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-01-19 17:43:23 +01:00
Amir Goldstein	da2e6b7eed	ovl: fix overlay: warning prefix Conform two stray warning messages to the standard overlayfs: prefix. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2017-12-14 11:14:52 +01:00
Amir Goldstein	f30536f0f9	ovl: update cache version of impure parent on rename ovl_rename() updates dir cache version for impure old parent if an entry with copy up origin is moved into old parent, but it did not update cache version if the entry moved out of old parent has a copy up origin. [SzM] Same for new dir: we updated the version if an entry with origin was moved in, but not if an entry with origin was moved out. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2017-11-09 10:23:27 +01:00
zhangyi (F)	07f6fff148	ovl: fix rmdir problem on non-merge dir with origin xattr An "origin && non-merge" upper dir may have leftover whiteouts that were created in past mount. overlayfs does no clear this dir when we delete it, which may lead to rmdir fail or temp file left in workdir. Simple reproducer: mkdir lower upper work merge mkdir -p lower/dir touch lower/dir/a mount -t overlay overlay -olowerdir=lower,upperdir=upper,\ workdir=work merge rm merge/dir/a umount merge rm -rf lower/* touch lower/dir () mount -t overlay overlay -olowerdir=lower,upperdir=upper,\ workdir=work merge rm -rf merge/dir Syslog dump: overlayfs: cleanup of 'work/#7' failed (-39) (): if we do not create the regular file, the result is different: rm: cannot remove "dir/": Directory not empty This patch adds a check for the case of non-merge dir that may contain whiteouts, and calls ovl_check_empty_dir() to check and clear whiteouts from upper dir when an empty dir is being deleted. [amir: split patch from ovl_check_empty_dir() cleanup rename ovl_is_origin() to ovl_may_have_whiteouts() check OVL_WHITEOUTS flag instead of checking origin xattr] Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2017-11-09 10:23:27 +01:00
zhangyi (F)	95e598e7ac	ovl: simplify ovl_check_empty_and_clear() Filter out non-whiteout non-upper entries from list of merge dir entries while checking if merge dir is empty in ovl_check_empty_dir(). The remaining work for ovl_clear_empty() is to clear all entries on the list. [amir: split patch from rmdir bug fix] Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2017-11-09 10:23:27 +01:00
Amir Goldstein	5820dc0888	ovl: fix missing unlock_rename() in ovl_do_copy_up() Use the ovl_lock_rename_workdir() helper which requires unlock_rename() only on lock success. Fixes: ("fd210b7d67ee ovl: move copy up lock out") Cc: <stable@vger.kernel.org> # v4.13 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2017-10-05 15:53:18 +02:00
Michal Hocko	0ee931c4e3	mm: treewide: remove GFP_TEMPORARY allocation flag GFP_TEMPORARY was introduced by commit `e12ba74d8f` ("Group short-lived and reclaimable kernel allocations") along with __GFP_RECLAIMABLE. It's primary motivation was to allow users to tell that an allocation is short lived and so the allocator can try to place such allocations close together and prevent long term fragmentation. As much as this sounds like a reasonable semantic it becomes much less clear when to use the highlevel GFP_TEMPORARY allocation flag. How long is temporary? Can the context holding that memory sleep? Can it take locks? It seems there is no good answer for those questions. The current implementation of GFP_TEMPORARY is basically GFP_KERNEL \| __GFP_RECLAIMABLE which in itself is tricky because basically none of the existing caller provide a way to reclaim the allocated memory. So this is rather misleading and hard to evaluate for any benefits. I have checked some random users and none of them has added the flag with a specific justification. I suspect most of them just copied from other existing users and others just thought it might be a good idea to use without any measuring. This suggests that GFP_TEMPORARY just motivates for cargo cult usage without any reasoning. I believe that our gfp flags are quite complex already and especially those with highlevel semantic should be clearly defined to prevent from confusion and abuse. Therefore I propose dropping GFP_TEMPORARY and replace all existing users to simply use GFP_KERNEL. Please note that SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and so they will be placed properly for memory fragmentation prevention. I can see reasons we might want some gfp flag to reflect shorterm allocations but I propose starting from a clear semantic definition and only then add users with proper justification. This was been brought up before LSF this year by Matthew [1] and it turned out that GFP_TEMPORARY really doesn't have a clear semantic. It seems to be a heuristic without any measured advantage for most (if not all) its current users. The follow up discussion has revealed that opinions on what might be temporary allocation differ a lot between developers. So rather than trying to tweak existing users into a semantic which they haven't expected I propose to simply remove the flag and start from scratch if we really need a semantic for short term allocations. [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org [akpm@linux-foundation.org: fix typo] [akpm@linux-foundation.org: coding-style fixes] [sfr@canb.auug.org.au: drm/i915: fix up] Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Neil Brown <neilb@suse.de> Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-09-13 18:53:16 -07:00
Miklos Szeredi	4edb83bb10	ovl: constant d_ino for non-merge dirs Impure directories are ones which contain objects with origins (i.e. those that have been copied up). These are relevant to readdir operation only because of the d_ino field, no other transformation is necessary. Also a directory can become impure between two getdents(2) calls. This patch creates a cache for impure directories. Unlike the cache for merged directories, this one only contains entries with origin and is not refcounted but has a its lifetime tied to that of the dentry. Similarly to the merged cache, the impure cache is invalidated based on a version number. This version number is incremented when an entry with origin is added or removed from the directory. If the cache is empty, then the impure xattr is removed from the directory. This patch also fixes up handling of d_ino for the ".." entry if the parent directory is merged. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2017-07-27 21:54:06 +02:00
Amir Goldstein	ea3dad18dc	ovl: mark parent impure on ovl_link() When linking a file with copy up origin into a new parent, mark the new parent dir "impure". Fixes: `ee1d6d37b6` ("ovl: mark upper dir with type origin entries "impure"") Cc: <stable@vger.kernel.org> # v4.12 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2017-07-13 22:06:45 +02:00
Amir Goldstein	5f8415d6b8	ovl: persistent overlay inode nlink for indexed inodes With inodes index enabled, an overlay inode nlink counts the union of upper and non-covered lower hardlinks. During the lifetime of a non-pure upper inode, the following nlink modifying operations can happen: 1. Lower hardlink copy up 2. Upper hardlink created, unlinked or renamed over 3. Lower hardlink whiteout or renamed over For the first, copy up case, the union nlink does not change, whether the operation succeeds or fails, but the upper inode nlink may change. Therefore, before copy up, we store the union nlink value relative to the lower inode nlink in the index inode xattr trusted.overlay.nlink. For the second, upper hardlink case, the union nlink should be incremented or decremented IFF the operation succeeds, aligned with nlink change of the upper inode. Therefore, before link/unlink/rename, we store the union nlink value relative to the upper inode nlink in the index inode. For the last, lower cover up case, we simplify things by preceding the whiteout or cover up with copy up. This makes sure that there is an index upper inode where the nlink xattr can be stored before the copied up upper entry is unlink. Return the overlay inode nlinks for indexed upper inodes on stat(2). Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2017-07-04 22:03:19 +02:00
Miklos Szeredi	55acc66182	ovl: add flag for upper in ovl_entry For rename, we need to ensure that an upper alias exists for hard links before attempting the operation. Introduce a flag in ovl_entry to track the state of the upper alias. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2017-07-04 22:03:18 +02:00
Amir Goldstein	415543d5c6	ovl: cleanup bad and stale index entries on mount Bad index entries are entries whose name does not match the origin file handle stored in trusted.overlay.origin xattr. Bad index entries could be a result of a system power off in the middle of copy up. Stale index entries are entries whose origin file handle is stale. Stale index entries could be a result of copying layers or removing lower entries while the overlay is not mounted. The case of copying layers should be detected earlier by the verification of upper root dir origin and index dir origin. Both bad and stale index entries are detected and removed on mount. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2017-07-04 22:03:17 +02:00

1 2 3

114 Commits