2019-05-19 12:08:55 +00:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* fs/libfs.c
|
|
|
|
* Library for filesystems writers.
|
|
|
|
*/
|
|
|
|
|
2014-06-04 23:06:27 +00:00
|
|
|
#include <linux/blkdev.h>
|
2011-11-17 04:57:37 +00:00
|
|
|
#include <linux/export.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
#include <linux/pagemap.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 08:04:11 +00:00
|
|
|
#include <linux/slab.h>
|
2017-02-02 16:54:15 +00:00
|
|
|
#include <linux/cred.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
#include <linux/mount.h>
|
|
|
|
#include <linux/vfs.h>
|
fs: introduce new truncate sequence
Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
setattr > vmtruncate > truncate, have filesystems call their truncate sequence
from ->setattr if filesystem specific operations are required. vmtruncate is
deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
previously should be used.
simple_setattr is introduced for simple in-ram filesystems to implement
the new truncate sequence. Eventually all filesystems should be converted
to implement a setattr, and the default code in notify_change should go
away.
simple_setsize is also introduced to perform just the ATTR_SIZE portion
of simple_setattr (ie. changing i_size and trimming pagecache).
To implement the new truncate sequence:
- filesystem specific manipulations (eg freeing blocks) must be done in
the setattr method rather than ->truncate.
- vmtruncate can not be used by core code to trim blocks past i_size in
the event of write failure after allocation, so this must be performed
in the fs code.
- convert usage of helpers block_write_begin, nobh_write_begin,
cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
variants. These avoid calling vmtruncate to trim blocks (see previous).
- inode_setattr should not be used. generic_setattr is a new function
to be used to copy simple attributes into the generic inode.
- make use of the better opportunity to handle errors with the new sequence.
Big problem with the previous calling sequence: the filesystem is not called
until i_size has already changed. This means it is not allowed to fail the
call, and also it does not know what the previous i_size was. Also, generic
code calling vmtruncate to truncate allocated blocks in case of error had
no good way to return a meaningful error (or, for example, atomically handle
block deallocation).
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-26 15:05:33 +00:00
|
|
|
#include <linux/quotaops.h>
|
2006-03-23 11:00:36 +00:00
|
|
|
#include <linux/mutex.h>
|
2013-09-16 14:30:04 +00:00
|
|
|
#include <linux/namei.h>
|
2007-10-21 23:42:05 +00:00
|
|
|
#include <linux/exportfs.h>
|
2022-09-09 20:57:41 +00:00
|
|
|
#include <linux/iversion.h>
|
2009-06-07 18:56:44 +00:00
|
|
|
#include <linux/writeback.h>
|
2011-09-16 06:31:11 +00:00
|
|
|
#include <linux/buffer_head.h> /* sync_mapping_buffers */
|
2019-03-25 16:38:23 +00:00
|
|
|
#include <linux/fs_context.h>
|
|
|
|
#include <linux/pseudo_fs.h>
|
2019-11-18 14:43:10 +00:00
|
|
|
#include <linux/fsnotify.h>
|
2020-07-08 09:12:35 +00:00
|
|
|
#include <linux/unicode.h>
|
|
|
|
#include <linux/fscrypt.h>
|
2024-02-19 15:30:57 +00:00
|
|
|
#include <linux/pidfs.h>
|
2006-03-23 11:00:36 +00:00
|
|
|
|
2016-12-24 19:46:01 +00:00
|
|
|
#include <linux/uaccess.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2011-07-07 19:03:58 +00:00
|
|
|
#include "internal.h"
|
|
|
|
|
2023-01-13 11:49:12 +00:00
|
|
|
int simple_getattr(struct mnt_idmap *idmap, const struct path *path,
|
2021-01-21 13:19:43 +00:00
|
|
|
struct kstat *stat, u32 request_mask,
|
|
|
|
unsigned int query_flags)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.
The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.
Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.
========
OVERVIEW
========
The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.
A number of requests were gathered for features to be included. The
following have been included:
(1) Make the fields a consistent size on all arches and make them large.
(2) Spare space, request flags and information flags are provided for
future expansion.
(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).
(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).
This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].
(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).
(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).
And the following have been left out for future extension:
(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].
Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.
(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).
(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].
(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].
(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).
(10) Extra coherency data may be useful in making backups [Andreas Dilger].
(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).
(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...
(This requires a separate system call - I have an fsinfo() call idea
for this).
(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].
(Deferred to fsinfo).
(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].
(Deferred to fsinfo).
(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).
(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).
(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].
(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).
(16) Femtosecond-resolution timestamps [Dave Chinner].
(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).
(17) A set multiple attributes syscall to go with this.
===============
NEW SYSTEM CALL
===============
The new system call is:
int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);
The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.
Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):
(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.
(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.
(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.
mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.
buffer points to the destination for the data. This must be 256 bytes in
size.
======================
MAIN ATTRIBUTES RECORD
======================
The following structures are defined in which to return the main attribute
set:
struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};
struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};
The defined bits in request_mask and stx_mask are:
STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]
stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.
Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.
The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:
STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs
Within the kernel, the supported flags are listed by:
KSTAT_ATTR_FS_IOC_FLAGS
[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]
New flags include:
STATX_ATTR_AUTOMOUNT Object is an automount trigger
These are for the use of GUI tools that might want to mark files specially,
depending on what they are.
Fields in struct statx come in a number of classes:
(0) stx_dev_*, stx_blksize.
These are local system information and are always available.
(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.
These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.
If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.
If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.
Note that there are instances where the type might not be valid, for
instance Windows reparse points.
(2) stx_rdev_*.
This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.
(3) stx_btime.
Similar to (1), except this will be set to 0 if it doesn't exist.
=======
TESTING
=======
The following test program can be used to test the statx system call:
samples/statx/test-statx.c
Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.
Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.
[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
Secondly, the result of automounting on that directory.
[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-01-31 16:46:22 +00:00
|
|
|
struct inode *inode = d_inode(path->dentry);
|
2023-08-07 19:38:33 +00:00
|
|
|
generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
stat->blocks = inode->i_mapping->nrpages << (PAGE_SHIFT - 9);
|
2005-04-16 22:20:36 +00:00
|
|
|
return 0;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(simple_getattr);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2006-06-23 09:02:58 +00:00
|
|
|
int simple_statfs(struct dentry *dentry, struct kstatfs *buf)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
fs: report f_fsid from s_dev for "simple" filesystems
There are many "simple" filesystems (*) that report null f_fsid in
statfs(2). Those "simple" filesystems report sb->s_dev as the st_dev
field of the stat syscalls for all inodes of the filesystem (**).
In order to enable fanotify reporting of events with fsid on those
"simple" filesystems, report the sb->s_dev number in f_fsid field of
statfs(2).
(*) For most of the "simple" filesystem refered to in this commit, the
->statfs() operation is simple_statfs(). Some of those fs assign the
simple_statfs() method directly in their ->s_op struct and some assign it
indirectly via a call to simple_fill_super() or to pseudo_fs_fill_super()
with either custom or "simple" s_op.
We also make the same change to efivarfs and hugetlbfs, although they do
not use simple_statfs(), because they use the simple_* inode opreations
(e.g. simple_lookup()).
(**) For most of the "simple" filesystems, the ->getattr() method is not
assigned, so stat() is implemented by generic_fillattr(). A few "simple"
filesystem use the simple_getattr() method which also calls
generic_fillattr() to fill most of the stat struct.
The two exceptions are procfs and 9p. procfs implements several different
->getattr() methods, but they all end up calling generic_fillattr() to
fill the st_dev field from sb->s_dev.
9p has more complicated ->getattr() methods, but they too, end up calling
generic_fillattr() to fill the st_dev field from sb->s_dev.
Note that 9p and kernfs also call simple_statfs() from custom ->statfs()
methods which already fill the f_fsid field, but v9fs_statfs() calls
simple_statfs() only in case f_fsid was not filled and kenrfs_statfs()
overwrites f_fsid after calling simple_statfs().
Link: https://lore.kernel.org/r/20230919094820.g5bwharbmy2dq46w@quack3/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231023143049.2944970-1-amir73il@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-23 14:30:49 +00:00
|
|
|
u64 id = huge_encode_dev(dentry->d_sb->s_dev);
|
|
|
|
|
|
|
|
buf->f_fsid = u64_to_fsid(id);
|
2006-06-23 09:02:58 +00:00
|
|
|
buf->f_type = dentry->d_sb->s_magic;
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
buf->f_bsize = PAGE_SIZE;
|
2005-04-16 22:20:36 +00:00
|
|
|
buf->f_namelen = NAME_MAX;
|
|
|
|
return 0;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(simple_statfs);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Retaining negative dentries for an in-memory filesystem just wastes
|
|
|
|
* memory and lookup time: arrange for them to be deleted immediately.
|
|
|
|
*/
|
2013-10-25 22:47:37 +00:00
|
|
|
int always_delete_dentry(const struct dentry *dentry)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
return 1;
|
|
|
|
}
|
2013-10-25 22:47:37 +00:00
|
|
|
EXPORT_SYMBOL(always_delete_dentry);
|
|
|
|
|
|
|
|
const struct dentry_operations simple_dentry_operations = {
|
|
|
|
.d_delete = always_delete_dentry,
|
|
|
|
};
|
|
|
|
EXPORT_SYMBOL(simple_dentry_operations);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Lookup the data. This is trivial - if the dentry didn't already
|
|
|
|
* exist, we know it is negative. Set d_op to delete negative dentries.
|
|
|
|
*/
|
2012-06-10 21:13:09 +00:00
|
|
|
struct dentry *simple_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
if (dentry->d_name.len > NAME_MAX)
|
|
|
|
return ERR_PTR(-ENAMETOOLONG);
|
2013-07-14 13:43:25 +00:00
|
|
|
if (!dentry->d_sb->s_d_op)
|
|
|
|
d_set_d_op(dentry, &simple_dentry_operations);
|
2005-04-16 22:20:36 +00:00
|
|
|
d_add(dentry, NULL);
|
|
|
|
return NULL;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(simple_lookup);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
int dcache_dir_open(struct inode *inode, struct file *file)
|
|
|
|
{
|
2016-06-10 15:32:47 +00:00
|
|
|
file->private_data = d_alloc_cursor(file->f_path.dentry);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
return file->private_data ? 0 : -ENOMEM;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(dcache_dir_open);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
int dcache_dir_close(struct inode *inode, struct file *file)
|
|
|
|
{
|
|
|
|
dput(file->private_data);
|
|
|
|
return 0;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(dcache_dir_close);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2016-06-06 23:37:13 +00:00
|
|
|
/* parent is locked at least shared */
|
2019-09-15 16:12:39 +00:00
|
|
|
/*
|
|
|
|
* Returns an element of siblings' list.
|
|
|
|
* We are looking for <count>th positive after <p>; if
|
2019-09-20 20:32:42 +00:00
|
|
|
* found, dentry is grabbed and returned to caller.
|
|
|
|
* If no such element exists, NULL is returned.
|
2019-09-15 16:12:39 +00:00
|
|
|
*/
|
2019-09-20 20:32:42 +00:00
|
|
|
static struct dentry *scan_positives(struct dentry *cursor,
|
2023-11-07 07:00:39 +00:00
|
|
|
struct hlist_node **p,
|
2019-09-15 16:12:39 +00:00
|
|
|
loff_t count,
|
2019-09-20 20:32:42 +00:00
|
|
|
struct dentry *last)
|
2016-06-06 23:37:13 +00:00
|
|
|
{
|
2019-09-15 16:12:39 +00:00
|
|
|
struct dentry *dentry = cursor->d_parent, *found = NULL;
|
|
|
|
|
|
|
|
spin_lock(&dentry->d_lock);
|
2023-11-07 07:00:39 +00:00
|
|
|
while (*p) {
|
|
|
|
struct dentry *d = hlist_entry(*p, struct dentry, d_sib);
|
|
|
|
p = &d->d_sib.next;
|
2019-09-15 16:12:39 +00:00
|
|
|
// we must at least skip cursors, to avoid livelocks
|
|
|
|
if (d->d_flags & DCACHE_DENTRY_CURSOR)
|
|
|
|
continue;
|
|
|
|
if (simple_positive(d) && !--count) {
|
|
|
|
spin_lock_nested(&d->d_lock, DENTRY_D_LOCK_NESTED);
|
|
|
|
if (simple_positive(d))
|
|
|
|
found = dget_dlock(d);
|
|
|
|
spin_unlock(&d->d_lock);
|
|
|
|
if (likely(found))
|
|
|
|
break;
|
|
|
|
count = 1;
|
|
|
|
}
|
|
|
|
if (need_resched()) {
|
2023-11-07 07:00:39 +00:00
|
|
|
if (!hlist_unhashed(&cursor->d_sib))
|
|
|
|
__hlist_del(&cursor->d_sib);
|
|
|
|
hlist_add_behind(&cursor->d_sib, &d->d_sib);
|
|
|
|
p = &cursor->d_sib.next;
|
2019-09-15 16:12:39 +00:00
|
|
|
spin_unlock(&dentry->d_lock);
|
|
|
|
cond_resched();
|
|
|
|
spin_lock(&dentry->d_lock);
|
2016-06-06 23:37:13 +00:00
|
|
|
}
|
|
|
|
}
|
2019-09-15 16:12:39 +00:00
|
|
|
spin_unlock(&dentry->d_lock);
|
2019-09-20 20:32:42 +00:00
|
|
|
dput(last);
|
|
|
|
return found;
|
2016-06-06 23:37:13 +00:00
|
|
|
}
|
|
|
|
|
2012-12-17 23:59:39 +00:00
|
|
|
loff_t dcache_dir_lseek(struct file *file, loff_t offset, int whence)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2011-01-07 06:49:34 +00:00
|
|
|
struct dentry *dentry = file->f_path.dentry;
|
2012-12-17 23:59:39 +00:00
|
|
|
switch (whence) {
|
2005-04-16 22:20:36 +00:00
|
|
|
case 1:
|
|
|
|
offset += file->f_pos;
|
2020-08-23 22:36:59 +00:00
|
|
|
fallthrough;
|
2005-04-16 22:20:36 +00:00
|
|
|
case 0:
|
|
|
|
if (offset >= 0)
|
|
|
|
break;
|
2020-08-23 22:36:59 +00:00
|
|
|
fallthrough;
|
2005-04-16 22:20:36 +00:00
|
|
|
default:
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
if (offset != file->f_pos) {
|
2019-09-15 16:12:39 +00:00
|
|
|
struct dentry *cursor = file->private_data;
|
|
|
|
struct dentry *to = NULL;
|
|
|
|
|
|
|
|
inode_lock_shared(dentry->d_inode);
|
|
|
|
|
2019-09-20 20:32:42 +00:00
|
|
|
if (offset > 2)
|
2023-11-07 07:00:39 +00:00
|
|
|
to = scan_positives(cursor, &dentry->d_children.first,
|
2019-09-20 20:32:42 +00:00
|
|
|
offset - 2, NULL);
|
|
|
|
spin_lock(&dentry->d_lock);
|
2023-11-07 07:00:39 +00:00
|
|
|
hlist_del_init(&cursor->d_sib);
|
2019-09-20 20:32:42 +00:00
|
|
|
if (to)
|
2023-11-07 07:00:39 +00:00
|
|
|
hlist_add_behind(&cursor->d_sib, &to->d_sib);
|
2019-09-20 20:32:42 +00:00
|
|
|
spin_unlock(&dentry->d_lock);
|
2019-09-15 16:12:39 +00:00
|
|
|
dput(to);
|
|
|
|
|
2019-09-20 20:32:42 +00:00
|
|
|
file->f_pos = offset;
|
|
|
|
|
2019-09-15 16:12:39 +00:00
|
|
|
inode_unlock_shared(dentry->d_inode);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
return offset;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(dcache_dir_lseek);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Directory is locked and all positive dentries in it are safe, since
|
|
|
|
* for ramfs-type trees they can't go away without unlink() or rmdir(),
|
|
|
|
* both impossible due to the lock on directory.
|
|
|
|
*/
|
|
|
|
|
2013-05-16 00:23:06 +00:00
|
|
|
int dcache_readdir(struct file *file, struct dir_context *ctx)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2013-05-16 00:23:06 +00:00
|
|
|
struct dentry *dentry = file->f_path.dentry;
|
|
|
|
struct dentry *cursor = file->private_data;
|
2019-09-15 16:12:39 +00:00
|
|
|
struct dentry *next = NULL;
|
2023-11-07 07:00:39 +00:00
|
|
|
struct hlist_node **p;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2013-05-16 00:23:06 +00:00
|
|
|
if (!dir_emit_dots(file, ctx))
|
|
|
|
return 0;
|
|
|
|
|
2016-06-06 23:37:13 +00:00
|
|
|
if (ctx->pos == 2)
|
2023-11-07 07:00:39 +00:00
|
|
|
p = &dentry->d_children.first;
|
2019-09-20 20:32:42 +00:00
|
|
|
else
|
2023-11-07 07:00:39 +00:00
|
|
|
p = &cursor->d_sib.next;
|
2019-09-15 16:12:39 +00:00
|
|
|
|
2019-09-20 20:32:42 +00:00
|
|
|
while ((next = scan_positives(cursor, p, 1, next)) != NULL) {
|
2013-05-16 00:23:06 +00:00
|
|
|
if (!dir_emit(ctx, next->d_name.name, next->d_name.len,
|
2023-03-30 10:41:43 +00:00
|
|
|
d_inode(next)->i_ino,
|
|
|
|
fs_umode_to_dtype(d_inode(next)->i_mode)))
|
2016-06-06 23:37:13 +00:00
|
|
|
break;
|
2013-05-16 00:23:06 +00:00
|
|
|
ctx->pos++;
|
2023-11-07 07:00:39 +00:00
|
|
|
p = &next->d_sib.next;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2019-09-15 16:12:39 +00:00
|
|
|
spin_lock(&dentry->d_lock);
|
2023-11-07 07:00:39 +00:00
|
|
|
hlist_del_init(&cursor->d_sib);
|
2019-09-20 20:32:42 +00:00
|
|
|
if (next)
|
2023-11-07 07:00:39 +00:00
|
|
|
hlist_add_before(&cursor->d_sib, &next->d_sib);
|
2019-09-15 16:12:39 +00:00
|
|
|
spin_unlock(&dentry->d_lock);
|
|
|
|
dput(next);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
return 0;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(dcache_readdir);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
ssize_t generic_read_dir(struct file *filp, char __user *buf, size_t siz, loff_t *ppos)
|
|
|
|
{
|
|
|
|
return -EISDIR;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(generic_read_dir);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2006-03-28 09:56:42 +00:00
|
|
|
const struct file_operations simple_dir_operations = {
|
2005-04-16 22:20:36 +00:00
|
|
|
.open = dcache_dir_open,
|
|
|
|
.release = dcache_dir_close,
|
|
|
|
.llseek = dcache_dir_lseek,
|
|
|
|
.read = generic_read_dir,
|
2016-04-20 23:52:15 +00:00
|
|
|
.iterate_shared = dcache_readdir,
|
2010-05-26 15:53:41 +00:00
|
|
|
.fsync = noop_fsync,
|
2005-04-16 22:20:36 +00:00
|
|
|
};
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(simple_dir_operations);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2007-02-12 08:55:39 +00:00
|
|
|
const struct inode_operations simple_dir_inode_operations = {
|
2005-04-16 22:20:36 +00:00
|
|
|
.lookup = simple_lookup,
|
|
|
|
};
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(simple_dir_inode_operations);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2024-02-17 20:23:47 +00:00
|
|
|
/* 0 is '.', 1 is '..', so always start with offset 2 or more */
|
|
|
|
enum {
|
|
|
|
DIR_OFFSET_MIN = 2,
|
|
|
|
};
|
|
|
|
|
2024-02-17 20:24:16 +00:00
|
|
|
static void offset_set(struct dentry *dentry, long offset)
|
2023-06-30 17:48:49 +00:00
|
|
|
{
|
2024-02-17 20:24:16 +00:00
|
|
|
dentry->d_fsdata = (void *)offset;
|
2023-06-30 17:48:49 +00:00
|
|
|
}
|
|
|
|
|
2024-02-17 20:24:16 +00:00
|
|
|
static long dentry2offset(struct dentry *dentry)
|
2023-06-30 17:48:49 +00:00
|
|
|
{
|
2024-02-17 20:24:16 +00:00
|
|
|
return (long)dentry->d_fsdata;
|
2023-06-30 17:48:49 +00:00
|
|
|
}
|
|
|
|
|
2024-02-17 20:24:16 +00:00
|
|
|
static struct lock_class_key simple_offset_lock_class;
|
2023-07-24 14:43:57 +00:00
|
|
|
|
2023-06-30 17:48:49 +00:00
|
|
|
/**
|
|
|
|
* simple_offset_init - initialize an offset_ctx
|
|
|
|
* @octx: directory offset map to be initialized
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
void simple_offset_init(struct offset_ctx *octx)
|
|
|
|
{
|
2024-02-17 20:24:16 +00:00
|
|
|
mt_init_flags(&octx->mt, MT_FLAGS_ALLOC_RANGE);
|
|
|
|
lockdep_set_class(&octx->mt.ma_lock, &simple_offset_lock_class);
|
2024-02-17 20:23:47 +00:00
|
|
|
octx->next_offset = DIR_OFFSET_MIN;
|
2023-06-30 17:48:49 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* simple_offset_add - Add an entry to a directory's offset map
|
|
|
|
* @octx: directory offset ctx to be updated
|
|
|
|
* @dentry: new dentry being added
|
|
|
|
*
|
2024-02-17 20:24:16 +00:00
|
|
|
* Returns zero on success. @octx and the dentry's offset are updated.
|
2023-06-30 17:48:49 +00:00
|
|
|
* Otherwise, a negative errno value is returned.
|
|
|
|
*/
|
|
|
|
int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
|
|
|
|
{
|
2024-02-17 20:24:16 +00:00
|
|
|
unsigned long offset;
|
2023-06-30 17:48:49 +00:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (dentry2offset(dentry) != 0)
|
|
|
|
return -EBUSY;
|
|
|
|
|
2024-02-17 20:24:16 +00:00
|
|
|
ret = mtree_alloc_cyclic(&octx->mt, &offset, dentry, DIR_OFFSET_MIN,
|
|
|
|
LONG_MAX, &octx->next_offset, GFP_KERNEL);
|
2023-06-30 17:48:49 +00:00
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
offset_set(dentry, offset);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2024-04-15 15:20:54 +00:00
|
|
|
static int simple_offset_replace(struct offset_ctx *octx, struct dentry *dentry,
|
|
|
|
long offset)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = mtree_store(&octx->mt, offset, dentry, GFP_KERNEL);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
offset_set(dentry, offset);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-06-30 17:48:49 +00:00
|
|
|
/**
|
|
|
|
* simple_offset_remove - Remove an entry to a directory's offset map
|
|
|
|
* @octx: directory offset ctx to be updated
|
|
|
|
* @dentry: dentry being removed
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
void simple_offset_remove(struct offset_ctx *octx, struct dentry *dentry)
|
|
|
|
{
|
2024-02-17 20:24:16 +00:00
|
|
|
long offset;
|
2023-06-30 17:48:49 +00:00
|
|
|
|
|
|
|
offset = dentry2offset(dentry);
|
|
|
|
if (offset == 0)
|
|
|
|
return;
|
|
|
|
|
2024-02-17 20:24:16 +00:00
|
|
|
mtree_erase(&octx->mt, offset);
|
2023-06-30 17:48:49 +00:00
|
|
|
offset_set(dentry, 0);
|
|
|
|
}
|
|
|
|
|
2024-02-17 20:23:54 +00:00
|
|
|
/**
|
|
|
|
* simple_offset_empty - Check if a dentry can be unlinked
|
|
|
|
* @dentry: dentry to be tested
|
|
|
|
*
|
|
|
|
* Returns 0 if @dentry is a non-empty directory; otherwise returns 1.
|
|
|
|
*/
|
|
|
|
int simple_offset_empty(struct dentry *dentry)
|
|
|
|
{
|
|
|
|
struct inode *inode = d_inode(dentry);
|
|
|
|
struct offset_ctx *octx;
|
|
|
|
struct dentry *child;
|
|
|
|
unsigned long index;
|
|
|
|
int ret = 1;
|
|
|
|
|
|
|
|
if (!inode || !S_ISDIR(inode->i_mode))
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
index = DIR_OFFSET_MIN;
|
|
|
|
octx = inode->i_op->get_offset_ctx(inode);
|
2024-02-17 20:24:16 +00:00
|
|
|
mt_for_each(&octx->mt, child, index, LONG_MAX) {
|
2024-02-17 20:23:54 +00:00
|
|
|
spin_lock(&child->d_lock);
|
|
|
|
if (simple_positive(child)) {
|
|
|
|
spin_unlock(&child->d_lock);
|
|
|
|
ret = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
spin_unlock(&child->d_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2024-04-15 15:20:55 +00:00
|
|
|
/**
|
|
|
|
* simple_offset_rename - handle directory offsets for rename
|
|
|
|
* @old_dir: parent directory of source entry
|
|
|
|
* @old_dentry: dentry of source entry
|
|
|
|
* @new_dir: parent_directory of destination entry
|
|
|
|
* @new_dentry: dentry of destination
|
|
|
|
*
|
|
|
|
* Caller provides appropriate serialization.
|
|
|
|
*
|
2024-04-15 15:20:56 +00:00
|
|
|
* User space expects the directory offset value of the replaced
|
|
|
|
* (new) directory entry to be unchanged after a rename.
|
|
|
|
*
|
2024-04-15 15:20:55 +00:00
|
|
|
* Returns zero on success, a negative errno value on failure.
|
|
|
|
*/
|
|
|
|
int simple_offset_rename(struct inode *old_dir, struct dentry *old_dentry,
|
|
|
|
struct inode *new_dir, struct dentry *new_dentry)
|
|
|
|
{
|
|
|
|
struct offset_ctx *old_ctx = old_dir->i_op->get_offset_ctx(old_dir);
|
|
|
|
struct offset_ctx *new_ctx = new_dir->i_op->get_offset_ctx(new_dir);
|
2024-04-15 15:20:56 +00:00
|
|
|
long new_offset = dentry2offset(new_dentry);
|
2024-04-15 15:20:55 +00:00
|
|
|
|
|
|
|
simple_offset_remove(old_ctx, old_dentry);
|
2024-04-15 15:20:56 +00:00
|
|
|
|
|
|
|
if (new_offset) {
|
|
|
|
offset_set(new_dentry, 0);
|
|
|
|
return simple_offset_replace(new_ctx, old_dentry, new_offset);
|
|
|
|
}
|
2024-04-15 15:20:55 +00:00
|
|
|
return simple_offset_add(new_ctx, old_dentry);
|
|
|
|
}
|
|
|
|
|
2023-06-30 17:48:49 +00:00
|
|
|
/**
|
|
|
|
* simple_offset_rename_exchange - exchange rename with directory offsets
|
|
|
|
* @old_dir: parent of dentry being moved
|
|
|
|
* @old_dentry: dentry being moved
|
|
|
|
* @new_dir: destination parent
|
|
|
|
* @new_dentry: destination dentry
|
|
|
|
*
|
2024-04-15 15:20:54 +00:00
|
|
|
* This API preserves the directory offset values. Caller provides
|
|
|
|
* appropriate serialization.
|
|
|
|
*
|
2023-06-30 17:48:49 +00:00
|
|
|
* Returns zero on success. Otherwise a negative errno is returned and the
|
|
|
|
* rename is rolled back.
|
|
|
|
*/
|
|
|
|
int simple_offset_rename_exchange(struct inode *old_dir,
|
|
|
|
struct dentry *old_dentry,
|
|
|
|
struct inode *new_dir,
|
|
|
|
struct dentry *new_dentry)
|
|
|
|
{
|
|
|
|
struct offset_ctx *old_ctx = old_dir->i_op->get_offset_ctx(old_dir);
|
|
|
|
struct offset_ctx *new_ctx = new_dir->i_op->get_offset_ctx(new_dir);
|
2024-02-17 20:24:16 +00:00
|
|
|
long old_index = dentry2offset(old_dentry);
|
|
|
|
long new_index = dentry2offset(new_dentry);
|
2023-06-30 17:48:49 +00:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
simple_offset_remove(old_ctx, old_dentry);
|
|
|
|
simple_offset_remove(new_ctx, new_dentry);
|
|
|
|
|
2024-04-15 15:20:54 +00:00
|
|
|
ret = simple_offset_replace(new_ctx, old_dentry, new_index);
|
2023-06-30 17:48:49 +00:00
|
|
|
if (ret)
|
|
|
|
goto out_restore;
|
|
|
|
|
2024-04-15 15:20:54 +00:00
|
|
|
ret = simple_offset_replace(old_ctx, new_dentry, old_index);
|
2023-06-30 17:48:49 +00:00
|
|
|
if (ret) {
|
|
|
|
simple_offset_remove(new_ctx, old_dentry);
|
|
|
|
goto out_restore;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = simple_rename_exchange(old_dir, old_dentry, new_dir, new_dentry);
|
|
|
|
if (ret) {
|
|
|
|
simple_offset_remove(new_ctx, old_dentry);
|
|
|
|
simple_offset_remove(old_ctx, new_dentry);
|
|
|
|
goto out_restore;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
out_restore:
|
2024-04-15 15:20:54 +00:00
|
|
|
(void)simple_offset_replace(old_ctx, old_dentry, old_index);
|
|
|
|
(void)simple_offset_replace(new_ctx, new_dentry, new_index);
|
2023-06-30 17:48:49 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* simple_offset_destroy - Release offset map
|
|
|
|
* @octx: directory offset ctx that is about to be destroyed
|
|
|
|
*
|
|
|
|
* During fs teardown (eg. umount), a directory's offset map might still
|
|
|
|
* contain entries. xa_destroy() cleans out anything that remains.
|
|
|
|
*/
|
|
|
|
void simple_offset_destroy(struct offset_ctx *octx)
|
|
|
|
{
|
2024-02-17 20:24:16 +00:00
|
|
|
mtree_destroy(&octx->mt);
|
2023-06-30 17:48:49 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* offset_dir_llseek - Advance the read position of a directory descriptor
|
|
|
|
* @file: an open directory whose position is to be updated
|
|
|
|
* @offset: a byte offset
|
|
|
|
* @whence: enumerator describing the starting position for this update
|
|
|
|
*
|
|
|
|
* SEEK_END, SEEK_DATA, and SEEK_HOLE are not supported for directories.
|
|
|
|
*
|
|
|
|
* Returns the updated read position if successful; otherwise a
|
|
|
|
* negative errno is returned and the read position remains unchanged.
|
|
|
|
*/
|
|
|
|
static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
|
|
|
|
{
|
|
|
|
switch (whence) {
|
|
|
|
case SEEK_CUR:
|
|
|
|
offset += file->f_pos;
|
|
|
|
fallthrough;
|
|
|
|
case SEEK_SET:
|
|
|
|
if (offset >= 0)
|
|
|
|
break;
|
|
|
|
fallthrough;
|
|
|
|
default:
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2023-11-19 23:56:17 +00:00
|
|
|
/* In this case, ->private_data is protected by f_pos_lock */
|
|
|
|
file->private_data = NULL;
|
2024-02-17 20:24:16 +00:00
|
|
|
return vfs_setpos(file, offset, LONG_MAX);
|
2023-06-30 17:48:49 +00:00
|
|
|
}
|
|
|
|
|
2024-02-17 20:23:40 +00:00
|
|
|
static struct dentry *offset_find_next(struct offset_ctx *octx, loff_t offset)
|
2023-06-30 17:48:49 +00:00
|
|
|
{
|
2024-02-17 20:24:16 +00:00
|
|
|
MA_STATE(mas, &octx->mt, offset, offset);
|
2023-06-30 17:48:49 +00:00
|
|
|
struct dentry *child, *found = NULL;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
2024-02-17 20:24:16 +00:00
|
|
|
child = mas_find(&mas, LONG_MAX);
|
2023-06-30 17:48:49 +00:00
|
|
|
if (!child)
|
|
|
|
goto out;
|
2023-07-25 18:31:04 +00:00
|
|
|
spin_lock(&child->d_lock);
|
2023-06-30 17:48:49 +00:00
|
|
|
if (simple_positive(child))
|
|
|
|
found = dget_dlock(child);
|
|
|
|
spin_unlock(&child->d_lock);
|
|
|
|
out:
|
|
|
|
rcu_read_unlock();
|
|
|
|
return found;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
|
|
|
|
{
|
|
|
|
struct inode *inode = d_inode(dentry);
|
2024-02-17 20:24:16 +00:00
|
|
|
long offset = dentry2offset(dentry);
|
2023-06-30 17:48:49 +00:00
|
|
|
|
|
|
|
return ctx->actor(ctx, dentry->d_name.name, dentry->d_name.len, offset,
|
|
|
|
inode->i_ino, fs_umode_to_dtype(inode->i_mode));
|
|
|
|
}
|
|
|
|
|
2023-11-19 23:56:17 +00:00
|
|
|
static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
|
2023-06-30 17:48:49 +00:00
|
|
|
{
|
2024-02-17 20:23:40 +00:00
|
|
|
struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
|
2023-06-30 17:48:49 +00:00
|
|
|
struct dentry *dentry;
|
|
|
|
|
|
|
|
while (true) {
|
2024-02-17 20:23:40 +00:00
|
|
|
dentry = offset_find_next(octx, ctx->pos);
|
2023-06-30 17:48:49 +00:00
|
|
|
if (!dentry)
|
2023-11-19 23:56:17 +00:00
|
|
|
return ERR_PTR(-ENOENT);
|
2023-06-30 17:48:49 +00:00
|
|
|
|
|
|
|
if (!offset_dir_emit(ctx, dentry)) {
|
|
|
|
dput(dentry);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2024-02-17 20:23:40 +00:00
|
|
|
ctx->pos = dentry2offset(dentry) + 1;
|
2023-06-30 17:48:49 +00:00
|
|
|
dput(dentry);
|
|
|
|
}
|
2023-11-19 23:56:17 +00:00
|
|
|
return NULL;
|
2023-06-30 17:48:49 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* offset_readdir - Emit entries starting at offset @ctx->pos
|
|
|
|
* @file: an open directory to iterate over
|
|
|
|
* @ctx: directory iteration context
|
|
|
|
*
|
|
|
|
* Caller must hold @file's i_rwsem to prevent insertion or removal of
|
|
|
|
* entries during this call.
|
|
|
|
*
|
|
|
|
* On entry, @ctx->pos contains an offset that represents the first entry
|
|
|
|
* to be read from the directory.
|
|
|
|
*
|
|
|
|
* The operation continues until there are no more entries to read, or
|
|
|
|
* until the ctx->actor indicates there is no more space in the caller's
|
|
|
|
* output buffer.
|
|
|
|
*
|
|
|
|
* On return, @ctx->pos contains an offset that will read the next entry
|
2023-07-25 18:31:04 +00:00
|
|
|
* in this directory when offset_readdir() is called again with @ctx.
|
2023-06-30 17:48:49 +00:00
|
|
|
*
|
|
|
|
* Return values:
|
|
|
|
* %0 - Complete
|
|
|
|
*/
|
|
|
|
static int offset_readdir(struct file *file, struct dir_context *ctx)
|
|
|
|
{
|
|
|
|
struct dentry *dir = file->f_path.dentry;
|
|
|
|
|
|
|
|
lockdep_assert_held(&d_inode(dir)->i_rwsem);
|
|
|
|
|
|
|
|
if (!dir_emit_dots(file, ctx))
|
|
|
|
return 0;
|
|
|
|
|
2023-11-19 23:56:17 +00:00
|
|
|
/* In this case, ->private_data is protected by f_pos_lock */
|
2024-02-17 20:23:47 +00:00
|
|
|
if (ctx->pos == DIR_OFFSET_MIN)
|
2023-11-19 23:56:17 +00:00
|
|
|
file->private_data = NULL;
|
|
|
|
else if (file->private_data == ERR_PTR(-ENOENT))
|
|
|
|
return 0;
|
|
|
|
file->private_data = offset_iterate_dir(d_inode(dir), ctx);
|
2023-06-30 17:48:49 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
const struct file_operations simple_offset_dir_operations = {
|
|
|
|
.llseek = offset_dir_llseek,
|
|
|
|
.iterate_shared = offset_readdir,
|
|
|
|
.read = generic_read_dir,
|
|
|
|
.fsync = noop_fsync,
|
|
|
|
};
|
|
|
|
|
2019-11-18 14:43:10 +00:00
|
|
|
static struct dentry *find_next_child(struct dentry *parent, struct dentry *prev)
|
|
|
|
{
|
2023-11-07 07:00:39 +00:00
|
|
|
struct dentry *child = NULL, *d;
|
2019-11-18 14:43:10 +00:00
|
|
|
|
|
|
|
spin_lock(&parent->d_lock);
|
2023-11-07 07:00:39 +00:00
|
|
|
d = prev ? d_next_sibling(prev) : d_first_child(parent);
|
|
|
|
hlist_for_each_entry_from(d, d_sib) {
|
2019-11-18 14:43:10 +00:00
|
|
|
if (simple_positive(d)) {
|
|
|
|
spin_lock_nested(&d->d_lock, DENTRY_D_LOCK_NESTED);
|
|
|
|
if (simple_positive(d))
|
|
|
|
child = dget_dlock(d);
|
|
|
|
spin_unlock(&d->d_lock);
|
|
|
|
if (likely(child))
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
spin_unlock(&parent->d_lock);
|
|
|
|
dput(prev);
|
|
|
|
return child;
|
|
|
|
}
|
|
|
|
|
|
|
|
void simple_recursive_removal(struct dentry *dentry,
|
|
|
|
void (*callback)(struct dentry *))
|
|
|
|
{
|
|
|
|
struct dentry *this = dget(dentry);
|
|
|
|
while (true) {
|
|
|
|
struct dentry *victim = NULL, *child;
|
|
|
|
struct inode *inode = this->d_inode;
|
|
|
|
|
|
|
|
inode_lock(inode);
|
|
|
|
if (d_is_dir(this))
|
|
|
|
inode->i_flags |= S_DEAD;
|
|
|
|
while ((child = find_next_child(this, victim)) == NULL) {
|
|
|
|
// kill and ascend
|
|
|
|
// update metadata while it's still locked
|
2023-07-05 19:01:21 +00:00
|
|
|
inode_set_ctime_current(inode);
|
2019-11-18 14:43:10 +00:00
|
|
|
clear_nlink(inode);
|
|
|
|
inode_unlock(inode);
|
|
|
|
victim = this;
|
|
|
|
this = this->d_parent;
|
|
|
|
inode = this->d_inode;
|
|
|
|
inode_lock(inode);
|
|
|
|
if (simple_positive(victim)) {
|
|
|
|
d_invalidate(victim); // avoid lost mounts
|
|
|
|
if (d_is_dir(victim))
|
|
|
|
fsnotify_rmdir(inode, victim);
|
|
|
|
else
|
|
|
|
fsnotify_unlink(inode, victim);
|
|
|
|
if (callback)
|
|
|
|
callback(victim);
|
|
|
|
dput(victim); // unpin it
|
|
|
|
}
|
|
|
|
if (victim == dentry) {
|
2023-10-04 18:52:37 +00:00
|
|
|
inode_set_mtime_to_ts(inode,
|
|
|
|
inode_set_ctime_current(inode));
|
2019-11-18 14:43:10 +00:00
|
|
|
if (d_is_dir(dentry))
|
|
|
|
drop_nlink(inode);
|
|
|
|
inode_unlock(inode);
|
|
|
|
dput(dentry);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
inode_unlock(inode);
|
|
|
|
this = child;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(simple_recursive_removal);
|
|
|
|
|
2007-03-05 08:30:28 +00:00
|
|
|
static const struct super_operations simple_super_operations = {
|
|
|
|
.statfs = simple_statfs,
|
|
|
|
};
|
|
|
|
|
2019-03-25 16:38:26 +00:00
|
|
|
static int pseudo_fs_fill_super(struct super_block *s, struct fs_context *fc)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2019-03-25 16:38:23 +00:00
|
|
|
struct pseudo_fs_context *ctx = fc->fs_private;
|
2005-04-16 22:20:36 +00:00
|
|
|
struct inode *root;
|
|
|
|
|
2009-08-18 21:11:08 +00:00
|
|
|
s->s_maxbytes = MAX_LFS_FILESIZE;
|
2008-07-30 05:33:03 +00:00
|
|
|
s->s_blocksize = PAGE_SIZE;
|
|
|
|
s->s_blocksize_bits = PAGE_SHIFT;
|
2019-05-11 15:43:59 +00:00
|
|
|
s->s_magic = ctx->magic;
|
|
|
|
s->s_op = ctx->ops ?: &simple_super_operations;
|
|
|
|
s->s_xattr = ctx->xattr;
|
2005-04-16 22:20:36 +00:00
|
|
|
s->s_time_gran = 1;
|
|
|
|
root = new_inode(s);
|
|
|
|
if (!root)
|
2019-03-25 16:38:26 +00:00
|
|
|
return -ENOMEM;
|
|
|
|
|
2007-05-08 07:32:31 +00:00
|
|
|
/*
|
|
|
|
* since this is the first inode, make it number 1. New inodes created
|
|
|
|
* after this must take care not to collide with it (by passing
|
|
|
|
* max_reserved of 1 to iunique).
|
|
|
|
*/
|
|
|
|
root->i_ino = 1;
|
2005-04-16 22:20:36 +00:00
|
|
|
root->i_mode = S_IFDIR | S_IRUSR | S_IWUSR;
|
2023-10-04 18:52:37 +00:00
|
|
|
simple_inode_init_ts(root);
|
2019-05-11 15:43:59 +00:00
|
|
|
s->s_root = d_make_root(root);
|
|
|
|
if (!s->s_root)
|
2019-03-25 16:38:26 +00:00
|
|
|
return -ENOMEM;
|
2019-05-11 15:43:59 +00:00
|
|
|
s->s_d_op = ctx->dops;
|
2019-03-25 16:38:23 +00:00
|
|
|
return 0;
|
2019-03-25 16:38:26 +00:00
|
|
|
}
|
2019-05-11 15:43:59 +00:00
|
|
|
|
2019-03-25 16:38:26 +00:00
|
|
|
static int pseudo_fs_get_tree(struct fs_context *fc)
|
|
|
|
{
|
2019-06-02 00:48:55 +00:00
|
|
|
return get_tree_nodev(fc, pseudo_fs_fill_super);
|
2019-03-25 16:38:23 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void pseudo_fs_free(struct fs_context *fc)
|
|
|
|
{
|
|
|
|
kfree(fc->fs_private);
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct fs_context_operations pseudo_fs_context_ops = {
|
|
|
|
.free = pseudo_fs_free,
|
|
|
|
.get_tree = pseudo_fs_get_tree,
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Common helper for pseudo-filesystems (sockfs, pipefs, bdev - stuff that
|
|
|
|
* will never be mountable)
|
|
|
|
*/
|
|
|
|
struct pseudo_fs_context *init_pseudo(struct fs_context *fc,
|
|
|
|
unsigned long magic)
|
|
|
|
{
|
|
|
|
struct pseudo_fs_context *ctx;
|
|
|
|
|
|
|
|
ctx = kzalloc(sizeof(struct pseudo_fs_context), GFP_KERNEL);
|
|
|
|
if (likely(ctx)) {
|
|
|
|
ctx->magic = magic;
|
|
|
|
fc->fs_private = ctx;
|
|
|
|
fc->ops = &pseudo_fs_context_ops;
|
2019-03-25 16:38:26 +00:00
|
|
|
fc->sb_flags |= SB_NOUSER;
|
|
|
|
fc->global = true;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2019-03-25 16:38:23 +00:00
|
|
|
return ctx;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2019-03-25 16:38:23 +00:00
|
|
|
EXPORT_SYMBOL(init_pseudo);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2012-04-05 21:25:09 +00:00
|
|
|
int simple_open(struct inode *inode, struct file *file)
|
|
|
|
{
|
|
|
|
if (inode->i_private)
|
|
|
|
file->private_data = inode->i_private;
|
|
|
|
return 0;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(simple_open);
|
2012-04-05 21:25:09 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
int simple_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry)
|
|
|
|
{
|
2015-03-17 22:26:15 +00:00
|
|
|
struct inode *inode = d_inode(old_dentry);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2023-10-04 18:52:37 +00:00
|
|
|
inode_set_mtime_to_ts(dir,
|
|
|
|
inode_set_ctime_to_ts(dir, inode_set_ctime_current(inode)));
|
2006-10-01 06:29:04 +00:00
|
|
|
inc_nlink(inode);
|
2010-10-23 15:11:40 +00:00
|
|
|
ihold(inode);
|
2005-04-16 22:20:36 +00:00
|
|
|
dget(dentry);
|
|
|
|
d_instantiate(dentry, inode);
|
|
|
|
return 0;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(simple_link);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
int simple_empty(struct dentry *dentry)
|
|
|
|
{
|
|
|
|
struct dentry *child;
|
|
|
|
int ret = 0;
|
|
|
|
|
2011-01-07 06:49:34 +00:00
|
|
|
spin_lock(&dentry->d_lock);
|
2023-11-07 07:00:39 +00:00
|
|
|
hlist_for_each_entry(child, &dentry->d_children, d_sib) {
|
2011-01-07 06:49:33 +00:00
|
|
|
spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
|
|
|
|
if (simple_positive(child)) {
|
|
|
|
spin_unlock(&child->d_lock);
|
2005-04-16 22:20:36 +00:00
|
|
|
goto out;
|
2011-01-07 06:49:33 +00:00
|
|
|
}
|
|
|
|
spin_unlock(&child->d_lock);
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
ret = 1;
|
|
|
|
out:
|
2011-01-07 06:49:34 +00:00
|
|
|
spin_unlock(&dentry->d_lock);
|
2005-04-16 22:20:36 +00:00
|
|
|
return ret;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(simple_empty);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
int simple_unlink(struct inode *dir, struct dentry *dentry)
|
|
|
|
{
|
2015-03-17 22:26:15 +00:00
|
|
|
struct inode *inode = d_inode(dentry);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2023-10-04 18:52:37 +00:00
|
|
|
inode_set_mtime_to_ts(dir,
|
|
|
|
inode_set_ctime_to_ts(dir, inode_set_ctime_current(inode)));
|
2006-10-01 06:29:03 +00:00
|
|
|
drop_nlink(inode);
|
2005-04-16 22:20:36 +00:00
|
|
|
dput(dentry);
|
|
|
|
return 0;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(simple_unlink);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
int simple_rmdir(struct inode *dir, struct dentry *dentry)
|
|
|
|
{
|
|
|
|
if (!simple_empty(dentry))
|
|
|
|
return -ENOTEMPTY;
|
|
|
|
|
2015-03-17 22:26:15 +00:00
|
|
|
drop_nlink(d_inode(dentry));
|
2005-04-16 22:20:36 +00:00
|
|
|
simple_unlink(dir, dentry);
|
2006-10-01 06:29:03 +00:00
|
|
|
drop_nlink(dir);
|
2005-04-16 22:20:36 +00:00
|
|
|
return 0;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(simple_rmdir);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2023-07-05 18:58:11 +00:00
|
|
|
/**
|
|
|
|
* simple_rename_timestamp - update the various inode timestamps for rename
|
|
|
|
* @old_dir: old parent directory
|
|
|
|
* @old_dentry: dentry that is being renamed
|
|
|
|
* @new_dir: new parent directory
|
|
|
|
* @new_dentry: target for rename
|
|
|
|
*
|
|
|
|
* POSIX mandates that the old and new parent directories have their ctime and
|
|
|
|
* mtime updated, and that inodes of @old_dentry and @new_dentry (if any), have
|
|
|
|
* their ctime updated.
|
|
|
|
*/
|
|
|
|
void simple_rename_timestamp(struct inode *old_dir, struct dentry *old_dentry,
|
|
|
|
struct inode *new_dir, struct dentry *new_dentry)
|
|
|
|
{
|
|
|
|
struct inode *newino = d_inode(new_dentry);
|
|
|
|
|
2023-10-04 18:52:37 +00:00
|
|
|
inode_set_mtime_to_ts(old_dir, inode_set_ctime_current(old_dir));
|
2023-07-05 18:58:11 +00:00
|
|
|
if (new_dir != old_dir)
|
2023-10-04 18:52:37 +00:00
|
|
|
inode_set_mtime_to_ts(new_dir,
|
|
|
|
inode_set_ctime_current(new_dir));
|
2023-07-05 18:58:11 +00:00
|
|
|
inode_set_ctime_current(d_inode(old_dentry));
|
|
|
|
if (newino)
|
|
|
|
inode_set_ctime_current(newino);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(simple_rename_timestamp);
|
|
|
|
|
2021-10-28 09:47:21 +00:00
|
|
|
int simple_rename_exchange(struct inode *old_dir, struct dentry *old_dentry,
|
|
|
|
struct inode *new_dir, struct dentry *new_dentry)
|
|
|
|
{
|
|
|
|
bool old_is_dir = d_is_dir(old_dentry);
|
|
|
|
bool new_is_dir = d_is_dir(new_dentry);
|
|
|
|
|
|
|
|
if (old_dir != new_dir && old_is_dir != new_is_dir) {
|
|
|
|
if (old_is_dir) {
|
|
|
|
drop_nlink(old_dir);
|
|
|
|
inc_nlink(new_dir);
|
|
|
|
} else {
|
|
|
|
drop_nlink(new_dir);
|
|
|
|
inc_nlink(old_dir);
|
|
|
|
}
|
|
|
|
}
|
2023-07-05 18:58:11 +00:00
|
|
|
simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
|
2021-10-28 09:47:21 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(simple_rename_exchange);
|
|
|
|
|
2023-01-13 11:49:17 +00:00
|
|
|
int simple_rename(struct mnt_idmap *idmap, struct inode *old_dir,
|
2021-01-21 13:19:43 +00:00
|
|
|
struct dentry *old_dentry, struct inode *new_dir,
|
|
|
|
struct dentry *new_dentry, unsigned int flags)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry)
Convert the following where appropriate:
(1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry).
(2) S_ISREG(dentry->d_inode) to d_is_reg(dentry).
(3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry). This is actually more
complicated than it appears as some calls should be converted to
d_can_lookup() instead. The difference is whether the directory in
question is a real dir with a ->lookup op or whether it's a fake dir with
a ->d_automount op.
In some circumstances, we can subsume checks for dentry->d_inode not being
NULL into this, provided we the code isn't in a filesystem that expects
d_inode to be NULL if the dirent really *is* negative (ie. if we're going to
use d_inode() rather than d_backing_inode() to get the inode pointer).
Note that the dentry type field may be set to something other than
DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS
manages the fall-through from a negative dentry to a lower layer. In such a
case, the dentry type of the negative union dentry is set to the same as the
type of the lower dentry.
However, if you know d_inode is not NULL at the call site, then you can use
the d_is_xxx() functions even in a filesystem.
There is one further complication: a 0,0 chardev dentry may be labelled
DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE. Strictly, this was
intended for special directory entry types that don't have attached inodes.
The following perl+coccinelle script was used:
use strict;
my @callers;
open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') ||
die "Can't grep for S_ISDIR and co. callers";
@callers = <$fd>;
close($fd);
unless (@callers) {
print "No matches\n";
exit(0);
}
my @cocci = (
'@@',
'expression E;',
'@@',
'',
'- S_ISLNK(E->d_inode->i_mode)',
'+ d_is_symlink(E)',
'',
'@@',
'expression E;',
'@@',
'',
'- S_ISDIR(E->d_inode->i_mode)',
'+ d_is_dir(E)',
'',
'@@',
'expression E;',
'@@',
'',
'- S_ISREG(E->d_inode->i_mode)',
'+ d_is_reg(E)' );
my $coccifile = "tmp.sp.cocci";
open($fd, ">$coccifile") || die $coccifile;
print($fd "$_\n") || die $coccifile foreach (@cocci);
close($fd);
foreach my $file (@callers) {
chomp $file;
print "Processing ", $file, "\n";
system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 ||
die "spatch failed";
}
[AV: overlayfs parts skipped]
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-01-29 12:02:35 +00:00
|
|
|
int they_are_dirs = d_is_dir(old_dentry);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2021-10-28 09:47:22 +00:00
|
|
|
if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE))
|
2016-09-27 09:03:57 +00:00
|
|
|
return -EINVAL;
|
|
|
|
|
2021-10-28 09:47:22 +00:00
|
|
|
if (flags & RENAME_EXCHANGE)
|
|
|
|
return simple_rename_exchange(old_dir, old_dentry, new_dir, new_dentry);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
if (!simple_empty(new_dentry))
|
|
|
|
return -ENOTEMPTY;
|
|
|
|
|
2015-03-17 22:26:15 +00:00
|
|
|
if (d_really_is_positive(new_dentry)) {
|
2005-04-16 22:20:36 +00:00
|
|
|
simple_unlink(new_dir, new_dentry);
|
2011-07-21 19:49:09 +00:00
|
|
|
if (they_are_dirs) {
|
2015-03-17 22:26:15 +00:00
|
|
|
drop_nlink(d_inode(new_dentry));
|
2006-10-01 06:29:03 +00:00
|
|
|
drop_nlink(old_dir);
|
2011-07-21 19:49:09 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
} else if (they_are_dirs) {
|
2006-10-01 06:29:03 +00:00
|
|
|
drop_nlink(old_dir);
|
2006-10-01 06:29:04 +00:00
|
|
|
inc_nlink(new_dir);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2023-07-05 18:58:11 +00:00
|
|
|
simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
|
2005-04-16 22:20:36 +00:00
|
|
|
return 0;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(simple_rename);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
fs: introduce new truncate sequence
Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
setattr > vmtruncate > truncate, have filesystems call their truncate sequence
from ->setattr if filesystem specific operations are required. vmtruncate is
deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
previously should be used.
simple_setattr is introduced for simple in-ram filesystems to implement
the new truncate sequence. Eventually all filesystems should be converted
to implement a setattr, and the default code in notify_change should go
away.
simple_setsize is also introduced to perform just the ATTR_SIZE portion
of simple_setattr (ie. changing i_size and trimming pagecache).
To implement the new truncate sequence:
- filesystem specific manipulations (eg freeing blocks) must be done in
the setattr method rather than ->truncate.
- vmtruncate can not be used by core code to trim blocks past i_size in
the event of write failure after allocation, so this must be performed
in the fs code.
- convert usage of helpers block_write_begin, nobh_write_begin,
cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
variants. These avoid calling vmtruncate to trim blocks (see previous).
- inode_setattr should not be used. generic_setattr is a new function
to be used to copy simple attributes into the generic inode.
- make use of the better opportunity to handle errors with the new sequence.
Big problem with the previous calling sequence: the filesystem is not called
until i_size has already changed. This means it is not allowed to fail the
call, and also it does not know what the previous i_size was. Also, generic
code calling vmtruncate to truncate allocated blocks in case of error had
no good way to return a meaningful error (or, for example, atomically handle
block deallocation).
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-26 15:05:33 +00:00
|
|
|
/**
|
2010-06-04 09:30:01 +00:00
|
|
|
* simple_setattr - setattr for simple filesystem
|
2023-01-13 11:49:11 +00:00
|
|
|
* @idmap: idmap of the target mount
|
fs: introduce new truncate sequence
Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
setattr > vmtruncate > truncate, have filesystems call their truncate sequence
from ->setattr if filesystem specific operations are required. vmtruncate is
deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
previously should be used.
simple_setattr is introduced for simple in-ram filesystems to implement
the new truncate sequence. Eventually all filesystems should be converted
to implement a setattr, and the default code in notify_change should go
away.
simple_setsize is also introduced to perform just the ATTR_SIZE portion
of simple_setattr (ie. changing i_size and trimming pagecache).
To implement the new truncate sequence:
- filesystem specific manipulations (eg freeing blocks) must be done in
the setattr method rather than ->truncate.
- vmtruncate can not be used by core code to trim blocks past i_size in
the event of write failure after allocation, so this must be performed
in the fs code.
- convert usage of helpers block_write_begin, nobh_write_begin,
cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
variants. These avoid calling vmtruncate to trim blocks (see previous).
- inode_setattr should not be used. generic_setattr is a new function
to be used to copy simple attributes into the generic inode.
- make use of the better opportunity to handle errors with the new sequence.
Big problem with the previous calling sequence: the filesystem is not called
until i_size has already changed. This means it is not allowed to fail the
call, and also it does not know what the previous i_size was. Also, generic
code calling vmtruncate to truncate allocated blocks in case of error had
no good way to return a meaningful error (or, for example, atomically handle
block deallocation).
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-26 15:05:33 +00:00
|
|
|
* @dentry: dentry
|
|
|
|
* @iattr: iattr structure
|
|
|
|
*
|
|
|
|
* Returns 0 on success, -error on failure.
|
|
|
|
*
|
2010-06-04 09:30:01 +00:00
|
|
|
* simple_setattr is a simple ->setattr implementation without a proper
|
|
|
|
* implementation of size changes.
|
|
|
|
*
|
|
|
|
* It can either be used for in-memory filesystems or special files
|
|
|
|
* on simple regular filesystems. Anything that needs to change on-disk
|
|
|
|
* or wire state on size changes needs its own setattr method.
|
fs: introduce new truncate sequence
Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
setattr > vmtruncate > truncate, have filesystems call their truncate sequence
from ->setattr if filesystem specific operations are required. vmtruncate is
deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
previously should be used.
simple_setattr is introduced for simple in-ram filesystems to implement
the new truncate sequence. Eventually all filesystems should be converted
to implement a setattr, and the default code in notify_change should go
away.
simple_setsize is also introduced to perform just the ATTR_SIZE portion
of simple_setattr (ie. changing i_size and trimming pagecache).
To implement the new truncate sequence:
- filesystem specific manipulations (eg freeing blocks) must be done in
the setattr method rather than ->truncate.
- vmtruncate can not be used by core code to trim blocks past i_size in
the event of write failure after allocation, so this must be performed
in the fs code.
- convert usage of helpers block_write_begin, nobh_write_begin,
cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
variants. These avoid calling vmtruncate to trim blocks (see previous).
- inode_setattr should not be used. generic_setattr is a new function
to be used to copy simple attributes into the generic inode.
- make use of the better opportunity to handle errors with the new sequence.
Big problem with the previous calling sequence: the filesystem is not called
until i_size has already changed. This means it is not allowed to fail the
call, and also it does not know what the previous i_size was. Also, generic
code calling vmtruncate to truncate allocated blocks in case of error had
no good way to return a meaningful error (or, for example, atomically handle
block deallocation).
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-26 15:05:33 +00:00
|
|
|
*/
|
2023-01-13 11:49:11 +00:00
|
|
|
int simple_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
|
2021-01-21 13:19:43 +00:00
|
|
|
struct iattr *iattr)
|
fs: introduce new truncate sequence
Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
setattr > vmtruncate > truncate, have filesystems call their truncate sequence
from ->setattr if filesystem specific operations are required. vmtruncate is
deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
previously should be used.
simple_setattr is introduced for simple in-ram filesystems to implement
the new truncate sequence. Eventually all filesystems should be converted
to implement a setattr, and the default code in notify_change should go
away.
simple_setsize is also introduced to perform just the ATTR_SIZE portion
of simple_setattr (ie. changing i_size and trimming pagecache).
To implement the new truncate sequence:
- filesystem specific manipulations (eg freeing blocks) must be done in
the setattr method rather than ->truncate.
- vmtruncate can not be used by core code to trim blocks past i_size in
the event of write failure after allocation, so this must be performed
in the fs code.
- convert usage of helpers block_write_begin, nobh_write_begin,
cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
variants. These avoid calling vmtruncate to trim blocks (see previous).
- inode_setattr should not be used. generic_setattr is a new function
to be used to copy simple attributes into the generic inode.
- make use of the better opportunity to handle errors with the new sequence.
Big problem with the previous calling sequence: the filesystem is not called
until i_size has already changed. This means it is not allowed to fail the
call, and also it does not know what the previous i_size was. Also, generic
code calling vmtruncate to truncate allocated blocks in case of error had
no good way to return a meaningful error (or, for example, atomically handle
block deallocation).
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-26 15:05:33 +00:00
|
|
|
{
|
2015-03-17 22:26:15 +00:00
|
|
|
struct inode *inode = d_inode(dentry);
|
fs: introduce new truncate sequence
Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
setattr > vmtruncate > truncate, have filesystems call their truncate sequence
from ->setattr if filesystem specific operations are required. vmtruncate is
deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
previously should be used.
simple_setattr is introduced for simple in-ram filesystems to implement
the new truncate sequence. Eventually all filesystems should be converted
to implement a setattr, and the default code in notify_change should go
away.
simple_setsize is also introduced to perform just the ATTR_SIZE portion
of simple_setattr (ie. changing i_size and trimming pagecache).
To implement the new truncate sequence:
- filesystem specific manipulations (eg freeing blocks) must be done in
the setattr method rather than ->truncate.
- vmtruncate can not be used by core code to trim blocks past i_size in
the event of write failure after allocation, so this must be performed
in the fs code.
- convert usage of helpers block_write_begin, nobh_write_begin,
cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
variants. These avoid calling vmtruncate to trim blocks (see previous).
- inode_setattr should not be used. generic_setattr is a new function
to be used to copy simple attributes into the generic inode.
- make use of the better opportunity to handle errors with the new sequence.
Big problem with the previous calling sequence: the filesystem is not called
until i_size has already changed. This means it is not allowed to fail the
call, and also it does not know what the previous i_size was. Also, generic
code calling vmtruncate to truncate allocated blocks in case of error had
no good way to return a meaningful error (or, for example, atomically handle
block deallocation).
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-26 15:05:33 +00:00
|
|
|
int error;
|
|
|
|
|
2023-01-13 11:49:11 +00:00
|
|
|
error = setattr_prepare(idmap, dentry, iattr);
|
fs: introduce new truncate sequence
Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
setattr > vmtruncate > truncate, have filesystems call their truncate sequence
from ->setattr if filesystem specific operations are required. vmtruncate is
deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
previously should be used.
simple_setattr is introduced for simple in-ram filesystems to implement
the new truncate sequence. Eventually all filesystems should be converted
to implement a setattr, and the default code in notify_change should go
away.
simple_setsize is also introduced to perform just the ATTR_SIZE portion
of simple_setattr (ie. changing i_size and trimming pagecache).
To implement the new truncate sequence:
- filesystem specific manipulations (eg freeing blocks) must be done in
the setattr method rather than ->truncate.
- vmtruncate can not be used by core code to trim blocks past i_size in
the event of write failure after allocation, so this must be performed
in the fs code.
- convert usage of helpers block_write_begin, nobh_write_begin,
cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
variants. These avoid calling vmtruncate to trim blocks (see previous).
- inode_setattr should not be used. generic_setattr is a new function
to be used to copy simple attributes into the generic inode.
- make use of the better opportunity to handle errors with the new sequence.
Big problem with the previous calling sequence: the filesystem is not called
until i_size has already changed. This means it is not allowed to fail the
call, and also it does not know what the previous i_size was. Also, generic
code calling vmtruncate to truncate allocated blocks in case of error had
no good way to return a meaningful error (or, for example, atomically handle
block deallocation).
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-26 15:05:33 +00:00
|
|
|
if (error)
|
|
|
|
return error;
|
|
|
|
|
2010-06-04 09:30:04 +00:00
|
|
|
if (iattr->ia_valid & ATTR_SIZE)
|
|
|
|
truncate_setsize(inode, iattr->ia_size);
|
2023-01-13 11:49:11 +00:00
|
|
|
setattr_copy(idmap, inode, iattr);
|
2010-06-04 09:30:01 +00:00
|
|
|
mark_inode_dirty(inode);
|
|
|
|
return 0;
|
fs: introduce new truncate sequence
Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
setattr > vmtruncate > truncate, have filesystems call their truncate sequence
from ->setattr if filesystem specific operations are required. vmtruncate is
deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
previously should be used.
simple_setattr is introduced for simple in-ram filesystems to implement
the new truncate sequence. Eventually all filesystems should be converted
to implement a setattr, and the default code in notify_change should go
away.
simple_setsize is also introduced to perform just the ATTR_SIZE portion
of simple_setattr (ie. changing i_size and trimming pagecache).
To implement the new truncate sequence:
- filesystem specific manipulations (eg freeing blocks) must be done in
the setattr method rather than ->truncate.
- vmtruncate can not be used by core code to trim blocks past i_size in
the event of write failure after allocation, so this must be performed
in the fs code.
- convert usage of helpers block_write_begin, nobh_write_begin,
cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
variants. These avoid calling vmtruncate to trim blocks (see previous).
- inode_setattr should not be used. generic_setattr is a new function
to be used to copy simple attributes into the generic inode.
- make use of the better opportunity to handle errors with the new sequence.
Big problem with the previous calling sequence: the filesystem is not called
until i_size has already changed. This means it is not allowed to fail the
call, and also it does not know what the previous i_size was. Also, generic
code calling vmtruncate to truncate allocated blocks in case of error had
no good way to return a meaningful error (or, for example, atomically handle
block deallocation).
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-26 15:05:33 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(simple_setattr);
|
|
|
|
|
2022-04-29 15:49:41 +00:00
|
|
|
static int simple_read_folio(struct file *file, struct folio *folio)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2022-04-29 15:49:41 +00:00
|
|
|
folio_zero_range(folio, 0, folio_size(folio));
|
|
|
|
flush_dcache_folio(folio);
|
|
|
|
folio_mark_uptodate(folio);
|
|
|
|
folio_unlock(folio);
|
2005-04-16 22:20:36 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2007-10-16 08:25:01 +00:00
|
|
|
int simple_write_begin(struct file *file, struct address_space *mapping,
|
2022-02-22 19:31:43 +00:00
|
|
|
loff_t pos, unsigned len,
|
2007-10-16 08:25:01 +00:00
|
|
|
struct page **pagep, void **fsdata)
|
|
|
|
{
|
2023-08-21 14:13:22 +00:00
|
|
|
struct folio *folio;
|
2007-10-16 08:25:01 +00:00
|
|
|
|
2023-08-21 14:13:22 +00:00
|
|
|
folio = __filemap_get_folio(mapping, pos / PAGE_SIZE, FGP_WRITEBEGIN,
|
|
|
|
mapping_gfp_mask(mapping));
|
|
|
|
if (IS_ERR(folio))
|
|
|
|
return PTR_ERR(folio);
|
2007-10-16 08:25:01 +00:00
|
|
|
|
2023-08-21 14:13:22 +00:00
|
|
|
*pagep = &folio->page;
|
2007-10-16 08:25:01 +00:00
|
|
|
|
2023-08-21 14:13:22 +00:00
|
|
|
if (!folio_test_uptodate(folio) && (len != folio_size(folio))) {
|
|
|
|
size_t from = offset_in_folio(folio, pos);
|
2010-01-12 14:18:08 +00:00
|
|
|
|
2023-08-21 14:13:22 +00:00
|
|
|
folio_zero_segments(folio, 0, from,
|
|
|
|
from + len, folio_size(folio));
|
2010-01-12 14:18:08 +00:00
|
|
|
}
|
|
|
|
return 0;
|
2007-10-16 08:25:01 +00:00
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(simple_write_begin);
|
2007-10-16 08:25:01 +00:00
|
|
|
|
2010-01-12 13:13:47 +00:00
|
|
|
/**
|
|
|
|
* simple_write_end - .write_end helper for non-block-device FSes
|
2019-10-14 21:12:14 +00:00
|
|
|
* @file: See .write_end of address_space_operations
|
2010-01-12 13:13:47 +00:00
|
|
|
* @mapping: "
|
|
|
|
* @pos: "
|
|
|
|
* @len: "
|
|
|
|
* @copied: "
|
|
|
|
* @page: "
|
|
|
|
* @fsdata: "
|
|
|
|
*
|
|
|
|
* simple_write_end does the minimum needed for updating a page after writing is
|
|
|
|
* done. It has the same API signature as the .write_end of
|
|
|
|
* address_space_operations vector. So it can just be set onto .write_end for
|
|
|
|
* FSes that don't need any other processing. i_mutex is assumed to be held.
|
|
|
|
* Block based filesystems should use generic_write_end().
|
|
|
|
* NOTE: Even though i_size might get updated by this function, mark_inode_dirty
|
|
|
|
* is not called, so a filesystem that actually does store data in .write_inode
|
|
|
|
* should extend on what's done here with a call to mark_inode_dirty() in the
|
|
|
|
* case that i_size has changed.
|
2016-08-30 02:39:56 +00:00
|
|
|
*
|
2022-04-29 15:49:41 +00:00
|
|
|
* Use *ONLY* with simple_read_folio()
|
2010-01-12 13:13:47 +00:00
|
|
|
*/
|
2021-06-29 02:36:09 +00:00
|
|
|
static int simple_write_end(struct file *file, struct address_space *mapping,
|
2007-10-16 08:25:01 +00:00
|
|
|
loff_t pos, unsigned len, unsigned copied,
|
|
|
|
struct page *page, void *fsdata)
|
|
|
|
{
|
2023-08-21 14:13:22 +00:00
|
|
|
struct folio *folio = page_folio(page);
|
|
|
|
struct inode *inode = folio->mapping->host;
|
2010-01-12 13:13:47 +00:00
|
|
|
loff_t last_pos = pos + copied;
|
2007-10-16 08:25:01 +00:00
|
|
|
|
2023-08-21 14:13:22 +00:00
|
|
|
/* zero the stale part of the folio if we did a short copy */
|
|
|
|
if (!folio_test_uptodate(folio)) {
|
2016-08-30 02:39:56 +00:00
|
|
|
if (copied < len) {
|
2023-08-21 14:13:22 +00:00
|
|
|
size_t from = offset_in_folio(folio, pos);
|
2007-10-16 08:25:01 +00:00
|
|
|
|
2023-08-21 14:13:22 +00:00
|
|
|
folio_zero_range(folio, from + copied, len - copied);
|
2016-08-30 02:39:56 +00:00
|
|
|
}
|
2023-08-21 14:13:22 +00:00
|
|
|
folio_mark_uptodate(folio);
|
2016-08-30 02:39:56 +00:00
|
|
|
}
|
2010-01-12 13:13:47 +00:00
|
|
|
/*
|
|
|
|
* No need to use i_size_read() here, the i_size
|
|
|
|
* cannot change under us because we hold the i_mutex.
|
|
|
|
*/
|
|
|
|
if (last_pos > inode->i_size)
|
|
|
|
i_size_write(inode, last_pos);
|
2007-10-16 08:25:01 +00:00
|
|
|
|
2023-08-21 14:13:22 +00:00
|
|
|
folio_mark_dirty(folio);
|
|
|
|
folio_unlock(folio);
|
|
|
|
folio_put(folio);
|
2007-10-16 08:25:01 +00:00
|
|
|
|
|
|
|
return copied;
|
|
|
|
}
|
2021-06-29 02:36:09 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Provides ramfs-style behavior: data in the pagecache, but no writeback.
|
|
|
|
*/
|
|
|
|
const struct address_space_operations ram_aops = {
|
2022-04-29 15:49:41 +00:00
|
|
|
.read_folio = simple_read_folio,
|
2021-06-29 02:36:09 +00:00
|
|
|
.write_begin = simple_write_begin,
|
|
|
|
.write_end = simple_write_end,
|
2022-02-09 20:22:13 +00:00
|
|
|
.dirty_folio = noop_dirty_folio,
|
2021-06-29 02:36:09 +00:00
|
|
|
};
|
|
|
|
EXPORT_SYMBOL(ram_aops);
|
2007-10-16 08:25:01 +00:00
|
|
|
|
2007-05-08 07:32:31 +00:00
|
|
|
/*
|
|
|
|
* the inodes created here are not hashed. If you use iunique to generate
|
|
|
|
* unique inode values later for this filesystem, then you must take care
|
|
|
|
* to pass it an appropriate max_reserved value to avoid collisions.
|
|
|
|
*/
|
2010-06-03 09:58:28 +00:00
|
|
|
int simple_fill_super(struct super_block *s, unsigned long magic,
|
2017-03-26 04:15:37 +00:00
|
|
|
const struct tree_descr *files)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
struct inode *inode;
|
|
|
|
struct dentry *dentry;
|
|
|
|
int i;
|
|
|
|
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
s->s_blocksize = PAGE_SIZE;
|
|
|
|
s->s_blocksize_bits = PAGE_SHIFT;
|
2005-04-16 22:20:36 +00:00
|
|
|
s->s_magic = magic;
|
2007-03-05 08:30:28 +00:00
|
|
|
s->s_op = &simple_super_operations;
|
2005-04-16 22:20:36 +00:00
|
|
|
s->s_time_gran = 1;
|
|
|
|
|
|
|
|
inode = new_inode(s);
|
|
|
|
if (!inode)
|
|
|
|
return -ENOMEM;
|
2007-05-08 07:32:31 +00:00
|
|
|
/*
|
|
|
|
* because the root inode is 1, the files array must not contain an
|
|
|
|
* entry at index 1
|
|
|
|
*/
|
|
|
|
inode->i_ino = 1;
|
2005-04-16 22:20:36 +00:00
|
|
|
inode->i_mode = S_IFDIR | 0755;
|
2023-10-04 18:52:37 +00:00
|
|
|
simple_inode_init_ts(inode);
|
2005-04-16 22:20:36 +00:00
|
|
|
inode->i_op = &simple_dir_inode_operations;
|
|
|
|
inode->i_fop = &simple_dir_operations;
|
2011-10-28 12:13:29 +00:00
|
|
|
set_nlink(inode, 2);
|
2023-11-11 20:56:55 +00:00
|
|
|
s->s_root = d_make_root(inode);
|
|
|
|
if (!s->s_root)
|
2005-04-16 22:20:36 +00:00
|
|
|
return -ENOMEM;
|
|
|
|
for (i = 0; !files->name || files->name[0]; i++, files++) {
|
|
|
|
if (!files->name)
|
|
|
|
continue;
|
2007-05-08 07:32:31 +00:00
|
|
|
|
|
|
|
/* warn if it tries to conflict with the root inode */
|
|
|
|
if (unlikely(i == 1))
|
|
|
|
printk(KERN_WARNING "%s: %s passed in a files array"
|
|
|
|
"with an index of 1!\n", __func__,
|
|
|
|
s->s_type->name);
|
|
|
|
|
2023-11-11 20:56:55 +00:00
|
|
|
dentry = d_alloc_name(s->s_root, files->name);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (!dentry)
|
2023-11-11 20:56:55 +00:00
|
|
|
return -ENOMEM;
|
2005-04-16 22:20:36 +00:00
|
|
|
inode = new_inode(s);
|
2011-11-01 13:12:33 +00:00
|
|
|
if (!inode) {
|
|
|
|
dput(dentry);
|
2023-11-11 20:56:55 +00:00
|
|
|
return -ENOMEM;
|
2011-11-01 13:12:33 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
inode->i_mode = S_IFREG | files->mode;
|
2023-10-04 18:52:37 +00:00
|
|
|
simple_inode_init_ts(inode);
|
2005-04-16 22:20:36 +00:00
|
|
|
inode->i_fop = files->ops;
|
|
|
|
inode->i_ino = i;
|
|
|
|
d_add(dentry, inode);
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(simple_fill_super);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
static DEFINE_SPINLOCK(pin_fs_lock);
|
|
|
|
|
2006-06-09 13:34:16 +00:00
|
|
|
int simple_pin_fs(struct file_system_type *type, struct vfsmount **mount, int *count)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
struct vfsmount *mnt = NULL;
|
|
|
|
spin_lock(&pin_fs_lock);
|
|
|
|
if (unlikely(!*mount)) {
|
|
|
|
spin_unlock(&pin_fs_lock);
|
2017-11-27 21:05:09 +00:00
|
|
|
mnt = vfs_kern_mount(type, SB_KERNMOUNT, type->name, NULL);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (IS_ERR(mnt))
|
|
|
|
return PTR_ERR(mnt);
|
|
|
|
spin_lock(&pin_fs_lock);
|
|
|
|
if (!*mount)
|
|
|
|
*mount = mnt;
|
|
|
|
}
|
|
|
|
mntget(*mount);
|
|
|
|
++*count;
|
|
|
|
spin_unlock(&pin_fs_lock);
|
|
|
|
mntput(mnt);
|
|
|
|
return 0;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(simple_pin_fs);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
void simple_release_fs(struct vfsmount **mount, int *count)
|
|
|
|
{
|
|
|
|
struct vfsmount *mnt;
|
|
|
|
spin_lock(&pin_fs_lock);
|
|
|
|
mnt = *mount;
|
|
|
|
if (!--*count)
|
|
|
|
*mount = NULL;
|
|
|
|
spin_unlock(&pin_fs_lock);
|
|
|
|
mntput(mnt);
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(simple_release_fs);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2008-07-04 16:59:51 +00:00
|
|
|
/**
|
|
|
|
* simple_read_from_buffer - copy data from the buffer to user space
|
|
|
|
* @to: the user space buffer to read to
|
|
|
|
* @count: the maximum number of bytes to read
|
|
|
|
* @ppos: the current position in the buffer
|
|
|
|
* @from: the buffer to read from
|
|
|
|
* @available: the size of the buffer
|
|
|
|
*
|
|
|
|
* The simple_read_from_buffer() function reads up to @count bytes from the
|
|
|
|
* buffer @from at offset @ppos into the user space address starting at @to.
|
|
|
|
*
|
|
|
|
* On success, the number of bytes read is returned and the offset @ppos is
|
|
|
|
* advanced by this number, or negative value is returned on error.
|
|
|
|
**/
|
2005-04-16 22:20:36 +00:00
|
|
|
ssize_t simple_read_from_buffer(void __user *to, size_t count, loff_t *ppos,
|
|
|
|
const void *from, size_t available)
|
|
|
|
{
|
|
|
|
loff_t pos = *ppos;
|
2009-09-18 20:05:42 +00:00
|
|
|
size_t ret;
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
if (pos < 0)
|
|
|
|
return -EINVAL;
|
2009-09-18 20:05:42 +00:00
|
|
|
if (pos >= available || !count)
|
2005-04-16 22:20:36 +00:00
|
|
|
return 0;
|
|
|
|
if (count > available - pos)
|
|
|
|
count = available - pos;
|
2009-09-18 20:05:42 +00:00
|
|
|
ret = copy_to_user(to, from + pos, count);
|
|
|
|
if (ret == count)
|
2005-04-16 22:20:36 +00:00
|
|
|
return -EFAULT;
|
2009-09-18 20:05:42 +00:00
|
|
|
count -= ret;
|
2005-04-16 22:20:36 +00:00
|
|
|
*ppos = pos + count;
|
|
|
|
return count;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(simple_read_from_buffer);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2010-05-01 21:51:22 +00:00
|
|
|
/**
|
|
|
|
* simple_write_to_buffer - copy data from user space to the buffer
|
|
|
|
* @to: the buffer to write to
|
|
|
|
* @available: the size of the buffer
|
|
|
|
* @ppos: the current position in the buffer
|
|
|
|
* @from: the user space buffer to read from
|
|
|
|
* @count: the maximum number of bytes to read
|
|
|
|
*
|
|
|
|
* The simple_write_to_buffer() function reads up to @count bytes from the user
|
|
|
|
* space address starting at @from into the buffer @to at offset @ppos.
|
|
|
|
*
|
|
|
|
* On success, the number of bytes written is returned and the offset @ppos is
|
|
|
|
* advanced by this number, or negative value is returned on error.
|
|
|
|
**/
|
|
|
|
ssize_t simple_write_to_buffer(void *to, size_t available, loff_t *ppos,
|
|
|
|
const void __user *from, size_t count)
|
|
|
|
{
|
|
|
|
loff_t pos = *ppos;
|
|
|
|
size_t res;
|
|
|
|
|
|
|
|
if (pos < 0)
|
|
|
|
return -EINVAL;
|
|
|
|
if (pos >= available || !count)
|
|
|
|
return 0;
|
|
|
|
if (count > available - pos)
|
|
|
|
count = available - pos;
|
|
|
|
res = copy_from_user(to + pos, from, count);
|
|
|
|
if (res == count)
|
|
|
|
return -EFAULT;
|
|
|
|
count -= res;
|
|
|
|
*ppos = pos + count;
|
|
|
|
return count;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(simple_write_to_buffer);
|
2010-05-01 21:51:22 +00:00
|
|
|
|
2008-07-04 16:59:51 +00:00
|
|
|
/**
|
|
|
|
* memory_read_from_buffer - copy data from the buffer
|
|
|
|
* @to: the kernel space buffer to read to
|
|
|
|
* @count: the maximum number of bytes to read
|
|
|
|
* @ppos: the current position in the buffer
|
|
|
|
* @from: the buffer to read from
|
|
|
|
* @available: the size of the buffer
|
|
|
|
*
|
|
|
|
* The memory_read_from_buffer() function reads up to @count bytes from the
|
|
|
|
* buffer @from at offset @ppos into the kernel space address starting at @to.
|
|
|
|
*
|
|
|
|
* On success, the number of bytes read is returned and the offset @ppos is
|
|
|
|
* advanced by this number, or negative value is returned on error.
|
|
|
|
**/
|
2008-06-06 05:46:21 +00:00
|
|
|
ssize_t memory_read_from_buffer(void *to, size_t count, loff_t *ppos,
|
|
|
|
const void *from, size_t available)
|
|
|
|
{
|
|
|
|
loff_t pos = *ppos;
|
|
|
|
|
|
|
|
if (pos < 0)
|
|
|
|
return -EINVAL;
|
|
|
|
if (pos >= available)
|
|
|
|
return 0;
|
|
|
|
if (count > available - pos)
|
|
|
|
count = available - pos;
|
|
|
|
memcpy(to, from + pos, count);
|
|
|
|
*ppos = pos + count;
|
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(memory_read_from_buffer);
|
2008-06-06 05:46:21 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* Transaction based IO.
|
|
|
|
* The file expects a single write which triggers the transaction, and then
|
|
|
|
* possibly a read which collects the result - which is stored in a
|
|
|
|
* file-local buffer.
|
|
|
|
*/
|
2009-03-25 15:48:35 +00:00
|
|
|
|
|
|
|
void simple_transaction_set(struct file *file, size_t n)
|
|
|
|
{
|
|
|
|
struct simple_transaction_argresp *ar = file->private_data;
|
|
|
|
|
|
|
|
BUG_ON(n > SIMPLE_TRANSACTION_LIMIT);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The barrier ensures that ar->size will really remain zero until
|
|
|
|
* ar->data is ready for reading.
|
|
|
|
*/
|
|
|
|
smp_mb();
|
|
|
|
ar->size = n;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(simple_transaction_set);
|
2009-03-25 15:48:35 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
char *simple_transaction_get(struct file *file, const char __user *buf, size_t size)
|
|
|
|
{
|
|
|
|
struct simple_transaction_argresp *ar;
|
|
|
|
static DEFINE_SPINLOCK(simple_transaction_lock);
|
|
|
|
|
|
|
|
if (size > SIMPLE_TRANSACTION_LIMIT - 1)
|
|
|
|
return ERR_PTR(-EFBIG);
|
|
|
|
|
|
|
|
ar = (struct simple_transaction_argresp *)get_zeroed_page(GFP_KERNEL);
|
|
|
|
if (!ar)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
|
|
|
|
spin_lock(&simple_transaction_lock);
|
|
|
|
|
|
|
|
/* only one write allowed per open */
|
|
|
|
if (file->private_data) {
|
|
|
|
spin_unlock(&simple_transaction_lock);
|
|
|
|
free_page((unsigned long)ar);
|
|
|
|
return ERR_PTR(-EBUSY);
|
|
|
|
}
|
|
|
|
|
|
|
|
file->private_data = ar;
|
|
|
|
|
|
|
|
spin_unlock(&simple_transaction_lock);
|
|
|
|
|
|
|
|
if (copy_from_user(ar->data, buf, size))
|
|
|
|
return ERR_PTR(-EFAULT);
|
|
|
|
|
|
|
|
return ar->data;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(simple_transaction_get);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
ssize_t simple_transaction_read(struct file *file, char __user *buf, size_t size, loff_t *pos)
|
|
|
|
{
|
|
|
|
struct simple_transaction_argresp *ar = file->private_data;
|
|
|
|
|
|
|
|
if (!ar)
|
|
|
|
return 0;
|
|
|
|
return simple_read_from_buffer(buf, size, pos, ar->data, ar->size);
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(simple_transaction_read);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
int simple_transaction_release(struct inode *inode, struct file *file)
|
|
|
|
{
|
|
|
|
free_page((unsigned long)file->private_data);
|
|
|
|
return 0;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL(simple_transaction_release);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2005-05-18 12:40:59 +00:00
|
|
|
/* Simple attribute files */
|
|
|
|
|
|
|
|
struct simple_attr {
|
2008-02-08 12:20:26 +00:00
|
|
|
int (*get)(void *, u64 *);
|
|
|
|
int (*set)(void *, u64);
|
2005-05-18 12:40:59 +00:00
|
|
|
char get_buf[24]; /* enough to store a u64 and "\n\0" */
|
|
|
|
char set_buf[24];
|
|
|
|
void *data;
|
|
|
|
const char *fmt; /* format for read operation */
|
2006-03-23 11:00:36 +00:00
|
|
|
struct mutex mutex; /* protects access to these buffers */
|
2005-05-18 12:40:59 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
/* simple_attr_open is called by an actual attribute open file operation
|
|
|
|
* to set the attribute specific access operations. */
|
|
|
|
int simple_attr_open(struct inode *inode, struct file *file,
|
2008-02-08 12:20:26 +00:00
|
|
|
int (*get)(void *, u64 *), int (*set)(void *, u64),
|
2005-05-18 12:40:59 +00:00
|
|
|
const char *fmt)
|
|
|
|
{
|
|
|
|
struct simple_attr *attr;
|
|
|
|
|
libfs: fix infoleak in simple_attr_read()
Reading from a debugfs file at a nonzero position, without first reading
at position 0, leaks uninitialized memory to userspace.
It's a bit tricky to do this, since lseek() and pread() aren't allowed
on these files, and write() doesn't update the position on them. But
writing to them with splice() *does* update the position:
#define _GNU_SOURCE 1
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
int main()
{
int pipes[2], fd, n, i;
char buf[32];
pipe(pipes);
write(pipes[1], "0", 1);
fd = open("/sys/kernel/debug/fault_around_bytes", O_RDWR);
splice(pipes[0], NULL, fd, NULL, 1, 0);
n = read(fd, buf, sizeof(buf));
for (i = 0; i < n; i++)
printf("%02x", buf[i]);
printf("\n");
}
Output:
5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a30
Fix the infoleak by making simple_attr_read() always fill
simple_attr::get_buf if it hasn't been filled yet.
Reported-by: syzbot+fcab69d1ada3e8d6f06b@syzkaller.appspotmail.com
Reported-by: Alexander Potapenko <glider@google.com>
Fixes: acaefc25d21f ("[PATCH] libfs: add simple attribute files")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20200308023849.988264-1-ebiggers@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-08 02:38:49 +00:00
|
|
|
attr = kzalloc(sizeof(*attr), GFP_KERNEL);
|
2005-05-18 12:40:59 +00:00
|
|
|
if (!attr)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
attr->get = get;
|
|
|
|
attr->set = set;
|
2006-09-27 08:50:46 +00:00
|
|
|
attr->data = inode->i_private;
|
2005-05-18 12:40:59 +00:00
|
|
|
attr->fmt = fmt;
|
2006-03-23 11:00:36 +00:00
|
|
|
mutex_init(&attr->mutex);
|
2005-05-18 12:40:59 +00:00
|
|
|
|
|
|
|
file->private_data = attr;
|
|
|
|
|
|
|
|
return nonseekable_open(inode, file);
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL_GPL(simple_attr_open);
|
2005-05-18 12:40:59 +00:00
|
|
|
|
2008-02-08 12:20:28 +00:00
|
|
|
int simple_attr_release(struct inode *inode, struct file *file)
|
2005-05-18 12:40:59 +00:00
|
|
|
{
|
|
|
|
kfree(file->private_data);
|
|
|
|
return 0;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL_GPL(simple_attr_release); /* GPL-only? This? Really? */
|
2005-05-18 12:40:59 +00:00
|
|
|
|
|
|
|
/* read from the buffer that is filled with the get function */
|
|
|
|
ssize_t simple_attr_read(struct file *file, char __user *buf,
|
|
|
|
size_t len, loff_t *ppos)
|
|
|
|
{
|
|
|
|
struct simple_attr *attr;
|
|
|
|
size_t size;
|
|
|
|
ssize_t ret;
|
|
|
|
|
|
|
|
attr = file->private_data;
|
|
|
|
|
|
|
|
if (!attr->get)
|
|
|
|
return -EACCES;
|
|
|
|
|
2008-02-08 12:20:27 +00:00
|
|
|
ret = mutex_lock_interruptible(&attr->mutex);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
libfs: fix infoleak in simple_attr_read()
Reading from a debugfs file at a nonzero position, without first reading
at position 0, leaks uninitialized memory to userspace.
It's a bit tricky to do this, since lseek() and pread() aren't allowed
on these files, and write() doesn't update the position on them. But
writing to them with splice() *does* update the position:
#define _GNU_SOURCE 1
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
int main()
{
int pipes[2], fd, n, i;
char buf[32];
pipe(pipes);
write(pipes[1], "0", 1);
fd = open("/sys/kernel/debug/fault_around_bytes", O_RDWR);
splice(pipes[0], NULL, fd, NULL, 1, 0);
n = read(fd, buf, sizeof(buf));
for (i = 0; i < n; i++)
printf("%02x", buf[i]);
printf("\n");
}
Output:
5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a30
Fix the infoleak by making simple_attr_read() always fill
simple_attr::get_buf if it hasn't been filled yet.
Reported-by: syzbot+fcab69d1ada3e8d6f06b@syzkaller.appspotmail.com
Reported-by: Alexander Potapenko <glider@google.com>
Fixes: acaefc25d21f ("[PATCH] libfs: add simple attribute files")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20200308023849.988264-1-ebiggers@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-08 02:38:49 +00:00
|
|
|
if (*ppos && attr->get_buf[0]) {
|
|
|
|
/* continued read */
|
2005-05-18 12:40:59 +00:00
|
|
|
size = strlen(attr->get_buf);
|
libfs: fix infoleak in simple_attr_read()
Reading from a debugfs file at a nonzero position, without first reading
at position 0, leaks uninitialized memory to userspace.
It's a bit tricky to do this, since lseek() and pread() aren't allowed
on these files, and write() doesn't update the position on them. But
writing to them with splice() *does* update the position:
#define _GNU_SOURCE 1
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
int main()
{
int pipes[2], fd, n, i;
char buf[32];
pipe(pipes);
write(pipes[1], "0", 1);
fd = open("/sys/kernel/debug/fault_around_bytes", O_RDWR);
splice(pipes[0], NULL, fd, NULL, 1, 0);
n = read(fd, buf, sizeof(buf));
for (i = 0; i < n; i++)
printf("%02x", buf[i]);
printf("\n");
}
Output:
5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a30
Fix the infoleak by making simple_attr_read() always fill
simple_attr::get_buf if it hasn't been filled yet.
Reported-by: syzbot+fcab69d1ada3e8d6f06b@syzkaller.appspotmail.com
Reported-by: Alexander Potapenko <glider@google.com>
Fixes: acaefc25d21f ("[PATCH] libfs: add simple attribute files")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20200308023849.988264-1-ebiggers@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-08 02:38:49 +00:00
|
|
|
} else {
|
|
|
|
/* first read */
|
2008-02-08 12:20:26 +00:00
|
|
|
u64 val;
|
|
|
|
ret = attr->get(attr->data, &val);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
2005-05-18 12:40:59 +00:00
|
|
|
size = scnprintf(attr->get_buf, sizeof(attr->get_buf),
|
2008-02-08 12:20:26 +00:00
|
|
|
attr->fmt, (unsigned long long)val);
|
|
|
|
}
|
2005-05-18 12:40:59 +00:00
|
|
|
|
|
|
|
ret = simple_read_from_buffer(buf, len, ppos, attr->get_buf, size);
|
2008-02-08 12:20:26 +00:00
|
|
|
out:
|
2006-03-23 11:00:36 +00:00
|
|
|
mutex_unlock(&attr->mutex);
|
2005-05-18 12:40:59 +00:00
|
|
|
return ret;
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL_GPL(simple_attr_read);
|
2005-05-18 12:40:59 +00:00
|
|
|
|
|
|
|
/* interpret the buffer as a number to call the set function with */
|
2022-09-19 17:24:16 +00:00
|
|
|
static ssize_t simple_attr_write_xsigned(struct file *file, const char __user *buf,
|
|
|
|
size_t len, loff_t *ppos, bool is_signed)
|
2005-05-18 12:40:59 +00:00
|
|
|
{
|
|
|
|
struct simple_attr *attr;
|
2020-11-22 06:17:19 +00:00
|
|
|
unsigned long long val;
|
2005-05-18 12:40:59 +00:00
|
|
|
size_t size;
|
|
|
|
ssize_t ret;
|
|
|
|
|
|
|
|
attr = file->private_data;
|
|
|
|
if (!attr->set)
|
|
|
|
return -EACCES;
|
|
|
|
|
2008-02-08 12:20:27 +00:00
|
|
|
ret = mutex_lock_interruptible(&attr->mutex);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2005-05-18 12:40:59 +00:00
|
|
|
ret = -EFAULT;
|
|
|
|
size = min(sizeof(attr->set_buf) - 1, len);
|
|
|
|
if (copy_from_user(attr->set_buf, buf, size))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
attr->set_buf[size] = '\0';
|
2022-09-19 17:24:16 +00:00
|
|
|
if (is_signed)
|
|
|
|
ret = kstrtoll(attr->set_buf, 0, &val);
|
|
|
|
else
|
|
|
|
ret = kstrtoull(attr->set_buf, 0, &val);
|
2020-11-22 06:17:19 +00:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
2009-09-18 20:06:03 +00:00
|
|
|
ret = attr->set(attr->data, val);
|
|
|
|
if (ret == 0)
|
|
|
|
ret = len; /* on success, claim we got the whole input */
|
2005-05-18 12:40:59 +00:00
|
|
|
out:
|
2006-03-23 11:00:36 +00:00
|
|
|
mutex_unlock(&attr->mutex);
|
2005-05-18 12:40:59 +00:00
|
|
|
return ret;
|
|
|
|
}
|
2022-09-19 17:24:16 +00:00
|
|
|
|
|
|
|
ssize_t simple_attr_write(struct file *file, const char __user *buf,
|
|
|
|
size_t len, loff_t *ppos)
|
|
|
|
{
|
|
|
|
return simple_attr_write_xsigned(file, buf, len, ppos, false);
|
|
|
|
}
|
2013-09-16 01:20:49 +00:00
|
|
|
EXPORT_SYMBOL_GPL(simple_attr_write);
|
2005-05-18 12:40:59 +00:00
|
|
|
|
2022-09-19 17:24:16 +00:00
|
|
|
ssize_t simple_attr_write_signed(struct file *file, const char __user *buf,
|
|
|
|
size_t len, loff_t *ppos)
|
|
|
|
{
|
|
|
|
return simple_attr_write_xsigned(file, buf, len, ppos, true);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(simple_attr_write_signed);
|
|
|
|
|
2023-10-26 20:45:40 +00:00
|
|
|
/**
|
|
|
|
* generic_encode_ino32_fh - generic export_operations->encode_fh function
|
|
|
|
* @inode: the object to encode
|
|
|
|
* @fh: where to store the file handle fragment
|
|
|
|
* @max_len: maximum length to store there (in 4 byte units)
|
|
|
|
* @parent: parent directory inode, if wanted
|
|
|
|
*
|
|
|
|
* This generic encode_fh function assumes that the 32 inode number
|
|
|
|
* is suitable for locating an inode, and that the generation number
|
|
|
|
* can be used to check that it is still valid. It places them in the
|
|
|
|
* filehandle fragment where export_decode_fh expects to find them.
|
|
|
|
*/
|
|
|
|
int generic_encode_ino32_fh(struct inode *inode, __u32 *fh, int *max_len,
|
|
|
|
struct inode *parent)
|
|
|
|
{
|
|
|
|
struct fid *fid = (void *)fh;
|
|
|
|
int len = *max_len;
|
|
|
|
int type = FILEID_INO32_GEN;
|
|
|
|
|
|
|
|
if (parent && (len < 4)) {
|
|
|
|
*max_len = 4;
|
|
|
|
return FILEID_INVALID;
|
|
|
|
} else if (len < 2) {
|
|
|
|
*max_len = 2;
|
|
|
|
return FILEID_INVALID;
|
|
|
|
}
|
|
|
|
|
|
|
|
len = 2;
|
|
|
|
fid->i32.ino = inode->i_ino;
|
|
|
|
fid->i32.gen = inode->i_generation;
|
|
|
|
if (parent) {
|
|
|
|
fid->i32.parent_ino = parent->i_ino;
|
|
|
|
fid->i32.parent_gen = parent->i_generation;
|
|
|
|
len = 4;
|
|
|
|
type = FILEID_INO32_GEN_PARENT;
|
|
|
|
}
|
|
|
|
*max_len = len;
|
|
|
|
return type;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(generic_encode_ino32_fh);
|
|
|
|
|
2007-10-21 23:42:05 +00:00
|
|
|
/**
|
|
|
|
* generic_fh_to_dentry - generic helper for the fh_to_dentry export operation
|
|
|
|
* @sb: filesystem to do the file handle conversion on
|
|
|
|
* @fid: file handle to convert
|
|
|
|
* @fh_len: length of the file handle in bytes
|
|
|
|
* @fh_type: type of file handle
|
|
|
|
* @get_inode: filesystem callback to retrieve inode
|
|
|
|
*
|
|
|
|
* This function decodes @fid as long as it has one of the well-known
|
|
|
|
* Linux filehandle types and calls @get_inode on it to retrieve the
|
|
|
|
* inode for the object specified in the file handle.
|
|
|
|
*/
|
|
|
|
struct dentry *generic_fh_to_dentry(struct super_block *sb, struct fid *fid,
|
|
|
|
int fh_len, int fh_type, struct inode *(*get_inode)
|
|
|
|
(struct super_block *sb, u64 ino, u32 gen))
|
|
|
|
{
|
|
|
|
struct inode *inode = NULL;
|
|
|
|
|
|
|
|
if (fh_len < 2)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
switch (fh_type) {
|
|
|
|
case FILEID_INO32_GEN:
|
|
|
|
case FILEID_INO32_GEN_PARENT:
|
|
|
|
inode = get_inode(sb, fid->i32.ino, fid->i32.gen);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2008-08-11 13:48:57 +00:00
|
|
|
return d_obtain_alias(inode);
|
2007-10-21 23:42:05 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(generic_fh_to_dentry);
|
|
|
|
|
|
|
|
/**
|
2012-09-05 08:31:29 +00:00
|
|
|
* generic_fh_to_parent - generic helper for the fh_to_parent export operation
|
2007-10-21 23:42:05 +00:00
|
|
|
* @sb: filesystem to do the file handle conversion on
|
|
|
|
* @fid: file handle to convert
|
|
|
|
* @fh_len: length of the file handle in bytes
|
|
|
|
* @fh_type: type of file handle
|
|
|
|
* @get_inode: filesystem callback to retrieve inode
|
|
|
|
*
|
|
|
|
* This function decodes @fid as long as it has one of the well-known
|
|
|
|
* Linux filehandle types and calls @get_inode on it to retrieve the
|
|
|
|
* inode for the _parent_ object specified in the file handle if it
|
|
|
|
* is specified in the file handle, or NULL otherwise.
|
|
|
|
*/
|
|
|
|
struct dentry *generic_fh_to_parent(struct super_block *sb, struct fid *fid,
|
|
|
|
int fh_len, int fh_type, struct inode *(*get_inode)
|
|
|
|
(struct super_block *sb, u64 ino, u32 gen))
|
|
|
|
{
|
|
|
|
struct inode *inode = NULL;
|
|
|
|
|
|
|
|
if (fh_len <= 2)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
switch (fh_type) {
|
|
|
|
case FILEID_INO32_GEN_PARENT:
|
|
|
|
inode = get_inode(sb, fid->i32.parent_ino,
|
|
|
|
(fh_len > 3 ? fid->i32.parent_gen : 0));
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2008-08-11 13:48:57 +00:00
|
|
|
return d_obtain_alias(inode);
|
2007-10-21 23:42:05 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(generic_fh_to_parent);
|
|
|
|
|
2010-05-26 15:53:41 +00:00
|
|
|
/**
|
2014-06-04 23:06:27 +00:00
|
|
|
* __generic_file_fsync - generic fsync implementation for simple filesystems
|
|
|
|
*
|
2010-05-26 15:53:41 +00:00
|
|
|
* @file: file to synchronize
|
2014-06-04 23:06:27 +00:00
|
|
|
* @start: start offset in bytes
|
|
|
|
* @end: end offset in bytes (inclusive)
|
2010-05-26 15:53:41 +00:00
|
|
|
* @datasync: only synchronize essential metadata if true
|
|
|
|
*
|
|
|
|
* This is a generic implementation of the fsync method for simple
|
|
|
|
* filesystems which track all non-inode metadata in the buffers list
|
|
|
|
* hanging off the address_space structure.
|
|
|
|
*/
|
2014-06-04 23:06:27 +00:00
|
|
|
int __generic_file_fsync(struct file *file, loff_t start, loff_t end,
|
|
|
|
int datasync)
|
2009-06-07 18:56:44 +00:00
|
|
|
{
|
2010-05-26 15:53:25 +00:00
|
|
|
struct inode *inode = file->f_mapping->host;
|
2009-06-07 18:56:44 +00:00
|
|
|
int err;
|
|
|
|
int ret;
|
|
|
|
|
2017-07-06 11:02:29 +00:00
|
|
|
err = file_write_and_wait_range(file, start, end);
|
2011-07-17 00:44:56 +00:00
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
2016-01-22 20:40:57 +00:00
|
|
|
inode_lock(inode);
|
2009-06-07 18:56:44 +00:00
|
|
|
ret = sync_mapping_buffers(inode->i_mapping);
|
2015-02-02 05:37:00 +00:00
|
|
|
if (!(inode->i_state & I_DIRTY_ALL))
|
2011-07-17 00:44:56 +00:00
|
|
|
goto out;
|
2009-06-07 18:56:44 +00:00
|
|
|
if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
|
2011-07-17 00:44:56 +00:00
|
|
|
goto out;
|
2009-06-07 18:56:44 +00:00
|
|
|
|
2010-10-06 08:48:20 +00:00
|
|
|
err = sync_inode_metadata(inode, 1);
|
2009-06-07 18:56:44 +00:00
|
|
|
if (ret == 0)
|
|
|
|
ret = err;
|
2014-06-04 23:06:27 +00:00
|
|
|
|
2011-07-17 00:44:56 +00:00
|
|
|
out:
|
2016-01-22 20:40:57 +00:00
|
|
|
inode_unlock(inode);
|
2017-07-06 11:02:29 +00:00
|
|
|
/* check and advance again to catch errors after syncing out buffers */
|
|
|
|
err = file_check_and_advance_wb_err(file);
|
|
|
|
if (ret == 0)
|
|
|
|
ret = err;
|
|
|
|
return ret;
|
2009-06-07 18:56:44 +00:00
|
|
|
}
|
2014-06-04 23:06:27 +00:00
|
|
|
EXPORT_SYMBOL(__generic_file_fsync);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* generic_file_fsync - generic fsync implementation for simple filesystems
|
|
|
|
* with flush
|
|
|
|
* @file: file to synchronize
|
|
|
|
* @start: start offset in bytes
|
|
|
|
* @end: end offset in bytes (inclusive)
|
|
|
|
* @datasync: only synchronize essential metadata if true
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
|
|
|
int generic_file_fsync(struct file *file, loff_t start, loff_t end,
|
|
|
|
int datasync)
|
|
|
|
{
|
|
|
|
struct inode *inode = file->f_mapping->host;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
err = __generic_file_fsync(file, start, end, datasync);
|
|
|
|
if (err)
|
|
|
|
return err;
|
2021-01-26 14:52:35 +00:00
|
|
|
return blkdev_issue_flush(inode->i_sb->s_bdev);
|
2014-06-04 23:06:27 +00:00
|
|
|
}
|
2010-05-26 15:53:41 +00:00
|
|
|
EXPORT_SYMBOL(generic_file_fsync);
|
|
|
|
|
2010-07-22 22:03:41 +00:00
|
|
|
/**
|
|
|
|
* generic_check_addressable - Check addressability of file system
|
|
|
|
* @blocksize_bits: log of file system block size
|
|
|
|
* @num_blocks: number of blocks in file system
|
|
|
|
*
|
|
|
|
* Determine whether a file system with @num_blocks blocks (and a
|
|
|
|
* block size of 2**@blocksize_bits) is addressable by the sector_t
|
|
|
|
* and page cache of the system. Return 0 if so and -EFBIG otherwise.
|
|
|
|
*/
|
|
|
|
int generic_check_addressable(unsigned blocksize_bits, u64 num_blocks)
|
|
|
|
{
|
|
|
|
u64 last_fs_block = num_blocks - 1;
|
2010-08-16 19:10:17 +00:00
|
|
|
u64 last_fs_page =
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
last_fs_block >> (PAGE_SHIFT - blocksize_bits);
|
2010-07-22 22:03:41 +00:00
|
|
|
|
|
|
|
if (unlikely(num_blocks == 0))
|
|
|
|
return 0;
|
|
|
|
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
if ((blocksize_bits < 9) || (blocksize_bits > PAGE_SHIFT))
|
2010-07-22 22:03:41 +00:00
|
|
|
return -EINVAL;
|
|
|
|
|
2010-08-16 19:10:17 +00:00
|
|
|
if ((last_fs_block > (sector_t)(~0ULL) >> (blocksize_bits - 9)) ||
|
|
|
|
(last_fs_page > (pgoff_t)(~0ULL))) {
|
2010-07-22 22:03:41 +00:00
|
|
|
return -EFBIG;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(generic_check_addressable);
|
|
|
|
|
2010-05-26 15:53:41 +00:00
|
|
|
/*
|
|
|
|
* No-op implementation of ->fsync for in-memory filesystems.
|
|
|
|
*/
|
2011-07-17 00:44:56 +00:00
|
|
|
int noop_fsync(struct file *file, loff_t start, loff_t end, int datasync)
|
2010-05-26 15:53:41 +00:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(noop_fsync);
|
2013-09-16 14:30:04 +00:00
|
|
|
|
2018-03-07 23:26:44 +00:00
|
|
|
ssize_t noop_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* iomap based filesystems support direct I/O without need for
|
|
|
|
* this callback. However, it still needs to be set in
|
|
|
|
* inode->a_ops so that open/fcntl know that direct I/O is
|
|
|
|
* generally supported.
|
|
|
|
*/
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(noop_direct_IO);
|
|
|
|
|
2015-12-29 20:58:39 +00:00
|
|
|
/* Because kfree isn't assignment-compatible with void(void*) ;-/ */
|
|
|
|
void kfree_link(void *p)
|
2013-09-16 14:30:04 +00:00
|
|
|
{
|
2015-12-29 20:58:39 +00:00
|
|
|
kfree(p);
|
2013-09-16 14:30:04 +00:00
|
|
|
}
|
2015-12-29 20:58:39 +00:00
|
|
|
EXPORT_SYMBOL(kfree_link);
|
2013-10-03 02:35:11 +00:00
|
|
|
|
|
|
|
struct inode *alloc_anon_inode(struct super_block *s)
|
|
|
|
{
|
|
|
|
static const struct address_space_operations anon_aops = {
|
2022-02-09 20:22:13 +00:00
|
|
|
.dirty_folio = noop_dirty_folio,
|
2013-10-03 02:35:11 +00:00
|
|
|
};
|
|
|
|
struct inode *inode = new_inode_pseudo(s);
|
|
|
|
|
|
|
|
if (!inode)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
|
|
|
|
inode->i_ino = get_next_ino();
|
|
|
|
inode->i_mapping->a_ops = &anon_aops;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Mark the inode dirty from the very beginning,
|
|
|
|
* that way it will never be moved to the dirty
|
|
|
|
* list because mark_inode_dirty() will think
|
|
|
|
* that it already _is_ on the dirty list.
|
|
|
|
*/
|
|
|
|
inode->i_state = I_DIRTY;
|
|
|
|
inode->i_mode = S_IRUSR | S_IWUSR;
|
|
|
|
inode->i_uid = current_fsuid();
|
|
|
|
inode->i_gid = current_fsgid();
|
|
|
|
inode->i_flags |= S_PRIVATE;
|
2023-10-04 18:52:37 +00:00
|
|
|
simple_inode_init_ts(inode);
|
2013-10-03 02:35:11 +00:00
|
|
|
return inode;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(alloc_anon_inode);
|
2014-08-27 10:49:41 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* simple_nosetlease - generic helper for prohibiting leases
|
|
|
|
* @filp: file pointer
|
|
|
|
* @arg: type of lease to obtain
|
|
|
|
* @flp: new lease supplied for insertion
|
2014-08-22 14:40:25 +00:00
|
|
|
* @priv: private data for lm_setup operation
|
2014-08-27 10:49:41 +00:00
|
|
|
*
|
|
|
|
* Generic helper for filesystems that do not wish to allow leases to be set.
|
|
|
|
* All arguments are ignored and it just returns -EINVAL.
|
|
|
|
*/
|
|
|
|
int
|
2024-01-31 23:02:28 +00:00
|
|
|
simple_nosetlease(struct file *filp, int arg, struct file_lease **flp,
|
2014-08-22 14:40:25 +00:00
|
|
|
void **priv)
|
2014-08-27 10:49:41 +00:00
|
|
|
{
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(simple_nosetlease);
|
2015-05-02 13:54:06 +00:00
|
|
|
|
2019-04-11 23:16:30 +00:00
|
|
|
/**
|
|
|
|
* simple_get_link - generic helper to get the target of "fast" symlinks
|
|
|
|
* @dentry: not used here
|
|
|
|
* @inode: the symlink inode
|
|
|
|
* @done: not used here
|
|
|
|
*
|
|
|
|
* Generic helper for filesystems to use for symlink inodes where a pointer to
|
|
|
|
* the symlink target is stored in ->i_link. NOTE: this isn't normally called,
|
|
|
|
* since as an optimization the path lookup code uses any non-NULL ->i_link
|
|
|
|
* directly, without calling ->get_link(). But ->get_link() still must be set,
|
|
|
|
* to mark the inode_operations as being for a symlink.
|
|
|
|
*
|
|
|
|
* Return: the symlink target
|
|
|
|
*/
|
2015-11-17 15:20:54 +00:00
|
|
|
const char *simple_get_link(struct dentry *dentry, struct inode *inode,
|
2015-12-29 20:58:39 +00:00
|
|
|
struct delayed_call *done)
|
2015-05-02 13:54:06 +00:00
|
|
|
{
|
2015-11-17 15:20:54 +00:00
|
|
|
return inode->i_link;
|
2015-05-02 13:54:06 +00:00
|
|
|
}
|
2015-11-17 15:20:54 +00:00
|
|
|
EXPORT_SYMBOL(simple_get_link);
|
2015-05-02 13:54:06 +00:00
|
|
|
|
|
|
|
const struct inode_operations simple_symlink_inode_operations = {
|
2015-11-17 15:20:54 +00:00
|
|
|
.get_link = simple_get_link,
|
2015-05-02 13:54:06 +00:00
|
|
|
};
|
|
|
|
EXPORT_SYMBOL(simple_symlink_inode_operations);
|
2015-05-09 20:54:49 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Operations for a permanently empty directory.
|
|
|
|
*/
|
|
|
|
static struct dentry *empty_dir_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
|
|
|
|
{
|
|
|
|
return ERR_PTR(-ENOENT);
|
|
|
|
}
|
|
|
|
|
2023-01-13 11:49:12 +00:00
|
|
|
static int empty_dir_getattr(struct mnt_idmap *idmap,
|
2021-01-21 13:19:43 +00:00
|
|
|
const struct path *path, struct kstat *stat,
|
statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.
The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.
Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.
========
OVERVIEW
========
The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.
A number of requests were gathered for features to be included. The
following have been included:
(1) Make the fields a consistent size on all arches and make them large.
(2) Spare space, request flags and information flags are provided for
future expansion.
(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).
(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).
This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].
(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).
(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).
And the following have been left out for future extension:
(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].
Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.
(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).
(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].
(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].
(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).
(10) Extra coherency data may be useful in making backups [Andreas Dilger].
(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).
(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...
(This requires a separate system call - I have an fsinfo() call idea
for this).
(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].
(Deferred to fsinfo).
(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].
(Deferred to fsinfo).
(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).
(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).
(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].
(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).
(16) Femtosecond-resolution timestamps [Dave Chinner].
(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).
(17) A set multiple attributes syscall to go with this.
===============
NEW SYSTEM CALL
===============
The new system call is:
int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);
The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.
Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):
(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.
(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.
(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.
mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.
buffer points to the destination for the data. This must be 256 bytes in
size.
======================
MAIN ATTRIBUTES RECORD
======================
The following structures are defined in which to return the main attribute
set:
struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};
struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};
The defined bits in request_mask and stx_mask are:
STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]
stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.
Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.
The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:
STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs
Within the kernel, the supported flags are listed by:
KSTAT_ATTR_FS_IOC_FLAGS
[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]
New flags include:
STATX_ATTR_AUTOMOUNT Object is an automount trigger
These are for the use of GUI tools that might want to mark files specially,
depending on what they are.
Fields in struct statx come in a number of classes:
(0) stx_dev_*, stx_blksize.
These are local system information and are always available.
(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.
These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.
If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.
If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.
Note that there are instances where the type might not be valid, for
instance Windows reparse points.
(2) stx_rdev_*.
This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.
(3) stx_btime.
Similar to (1), except this will be set to 0 if it doesn't exist.
=======
TESTING
=======
The following test program can be used to test the statx system call:
samples/statx/test-statx.c
Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.
Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.
[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
Secondly, the result of automounting on that directory.
[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-01-31 16:46:22 +00:00
|
|
|
u32 request_mask, unsigned int query_flags)
|
2015-05-09 20:54:49 +00:00
|
|
|
{
|
statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.
The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.
Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.
========
OVERVIEW
========
The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.
A number of requests were gathered for features to be included. The
following have been included:
(1) Make the fields a consistent size on all arches and make them large.
(2) Spare space, request flags and information flags are provided for
future expansion.
(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).
(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).
This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].
(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).
(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).
And the following have been left out for future extension:
(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].
Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.
(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).
(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].
(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].
(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).
(10) Extra coherency data may be useful in making backups [Andreas Dilger].
(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).
(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...
(This requires a separate system call - I have an fsinfo() call idea
for this).
(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].
(Deferred to fsinfo).
(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].
(Deferred to fsinfo).
(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).
(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).
(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].
(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).
(16) Femtosecond-resolution timestamps [Dave Chinner].
(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).
(17) A set multiple attributes syscall to go with this.
===============
NEW SYSTEM CALL
===============
The new system call is:
int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);
The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.
Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):
(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.
(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.
(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.
mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.
buffer points to the destination for the data. This must be 256 bytes in
size.
======================
MAIN ATTRIBUTES RECORD
======================
The following structures are defined in which to return the main attribute
set:
struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};
struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};
The defined bits in request_mask and stx_mask are:
STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]
stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.
Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.
The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:
STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs
Within the kernel, the supported flags are listed by:
KSTAT_ATTR_FS_IOC_FLAGS
[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]
New flags include:
STATX_ATTR_AUTOMOUNT Object is an automount trigger
These are for the use of GUI tools that might want to mark files specially,
depending on what they are.
Fields in struct statx come in a number of classes:
(0) stx_dev_*, stx_blksize.
These are local system information and are always available.
(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.
These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.
If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.
If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.
Note that there are instances where the type might not be valid, for
instance Windows reparse points.
(2) stx_rdev_*.
This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.
(3) stx_btime.
Similar to (1), except this will be set to 0 if it doesn't exist.
=======
TESTING
=======
The following test program can be used to test the statx system call:
samples/statx/test-statx.c
Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.
Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.
[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
Secondly, the result of automounting on that directory.
[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-01-31 16:46:22 +00:00
|
|
|
struct inode *inode = d_inode(path->dentry);
|
2023-08-07 19:38:33 +00:00
|
|
|
generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat);
|
2015-05-09 20:54:49 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-01-13 11:49:11 +00:00
|
|
|
static int empty_dir_setattr(struct mnt_idmap *idmap,
|
2021-01-21 13:19:43 +00:00
|
|
|
struct dentry *dentry, struct iattr *attr)
|
2015-05-09 20:54:49 +00:00
|
|
|
{
|
|
|
|
return -EPERM;
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t empty_dir_listxattr(struct dentry *dentry, char *list, size_t size)
|
|
|
|
{
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct inode_operations empty_dir_inode_operations = {
|
|
|
|
.lookup = empty_dir_lookup,
|
|
|
|
.permission = generic_permission,
|
|
|
|
.setattr = empty_dir_setattr,
|
|
|
|
.getattr = empty_dir_getattr,
|
|
|
|
.listxattr = empty_dir_listxattr,
|
|
|
|
};
|
|
|
|
|
|
|
|
static loff_t empty_dir_llseek(struct file *file, loff_t offset, int whence)
|
|
|
|
{
|
|
|
|
/* An empty directory has two entries . and .. at offsets 0 and 1 */
|
|
|
|
return generic_file_llseek_size(file, offset, whence, 2, 2);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int empty_dir_readdir(struct file *file, struct dir_context *ctx)
|
|
|
|
{
|
|
|
|
dir_emit_dots(file, ctx);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct file_operations empty_dir_operations = {
|
|
|
|
.llseek = empty_dir_llseek,
|
|
|
|
.read = generic_read_dir,
|
2016-05-01 02:37:34 +00:00
|
|
|
.iterate_shared = empty_dir_readdir,
|
2015-05-09 20:54:49 +00:00
|
|
|
.fsync = noop_fsync,
|
|
|
|
};
|
|
|
|
|
|
|
|
|
|
|
|
void make_empty_dir_inode(struct inode *inode)
|
|
|
|
{
|
|
|
|
set_nlink(inode, 2);
|
|
|
|
inode->i_mode = S_IFDIR | S_IRUGO | S_IXUGO;
|
|
|
|
inode->i_uid = GLOBAL_ROOT_UID;
|
|
|
|
inode->i_gid = GLOBAL_ROOT_GID;
|
|
|
|
inode->i_rdev = 0;
|
2015-08-12 20:00:12 +00:00
|
|
|
inode->i_size = 0;
|
2015-05-09 20:54:49 +00:00
|
|
|
inode->i_blkbits = PAGE_SHIFT;
|
|
|
|
inode->i_blocks = 0;
|
|
|
|
|
|
|
|
inode->i_op = &empty_dir_inode_operations;
|
2016-09-29 15:48:41 +00:00
|
|
|
inode->i_opflags &= ~IOP_XATTR;
|
2015-05-09 20:54:49 +00:00
|
|
|
inode->i_fop = &empty_dir_operations;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool is_empty_dir_inode(struct inode *inode)
|
|
|
|
{
|
|
|
|
return (inode->i_fop == &empty_dir_operations) &&
|
|
|
|
(inode->i_op == &empty_dir_inode_operations);
|
|
|
|
}
|
2020-07-08 09:12:35 +00:00
|
|
|
|
2022-01-18 06:56:14 +00:00
|
|
|
#if IS_ENABLED(CONFIG_UNICODE)
|
2020-07-08 09:12:35 +00:00
|
|
|
/**
|
|
|
|
* generic_ci_d_compare - generic d_compare implementation for casefolding filesystems
|
|
|
|
* @dentry: dentry whose name we are checking against
|
|
|
|
* @len: len of name of dentry
|
|
|
|
* @str: str pointer to name of dentry
|
|
|
|
* @name: Name to compare against
|
|
|
|
*
|
|
|
|
* Return: 0 if names match, 1 if mismatch, or -ERRNO
|
|
|
|
*/
|
2020-12-28 23:25:29 +00:00
|
|
|
static int generic_ci_d_compare(const struct dentry *dentry, unsigned int len,
|
|
|
|
const char *str, const struct qstr *name)
|
2020-07-08 09:12:35 +00:00
|
|
|
{
|
2024-01-24 18:13:40 +00:00
|
|
|
const struct dentry *parent;
|
|
|
|
const struct inode *dir;
|
2020-07-08 09:12:35 +00:00
|
|
|
char strbuf[DNAME_INLINE_LEN];
|
2024-01-24 18:13:40 +00:00
|
|
|
struct qstr qstr;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Attempt a case-sensitive match first. It is cheaper and
|
|
|
|
* should cover most lookups, including all the sane
|
|
|
|
* applications that expect a case-sensitive filesystem.
|
|
|
|
*
|
|
|
|
* This comparison is safe under RCU because the caller
|
|
|
|
* guarantees the consistency between str and len. See
|
|
|
|
* __d_lookup_rcu_op_compare() for details.
|
|
|
|
*/
|
|
|
|
if (len == name->len && !memcmp(str, name->name, len))
|
|
|
|
return 0;
|
2020-07-08 09:12:35 +00:00
|
|
|
|
2024-01-24 18:13:40 +00:00
|
|
|
parent = READ_ONCE(dentry->d_parent);
|
|
|
|
dir = READ_ONCE(parent->d_inode);
|
2023-08-14 18:29:03 +00:00
|
|
|
if (!dir || !IS_CASEFOLDED(dir))
|
2024-01-24 18:13:40 +00:00
|
|
|
return 1;
|
|
|
|
|
2020-07-08 09:12:35 +00:00
|
|
|
/*
|
|
|
|
* If the dentry name is stored in-line, then it may be concurrently
|
|
|
|
* modified by a rename. If this happens, the VFS will eventually retry
|
|
|
|
* the lookup, so it doesn't matter what ->d_compare() returns.
|
|
|
|
* However, it's unsafe to call utf8_strncasecmp() with an unstable
|
|
|
|
* string. Therefore, we have to copy the name into a temporary buffer.
|
|
|
|
*/
|
|
|
|
if (len <= DNAME_INLINE_LEN - 1) {
|
|
|
|
memcpy(strbuf, str, len);
|
|
|
|
strbuf[len] = 0;
|
2024-01-24 18:13:40 +00:00
|
|
|
str = strbuf;
|
2020-07-08 09:12:35 +00:00
|
|
|
/* prevent compiler from optimizing out the temporary buffer */
|
|
|
|
barrier();
|
|
|
|
}
|
2024-01-24 18:13:40 +00:00
|
|
|
qstr.len = len;
|
|
|
|
qstr.name = str;
|
2020-07-08 09:12:35 +00:00
|
|
|
|
2024-01-24 18:13:40 +00:00
|
|
|
return utf8_strncasecmp(dentry->d_sb->s_encoding, name, &qstr);
|
2020-07-08 09:12:35 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* generic_ci_d_hash - generic d_hash implementation for casefolding filesystems
|
|
|
|
* @dentry: dentry of the parent directory
|
|
|
|
* @str: qstr of name whose hash we should fill in
|
|
|
|
*
|
|
|
|
* Return: 0 if hash was successful or unchanged, and -EINVAL on error
|
|
|
|
*/
|
2020-12-28 23:25:29 +00:00
|
|
|
static int generic_ci_d_hash(const struct dentry *dentry, struct qstr *str)
|
2020-07-08 09:12:35 +00:00
|
|
|
{
|
|
|
|
const struct inode *dir = READ_ONCE(dentry->d_inode);
|
|
|
|
struct super_block *sb = dentry->d_sb;
|
|
|
|
const struct unicode_map *um = sb->s_encoding;
|
2024-02-20 06:20:30 +00:00
|
|
|
int ret;
|
2020-07-08 09:12:35 +00:00
|
|
|
|
2023-08-14 18:29:03 +00:00
|
|
|
if (!dir || !IS_CASEFOLDED(dir))
|
2020-07-08 09:12:35 +00:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
ret = utf8_casefold_hash(um, dentry, str);
|
|
|
|
if (ret < 0 && sb_has_strict_encoding(sb))
|
|
|
|
return -EINVAL;
|
|
|
|
return 0;
|
|
|
|
}
|
2020-11-19 06:09:02 +00:00
|
|
|
|
|
|
|
static const struct dentry_operations generic_ci_dentry_ops = {
|
|
|
|
.d_hash = generic_ci_d_hash,
|
|
|
|
.d_compare = generic_ci_d_compare,
|
|
|
|
#ifdef CONFIG_FS_ENCRYPTION
|
|
|
|
.d_revalidate = fscrypt_d_revalidate,
|
2024-02-21 17:14:07 +00:00
|
|
|
#endif
|
2020-11-19 06:09:02 +00:00
|
|
|
};
|
2024-06-06 07:33:49 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* generic_ci_match() - Match a name (case-insensitively) with a dirent.
|
|
|
|
* This is a filesystem helper for comparison with directory entries.
|
|
|
|
* generic_ci_d_compare should be used in VFS' ->d_compare instead.
|
|
|
|
*
|
|
|
|
* @parent: Inode of the parent of the dirent under comparison
|
|
|
|
* @name: name under lookup.
|
|
|
|
* @folded_name: Optional pre-folded name under lookup
|
|
|
|
* @de_name: Dirent name.
|
|
|
|
* @de_name_len: dirent name length.
|
|
|
|
*
|
|
|
|
* Test whether a case-insensitive directory entry matches the filename
|
|
|
|
* being searched. If @folded_name is provided, it is used instead of
|
|
|
|
* recalculating the casefold of @name.
|
|
|
|
*
|
|
|
|
* Return: > 0 if the directory entry matches, 0 if it doesn't match, or
|
|
|
|
* < 0 on error.
|
|
|
|
*/
|
|
|
|
int generic_ci_match(const struct inode *parent,
|
|
|
|
const struct qstr *name,
|
|
|
|
const struct qstr *folded_name,
|
|
|
|
const u8 *de_name, u32 de_name_len)
|
|
|
|
{
|
|
|
|
const struct super_block *sb = parent->i_sb;
|
|
|
|
const struct unicode_map *um = sb->s_encoding;
|
|
|
|
struct fscrypt_str decrypted_name = FSTR_INIT(NULL, de_name_len);
|
|
|
|
struct qstr dirent = QSTR_INIT(de_name, de_name_len);
|
|
|
|
int res = 0;
|
|
|
|
|
|
|
|
if (IS_ENCRYPTED(parent)) {
|
|
|
|
const struct fscrypt_str encrypted_name =
|
|
|
|
FSTR_INIT((u8 *) de_name, de_name_len);
|
|
|
|
|
|
|
|
if (WARN_ON_ONCE(!fscrypt_has_encryption_key(parent)))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
decrypted_name.name = kmalloc(de_name_len, GFP_KERNEL);
|
|
|
|
if (!decrypted_name.name)
|
|
|
|
return -ENOMEM;
|
|
|
|
res = fscrypt_fname_disk_to_usr(parent, 0, 0, &encrypted_name,
|
|
|
|
&decrypted_name);
|
|
|
|
if (res < 0) {
|
|
|
|
kfree(decrypted_name.name);
|
|
|
|
return res;
|
|
|
|
}
|
|
|
|
dirent.name = decrypted_name.name;
|
|
|
|
dirent.len = decrypted_name.len;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Attempt a case-sensitive match first. It is cheaper and
|
|
|
|
* should cover most lookups, including all the sane
|
|
|
|
* applications that expect a case-sensitive filesystem.
|
|
|
|
*/
|
|
|
|
|
|
|
|
if (dirent.len == name->len &&
|
|
|
|
!memcmp(name->name, dirent.name, dirent.len))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (folded_name->name)
|
|
|
|
res = utf8_strncasecmp_folded(um, folded_name, &dirent);
|
|
|
|
else
|
|
|
|
res = utf8_strncasecmp(um, name, &dirent);
|
|
|
|
|
|
|
|
out:
|
|
|
|
kfree(decrypted_name.name);
|
|
|
|
if (res < 0 && sb_has_strict_encoding(sb)) {
|
|
|
|
pr_err_ratelimited("Directory contains filename that is invalid UTF-8");
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
return !res;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(generic_ci_match);
|
2020-11-19 06:09:02 +00:00
|
|
|
#endif
|
|
|
|
|
2024-02-21 17:14:07 +00:00
|
|
|
#ifdef CONFIG_FS_ENCRYPTION
|
|
|
|
static const struct dentry_operations generic_encrypted_dentry_ops = {
|
2020-11-19 06:09:02 +00:00
|
|
|
.d_revalidate = fscrypt_d_revalidate,
|
|
|
|
};
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/**
|
2024-02-21 17:14:08 +00:00
|
|
|
* generic_set_sb_d_ops - helper for choosing the set of
|
|
|
|
* filesystem-wide dentry operations for the enabled features
|
|
|
|
* @sb: superblock to be configured
|
2020-11-19 06:09:02 +00:00
|
|
|
*
|
2024-02-21 17:14:08 +00:00
|
|
|
* Filesystems supporting casefolding and/or fscrypt can call this
|
|
|
|
* helper at mount-time to configure sb->s_d_op to best set of dentry
|
|
|
|
* operations required for the enabled features. The helper must be
|
|
|
|
* called after these have been configured, but before the root dentry
|
|
|
|
* is created.
|
2020-11-19 06:09:02 +00:00
|
|
|
*/
|
2024-02-21 17:14:08 +00:00
|
|
|
void generic_set_sb_d_ops(struct super_block *sb)
|
2020-11-19 06:09:02 +00:00
|
|
|
{
|
2022-01-18 06:56:14 +00:00
|
|
|
#if IS_ENABLED(CONFIG_UNICODE)
|
2024-02-21 17:14:08 +00:00
|
|
|
if (sb->s_encoding) {
|
|
|
|
sb->s_d_op = &generic_ci_dentry_ops;
|
2020-11-19 06:09:02 +00:00
|
|
|
return;
|
|
|
|
}
|
2020-07-08 09:12:35 +00:00
|
|
|
#endif
|
2020-11-19 06:09:02 +00:00
|
|
|
#ifdef CONFIG_FS_ENCRYPTION
|
2024-02-21 17:14:08 +00:00
|
|
|
if (sb->s_cop) {
|
|
|
|
sb->s_d_op = &generic_encrypted_dentry_ops;
|
2020-11-19 06:09:02 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
}
|
2024-02-21 17:14:08 +00:00
|
|
|
EXPORT_SYMBOL(generic_set_sb_d_ops);
|
2022-09-09 20:57:41 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* inode_maybe_inc_iversion - increments i_version
|
|
|
|
* @inode: inode with the i_version that should be updated
|
|
|
|
* @force: increment the counter even if it's not necessary?
|
|
|
|
*
|
|
|
|
* Every time the inode is modified, the i_version field must be seen to have
|
|
|
|
* changed by any observer.
|
|
|
|
*
|
|
|
|
* If "force" is set or the QUERIED flag is set, then ensure that we increment
|
|
|
|
* the value, and clear the queried flag.
|
|
|
|
*
|
|
|
|
* In the common case where neither is set, then we can return "false" without
|
|
|
|
* updating i_version.
|
|
|
|
*
|
|
|
|
* If this function returns false, and no other metadata has changed, then we
|
|
|
|
* can avoid logging the metadata.
|
|
|
|
*/
|
|
|
|
bool inode_maybe_inc_iversion(struct inode *inode, bool force)
|
|
|
|
{
|
|
|
|
u64 cur, new;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The i_version field is not strictly ordered with any other inode
|
|
|
|
* information, but the legacy inode_inc_iversion code used a spinlock
|
|
|
|
* to serialize increments.
|
|
|
|
*
|
|
|
|
* Here, we add full memory barriers to ensure that any de-facto
|
|
|
|
* ordering with other info is preserved.
|
|
|
|
*
|
|
|
|
* This barrier pairs with the barrier in inode_query_iversion()
|
|
|
|
*/
|
|
|
|
smp_mb();
|
|
|
|
cur = inode_peek_iversion_raw(inode);
|
|
|
|
do {
|
|
|
|
/* If flag is clear then we needn't do anything */
|
|
|
|
if (!force && !(cur & I_VERSION_QUERIED))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/* Since lowest bit is flag, add 2 to avoid it */
|
|
|
|
new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
|
|
|
|
} while (!atomic64_try_cmpxchg(&inode->i_version, &cur, new));
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(inode_maybe_inc_iversion);
|
2022-09-16 13:37:51 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* inode_query_iversion - read i_version for later use
|
|
|
|
* @inode: inode from which i_version should be read
|
|
|
|
*
|
|
|
|
* Read the inode i_version counter. This should be used by callers that wish
|
|
|
|
* to store the returned i_version for later comparison. This will guarantee
|
|
|
|
* that a later query of the i_version will result in a different value if
|
|
|
|
* anything has changed.
|
|
|
|
*
|
|
|
|
* In this implementation, we fetch the current value, set the QUERIED flag and
|
|
|
|
* then try to swap it into place with a cmpxchg, if it wasn't already set. If
|
|
|
|
* that fails, we try again with the newly fetched value from the cmpxchg.
|
|
|
|
*/
|
|
|
|
u64 inode_query_iversion(struct inode *inode)
|
|
|
|
{
|
|
|
|
u64 cur, new;
|
|
|
|
|
|
|
|
cur = inode_peek_iversion_raw(inode);
|
|
|
|
do {
|
|
|
|
/* If flag is already set, then no need to swap */
|
|
|
|
if (cur & I_VERSION_QUERIED) {
|
|
|
|
/*
|
|
|
|
* This barrier (and the implicit barrier in the
|
|
|
|
* cmpxchg below) pairs with the barrier in
|
|
|
|
* inode_maybe_inc_iversion().
|
|
|
|
*/
|
|
|
|
smp_mb();
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
new = cur | I_VERSION_QUERIED;
|
|
|
|
} while (!atomic64_try_cmpxchg(&inode->i_version, &cur, new));
|
|
|
|
return cur >> I_VERSION_QUERIED_SHIFT;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(inode_query_iversion);
|
2023-06-01 14:59:01 +00:00
|
|
|
|
|
|
|
ssize_t direct_write_fallback(struct kiocb *iocb, struct iov_iter *iter,
|
|
|
|
ssize_t direct_written, ssize_t buffered_written)
|
|
|
|
{
|
|
|
|
struct address_space *mapping = iocb->ki_filp->f_mapping;
|
|
|
|
loff_t pos = iocb->ki_pos - buffered_written;
|
|
|
|
loff_t end = iocb->ki_pos - 1;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the buffered write fallback returned an error, we want to return
|
|
|
|
* the number of bytes which were written by direct I/O, or the error
|
|
|
|
* code if that was zero.
|
|
|
|
*
|
|
|
|
* Note that this differs from normal direct-io semantics, which will
|
|
|
|
* return -EFOO even if some bytes were written.
|
|
|
|
*/
|
|
|
|
if (unlikely(buffered_written < 0)) {
|
|
|
|
if (direct_written)
|
|
|
|
return direct_written;
|
|
|
|
return buffered_written;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We need to ensure that the page cache pages are written to disk and
|
|
|
|
* invalidated to preserve the expected O_DIRECT semantics.
|
|
|
|
*/
|
|
|
|
err = filemap_write_and_wait_range(mapping, pos, end);
|
|
|
|
if (err < 0) {
|
|
|
|
/*
|
|
|
|
* We don't know how much we wrote, so just return the number of
|
|
|
|
* bytes which were direct-written
|
|
|
|
*/
|
2023-09-13 16:28:15 +00:00
|
|
|
iocb->ki_pos -= buffered_written;
|
2023-06-01 14:59:01 +00:00
|
|
|
if (direct_written)
|
|
|
|
return direct_written;
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
invalidate_mapping_pages(mapping, pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
|
|
|
|
return direct_written + buffered_written;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(direct_write_fallback);
|
2023-10-04 18:52:37 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* simple_inode_init_ts - initialize the timestamps for a new inode
|
|
|
|
* @inode: inode to be initialized
|
|
|
|
*
|
|
|
|
* When a new inode is created, most filesystems set the timestamps to the
|
|
|
|
* current time. Add a helper to do this.
|
|
|
|
*/
|
|
|
|
struct timespec64 simple_inode_init_ts(struct inode *inode)
|
|
|
|
{
|
|
|
|
struct timespec64 ts = inode_set_ctime_current(inode);
|
|
|
|
|
|
|
|
inode_set_atime_to_ts(inode, ts);
|
|
|
|
inode_set_mtime_to_ts(inode, ts);
|
|
|
|
return ts;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(simple_inode_init_ts);
|
2024-02-18 13:50:13 +00:00
|
|
|
|
|
|
|
static inline struct dentry *get_stashed_dentry(struct dentry *stashed)
|
|
|
|
{
|
|
|
|
struct dentry *dentry;
|
|
|
|
|
|
|
|
guard(rcu)();
|
|
|
|
dentry = READ_ONCE(stashed);
|
|
|
|
if (!dentry)
|
|
|
|
return NULL;
|
|
|
|
if (!lockref_get_not_dead(&dentry->d_lockref))
|
|
|
|
return NULL;
|
|
|
|
return dentry;
|
|
|
|
}
|
|
|
|
|
2024-02-21 08:59:51 +00:00
|
|
|
static struct dentry *prepare_anon_dentry(struct dentry **stashed,
|
2024-02-18 13:52:24 +00:00
|
|
|
struct super_block *sb,
|
|
|
|
void *data)
|
2024-02-18 13:50:13 +00:00
|
|
|
{
|
|
|
|
struct dentry *dentry;
|
|
|
|
struct inode *inode;
|
2024-03-01 09:26:03 +00:00
|
|
|
const struct stashed_operations *sops = sb->s_fs_info;
|
2024-03-12 09:39:44 +00:00
|
|
|
int ret;
|
2024-02-18 13:50:13 +00:00
|
|
|
|
|
|
|
inode = new_inode_pseudo(sb);
|
|
|
|
if (!inode) {
|
2024-03-12 09:39:44 +00:00
|
|
|
sops->put_data(data);
|
2024-02-18 13:50:13 +00:00
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
}
|
|
|
|
|
|
|
|
inode->i_flags |= S_IMMUTABLE;
|
2024-03-01 09:26:03 +00:00
|
|
|
inode->i_mode = S_IFREG;
|
2024-02-18 13:50:13 +00:00
|
|
|
simple_inode_init_ts(inode);
|
2024-03-12 09:39:44 +00:00
|
|
|
|
|
|
|
ret = sops->init_inode(inode, data);
|
|
|
|
if (ret < 0) {
|
|
|
|
iput(inode);
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
}
|
2024-03-01 09:26:03 +00:00
|
|
|
|
|
|
|
/* Notice when this is changed. */
|
|
|
|
WARN_ON_ONCE(!S_ISREG(inode->i_mode));
|
|
|
|
WARN_ON_ONCE(!IS_IMMUTABLE(inode));
|
2024-02-18 13:50:13 +00:00
|
|
|
|
2024-03-12 09:39:44 +00:00
|
|
|
dentry = d_alloc_anon(sb);
|
|
|
|
if (!dentry) {
|
|
|
|
iput(inode);
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
}
|
|
|
|
|
2024-02-21 08:59:51 +00:00
|
|
|
/* Store address of location where dentry's supposed to be stashed. */
|
|
|
|
dentry->d_fsdata = stashed;
|
|
|
|
|
2024-02-18 13:50:13 +00:00
|
|
|
/* @data is now owned by the fs */
|
|
|
|
d_instantiate(dentry, inode);
|
2024-02-18 13:52:24 +00:00
|
|
|
return dentry;
|
|
|
|
}
|
2024-02-18 13:50:13 +00:00
|
|
|
|
2024-02-18 13:52:24 +00:00
|
|
|
static struct dentry *stash_dentry(struct dentry **stashed,
|
|
|
|
struct dentry *dentry)
|
|
|
|
{
|
|
|
|
guard(rcu)();
|
|
|
|
for (;;) {
|
|
|
|
struct dentry *old;
|
2024-02-18 13:50:13 +00:00
|
|
|
|
2024-02-18 13:52:24 +00:00
|
|
|
/* Assume any old dentry was cleared out. */
|
|
|
|
old = cmpxchg(stashed, NULL, dentry);
|
|
|
|
if (likely(!old))
|
|
|
|
return dentry;
|
|
|
|
|
|
|
|
/* Check if somebody else installed a reusable dentry. */
|
|
|
|
if (lockref_get_not_dead(&old->d_lockref))
|
|
|
|
return old;
|
|
|
|
|
|
|
|
/* There's an old dead dentry there, try to take it over. */
|
|
|
|
if (likely(try_cmpxchg(stashed, &old, dentry)))
|
|
|
|
return dentry;
|
|
|
|
}
|
2024-02-18 13:50:13 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* path_from_stashed - create path from stashed or new dentry
|
|
|
|
* @stashed: where to retrieve or stash dentry
|
|
|
|
* @mnt: mnt of the filesystems to use
|
|
|
|
* @data: data to store in inode->i_private
|
|
|
|
* @path: path to create
|
|
|
|
*
|
|
|
|
* The function tries to retrieve a stashed dentry from @stashed. If the dentry
|
|
|
|
* is still valid then it will be reused. If the dentry isn't able the function
|
2024-02-18 13:52:24 +00:00
|
|
|
* will allocate a new dentry and inode. It will then check again whether it
|
|
|
|
* can reuse an existing dentry in case one has been added in the meantime or
|
|
|
|
* update @stashed with the newly added dentry.
|
2024-02-18 13:50:13 +00:00
|
|
|
*
|
|
|
|
* Special-purpose helper for nsfs and pidfs.
|
|
|
|
*
|
2024-03-01 09:26:03 +00:00
|
|
|
* Return: On success zero and on failure a negative error is returned.
|
2024-02-18 13:50:13 +00:00
|
|
|
*/
|
2024-03-12 09:39:44 +00:00
|
|
|
int path_from_stashed(struct dentry **stashed, struct vfsmount *mnt, void *data,
|
|
|
|
struct path *path)
|
2024-02-18 13:50:13 +00:00
|
|
|
{
|
|
|
|
struct dentry *dentry;
|
2024-03-01 09:26:03 +00:00
|
|
|
const struct stashed_operations *sops = mnt->mnt_sb->s_fs_info;
|
2024-02-18 13:50:13 +00:00
|
|
|
|
2024-02-18 13:52:24 +00:00
|
|
|
/* See if dentry can be reused. */
|
|
|
|
path->dentry = get_stashed_dentry(*stashed);
|
2024-03-01 09:26:03 +00:00
|
|
|
if (path->dentry) {
|
|
|
|
sops->put_data(data);
|
2024-02-18 13:50:13 +00:00
|
|
|
goto out_path;
|
2024-03-01 09:26:03 +00:00
|
|
|
}
|
2024-02-18 13:50:13 +00:00
|
|
|
|
2024-02-18 13:52:24 +00:00
|
|
|
/* Allocate a new dentry. */
|
2024-03-12 09:39:44 +00:00
|
|
|
dentry = prepare_anon_dentry(stashed, mnt->mnt_sb, data);
|
|
|
|
if (IS_ERR(dentry))
|
2024-02-18 13:50:13 +00:00
|
|
|
return PTR_ERR(dentry);
|
2024-02-18 13:52:24 +00:00
|
|
|
|
|
|
|
/* Added a new dentry. @data is now owned by the filesystem. */
|
|
|
|
path->dentry = stash_dentry(stashed, dentry);
|
|
|
|
if (path->dentry != dentry)
|
|
|
|
dput(dentry);
|
2024-02-18 13:50:13 +00:00
|
|
|
|
|
|
|
out_path:
|
2024-02-21 08:59:51 +00:00
|
|
|
WARN_ON_ONCE(path->dentry->d_fsdata != stashed);
|
|
|
|
WARN_ON_ONCE(d_inode(path->dentry)->i_private != data);
|
2024-02-18 13:50:13 +00:00
|
|
|
path->mnt = mntget(mnt);
|
2024-03-01 09:26:03 +00:00
|
|
|
return 0;
|
2024-02-18 13:50:13 +00:00
|
|
|
}
|
2024-02-21 08:59:51 +00:00
|
|
|
|
|
|
|
void stashed_dentry_prune(struct dentry *dentry)
|
|
|
|
{
|
|
|
|
struct dentry **stashed = dentry->d_fsdata;
|
|
|
|
struct inode *inode = d_inode(dentry);
|
|
|
|
|
|
|
|
if (WARN_ON_ONCE(!stashed))
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (!inode)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Only replace our own @dentry as someone else might've
|
|
|
|
* already cleared out @dentry and stashed their own
|
|
|
|
* dentry in there.
|
|
|
|
*/
|
|
|
|
cmpxchg(stashed, dentry, NULL);
|
|
|
|
}
|