When reading uids and gids off the wire convert them to
kuids and kgids.
When putting kuids and kgids onto the wire first convert
them to uids and gids the other side will understand.
When printing kuids and kgids convert them to values in
the initial user namespace then use normal printf formats.
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When reading uids and gids off the wire convert them to
kuids and kgids.
When putting kuids and kgids onto the wire first convert
them to uids and gids the other side will understand.
Add an additional failure mode incoming for uids or gids
that are invalid.
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When reading uids and gids off the wire convert them to
kuids and kgids.
When putting kuids and kgids onto the wire first convert
them to uids and gids the other side will understand.
Add an additional failure mode for incoming uid or
gids that are invalid.
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Convert nfs_map_name_to_uid to return a kuid_t value.
Convert nfs_map_name_to_gid to return a kgid_t value.
Convert nfs_map_uid_to_name to take a kuid_t paramater.
Convert nfs_map_gid_to_name to take a kgid_t paramater.
Tweak nfs_fattr_map_owner_to_name to use a kuid_t intermediate value.
Tweak nfs_fattr_map_group_to_name to use a kgid_t intermediate value.
Which makes these functions properly handle kuids and kgids, including
erroring of the generated kuid or kgid is invalid.
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
- Use kuid_t and kgit in struct nfsacl_encode_desc.
- Convert from kuids and kgids when generating on the wire values.
- Convert on the wire values to kuids and kgids when read.
- Modify cmp_acl_entry to be type safe comparison on posix acls.
Only acls with type ACL_USER and ACL_GROUP can appear more
than once and as such need to compare more than their tag.
- The e_id field is being removed from posix acls so don't initialize it.
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
ncpfs does not natively support uids and gids so this conversion was
simply a matter of updating the the type of the mounteduid, the uid
and the gid on the superblock. Fixing the ioctls that read them,
updating the mount option parser and the mount option printer.
Cc: Petr Vandrovec <petr@vandrovec.name>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
When reading dinodes from the disk convert uids and gids
into kuids and kgids to store in vfs data structures.
When writing to dinodes to the disk convert kuids and kgids
in the in memory structures into plain uids and gids.
For now all on disk data structures are assumed to be
stored in the initial user namespace.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Where kuid_t values are compared use uid_eq and where kgid_t values
are compared use gid_eq. This is unfortunately necessary because
of the type safety that keeps someone from accidentally mixing
kuids and kgids with other types.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Remove the QUOTA_USER and QUOTA_GRUP defines. Remove
the last vestigal users of QUOTA_USER and QUOTA_GROUP.
Now that struct kqid is used throughout the gfs2 quota
code the need there is to use QUOTA_USER and QUOTA_GROUP
and the defines are just extraneous and confusing.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
- Change qd_id in struct gfs2_qutoa_data to struct kqid.
- Remove the now unnecessary QDF_USER bit field in qd_flags.
- Propopoage this change through the code generally making
things simpler along the way.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
- In quota_refresh_user_store convert the user supplied uid
into a kqid and pass it to gfs2_quota_refresh.
- In quota_refresh_group_store convert the user supplied gid
into a kqid and pass it to gfs2_quota_refresh.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Both qd_alloc and qd2offset perform the exact same computation
to get an index from a gfs2_quota_data. Make life a little
simpler and factor out this index computation.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When a quota is queried return the uid or the gid in the mapped into
the caller's user namespace. In addition perform the munged version
of the mapping so that instead of -1 a value that does not map is
reported as the overflowuid or the overflowgid.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Split NO_QUOTA_CHANGE into NO_UID_QUTOA_CHANGE and NO_GID_QUTOA_CHANGE
so the constants may be well typed.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In set_dqblk it is an error to look at fdq->d_id or fdq->d_flags.
Userspace quota applications do not set these fields when calling
quotactl(Q_XSETQLIM,...), and the kernel does not set those fields
when quota_setquota calls set_dqblk.
gfs2 never looks at fdq->d_id or fdq->d_flags after checking
to see if they match the id and type supplied to set_dqblk.
No other linux filesystem in set_dqblk looks at either fdq->d_id
or fdq->d_flags.
Therefore remove these bogus checks from gfs2 and allow normal
quota setting applications to work.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Convert between uid and gids stored in the on the wire format of dlm
locks aka struct ocfs2_meta_lvb and kuids and kgids stored in
inode->i_uid and inode->i_gid.
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Explicitly deal with the different kinds of acls because they need
different conversions.
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
- Change c_uid in struct coda_indoe_info from a vuid_t to a kuid_t.
- Initialize c_uid to GLOBAL_ROOT_UID instead of 0.
- Use uid_eq to compare cached kuids.
Cc: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Remove the slight chance that uids and gids in coda messages will be
interpreted in the wrong user namespace.
- Only allow processes in the initial user namespace to open the coda
character device to communicate with coda filesystems.
- Explicitly convert the uids in the coda header into the initial user
namespace.
- In coda_vattr_to_attr make kuids and kgids from the initial user
namespace uids and gids in struct coda_vattr that just came from
userspace.
- In coda_iattr_to_vattr convert kuids and kgids into uids and gids
in the intial user namespace and store them in struct coda_vattr for
sending to coda userspace programs.
Nothing needs to be changed with mounts as coda does not support
being mounted in anything other than the initial user namespace.
Cc: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Remove the slight chance that pids in coda messages will be
interpreted in the wrong pid namespace.
- Explicitly send all pids in coda messages in the initial pid
namespace.
- Only allow mounts from processes in the initial pid namespace.
- Only allow processes in the initial pid namespace to open the coda
character device to communicate with coda.
Cc: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Modify struct afs_file_status to store owner as a kuid_t and group as
a kgid_t.
In xdr_decode_AFSFetchStatus as owner is now a kuid_t and group is now
a kgid_t don't use the EXTRACT macro. Instead perform the work of
the extract macro explicitly. Read the value with ntohl and
convert it to the appropriate type with make_kuid or make_kgid.
Test if the value is different from what is stored in status and
update changed. Update the value in status.
In xdr_encode_AFS_StoreStatus call from_kuid or from_kgid as
we are computing the on the wire encoding.
Initialize uids with GLOBAL_ROOT_UID instead of 0.
Initialize gids with GLOBAL_ROOT_GID instead of 0.
Cc: David Howells <dhowells@redhat.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
rxrpc sockets only work in the initial network namespace so it isn't
possible to support afs in any other network namespace.
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This patch reinstates the ack system which withdraw should be using. It
appears to have been accidentally forgotten when the lock module was
merged into GFS2, due to two different sysfs files having the same name.
Reported-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Using /dev/pstore as a mount point for the pstore filesystem is slightly
awkward. We don't normally mount filesystems in /dev/ and the /dev/pstore
file isn't created automatically by anything. While this method will
still work, we can create a persistent mount point in sysfs. This will
put pstore on par with things like cgroups and efivarfs.
Signed-off-by: Josh Boyer <jwboyer@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
While looking for kuid_t and kgid_t conversions I found this
structure that has never been used since it was added to the
kernel in 2007. The obvious for this structure to be used
is in xdr_encode_AFS_StoreStatus and that function uses a
small handful of local variables instead.
So remove the unnecessary structure to prevent confusion.
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Modify v9fs_get_fsgid_for_create to return a kgid and modify all of
the variables that hold the result of v9fs_get_fsgid_for_create to be
of type kgid_t.
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@gmail.com>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Change struct v9fs_session_info and the code that popluates it to use
kuids and kgids. When parsing the 9p mount options convert the
dfltuid, dflutgid, and the session uid from the current user namespace
into kuids and kgids. Modify V9FS_DEFUID and V9FS_DEFGUID to be kuid
and kgid values.
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@gmail.com>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Change struct 9p_fid and it's associated functions to
use kuid_t's instead of uid_t.
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@gmail.com>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
9p has thre strucrtures that can encode inode stat information. Modify
all of those structures to contain kuid_t and kgid_t values. Modify
he wire encoders and decoders of those structures to use 'u' and 'g' instead of
'd' in the format string where uids and gids are present.
This results in all kuid and kgid conversion to and from on the wire values
being performed by the same code in protocol.c where the client is known
at the time of the conversion.
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@gmail.com>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Modify the p9_client_rpc format specifiers of every function that
directly transmits a uid or a gid from 'd' to 'u' or 'g' as
appropriate.
Modify those same functions to take kuid_t and kgid_t parameters
instead of uid_t and gid_t parameters.
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@gmail.com>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Before printing kuid and kgids values convert them into
the initial user namespace.
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Hold the uid and gid for a pending ceph mds request using the types
kuid_t and kgid_t. When a request message is finally created convert
the kuid_t and kgid_t values into uids and gids in the initial user
namespace.
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
- In fill_inode() transate uids and gids in the initial user namespace
into kuids and kgids stored in inode->i_uid and inode->i_gid.
- In ceph_setattr() if they have changed convert inode->i_uid and
inode->i_gid into initial user namespace uids and gids for
transmission.
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
- Make the uid and gid arguments of send_cap_msg() used to compose
ceph_mds_caps messages of type kuid_t and kgid_t.
- Pass inode->i_uid and inode->i_gid in __send_cap to send_cap_msg()
through variables of type kuid_t and kgid_t.
- Modify struct ceph_cap_snap to store uids and gids in types kuid_t
and kgid_t. This allows capturing inode->i_uid and inode->i_gid in
ceph_queue_cap_snap() without loss and pssing them to
__ceph_flush_snaps() where they are removed from struct
ceph_cap_snap and passed to send_cap_msg().
- In handle_cap_grant translate uid and gids in the initial user
namespace stored in struct ceph_mds_cap into kuids and kgids
before setting inode->i_uid and inode->i_gid.
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Ensure that if nfs_wait_on_sequence() causes our rpc task to wait for
an NFSv4 state serialisation lock, then we also drop the session slot.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
This patch removes the following build warning:
fs/f2fs/node.c: warning: 'nofs' may be used uninitialized in this function
[-Wuninitialized]: => 738:8
Note that this is a false alarm.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Pull f2fs cleanup patches from Al Viro:
f2fs: get rid of fake on-stack dentries
f2fs: switch init_inode_metadata() to passing parent and name separately
f2fs: switch new_inode_page() from dentry to qstr
f2fs: init_dent_inode() should take qstr
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Conflicts:
fs/f2fs/recovery.c
adding compat_ioctl to provide support for backward comptability - 32bit binary
execution on 64bit kernel.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In the SSR case, the max gc cost should be the number of pages in a segment.
Otherwise, f2fs is able to fail getting dirty segments frequently for SSR.
In get_victim_by_default() previously,
while(1) {
...
cost = get_gc_cost(); <- cost is between 0 ~ 512.
...
if (cost == get_max_cost(sbi, &p)) <- max cost is UINT_MAX due to GC_CB type
continue;
if (nsearched++ >= MAX_VICTIM_SEARCH)
break;
}
So, if there are a number of fully valid segments in series, f2fs cannot skip
those segments by comparing the cost and max cost of each segment.
Note that, the cost is the number of valid blocks at the time of the last
checkpoint.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch makes clearer the ambiguous f2fs_gc flow as follows.
1. Remove intermediate checkpoint condition during f2fs_gc
(i.e., should_do_checkpoint() and GC_BLOCKED)
2. Remove unnecessary return values of f2fs_gc because of #1.
(i.e., GC_NODE, GC_OK, etc)
3. Simplify write_checkpoint() because of #2.
4. Clarify the main f2fs_gc flow.
o monitor how many freed sections during one iteration of do_garbage_collect().
o do GC more without checkpoints if we can't get enough free sections.
o do checkpoint once we've got enough free sections through forground GCs.
5. Adopt thread-logging (Slack-Space-Recycle) scheme more aggressively on data
log types. See. get_ssr_segement()
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Instead of evaluating the free_sections and then deciding to return
true/false from that path. We can directly use the evaluation condition
for returning proper value.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Introduce accessor to get the sections based upon the block type
(node,dents...) and modify the functions : should_do_checkpoint,
has_not_enough_free_secs to use this accessor function to get
the node sections and dent sections.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
When gc thread creation is failed, mark gc_thread as NULL to avoid
crash while trying to stop invalid thread in stop_gc_thread->kthread_stop.
Instead make it return from:
if (!gc_th)
return;
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Currently GC task is started for each f2fs formatted/mounted device.
But, when we check the task list, using 'ps', there is no distinguishing
factor between the tasks. So, name the task as per the block device just
like the flusher threads.
Also, remove the macro GC_THREAD_NAME and instead use the name: f2fs_gc
to avoid name length truncation, as the command length is 16
-> TASK_COMM_LEN 16 and example name like:
f2fs_gc_task:8:16 -> this exceeds name length
Before Patch for 2 F2FS formatted partitions:
root 28061 0.0 0.0 0 0 ? S 10:31 0:00 [f2fs_gc_task]
root 28087 0.0 0.0 0 0 ? S 10:32 0:00 [f2fs_gc_task]
After Patch:
root 16756 0.0 0.0 0 0 ? S 14:57 0:00 [f2fs_gc-8:18]
root 16765 0.0 0.0 0 0 ? S 14:57 0:00 [f2fs_gc-8:19]
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
1. If f2fs is mounted with background_gc_off option, checking
BG_GC is not redundant.
2. f2fs_balance_fs is checked in f2fs_gc, so this is also redundant.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
F2FS_SET_SB_DIRT is called in inc_page_count and
it is directly called one more time in the next line.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In f2fs, there are two superblocks. So when the first superblock was
invalidate, it should try to check another.
By Jaegeuk Kim:
o Remove a white space for coding style
o Clean up for code readability
o Fix a typo
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In some system PAGE_CACHE_SIZE isn't 4K. So using F2FS_BLKSIZE to judge.
By Jaegeuk Kim:
o f2fs does not support no other 4KB page cache size.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In file status, it can't distinguish between different devices.
So add device name to do this function.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
1. Background
Previously, if f2fs tries to move data blocks of an *evicting* inode during the
cleaning process, it stops the process incompletely and then restarts the whole
process, since it needs a locked inode to grab victim data pages in its address
space. In order to get a locked inode, iget_locked() by f2fs_iget() is normally
used, but, it waits if the inode is on freeing.
So, here is a deadlock scenario.
1. f2fs_evict_inode() <- inode "A"
2. f2fs_balance_fs()
3. f2fs_gc()
4. gc_data_segment()
5. f2fs_iget() <- inode "A" too!
If step #1 and #5 treat a same inode "A", step #5 would fall into deadlock since
the inode "A" is on freeing. In order to resolve this, f2fs_iget_nowait() which
skips __wait_on_freeing_inode() was introduced in step #5, and stops f2fs_gc()
to complete f2fs_evict_inode().
1. f2fs_evict_inode() <- inode "A"
2. f2fs_balance_fs()
3. f2fs_gc()
4. gc_data_segment()
5. f2fs_iget_nowait() <- inode "A", then stop f2fs_gc() w/ -ENOENT
2. Problem and Solution
In the above scenario, however, f2fs cannot finish f2fs_evict_inode() only if:
o there are not enough free sections, and
o f2fs_gc() tries to move data blocks of the *evicting* inode repeatedly.
So, the final solution is to use f2fs_iget() and remove f2fs_balance_fs() in
f2fs_evict_inode().
The f2fs_evict_inode() actually truncates all the data and node blocks, which
means that it doesn't produce any dirty node pages accordingly.
So, we don't need to do f2fs_balance_fs() in practical.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Let's remove the use of page_cache_release() in f2fs, and instead, use
f2fs_put_page(page, 0) which is exactly same but for code readability.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In f2fs_inode_info structure, the description for data_version
has a typo mistake. It should be latest instead of lastes.
So, correcting that.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
We can remove unneeded label unlock_out, avoid unnecessary jump
and reorganize the returning conditions in this function.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
After doing a write_checkpoint from garbage collection path if there is still
need to do more garbage collection, gc_more label is used to jump and start
the process again. And in that process, first step before getting victim is to
check if there are not enough free sections, which is already done before
doing a jump to gc_more. We can avoid the redundant call to check free
sections, by checking the gc_type flag which will remain FG_GC(value 1) under
this condition.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch supports ioctl FIFREEZE and FITHAW to snapshot filesystem.
Before calling f2fs_freeze, all writers would be suspended and sync_fs
would be completed. So no f2fs has to do something.
Just background gc operation should be skipped due to generate dirty
nodes and data until unfreeze.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
For the code
> prev = list_entry(orphan->list.prev, typeof(*prev), list);
if orphan->list.prev == head, it can't get the right prev.
And we can use the parameter 'this' to add.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
There is a typo in the ->show_options function for disable_ext_identify.
Fix it to match the spelling from the documentation.
Signed-off-by: Alejandro Martinez Ruiz <alex@nowcomputing.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The fill_zero() from fallocate() calls get_new_data_page() in which calls
reserve_new_block().
The reserve_new_block() should be covered by *DATA_NEW*, one of global locks.
And also, before getting the lock, we should check free sections by calling
f2fs_balance_fs().
If we break this rule, f2fs is able to face with out-of-control free space
management and fall into infinite loop like the following scenario as well.
[f2fs_sync_fs()] [fallocate()]
- write_checkpoint() - fill_zero()
- block_operations() - get_new_data_page()
: grab NODE_NEW - get_dnode_of_data()
: get locked dirty node page
- sync_node_pages()
: try to grab NODE_NEW for data allocation
: trylock and skip the dirty node page
: call sync_node_pages() repeatedly in order to flush all the dirty node
pages!
In order to avoid this, we should grab another global lock such as DATA_NEW
before calling get_new_data_page() in fill_zero().
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch enhances the checkpoint routine to cope with IO errors.
Basically f2fs detects IO errors from end_io_write, and the errors are able to
be occurred during one of data, node, and meta page writes.
In the previous code, when an IO error is occurred during writes, f2fs sets a
flag, CP_ERROR_FLAG, in the raw ckeckpoint buffer which will be written to disk.
Afterwards, write_checkpoint() will check the flag and remount f2fs as a
read-only (ro) mode.
However, even once f2fs is remounted as a ro mode, dirty checkpoint pages are
freely able to be written to disk by flusher or kswapd in background.
In such a case, after cold reboot, f2fs would restore the checkpoint data having
CP_ERROR_FLAG, resulting in disabling write_checkpoint and remounting f2fs as
a ro mode again.
Therefore, let's prevent any checkpoint page (meta) writes once an IO error is
occurred, and remount f2fs as a ro mode right away at that moment.
Reported-by: Oliver Winker <oliver@oli1170.net>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
This patch stores inode->i_rdev into on-disk inode structure.
Alun reported that:
aspire tmp # mount -t f2fs /dev/sdb mnt
aspire tmp # mknod mnt/sda1 b 8 1
aspire tmp # mknod mnt/null c 1 3
aspire tmp # mknod mnt/console c 5 1
aspire tmp # ls -l mnt
total 2
crw-r--r-- 1 root root 5, 1 Jan 22 18:44 console
crw-r--r-- 1 root root 1, 3 Jan 22 18:44 null
brw-r--r-- 1 root root 8, 1 Jan 22 18:44 sda1
aspire tmp # umount mnt
aspire tmp # mount -t f2fs /dev/sdb mnt
aspire tmp # ls -l mnt
total 2
crw-r--r-- 1 root root 0, 0 Jan 22 18:44 console
crw-r--r-- 1 root root 0, 0 Jan 22 18:44 null
brw-r--r-- 1 root root 0, 0 Jan 22 18:44 sda1
In this report, f2fs lost the major/minor numbers of device files after umount.
The reason was revealed that f2fs does not store the inode->i_rdev to the
on-disk inode data structure.
So, as the other file systems do, f2fs also stores i_rdev into the i_addr fields
in on-disk inode structure without any on-disk layout changes.
Note that, this bug is limited to device files made by mknod().
Reported-and-Tested-by: Alun Jones <alun.linux@ty-penguin.org.uk>
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If the server reboots after it has replied to our OPEN, but before we
call nfs4_opendata_to_nfs4_state(), then the reboot recovery thread
will not see a stateid for this open, and so will fail to recover it.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add a mutex to the struct nfs4_state_owner to ensure that delegation
recall doesn't conflict with byte range lock removal.
Note that we nest the new mutex _outside_ the state manager reclaim
protection (nfsi->rwsem) in order to avoid deadlocks.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Adjust the return values so that they return EAGAIN to the caller in
cases where we might want to retry the delegation recall after
the state recovery has run.
Note that we can't wait and retry in this routine, because the caller
may be the state manager thread.
If delegation recall fails due to a session or reboot related issue,
also ensure that we mark the stateid as delegated so that
nfs_delegation_claim_opens can find it again later.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the server reboots while we are converting a delegation into
OPEN/LOCK stateids as part of a delegation return, the current code
will simply exit with an error. This causes us to lose both
delegation state and locking state (i.e. locking atomicity).
Deal with this by exposing the delegation stateid during delegation
return, so that we can recover the delegation, and then resume
open/lock recovery.
Note that not having to hold the nfs_inode->rwsem across the
calls to nfs_delegation_claim_opens() also fixes a deadlock against
the NFSv4.1 reboot recovery code.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We currently have a deadlock in which the state recovery thread
ends up blocking due to one of the locks which it is trying to
recover holding the nfs_inode->rwsem.
The situation is as follows: the state recovery thread is
scheduled in order to recover from a reboot. It immediately
drains the session, forcing all ordinary NFSv4.1 calls to
nfs41_setup_sequence() to be put to sleep. This includes the
file locking process that holds the nfs_inode->rwsem.
When the thread gets to nfs4_reclaim_locks(), it tries to
grab a write lock on nfs_inode->rwsem, and boom...
Fix is to have the lock drop the nfs_inode->rwsem while it is
doing RPC calls. We use a sequence lock in order to signal to
the locking process whether or not a state recovery thread has
run on that inode, in which case it should retry the lock.
Reported-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch adds a seqcount_t lock for use by the state manager to
signal that an open owner has been recovered. This mechanism will be
used by the delegation, open and byte range lock code in order to
figure out if they need to replay requests due to collisions with
lock recovery.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* acpi-pm: (35 commits)
ACPI / PM: Handle missing _PSC in acpi_bus_update_power()
ACPI / PM: Do not power manage devices in unknown initial states
ACPI / PM: Fix acpi_bus_get_device() check in drivers/acpi/device_pm.c
ACPI / PM: Fix /proc/acpi/wakeup for devices w/o bus or parent
ACPI / PM: Fix consistency check for power resources during resume
ACPI / PM: Expose lists of device power resources to user space
sysfs: Functions for adding/removing symlinks to/from attribute groups
ACPI / PM: Expose current status of ACPI power resources
ACPI / PM: Expose power states of ACPI devices to user space
ACPI / scan: Prevent device add uevents from racing with user space
ACPI / PM: Fix device power state value after transitions to D3cold
ACPI / PM: Use string "D3cold" to represent ACPI_STATE_D3_COLD
ACPI / PM: Sanitize checks in acpi_power_on_resources()
ACPI / PM: Always evaluate _PSn after setting power resources
ACPI / PM: Introduce helper for executing _PSn methods
ACPI / PM: Make acpi_bus_init_power() more robust
ACPI / PM: Fix build for unusual combination of Kconfig options
ACPI / PM: remove leading whitespace from #ifdef
ACPI / PM: Consolidate suspend-specific and hibernate-specific code
ACPI / PM: Move device power management functions to device_pm.c
...
Return EEXISTS if requested file already exists, without this patch open
call will always succeed even if the file exists and user specified
O_CREAT|O_EXCL.
Following test code can be used to verify this patch. Without this patch
executing following test code on 9p mount will result in printing 'test case
failed' always.
main()
{
int fd;
/* first create the file */
fd = open("./file", O_CREAT|O_WRONLY);
if (fd < 0) {
perror("open");
return -1;
}
close(fd);
/* Now opening same file with O_CREAT|O_EXCL should fail */
fd = open("./file", O_CREAT|O_EXCL);
if (fd < 0 && errno == EEXIST)
printf("test case pass\n");
else
printf("test case failed\n");
close(fd);
return 0;
}
Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
We do the truncate via setattr request, hence don't pass the O_TRUNC flag in
open request. Without this patch we end up sending zero sized write request
to server when we try to truncate. Some servers (VirtFS) were not handling that
properly.
Reported-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
... is really excessive. First of all, ->readdir() is serialized by
file->f_path.dentry->d_inode->i_mutex; playing with file->f_path.dentry->d_lock
is not buying you anything. Moreover, rdir->mutex is pointless for exactly
the same reason - you'll never see contention on it.
While we are at it, there's no point in having rdir->buf a pointer -
you have it point just past the end of rdir, so it might as well be a flex
array (and no, it's not a gccism).
Absolutely untested patch follows:
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
There are multiple reasons to move away from debugfs. First of all,
we are only using it for a single parameter, and it is much more
complicated to set up (some 30 lines of code compared to 3), and one
more thing that might fail while loading the jbd2 module.
Secondly, as a module paramter it can be specified as a boot option if
jbd2 is built into the kernel, or as a parameter when the module is
loaded, and it can also be manipulated dynamically under
/sys/module/jbd2/parameters/jbd2_debug. So it is more flexible.
Ultimately we want to move away from using jbd_debug() towards
tracepoints, but for now this is still a useful simplification of the
code base.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
There are multiple reasons to move away from debugfs. First of all,
we are only using it for a single parameter, and it is much more
complicated to set up (some 30 lines of code compared to 3), and one
more thing that might fail while loading the ext4 module.
Secondly, as a module paramter it can be specified as a boot option if
ext4 is built into the kernel, or as a parameter when the module is
loaded, and it can also be manipulated dynamically under
/sys/module/ext4/parameters/mballoc_debug. So it is more flexible.
Ultimately we want to move away from using mb_debug() towards
tracepoints, but for now this is still a useful simplification of the
code base.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In ext4_{create,mknod,mkdir,symlink}(), don't start the journal handle
until the inode has been succesfully allocated. In order to do this,
we need to start the handle in the ext4_new_inode(). So create a new
variant of this function, ext4_new_inode_start_handle(), so the handle
can be created at the last possible minute, before we need to modify
the inode allocation bitmap block.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Operations which modify extended attributes may need extra journal
credits if inline data is used, since there is a chance that some
extended attributes may need to get pushed to an external attribute
block.
Changes to reflect this was made in xattr.c, but they were missed in
fs/ext4/acl.c. To fix this, abstract the calculation of the number of
credits needed for xattr operations to an inline function defined in
ext4_jbd2.h, and use it in acl.c and xattr.c.
Also move the function declarations used in inline.c from xattr.h
(where they are non-obviously hidden, and caused problems since
ext4_jbd2.h needs to use the function ext4_has_inline_data), and move
them to ext4.h.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Tao Ma <boyu.mt@taobao.com>
Reviewed-by: Jan Kara <jack@suse.cz>
The ext4_unlink() and ext4_rmdir() don't actually release the blocks
associated with the file/directory. This gets done in a separate jbd2
handle called via ext4_evict_inode(). Thus, we don't need to reserve
lots of journal credits for the truncate.
Note that using too many journal credits is non-optimal because it can
leading to the journal transmit getting closed too early, before it is
strictly necessary.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
The migration ioctl creates a temporary inode. Since this inode is
never linked to a directory, we don't need to reserve journal credits
required for modifying the directory.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Don't start the jbd2 transaction handle until after the directory
entry has been found, to minimize the amount of time that a handle is
held active.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Don't start the jbd2 transaction handle until after the directory
entry has been found, to minimize the amount of time that a handle is
held active.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
The grab_cache_page_write_begin() function can potentially sleep for a
long time, since it may need to do memory allocation which can block
if the system is under significant memory pressure, and because it may
be blocked on page writeback. If it does take a long time to grab the
page, it's better that we not hold an active jbd2 handle.
So grab a handle on the page first, and _then_ start the transaction
handle.
This commit fixes the following long transaction handle hold time:
postmark-2917 [000] .... 196.435786: jbd2_handle_stats: dev 254,32
tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
dirtied_blocks 0
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
So we can better understand what bits of ext4 are responsible for
long-running jbd2 handles, use jbd2__journal_start() so we can pass
context information for logging purposes.
The recommended way for finding the longer-running handles is:
T=/sys/kernel/debug/tracing
EVENT=$T/events/jbd2/jbd2_handle_stats
echo "interval > 5" > $EVENT/filter
echo 1 > $EVENT/enable
./run-my-fs-benchmark
cat $T/trace > /tmp/problem-handles
This will list handles that were active for longer than 20ms. Having
longer-running handles is bad, because a commit started at the wrong
time could stall for those 20+ milliseconds, which could delay an
fsync() or an O_SYNC operation. Here is an example line from the
trace file describing a handle which lived on for 311 jiffies, or over
1.2 seconds:
postmark-2917 [000] .... 196.435786: jbd2_handle_stats: dev 254,32
tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
dirtied_blocks 0
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Now that we're allowing more DRC entries, it becomes a lot easier to hit
problems with XID collisions. In order to mitigate those, calculate a
checksum of up to the first 256 bytes of each request coming in and store
that in the cache entry, along with the total length of the request.
This initially used crc32, but Chuck Lever and Jim Rees pointed out that
crc32 is probably more heavyweight than we really need for generating
these checksums, and recommended looking at using the same routines that
are used to generate checksums for IP packets.
On an x86_64 KVM guest measurements with ftrace showed ~800ns to use
csum_partial vs ~1750ns for crc32. The difference probably isn't
terribly significant, but for now we may as well use csum_partial.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Stones-thrown-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Move the jbd2 wrapper functions which start and stop handles out of
super.c, where they don't really logically belong, and into
ext4_jbd2.c.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Handles which stay open a long time are problematic when it comes time
to close down a transaction so it can be committed. These tracepoints
will help us determine which ones are the problematic ones, and to
validate whether changes makes things better or worse.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
... sure, it's tempting to just pass dentry. Except that we don't
_have_ anything resembling a real dentry on one of the paths to it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
for one thing, it doesn't (and shouldn't) use anything else from dentry;
for another, on some call chains the dentry is fake and should
be eliminated completely.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull btrfs fixes from Chris Mason:
"We've got corner cases for updating i_size that ceph was hitting,
error handling for quotas when we run out of space, a very subtle
snapshot deletion race, a crash while removing devices, and one
deadlock between subvolume creation and the sb_internal code (thanks
lockdep)."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: move d_instantiate outside the transaction during mksubvol
Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata
Btrfs: fix possible stale data exposure
Btrfs: fix missing i_size update
Btrfs: fix race between snapshot deletion and getting inode
Btrfs: fix missing release of the space/qgroup reservation in start_transaction()
Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write()
Btrfs: do not merge logged extents if we've removed them from the tree
btrfs: don't try to notify udev about missing devices
Move rt scheduler definitions out of include/linux/sched.h into
new file include/linux/sched/rt.h
Signed-off-by: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20130207094707.7b9f825f@riff.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Because the static function 'release_blocks' is only called
when releasing blocks,it will be more simple and efficient to
call the function 'percpu_counter_add' directly.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
We should mark inode dirty after the function dquot_free_block_nodirty
is called.Besides,add a check whether it is necessary to call
dquot_free_block_nodirty functon.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
In xfs_ifunlock() there is a call to wake_up_bit() after clearing
the flush lock on the xfs inode. This is not guaranteed to be safe,
as noted in the comments above wake_up_bit() beginning with:
In order for this to function properly, as it uses
waitqueue_active() internally, some kind of memory
barrier must be done prior to calling this.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
For some filesystems (e.g. GlusterFS), the cost of performing a
normal readdir and readdirplus are identical. Since adaptively
using readdirplus has no benefit for those systems, give
users/filesystems the option to control adaptive readdirplus use.
v2 of this patch incorporates Miklos's suggestion to simplify the code,
as well as improving consistency of macro names and documentation.
Signed-off-by: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Track the delay between when we first request that the commit begin
and when it actually begins, so we can see how much of a gap exists.
In theory, this should just be the remaining scheduling quantuum of
the thread which requested the commit (assuming it was not a
synchronous operation which triggered the commit request) plus
scheduling overhead; however, it's possible that real time processes
might get in the way of letting the kjournald thread from executing.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Dave Sterba triggered a lockdep complaint about lock ordering
between the sb_internal lock and the cleaner semaphore.
btrfs_lookup_dentry() checks for orphans if we're looking up
the inode for a subvolume, and subvolume creation is triggering
the lookup with a transaction running.
This commit moves the d_instantiate after the transaction closes.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When btrfs_qgroup_reserve returned a failure, we were missing a counter
operation for BTRFS_I(inode)->outstanding_extents++, leading to warning
messages about outstanding extents and space_info->bytes_may_use != 0.
Additionally, the error handling code didn't take into account that we
dropped the inode lock which might require more cleanup.
Luckily, all the cleanup code we need is already there and can be shared
with reserve_metadata_bytes, which is exactly what this patch does.
Reported-by: Lev Vainblat <lev@zadarastorage.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
It can be guranteed that inode->i_sb should not be null in vfs.
So here the check about it is overhead.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
We specifically do not update the disk i_size if there are ordered extents
outstanding for any area between the current disk_i_size and our ordered
extent so that we do not expose stale data. The problem is the check we
have only checks if the ordered extent starts at or after the current
disk_i_size, which doesn't take into account an ordered extent that starts
before the current disk_i_size and ends past the disk_i_size. Fix this by
checking if the extent ends past the disk_i_size. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If we have an ordered extent before the ordered extent we are currently
completing that is after the current disk_i_size we will put our i_size
update into that ordered extent so that we do not expose stale data. The
problem is that if our disk i_size is updated past the previous ordered
extent we won't update the i_size with the pending i_size update. So check
the pending i_size update and if its above the current disk i_size we need
to go ahead and try to update. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
While running snapshot testscript created by Mitch and David,
the race between autodefrag and snapshot deletion can lead to
corruption of dead_root list so that we can get crash on
btrfs_clean_old_snapshots().
And besides autodefrag, scrub also does the same thing, ie. read
root first and get inode.
Here is the story(take autodefrag as an example):
(1) when we delete a snapshot or subvolume, it will set its root's
refs to zero and do a iput() on its own inode, and if this inode happens
to be the only active in-meory one in root's inode rbtree, it will add
itself to the global dead_roots list for later cleanup.
(2) after (1), the autodefrag thread may read another inode for defrag
and the inode is just in the deleted snapshot/subvolume, but all of these
are without checking if the root is still valid(refs > 0). So the end up
result is adding the deleted snapshot/subvolume's root to the global
dead_roots list AGAIN.
Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu.
So all we need to do is to take the lock to protect 'read root and get inode',
since we synchronize to wait for the rcu grace period before adding something
to the global dead_roots list.
Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When we fail to start a transaction, we need to release the reserved free space
and qgroup space, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If the checks at the beginning of btrfs_file_aio_write() fail, we needn't
decrease ->sync_writers, because we have not increased it. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
You can run into this problem where if somebody is fsyncing and writing out
the existing extents you will have removed the extent map from the em tree,
but it's still valid for the current fsync so we go ahead and write it. The
problem is we unconditionally try to merge it back into the em tree, but if
we've removed it from the em tree that will cause use after free problems.
Fix this to only merge if we are still a part of the tree. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
struct udf_bitmap has array of buffer pointers attached to it. The code
unnecessarily used s_block_bitmap as a pointer to the array instead of
the standard trick of using 0 length array in the declaration. Change
that to make code more readable and actually shrink the structure by one
pointer.
Signed-off-by: Jan Kara <jack@suse.cz>
For large UDF filesystems with 512-byte blocks the number of necessary
bitmap blocks is larger than 2^16 so s_nr_groups in udf_bitmap overflows
(the number will overflow for filesystems larger than 128 GB with
512-byte blocks). That results in ENOSPC errors despite the filesystem
has plenty of free space.
Fix the problem by changing s_nr_groups' type to 'int'. That is enough
even for filesystems 2^32 blocks (UDF maximum) and 512-byte blocksize.
Reported-and-tested-by: v10lator@myway.de
Signed-off-by: Jan Kara <jack@suse.cz>
These routines are used by server and client code, so having them in a
separate header would be best.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We don't really need to preallocate at all; just allocate and initialize
everything at once, but leave the sc_type field initially 0 to prevent
finding the stateid till it's fully initialized.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When free nfs-client, it must free the ->cl_stateids.
Cc: stable@kernel.org
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Typical cputime stats infrastructure relies on the timer tick and
its periodic polling on the CPU to account the amount of time
spent by the CPUs and the tasks per high level domains such as
userspace, kernelspace, guest, ...
Now we are preparing to implement full dynticks capability on
Linux for Real Time and HPC users who want full CPU isolation.
This feature requires a cputime accounting that doesn't depend
on the timer tick.
To implement it, this new cputime infrastructure plugs into
kernel/user/guest boundaries to take snapshots of cputime and
flush these to the stats when needed. This performs pretty
much like CONFIG_VIRT_CPU_ACCOUNTING except that context location
and cputime snaphots are synchronized between write and read
side such that the latter can safely retrieve the pending tickless
cputime of a task and add it to its latest cputime snapshot to
return the correct result to the user.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJRBsKnAAoJEIUkVEdQjox3lMgP/2R6DU2f8PyGIao3hne4M3Pu
L3q+mAG53b24Dy014KeW7gd8yv45fE7wp/rs8CGLte9VzbLkRCDSFQPgBuXVagRj
tV5nfAuqD0wHTnA+HhBE3l3C2RKAPGIu79rBpnIR/QIPPl8Z3Dby8YgmxEQKDf8G
j7MEBu2LthSuqEi2ZXemnO5r0oEnQAzAp4TTi/M38k0Fmt59nOGyjLnI+xHYCBMa
1pnz7j3jjR9NJExGu8iVvbo+jupuQngP8qmkLXHvYnj/TEJNwzO1hHVoSwOpjYpS
9ycl+T8IKQLbAkBywLtq3Mzde43xt/t8wYyGZ0oAV+Z7MIpz/9YIfDJwqQeqoNbD
dAdbNjKMbsxCgmrnyqSagfMQg/r3CPZ4vf40TMCaN4gNUJC4Ie+E4kPRKRh59+PB
Ukthmqujn0f40LAa+HXTUuzafd3b0s/ewH+8FuQ6LAG9b5+WnoN8JTJ5u6+ydokO
ZleeOowuRZZEg+abQ8Sm2GRm/BzN29gi/npb//I+ZDXWv/+3yccgsiPjCRzCAAaO
g1RmYryFSRUwHQbGNNypVWVuOLWvrBQ4jqbGO7BBuBByZMSHryKxR6mb+inH3qLE
xIDM9SdSJisc292OzoFKwVZki4MaXaadJXJduVvqYlZQvXXs7eAa4wo3euhtVITD
NLQO5OZXE4oIQmDFb0FV
=1Tzp
-----END PGP SIGNATURE-----
Merge tag 'full-dynticks-cputime-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into sched/core
Pull full-dynticks (user-space execution is undisturbed and
receives no timer IRQs) preparation changes that convert the
cputime accounting code to be full-dynticks ready,
from Frederic Weisbecker:
"This implements the cputime accounting on full dynticks CPUs.
Typical cputime stats infrastructure relies on the timer tick and
its periodic polling on the CPU to account the amount of time
spent by the CPUs and the tasks per high level domains such as
userspace, kernelspace, guest, ...
Now we are preparing to implement full dynticks capability on
Linux for Real Time and HPC users who want full CPU isolation.
This feature requires a cputime accounting that doesn't depend
on the timer tick.
To implement it, this new cputime infrastructure plugs into
kernel/user/guest boundaries to take snapshots of cputime and
flush these to the stats when needed. This performs pretty
much like CONFIG_VIRT_CPU_ACCOUNTING except that context location
and cputime snaphots are synchronized between write and read
side such that the latter can safely retrieve the pending tickless
cputime of a task and add it to its latest cputime snapshot to
return the correct result to the user."
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull dlm fix from David Teigland:
"Thanks to Jana who reported the problem and was able to test this fix
so quickly."
This fixes an incorrect size check that triggered for CONFIG_COMPAT
whether the code was actually doing compat or not. The incorrect write
size check broke userland (clvmd) when maximum resource name lengths are
used.
* 'fix-max-write' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: check the write size from user
There exists a situation when GC can work in background alone without
any other filesystem activity during significant time.
The nilfs_clean_segments() method calls nilfs_segctor_construct() that
updates superblocks in the case of NILFS_SC_SUPER_ROOT and
THE_NILFS_DISCONTINUED flags are set. But when GC is working alone the
nilfs_clean_segments() is called with unset THE_NILFS_DISCONTINUED flag.
As a result, the update of superblocks doesn't occurred all this time
and in the case of SPOR superblocks keep very old values of last super
root placement.
SYMPTOMS:
Trying to mount a NILFS2 volume after SPOR in such environment ends with
very long mounting time (it can achieve about several hours in some
cases).
REPRODUCING PATH:
1. It needs to use external USB HDD, disable automount and doesn't
make any additional filesystem activity on the NILFS2 volume.
2. Generate temporary file with size about 100 - 500 GB (for example,
dd if=/dev/zero of=<file_name> bs=1073741824 count=200). The size of
file defines duration of GC working.
3. Then it needs to delete file.
4. Start GC manually by means of command "nilfs-clean -p 0". When you
start GC by means of such way then, at the end, superblocks is updated
by once. So, for simulation of SPOR, it needs to wait sometime (15 -
40 minutes) and simply switch off USB HDD manually.
5. Switch on USB HDD again and try to mount NILFS2 volume. As a
result, NILFS2 volume will mount during very long time.
REPRODUCIBILITY: 100%
FIX:
This patch adds checking that superblocks need to update and set
THE_NILFS_DISCONTINUED flag before nilfs_clean_segments() call.
Reported-by: Sergey Alexandrov <splavgm@gmail.com>
Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Tested-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since we dynamically allocate them now, allow the system to call us up
to release them if it gets low on memory. Since these entries aren't
replaceable, only free ones that are expired or that are over the cap.
The the seeks value is set to '1' however to indicate that freeing the
these entries is low-cost.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
It's not sufficient to only clean the cache when requests come in. What
if we have a flurry of activity and then the server goes idle? Add a
workqueue job that will clean the cache every RC_EXPIRE period.
Care is taken to only run this when we expect to have entries expiring.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There's no need to keep entries around that we're declaring RC_NOCACHE.
Ditto if there's a problem with the entry.
With this change too, there's no need to test for RC_UNUSED in the
search function. If the entry's in the hash table then it's either
INPROG or DONE.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
With the change to dynamically allocate entries, the cache is never
disabled on the fly. Remove this flag.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The existing code keeps a fixed-size cache of 1024 entries. This is much
too small for a busy server, and wastes memory on an idle one. This
patch changes the code to dynamically allocate and free these cache
entries.
A cap on the number of entries is retained, but it's much larger than
the existing value and now scales with the amount of low memory in the
machine.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
...otherwise, we end up with the list ordering wrong. Currently, it's
not a problem since we skip RC_INPROG entries, but keeping the ordering
strict will be necessary for a later patch that adds a cache cleaner.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Return EINVAL from write if the size is larger than
allowed. Do this before allocating kernel memory for
the bogus size, which could lead to OOM.
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Tested-by: Jana Saout <jana@saout.de>
Signed-off-by: David Teigland <teigland@redhat.com>
The ext4 block allocator only maintains buddy bitmaps for chunks which
are less than or equal to one quarter of a block group. That is, for
a file aystem with a 1k blocksize, and where the number of blocks in a
block group is 8192 blocks, the largest chunk size tracked by buddy
bitmaps is 2048 blocks.
For a file system with a 4k blocksize, and where the number of blocks
in a block group is 32768 blocks, the largest chunk size tracked by
buddy bitmaps is 8192 blocks.
To work around this code, mballoc.c before this commit would truncate
allocation requests to the number of blocks in a block group minus 10.
Why 10? Aside from being a completely arbitrary number, it avoids
block allocation to be a power of two larger than 25% of the block
group. If you try to explicitly fallocate 50% of the block group
size, this will demonstrate the problem; the block allocation code
will scan the all of the blocks in the file system with cr==0 (since
the request is for a natural power of two), but then completely fail
for all blocks groups, since the buddy bitmaps don't track chunk sizes
of 50% of the block group.
To fix this, in these we use ext4_mb_complex_scan_group() instead of
ext4_mb_simple_scan_group().
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger@dilger.ca>
commit 626cf23660 "poll: add poll_requested_events()..." enabled us to send the
requested events to the filesystem.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
drop_nlink() warns if nlink is already zero. This is triggerable by a buggy
userspace filesystem. The cure, I think, is worse than the disease so disable
the warning.
Reported-by: Tero Roponen <tero.roponen@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The all pointers within fuse_req must point to valid memory once
fuse_force_forget() returns.
This bug appeared in "fuse: implement NFS-like readdirplus support"
and was never in any official Linux release.
I tested the fuse_force_forget() code path by injecting to fake -ENOMEM and
verified the FORGET operation was called properly in userspace.
Signed-off-by: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Later, we'll need more than one call site for this, so break it out
into a new function.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Add a preprocessor constant for the expiry time of cache entries, and
move the test for an expired entry into a function. Note that the current
code does not test for RC_INPROG. It just assumes that it won't take more
than 2 minutes to fill out an in-progress entry.
I'm not sure how valid that assumption is though, so let's just ensure
that we never consider an RC_INPROG entry to be expired.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Entries can only get a c_type of RC_REPLBUFF iff they are
RC_DONE. Therefore the test for RC_DONE isn't necessary here.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently we use kmalloc() which wastes a little bit of memory on each
allocation since it's a power of 2 allocator. Since we're allocating a
1024 of these now, and may need even more later, let's create a new
slabcache for them.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The locking rules for cache entries say that locking the cache_lock
isn't needed if you're just touching the current entry. Earlier
in this function we set rp->c_state to RC_UNUSED without any locking,
so I believe it's ok to do the same here.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently, it only stores the first 16 bytes of any address. struct
sockaddr_in6 is 28 bytes however, so we're currently ignoring the last
12 bytes of the address.
Expand the c_addr field to a sockaddr_in6, and cast it to a sockaddr_in
as necessary. Also fix the comparitor to use the existing RPC
helpers for this.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The last orphan in the dnext list has its dnext set to NULL. Because
of that, ubifs_delete_orphan assumes that it is not on the dnext list
and frees it immediately instead ignoring it as a second delete. The
orphan is later freed again by erase_deleted.
This change adds an explicit flag to ubifs_orphan indicating whether
it is pending delete.
Signed-off-by: Adam Thomas <adamthomas1111@gmail.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: stable@vger.kernel.org
The last orphan in the cnext list has its cnext set to NULL. Because
of that, ubifs_delete_orphan assumes that it is not on the cnext list
and frees it immediately instead of adding it to the dnext list. The
freed orphan is later modified by write_orph_node.
This can cause various inconsistencies including directory entries
that cannot be removed and this error:
UBIFS error (pid 20685): layout_cnodes: LPT out of space at LEB 14:129009 needing 17, done_ltab 1, done_lsave 1
This is a regression introduced by
"7074e5eb UBIFS: remove invalid reference to list iterator variable".
This change adds an explicit flag to ubifs_orphan indicating whether
it is pending commit.
Signed-off-by: Adam Thomas <adamthomas1111@gmail.com>
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: stable@vger.kernel.org # v3.6+
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Check for incompatible mount options when using the ext4 file system
driver to mount ext2 or ext3 file systems.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If argument of inode_readahead_blk is too big, we just bail out
without printing any error. Fix this since it could confuse users.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The loop looking for correct mount option entry is more logical if it is
written rewritten as an empty loop looking for correct option entry and then
code handling the option. It also saves one level of indentation for a lot of
code so we can join a couple of split lines.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Several mount option (resuid, resgid, journal_dev, journal_ioprio) are
currently handled before we enter standard option handling loop. I don't
see a reason for this so move them to normal handling loop to make things
more regular.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
It is unnecessary to check i<4 after the loop; just do it before the
break.
Signed-off-by: Cong Ding <dinggnu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In ext4_mb_add_n_trim(), lg_prealloc_lock should be taken when
changing the lg_prealloc_list.
Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
Commit 2147b1a6a4 resulted in a new smatch warning:
> fs/ext4/move_extent.c:693 mext_replace_branches()
> warn: variable dereferenced before check 'dext' (see line 683)
Fix this by adding a check to make sure dext is non-NULL before we
derefrence it.
Signed-off-by: Akria Fujita <a-fujita@rs.jp.nec.com>
[ modified by tytso to make sure an ext4_error is called ]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Use WARN rather than printk followed by WARN_ON(1), for conciseness.
A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
expression list es;
@@
-printk(
+WARN(1,
es);
-WARN_ON(1);
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently, we calculate the attribute set transaction
log space reservation at runtime in two parts:
1) XFS_ATTRSET_LOG_RES() which is calcuated out at mount time.
2) ((ext * (mp)->m_sb.sb_sectsize) + \
(ext * XFS_FSB_TO_B((mp), XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK))) + \
(128 * (ext + (ext * XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK))))))
which is calculated out at runtime since it depend on the given extent length in blocks.
This patch renamed XFS_ATTRSET_LOG_RES(mp) to XFS_ATTRSETM_LOG_RES(mp) to indicate
that it is figured out at mount time. Introduce XFS_ATTRSETRT_LOG_RES(mp) which would
be used to calculate out the unit of the log space reservation for one block.
In this way, the total runtime space for the given extent length can be figured out by:
XFS_ATTRSETM_LOG_RES(mp) + XFS_ATTRSETRT_LOG_RES(mp) * ext
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Make use of XFS_SB_LOG_RES() at xfs_fs_log_dummy().
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Make use of XFS_SB_LOG_RES() at xfs_mount_log_sb().
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Make use of XFS_SB_LOG_RES() at xfs_log_sbcount().
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Introduce a new transaction space reservation XFS_SB_LOG_RES() for
those transactions that need to modify the superblock on disk.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Convert the calculation for end of quotaoff log space reservation
from runtime to mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Convert the calculation of quota off transaction log space reservation
from runtime to mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The disk quota allocation log space reservation is calcuated at runtime,
this patch does it at mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
For adjusting quota limits transactions, we calculate out the log space
reservation at runtime, this patch does it at mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
For the transaction that write the incore superblock changes of quota flags
to disk, it would reserve the same log space to clear/reset quota flags
transaction, hence we can use XFS_TRANS_SBCHANGE_LOG_RES() for it as well.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The transaction log space for clearing/reseting the quota flags
is calculated out at runtime, this patch can figure it out at
mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Refining the existing reservations with xfs_calc_buf_res() in xfs_trans.c
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
This patch allocates a block reservation structure before growing
or shrinking a file. Without this structure, the grow or shink code
can reference the bad pointer.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The intent here is to split the processing of the glock lru
list into two parts, so that the selection of glocks and the
disposal are separate functions. The plan is then, that further
updates can then be made to these functions in the future
to improve the selection of glocks and also the efficiency of
glock disposal.
The new feature which this patch brings is sorting the
glocks to be disposed of into glock number (and thus also
disk block number) order. Not all glocks will need i/o in
order to dispose of them, but some will, and at least we'll
generate mostly disk block order i/o now.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Add a new helper xfs_calc_buf_res() to calcuate out the transaction space
reservations per item. xfs_buf_log_overhead() is used to figure out the
extra space for struct xfs_buf_log_format that gets written into the log
for every buffer as well as a log opheader, i.e. struct xlog_op_header.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
If we remove a missing device, bdev is null, and if we
send that off to btrfs_kobject_uevent we'll panic.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This reverts commit 324d003b0c.
The deadlock turned out to be caused by a workqueue limitation that has
now been worked around in the RPC code (see comment in rpc_free_task).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
- Error reporting in nfs_xdev_mount incorrectly maps all errors to ENOMEM
- Fix an NFSv4 refcounting issue
- Fix a mount failure when the server reboots during NFSv4 trunking discovery
- NFSv4.1 mounts may need to run the lease recovery thread.
- Don't silently fail setattr() requests on mountpoints
- Fix a SUNRPC socket/transport livelock and priority queue issue
- We must handle NFS4ERR_DELAY when resetting the NFSv4.1 session.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQIcBAABAgAGBQJRCpS4AAoJEGcL54qWCgDyqucP/2CTv5leu+X5/0PXBOykAIHg
8oEsTEz7/4IvIxTXzuHDYirMnm/mulfGF6NrdPqfvxpAHqRfVBfLFocfNLMVhQci
97RmBfEEGM22AToYUubML5bIxr0QllV4s9Vmyh/zGDan52y7zNNlZX+v6aLjZbJB
Fbolihpcch6lhQEUNAzK0B0ddimDl9lazx/WTmMOD/JrwOqzA4FJC+YxBe88nfzQ
c6sYyEptBaSirbCOlueqGpv8skB1CLpFJXguXToPXFxpWed6uoGrIwLO7MLUdFpJ
Xw+j8cuv/wjyYJGVKjhW7kXtwK8T7+u4bT2L883R01XYXr8XfkkLON0dgG1X/unk
80mLzCO1+qRdoDSQ4b/V4B0nScPRCJuoZpftjCi2uhKewNcxQPMZ2V5/D7pO3uyE
NdhxByB8D86JfNrIcBcRaxfuiQsurQBDvsDNmWPBZSOmH/dmHqTQGLcIe6N94A0B
c7KFtXrN2MzOl8S68dUpbhftObq9X0oK2oxFFLWRQoqjrFtiDLU5JilV0bEH2CMo
gJX7CRrPoJ2wmNToKdPJRWBYDmtMMIThq3vIpCj1FFzX17r5grJpqCmp/TViu/ER
r8rmJzni+nmHaO1NLJMTCHzLJ8soiKyBq8PZKAbipTJxy+TXI/69jnWzpIQ/pbM/
JN+0tiCcqmUmXsO+hrCP
=lWFV
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
- Error reporting in nfs_xdev_mount incorrectly maps all errors to
ENOMEM
- Fix an NFSv4 refcounting issue
- Fix a mount failure when the server reboots during NFSv4 trunking
discovery
- NFSv4.1 mounts may need to run the lease recovery thread.
- Don't silently fail setattr() requests on mountpoints
- Fix a SUNRPC socket/transport livelock and priority queue issue
- We must handle NFS4ERR_DELAY when resetting the NFSv4.1 session.
* tag 'nfs-for-3.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4.1: Handle NFS4ERR_DELAY when resetting the NFSv4.1 session
SUNRPC: When changing the queue priority, ensure that we change the owner
NFS: Don't silently fail setattr() requests on mountpoints
NFSv4.1: Ensure that nfs41_walk_client_list() does start lease recovery
NFSv4: Fix NFSv4 trunking discovery
NFSv4: Fix NFSv4 reference counting for trunked sessions
NFS: Fix error reporting in nfs_xdev_mount
Use the same adaptive readdirplus mechanism as NFS:
http://permalink.gmane.org/gmane.linux.nfs/49299
If the user space implementation wants to disable readdirplus
temporarily, it could just return ENOTSUPP. Then kernel will
recall it with readdir.
Signed-off-by: Feng Shuo <steve.shuo.feng@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Commit c69e8d9c0 added rcu lock to fuse/dir.c It was assuming
that 'task' is some other process but in fact this parameter always
equals to 'current'. Inline this parameter to make it more readable
and remove RCU lock as it is not needed when access current process
credentials.
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
NFS4ERR_DELAY is a legal reply when we call DESTROY_SESSION. It
usually means that the server is busy handling an unfinished RPC
request. Just sleep for a second and then retry.
We also need to be able to handle the NFS4ERR_BACK_CHAN_BUSY return
value. If the NFS server has outstanding callbacks, we just want to
similarly sleep & retry.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
Ensure that any setattr and getattr requests for junctions and/or
mountpoints are sent to the server. Ever since commit
0ec26fd069 (vfs: automount should ignore LOOKUP_FOLLOW), we have
silently dropped any setattr requests to a server-side mountpoint.
For referrals, we have silently dropped both getattr and setattr
requests.
This patch restores the original behaviour for setattr on mountpoints,
and tries to do the same for referrals, provided that we have a
filehandle...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
Don't send an extra wakeup to kjournald in the case where we
already have the proper target in j_commit_request, i.e. that
transaction has already been requested for commit.
commit deeeaf13 "jbd2: fix fsync() tid wraparound bug" changed
the logic leading to a wakeup, but it caused some extra wakeups
which were found to lead to a measurable performance regression.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[tytso@mit.edu: reworked check to make it clearer]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Running AIO is pinning inode in memory using file reference. Once AIO
is completed using aio_complete(), file reference is put and inode can
be freed from memory. So we have to be sure that calling aio_complete()
is the last thing we do with the inode.
CC: stable@vger.kernel.org
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
- fix return value when filesystem probe finds no XFS magic, a
regression introduced in 9802182.
- fix stack switch in __xfs_bmapi_allocate by moving the check for stack
switch up into xfs_bmapi_write.
- fix oops in _xfs_buf_find by validating that the requested block is
within the filesystem bounds.
- limit speculative preallocation near ENOSPC.
- fix an unmount hang in xfs_wait_buftarg by freeing the
xfs_buf_log_item in xfs_buf_item_unlock.
- fix a possible use after free with AIO.
- fix xfs_swap_extents after removal of xfs_flushinval_pages, a
regression introduced in fb59581404.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJRBvmgAAoJENaLyazVq6ZOOacP/RilCPsi41NkJqRx1Rs5aRGE
UvinrHfAL/tBupS2JVo1niIilBNJG/cI+lLcpV/P5omLBJfpEu0trzZUSxU7S1Vc
a2M8J0qhmKfBcl70fuCALAxPY52895y+44gufxaH0O5HQDN6tB8n4MMqYGPmS8hz
Ul/q3MO601hVyBHaoYa7BNGS3YG0TCdFGtWcC5tQaR3v7upTLR2ouZrGQ8CV0BBa
Ek1xdxLh4D0fRybSL7lUw64W957iyldoLsEg+zQrE9NSfTE8DSqUG+NPWB0wjPce
ICtmO6TbE5c6q1ScOL3YCC2cmYvjR9mlAHnPy73SqWSIsTqUsVzdibNo+tUJJZ5r
RZf3u6Uri6uKC6Hl4XEtg4LVnnquKosTXfoiHmn+eh0dhYL7sZG0Ya5we5pH5Tmi
P6B2DlfUA1fj4Ne4Asx2d7mwOJaZcLHDZoeCs/Haz2Z6kGVEm7ImyAb1h76uNOZo
l0NFhXJGcOQLyjPtQjl81SGjQmntiIN0Poia3528zjxxGXlBNwAwalkOtdJnk5iN
IaYRKtvIcrdjFvunasiKZIsV/O9w3/mguXlrSqBDgUsKPUc/cq5vLfsa70jYGc2j
M6ldJRRqTvSjkVXc/7SXv4GLt/qbUWa92ESzhZQXEABIjZJUnOZpZuLmSVdRUyZk
+SaGvbMphE0U/4ps3pO/
=wViN
-----END PGP SIGNATURE-----
Merge tag 'for-linus-v3.8-rc6' of git://oss.sgi.com/xfs/xfs
Pull xfs bugfixes from Ben Myers:
"Here are fixes for returning EFSCORRUPTED on probe of a non-xfs
filesystem, the stack switch in xfs_bmapi_allocate, a crash in
_xfs_buf_find, speculative preallocation as the filesystem nears
ENOSPC, an unmount hang, a race with AIO, and a regression with
xfs_fsr:
- fix return value when filesystem probe finds no XFS magic, a
regression introduced in 9802182.
- fix stack switch in __xfs_bmapi_allocate by moving the check for
stack switch up into xfs_bmapi_write.
- fix oops in _xfs_buf_find by validating that the requested block is
within the filesystem bounds.
- limit speculative preallocation near ENOSPC.
- fix an unmount hang in xfs_wait_buftarg by freeing the
xfs_buf_log_item in xfs_buf_item_unlock.
- fix a possible use after free with AIO.
- fix xfs_swap_extents after removal of xfs_flushinval_pages, a
regression introduced in commit fb59581404a."
* tag 'for-linus-v3.8-rc6' of git://oss.sgi.com/xfs/xfs:
xfs: Fix xfs_swap_extents() after removal of xfs_flushinval_pages()
xfs: Fix possible use-after-free with AIO
xfs: fix shutdown hang on invalid inode during create
xfs: limit speculative prealloc near ENOSPC thresholds
xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end
xfs: pull up stack_switch check into xfs_bmapi_write
xfs: Do not return EFSCORRUPTED when filesystem probe finds no XFS magic
In func svc_export_parse, the uuid which used kmemdup to alloc will be
changed in func export_update.So the later kfree don't free this memory.
And it can't be free in func svc_export_parse because other place still
used.So put this operation in func svc_export_put.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Instead of using a list of buffers to write ahead of the journal
flush, this now uses a list of inodes and calls ->writepages
via filemap_fdatawrite() in order to achieve the same thing. For
most use cases this results in a shorter ordered write list,
as well as much larger i/os being issued.
The ordered write list is sorted by inode number before writing
in order to retain the disk block ordering between inodes as
per the previous code.
The previous ordered write code used to conflict in its assumptions
about how to write out the disk blocks with mpage_writepages()
so that with this updated version we can also use mpage_writepages()
for GFS2's ordered write, writepages implementation. So we will
also send larger i/os from writeback too.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The freeze code has not been looked at a lot recently. Upstream has
moved on, and this is an attempt to catch us back up again. There
is a vfs level interface for the freeze code which can be called
from our (obsolete, but kept for backward compatibility purposes)
sysfs freeze interface. This means freezing this way vs. doing it
from the ioctl should now work in identical fashion.
As a result of this, the freeze function is only called once
and we can drop our own special purpose code for counting the
number of freezes.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The locking in gfs2_attach_bufdata() was type specific (data/meta)
which made the function rather confusing. This patch moves the core
of gfs2_attach_bufdata() into trans.c renaming it gfs2_alloc_bufdata()
and moving the locking into gfs2_trans_add_data()/gfs2_trans_add_meta()
As a result all of the locking related to adding data and metadata to
the journal is now in these two functions. This should help to clarify
what is going on, and give us some opportunities to simplify in
some cases.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch copies the body of gfs2_trans_add_bh into the two newly
added gfs2_trans_add_data and gfs2_trans_add_meta functions. We can
then move the .lo_add functions from lops.c into trans.c and call
them directly.
As a result of this, we no longer need to use the .lo_add functions
at all, so that is removed from the log operations structure.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There is little common content in gfs2_trans_add_bh() between the data
and meta classes by the time that the functions which it calls are
taken into account. The intent here is to split this into two
separate functions. Stage one is to introduce gfs2_trans_add_data()
and gfs2_trans_add_meta() and update the callers accordingly.
Later patches will then pull in the content of gfs2_trans_add_bh()
and its dependent functions in order to clean up the code in this
area.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This moves the lo_add function for revokes into trans.c, removing
a function call and making the code easier to read.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This breaks out the LRU scanning function from the shrinker in
preparation for adding other callers to the LRU scanner.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
brelse() and ext4_journal_force_commit() are both inlined and able
to handle NULL.
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
After commit 978fef9 (create __ext4_insert_dentry for dir entry
insertion), 'reclen' is not used anymore.
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Commit b0336e8d (ext4: calculate and verify checksums of directory
leaf blocks) and commit dbe89444 (ext4: Calculate and verify checksums
for htree nodes) forget to release buffer when checksum failed, at
some places.
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
In two places we call WARN_ON() before we print out the debug message,
however we agreed that the WARN_ON() is unnecessary at those places so
remove them.
Also use ext4_warning() instead of ext4_msg() and printk().
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Remove unused variable flags from dump_completed_IO(). The code is
only exercised when EXT4FS_DEBUG is defined.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
So far ext4_writepage() skipped writing pages that had any delayed or
unwritten buffers attached. When blocksize < pagesize this breaks
data=ordered mode guarantees as we can have a page with one freshly
allocated buffer whose allocation is part of the committing
transaction and another buffer in the page which is delayed or
unwritten. So fix this problem by calling ext4_bio_writepage()
anyway. It will submit mapped buffers and leave others alone.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
So far ext4_bio_writepage() unconditionally cleared dirty bit on all
buffers underlying the page. That implicitely assumes we can write all
buffers. So far that is true because callers call into
ext4_bio_writepage() make sure all buffers in the page are mapped but:
a) it's a data corruption bug waiting to happen
b) in data=ordered mode when blocksize < pagesize we do need to write
pages that may have only some of dirty buffers mapped.
So change ext4_bio_writepage() to skip buffers that cannot be written without
clearing their dirty bit.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Commit fb59581404 removed
xfs_flushinval_pages() and changed its callers to use
filemap_write_and_wait() and truncate_pagecache_range() directly.
But in xfs_swap_extents() this change accidental switched the argument
for 'tip' to 'ip'. This patch switches it back to 'tip'
Signed-off-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Commit fb59581404 removed
xfs_flushinval_pages() and changed its callers to use
filemap_write_and_wait() and truncate_pagecache_range() directly.
But in xfs_swap_extents() this change accidental switched the argument
for 'tip' to 'ip'. This patch switches it back to 'tip'
Signed-off-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Running AIO is pinning inode in memory using file reference. Once AIO
is completed using aio_complete(), file reference is put and inode can
be freed from memory. So we have to be sure that calling aio_complete()
is the last thing we do with the inode.
CC: xfs@oss.sgi.com
CC: Ben Myers <bpm@sgi.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
When the new inode verify in xfs_iread() fails, the create
transaction is aborted and a shutdown occurs. The subsequent unmount
then hangs in xfs_wait_buftarg() on a buffer that has an elevated
hold count. Debug showed that it was an AGI buffer getting stuck:
[ 22.576147] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[ 22.976213] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[ 23.376206] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[ 23.776325] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
The trace of this buffer leading up to the shutdown (trimmed for
brevity) looks like:
xfs_buf_init: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_get_map
xfs_buf_get: bno 0x2 len 0x200 hold 1 caller xfs_buf_read_map
xfs_buf_read: bno 0x2 len 0x200 hold 1 caller xfs_trans_read_buf_map
xfs_buf_iorequest: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_iorequest
xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_iorequest
xfs_buf_iowait: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_ioerror: bno 0x2 len 0x200 hold 1 caller xfs_buf_bio_end_io
xfs_buf_iodone: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_ioend
xfs_buf_iowait_done: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_item_init
xfs_trans_read_buf: bno 0x2 len 0x200 hold 2 recur 0 refcount 1
xfs_trans_brelse: bno 0x2 len 0x200 hold 2 recur 0 refcount 1
xfs_buf_item_relse: bno 0x2 nblks 0x1 hold 2 caller xfs_trans_brelse
xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_relse
xfs_buf_unlock: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
xfs_buf_rele: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
xfs_buf_trylock: bno 0x2 nblks 0x1 hold 2 caller _xfs_buf_find
xfs_buf_find: bno 0x2 len 0x200 hold 2 caller xfs_buf_get_map
xfs_buf_get: bno 0x2 len 0x200 hold 2 caller xfs_buf_read_map
xfs_buf_read: bno 0x2 len 0x200 hold 2 caller xfs_trans_read_buf_map
xfs_buf_hold: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_init
xfs_trans_read_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1
xfs_trans_log_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1
xfs_buf_item_unlock: bno 0x2 len 0x200 hold 3 flags DIRTY liflags ABORTED
xfs_buf_unlock: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
xfs_buf_rele: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
And that is the AGI buffer from cold cache read into memory to
transaction abort. You can see at transaction abort the bli is dirty
and only has a single reference. The item is not pinned, and it's
not in the AIL. Hence the only reference to it is this transaction.
The problem is that the xfs_buf_item_unlock() call is dropping the
last reference to the xfs_buf_log_item attached to the buffer (which
holds a reference to the buffer), but it is not freeing the
xfs_buf_log_item. Hence nothing will ever release the buffer, and
the unmount hangs waiting for this reference to go away.
The fix is simple - xfs_buf_item_unlock needs to detect the last
reference going away in this case and free the xfs_buf_log_item to
release the reference it holds on the buffer.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
There is a window on small filesytsems where specualtive
preallocation can be larger than that ENOSPC throttling thresholds,
resulting in specualtive preallocation trying to reserve more space
than there is space available. This causes immediate ENOSPC to be
triggered, prealloc to be turned off and flushing to occur. One the
next write (i.e. next 4k page), we do exactly the same thing, and so
effective drive into synchronous 4k writes by triggering ENOSPC
flushing on every page while in the window between the prealloc size
and the ENOSPC prealloc throttle threshold.
Fix this by checking to see if the prealloc size would consume all
free space, and throttle it appropriately to avoid premature
ENOSPC...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
When _xfs_buf_find is passed an out of range address, it will fail
to find a relevant struct xfs_perag and oops with a null
dereference. This can happen when trying to walk a filesystem with a
metadata inode that has a partially corrupted extent map (i.e. the
block number returned is corrupt, but is otherwise intact) and we
try to read from the corrupted block address.
In this case, just fail the lookup. If it is readahead being issued,
it will simply not be done, but if it is real read that fails we
will get an error being reported. Ideally this case should result
in an EFSCORRUPTED error being reported, but we cannot return an
error through xfs_buf_read() or xfs_buf_get() so this lookup failure
may result in ENOMEM or EIO errors being reported instead.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The stack_switch check currently occurs in __xfs_bmapi_allocate,
which means the stack switch only occurs when xfs_bmapi_allocate()
is called in a loop. Pull the check up before the loop in
xfs_bmapi_write() such that the first iteration of the loop has
consistent behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
9802182 changed the return value from EWRONGFS (aka EINVAL)
to EFSCORRUPTED which doesn't seem to be handled properly by
the root filesystem probe.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The argument b_size of mpage_add_bh_to_extent() was bogus since it was
always == blocksize (which we can easily derive from inode->i_blkbits).
Also second branch of condition:
if (nrblocks >= EXT4_MAX_TRANS_DATA) {
} else if ((nrblocks + (b_size >> mpd->inode->i_blkbits)) >
EXT4_MAX_TRANS_DATA) {
}
was never taken because (b_size >> mpd->inode->i_blkbits) == 1.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_writepage(), write_cache_pages_da(), and mpage_da_submit_io()
doesn't have to deal with the case when page doesn't have buffers. We
attach buffers to a page in ->write_begin() and ->page_mkwrite() which
covers all places where a page can become dirty.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The function splices i_completed_io_list to its private list
first. From that moment on we don't need any lock for working with
io_end structures because all io_end structure on the list are only
our own. So we can remove the other two lists in the function and free
io_end immediately after we are done with it.
CC: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
It does not make much sense to have struct work in ext4_io_end_t
because we always use it for only one ext4_io_end_t per inode (the
first one in the i_completed_io list). So just move the structure to
inode itself. This also allows for a small simplification in
processing io_end structures.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
We don't support delayed allocation in data=journal mode. So checking for it in
mpage_da_submit_io() doesn't make really sence. If we ever decide to extend
delayed allocation support to data=journal mode, adding
__ext4_journalled_writepage() call will be the least of problems we have to
solve. Most likely we'd have to implement separate writepages call anyways
because we don't have transaction credits for writing more than a single page
so mapping of page buffers would have to be done differently.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When we cannot write a page we should use redirty_page_for_writepage()
instead of plain set_page_dirty(). That tells writeback code we have
problems, redirties only the page (redirtying buffers is not needed),
and updates mm accounting of failed page writes.
Also move clearing of buffer dirty flag after io_submit_add_bh(). At that
moment we are sure buffer will be going to disk.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently we sometimes used block_write_full_page() and sometimes
ext4_bio_write_page() for writeback (depending on mount options and call
path). Let's always use ext4_bio_write_page() to simplify things a bit.
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch add supports for indirect file support punching hole. It
is almost the same as ext4_ext_punch_hole. First, we invalidate all
pages between this hole, and then we try to deallocate all blocks of
this hole.
A recursive function is used to handle deallocation of blocks. In
this function, it iterates over the entries in inode's i_blocks or
indirect blocks, and try to free the block for each one of them.
After applying this patch, xfstest #255 will not pass w/o extent because
indirect-based file doesn't support unwritten extents.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The recent commit fb6791d100
included the wrong logic. The lvbptr check was incorrectly
added after the patch was tested.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
We do need to start the lease recovery thread prior to waiting for the
client initialisation to complete in NFSv4.1.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ben Greear <greearb@candelatech.com>
Cc: stable@vger.kernel.org [>=3.7]
If walking the list in nfs4[01]_walk_client_list fails, then the most
likely explanation is that the server dropped the clientid before we
actually managed to confirm it. As long as our nfs_client is the very
last one in the list to be tested, the caller can be assured that this
is the case when the final return value is NFS4ERR_STALE_CLIENTID.
Reported-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org [>=3.7]
Tested-by: Ben Greear <greearb@candelatech.com>
The reference counting in nfs4_init_client assumes wongly that it
is safe for nfs4_discover_server_trunking() to return a pointer to a
nfs_client prior to bumping the reference count.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ben Greear <greearb@candelatech.com>
Cc: stable@vger.kernel.org [>=3.7]
Currently, nfs_xdev_mount converts all errors from clone_server() to
ENOMEM, which can then leak to userspace (for instance to 'mount'). Fix that.
Also ensure that if nfs_fs_mount_common() returns an error, we
don't dprintk(0)...
The regression originated in commit 3d176e3fe4
(NFS: Use nfs_fs_mount_common() for xdev mounts)
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>= 3.5]
This is in preparation for the full dynticks feature. While
remotely reading the cputime of a task running in a full
dynticks CPU, we'll need to do some extra-computation. This
way we can account the time it spent tickless in userspace
since its last cputime snapshot.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
There is no backing store to ramfs and file creation
rules are the same as for any other filesystem so
it is semantically safe to allow unprivileged users
to mount it.
The memory control group successfully limits how much
memory ramfs can consume on any system that cares about
a user namespace root using ramfs to exhaust memory
the memory control group can be deployed.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
- The context in which devpts is mounted has no effect on the creation
of ptys as the /dev/ptmx interface has been used by unprivileged
users for many years.
- Only support unprivileged mounts in combination with the newinstance
option to ensure that mounting of /dev/pts in a user namespace will
not allow the options of an existing mount of devpts to be modified.
- Create /dev/pts/ptmx as the root user in the user namespace that
mounts devpts so that it's permissions to be changed.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Running AIO is pinning inode in memory using file reference. Once AIO
is completed using aio_complete(), file reference is put and inode can
be freed from memory. So we have to be sure that calling aio_complete()
is the last thing we do with the inode.
CC: xfs@oss.sgi.com
CC: Ben Myers <bpm@sgi.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
When the new inode verify in xfs_iread() fails, the create
transaction is aborted and a shutdown occurs. The subsequent unmount
then hangs in xfs_wait_buftarg() on a buffer that has an elevated
hold count. Debug showed that it was an AGI buffer getting stuck:
[ 22.576147] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[ 22.976213] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[ 23.376206] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[ 23.776325] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
The trace of this buffer leading up to the shutdown (trimmed for
brevity) looks like:
xfs_buf_init: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_get_map
xfs_buf_get: bno 0x2 len 0x200 hold 1 caller xfs_buf_read_map
xfs_buf_read: bno 0x2 len 0x200 hold 1 caller xfs_trans_read_buf_map
xfs_buf_iorequest: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_iorequest
xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_iorequest
xfs_buf_iowait: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_ioerror: bno 0x2 len 0x200 hold 1 caller xfs_buf_bio_end_io
xfs_buf_iodone: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_ioend
xfs_buf_iowait_done: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_item_init
xfs_trans_read_buf: bno 0x2 len 0x200 hold 2 recur 0 refcount 1
xfs_trans_brelse: bno 0x2 len 0x200 hold 2 recur 0 refcount 1
xfs_buf_item_relse: bno 0x2 nblks 0x1 hold 2 caller xfs_trans_brelse
xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_relse
xfs_buf_unlock: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
xfs_buf_rele: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
xfs_buf_trylock: bno 0x2 nblks 0x1 hold 2 caller _xfs_buf_find
xfs_buf_find: bno 0x2 len 0x200 hold 2 caller xfs_buf_get_map
xfs_buf_get: bno 0x2 len 0x200 hold 2 caller xfs_buf_read_map
xfs_buf_read: bno 0x2 len 0x200 hold 2 caller xfs_trans_read_buf_map
xfs_buf_hold: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_init
xfs_trans_read_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1
xfs_trans_log_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1
xfs_buf_item_unlock: bno 0x2 len 0x200 hold 3 flags DIRTY liflags ABORTED
xfs_buf_unlock: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
xfs_buf_rele: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
And that is the AGI buffer from cold cache read into memory to
transaction abort. You can see at transaction abort the bli is dirty
and only has a single reference. The item is not pinned, and it's
not in the AIL. Hence the only reference to it is this transaction.
The problem is that the xfs_buf_item_unlock() call is dropping the
last reference to the xfs_buf_log_item attached to the buffer (which
holds a reference to the buffer), but it is not freeing the
xfs_buf_log_item. Hence nothing will ever release the buffer, and
the unmount hangs waiting for this reference to go away.
The fix is simple - xfs_buf_item_unlock needs to detect the last
reference going away in this case and free the xfs_buf_log_item to
release the reference it holds on the buffer.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The most convenient way to expose ACPI power resources lists of a
device is to put symbolic links to sysfs directories representing
those resources into special attribute groups in the device's sysfs
directory. For this purpose, it is necessary to be able to add
symbolic links to attribute groups.
For this reason, add sysfs helper functions for adding/removing
symbolic links to/from attribute groups, sysfs_add_link_to_group()
and sysfs_remove_link_from_group(), respectively.
This change set includes a build fix from David Rientjes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull btrfs fixes from Chris Mason:
"It turns out that we had two crc bugs when running fsx-linux in a
loop. Many thanks to Josef, Miao Xie, and Dave Sterba for nailing it
all down. Miao also has a new OOM fix in this v2 pull as well.
Ilya fixed a regression Liu Bo found in the balance ioctls for pausing
and resuming a running balance across drives.
Josef's orphan truncate patch fixes an obscure corruption we'd see
during xfstests.
Arne's patches address problems with subvolume quotas. If the user
destroys quota groups incorrectly the FS will refuse to mount.
The rest are smaller fixes and plugs for memory leaks."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (30 commits)
Btrfs: fix repeated delalloc work allocation
Btrfs: fix wrong max device number for single profile
Btrfs: fix missed transaction->aborted check
Btrfs: Add ACCESS_ONCE() to transaction->abort accesses
Btrfs: put csums on the right ordered extent
Btrfs: use right range to find checksum for compressed extents
Btrfs: fix panic when recovering tree log
Btrfs: do not allow logged extents to be merged or removed
Btrfs: fix a regression in balance usage filter
Btrfs: prevent qgroup destroy when there are still relations
Btrfs: ignore orphan qgroup relations
Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag
Btrfs: fix unlock order in btrfs_ioctl_rm_dev
Btrfs: fix unlock order in btrfs_ioctl_resize
Btrfs: fix "mutually exclusive op is running" error code
Btrfs: bring back balance pause/resume logic
btrfs: update timestamps on truncate()
btrfs: fix btrfs_cont_expand() freeing IS_ERR em
Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents
Btrfs: fix off-by-one in lseek
...
When usrjquota or grpjquota mount options are specified several times,
we leak memory storing the names. Free the memory correctly.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
In addition, print the error returned from ext4_enable_quotas()
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Cc: stable@vger.kernel.org
Pull cifs fixes from Steve French:
"Two small cifs fixes"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
fs/cifs/cifs_dfs_ref.c: fix potential memory leakage
cifs: fix srcip_matches() for ipv6
btrfs_start_delalloc_inodes() locks the delalloc_inodes list, fetches the
first inode, unlocks the list, triggers btrfs_alloc_delalloc_work/
btrfs_queue_worker for this inode, and then it locks the list, checks the
head of the list again. But because we don't delete the first inode that it
deals with before, it will fetch the same inode. As a result, this function
allocates a huge amount of btrfs_delalloc_work structures, and OOM happens.
Fix this problem by splice this delalloc list.
Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The max device number of single profile is 1, not 0 (0 means 'as many as
possible'). Fix it.
Cc: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
First, though the current transaction->aborted check can stop the commit early
and avoid unnecessary operations, it is too early, and some transaction handles
don't end, those handles may set transaction->aborted after the check.
Second, when we commit the transaction, we will wake up some worker threads to
flush the space cache and inode cache. Those threads also allocate some transaction
handles and may set transaction->aborted if some serious error happens.
So we need more check for ->aborted when committing the transaction. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We may access and update transaction->aborted on the different CPUs without
lock, so we need ACCESS_ONCE() wrapper to prevent the compiler from creating
unsolicited accesses and make sure we can get the right value.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I noticed a WARN_ON going off when adding csums because we were going over
the amount of csum bytes that should have been allowed for an ordered
extent. This is a leftover from when we used to hold the csums privately
for direct io, but now we use the normal ordered sum stuff so we need to
make sure and check if we've moved on to another extent so that the csums
are added to the right extent. Without this we could end up with csums for
bytenrs that don't have extents to cover them yet. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
For compressed extents, the range of checksum is covered by disk length,
and the disk length is different with ram length, so we need to use disk
length instead to get us the right checksum.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
A user reported a BUG_ON(ret) that occured during tree log replay. Ret was
-EAGAIN, so what I think happened is that we removed an extent that covered
a bitmap entry and an extent entry. We remove the part from the bitmap and
return -EAGAIN and then search for the next piece we want to remove, which
happens to be an entire extent entry, so we just free the sucker and return.
The problem is ret is still set to -EAGAIN so we trip the BUG_ON(). The
user used btrfs-zero-log so I'm not 100% sure this is what happened so I've
added a WARN_ON() to catch the other possibility. Thanks,
Reported-by: Jan Steffens <jan.steffens@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We drop the extent map tree lock while we're logging extents, so somebody
could come in and merge another extent into this one and screw up our
logging, or they could even remove us from the list which would keep us from
logging the extent or freeing our ref on it, so we need to make sure to not
clear LOGGING until after the extent is logged, and then we can merge it to
adjacent extents. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There is a window on small filesytsems where specualtive
preallocation can be larger than that ENOSPC throttling thresholds,
resulting in specualtive preallocation trying to reserve more space
than there is space available. This causes immediate ENOSPC to be
triggered, prealloc to be turned off and flushing to occur. One the
next write (i.e. next 4k page), we do exactly the same thing, and so
effective drive into synchronous 4k writes by triggering ENOSPC
flushing on every page while in the window between the prealloc size
and the ENOSPC prealloc throttle threshold.
Fix this by checking to see if the prealloc size would consume all
free space, and throttle it appropriately to avoid premature
ENOSPC...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
When _xfs_buf_find is passed an out of range address, it will fail
to find a relevant struct xfs_perag and oops with a null
dereference. This can happen when trying to walk a filesystem with a
metadata inode that has a partially corrupted extent map (i.e. the
block number returned is corrupt, but is otherwise intact) and we
try to read from the corrupted block address.
In this case, just fail the lookup. If it is readahead being issued,
it will simply not be done, but if it is real read that fails we
will get an error being reported. Ideally this case should result
in an EFSCORRUPTED error being reported, but we cannot return an
error through xfs_buf_read() or xfs_buf_get() so this lookup failure
may result in ENOMEM or EIO errors being reported instead.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
__fuse_direct_io() allocates fuse-requests by calling fuse_get_req(fc, n). The
patch calculates 'n' based on iov[] array. This is useful because allocating
FUSE_MAX_PAGES_PER_REQ page pointers and descriptors for each fuse request
would be waste of memory in case of iov-s of smaller size.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Let fuse_get_user_pages() pack as many iov-s to a single fuse_req as
possible. This is very beneficial in case of iov[] consisting of many
iov-s of relatively small sizes (e.g. PAGE_SIZE).
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch makes preliminary work for the next patch optimizing scatter-gather
direct IO. The idea is to allow fuse_get_user_pages() to pack as many iov-s
to each fuse request as possible. So, here we only rework all related
call-paths to carry iov[] from fuse_direct_IO() to fuse_get_user_pages().
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Previously, anyone who set flag 'argpages' only filled req->pages[] and set
per-request page_offset. This patch re-works all cases where argpages=1 to
fill req->page_descs[] properly.
Having req->page_descs[] filled properly allows to re-work fuse_copy_pages()
to copy page fragments described by req->page_descs[]. This will be useful
for next patches optimizing direct_IO.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The ability to save page pointers along with lengths and offsets in fuse_req
will be useful to cover several iovec-s with a single fuse_req.
Per-request page_offset is removed because anybody who need it can use
req->page_descs[0].offset instead.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
fuse_do_ioctl() already calculates the number of pages it's going to use. It is
stored in 'num_pages' variable. So the patch simply uses it for allocating
fuse_req.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The patch allocates as many page pointers in fuse_req as needed to cover
interval [pos .. pos+len-1]. Inline helper fuse_wr_pages() is introduced
to hide this cumbersome arithmetic.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>