This should be a bitwise negate here. It silences a Sparse warning:
fs/nfsd/nfs4xdr.c:693:16: warning: dubious: x & !y
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Current ore_check_io API receives a residual
pointer, to report partial IO. But it is actually
not used, because in a multiple devices IO there
is never a linearity in the IO failure.
On the other hand if every failing device is reported
through a received callback measures can be taken to
handle only failed devices. One at a time.
This will also be needed by the objects-layout-driver
for it's error reporting facility.
Exofs is not currently using the new information and
keeps the old behaviour of failing the complete IO in
case of an error. (No partial completion)
TODO: Use an ore_check_io callback to set_page_error only
the failing pages. And re-dirty write pages.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
All users of the ore will need to check if current code
supports the given layout. For example RAID5/6 is not
currently supported.
So move all the checks from exofs/super.c to a new
ore_verify_layout() to be used by ore users.
Note that any new layout should be passed through the
ore_verify_layout() because the ore engine will prepare
and verify some internal members of ore_layout, and
assumes it's called.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Users like the objlayout-driver would like to only pass
a partial device table that covers the IO in question.
For example exofs divides the file into raid-group-sized
chunks and only serves group_width number of devices at
a time.
The partiality is communicated by setting
ore_componets->first_dev and the array covers all logical
devices from oc->first_dev upto (oc->first_dev + oc->numdevs)
The ore_comp_dev() API receives a logical device index
and returns the actual present device in the table.
An out-of-range dev_index will BUG.
Logical device index is the theoretical device index as if
all the devices of a file are present. .i.e:
total_devs = group_width * mirror_p1 * group_count
0 <= dev_index < total_devs
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Memory conditions and max_bio constraints might cause us to
not comply to the full length of the requested IO. Instead of
failing the complete IO we can issue a shorter read/write and
report how much was actually executed in the ios->length
member.
All users must check ios->length at IO_done or upon return of
ore_read/write and re-issue the reminder of the bytes. Because
other wise there is no error returned like before.
This is part of the effort to support the pnfs-obj layout driver.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
If at read/write_done the actual IO was shorter then requested,
reported in returned ios->length. It is not an error. The reminder
of the pages should just be unlocked but not marked uptodate or
end_page_writeback. They will be re issued later by the VFS.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Move the check and preparation of the ios->kern_buff case to
later inside _write_mirror().
Since read was never used with ios->kern_buff its support is removed
instead of fixed.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Now that each ore_io_state covers only a single raid group.
A single striping_info math is needed. Embed one inside
ore_io_state to cache the calculation results and eliminate
an extra call.
Also the outer _prepare_for_striping is removed since it does nothing.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Usually a single IO is confined to one group of devices
(group_width) and at the boundary of a raid group it can
spill into a second group. Current code would allocate a
full device_table size array at each io_state so it can
comply to requests that span two groups. Needless to say
that is very wasteful, specially when device_table count
can get very large (hundreds even thousands), while a
group_width is usually 8 or 10.
* Change ore API to trim on IO that spans two raid groups.
The user passes offset+length to ore_get_rw_state, the
ore might trim on that length if spanning a group boundary.
The user must check ios->length or ios->nrpages to see
how much IO will be preformed. It is the responsibility
of the user to re-issue the reminder of the IO.
* Modify exofs To copy spilled pages on to the next IO.
This means one last kick is needed after all coalescing
of pages is done.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: revert to using a kthread for AIL pushing
xfs: force the log if we encounter pinned buffers in .iop_pushbuf
xfs: do not update xa_last_pushed_lsn for locked items
that let us do local lock checks before requesting to the server.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
Rename it for better clarity as to what it does and have the caller pass
in just the single type byte. Turn the if statement into a switch and
optimize it by placing the most common message type at the top. Move the
header length check back into cifs_demultiplex_thread in preparation
for adding a new receive phase and normalize the cFYI messages.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Split cifs_lock into several functions and let CIFSSMBLock get pid
as an argument.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
..the length field has only 17 bits.
Cc: <stable@kernel.org>
Acked-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
Move the iovec handling entirely into read_from_socket. That simplifies
the code and gets rid of the special handling for header reads. With
this we can also get rid of the "goto incomplete_rcv" label in the main
demultiplex thread function since we can now treat header and non-header
receives the same way.
Also, make it return an int (since we'll never receive enough to worry
about the sign bit anyway), and simply make it return the amount of bytes
read or a negative error code.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Add data structures and functions necessary to map a uid and gid to SID.
These functions are very similar to the ones used to map a SID to uid and gid.
This time, instead of storing sid to id mapping sorted on a sid value,
id to sid is stored, sorted on an id.
A cifs upcall sends an id (uid or gid) and expects a SID structure
in return, if mapping was done successfully.
A failed id to sid mapping to EINVAL.
This patchset aims to enable chown and chgrp commands when
cifsacl mount option is specified, especially to Windows SMB servers.
Currently we can't do that. So now along with chmod command,
chown and chgrp work.
Winbind is used to map id to a SID. chown and chgrp use an upcall
to provide an id to winbind and upcall returns with corrosponding
SID if any exists. That SID is used to build security descriptor.
The DACL part of a security descriptor is not changed by either
chown or chgrp functionality.
cifs client maintains a separate caches for uid to SID and
gid to SID mapping. This is similar to the one used earlier
to map SID to id (as part of ID mapping code).
I tested it by mounting shares from a Windows (2003) server by
authenticating as two users, one at a time, as Administrator and
as a ordinary user.
And then attempting to change owner of a file on the share.
Depending on the permissions/privileges at the server for that file,
chown request fails to either open a file (to change the ownership)
or to set security descriptor.
So it all depends on privileges on the file at the server and what
user you are authenticated as at the server, cifs client is just a
conduit.
I compared the security descriptor during chown command to that
what smbcacls sends when it is used with -M OWNNER: option
and they are similar.
This patchset aim to enable chown and chgrp commands when
cifsacl mount option is specified, especially to Windows SMB servers.
Currently we can't do that. So now along with chmod command,
chown and chgrp work.
I tested it by mounting shares from a Windows (2003) server by
authenticating as two users, one at a time, as Administrator and
as a ordinary user.
And then attempting to change owner of a file on the share.
Depending on the permissions/privileges at the server for that file,
chown request fails to either open a file (to change the ownership)
or to set security descriptor.
So it all depends on privileges on the file at the server and what
user you are authenticated as at the server, cifs client is just a
conduit.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Suresh had a typo in his recent patch adding information on
the new oplock_endabled parm. Should be documented as in
directory /sys/module/cifs/parameters not /proc/module/cifs/parameters
Signed-off-by: Steve French <smfrench@gmail.com>
Add mount options backupuid and backugid.
It allows an authenticated user to access files with the intent to back them
up including their ACLs, who may not have access permission but has
"Backup files and directories user right" on them (by virtue of being part
of the built-in group Backup Operators.
When mount options backupuid is specified, cifs client restricts the
use of backup intents to the user whose effective user id is specified
along with the mount option.
When mount options backupgid is specified, cifs client restricts the
use of backup intents to the users whose effective user id belongs to the
group id specified along with the mount option.
If an authenticated user is not part of the built-in group Backup Operators
at the server, access to such files is denied, even if allowed by the client.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
The plan is to deprecate this interface by kernel version 3.4.
Changes since v1
- add a '\n' to the printk.
Reported-by: Alexander Swen <alex@swen.nu>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <smfrench@gmail.com>
Reported-by: Alexander Swen <alex@swen.nu>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <smfrench@gmail.com>
Thus spake Jeff Layton:
"Making that a module parm would allow you to set that parameter at boot
time without needing to add special startup scripts. IMO, all of the
procfile "switches" under /proc/fs/cifs should be module parms
instead."
This patch doesn't alter the default behavior (Oplocks are enabled by
default).
To disable oplocks when loading the module, use
modprobe cifs enable_oplocks=0
(any of '0' or 'n' or 'N' conventions can be used).
To disable oplocks at runtime using the new interface, use
echo 0 > /sys/module/cifs/parameters/enable_oplocks
The older /proc/fs/cifs/OplockEnabled interface will be deprecated
after two releases. A subsequent patch will add an warning message
about this deprecation.
Changes since v2:
- make enable_oplocks a 'bool'
Changes since v1:
- eliminate the use of extra variable by renaming the old one to
enable_oplocks and make it an 'int' type.
Reported-by: Alexander Swen <alex@swen.nu>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <smfrench@gmail.com>
If the server stops sending data while in the middle of sending a
response then we still want to reconnect it if it doesn't come back.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
If msg_controllen is 0, then the socket layer should never touch these
fields. Thus, there's no need to continually reset them. Also, there's
no need to keep this field on the stack for the demultiplex thread, just
make it a local variable in read_from_socket.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
We have two versions of signature generating code. A vectorized and
non-vectorized version. Eliminate a large chunk of cut-and-paste
code by turning the non-vectorized version into a wrapper around the
vectorized one.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
The variable names in this function are so ambiguous that it's very
difficult to know what it's doing. Rename them to make it a bit more
clear.
Also, remove a redundant length check. cifsd checks to make sure that
the rfclen isn't larger than the maximum frame size when it does the
receive.
Finally, change checkSMB to return a real error code (-EIO) when
it finds an error. That will help simplify some coming changes in the
callers.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
server->maxBuf is the maximum SMB size (including header) that the
server can handle. CIFSMaxBufSize is the maximum amount of data (sans
header) that the client can handle. Currently maxBuf is being capped at
CIFSMaxBufSize + the max headers size, and the two values are used
somewhat interchangeably in the code.
This makes little sense as these two values are not related at all.
Separate them and make sure the code uses the right values in the right
places.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
It should be 'CONFIG_CIFS_NFSD_EXPORT'. No-one noticed because that
symbol depends on BROKEN.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Steve French <smfrench@gmail.com>
...as that's more efficient when we know that the lengths are equal.
Reported-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Currently pstore write interface employs record id as return
value, but it is not enough because it can't tell caller if
the write operation is successful. Pass the record id back via
an argument pointer and return zero for success, non-zero for
failure.
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
The result from ipv6_addr_scope() is a set of flags, not a single value,
so we can't just compare the result with IPV6_ADDR_SCOPE_LINKLOCAL.
This patch fixs the problem, and checks for unequal addresses before
scope_id.
Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When we call xfs_flush_buftarg (generally from sync or umount) it already
is too late to flush the data workqueues, as I/O completion is signalled
for them and we are thus already done with the data we would flush here.
There are places where flushing them might be useful, but the current
sync interface doesn't give us that opportunity.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
The calling convention that returns a pointer to a static buffer is
fairly nasty, so just opencode it in the only caller that is left.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Use xfs_ioerror_alert instead of opencoding a very similar error
message.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Instead of passing the block number and mount structure explicitly
get them off the bp and fix make the argument order more natural.
Also move it to xfs_buf.c and stop printing the device name given
that we already get the fs name as part of xfs_alert, and we know
what device is operates on because of the caller that gets printed,
finally rename it to xfs_buf_ioerror_alert and pass __func__ as
argument where it makes sense.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Change _xfs_buf_initialize to allocate the buffer directly and rename it to
xfs_buf_alloc now that is the only buffer allocation routine. Also remove
the xfs_buf_deallocate wrapper around the kmem_zone_free calls for buffers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
For each call to xfs_buf_stale we call xfs_buf_delwri_dequeue either
directly before or after it, or are guaranteed by the surrounding
conditionals that we are never called on delwri buffers. Simply
this situation by moving the call to xfs_buf_delwri_dequeue into
xfs_buf_stale.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
The code is unused and under a config option that doesn't exist, remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
The code to flush buffers in the umount code is a bit iffy: we first
flush all delwri buffers out, but then might be able to queue up a
new one when logging the sb counts. On a normal shutdown that one
would get flushed out when doing the synchronous superblock write in
xfs_unmountfs_writesb, but we skip that one if the filesystem has
been shut down.
Fix this by moving the delwri list flushing until just before unmounting
the log, and while we're at it also remove the superflous delwri list
and buffer lru flusing for the rt and log device that can never have
cached or delwri buffers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Amit Sahrawat <amit.sahrawat83@gmail.com>
Tested-by: Amit Sahrawat <amit.sahrawat83@gmail.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Directories are only updated transactionally, which means fsync only
needs to flush the log the inode is currently dirty, but not bother
with checking for dirty data, non-transactional updates, and most
importanly doesn't have to flush disk caches except as part of a
transaction commit.
While the first two optimizations can't easily be measured, the
latter actually makes a difference when doing lots of fsync that do
not actually have to commit the inode, e.g. because an earlier fsync
already pushed the log far enough.
The new xfs_dir_fsync is identical to xfs_nfs_commit_metadata except
for the prototype, but I'm not sure creating a common helper for the
two is worth it given how simple the functions are.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The AIL push code will issue a log force on ever single push loop
that it exits and has encountered pinned items. It doesn't rescan
these pinned items until it revisits the AIL from the start. Hence
we only need to force the log once per walk from the start of the
AIL to the target LSN.
This results in numbers like this:
xs_push_ail_flush..... 1456
xs_log_force......... 1485
For an 8-way 50M inode create workload - almost all the log forces
are coming from the AIL pushing code.
Reduce the number of log forces by only forcing the log if the
previous walk found pinned buffers. This reduces the numbers to:
xs_push_ail_flush..... 665
xs_log_force......... 682
For the same test.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Stats show that for an 8-way unlink @ ~80,000 unlinks/s we are doing
~1 million cache hit lookups to ~3000 buffer creates. That's almost
3 orders of magnitude more cahce hits than misses, so optimising for
cache hits is quite important. In the cache hit case, we do not need
to allocate a new buffer in case of a cache miss, so we are
effectively hitting the allocator for no good reason for vast the
majority of calls to _xfs_buf_find. 8-way create workloads are
showing similar cache hit/miss ratios.
The result is profiles that look like this:
samples pcnt function DSO
_______ _____ _______________________________ _________________
1036.00 10.0% _xfs_buf_find [kernel.kallsyms]
582.00 5.6% kmem_cache_alloc [kernel.kallsyms]
519.00 5.0% __memcpy [kernel.kallsyms]
468.00 4.5% __ticket_spin_lock [kernel.kallsyms]
388.00 3.7% kmem_cache_free [kernel.kallsyms]
331.00 3.2% xfs_log_commit_cil [kernel.kallsyms]
Further, there is a fair bit of work involved in initialising a new
buffer once a cache miss has occurred and we currently do that under
the rbtree spinlock. That increases spinlock hold time on what are
heavily used trees.
To fix this, remove the initialisation of the buffer from
_xfs_buf_find() and only allocate the new buffer once we've had a
cache miss. Initialise the buffer immediately after allocating it in
xfs_buf_get, too, so that is it ready for insert if we get another
cache miss after allocation. This minimises lock hold time and
avoids unnecessary allocator churn. The resulting profiles look
like:
samples pcnt function DSO
_______ _____ ___________________________ _________________
8111.00 9.1% _xfs_buf_find [kernel.kallsyms]
4380.00 4.9% __memcpy [kernel.kallsyms]
4341.00 4.8% __ticket_spin_lock [kernel.kallsyms]
3401.00 3.8% kmem_cache_alloc [kernel.kallsyms]
2856.00 3.2% xfs_log_commit_cil [kernel.kallsyms]
2625.00 2.9% __kmalloc [kernel.kallsyms]
2380.00 2.7% kfree [kernel.kallsyms]
2016.00 2.3% kmem_cache_free [kernel.kallsyms]
Showing a significant reduction in time spent doing allocation and
freeing from slabs (kmem_cache_alloc and kmem_cache_free).
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
There is no reason to keep a reference to the inode even if we unlock
it during transaction commit because we never drop a reference between
the ijoin and commit. Also use this fact to merge xfs_trans_ijoin_ref
back into xfs_trans_ijoin - the third argument decides if an unlock
is needed now.
I'm actually starting to wonder if allowing inodes to be unlocked
at transaction commit really is worth the effort. The only real
benefit is that they can be unlocked earlier when commiting a
synchronous transactions, but that could be solved by doing the
log force manually after the unlock, too.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Let the transaction commit unlock the inode before it potentially causes
a synchronous log force.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Only read the LSN we need to push to with the ilock held, and then release
it before we do the log force to improve concurrency.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Only read the LSN we need to push to with the ilock held, and then release
it before we do the log force to improve concurrency.
This also removes the only direct caller of _xfs_trans_commit, thus
allowing it to be merged into the plain xfs_trans_commit again.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
XFS_TRANS_SWAPEXT is a transaction type, not a flag for xfs_trans_commit, so
don't pass it in xfs_swap_extents.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
In xfs_ioc_trim it is possible that computing the last allocation group
to discard might overflow for big start & len values, because the result
might be bigger then xfs_agnumber_t which is 32 bit long. Fix this by not
allowing the start and end block of the range to be beyond the end of the
file system.
Note that if the start is beyond the end of the file system we have to
return -EINVAL, but in the "end" case we have to truncate it to the fs
size.
Also introduce "end" variable, rather than using start+len which which
might be more confusing to get right as this bug shows.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Convert all function prototypes to the short form used elsewhere,
and remove duplicates of comments already placed at the function
body.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Fix a case in xfs_bmap_add_extent_unwritten_real where we aren't
passing the returned error on.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
All the parameters passed to xfs_bmap_add_extent_hole_real() are in
the xfs_bmalloca structure now. Just pass the bmalloca parameter to
the function instead of 8 separate parameters.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
All the parameters passed to xfs_bmap_add_extent_delay_real() are in
the xfs_bmalloca structure now. Just pass the bmalloca parameter to
the function instead of 8 separate parameters.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Rather than passing the firstblock and freelist structure around,
embed it into the bmalloca structure and remove it from the function
parameters.
This also enables the minleft parameter to be set only once in
xfs_bmapi_write(), and the freelist cursor directly queried in
xfs_bmapi_allocate to clear it when the lowspace algorithm is
activated.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Rather that putting extent records on the stack and then pointing to
them in the bmalloca structure which is in the same stack frame, put
the extent records directly in the bmalloca structure. This reduces
the number of args that need to be passed around.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
All the variables xfs_bmap_isaeof() is passed are contained within
the xfs_bmalloca structure. Pass that instead.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
There is no real need to the xfs_bmap_add_extent, as the callers
know what kind of extents they need to it. Removing it means
duplicating the extents to btree conversion logic in three places,
but overall it's still much simpler code and quite a bit less code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Add a common helper for finding the last extent in a file.
Largely based on a patch from Dave Chinner.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Now that all the read-only users of xfs_bmapi have been converted to
use xfs_bmapi_read(), we can remove all the read-only handling cases
from xfs_bmapi().
Once this is done, rename xfs_bmapi to xfs_bmapi_write to reflect
the fact it is for allocation only. This enables us to kill the
XFS_BMAPI_WRITE flag as well.
Also clean up xfs_bmapi_write to the style used in the newly added
xfs_bmapi_read/delay functions.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
To further improve the readability of xfs_bmapi(), factor the
unwritten extent conversion out into a separate function. This
removes large block of logic from the xfs_bmapi() code loop and
makes it easier to see the operational logic flow for xfs_bmapi().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
To further improve the readability of xfs_bmapi(), factor the extent
allocation out into a separate function. This removes a large block
of logic from the xfs_bmapi() code loop and makes it easier to see
the operational logic flow for xfs_bmapi().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
We can just call xfs_bmap_add_extent_hole_delay directly to add a
delayed allocated regions to the extent tree, instead of going
through all the complexities of xfs_bmap_add_extent that aren't
needed for this simple case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Delalloc reservations are much simpler than allocations, so give
them a separate bmapi-level interface. Using the previously added
xfs_bmapi_reserve_delalloc we get a function that is only minimally
more complicated than xfs_bmapi_read, which is far from the complexity
in xfs_bmapi. Also remove the XFS_BMAPI_DELAY code after switching
over the only user to xfs_bmapi_delay.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Move the reservation of delayed allocations, and addition of delalloc
regions to the extent trees into a new helper function. For now
this adds some twisted goto logic to xfs_bmapi, but that will be
cleaned up in the following patches.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Now we have xfs_bmapi_read, there is no need for xfs_bmapi_single().
Change the remaining caller over and kill the function.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
xfs_bmapi() currently handles both extent map reading and
allocation. As a result, the code is littered with "if (wr)"
branches to conditionally do allocation operations if required.
This makes the code much harder to follow and causes significant
indent issues with the code.
Given that read mapping is much simpler than allocation, we can
split out read mapping from xfs_bmapi() and reuse the logic that
we have already factored out do do all the hard work of handling the
extent map manipulations. The results in a much simpler function for
the common extent read operations, and will allow the allocation
code to be simplified in another commit.
Once xfs_bmapi_read() is implemented, convert all the callers of
xfs_bmapi() that are only reading extents to use the new function.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
To further improve the readability of xfs_bmapi(), factor the pure
extent map manipulations out into separate functions. This removes
large blocks of logic from the xfs_bmapi() code loop and makes it
easier to see the operational logic flow for xfs_bmapi().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Instead of using a local variable that needs to updated when we modify
the extent map just check ifp->if_bytes directly where we use it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
We already have the worst case blocks reserved, so xfs_icsb_modify_counters
won't fail in xfs_bmap_add_extent_delay_real. In fact we've had an assert
to catch this case since day and it never triggered. So remove the code
to try smaller reservations, and just return the error for that case in
addition to keeping the assert.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Both xfs_bmap_add_extent_hole_delay and xfs_bmap_add_extent_hole_real
already contain code to handle the case where there is no extent to
merge with, which is effectively the same as the code duplicated here.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
An attribute of inode can be fetched via xfs_vn_getattr() in XFS.
Currently it returns EIO, not negative value, when it failed. As a
result, the system call returns not negative value even though an
error occured. The stat(2), ls and mv commands cannot handle this
error and do not work correctly.
This patch fixes this bug, and returns -EIO, not EIO when an error
is detected in xfs_vn_getattr().
Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Fix the incorrect comment in the header of the function
_xfs_buf_find().
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Check the return value of xfs_trans_get_buf() and fail
appropriately.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Check the return value of xfs_buf_get() and fail appropriately.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Return unwritten extent conversion errors to aio_complete.
Skip both unwritten extent conversion and size updates if we had an
I/O error or the filesystem has been shut down.
Return -EIO to the aio/buffer completion handlers in case of a
forced shutdown.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Currently a buffered reader or writer can add pages to the pagecache
while we are waiting for the iolock in xfs_file_dio_aio_write. Prevent
this by re-checking mapping->nrpages after we got the iolock, and if
nessecary upgrade the lock to exclusive mode. To simplify this a bit
only take the ilock inside of xfs_file_aio_write_checks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Currently xfs_attr_inactive causes a synchronous transactions if we are
removing a file that has any extents allocated to the attribute fork, and
thus makes XFS extremely slow at removing files with out of line extended
attributes. The code looks a like a relict from the days before the busy
extent list, but with the busy extent list we avoid reusing data and attr
extents that have been freed but not commited yet, so this code is just
as superflous as the synchronous transactions for data blocks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
We now have an i_dio_count filed and surrounding infrastructure to wait
for direct I/O completion instead of i_icount, and we have never needed
to iocount waits for buffered I/O given that we only set the page uptodate
after finishing all required work. Thus remove i_iocount, and replace
the actually needed waits with calls to inode_dio_wait.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
The current code relies on the xfs_ioend_wait call later on to make sure
all I/O actually has completed. The xfs_ioend_wait call will go away soon,
so prepare for that by using the waiting filemap function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
There is no reason to queue up ioends for processing in user context
unless we actually need it. Just complete ioends that do not convert
unwritten extents or need a size update from the end_io context.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
We really shouldn't complete AIO or DIO requests until we have finished
the unwritten extent conversion and size update. This means fsync never
has to pick up any ioends as all work has been completed when signalling
I/O completion.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
No driver returns ENODEV from it bio completion handler, not has this
ever been documented. Remove the dead code dealing with it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
And also remove the strange local lock and delwri list pointers in a few
functions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Remove the xfs_buf_relse from xfs_bwrite and let the caller handle it to
mirror the delwri and read paths.
Also remove the mount pointer passed to xfs_bwrite, which is superflous now
that we have a mount pointer in the buftarg.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Unify the ways we add buffers to the delwri queue by always calling
xfs_buf_delwri_queue directly. The xfs_bdwrite functions is removed and
opencoded in its callers, and the two places setting XBF_DELWRI while a
buffer is locked and expecting xfs_buf_unlock to pick it up are converted
to call xfs_buf_delwri_queue directly, too. Also replace the
XFS_BUF_UNDELAYWRITE macro with direct calls to xfs_buf_delwri_dequeue
to make the explicit queuing/dequeuing more obvious.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Do not transfer a reference held by the caller to the buffer on the list,
or decrement it in xfs_buf_delwri_queue, but instead grab a new reference
if needed, and let the caller drop its own reference. Also move setting
of the XBF_DELWRI and XBF_ASYNC flags into xfs_buf_delwri_queue, and
only do it if needed. Note that for now xfs_buf_unlock already has
XBF_DELWRI, but that will change in the following patches.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
We can just unlock the buffer in the caller, and the decrement of b_hold
would also be needed in the !unlock, we just never hit that case currently
given that the caller handles that case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
We cannot ever reach xfs_buf_iorequest for a buffer with XBF_DELWRI set,
given that all write handlers make sure that the buffer is remove from
the delwri queue before, and we never do reads with the XBF_DELWRI flag
set (which the code would not handle correctly anyway).
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
For append write workloads, extending the file requires a certain
amount of exclusive locking to be done up front to ensure sanity in
things like ensuring that we've zeroed any allocated regions
between the old EOF and the start of the new IO.
For single threads, this typically isn't a problem, and for large
IOs we don't serialise enough for it to be a problem for two
threads on really fast block devices. However for smaller IO and
larger thread counts we have a problem.
Take 4 concurrent sequential, single block sized and aligned IOs.
After the first IO is submitted but before it completes, we end up
with this state:
IO 1 IO 2 IO 3 IO 4
+-------+-------+-------+-------+
^ ^
| |
| |
| |
| \- ip->i_new_size
\- ip->i_size
And the IO is done without exclusive locking because offset <=
ip->i_size. When we submit IO 2, we see offset > ip->i_size, and
grab the IO lock exclusive, because there is a chance we need to do
EOF zeroing. However, there is already an IO in progress that avoids
the need for IO zeroing because offset <= ip->i_new_size. hence we
could avoid holding the IO lock exlcusive for this. Hence after
submission of the second IO, we'd end up this state:
IO 1 IO 2 IO 3 IO 4
+-------+-------+-------+-------+
^ ^
| |
| |
| |
| \- ip->i_new_size
\- ip->i_size
There is no need to grab the i_mutex of the IO lock in exclusive
mode if we don't need to invalidate the page cache. Taking these
locks on every direct IO effective serialises them as taking the IO
lock in exclusive mode has to wait for all shared holders to drop
the lock. That only happens when IO is complete, so effective it
prevents dispatch of concurrent direct IO writes to the same inode.
And so you can see that for the third concurrent IO, we'd avoid
exclusive locking for the same reason we avoided the exclusive lock
for the second IO.
Fixing this is a bit more complex than that, because we need to hold
a write-submission local value of ip->i_new_size to that clearing
the value is only done if no other thread has updated it before our
IO completes.....
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
There is no need to grab the i_mutex of the IO lock in exclusive
mode if we don't need to invalidate the page cache. Taking these
locks on every direct IO effective serialises them as taking the IO
lock in exclusive mode has to wait for all shared holders to drop
the lock. That only happens when IO is complete, so effective it
prevents dispatch of concurrent direct IO reads to the same inode.
Fix this by taking the IO lock shared to check the page cache state,
and only then drop it and take the IO lock exclusively if there is
work to be done. Hence for the normal direct IO case, no exclusive
locking will occur.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Joern Engel <joern@logfs.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Commit d39454ffe4 adds a strictcache mount
option. This patch allows the display of this mount option in
/proc/mounts when listing shares mounted with the strictcache mount
option.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Currently we have a few issues with the way the workqueue code is used to
implement AIL pushing:
- it accidentally uses the same workqueue as the syncer action, and thus
can be prevented from running if there are enough sync actions active
in the system.
- it doesn't use the HIGHPRI flag to queue at the head of the queue of
work items
At this point I'm not confident enough in getting all the workqueue flags and
tweaks right to provide a perfectly reliable execution context for AIL
pushing, which is the most important piece in XFS to make forward progress
when the log fills.
Revert back to use a kthread per filesystem which fixes all the above issues
at the cost of having a task struct and stack around for each mounted
filesystem. In addition this also gives us much better ways to diagnose
any issues involving hung AIL pushing and removes a small amount of code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Tested-by: Stefan Priebe <s.priebe@profihost.ag>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
We need to check for pinned buffers even in .iop_pushbuf given that inode
items flush into the same buffers that may be pinned directly due operations
on the unlinked inode list operating directly on buffers. To do this add a
return value to .iop_pushbuf that tells the AIL push about this and use
the existing log force mechanisms to unpin it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Tested-by: Stefan Priebe <s.priebe@profihost.ag>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
If an item was locked we should not update xa_last_pushed_lsn and thus skip
it when restarting the AIL scan as we need to be able to lock and write it
out as soon as possible. Otherwise heavy lock contention might starve AIL
pushing too easily, especially given the larger backoff once we moved
xa_last_pushed_lsn all the way to the target lsn.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Tested-by: Stefan Priebe <s.priebe@profihost.ag>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
The btrfs file defrag code will loop through the extents and
force COW on them. But there is a concurrent truncate in the middle of
the defrag, it might end up defragging the same range over and over
again.
The problem is that writepage won't go through and do anything on pages
past i_size, so the cow won't happen, so the file will appear to still
be fragmented. defrag will end up hitting the same extents again and
again.
In the worst case, the truncate can actually live lock with the defrag
because the defrag keeps creating new ordered extents which the truncate
code keeps waiting on.
The fix here is to make defrag check for i_size inside the main loop,
instead of just once before the looping starts.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I'd rather put more of these sorts of checks into standardized xdr
decoders for the various types rather than have them cluttering up the
core logic in nfs4proc.c and nfs4state.c.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We don't use WANT bits yet--and sending them can probably trigger a
BUG() further down.
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In response to some review comments, get rid of the somewhat obscure
for-loop with bitops, and improve a comment.
Reported-by: Steve Dickson <steved@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Follow those steps:
# mount -o autodefrag /dev/sda7 /mnt
# dd if=/dev/urandom of=/mnt/tmp bs=200K count=1
# sync
# dd if=/dev/urandom of=/mnt/tmp bs=8K count=1 conv=notrunc
and then it'll go into a loop: writeback -> defrag -> writeback ...
It's because writeback writes [8K, 200K] and then writes [0, 8K].
I tried to make writeback know if the pages are dirtied by defrag,
but the patch was a bit intrusive. Here I simply set writeback_index
when we defrag a file.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Rename udf_warning to udf_warn for consistency with normal logging
uses of pr_warn.
Rename function udf_warning to _udf_warn.
Remove __func__ from uses and move __func__ to a new udf_warn
macro that calls _udf_warn.
Add \n's to uses of udf_warn, remove \n from _udf_warn.
Coalesce formats.
Reviewed-by: NamJae Jeon <linkinjeon@gmail.com>
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Rename udf_error to udf_err for consistency with normal logging
uses of pr_err.
Rename function udf_err to _udf_err.
Remove __func__ from uses and move __func__ to a new udf_err
macro that calls _udf_err.
Some of the udf_error uses had \n terminations, some did not so
standardize \n's to udf_err uses, remove \n from _udf_err function.
Coalesce udf_err formats.
One message prefixed with udf_read_super is now prefixed with
udf_fill_super.
Reviewed-by: NamJae Jeon <linkinjeon@gmail.com>
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jan Kara <jack@suse.cz>
If there is a problem with a scratched disc or loader, it's valuable to know
which error occurred.
Convert some debug messages to udf_error, neaten those messages too.
Add the calculated tag checksum and the read checksum to error message.
Make udf_error a public function and move the logging prototypes together.
Original-patch-by: NamJae Jeon <linkinjeon@gmail.com>
Reviewed-by: NamJae Jeon <linkinjeon@gmail.com>
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jan Kara <jack@suse.cz>
There are no user of EXT3_IOC32_WAIT_FOR_READONLY and also it is
broken. No one set the set_ro_timer, no one wake up us and our
state is set to TASK_INTERRUPTIBLE not RUNNING. So remove it.
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
This fixes a bug which was introduced in dd68314ccf. The problem
came from the test of the return value of proc_mkdir which is always
false without procfs, and this would initialization of ext4.
Signed-off-by: Fabrice Jouhaud <yargil@free.fr>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_extent_idx.e_block is __le32, so use le32_to_cpu() in
ext4_ext_search_left().
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
There are no users of the EXT4_IOC_WAIT_FOR_READONLY ioctl, and it is
also broken. No one sets the set_ro_timer, no one wakes up us and our
state is set to TASK_INTERRUPTIBLE not RUNNING. So remove it.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The comment describing what ext4_ext_search_right() does is incorrect.
We return 0 in *phys when *logical is the 'largest' allocated block,
not smallest.
Fix a few other typos while we're at it.
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
For a long time now orlov is the default block allocator in the
ext4. It performs better than the old one and no one seems to claim
otherwise so we can safely drop it and make oldalloc and orlov mount
option deprecated.
This is a part of the effort to reduce number of ext4 options hence the
test matrix.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Microsoft has a bug with ntlmv2 that requires use of ntlmssp, but
we didn't get the required information on when/how to use ntlmssp to
old (but once very popular) legacy servers (various NT4 fixpacks
for example) until too late to merge for 3.1. Will upgrade
to NTLMv2 in NTLMSSP in 3.2
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Some of the error path in ext4_fill_super don't release the
resouces properly. So this patch just try to release them
in the right way.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In commit 79a77c5ac, we move ext4_mb_init_backend after the allocation
of s_locality_group to avoid memory leak in error path, but there are
still some other error paths in ext4_mb_init that need to do the same
work. So this patch adds all the error patch for ext4_mb_init. And all
the pointers are reset to NULL in case the caller may double free them.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Use mpage_readpages() instead of multiple calls to udf_readpage() to reduce the
CPU utilization and make performance higher.
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
This quites the sparse noise:
warning: symbol 'ext3_trim_all_free' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
In the pNFS obj-LD the device table at the layout level needs
to point to a device_cache node, where it is possible and likely
that many layouts will point to the same device-nodes.
In Exofs we have a more orderly structure where we have a single
array of devices that repeats twice for a round-robin view of the
device table
This patch moves to a model that can be used by the pNFS obj-LD
where struct ore_components holds an array of ore_dev-pointers.
(ore_dev is newly defined and contains a struct osd_dev *od
member)
Each pointer in the array of pointers will point to a bigger
user-defined dev_struct. That can be accessed by use of the
container_of macro.
In Exofs an __alloc_dev_table() function allocates the
ore_dev-pointers array as well as an exofs_dev array, in one
allocation and does the addresses dance to set everything pointing
correctly. It still keeps the double allocation trick for the
inodes round-robin view of the table.
The device table is always allocated dynamically, also for the
single device case. So it is unconditionally freed at umount.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
The struct ore_striping_info will be used later in other
structures. And ore_calc_stripe_info as well. Rename them
make struct ore_striping_info public. ore_calc_stripe_info
is still static, will be made public on first use.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
The struct pnfs_osd_data_map data_map member of exofs_sb_info was
never used after mount. In fact all it's members were duplicated
by the ore_layout structure. So just remove the duplicated information.
Also removed some stupid, but perfectly supported, restrictions on
layout parameters. The case where num_devices is not divisible by
mirror_count+1 is perfectly fine since the rotating device view
will eventually use all the devices it can get.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
ore_components already has a comps member so this leads
to things like comps->comps which is annoying. the name oc
was already used in new code. So rename all old usage of
ore_components comps => ore_components oc.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
This quiets the following sparse noise:
warning: symbol 'exofs_sync_fs' was not declared. Should it be static?
warning: symbol 'exofs_free_sbi' was not declared. Should it be static?
warning: symbol 'exofs_get_parent' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
This quiets the sparse noise:
warning: symbol '_calc_trunk_info' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
One thing puzzled me is that in JBOD case, the per-disk writeout
performance is smaller than the corresponding single-disk case even
when they have comparable bdi_thresh. Tracing shows find that in single
disk case, bdi_writeback is always kept high while in JBOD case, it
could drop low from time to time and correspondingly bdi_reclaimable
could sometimes rush high.
The fix is to watch bdi_reclaimable and kick background writeback as
soon as it goes high. This resembles the global background threshold
but in per-bdi manner. The trick is, as long as bdi_reclaimable does
not go high, bdi_writeback naturally won't go low because
bdi_reclaimable+bdi_writeback ~= bdi_thresh.
With less fluctuated writeback pages, JBOD performance is observed to
increase noticeably in various cases.
vmstat:nr_written values before/after patch:
3.1.0-rc4-wo-underrun+ 3.1.0-rc4-bgthresh3+
------------------------ ------------------------
125596480 +25.9% 158179363 JBOD-10HDD-16G/ext4-100dd-1M-24p-16384M-20:10-X
61790815 +110.4% 130032231 JBOD-10HDD-16G/ext4-10dd-1M-24p-16384M-20:10-X
58853546 -0.1% 58823828 JBOD-10HDD-16G/ext4-1dd-1M-24p-16384M-20:10-X
110159811 +24.7% 137355377 JBOD-10HDD-16G/xfs-100dd-1M-24p-16384M-20:10-X
69544762 +10.8% 77080047 JBOD-10HDD-16G/xfs-10dd-1M-24p-16384M-20:10-X
50644862 +0.5% 50890006 JBOD-10HDD-16G/xfs-1dd-1M-24p-16384M-20:10-X
42677090 +28.0% 54643527 JBOD-10HDD-thresh=100M/ext4-100dd-1M-24p-16384M-100M:10-X
47491324 +13.3% 53785605 JBOD-10HDD-thresh=100M/ext4-10dd-1M-24p-16384M-100M:10-X
52548986 +0.9% 53001031 JBOD-10HDD-thresh=100M/ext4-1dd-1M-24p-16384M-100M:10-X
26783091 +36.8% 36650248 JBOD-10HDD-thresh=100M/xfs-100dd-1M-24p-16384M-100M:10-X
35526347 +14.0% 40492312 JBOD-10HDD-thresh=100M/xfs-10dd-1M-24p-16384M-100M:10-X
44670723 -1.1% 44177606 JBOD-10HDD-thresh=100M/xfs-1dd-1M-24p-16384M-100M:10-X
127996037 +22.4% 156719990 JBOD-10HDD-thresh=2G/ext4-100dd-1M-24p-16384M-2048M:10-X
57518856 +3.8% 59677625 JBOD-10HDD-thresh=2G/ext4-10dd-1M-24p-16384M-2048M:10-X
51919909 +12.2% 58269894 JBOD-10HDD-thresh=2G/ext4-1dd-1M-24p-16384M-2048M:10-X
86410514 +79.0% 154660433 JBOD-10HDD-thresh=2G/xfs-100dd-1M-24p-16384M-2048M:10-X
40132519 +38.6% 55617893 JBOD-10HDD-thresh=2G/xfs-10dd-1M-24p-16384M-2048M:10-X
48423248 +7.5% 52042927 JBOD-10HDD-thresh=2G/xfs-1dd-1M-24p-16384M-2048M:10-X
206041046 +44.1% 296846536 JBOD-10HDD-thresh=4G/xfs-100dd-1M-24p-16384M-4096M:10-X
72312903 -19.4% 58272885 JBOD-10HDD-thresh=4G/xfs-10dd-1M-24p-16384M-4096M:10-X
50635672 -0.5% 50384787 JBOD-10HDD-thresh=4G/xfs-1dd-1M-24p-16384M-4096M:10-X
68308534 +115.7% 147324758 JBOD-10HDD-thresh=800M/ext4-100dd-1M-24p-16384M-800M:10-X
57882933 +14.5% 66269621 JBOD-10HDD-thresh=800M/ext4-10dd-1M-24p-16384M-800M:10-X
52183472 +12.8% 58855181 JBOD-10HDD-thresh=800M/ext4-1dd-1M-24p-16384M-800M:10-X
53788956 +94.2% 104460352 JBOD-10HDD-thresh=800M/xfs-100dd-1M-24p-16384M-800M:10-X
44493342 +35.5% 60298210 JBOD-10HDD-thresh=800M/xfs-10dd-1M-24p-16384M-800M:10-X
42641209 +18.9% 50681038 JBOD-10HDD-thresh=800M/xfs-1dd-1M-24p-16384M-800M:10-X
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Scrub uses a simple tree-enumeration to bring the relevant portions
of the extent- and csum-tree into the page cache before starting the
scrub-I/O. This is now replaced by using the new readahead-API.
During readahead the scrub is being accounted as paused, so it won't
hold off transaction commits.
This change raises the average disk bandwith utilisation on my test
volume from 70% to 90%. On another volume, the time for a test run
went down from 89s to 43s.
Changes v5:
- reada1/2 are now of type struct reada_control *
Signed-off-by: Arne Jansen <sensille@gmx.net>
This adds the hooks needed for readahead. In the readpage_end_io_hook,
the extent state is checked for the EXTENT_READAHEAD flag. Only in this
case the readahead hook is called, to keep the impact on non-ra as low
as possible.
Additionally, a hook for a failed IO is added, otherwise readahead would
wait indefinitely for the extent to finish.
Changes for v2:
- eliminate race condition
Signed-off-by: Arne Jansen <sensille@gmx.net>
This is the implementation for the generic read ahead framework.
To trigger a readahead, btrfs_reada_add must be called. It will start
a read ahead for the given range [start, end) on tree root. The returned
handle can either be used to wait on the readahead to finish
(btrfs_reada_wait), or to send it to the background (btrfs_reada_detach).
The read ahead works as follows:
On btrfs_reada_add, the root of the tree is inserted into a radix_tree.
reada_start_machine will then search for extents to prefetch and trigger
some reads. When a read finishes for a node, all contained node/leaf
pointers that lie in the given range will also be enqueued. The reads will
be triggered in sequential order, thus giving a big win over a naive
enumeration. It will also make use of multi-device layouts. Each disk
will have its on read pointer and all disks will by utilized in parallel.
Also will no two disks read both sides of a mirror simultaneously, as this
would waste seeking capacity. Instead both disks will read different parts
of the filesystem.
Any number of readaheads can be started in parallel. The read order will be
determined globally, i.e. 2 parallel readaheads will normally finish faster
than the 2 started one after another.
Changes v2:
- protect root->node by transaction instead of node_lock
- fix missed branches:
The readahead had a too simple check to determine if a branch from
a node should be checked or not. It now also records the upper bound
of each node to see if the requested RA range lies within.
- use KERN_CONT to debug output, to avoid line breaks
- defer reada_start_machine to worker to avoid deadlock
Changes v3:
- protect root->node by rcu
Changes v5:
- changed EIO-semantics of reada_tree_block_flagged
- remove spin_lock from reada_control and make elems an atomic_t
- remove unused read_total from reada_control
- kill reada_key_cmp, use btrfs_comp_cpu_keys instead
- use kref-style release functions where possible
- return struct reada_control * instead of void * from btrfs_reada_add
Signed-off-by: Arne Jansen <sensille@gmx.net>
Add state information for readahead to btrfs_fs_info and btrfs_device
Changes v2:
- don't wait in radix_trees
- add own set of workers for readahead
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Arne Jansen <sensille@gmx.net>
Add a READAHEAD extent buffer flag.
Add a function to trigger a read with this flag set.
Changes v2:
- use extent buffer flags instead of extent state flags
Changes v5:
- adapt to changed read_extent_buffer_pages interface
- don't return eb from reada_tree_block_flagged if it has CORRUPT flag set
Signed-off-by: Arne Jansen <sensille@gmx.net>
read_extent_buffer_pages currently has two modes, either trigger a read
without waiting for anything, or wait for the I/O to finish. The former
also bails when it's unable to lock the page. This patch now adds an
additional parameter to allow it to block on page lock, but don't wait
for completion.
Changes v5:
- merge the 2 wait parameters into one and define WAIT_NONE, WAIT_COMPLETE and
WAIT_PAGE_LOCK
Change v6:
- fix bug introduced in v5
Signed-off-by: Arne Jansen <sensille@gmx.net>
A user reported a problem where ceph was getting into 100% cpu usage while doing
some writing. It turns out it's because we were doing a short write on a not
uptodate page, which means we'd fall back at one page at a time and fault the
page in. The problem is our position is on the page boundary, so our fault in
logic wasn't actually reading the page, so we'd just spin forever or until the
page got read in by somebody else. This will force a readpage if we end up
doing a short copy. Alexandre could reproduce this easily with ceph and reports
it fixes his problem. I also wrote a reproducer that no longer hangs my box
with this patch. Thanks,
Reported-and-tested-by: Alexandre Oliva <aoliva@redhat.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This ties nodatasum fixup in scrub together with raid repair patches. While
both series are working fine alone, scrub will report uncorrectable errors
if they occur in a nodatasum extent *and* the page is in the page cache.
Previously, we would have triggered readpage to find good data and do the
repair. However, readpage wouldn't read anything in the case where the page
is up to date in the cache. So, we simply take that good data we have and
call repair_io_failure directly (unless the page in the cache is dirty).
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
The raid-retry code in inode.c can be generalized so that it works for
metadata as well. Thus, this patch moves it to extent_io.c and makes the
raid-retry code a raid-repair code.
Repair works that way: Whenever a read error occurs and we have more
mirrors to try, note the failed mirror, and retry another. If we find a
good one, check if we did note a failure earlier and if so, do not allow
the read to complete until after the bad sector was written with the good
data we just fetched. As we have the extent locked while reading, no one
can change the data in between.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
The error correction code wants to make sure that only the bad mirror is
rewritten. Thus, we need to know which mirror is the bad one. I did not
find a more apropriate field than bi_bdev. But I think using this is fine,
because it is modified by the block layer, anyway, and should not be read
after the bio returned.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
The block layer modifies bio->bi_bdev and bio->bi_sector while working on
the bio, they do _not_ come back unmodified in the completion callback.
To call add_page, we need at least some bi_bdev set, which is why the code
was working, previously. With this patch, we use the latest_bdev from
fsinfo instead of the leftover in the bio. This gives us the possibility to
use the bi_bdev field for another purpose.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
btrfs_bio is a bio abstraction able to split and not complete after the last
bio has returned (like the old btrfs_multi_bio). Additionally, btrfs_bio
tracks the mirror_num used to read data which can be used for error
correction purposes.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
these ioctls make use of the new functions initially added for scrub. they
return all inodes belonging to a logical address (BTRFS_IOC_LOGICAL_INO) and
all paths belonging to an inode (BTRFS_IOC_INO_PATHS).
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
This removes a FIXME comment and introduces the first part of nodatasum
fixup: It gets the corresponding inode for a logical address and triggers a
regular readpage for the corrupted sector.
Once we have on-the-fly error correction our error will be automatically
corrected. The correction code is expected to clear the newly introduced
EXTENT_DAMAGED flag, making scrub report that error as "corrected" instead
of "uncorrectable" eventually.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Currently, extent_read_full_page always assumes we are trying to read mirror
0, which generally is the best we can do. To add flexibility, pass it as a
parameter. This will be needed by scrub fixup code.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Fix the mirror_num determination in scrub_stripe. The rest of the scrub code
did not use mirror_num for anything important and that error went unnoticed.
The nodatasum fixup patch of this set depends on a correct mirror_num.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
While scrubbing, we may encounter various errors. Previously, a logical
address was printed to the log only. Now, all paths belonging to that
address are resolved and printed separately. That should work for hardlinks
as well as reflinks.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
In normal operation, scrub is reading data sequentially in large portions.
In case of an i/o error, we try to find the corrupted area(s) by issuing
page sized read requests. With this commit we increment the
unverified_errors counter if all of the small size requests succeed.
Userland patches carrying such conspicous events to the administrator should
already be around.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
These helper functions iterate back references and call a function for each
backref. There is also a function to resolve an inode to a path in the
file system.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
In cleanup_volume_info_contents() we kfree(volume_info->UNC); and then
proceed to use that variable on the very next line.
This causes (at least) Coverity Prevent to complain about use-after-free
of that variable (and I guess other checkers may do that as well).
There's not any /real/ problem here since we are just using the value of
the pointer, not actually dereferencing it, but it's still trivial to
silence the tool, so why not?
To me at least it also just seems nicer to defer freeing the variable
until we are entirely done with it in all respects.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
There are numerous broken references to Documentation files (in other
Documentation files, in comments, etc.). These broken references are
caused by typo's in the references, and by renames or removals of the
Documentation files. Some broken references are simply odd.
Fix these broken references, sometimes by dropping the irrelevant text
they were part of.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
That flag no longer makes sense, since we don't look up automount points
as eagerly any more. Additionally, it turns out that the NO_AUTOMOUNT
handling was buggy to begin with: it would avoid automounting even for
cases where we really *needed* to do the automount handling, and could
return ENOENT for autofs entries that hadn't been instantiated yet.
With our new non-eager automount semantics, one discussion has been
about adding a AT_AUTOMOUNT flag to vfs_fstatat (and thus the
newfstatat() and fstatat64() system calls), but it's probably not worth
it: you can always force at least directory automounting by simply
adding the final '/' to the filename, which works for *all* of the stat
family system calls, old and new.
So AT_NO_AUTOMOUNT (and thus LOOKUP_NO_AUTOMOUNT) really were just a
result of our bad default behavior.
Acked-by: Ian Kent <raven@themaw.net>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The concensus seems to be that system calls such as stat() etc should
not trigger an automount. Neither should the l* versions.
This patch therefore adds a LOOKUP_AUTOMOUNT flag to tag those lookups
that _should_ trigger an automount on the last path element.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
[ Edited to leave out the cases that are already covered by LOOKUP_OPEN,
LOOKUP_DIRECTORY and LOOKUP_CREATE - all of which also fundamentally
force automounting for their own reasons - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since we've now turned around and made LOOKUP_FOLLOW *not* force an
automount, we want to add the ability to force an automount event on
lookup even if we don't happen to have one of the other flags that force
it implicitly (LOOKUP_OPEN, LOOKUP_DIRECTORY, LOOKUP_PARENT..)
Most cases will never want to use this, since you'd normally want to
delay automounting as long as possible, which usually implies
LOOKUP_OPEN (when we open a file or directory, we really cannot avoid
the automount any more).
But Trond argued sufficiently forcefully that at a minimum bind mounting
a file and quotactl will want to force the automount lookup. Some other
cases (like nfs_follow_remote_path()) could use it too, although
LOOKUP_DIRECTORY would work there as well.
This commit just adds the flag and logic, no users yet, though. It also
doesn't actually touch the LOOKUP_NO_AUTOMOUNT flag that is related, and
was made irrelevant by the same change that made us not follow on
LOOKUP_FOLLOW.
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg KH <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"sysfs: use rb-tree for inode number lookup" added a new printk which
causes a new compile warning on s390 (and few other architectures):
fs/sysfs/dir.c: In function 'sysfs_link_sibling':
fs/sysfs/dir.c:63:4: warning: format '%lx' expects argument of type
'long unsigned int', but argument 2 has type 'ino_t' [-Wform
Add an explicit unsigned long cast since ino_t is an unsigned long on
most architectures.
Cc: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Use a separate stateid idr per client, and lookup a stateid by first
finding the client, then looking up the stateid relative to that client.
Also some minor refactoring.
This allows us to improve error returns: we can return expired when the
clientid is not found and bad_stateid when the clientid is found but not
the stateid, as opposed to returning expired for both cases.
I hope this will also help to replace the state lock mostly by a
per-client lock, but that hasn't been done yet.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Test_stateid is 4.1-only and only allowed after a sequence operation, so
this check is unnecessary.
Cc: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The idr system is designed exactly for generating id and looking up
integer id's. Thanks to Trond for pointing it out.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* 'for-linus' of git://git.kernel.dk/linux-block:
floppy: use del_timer_sync() in init cleanup
blk-cgroup: be able to remove the record of unplugged device
block: Don't check QUEUE_FLAG_SAME_COMP in __blk_complete_request
mm: Add comment explaining task state setting in bdi_forker_thread()
mm: Cleanup clearing of BDI_pending bit in bdi_forker_thread()
block: simplify force plug flush code a little bit
block: change force plug flush call order
block: Fix queue_flag update when rq_affinity goes from 2 to 1
block: separate priority boosting from REQ_META
block: remove READ_META and WRITE_META
xen-blkback: fixed indentation and comments
xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.
This is modeled after the smaps code.
It detects transparent hugepages and then does a single gather_stats()
for the page as a whole. This has two benifits:
1. It is more efficient since it does many pages in a single shot.
2. It does not have to break down the huge page.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
gather_pte_stats() does a number of checks on a target page
to see whether it should even be considered for statistics.
This breaks that code out in to a separate function so that
we can use it in the transparent hugepage case in the next
patch.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Christoph Lameter <cl@gentwo.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to teach the numa_maps code about transparent huge pages. The
first step is to teach gather_stats() that the pte it is dealing with
might represent more than one page.
Note that will we use this in a moment for transparent huge pages since
they have use a single pmd_t which _acts_ as a "surrogate" for a bunch
of smaller pte_t's.
I'm a _bit_ unhappy that this interface counts in hugetlbfs page sizes
for hugetlbfs pages and PAGE_SIZE for normal pages. That means that to
figure out how many _bytes_ "dirty=1" means, you must first know the
hugetlbfs page size. That's easier said than done especially if you
don't have visibility in to the mount.
But, that's probably a discussion for another day especially since it
would change behavior to fix it. But, just in case anyone wonders why
this patch only passes a '1' in the hugetlb case...
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a crash/BUG_ON in the clone ioctl due to insufficient reservation. We
need to reserve space for:
- adjusting the old extent (possibly splitting it)
- adding the new extent
- updating the inode
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I'm not sure why I used a new field for this originally.
Also, the differences between some of these flags are a little subtle;
add some comments to explain.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Yet another open-management regression:
- nfs4_file_downgrade() doesn't remove the BOTH access bit on
downgrade, so the server's idea of the stateid's access gets
out of sync with the client's. If we want to keep an O_RDWR
open in this case, we should do that in the file_put_access
logic rather than here.
- We forgot to convert v4 access to an open mode here.
This logic has proven too hard to get right. In the future we may
consider:
- reexamining the lock/openowner relationship (locks probably
don't really need to take their own references here).
- adding open upgrade/downgrade support to the vfs.
- removing the atomic operations. They're redundant as long as
this is all under some other lock.
Also, maybe some kind of additional static checking would help catch
O_/NFS4_SHARE_ACCESS confusion.
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Fix sec=ntlmv2/i authentication option during mount of Samba shares.
cifs client was coding ntlmv2 response incorrectly.
All that is needed in temp as specified in MS-NLMP seciton 3.3.2
"Define ComputeResponse(NegFlg, ResponseKeyNT, ResponseKeyLM,
CHALLENGE_MESSAGE.ServerChallenge, ClientChallenge, Time, ServerName)
as
Set temp to ConcatenationOf(Responserversion, HiResponserversion,
Z(6), Time, ClientChallenge, Z(4), ServerName, Z(4)"
is MsvAvNbDomainName.
For sec=ntlmsspi, build_av_pair is not used, a blob is plucked from
type 2 response sent by the server to use in authentication.
I tested sec=ntlmv2/i and sec=ntlmssp/i mount options against
Samba (3.6) and Windows - XP, 2003 Server and 7.
They all worked.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Both these options are started with "rw" - that's why the first one
isn't switched on even if it is specified. Fix this by adding a length
check for "rw" option check.
Cc: <stable@kernel.org>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
move it to the beginning of the loop.
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The name_len variable in CIFSFindNext is a signed int that gets set to
the resume_name_len in the cifs_search_info. The resume_name_len however
is unsigned and for some infolevels is populated directly from a 32 bit
value sent by the server.
If the server sends a very large value for this, then that value could
look negative when converted to a signed int. That would make that
value pass the PATH_MAX check later in CIFSFindNext. The name_len would
then be used as a length value for a memcpy. It would then be treated
as unsigned again, and the memcpy scribbles over a ton of memory.
Fix this by making the name_len an unsigned value in CIFSFindNext.
Cc: <stable@kernel.org>
Reported-by: Darren Lavender <dcl@hppine99.gbr.hp.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'for-linus' of git://github.com/chrismason/linux:
Btrfs: only clear the need lookup flag after the dentry is setup
BTRFS: Fix lseek return value for error
Btrfs: don't change inode flag of the dest clone file
Btrfs: don't make a file partly checksummed through file clone
Btrfs: fix pages truncation in btrfs_ioctl_clone()
btrfs: fix d_off in the first dirent
Look up closed stateid's in the stateid hash like any other stateid
rather than searching the close lru.
This is simpler, and fixes a bug: currently we handle only the case of a
close that is the last close for a given stateowner, but not the case of
a close for a stateowner that still has active opens on other files.
Thus in a case like:
open(owner, file1)
open(owner, file2)
close(owner, file2)
close(owner, file2)
the final close won't be recognized as a retransmission.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Including the full clientid in the on-the-wire stateid allows more
reliable detection of bad vs. expired stateid's, simplifies code, and
ensures we won't reuse the opaque part of the stateid (as we currently
do when the same openowner closes and reopens the same file).
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We can race with readdir and the RCU path walking stuff. This is because we
clear the need lookup flag before actually instantiating the inode. This will
lead the RCU path walk stuff to find a dentry it thinks is valid without a
d_inode attached. So instead unhash the dentry when we first start the lookup,
and then clear the flag after we've instantiated the dentry so we're garunteed
to either try the slow lookup, or have the d_inode set properly.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The recent reworking of btrfs' lseek lead to incorrect
values being returned. This adds checks for seeking
beyond EOF in SEEK_HOLE and makes sure the error
values come back correct.
Andi Kleen also sent in similar patches.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reported-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The dst file will have the same inode flags with dst file after
file clone, and I think it's unexpected.
For example, the dst file will suddenly become immutable after
getting some share of data with src file, if the src is immutable.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
To reproduce the bug:
# mount /dev/sda7 /mnt
# dd if=/dev/zero of=/mnt/src bs=4K count=1
# umount /mnt
# mount -o nodatasum /dev/sda7 /mnt
# dd if=/dev/zero of=/mnt/dst bs=4K count=1
# clone_range -s 4K -l 4K /mnt/src /mnt/dst
# echo 3 > /proc/sys/vm/drop_caches
# cat /mnt/dst
# dmesg
...
btrfs no csum found for inode 258 start 0
btrfs csum failed ino 258 off 0 csum 2566472073 private 0
It's because part of the file is checksummed and the other part is not,
and then btrfs will complain checksum is not found when we read the file.
Disallow file clone if src and dst file have different checksum flag,
so we ensure a file is completely checksummed or unchecksummed.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It's a bug in commit f81c9cdc56
(Btrfs: truncate pages from clone ioctl target range)
We should pass the dest range to the truncate function, but not the
src range.
Also move the function before locking extent state.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since the d_off in the first dirent for "." (that originates from
the 4th argument "offset" of filldir() for the 2nd dirent for "..")
is wrongly assigned in btrfs_real_readdir(), telldir returns same
offset for different locations.
| # mkfs.btrfs /dev/sdb1
| # mount /dev/sdb1 fs0
| # cd fs0
| # touch file0 file1
| # ../test
| telldir: 0
| readdir: d_off = 2, d_name = "."
| telldir: 2
| readdir: d_off = 2, d_name = ".."
| telldir: 2
| readdir: d_off = 3, d_name = "file0"
| telldir: 3
| readdir: d_off = 2147483647, d_name = "file1"
| telldir: 2147483647
To fix this problem, pass filp->f_pos (which is loff_t) instead.
| # ../test
| telldir: 0
| readdir: d_off = 1, d_name = "."
| telldir: 1
| readdir: d_off = 2, d_name = ".."
| telldir: 2
| readdir: d_off = 3, d_name = "file0"
:
At the moment the "offset" for "." is unused because there is no
preceding dirent, however it is better to pass filp->f_pos to follow
grammatical usage.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Keep around an unhashed copy of the final stateid after the last close
using an openowner, and when identifying a replay, match against that
stateid instead of just against the open owner id. Free it the next
time the seqid is bumped or the stateowner is destroyed.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
For checking the size of reply before calling a operation,
we need try to get maxsize of the operation's reply.
v3: using new method as Bruce said,
"we could handle operations in two different ways:
- For operations that actually change something (write, rename,
open, close, ...), do it the way we're doing it now: be
very careful to estimate the size of the response before even
processing the operation.
- For operations that don't change anything (read, getattr, ...)
just go ahead and do the operation. If you realize after the
fact that the response is too large, then return the error at
that point.
So we'd add another flag to op_flags: say, OP_MODIFIES_SOMETHING. And for
operations with OP_MODIFIES_SOMETHING set, we'd do the first thing. For
operations without it set, we'd do the second."
Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
[bfields@redhat.com: crash, don't attempt to handle, undefined op_rsize_bop]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
nfs: Do not allow multiple mounts on same mountpoint when using -o noac
NFS: Fix a typo in nfs_flush_multi
NFSv4: renewd needs to be able to handle the NFS4ERR_CB_PATH_DOWN error
NFSv4: The NFSv4.0 client must send RENEW calls if it holds a delegation
NFSv4: nfs4_proc_renew should be declared static
NFSv4: nfs4_proc_async_renew should use a GFP_NOFS allocation
generic_check_addressable can't deal with hfsplus's larger than page
size allocation blocks, so simply opencode the checks that we actually
need in hfsplus_fill_super.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Reported-by: Pavel Ivanov <paivanof@gmail.com>
Tested-by: Pavel Ivanov <paivanof@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 6596528e39 ("hfsplus: ensure bio requests are not smaller than
the hardware sectors") changed the pointers used for volume header
allocations but failed to free the correct pointers in the error path
path of hfsplus_fill_super() and hfsplus_read_wrapper.
The second hunk came from a separate patch by Pavel Ivanov.
Reported-by: Pavel Ivanov <paivanof@gmail.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Cc: <stable@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We used to get the victim pinned by dentry_unhash() prior to commit
64252c75a2 ("vfs: remove dget() from dentry_unhash()") and ->rmdir()
and ->rename() instances relied on that; most of them don't care, but
ones that used d_delete() themselves do. As the result, we are getting
rmdir() oopses on NFS now.
Just grab the reference before locking the victim and drop it explicitly
after unlocking, same as vfs_rename_other() does.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Simon Kirby <sim@hostway.ca>
Cc: stable@kernel.org (3.0.x)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a window in which the ioend that we call inode_dio_wake on
in xfs_end_io_direct_write is already free. Fix this by storing
the inode pointer in a local variable.
This is a fix for the regression introduced in 3.1-rc by
"fs: move inode_dio_done to the end_io handler".
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
For IPv6 local address, lockd can not callback to client for
missing scope id when binding address at inet6_bind:
324 if (addr_type & IPV6_ADDR_LINKLOCAL) {
325 if (addr_len >= sizeof(struct sockaddr_in6) &&
326 addr->sin6_scope_id) {
327 /* Override any existing binding, if another one
328 * is supplied by user.
329 */
330 sk->sk_bound_dev_if = addr->sin6_scope_id;
331 }
332
333 /* Binding to link-local address requires an interface */
334 if (!sk->sk_bound_dev_if) {
335 err = -EINVAL;
336 goto out_unlock;
337 }
Replacing svc_addr_u by sockaddr_storage, let rqstp->rq_daddr contains more info
besides address.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
[ cel: since this is server-side, use nfsd4_ prefix instead of nfs4_ prefix. ]
[ cel: implement S_ISVTX filter in bfields-normal form ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There are no more users...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The current code is sort of hackish in that it assumes a referral is always
matched to an export. When we add support for junctions that may not be the
case.
We can replace nfsd4_path() with a function that encodes the components
directly from the dentries. Since nfsd4_path is currently the only user of
the 'ex_pathname' field in struct svc_export, this has the added benefit
of allowing us to get rid of that.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
First, we shouldn't care here about the structure of the opaque part of
the stateid. Second, this hash is really dumb. (I'm not sure the
replacement is much better, though--to look at it another patch.)
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We want delegations to share more with open/lock stateid's, so first
we'll pull out some of the common stuff we want to share.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Move most of this into helper functions. Also move the non-CONFIRM case
into caller, providing a helper function for that purpose.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Do not allow multiple mounts on same mountpoint when using -o noac
When you normally attempt to mount a share twice on the same mountpoint,
a check in do_add_mount causes it to return an error
# mount localhost:/nfsv3 /mnt
# mount localhost:/nfsv3 /mnt
mount.nfs: /mnt is already mounted or busy
However when using the option 'noac', the user is able to mount the same
share on the same mountpoint multiple times. This happens because a
share mounted with the noac option is automatically assigned the 'sync'
flag MS_SYNCHRONOUS in nfs_initialise_sb(). This flag is set after the
check for already existing superblocks is done in sget(). The check for
the mount flags in nfs_compare_mount_options() does not take into
account the 'sync' flag applied later on in the code path. This means
that when using 'noac', a new superblock structure is assigned for every
new mount of the same share and multiple shares on the same mountpoint
are allowed.
ie.
# mount -onoac localhost:/nfsv3 /mnt
can be run multiple times.
The patch checks for noac and assigns the sync flag before sget() is
called to obtain an already existing superblock structure.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix a typo which causes an Oops in the RPC layer, when using wsize < 4k.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Tested-by: Sricharan R <r.sricharan@ti.com>
* 'for-linus' of git://github.com/chrismason/linux:
Btrfs: add dummy extent if dst offset excceeds file end in
Btrfs: calc file extent num_bytes correctly in file clone
btrfs: xattr: fix attribute removal
Btrfs: fix wrong nbytes information of the inode
Btrfs: fix the file extent gap when doing direct IO
Btrfs: fix unclosed transaction handle in btrfs_cont_expand
Btrfs: fix misuse of trans block rsv
Btrfs: reset to appropriate block rsv after orphan operations
Btrfs: skip locking if searching the commit root in csum lookup
btrfs: fix warning in iput for bad-inode
Btrfs: fix an oops when deleting snapshots
Commit 37fb3a30b4 ("fuse: fix flock") added in 3.1-rc4 caused flock() to
fail with ENOSYS with the kernel ABI version 7.16 or earlier.
Fix by falling back to testing FUSE_POSIX_LOCKS for ABI versions 7.16
and earlier.
Reported-by: Martin Ziegler <ziegler@email.mathematik.uni-freiburg.de>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Martin Ziegler <ziegler@email.mathematik.uni-freiburg.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
You can see there's no file extent with range [0, 4096]. Check this by
btrfsck:
# btrfsck /dev/sda7
root 5 inode 258 errors 100
...
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
An attribute is not removed by 'setfattr -x attr file' and remains
visible in attr list. This makes xfstests/062 pass again.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we write some data into the data hole of the file(no preallocation for this
hole), Btrfs will allocate some disk space, and update nbytes of the inode, but
the other element--disk_i_size needn't be updated. At this condition, we must
update inode metadata though disk_i_size is not changed(btrfs_ordered_update_i_size()
return 1).
# mkfs.btrfs /dev/sdb1
# mount /dev/sdb1 /mnt
# touch /mnt/a
# truncate -s 856002 /mnt/a
# dd if=/dev/zero of=/mnt/a bs=4K count=1 conv=nocreat,notrunc
# umount /mnt
# btrfsck /dev/sdb1
root 5 inode 257 errors 400
found 32768 bytes used err is 1
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we write some data to the place that is beyond the end of the file
in direct I/O mode, a data hole will be created. And Btrfs should insert
a file extent item that point to this hole into the fs tree. But unfortunately
Btrfs forgets doing it.
The following is a simple way to reproduce it:
# mkfs.btrfs /dev/sdc2
# mount /dev/sdc2 /test4
# touch /test4/a
# dd if=/dev/zero of=/test4/a seek=8 count=1 bs=4K oflag=direct conv=nocreat,notrunc
# umount /test4
# btrfsck /dev/sdc2
root 5 inode 257 errors 100
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The function - btrfs_cont_expand() forgot to close the transaction handle before
it jump out the while loop. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
At the beginning of create_pending_snapshot, trans->block_rsv is set
to pending->block_rsv and is used for snapshot things, however, when
it is done, we do not recover it as will.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
While truncating free space cache, we forget to change trans->block_rsv
back to the original one, but leave it with the orphan_block_rsv, and
then with option inode_cache enable, it leads to countless warnings of
btrfs_alloc_free_block and btrfs_orphan_commit_root:
WARNING: at fs/btrfs/extent-tree.c:5711 btrfs_alloc_free_block+0x180/0x350 [btrfs]()
...
WARNING: at fs/btrfs/inode.c:2193 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It's not enough to just search the commit root, since we could be cow'ing the
very block we need to search through, which would mean that its locked and we'll
still deadlock. So use path->skip_locking as well. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
iput() shouldn't be called for inodes in I_NEW state.
We need to mark inode as constructed first.
WARNING: at fs/inode.c:1309 iput+0x20b/0x210()
Call Trace:
[<ffffffff8103e7ba>] warn_slowpath_common+0x7a/0xb0
[<ffffffff8103e805>] warn_slowpath_null+0x15/0x20
[<ffffffff810eaf0b>] iput+0x20b/0x210
[<ffffffff811b96fb>] btrfs_iget+0x1eb/0x4a0
[<ffffffff811c3ad6>] btrfs_run_defrag_inodes+0x136/0x210
[<ffffffff811ad55f>] cleaner_kthread+0x17f/0x1a0
[<ffffffff81035b7d>] ? sub_preempt_count+0x9d/0xd0
[<ffffffff811ad3e0>] ? transaction_kthread+0x280/0x280
[<ffffffff8105af86>] kthread+0x96/0xa0
[<ffffffff814336d4>] kernel_thread_helper+0x4/0x10
[<ffffffff8105aef0>] ? kthread_worker_fn+0x190/0x190
[<ffffffff814336d0>] ? gs_change+0xb/0xb
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
CC: Konstantin Khlebnikov <khlebnikov@openvz.org>
Tested-by: David Sterba <dsterba@suse.cz>
CC: Josef Bacik <josef@redhat.com>
CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We can reproduce this oops via the following steps:
$ mkfs.btrfs /dev/sdb7
$ mount /dev/sdb7 /mnt/btrfs
$ for ((i=0; i<3; i++)); do btrfs sub snap /mnt/btrfs /mnt/btrfs/s_$i; done
$ rm -fr /mnt/btrfs/*
$ rm -fr /mnt/btrfs/*
then we'll get
------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:2264!
[...]
Call Trace:
[<ffffffffa05578c7>] btrfs_rmdir+0xf7/0x1b0 [btrfs]
[<ffffffff81150b95>] vfs_rmdir+0xa5/0xf0
[<ffffffff81153cc3>] do_rmdir+0x123/0x140
[<ffffffff81145ac7>] ? fput+0x197/0x260
[<ffffffff810aecff>] ? audit_syscall_entry+0x1bf/0x1f0
[<ffffffff81153d0d>] sys_unlinkat+0x2d/0x40
[<ffffffff8147896b>] system_call_fastpath+0x16/0x1b
RIP [<ffffffffa054f7b9>] btrfs_orphan_add+0x179/0x1a0 [btrfs]
When it comes to btrfs_lookup_dentry, we may set a snapshot's inode->i_ino
to BTRFS_EMPTY_SUBVOL_DIR_OBJECTID instead of BTRFS_FIRST_FREE_OBJECTID,
while the snapshot's location.objectid remains unchanged.
However, btrfs_ino() does not take this into account, and returns a wrong ino,
and causes the oops.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
These modes are not necessarily for OOB only. Particularly, MTD_OOB_RAW
affected operations on in-band page data as well. To clarify these
options and to emphasize that their effect is applied per-operation, we
change the primary prefix to MTD_OPS_.
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@intel.com>
Use a helper to test if a mutex is held instead of a hack with
mutex_trylock().
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Artem Bityutskiy <dedekind1@gmail.com>
kfree() deals gracefully with NULL pointers, so it's pointless to test for
one prior to calling it.
This removes such a test from jffs2_scan_medium().
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Artem Bityutskiy <dedekind1@gmail.com>
* 'for-linus' of git://neil.brown.name/md:
md: Fix handling for devices from 2TB to 4TB in 0.90 metadata.
md/raid1,10: Remove use-after-free bug in make_request.
md/raid10: unify handling of write completion.
Avoid dereferencing a 'request_queue' after last close.
On the last close of an 'md' device which as been stopped, the device
is destroyed and in particular the request_queue is freed. The free
is done in a separate thread so it might happen a short time later.
__blkdev_put calls bdev_inode_switch_bdi *after* ->release has been
called.
Since commit f758eeabeb
bdev_inode_switch_bdi will dereference the 'old' bdi, which lives
inside a request_queue, to get a spin lock. This causes the last
close on an md device to sometime take a spin_lock which lives in
freed memory - which results in an oops.
So move the called to bdev_inode_switch_bdi before the call to
->release.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Currently, there exists a race between delayed allocated writes and
the writeback when bigalloc feature is in use. The race was because we
wanted to determine what blocks in a cluster are under delayed
allocation and we were using buffer_delayed(bh) check for it. But, the
writeback codepath clears this bit without any synchronization which
resulted in a race and an ext4 warning similar to:
EXT4-fs (ram1): ext4_da_update_reserve_space: ino 13, used 1 with only 0
reserved data blocks
The race existed in two places.
(1) between ext4_find_delalloc_range() and ext4_map_blocks() when called from
writeback code path.
(2) between ext4_find_delalloc_range() and ext4_da_get_block_prep() (where
buffer_delayed(bh) is set.
To fix (1), this patch introduces a new buffer_head state bit -
BH_Da_Mapped. This bit is set under the protection of
EXT4_I(inode)->i_data_sem when we have actually mapped the delayed
allocated blocks during the writeout time. We can now reliably check
for this bit inside ext4_find_delalloc_range() to determine whether
the reservation for the blocks have already been claimed or not.
To fix (2), it was necessary to set buffer_delay(bh) under the
protection of i_data_sem. So, I extracted the very beginning of
ext4_map_blocks into a new function - ext4_da_map_blocks() - and
performed the required setting of bh_delay bit and the quota
reservation under the protection of i_data_sem. These two fixes makes
the checking of buffer_delay(bh) and buffer_da_mapped(bh) consistent,
thus removing the race.
Tested: I was able to reproduce the problem by running 'dd' and
'fsync' in parallel. Also, xfstests sometimes used to reproduce this
race. After the fix both my test and xfstests were successful and no
race (warning message) was observed.
Google-Bug-Id: 4997027
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch adds some tracepoints in ext4/extents.c and updates a tracepoint in
ext4/inode.c.
Tested: Built and ran the kernel and verified that these tracepoints work.
Also ran xfstests.
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Rename the function so it is more clear what is going on. Also rename
the various variables so it's clearer what's happening.
Also fix a missing blocks to cluster conversion when reading the
number of reserved blocks for root.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This function really claims a number of free clusters, not blocks, so
rename it so it's clearer what's going on.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This function really returns the number of clusters after initializing
an uninitalized block bitmap has been initialized.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This function really counts the free clusters reported in the block
group descriptors, so rename it to reduce confusion.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The field bg_free_blocks_count_{lo,high} in the block group
descriptor has been repurposed to hold the number of free clusters for
bigalloc functions. So rename the functions so it makes it easier to
read and audit the block allocation and block freeing code.
Note: at this point in bigalloc development we doesn't support
online resize, so this also makes it really obvious all of the places
we need to fix up to add support for online resize.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
With bigalloc changes, the i_blocks value was not correctly set (it was still
set to number of blocks being used, but in case of bigalloc, we want i_blocks
to represent the number of clusters being used). Since the quota subsystem sets
the i_blocks value, this patch fixes the quota accounting and makes sure that
the i_blocks value is set correctly.
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The default group preallocation size had been previously set to 512
blocks/clusters, regardless of the block/cluster size. This is
probably too big for large cluster sizes. So adjust the default so
that it is 2 megabytes or 32 clusters, whichever is larger.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Convert the free_blocks to be free_clusters to make the final revised
bigalloc changes easier to read/understand.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Convert the percpu counters s_dirtyblocks_counter and
s_freeblocks_counter in struct ext4_super_info to be
s_dirtyclusters_counter and s_freeclusters_counter.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When we are truncating (as opposed unlinking) a file, we need to worry
about partial truncates of a file, especially in the light of sparse
files. The changes here make sure that arbitrary truncates of sparse
files works correctly. Yeah, it's messy.
Note that these functions will need to be revisted when the punch
ioctl is integrated --- in fact this commit will probably have merge
conflicts with the punch changes which Allison Henders and the IBM LTC
have been working on. I will need to fix this up when either patch
hits mainline.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If we need to allocate a new block in ext4_ext_map_blocks(), the
function needs to see if the cluster has already been allocated.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The ext4_free_blocks() function now has two new flags that indicate
whether a partial cluster at the beginning or the end of the block
extents should be freed or not. That will be up the caller (i.e.,
truncate), who can figure out whether partial clusters at the
beginning or the end of a block range can be freed.
We also have to update the ext4_mb_free_metadata() and
release_blocks_on_commit() machinery to be cluster-based, since it is
used by ext4_free_blocks().
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In most of mballoc.c, we do everything in units of clusters, since the
block allocation bitmaps and buddy bitmaps are all denominated in
clusters. The one place where we do deal with absolute block numbers
is in the code that handles the preallocation regions, since in the
case of inode-based preallocation regions, the start of the
preallocation region can't be relative to the beginning of the group.
So this adds a bit of complexity, where pa_pstart and pa_lstart are
block numbers, while pa_free, pa_len, and fe_len are denominated in
units of clusters.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Certain parts of the ext4 code base, primarily in mballoc.c, use a
block group number and offset from the beginning of the block group.
This offset is invariably used to index into the allocation bitmap, so
change the offset to be denominated in units of clusters.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The function ext4_free_blocks_after_init() used to be a #define of
ext4_init_block_bitmap(). This actually made it difficult to
understand how the function worked, and made it hard make changes to
support clusters. So as an initial cleanup, I've separated out the
functionality of initializing block bitmap from calculating the number
of free blocks in the new block group.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Prior to 2.6.38 automount would not trigger on either stat(2) or
lstat(2) on the automount point.
After 2.6.38, with the introduction of the ->d_automount()
infrastructure, stat(2) and others would start triggering automount
while lstat(2), etc. still would not. This is a regression and a
userspace ABI change.
Problem originally reported here:
http://thread.gmane.org/gmane.linux.kernel.autofs/6098
It appears that there was an attempt at fixing various userspace tools
to not trigger the automount. But since the stat system call is
rather common it is impossible to "fix" all userspace.
This patch reverts the original behavior, which is to not trigger on
stat(2) and other symlink following syscalls.
[ It's not really clear what the right behavior is. Apparently Solaris
does the "automount on stat, leave alone on lstat". And some programs
can get unhappy when "stat+open+fstat" ends up giving a different
result from the fstat than from the initial stat.
But the change in 2.6.38 resulted in problems for some people, so
we're going back to old behavior. Maybe we can re-visit this
discussion at some future date - Linus ]
Reported-by: Leonardo Chiquitto <leonardo.lists@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Ian Kent <raven@themaw.net>
Cc: David Howells <dhowells@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes it easier to understand how ext4_init_block_bitmap() works,
and it will assist when we split out ext4_free_blocks_after_init() in
the next commit.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Change the places in fs/ext4/mballoc.c where EXT4_BLOCKS_PER_GROUP are
used to indicate the number of bits in a block bitmap (which is really
a cluster allocation bitmap in bigalloc file systems). There are
still some places in the ext4 codebase where usage of
EXT4_BLOCKS_PER_GROUP needs to be audited/fixed, in code paths that
aren't used given the initial restricted assumptions for bigalloc.
These will need to be fixed before we can relax those restrictions.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
At least initially if the bigalloc feature is enabled, we will not
support non-extent mapped inodes, online resizing, online defrag, or
the FITRIM ioctl. This simplifies the initial implementation.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This adds supports for bigalloc file systems. It teaches the mount
code just enough about bigalloc superblock fields that it will mount
the file system without freaking out that the number of blocks per
group is too big.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The del_gendisk() function uninitializes the disk-specific data
structures, including the bdi structure, without telling anyone
else. Once this happens, any attempt to call mark_buffer_dirty()
(for example, by ext4_commit_super), will cause a kernel OOPS.
Fix this for now until we can fix things in an architecturally correct
way.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
show_stat handler of the /proc/stat file relies on kstat_cpu(cpu)
statistics when priting information about idle and iowait times.
This is OK if we are not using tickless kernel (CONFIG_NO_HZ) because
counters are updated periodically.
With NO_HZ things got more tricky because we are not doing idle/iowait
accounting while we are tickless so the value might get outdated.
Users of /proc/stat will notice that by unchanged idle/iowait values
which is then interpreted as 0% idle/iowait time. From the user space
POV this is an unexpected behavior and a change of the interface.
Let's fix this by using get_cpu_{idle,iowait}_time_us which accounts the
total idle/iowait time since boot and it doesn't rely on sampling or any
other periodic activity. Fall back to the previous behavior if NO_HZ is
disabled or not configured.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Dave Jones <davej@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Link: http://lkml.kernel.org/r/39181366adac1b39cb6aa3cd53ff0f7c78d32676.1314172057.git.mhocko@suse.cz
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* branch 'linux-next' of git://git.infradead.org/ubifs-2.6:
UBIFS: not build debug messages with CONFIG_UBIFS_FS_DEBUG disabled
* branch 'linux-next' of git://git.infradead.org/ubi-2.6:
UBI: do not link debug messages when debugging is disabled
The stateowner has some fields that only make sense for openowners, and
some that only make sense for lockowners, and I find it a lot clearer if
those are separated out.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
While running extended fsx tests to verify the preceeding patches,
a similar bug was also found in the write operation
When ever a write operation begins or ends in a hole,
or extends EOF, the partial page contained in the hole
or beyond EOF needs to be zeroed out.
To correct this the new ext4_discard_partial_page_buffers_no_lock
routine is used to zero out the partial page, but only for buffer
heads that are already unmapped.
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
While running extended fsx tests to verify the first
two patches, a similar bug was also found in the
truncate operation.
This bug happens because the truncate routine only zeros
the unblock aligned portion of the last page. This means
that the block aligned portions of the page appearing after
i_size are left unzeroed, and the buffer heads still mapped.
This bug is corrected by using ext4_discard_partial_page_buffers
in the truncate routine to zero the partial page and unmap
the buffer headers.
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This make sure we don't end up reusing the unlinked inode object.
The ideal way is to use inode i_generation. But i_generation is
not available in userspace always.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Some of the flags are OS/arch dependent we add a 9p
protocol value which maps to asm-generic/fcntl.h values in Linux
Based on the original patch from Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
We should only update attributes that we can change on stat2inode.
Also do file type initialization in v9fs_init_inode.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
d_instantiate marks the dentry positive. So a parallel lookup and mkdir of
the directory can find dentry that doesn't have fid attached. This can result
in both the code path doing v9fs_fid_add which results in v9fs_dentry leak.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
In delayed allocation mode, it's important to only call
ext4_jbd2_file_inode when the file has been extended. This is
necessary to avoid a race which first got introduced in commit
678aaf481, but which was made much more common with the introduction
of the "punch hole" functionality. (Especially when dioread_nolock
was enabled; when I could reliably reproduce this problem with
xfstests #74.)
The race is this: If while trying to writeback a delayed allocation
inode, there is a need to map delalloc blocks, and we run out of space
in the journal, *and* at the same time the inode is already on the
committing transaction's t_inode_list (because for example while doing
the punch hole operation, ext4_jbd2_file_inode() is called), then the
commit operation will wait for the inode to finish all of its pending
writebacks by calling filemap_fdatawait(), but since that inode has
one or more pages with the PageWriteback flag set, the commit
operation will wait forever, and the so the writeback of the inode can
never take place, and the kjournald thread and the writeback thread
end up waiting for each other --- forever.
It's important at this point to recall why an inode is placed on the
t_inode_list; it is to provide the data=ordered guarantees that we
don't end up exposing stale data. In the case where we are truncating
or punching a hole in the inode, there is no possibility that stale
data could be exposed in the first place, so we don't need to put the
inode on the t_inode_list!
The right long-term fix is to get rid of data=ordered mode altogether,
and only update the extent tree or indirect blocks after the data has
been written. Until then, this change will also avoid some
unnecessary waiting in the commit operation.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Allison Henderson <achender@linux.vnet.ibm.com>
Cc: Jan Kara <jack@suse.cz>
This silences some Sparse warnings:
fs/jbd2/transaction.c:135:69: warning: incorrect type in argument 2 (different base types)
fs/jbd2/transaction.c:135:69: expected restricted gfp_t [usertype] flags
fs/jbd2/transaction.c:135:69: got int [signed] gfp_mask
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Add debugging information in case jbd2_journal_dirty_metadata() is
called with a buffer_head which didn't have
jbd2_journal_get_write_access() called on it, or if the journal_head
has the wrong transaction in it. In addition, return an error code.
This won't change anything for ocfs2, which will BUG_ON() the non-zero
exit code.
For ext4, the caller of this function is ext4_handle_dirty_metadata(),
and on seeing a non-zero return code, will call __ext4_journal_stop(),
which will print the function and line number of the (buggy) calling
function and abort the journal. This will allow us to recover instead
of bug halting, which is better from a robustness and reliability
point of view.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Move the CLOSE_STATE case into the unique caller that cares about it
rather than putting it in preprocess_seqid_op.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If the user explicitly specifies conflicting mount options for
delalloc or dioread_nolock and data=journal, fail the mount, instead
of printing a warning and continuing (since many user's won't look at
dmesg and notice the warning).
Also, print a single warning that data=journal implies that delayed
allocation is not on by default (since it's not supported), and
furthermore that O_DIRECT is not supported. Improve the text in
Documentation/filesystems/ext4.txt so this is clear there as well.
Similarly, if the dioread_nolock mount option is specified when the
file system block size != PAGE_SIZE, fail the mount instead of
printing a warning message and ignoring the mount option.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch fixes a second punch hole bug found by xfstests 127.
This bug happens because punch hole needs to flush the pages
of the hole to avoid race conditions. But if the end of the
hole is in the same page as i_size, the buffer heads beyond
i_size need to be unmapped and the page needs to be zeroed
after it is flushed.
To correct this, the new ext4_discard_partial_page_buffers
routine is used to zero and unmap the partial page
beyond i_size if the end of the hole appears in the same
page as i_size.
The code has also been optimized to set the end of the hole
to the page after i_size if the specified hole exceeds i_size,
and the code that flushes the pages has been simplified.
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
This patch addresses a bug found by xfstests 75, 112, 127
when blocksize = 1k
This bug happens because the punch hole code only zeros
out non block aligned regions of the page. This means that if the
blocks are smaller than a page, then the block aligned regions of
the page inside the hole are left un-zeroed, and their buffer heads
are still mapped. This bug is corrected by using
ext4_discard_partial_page_buffers to properly zero the partial page
at the head and tail of the hole, and unmap the corresponding buffer
heads
This patch also addresses a bug reported by Lukas while working on a
new patch to add discard support for loop devices using punch hole.
The bug happened because of the first and last block number
needed to be cast to a larger data type before calculating the
byte offset, but since now we only need the byte offsets of the
pages, we no longer even need to be calculating the byte offsets
of the blocks. The code to do the block offset calculations is
removed in this patch.
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
This patch adds two new routines: ext4_discard_partial_page_buffers
and ext4_discard_partial_page_buffers_no_lock.
The ext4_discard_partial_page_buffers routine is a wrapper
function to ext4_discard_partial_page_buffers_no_lock.
The wrapper function locks the page and passes it to
ext4_discard_partial_page_buffers_no_lock.
Calling functions that already have the page locked can call
ext4_discard_partial_page_buffers_no_lock directly.
The ext4_discard_partial_page_buffers_no_lock function
zeros a specified range in a page, and unmaps the
corresponding buffer heads. Only block aligned regions of the
page will have their buffer heads unmapped. Unblock aligned regions
will be mapped if needed so that they can be updated with the
partial zero out. This function is meant to
be used to update a page and its buffer heads to be zeroed
and unmapped when the corresponding blocks have been released
or will be released.
This routine is used in the following scenarios:
* A hole is punched and the non page aligned regions
of the head and tail of the hole need to be discarded
* The file is truncated and the partial page beyond EOF needs
to be discarded
* The end of a hole is in the same page as EOF. After the
page is flushed, the partial page beyond EOF needs to be
discarded.
* A write operation begins or ends inside a hole and the partial
page appearing before or after the write needs to be discarded
* A write operation extends EOF and the partial page beyond EOF
needs to be discarded
This function takes a flag EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED
which is used when a write operation begins or ends in a hole.
When the EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED flag is used, only
buffer heads that are already unmapped will have the corresponding
regions of the page zeroed.
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
I don't see the point of having this check in nfs4_preprocess_seqid_op()
when it's only needed by the one caller.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: fix ->write_inode return values
xfs: fix xfs_mark_inode_dirty during umount
xfs: deprecate the nodelaylog mount option
Currently we always redirty an inode that was attempted to be written out
synchronously but has been cleaned by an AIL pushed internall, which is
rather bogus. Fix that by doing the i_update_core check early on and
return 0 for it. Also include async calls for it, as doing any work for
those is just as pointless. While we're at it also fix the sign for the
EIO return in case of a filesystem shutdown, and fix the completely
non-sensical locking around xfs_log_inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
(cherry picked from commit 297db93bb74cf687510313eb235a7aec14d67e97)
Signed-off-by: Alex Elder <aelder@sgi.com>
If open fails with any error other than nfserr_replay_me, then the main
nfsd4_proc_compound() loop continues unconditionally to
nfsd4_encode_operation(), which will always call encode_seqid_op_tail.
Thus the condition we check for here does not occur.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There are currently a couple races in the seqid replay code: a
retransmission could come while we're still encoding the original reply,
or a new seqid-mutating call could come as we're encoding a replay.
So, extend the state lock over the encoding (both encoding of a replayed
reply and caching of the original encoded reply).
I really hate doing this, and previously added the stateowner
reference-counting code to avoid it (which was insufficient)--but I
don't see a less complicated alternative at the moment.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
During umount we do not add a dirty inode to the lru and wait for it to
become clean first, but force writeback of data and metadata with
I_WILL_FREE set. Currently there is no way for XFS to detect that the
inode has been redirtied for metadata operations, as we skip the
mark_inode_dirty call during teardown. Fix this by setting i_update_core
nanually in that case, so that the inode gets flushed during inode reclaim.
Alternatively we could enable calling mark_inode_dirty for inodes in
I_WILL_FREE state, and let the VFS dirty tracking handle this. I decided
against this as we will get better I/O patterns from reclaim compared to
the synchronous writeout in write_inode_now, and always marking the inode
dirty in some way from xfs_mark_inode_dirty is a better safetly net in
either case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
(cherry picked from commit da6742a5a4cc844a9982fdd936ddb537c0747856)
Signed-off-by: Alex Elder <aelder@sgi.com>
Now that the replay owner is in the cstate we can remove it from a lot
of other individual operations and further simplify
nfs4_preprocess_seqid_op().
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Set the stateowner associated with a replay in one spot in
nfs4_preprocess_seqid_op() and keep it in cstate. This allows removing
a few lines of boilerplate from all the nfs4_preprocess_seqid_op()
callers.
Also turn ENCODE_SEQID_OP_TAIL into a function while we're here.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Thanks to Casey for reminding me that 5661 gives a special meaning to a
value of 0 in the stateid's seqid field, so all stateid's should start
out with si_generation 1. We were doing that in the open and lock
cases for minorversion 1, but not for the delegation stateid, and not
for openstateid's with v4.0.
It doesn't *really* matter much for v4.0 or for delegation stateid's
(which never get the seqid field incremented), but we may as well do the
same for all of them.
Reported-by: Casey Bodley <cbodley@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Follow the recommendation from rfc3530bis for stateid generation number
wraparound, simplify some code, and fix or remove incorrect comments.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The values here represent highest slotid numbers. Since slotid's are
numbered starting from zero, the highest should be one less than the
number of slots.
Reported-by: Rick Macklem <rmacklem@uoguelph.ca>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
ext4_dx_add_entry manipulates bh2 and frames[0].bh, which are two buffer_heads
that point to directory blocks assigned to the directory inode. However, the
function calls ext4_handle_dirty_metadata with the inode of the file that's
being added to the directory, not the directory inode itself. Therefore,
correct the code to dirty the directory buffers with the directory inode, not
the file inode.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
ext4_mkdir calls ext4_handle_dirty_metadata with dir_block and the inode "dir".
Unfortunately, dir_block belongs to the newly created directory (which is
"inode"), not the parent directory (which is "dir"). Fix the incorrect
association.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
When ext4_rename performs a directory rename (move), dir_bh is a
buffer that is modified to update the '..' link in the directory being
moved (old_inode). However, ext4_handle_dirty_metadata is called with
the old parent directory inode (old_dir) and dir_bh, which is
incorrect because dir_bh does not belong to the parent inode. Fix
this error.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Currently attempts to open a file with O_DIRECT in data=journal mode
causes the open to fail with -EINVAL. This makes it very hard to test
data=journal mode. So we will let the open succeed, but then always
fall back to O_DSYNC buffered writes.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This doesn't make much sense, and it exposes a bug in the kernel where
attempts to create a new file in an append-only directory using
O_CREAT will fail (but still leave a zero-length file). This was
discovered when xfstests #79 was generalized so it could run on all
file systems.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc:stable@kernel.org
The i_mutex lock and flush_completed_IO() added by commit 2581fdc810
in ext4_evict_inode() causes lockdep complaining about potential
deadlock in several places. In most/all of these LOCKDEP complaints
it looks like it's a false positive, since many of the potential
circular locking cases can't take place by the time the
ext4_evict_inode() is called; but since at the very least it may mask
real problems, we need to address this.
This change removes the flush_completed_IO() and i_mutex lock in
ext4_evict_inode(). Instead, we take a different approach to resolve
the software lockup that commit 2581fdc810 intends to fix. Rather
than having ext4-dio-unwritten thread wait for grabing the i_mutex
lock of an inode, we use mutex_trylock() instead, and simply requeue
the work item if we fail to grab the inode's i_mutex lock.
This should speed up work queue processing in general and also
prevents the following deadlock scenario: During page fault,
shrink_icache_memory is called that in turn evicts another inode B.
Inode B has some pending io_end work so it calls ext4_ioend_wait()
that waits for inode B's i_ioend_count to become zero. However, inode
B's ioend work was queued behind some of inode A's ioend work on the
same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten
thread on that cpu is processing inode A's ioend work, it tries to
grab inode A's i_mutex lock. Since the i_mutex lock of inode A is
still hold before the page fault happened, we enter a deadlock.
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When called with OPEN_STATE, preprocess_seqid_op only returns an open
stateid, hence only an open owner.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We've got some lock-specific code here in nfs4_preprocess_seqid_op which
is only used by nfsd4_lock(). Move it to the caller.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Note that the special handling for the lock stateid case is already done
by nfs4_check_openmode() (as of 0292191417
"nfsd4: fix openmode checking on IO using lock stateid") so we no longer
need these two cases in the caller.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Share some common code, stop doing silly things like initializing a list
head immediately before adding it to a list, etc.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
These appear to be generic (for both open and lock owners), but they're
actually just for open owners. This has confused me more than once.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The server is returning nfserr_resource for both permanent errors and
for errors (like allocation failures) that might be resolved by retrying
later. Save nfserr_resource for the former and use delay/jukebox for
the latter.
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Even if we fail to write a recovery record, we should still mark the
client as having acquired its first state. Otherwise we leave 4.1
clients with indefinite ERR_GRACE returns.
However, an inability to write stable storage records may cause failures
of reboot recovery, and the problem should still be brought to the
server administrator's attention.
So, make sure the error is logged.
These errors shouldn't normally be triggered on a corectly functioning
server--this isn't a case where a misconfigured client could spam the
logs.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
A client that wants to execute a file must be able to read it. Read
opens over nfs are therefore implicitly allowed for executable files
even when those files are not readable.
NFSv2/v3 get this right by using a passed-in NFSD_MAY_OWNER_OVERRIDE on
read requests, but NFSv4 has gotten this wrong ever since
dc730e1737 "nfsd4: fix owner-override on
open", when we realized that the file owner shouldn't override
permissions on non-reclaim NFSv4 opens.
So we can't use NFSD_MAY_OWNER_OVERRIDE to tell nfsd_permission to allow
reads of executable files.
So, do the same thing we do whenever we encounter another weird NFS
permission nit: define yet another NFSD_MAY_* flag.
The industry's future standardization on 128-bit processors will be
motivated primarily by the need for integers with enough bits for all
the NFSD_MAY_* flags.
Reported-by: Leonardo Borda <leonardoborda@gmail.com>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The nfsd4 code has a bunch of special exceptions for error returns which
map nfserr_symlink to other errors.
In fact, the spec makes it clear that nfserr_symlink is to be preferred
over less specific errors where possible.
The patch that introduced it back in 2.6.4 is "kNFSd: correct symlink
related error returns.", which claims that these special exceptions are
represent an NFSv4 break from v2/v3 tradition--when in fact the symlink
error was introduced with v4.
I suspect what happened was pynfs tests were written that were overly
faithful to the (known-incomplete) rfc3530 error return lists, and then
code was fixed up mindlessly to make the tests pass.
Delete these unnecessary exceptions.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Zero means "I don't care what kind of file this is". And that's
probably what we want--acls are also settable at least on directories,
and if the filesystem doesn't want them on other objects, leave it to it
to complain.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We allow the fh_verify caller to specify that any object *except* those
of a given type is allowed, by passing a negative type. But only one
caller actually uses it. Open-code that check in the one caller.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
A slightly unconventional approach to make the code more compact I could
live with, but let's give the poor reader *some* chance.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The nfsservctl system call is now gone, so we should remove all
linkage for it.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The dark space calculation should be 64 bit type-casted, when
assigning to tmp64 (similar to how total_free is calculated).
Overflow will occur for very large flashes.
Signed-off-by: srimugunthan <srimugunthan.dhandapani@gmail.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@intel.com>
Purely in-memory filesystems do not use the inode hash as the dcache
tells us if an entry already exists. As a result, they do not call
unlock_new_inode, and thus directory inodes do not get put into a
different lockdep class for i_sem.
We need the different lockdep classes, because the locking order for
i_mutex is different for directory inodes and regular inodes. Directory
inodes can do "readdir()", which takes i_mutex *before* possibly taking
mm->mmap_sem (due to a page fault while copying the directory entry to
user space).
In contrast, regular inodes can be mmap'ed, which takes mm->mmap_sem
before accessing i_mutex.
The two cases can never happen for the same inode, so no real deadlock
can occur, but without the different lockdep classes, lockdep cannot
understand that. As a result, if CONFIG_DEBUG_LOCK_ALLOC is set, this
can lead to false positives from lockdep like below:
find/645 is trying to acquire lock:
(&mm->mmap_sem){++++++}, at: [<ffffffff81109514>] might_fault+0x5c/0xac
but task is already holding lock:
(&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffff81149f34>]
vfs_readdir+0x5b/0xb4
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&sb->s_type->i_mutex_key#15){+.+.+.}:
[<ffffffff8108ac26>] lock_acquire+0xbf/0x103
[<ffffffff814db822>] __mutex_lock_common+0x4c/0x361
[<ffffffff814dbc46>] mutex_lock_nested+0x40/0x45
[<ffffffff811daa87>] hugetlbfs_file_mmap+0x82/0x110
[<ffffffff81111557>] mmap_region+0x258/0x432
[<ffffffff811119dd>] do_mmap_pgoff+0x2ac/0x306
[<ffffffff81111b4f>] sys_mmap_pgoff+0x118/0x16a
[<ffffffff8100c858>] sys_mmap+0x22/0x24
[<ffffffff814e3ec2>] system_call_fastpath+0x16/0x1b
-> #0 (&mm->mmap_sem){++++++}:
[<ffffffff8108a4bc>] __lock_acquire+0xa1a/0xcf7
[<ffffffff8108ac26>] lock_acquire+0xbf/0x103
[<ffffffff81109541>] might_fault+0x89/0xac
[<ffffffff81149cff>] filldir+0x6f/0xc7
[<ffffffff811586ea>] dcache_readdir+0x67/0x205
[<ffffffff81149f54>] vfs_readdir+0x7b/0xb4
[<ffffffff8114a073>] sys_getdents+0x7e/0xd1
[<ffffffff814e3ec2>] system_call_fastpath+0x16/0x1b
This patch moves the directory vs file lockdep annotation into a helper
function that can be called by in-memory filesystems and has hugetlbfs
call it.
Signed-off-by: Josh Boyer <jwboyer@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The NFSv4 spec does not specify that the server must repeat that error,
so in order to avoid having the delegations revoked, we should handle
it immediately.
Also note that NFS4ERR_CB_PATH_DOWN does in fact renew the lease...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
RFC3530 states that if the client holds a delegation, then it is obliged
to continue to send RENEW calls once every lease period in order to allow
the server to return NFS4ERR_CB_PATH_DOWN if the callback path is
unreachable.
This is not required for NFSv4.1, since the server can at any time set
the SEQ4_STATUS_CB_PATH_DOWN_SESSION in any SEQUENCE operation.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: check size of FUSE_NOTIFY_INVAL_ENTRY message
fuse: mark pages accessed when written to
fuse: delete dead .write_begin and .write_end aops
fuse: fix flock
fuse: fix non-ANSI void function notation
FUSE_NOTIFY_INVAL_ENTRY didn't check the length of the write so the
message processing could overrun and result in a "kernel BUG at
fs/fuse/dev.c:629!"
Reported-by: Han-Wen Nienhuys <hanwenn@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: fix tracing builds inside the source tree
xfs: remove subdirectories
xfs: don't expect xfs headers to be in subdirectories
There are cases where suppressing partition scan is useful - e.g. for
lo devices and pseudo SATA devices which advertise to be a disk but
get upset on partition scan (some port multiplier control devices show
such behavior).
This patch adds GENHD_FL_NO_PART_SCAN which suppresses partition scan
regardless of the number of possible partitions. disk_partitionable()
is renamed to disk_part_scan_enabled() as suppressing partition scan
doesn't imply the device can't be partitioned using
BLKPG_ADD/DEL_PARTITION calls from userland. show_partition() now
directly tests disk_max_parts() to maintain backward-compatibility.
-v2: Updated to make it clear that only partition scan is suppressed
not partitioning itself as suggested by Kay Sievers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Add a new REQ_PRIO to let requests preempt others in the cfq I/O schedule,
and lave REQ_META purely for marking requests as metadata in blktrace.
All existing callers of REQ_META except for XFS are updated to also
set REQ_PRIO for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Replace all occurnanced of the undocumented READ_META with READ | REQ_META
and remove the unused WRITE_META define.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
sysfs: use rb-tree for inode number lookup
This patch makes sysfs use red-black tree for inode number lookup.
Together with a previous patch to use red-black tree for name lookup,
this patch makes all sysfs lookups to have O(log n) complexity.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sysfs: remove s_sibling hacks
s_sibling was used for three different purposes:
1) as a linked list of entries in the directory
2) as a linked list of entries to be deleted
3) as a pointer to "struct completion"
This patch removes the hack and introduces new union u which
holds pointers for cases 2) and 3).
This change is needed for the following patch that removes s_sibling at all
and replaces it with a rb tree.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sysfs: use rb-tree for name lookups
Use red-black tree for name lookups.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sysfs: count subdirectories
This patch introduces a subdirectory counter for each sysfs directory.
Without the patch, sysfs_refresh_inode would walk all entries of the directory
to calculate the number of subdirectories.
This patch improves time of "ls -la /sys/block" when there are 10000 block
devices from 9 seconds to 0.19 seconds.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The file is fs/debugfs/inode.c but the comment says it is file.c.
This patch can fix this little mistake.
Signed-off-by: Harry Wei <harryxiyou@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The code really requires the current source directory to be in the
header search path. We already do this if building with an object
tree separate from the source, but it needs to be added manually
if building inside the source. The cflags addition for it accidentally
got removed when collapsing the xfs directory structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
kfree does not clean up indirect allocations in
ceph_fs_client and ceph_options (e.g. snapdir_name).
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
This fixes a regression introduced by commit cdcb725c05 ("Btrfs: check
if there is enough space for balancing smarter"). We can't do 64-bit
divides on 32-bit architectures.
In cases where we need to divide/multiply by 2 we should just left/right
shift respectively, and in cases where theres N number of devices use
do_div. Also make the counters u64 to match up with rw_devices.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Acked-and-tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: flush any pending end_io requests before DIO reads w/dioread_nolock
ext4: fix nomblk_io_submit option so it correctly converts uninit blocks
ext4: Resolve the hang of direct i/o read in handling EXT4_IO_END_UNWRITTEN.
ext4: call ext4_ioend_wait and ext4_flush_completed_IO in ext4_evict_inode
ext4: Fix ext4_should_writeback_data() for no-journal mode
There is a race between ext4 buffer write and direct_IO read with
dioread_nolock mount option enabled. The problem is that we clear
PageWriteback flag during end_io time but will do
uninitialized-to-initialized extent conversion later with dioread_nolock.
If an O_direct read request comes in during this period, ext4 will return
zero instead of the recently written data.
This patch checks whether there are any pending uninitialized-to-initialized
extent conversion requests before doing O_direct read to close the race.
Note that this is just a bandaid fix. The fundamental issue is that we
clear PageWriteback flag before we really complete an IO, which is
problem-prone. To fix the fundamental issue, we may need to implement an
extent tree cache that we can use to look up pending to-be-converted extents.
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Use NUMA aware allocations to reduce latencies and increase throughput.
sunrpc kthreads can use kthread_create_on_node() if pool_mode is
"percpu" or "pernode", and svc_prepare_thread()/svc_init_buffer() can
also take into account NUMA node affinity for memory allocations.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: "J. Bruce Fields" <bfields@fieldses.org>
CC: Neil Brown <neilb@suse.de>
CC: David Miller <davem@davemloft.net>
Reviewed-by: Greg Banks <gnb@fastmail.fm>
[bfields@redhat.com: fix up caller nfs41_callback_up]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There's an incorrect comment here. Also clean up the logic: the
"rdlease" and "wrlease" locals are confusingly named, and don't really
add anything since we can make a decision as soon as we hit one of these
cases.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We currently use a bit in fl_flags to record whether a lease is being
broken, and set fl_type to the type (RDLCK or UNLCK) that it will
eventually have. This means that once the lease break starts, we forget
what the lease's type *used* to be. Breaking a read lease will then
result in blocking read opens, even though there's no conflict--because
the lease type is now F_UNLCK and we can no longer tell whether it was
previously a read or write lease.
So, instead keep fl_type as the original type (the type which we
enforce), and keep track of whether we're unlocking or merely
downgrading by replacing the single FL_INPROGRESS flag by
FL_UNLOCK_PENDING and FL_DOWNGRADE_PENDING flags.
To get this right we also need to track separate downgrade and break
times, to handle the case where a write-leased file gets conflicting
opens first for read, then later for write.
(I first considered just eliminating the downgrade behavior
completely--nfsv4 doesn't need it, and nobody as far as I can tell
actually uses it currently--but Jeremy Allison tells me that Windows
oplocks do behave this way, so Samba will probably use this some day.)
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
F_INPROGRESS isn't exposed to userspace. To me it makes more sense in
fl_flags....
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The set of errors here does *not* agree with the set of errors specified
in the rfc!
While we're there, turn this macros into a function, for the usual
reasons, and move it to the one place where it's actually used.
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
With
$ grep -e UBIFS_FS_DEBUG -e DYNAMIC_DEBUG .config
# CONFIG_UBIFS_FS_DEBUG is not set
CONFIG_DYNAMIC_DEBUG=y
Debug messages are kept in the object files due to the
dynamic_pr_debug() macro, even if they are never going to be printed:
$ make fs/ubifs/super.o
$ strings fs/ubifs/super.o | grep 'compiled on'
compiled on: Aug 11 2011 at 12:21:38
Use plain printk to fix this.
Signed-off-by: Michal Marek <mmarek@suse.cz>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@intel.com>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4.1: Return NFS4ERR_BADSESSION to callbacks during session resets
NFSv4.1: Fix the callback 'highest_used_slotid' behaviour
pnfs-obj: Fix the comp_index != 0 case
pnfs-obj: Bug when we are running out of bio
nfs: add missing prefetch.h include
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: set i_size properly when fallocating and we already
btrfs: unlock on error in btrfs_file_llseek()
btrfs: btrfs_permission's RO check shouldn't apply to device nodes
Btrfs: truncate pages from clone ioctl target range
Btrfs: fix uninitialized sync_pending
Btrfs: fix wrong free space information
btrfs: memory leak in btrfs_add_inode_defrag()
Btrfs: use plain page_address() in header fields setget functions
Btrfs: forced readonly when btrfs_drop_snapshot() fails
Btrfs: check if there is enough space for balancing smarter
Btrfs: fix a bug of balance on full multi-disk partitions
Btrfs: fix an oops of log replay
Btrfs: detect wether a device supports discard
Btrfs: force unplugs when switching from high to regular priority bios
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
update cifs version to 1.75
[CIFS] possible memory corruption on mount
cifs: demote cERROR in build_path_from_dentry to cFYI
CIFS cleanup_volume_info_contents() looks like having a memory
corruption problem.
When UNCip is set to "&vol->UNC[2]" in cifs_parse_mount_options(), it
should not be kfree()-ed in cleanup_volume_info_contents().
Introduced in commit b946845a9d
Signed-off-by: J.R. Okajima <hooanon05@yahoo.co.jp>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
CC: Stable <stable@kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
xfstests exposed a problem with preallocate when it fallocates a range that
already has an extent. We don't set the new i_size properly because we see that
we already have an extent. This isn't right and we should update i_size if the
space already exists. With this patch we now pass xfstests 075. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There were some unlocks on error missing in a recent patch to
btrfs_file_llseek().
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch tightens the read-only access checks in btrfs_permission to
match the constraints in inode_permission. Currently, even though the
device node itself will be unmodified, read-write access to device nodes
is denied to when the device node resides on a read-only subvolume or a
is a file that has been marked read-only by the btrfs conversion utility.
With this patch applied, the check only affects regular files,
directories, and symlinks. It also restructures the code a bit so that
we don't duplicate the MAY_WRITE check for both tests.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
FAT16 support maximum 4GB vol/file size with 64KB cluster size.
Win NT/XP/7 increased the maximum cluster size to 64KB, and file/vol
size increased 4GB also. Although increasing, the file size of linux
FAT is still limited at 2GB.
I found that it is limited by sb->maxbytes(0x7fffffff) when partition
is formatted by FAT16. sb->s_maxbytes in fill_super should be set to
0xffffffff like fat32.
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
The fat_msg function already formats the given message and appends
a newline to it - we don't need to do this in the passed message
string as well, or will end up with a blank line printed in the
kernel log ring buffer.
Also change the loglevel from error to warning.
Signed-off-by: Mihai Moldovan <ionic@ionic.de>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
This fixes a compile warning (unititialized variable) in
the fat filesystem code.
Signed-off-by: Jonas Aberg <jonas.aberg@stericsson.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
For a long time now orlov is the default block allocator in the ext3. It
performs better than the old one and no one seems to claim otherwise so
we can safely drop it and make oldalloc and orlov mount option
deprecated.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Delete nontrivial initialization that is immediately overwritten by the
result of an allocation function.
The semantic match that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
type T;
identifier i;
expression e;
@@
(
T i = \(0\|NULL\|ERR_PTR(...)\);
|
-T i = e;
+T i;
)
... when != i
i = \(kzalloc\|kcalloc\|kmalloc\)(...);
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Jan Kara <jack@suse.cz>
Delete nontrivial initialization that is immediately overwritten by the
result of an allocation function.
The semantic match that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
type T;
identifier i;
expression e;
@@
(
T i = \(0\|NULL\|ERR_PTR(...)\);
|
-T i = e;
+T i;
)
... when != i
i = \(kzalloc\|kcalloc\|kmalloc\)(...);
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Jan Kara <jack@suse.cz>
If there are some inodes in orphan list while a filesystem is being
read-only mounted, we should recommend that peole umount and then
mount it when they try to remount with read-write. But the current
message and comment recommend that they umount and then remount it
which may be slightly misleading.
Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
We need to truncate page cache pages for the clone ioctl target range or
else we'll confuse ourselves to no end. If the old data was cached, we
used to still see it (until remount). If the page was partially updated
we used to get a mix of old and new data.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
sync_pending is uninitialized before it be used, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs subtracted the size of the allocated space twice when it allocated
the space from the bitmap in the cluster, it broke the free space information
and led to oops finally.
And this patch also fixes the bug that ctl->free_space was subtracted
without lock.
Reported-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The filesystem turns readonly instead of returning the error to the
caller when detected error in btrfs_drop_snapshot().
and, because the caller doesn't check the error, the function type is
changed to 'void'.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When checking if there is enough space for balancing a block group,
since we do not take raid types into consideration, we do not account
corrent amounts of space that we needed. This makes us do some extra
work before we get ENOSPC.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When balancing, we'll first try to shrink devices for some space,
but if it is working on a full multi-disk partition with raid protection,
we may encounter a bug, that is, while shrinking, total_bytes may be less
than bytes_used, and btrfs may allocate a dev extent that accesses out of
device's bounds.
Then we will not be able to write or read the data which stores at the end
of the device, and get the followings:
device fsid 0939f071-7ea3-46c8-95df-f176d773bfb6 devid 1 transid 10 /dev/sdb5
Btrfs detected SSD devices, enabling SSD mode
btrfs: relocating block group 476315648 flags 9
btrfs: found 4 extents
attempt to access beyond end of device
sdb5: rw=145, want=546176, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546304, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546432, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546560, limit=546147
attempt to access beyond end of device
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs recovers from a crash, it may hit the oops below:
------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:4580!
[...]
RIP: 0010:[<ffffffffa03df251>] [<ffffffffa03df251>] btrfs_add_link+0x161/0x1c0 [btrfs]
[...]
Call Trace:
[<ffffffffa03e7b31>] ? btrfs_inode_ref_index+0x31/0x80 [btrfs]
[<ffffffffa04054e9>] add_inode_ref+0x319/0x3f0 [btrfs]
[<ffffffffa0407087>] replay_one_buffer+0x2c7/0x390 [btrfs]
[<ffffffffa040444a>] walk_down_log_tree+0x32a/0x480 [btrfs]
[<ffffffffa0404695>] walk_log_tree+0xf5/0x240 [btrfs]
[<ffffffffa0406cc0>] btrfs_recover_log_trees+0x250/0x350 [btrfs]
[<ffffffffa0406dc0>] ? btrfs_recover_log_trees+0x350/0x350 [btrfs]
[<ffffffffa03d18b2>] open_ctree+0x1442/0x17d0 [btrfs]
[...]
This comes from that while replaying an inode ref item, we forget to
check those old conflicting DIR_ITEM and DIR_INDEX items in fs/file tree,
then we will come to conflict corners which lead to BUG_ON().
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Tested-by: Andy Lutomirski <luto@mit.edu>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We have a problem where if a user specifies discard but doesn't actually support
it we will return EOPNOTSUPP from btrfs_discard_extent. This is a problem
because this gets called (in a fashion) from the tree log recovery code, which
has a nice little BUG_ON(ret) after it, which causes us to fail the tree log
replay. So instead detect wether our devices support discard when we're adding
them and then don't issue discards if we know that the device doesn't support
it. And just for good measure set ret = 0 in btrfs_issue_discard just in case
we still get EOPNOTSUPP so we don't screw anybody up like this again. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Fan Yong <yong.fan@whamcloud.com> noticed setting
FMODE_32bithash wouldn't work with nfsd v4, as
nfsd4_readdir() checks for 32 bit cookies. However, according to RFC 3530
cookies have a 64 bit type and cookies are also defined as u64 in
'struct nfsd4_readdir'. So remove the test for >32-bit values.
Cc: stable@kernel.org
Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
pstore was using mutex locking to protect read/write access to the
backend plug-ins. This causes problems when pstore is executed in
an NMI context through panic() -> kmsg_dump().
This patch changes the mutex to a spin_lock_irqsave then also checks to
see if we are in an NMI context. If we are in an NMI and can't get the
lock, just print a message stating that and blow by the locking.
All this is probably a hack around the bigger locking problem but it
solves my current situation of trying to sleep in an NMI context.
Tested by loading the lkdtm module and executing a HARDLOCKUP which
will cause the machine to panic inside the nmi handler.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Acked-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Life is simple for all the kernel terminating types of kmsg_dump
call backs - pstore just saves the tail end of the console log. But
for "oops" the situation is more complex - the kernel may carry on
running (possibly for ever). So we'd like to make the logged copy
of the oops appear in the pstore filesystem - so that the user has
a handle to clear the entry from the persistent backing store (if
we don't, the store may fill with "oops" entries (that are also
safely stashed in /var/log/messages) leaving no space for real
errors.
Current code calls pstore_mkfile() immediately. But this may
not be safe. The oops could have happened with arbitrary locks
held, or in interrupt or NMI context. So allocating memory and
calling into generic filesystem code seems unwise.
This patch defers making the entry appear. At the time
of the oops, we merely set a flag "pstore_new_entry" noting that
a new entry has been added. A periodic timer checks once a minute
to see if the flag is set - if so, it schedules a work queue to
rescan the backing store and make all new entries appear in the
pstore filesystem.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Running the cthon tests on a recent kernel caused this message to pop
occasionally:
CIFS VFS: did not end path lookup where expected namelen is 0
Some added debugging showed that namelen and dfsplen were both 0 when
this occurred. That means that the read_seqretry returned true.
Assuming that the comment inside the if statement is true, this should
be harmless and just means that we raced with a rename. If that is the
case, then there's no need for alarm and we can demote this to cFYI.
While we're at it, print the dfsplen too so that we can see what
happened here if the message pops during debugging.
Cc: stable@kernel.org
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
A 'path' consists of a starting ino and relative component. Encode even
when there is no relative component. This is primarily needed by the
NFS reexport code.
Signed-off-by: Sage Weil <sage@newdream.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: Do not set cifs/ntfs acl using a file handle (try #4)
[CIFS] Cleanup use of CONFIG_CIFS_STATS2 ifdef to make transport routines more readable
Bug discovered by Jan Kara:
Finally, commit 1449032be1 returned back
the old IO submission code but apparently it forgot to return the old
handling of uninitialized buffers so we unconditionnaly call
block_write_full_page() without specifying end_io function. So AFAICS
we never convert unwritten extents to written in some cases. For
example when I mount the fs as: mount -t ext4 -o
nomblk_io_submit,dioread_nolock /dev/ubdb /mnt and do
int fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0600);
char buf[1024];
memset(buf, 'a', sizeof(buf));
fallocate(fd, 0, 0, 16384);
write(fd, buf, sizeof(buf));
I get a file full of zeros (after remounting the filesystem so that
pagecache is dropped) instead of seeing the first KB contain 'a's.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten
should be done simultaneously since ext4_end_io_nolock always clear
the flag and decrease the counter in the same time.
We don't increase i_aiodio_unwritten when setting
EXT4_IO_END_UNWRITTEN so it will go nagative and causes some process
to wait forever.
Part of the patch came from Eric in his e-mail, but it doesn't fix the
problem met by Michael actually.
http://marc.info/?l=linux-ext4&m=131316851417460&w=2
Reported-and-Tested-by: Michael Tokarev<mjt@tls.msk.ru>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Flush inode's i_completed_io_list before calling ext4_io_wait to
prevent the following deadlock scenario: A page fault happens while
some process is writing inode A. During page fault,
shrink_icache_memory is called that in turn evicts another inode
B. Inode B has some pending io_end work so it calls ext4_ioend_wait()
that waits for inode B's i_ioend_count to become zero. However, inode
B's ioend work was queued behind some of inode A's ioend work on the
same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten
thread on that cpu is processing inode A's ioend work, it tries to
grab inode A's i_mutex lock. Since the i_mutex lock of inode A is
still hold before the page fault happened, we enter a deadlock.
Also moves ext4_flush_completed_IO and ext4_ioend_wait from
ext4_destroy_inode() to ext4_evict_inode(). During inode deleteion,
ext4_evict_inode() is called before ext4_destroy_inode() and in
ext4_evict_inode(), we may call ext4_truncate() without holding
i_mutex lock. As a result, there is a race between flush_completed_IO
that is called from ext4_ext_truncate() and ext4_end_io_work, which
may cause corruption on an io_end structure. This change moves
ext4_flush_completed_IO and ext4_ioend_wait from ext4_destroy_inode()
to ext4_evict_inode() to resolve the race between ext4_truncate() and
ext4_end_io_work during inode deletion.
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
ext4_should_writeback_data() had an incorrect sequence of
tests to determine if it should return 0 or 1: in
particular, even in no-journal mode, 0 was being returned
for a non-regular-file inode.
This meant that, in non-journal mode, we would use
ext4_journalled_aops for directories, symlinks, and other
non-regular files. However, calling journalled aop
callbacks when there is no valid handle, can cause problems.
This would cause a kernel crash with Jan Kara's commit
2d859db3e4 ("ext4: fix data corruption in inodes with
journalled data"), because we now dereference 'handle' in
ext4_journalled_write_end().
I also added BUG_ONs to check for a valid handle in the
obviously journal-only aops callbacks.
I tested this running xfstests with a scratch device in
these modes:
- no-journal
- data=ordered
- data=writeback
- data=journal
All work fine; the data=journal run has many failures and a
crash in xfstests 074, but this is no different from a
vanilla kernel.
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: replace xfs_buf_geterror() with bp->b_error
xfs: Check the return value of xfs_buf_read() for NULL
"xfs: fix error handling for synchronous writes" revisited
xfs: set cursor in xfs_ail_splice() even when AIL was empty
xfs: Remove the macro XFS_BUFTARG_NAME
xfs: Remove the macro XFS_BUF_TARGET
xfs: Remove the macro XFS_BUF_SET_TARGET
Replace the macro XFS_BUF_ISPINNED with helper xfs_buf_ispinned
xfs: Remove the macro XFS_BUF_SET_PTR
xfs: Remove the macro XFS_BUF_PTR
xfs: Remove macro XFS_BUF_SET_START
xfs: Remove macro XFS_BUF_HOLD
xfs: Remove macro XFS_BUF_BUSY and family
xfs: Remove the macro XFS_BUF_ERROR and family
xfs: Remove the macro XFS_BUF_BFLAGS
Use the move from Linux 2.6 to Linux 3.x as an excuse to kill the
annoying subdirectories in the XFS source code. Besides the large
amount of file rename the only changes are to the Makefile, a few
files including headers with the subdirectory prefix, and the binary
sysctl compat code that includes a header under fs/xfs/ from
kernel/.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Fix up some #include directives in preparation for moving a few
header files out of xfs source subdirectories.
Note that "xfs_linux.h" also got its quoting convention for included
files switched.
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Since we just checked bp for NULL, it is ok to replace
xfs_buf_geterror() with bp->b_error in these places.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Check the return value of xfs_buf_read() for NULL and return ENOMEM
if it is NULL. This is necessary in a few spots to avoid subsequent
code blindly dereferencing the null buffer pointer.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
e1000e: increase driver version number
e1000e: alternate MAC address update
e1000e: do not disable receiver on 82574/82583
e1000e: alternate MAC address does not work on device id 0x1060
PCnet: Fix section mismatch
bnx2x: disable dcb on 578xx since not supported yet
bnx2x: properly clean indirect addresses
bnx2x: prevent race between undi_unload and load flows
bnx2x: fix select_queue when FCoE is disabled
bnx2x: init FCOE FP only once
ipv4: some rt_iif -> rt_route_iif conversions
net/bridge/netfilter/ebtables.c: use available error handling code
net/netlabel/netlabel_kapi.c: add missing cleanup code
net/irda: sh_sir: tidyup compile warning
net/irda: sh_sir: add missing header
net/irda: sh_irda: add missing header
slcan: ldisc generated skbs are received in softirq context
scm: Capture the full credentials of the scm sender
tcp: initialize variable ecn_ok in syncookies path
drivers/net/wireless/wl1251: add missing kfree
...
Local XATTR_TRUSTED_PREFIX_LEN and XATTR_SECURITY_PREFIX_LEN definitions
redefined ones in 'linux/xattr.h'. This was caused by commit 9d8f13ba3f
("security: new security_inode_init_security API adds function callback")
including 'linux/xattr.h' in 'linux/security.h'.
In file included from include/linux/security.h:39,
from include/net/sock.h:54,
from fs/cifs/cifspdu.h:25,
from fs/cifs/xattr.c:26:
This patch removes the local definitions.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
Just like files-layout, blocks & objects layouts are part of the
NFS 4.1 protocol and should be automatically selected if NFS_4_1
is selected. The small problem is that these depend on other
Kernel support being present, while files only depends on NFS
itself.
This patch removes from the user choice the presence of objects
and blocks layout. But makes sure these are selected only if
the depended subsystems are present in the Kernel.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Acked-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit df5e622340 ("ext4: fix deadlock in ext4_symlink() in ENOSPC
conditions") recalculated the number of credits needed for a long
symlink, in the process of splitting it into two transactions. However,
the first credit calculation under-counted because if selinux is
enabled, credits are needed to create the selinux xattr as well.
Overrunning the reservation will result in an OOPS in
jbd2_journal_dirty_metadata() due to this assert:
J_ASSERT_JH(jh, handle->h_buffer_credits > 0);
Fix this by increasing the reservation size.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit ae54870a1d ("ext3: Fix lock inversion in ext3_symlink()")
recalculated the number of credits needed for a long symlink, in the
process of splitting it into two transactions. However, the first
credit calculation under-counted because if selinux is enabled, credits
are needed to create the selinux xattr as well.
Overrunning the reservation will result in an OOPS in
journal_dirty_metadata() due to this assert:
J_ASSERT_JH(jh, handle->h_buffer_credits > 0);
Fix this by increasing the reservation size.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The patch http://lkml.org/lkml/2003/7/13/226 introduced an RLIMIT_NPROC
check in set_user() to check for NPROC exceeding via setuid() and
similar functions.
Before the check there was a possibility to greatly exceed the allowed
number of processes by an unprivileged user if the program relied on
rlimit only. But the check created new security threat: many poorly
written programs simply don't check setuid() return code and believe it
cannot fail if executed with root privileges. So, the check is removed
in this patch because of too often privilege escalations related to
buggy programs.
The NPROC can still be enforced in the common code flow of daemons
spawning user processes. Most of daemons do fork()+setuid()+execve().
The check introduced in execve() (1) enforces the same limit as in
setuid() and (2) doesn't create similar security issues.
Neil Brown suggested to track what specific process has exceeded the
limit by setting PF_NPROC_EXCEEDED process flag. With the change only
this process would fail on execve(), and other processes' execve()
behaviour is not changed.
Solar Designer suggested to re-check whether NPROC limit is still
exceeded at the moment of execve(). If the process was sleeping for
days between set*uid() and execve(), and the NPROC counter step down
under the limit, the defered execve() failure because NPROC limit was
exceeded days ago would be unexpected. If the limit is not exceeded
anymore, we clear the flag on successful calls to execve() and fork().
The flag is also cleared on successful calls to set_user() as the limit
was exceeded for the previous user, not the current one.
Similar check was introduced in -ow patches (without the process flag).
v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user().
Reviewed-by: James Morris <jmorris@namei.org>
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Set security descriptor using path name instead of a file handle.
We can't be sure that the file handle has adequate permission to
set a security descriptor (to modify DACL).
Function set_cifs_acl_by_fid() has been removed since we can't be
sure how a file was opened for writing, a valid request can fail
if the file was not opened with two above mentioned permissions.
We could have opted to add on WRITE_DAC and WRITE_OWNER permissions
to file opens and then use that file handle but adding addtional
permissions such as WRITE_DAC and WRITE_OWNER could cause an
any open to fail.
And it was incorrect to look for read file handle to set a
security descriptor anyway.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Christoph had requested that the stats related code (in
CONFIG_CIFS_STATS2) be moved into helpers to make code flow more
readable. This patch should help. For example the following
section from transport.c
spin_unlock(&GlobalMid_Lock);
atomic_inc(&ses->server->num_waiters);
wait_event(ses->server->request_q,
atomic_read(&ses->server->inFlight)
< cifs_max_pending);
atomic_dec(&ses->server->num_waiters);
spin_lock(&GlobalMid_Lock);
becomes simpler (with the patch below):
spin_unlock(&GlobalMid_Lock);
cifs_num_waiters_inc(server);
wait_event(server->request_q,
atomic_read(&server->inFlight)
< cifs_max_pending);
cifs_num_waiters_dec(server);
spin_lock(&GlobalMid_Lock);
Reviewed-by: Jeff Layton <jlayton@redhat.com>
CC: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
PNFS_BLOCK needs BLK_DEV_DM/MD, which is not a dependency for other
pnfs layout drivers. Seperate it out so others can still build when
BLK_DEV_DM/MD is not enabled.
Also change select to depends on to avoid build failures.
Reported-and-tested-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Acked-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
xfs: fix for hang during synchronous buffer write error
If removed storage while synchronous buffer write underway,
"xfslogd" hangs.
Detailed log http://oss.sgi.com/archives/xfs/2011-07/msg00740.html
Related work bfc60177f8
"xfs: fix error handling for synchronous writes"
Given that xfs_bwrite actually does the shutdown already after
waiting for the b_iodone completion and given that we actually
found that calling xfs_force_shutdown from inside
xfs_buf_iodone_callbacks was a major contributor the problem
it better to drop this call.
Signed-off-by: Ajeet Yadav <ajeet.yadav.77@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Close a TOCTOU race for mounts done via ecryptfs-mount-private. The mount
source (device) can be raced when the ownership test is done in userspace.
Provide Ecryptfs a means to force the uid check at mount time.
Signed-off-by: John Johansen <john.johansen@canonical.com>
Cc: <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
In xfs_ail_splice(), if a cursor is provided it is updated to
point to the last item on the list being spliced into the AIL.
But if the AIL was found to be empty, the cursor (if provided)
is just initialized instead.
There is no reason the empty AIL case needs to be treated any
differently. And treating it the same way allows this code
to be rearranged a bit, with a somewhat tidier result.
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
fs/ecryptfs/keystore.c: In function ‘ecryptfs_generate_key_packet_set’:
fs/ecryptfs/keystore.c:1991:28: warning: ‘payload_len’ may be used uninitialized in this function [-Wuninitialized]
fs/ecryptfs/keystore.c:1976:9: note: ‘payload_len’ was declared here
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
This patch fixes the compile error reported at the address:
https://bugzilla.kernel.org/show_bug.cgi?id=40292
The problem arises when compiling eCryptfs as built-in and the 'encrypted'
key type as a module. The patch prevents this combination from being set in
the kernel configuration, by fixing the eCryptfs dependencies.
Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Reported-by: David Hill <hilld@binarystorm.net>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
When an eCryptfs inode's lower file has been closed, and the pointer has
been set to NULL, return an error when trying to do a lower read or
write rather than calling BUG().
https://bugzilla.kernel.org/show_bug.cgi?id=37292
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Cc: <stable@kernel.org>
The previous comit made the autofs4 debug printouts check types against
the printout format, and uncovered this bug:
fs/autofs4/waitq.c:106:2: warning: format ‘%08lx’ expects type ‘long unsigned int’, but argument 4 has type ‘autofs_wqt_t’
which is due to the insane type for wait_queue_token. That thing should
be some fixed well-defined size (preferably just 'unsigned int' or
'u32') but for unexplained reasons it is randomly either 'unsigned long'
or 'unsigned int' depending on the architecture.
For now, cast it to 'unsigned long' for printing, the way we do
elsewhere. Somebody else can try to explain the typedef mess.
(There's a reason we don't support excessive use of typedefs in the
kernel: it's usually just a good way of confusing yourself).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use 'pr_debug()' for DPRINTK, which will do the proper type checking on
the arguments (without generating code) even when DEBUG isn't #defined.
Also, use the standard __VA_ARGS__ for the macros, and stop the
pointless abuse of 'do { xyz } while (0)' when the macro is already a
perfectly well-formed single statement.
Reported-by: David Howells <dhowells@redhat.com>
Suggested-by: Joe Perches <joe@perches.com>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As fuse does not use the page cache library functions when userspace
writes to a file, it did not benefit from 'c8236db mm: mark page
accessed before we write_end()' that made sure pages are properly
marked accessed when written to.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Ever since 'ea9b990 fuse: implement perform_write', the .write_begin
and .write_end aops have been dead code.
Their task - acquiring a page from the page cache, sending out a write
request and releasing the page again - is now done batch-wise to
maximize the number of pages send per userspace request.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Commit a9ff4f87 "fuse: support BSD locking semantics" overlooked a
number of issues with supporing flock locks over existing POSIX
locking infrastructure:
- it's not backward compatible, passing flock(2) calls to userspace
unconditionally (if userspace sets FUSE_POSIX_LOCKS)
- it doesn't cater for the fact that flock locks are automatically
unlocked on file release
- it doesn't take into account the fact that flock exclusive locks
(write locks) don't need an fd opened for write.
The last one invalidates the original premise of the patch that flock
locks can be emulated with POSIX locks.
This patch fixes the first two issues. The last one needs to be fixed
in userspace if the filesystem assumed that a write lock will happen
only on a file operned for write (as in the case of the current fuse
library).
Reported-by: Sebastian Pipping <webmaster@hartwork.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
fixes following error seen on x86_64 kernel:
ioctl32(openl2tpd:7480): Unknown cmd fd(14) cmd(80487436){t:'t';sz:72} arg(ffa7e6c0) on socket:[105094]
The argument (struct pppol2tp_ioc_stats) uses "aligned_u64" and thus doesn't need
fixups.
Cc: James Chapman <jchapman@katalix.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Al points out that the do_follow_link() helper function really is
misnamed - it's about whether we should try to follow a symlink or not,
not about actually doing the following.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit 3567866bf2: "RCUify freeing acls, let check_acl() go ahead in
RCU mode if acl is cached" posix_acl_permission is being called with an
unsupported flag and the permission check fails. This patch fixes the issue.
Signed-off-by: Ari Savolainen <ari.m.savolainen@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
ore: Make ore its own module
exofs: Rename raid engine from exofs/ios.c => ore
exofs: ios: Move to a per inode components & device-table
exofs: Move exofs specific osd operations out of ios.c
exofs: Add offset/length to exofs_get_io_state
exofs: Fix truncate for the raid-groups case
exofs: Small cleanup of exofs_fill_super
exofs: BUG: Avoid sbi realloc
exofs: Remove pnfs-osd private definitions
nfs_xdr: Move nfs4_string definition out of #ifdef CONFIG_NFS_V4
The inode structure layout is largely random, and some of the vfs paths
really do care. The path lookup in particular is already quite D$
intensive, and profiles show that accessing the 'inode->i_op->xyz'
fields is quite costly.
We already optimized the dcache to not unnecessarily load the d_op
structure for members that are often NULL using the DCACHE_OP_xyz bits
in dentry->d_flags, and this does something very similar for the inode
ops that are used during pathname lookup.
It also re-orders the fields so that the fields accessed by 'stat' are
together at the beginning of the inode structure, and roughly in the
order accessed.
The effect of this seems to be in the 1-2% range for an empty kernel
"make -j" run (which is fairly kernel-intensive, mostly in filename
lookup), so it's visible. The numbers are fairly noisy, though, and
likely depend a lot on exact microarchitecture. So there's more tuning
to be done.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gcc tends to generate better code with small integers, including the
DCACHE_xyz flag tests - so move the common ones to be first in the list.
Also just remove the unused DCACHE_INOTIFY_PARENT_WATCHED and
DCACHE_AUTOFS_PENDING values, their users no longer exists in the source
tree.
And add a "unlikely()" to the DCACHE_OP_COMPARE test, since we want the
common case to be a nice straight-line fall-through.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Export everything from ore need exporting. Change Kbuild and Kconfig
to build ore.ko as an independent module. Import ore from exofs
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
ORE stands for "Objects Raid Engine"
This patch is a mechanical rename of everything that was in ios.c
and its API declaration to an ore.c and an osd_ore.h header. The ore
engine will later be used by the pnfs objects layout driver.
* File ios.c => ore.c
* Declaration of types and API are moved from exofs.h to a new
osd_ore.h
* All used types are prefixed by ore_ from their exofs_ name.
* Shift includes from exofs.h to osd_ore.h so osd_ore.h is
independent, include it from exofs.h.
Other than a pure rename there are no other changes. Next patch
will move the ore into it's own module and will export the API
to be used by exofs and later the layout driver
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Exofs raid engine was saving on memory space by having a single layout-info,
single pid, and a single device-table, global to the filesystem. Then passing
a credential and object_id info at the io_state level, private for each
inode. It would also devise this contraption of rotating the device table
view for each inode->ino to spread out the device usage.
This is not compatible with the pnfs-objects standard, demanding that
each inode can have it's own layout-info, device-table, and each object
component it's own pid, oid and creds.
So: Bring exofs raid engine to be usable for generic pnfs-objects use by:
* Define an exofs_comp structure that holds obj_id and credential info.
* Break up exofs_layout struct to an exofs_components structure that holds a
possible array of exofs_comp and the array of devices + the size of the
arrays.
* Add a "comps" parameter to get_io_state() that specifies the ids creds
and device array to use for each IO.
This enables to keep the layout global, but the device-table view, creds
and IDs at the inode level. It only adds two 64bit to each inode, since
some of these members already existed in another form.
* ios raid engine now access layout-info and comps-info through the passed
pointers. Everything is pre-prepared by caller for generic access of
these structures and arrays.
At the exofs Level:
* Super block holds an exofs_components struct that holds the device
array, previously in layout. The devices there are in device-table
order. The device-array is twice bigger and repeats the device-table
twice so now each inode's device array can point to a random device
and have a round-robin view of the table, making it compatible to
previous exofs versions.
* Each inode has an exofs_components struct that is initialized at
load time, with it's own view of the device table IDs and creds.
When doing IO this gets passed to the io_state together with the
layout.
While preforming this change. Bugs where found where credentials with the
wrong IDs where used to access the different SB objects (super.c). As well
as some dead code. It was never noticed because the target we use does not
check the credentials.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
ios.c will be moving to an external library, for use by the
objects-layout-driver. Remove from it some exofs specific functions.
Also g_attr_logical_length is used both by inode.c and ios.c
move definition to the later, to keep it independent
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
In future raid code we will need to know the IO offset/length
and if it's a read or write to determine some of the array
sizes we'll need.
So add a new exofs_get_rw_state() API for use when
writeing/reading. All other simple cases are left using the
old way.
The major change to this is that now we need to call
exofs_get_io_state later at inode.c::read_exec and
inode.c::write_exec when we actually know these things. So this
patch is kept separate so I can test things apart from other
changes.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: cope with negative dentries in cifs_get_root
cifs: convert prefixpath delimiters in cifs_build_path_to_root
CIFS: Fix missing a decrement of inFlight value
cifs: demote DFS referral lookup errors to cFYI
Revert "cifs: advertise the right receive buffer size to the server"
The CLOEXE bit is magical, and for performance (and semantic) reasons we
don't actually maintain it in the file descriptor itself, but in a
separate bit array. Which means that when we show f_flags, the CLOEXE
status is shown incorrectly: we show the status not as it is now, but as
it was when the file was opened.
Fix that by looking up the bit properly in the 'fdt->close_on_exec' bit
array.
Uli needs this in order to re-implement the pfiles program:
"For normal file descriptors (not sockets) this was the last piece of
information which wasn't available. This is all part of my 'give
Solaris users no reason to not switch' effort. I intend to offer the
code to the util-linux-ng maintainers."
Requested-by: Ulrich Drepper <drepper@akkadia.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
WARN_ONCE() is very annoying, in that it shows the stack trace that we
don't care about at all, and also triggers various user-level "kernel
oopsed" logic that we really don't care about. And it's not like the
user can do anything about the applications (sshd) in question, it's a
distro issue.
Requested-by: Andi Kleen <andi@firstfloor.org> (and many others)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Btrfs does bio submissions from a worker thread, and each device
has a list of high priority bios and regular priority bios.
Synchronous writes go to the high priority thread while async writes
go to regular list. This commit brings back an explicit unplug
any time we switch from high to regular priority, which makes it
easier for the block layer to give us low latencies.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The loop around lookup_one_len doesn't handle the case where it might
return a negative dentry, which can cause an oops on the next pass
through the loop. Check for that and break out of the loop with an
error of -ENOENT if there is one.
Fixes the panic reported here:
https://bugzilla.redhat.com/show_bug.cgi?id=727927
Reported-by: TR Bentley <home@trarbentley.net>
Reported-by: Iain Arnell <iarnell@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: stable@kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Regression from 2.6.39...
The delimiters in the prefixpath are not being converted based on
whether posix paths are in effect. Fixes:
https://bugzilla.redhat.com/show_bug.cgi?id=727834
Reported-and-Tested-by: Iain Arnell <iarnell@gmail.com>
Reported-by: Patrick Oltmann <patrick.oltmann@gmx.net>
Cc: Pavel Shilovsky <piastryyy@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
RCUify freeing acls, let check_acl() go ahead in RCU mode if acl is cached
get rid of boilerplate switches in posix_acl.h
fix block device fallout from ->fsync() changes
In the general raid-group case the truncate was wrong in that
it did not also fix the object length of the neighboring groups.
There are two bad cases in the old code:
1. Space that should be freed was not.
2. If a file That was big is truncated small, then made bigger
again, the holes would not contain zeros but could expose old data.
(If the growing of the file expands to more than a full
groups cycle + group size (> S + T))
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Since the beginning we realloced the sbi structure when a bigger
then one device table was specified. (I know that was really stupid).
Then much later when "register bdi" was added (By Jens) it was
registering the pointer to sbi->bdi before the realloc.
We never saw this problem because up till now the realloc did not
do anything since the device table was small enough to fit in the
original allocation. But once we starting testing with large device
tables (Bigger then 28) we noticed the crash of writeback operating
on a deallocated pointer.
* Avoid the all mess by allocating the device-table as a second array
and get rid of the variable-sized structure and the rest of this
mess.
* Take the chance to clean near by structures and comments.
* Add a needed dprint on startup to indicate the loaded layout.
* Also move the bdi registration to the very end because it will
only fail in a low memory, which will probably fail before hand.
There are many more likely causes to not load before that. This
way the error handling is made simpler. (Just doing this would be
enough to fix the BUG)
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
If the client is in the process of resetting the session when it receives
a callback, then returning NFS4ERR_DELAY may cause a deadlock with the
DESTROY_SESSION call.
Basically, if the client returns NFS4ERR_DELAY in response to the
CB_SEQUENCE call, then the server is entitled to believe that the
client is busy because it is already processing that call. In that
case, the server is perfectly entitled to respond with a
NFS4ERR_BACK_CHAN_BUSY to any DESTROY_SESSION call.
Fix this by having the client reply with a NFS4ERR_BADSESSION in
response to the callback if it is resetting the session.
Cc: stable@kernel.org [2.6.38+]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, there is no guarantee that we will call nfs4_cb_take_slot() even
though nfs4_callback_compound() will consistently call
nfs4_cb_free_slot() provided the cb_process_state has set the 'clp' field.
The result is that we can trigger the BUG_ON() upon the next call to
nfs4_cb_take_slot().
This patch fixes the above problem by using the slot id that was taken in
the CB_SEQUENCE operation as a flag for whether or not we need to call
nfs4_cb_free_slot().
It also fixes an atomicity problem: we need to set tbl->highest_used_slotid
atomically with the check for NFS4_SESSION_DRAINING, otherwise we end up
racing with the various tests in nfs4_begin_drain_session().
Cc: stable@kernel.org [2.6.38+]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There were bugs in the case of partial layout where olo_comp_index
is not zero. This used to work and was tested but one of the later
cleanup SQUASHMEs broke it and was not tested since.
Also add a dprint that specify those received layout parameters.
Everything else was already printed.
[Needed in v3.0]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When we have a situation that the number of pages we want
to encode is bigger then the size of the bio. (Which can
currently happen only when all IO is going to a single device
.e.g group_width==1) then the IO is submitted short and we
report back only the amount of bytes we actually wrote/read
and all is fine. BUT ...
There was a bug that the current length counter was advanced
before the fail to add the extra page, and we come to a situation
that the CDB length was one-page longer then the actual bio size,
which is of course rejected by the osd-target.
While here also fix the bio size calculation, in the case
that we received more then one group of devices.
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix this compile error on s390:
CC [M] fs/nfs/blocklayout/blocklayout.o
fs/nfs/blocklayout/blocklayout.c: In function 'bl_end_io_read':
fs/nfs/blocklayout/blocklayout.c:201:4: error: implicit declaration of function 'prefetchw'
Introduced with 9549ec01 "pnfsblock: bl_read_pagelist".
Cc: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Expand the fs/Kconfig "help" info to clarify why it's a bad idea to
deselect the TMPFS_POSIX_ACL config variable.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove PageSwapBacked (!page_is_file_cache) cases from
add_to_page_cache_locked() and add_to_page_cache_lru(): those pages now
go through shmem_add_to_page_cache().
Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(),
and add a comment on swap entries to invalidate_mapping_pages().
And mincore_page() uses find_get_page() on what might be shmem or a
tmpfs file: allow for a radix_tree_exceptional_entry(), and proceed to
find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix new kernel-doc warning in fs/dcache.c:
Warning(fs/dcache.c:797): No description found for parameter 'sb'
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
if we failed on getting mid entry in cifs_call_async.
Cc: stable@kernel.org
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Commit 9933fc0i (ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and
ext4_kvfree()) intruduced wrappers around k*alloc/vmalloc but introduced
a typo for ext4_kzalloc() by not using kzalloc() but kmalloc().
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (31 commits)
Btrfs: don't call writepages from within write_full_page
Btrfs: Remove unused variable 'last_index' in file.c
Btrfs: clean up for find_first_extent_bit()
Btrfs: clean up for wait_extent_bit()
Btrfs: clean up for insert_state()
Btrfs: remove unused members from struct extent_state
Btrfs: clean up code for merging extent maps
Btrfs: clean up code for extent_map lookup
Btrfs: clean up search_extent_mapping()
Btrfs: remove redundant code for dir item lookup
Btrfs: make acl functions really no-op if acl is not enabled
Btrfs: remove remaining ref-cache code
Btrfs: remove a BUG_ON() in btrfs_commit_transaction()
Btrfs: use wait_event()
Btrfs: check the nodatasum flag when writing compressed files
Btrfs: copy string correctly in INO_LOOKUP ioctl
Btrfs: don't print the leaf if we had an error
btrfs: make btrfs_set_root_node void
Btrfs: fix oops while writing data to SSD partitions
Btrfs: Protect the readonly flag of block group
...
Fix up trivial conflicts (due to acl and writeback cleanups) in
- fs/btrfs/acl.c
- fs/btrfs/ctree.h
- fs/btrfs/extent_io.c
cifs: demote DFS referral lookup errors to cFYI
Now that we call into this routine on every mount, anyone who doesn't
have the upcall configured will get multiple printks about failed lookups.
Reported-and-Tested-by: Martijn Uffing <mp3project@sarijopen.student.utwente.nl>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (60 commits)
ext4: prevent memory leaks from ext4_mb_init_backend() on error path
ext4: use EXT4_BAD_INO for buddy cache to avoid colliding with valid inode #
ext4: use ext4_msg() instead of printk in mballoc
ext4: use ext4_kvzalloc()/ext4_kvmalloc() for s_group_desc and s_group_info
ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and ext4_kvfree()
ext4: use the correct error exit path in ext4_init_inode_table()
ext4: add missing kfree() on error return path in add_new_gdb()
ext4: change umode_t in tracepoint headers to be an explicit __u16
ext4: fix races in ext4_sync_parent()
ext4: Fix overflow caused by missing cast in ext4_fallocate()
ext4: add action of moving index in ext4_ext_rm_idx for Punch Hole
ext4: simplify parameters of reserve_backup_gdb()
ext4: simplify parameters of add_new_gdb()
ext4: remove lock_buffer in bclean() and setup_new_group_blocks()
ext4: simplify journal handling in setup_new_group_blocks()
ext4: let setup_new_group_blocks() set multiple bits at a time
ext4: fix a typo in ext4_group_extend()
ext4: let ext4_group_add_blocks() handle 0 blocks quickly
ext4: let ext4_group_add_blocks() return an error code
ext4: rename ext4_add_groupblocks() to ext4_group_add_blocks()
...
Fix up conflict in fs/ext4/inode.c: commit aacfc19c62 ("fs: simplify
the blockdev_direct_IO prototype") had changed the ext4_ind_direct_IO()
function for the new simplified calling convention, while commit
dae1e52cb1 ("ext4: move ext4_ind_* functions from inode.c to
indirect.c") moved the function to another file.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
xfs: Fix build breakage in xfs_iops.c when CONFIG_FS_POSIX_ACL is not set
VFS: Reorganise shrink_dcache_for_umount_subtree() after demise of dcache_lock
VFS: Remove dentry->d_lock locking from shrink_dcache_for_umount_subtree()
VFS: Remove detached-dentry counter from shrink_dcache_for_umount_subtree()
switch posix_acl_chmod() to umode_t
switch posix_acl_from_mode() to umode_t
switch posix_acl_equiv_mode() to umode_t *
switch posix_acl_create() to umode_t *
block: initialise bd_super in bdget()
vfs: avoid call to inode_lru_list_del() if possible
vfs: avoid taking inode_hash_lock on pipes and sockets
vfs: conditionally call inode_wb_list_del()
VFS: Fix automount for negative autofs dentries
Btrfs: load the key from the dir item in readdir into a fake dentry
devtmpfs: missing initialialization in never-hit case
hppfs: missing include
* 'pstore-efi' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
efivars: Introduce PSTORE_EFI_ATTRIBUTES
efivars: Use string functions in pstore_write
efivars: introduce utf16_strncmp
efivars: String functions
efi: Add support for using efivars as a pstore backend
pstore: Allow the user to explicitly choose a backend
pstore: Make "part" unsigned
pstore: Add extra context for writes and erases
pstore: Extend API for more flexibility in new backends
In ext4_mb_init(), if the s_locality_group allocation fails it will
currently cause the allocations made in ext4_mb_init_backend() to
be leaked. Moving the ext4_mb_init_backend() allocation after the
s_locality_group allocation avoids that problem.
Signed-off-by: Yu Jian <yujian@whamcloud.com>
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When doing a writepage we call writepages to try and write out any other dirty
pages in the area. This could cause problems where we commit a transaction and
then have somebody else dirtying metadata in the area as we could end up writing
out a lot more than we care about, which could cause latency on anybody who is
waiting for the transaction to completely finish committing. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The variable 'last_index' is calculated in the __btrfs_buffered_write
function and passed as a parameter to the prepare_pages function,
but is not used anywhere in the prepare_pages function.
Remove instances of 'last_index' in these functions.
Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
find_first_extent_bit() and find_first_extent_bit_state() share
most of the code, and we can just make the former call the latter.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We can just use cond_resched_lock().
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
These members are not used at all.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
unpin_extent_cache() and add_extent_mapping() shares the same code
that merges extent maps.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
lookup_extent_map() and search_extent_map() can share most of code.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
rb_node returned by __tree_search() can be a valid pointer or NULL,
but won't be some errno.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we search a dir item with a specific hash code, we can
just return NULL without further checking if btrfs_search_slot()
returns 1.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since commit f2a97a9dbd
("btrfs: remove all unused functions"), there's no extern functions
at all in ref-cache.c, so just remove the remaining dead code.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Use wait_event() when possible to avoid code duplication.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If mounting with nodatasum option, we won't csum file data for
general write or direct-io write, and this rule should also be
applied when writing compressed files.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Memory areas [ptr, ptr+total_len] and [name, name+total_len]
may overlap, so it's wrong to use memcpy().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In __btrfs_free_extent we will print the leaf if we fail to find the extent we
wanted, but the problem is if we get an error we won't have a leaf so often this
leads to a NULL pointer dereference and we lose the error that actually
occurred. So only print the leaf if ret > 0, which means we didn't find the
item we were looking for but we didn't error either. This way the error is
preserved.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This is fairly trivial - btrfs_set_root_node() - always returns zero so we
can just make it void. All callers ignore the return code now anyway. I
also made sure to check that none of the functions that
btrfs_set_root_node() calls returns an error that we might have needed to
catch and pass back.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Here I have a two SSD-partitions btrfs, and they are defaultly set to
"data=raid0, metadata=raid1", then I try to fill my btrfs partition
till "No space left on device", via "dd if=/dev/zero of=/mnt/btrfs/tmp".
I get an oops panic from kernel BUG at fs/btrfs/extent-tree.c:5199!, which
refers to find_free_extent's
BUG_ON(index != get_block_group_index(block_group));
In SSD mode, in order to find enough space to alloc, we may check the
block_group cache which has been checked sometime before, but the index is not
updated, where it hits the BUG_ON.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Acked-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The access for ro in btrfs_block_group_cache should be protected
because of the racy lock in relocation.
Signed-off-by: Wu Bo <wu.bo@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The set/clear bit and the extent split/merge hooks only ever return 0.
Changing them to return void simplifies the error handling cases later.
This patch changes the hook prototypes, the single implementation of each,
and the functions that call them to return void instead.
Since all four of these hooks execute under a spinlock, they're necessarily
simple.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We passed the wrong value to btrfs_force_ra(). Fix this by changing
the argument of btrfs_force_ra() from last_index to nr_page.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs_unlink_inode() and btrfs_orphan_add() in btrfs_unlink()
are error, the error code is returned to the caller instead of
BUG_ON().
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Don't need to check the return value of __btrfs_add_inode_defrag(),
since it will always return 0.
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This fixes a race during unmount. We need to not only make sure that
the journal is completely written, but that the metadata changes make
it to disk before releasing ipimap and ipbmap.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
CIFS: Cleanup demupltiplex thread exiting code
CIFS: Move mid search to a separate function
CIFS: Move RFC1002 check to a separate function
CIFS: Simplify socket reading in demultiplex thread
CIFS: Move buffer allocation to a separate function
cifs: remove unneeded variable initialization in cifs_reconnect_tcon
cifs: simplify refcounting for oplock breaks
cifs: fix compiler warning in CIFSSMBQAllEAs
cifs: fix name parsing in CIFSSMBQAllEAs
cifs: don't start signing too early
cifs: trivial: goto out here is unnecessary
cifs: advertise the right receive buffer size to the server
Reviewed-and-Tested-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Reviewed-and-Tested-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Reviewed-and-Tested-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Move reading to separate function and remove csocket variable.
Also change semantic in a little: goto incomplete_rcv only when
we get -EAGAIN (or a familiar error) while reading rfc1002 header.
In this case we don't check for echo timeout when we don't get whole
header at once, as it was before.
Reviewed-and-Tested-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Introduce new helper functions which try kmalloc, and then fall back
to vmalloc if necessary, and use them for allocating and deallocating
s_flex_groups.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-and-Tested-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This patch lets ext4_init_inode_table() handle errors right.
ext4_init_inode_table() should down_write() alloc_sem which
has been up_write()ed and stop the started journal handle.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
commit 4e34e719e4, that takes the ACL checks to common code,
accidentely broke the build when CONFIG_FS_POSIX_ACL is not set:
CC fs/xfs/linux-2.6/xfs_iops.o
fs/xfs/linux-2.6/xfs_iops.c:1025:14: error: ‘xfs_get_acl’ undeclared here (not in a function)
Fix this by declaring xfs_get_acl a static inline function.
Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reorganise shrink_dcache_for_umount_subtree() in light of the demise of
dcache_lock. Without that dcache_lock, there is no need for the batching of
removal of dentries from the system under it (we wanted to make intensive use
of the locked data whilst we held it, but didn't want to hold it for long at a
time).
This works, provided the preceding patch is correct in its removal of locking
on dentry->d_lock on the basis that no one should be locking these dentries any
more as the whole superblock is defunct.
With this patch, the calls to dentry_lru_del() and __d_shrink() are placed at
the point where each dentry is detached handled.
It is possible that, as an alternative, the batching should still be done -
but only for dentry_lru_del() of all a dentry's children in one go. In such a
case, the batching would be done under dcache_lru_lock.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Locks of the dcache_lock were replaced by locks of dentry->d_lock in commits
such as:
23044507832fd6b7f507
as part of the RCU-based pathwalk changes, despite the fact that the caller
(shrink_dcache_for_umount()) notes in the banner comment the reasons that
d_lock is not necessary in these functions:
/*
* destroy the dentries attached to a superblock on unmounting
* - we don't need to use dentry->d_lock because:
* - the superblock is detached from all mountings and open files, so the
* dentry trees will not be rearranged by the VFS
* - s_umount is write-locked, so the memory pressure shrinker will ignore
* any dentries belonging to this superblock that it comes across
* - the filesystem itself is no longer permitted to rearrange the dentries
* in this superblock
*/
So remove these locks. If the locks are actually necessary, then this banner
comment should be altered instead.
The hash table chains are protected by 1-bit locks in the hash table heads, so
those shouldn't be a problem.
Note that to make this work, __d_drop() has to be split so that the RCUwalk
barrier can be avoided. This causes problems otherwise as it has an assertion
that dentry->d_lock is locked - but there is no need for that as no one else
can be trying to access this dentry, except to step over it (and that should
be handled by d_free(), I think).
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove the detached-dentry counter from shrink_dcache_for_umount_subtree() as
the value it computes is no longer used as of commit
312d3ca856 which made the nr_dentry counters
summed per-CPU rather than global atomic.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
bd_super is currently reset to NULL in kill_block_super() so we rely on previous
users of the block_device object to initialise this value for the next user.
This quirk was exposed on RHEL5 when a third party filesystem did not always use
kill_block_super() and therefore bd_super wasn't being reset when a block_device
object was recycled within the cache. This may not be a problem upstream but
makes sense to be defensive.
Signed-off-by: Lachlan McIlroy <lmcilroy@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
inode_lru_list_del() is expensive because of per superblock lru locking,
while some inodes are not in lru list.
Adding a check in iput_final() can speedup pipe/sockets workloads on
SMP.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Some inodes (pipes, sockets, ...) are not hashed, no need to take
contended inode_hash_lock at dismantle time.
nice speedup on SMP machines on socket intensive workloads.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Some inodes (pipes, sockets, ...) are not in bdi writeback list.
evict() can avoid calling inode_wb_list_del() and its expensive spinlock
by checking inode i_wb_list being empty or not.
At this point, no other cpu/user can concurrently manipulate this inode
i_wb_list
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Autofs may set the DCACHE_NEED_AUTOMOUNT flag on negative dentries. These
need attention from the automounter daemon regardless of the LOOKUP_FOLLOW flag.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In btrfs we have 2 indexes for inodes. One is for readdir, it's in this nice
sequential order and works out brilliantly for readdir. However if you use ls,
it usually stat's each file it gets from readdir. This is where the second
index comes in, which is based on a hash of the name of the file. So then the
lookup has to lookup this index, and then lookup the inode. The index lookup is
going to be in random order (since its based on the name hash), which gives us
less than stellar performance. Since we know the inode location from the
readdir index, I create a dummy dentry and copy the location key into
dentry->d_fsdata. Then on lookup if we have d_fsdata we use that location to
lookup the inode, avoiding looking up the other directory index. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-and-acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Currently, we take a sb->s_active reference and a cifsFileInfo reference
when an oplock break workqueue job is queued. This is unnecessary and
more complicated than it needs to be. Also as Al points out,
deactivate_super has non-trivial locking implications so it's best to
avoid that if we can.
Instead, just cancel any pending oplock breaks for this filehandle
synchronously in cifsFileInfo_put after taking it off the lists.
That should ensure that this job doesn't outlive the structures it
depends on.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The recent fix to the above function causes this compiler warning to pop
on some gcc versions:
CC [M] fs/cifs/cifssmb.o
fs/cifs/cifssmb.c: In function ‘CIFSSMBQAllEAs’:
fs/cifs/cifssmb.c:5708: warning: ‘ea_name_len’ may be used uninitialized in
this function
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The code that matches EA names in CIFSSMBQAllEAs is incorrect. It
uses strncmp to do the comparison with the length limited to the
name_len sent in the response.
Problem: Suppose we're looking for an attribute named "foobar" and
have an attribute before it in the EA list named "foo". The
comparison will succeed since we're only looking at the first 3
characters. Fix this by also comparing the length of the provided
ea_name with the name_len in the response. If they're not equal then
it shouldn't match.
Reported-by: Jian Li <jiali@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Sniffing traffic on the wire shows that windows clients send a zeroed
out signature field in a NEGOTIATE request, and send "BSRSPYL" in the
signature field during SESSION_SETUP. Make the cifs client behave the
same way.
It doesn't seem to make much difference in any server that I've tested
against, but it's probably best to follow windows behavior as closely as
possible here.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Currently, we mirror the same size back to the server that it sends us.
That makes little sense. Instead we should be sending the server the
maximum buffer size that we can handle -- CIFSMaxBufSize minus the
4 byte RFC1001 header.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
For invalid extents, find other pages in the same fsblock and write them out.
[pnfsblock: write_begin]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Note: When upper layer's read/write request cannot be fulfilled, the block
layout driver shouldn't silently mark the page as error. It should do
what can be done and leave the rest to the upper layer. To do so, we
should set rdata/wdata->res.count properly.
When upper layer re-send the read/write request to finish the rest
part of the request, pgbase is the position where we should start at.
[pnfsblock: bl_write_pagelist support functions]
[pnfsblock: bl_write_pagelist adjust for missing PG_USE_PNFS]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: handle errors when read or write pagelist.]
Signed-off-by: Zhang Jingwang <yyalone@gmail.com>
[pnfs-block: use new write_pagelist api]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
[SQUASHME: pnfsblock: mds_offset is set in the generic layer]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[pnfsblock: mark IO error with NFS_LAYOUT_{RW|RO}_FAILED]
Signed-off-by: Peng Tao <peng_tao@emc.com>
[pnfsblock: SQUASHME: adjust to API change]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: fixup blksize alignment in bl_setup_layoutcommit]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[pnfsblock: bl_write_pagelist adjust for missing PG_USE_PNFS]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: handle errors when read or write pagelist.]
Signed-off-by: Zhang Jingwang <yyalone@gmail.com>
[pnfs-block: use new write_pagelist api]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Note: When upper layer's read/write request cannot be fulfilled, the block
layout driver shouldn't silently mark the page as error. It should do
what can be done and leave the rest to the upper layer. To do so, we
should set rdata/wdata->res.count properly.
When upper layer re-send the read/write request to finish the rest
part of the request, pgbase is the position where we should start at.
[pnfsblock: mark IO error with NFS_LAYOUT_{RW|RO}_FAILED]
Signed-off-by: Peng Tao <peng_tao@emc.com>
[pnfsblock: read path error handling]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: handle errors when read or write pagelist.]
Signed-off-by: Zhang Jingwang <yyalone@gmail.com>
[pnfs-block: use new read_pagelist api]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In blocklayout driver. There are two things happening
while layoutcommit/cleanup.
1. the modified extents are encoded.
2. On cleanup the extents are put back on the layout rw
extents list, for reads.
In the new system where actual xdr encoding is done in
encode_layoutcommit() directly into xdr buffer, these are
the new commit stages:
1. On setup_layoutcommit, the range is adjusted as before
and a structure is allocated for communication with
bl_encode_layoutcommit && bl_cleanup_layoutcommit
(Generic layer provides a void-star to hang it on)
2. bl_encode_layoutcommit is called to do the actual
encoding directly into xdr. The commit-extent-list is not
freed and is stored on above structure.
FIXME: The code is not yet converted to the new XDR cleanup
3. On cleanup the commit-extent-list is put back by a call
to set_to_rw() as before, but with no need for XDR decoding
of the list as before. And the commit-extent-list is freed.
Finally allocated structure is freed.
[rm inode and pnfs_layout_hdr args from cleanup_layoutcommit()]
Signed-off-by: Jim Rees <rees@umich.edu>
[pnfsblock: introduce bl_committing list]
Signed-off-by: Peng Tao <peng_tao@emc.com>
[pnfsblock: SQUASHME: adjust to API change]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[blocklayout: encode_layoutcommit implementation]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[pnfsblock: fix bug setting up layoutcommit.]
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
[pnfsblock: cleanup_layoutcommit wants a status parameter]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In blocklayout driver. There are two things happening
while layoutcommit/cleanup.
1. the modified extents are encoded.
2. On cleanup the extents are put back on the layout rw
extents list, for reads.
In the new system where actual xdr encoding is done in
encode_layoutcommit() directly into xdr buffer, these are
the new commit stages:
1. On setup_layoutcommit, the range is adjusted as before
and a structure is allocated for communication with
bl_encode_layoutcommit && bl_cleanup_layoutcommit
(Generic layer provides a void-star to hang it on)
2. bl_encode_layoutcommit is called to do the actual
encoding directly into xdr. The commit-extent-list is not
freed and is stored on above structure.
FIXME: The code is not yet converted to the new XDR cleanup
3. On cleanup the commit-extent-list is put back by a call
to set_to_rw() as before, but with no need for XDR decoding
of the list as before. And the commit-extent-list is freed.
Finally allocated structure is freed.
[rm inode and pnfs_layout_hdr args from cleanup_layoutcommit()]
[pnfsblock: get rid of deprecated xdr macros]
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[blocklayout: encode_layoutcommit implementation]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[pnfsblock: fix bug setting up layoutcommit.]
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
[pnfsblock: prevent commit list corruption]
[pnfsblock: fix layoutcommit with an empty opaque]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Adds working implementations of various support functions
to handle INVAL extents, needed by writes, such as
bl_mark_sectors_init and bl_is_sector_init.
[pnfsblock: fix 64-bit compiler warnings for extent manipulation]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[Implement release_inval_marks]
Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Implement bl_find_get_extent(), one of the core extent manipulation
routines.
[pnfsblock: Lookup list entry of layouts and tags in reverse order]
Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Jim Rees <rees@umich.edu>
pnfsblock: fix print format warnings for sector_t and size_t
gcc spews warnings about these on x86_64, e.g.:
fs/nfs/blocklayout/blocklayout.c:74: warning: format ‘%Lu’ expects type ‘long long unsigned int’, but argument 2 has type ‘sector_t’
fs/nfs/blocklayout/blocklayout.c:388: warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
XDR decodes the block layout payload sent in LAYOUTGET result, storing
the result in an extent list.
[pnfsblock: get rid of deprecated xdr macros]
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: fix bug getting pnfs_layout_type in translate_devid().]
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>