all remaining callers pass LOOKUP_PARENT to it, so
flags argument can die; renamed to kern_path_parent()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The previous patch missed a couple of places where the AIL list
needed locking, so this fixes up those places, plus a comment
is corrected too.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Chinner <dchinner@redhat.com>
Fix for a dumb preadv()/pwritev() compat bug - unlike the native
variants, the compat_... ones forget to check FMODE_P{READ,WRITE}, so
e.g. on pipe the native preadv() will fail with -ESPIPE and compat one
will act as readv() and succeed.
Not critical, but it's a clear bug with trivial fix, so IMO it's OK for
-final.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix for a dumb preadv()/pwritev() compat bug - unlike the native
variants, compat_... ones forget to check FMODE_P{READ,WRITE}, so e.g.
on pipe the native preadv() will fail with -ESPIPE and compat one will
act as readv() and succeed. Not critical, but it's a clear bug with trivial
fix.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: break out of shrink_delalloc earlier
btrfs: fix not enough reserved space
btrfs: fix dip leak
Btrfs: make sure not to return overlapping extents to fiemap
Btrfs: deal with short returns from copy_from_user
Btrfs: fix regressions in copy_from_user handling
Josef had changed shrink_delalloc to exit after three shrink
attempts, which wasn't quite enough because new writers could
race in and steal free space.
But it also fixed deadlocks and stalls as we tried to recover
delalloc reservations. The code was tweaked to loop 1024
times, and would reset the counter any time a small amount
of progress was made. This was too drastic, and with a
lot of writers we can end up stuck in shrink_delalloc forever.
The shrink_delalloc loop is fairly complex because the caller is looping
too, and the caller will go ahead and force a transaction commit to make
sure we reclaim space.
This reworks things to exit shrink_delalloc when we've forced some
writeback and the delalloc reservations have gone down. This means
the writeback has not just started but has also finished at
least some of the metadata changes required to reclaim delalloc
space.
If we've got this wrong, we're returning ENOSPC too early, which
is a big improvement over the current behavior of hanging the machine.
Test 224 in xfstests hammers on this nicely, and with 1000 writers
trying to fill a 1GB drive we get our first ENOSPC at 93% full. The
other writers are able to continue until we get 100%.
This is a worst case test for btrfs because the 1000 writers are doing
small IO, and the small FS size means we don't have a lot of room
for metadata chunks.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The new xfs_alert_tag() used a variable named "panic",
and that is to be avoided. Rename it.
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Factor out some cut-and-paste code in options parsing.
Saves about 800 bytes on x86-64.
Signed-off-by: Rob Landley <rlandley@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Eliminate two mostly duplicate functions (nfs_parse_simple_hostname()
and nfs_parse_protected_hostname()) and instead just make the calling
function (nfs_parse_devname()) do everything.
Signed-off-by: Rob Landley <rlandley@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Account NFS direct-io reads and writes into Task I/O Accounting.
Do it before complition to handle aio.
NFS have unusual direct-io implementation,
thus accounting in generic code does not work.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The new behaviour is enabled using the new module parameter
'nfs4_disable_idmapping'.
Note that if the server rejects an unmapped uid or gid, then
the client will automatically switch back to using the idmapper.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This will be required in order to switch uid/gid mapping back on if the
admin has tried to disable it.
Note that we also propagate NFS4ERR_BADNAME at the same time, in order to
work around a Linux server bug.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Allowing stripe_unit==0 causes the client to crash later on
when dividing by zero.
Reported-by: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now that we have access to the pointer, clear it immediately after
the put, instead of in caller.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This will make it possible to clear the lseg pointer in the same
function as it is put, instead of in the caller nfs_pageio_doio().
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Allows the pnfs filelayout driver to write to the data servers.
Note that COMMIT to data servers will be implemented in a future
patch. To avoid improper behavior, for the moment any WRITE to a data
server that would also require a COMMIT to the data server is sent
NFS_FILE_SYNC.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Any WRITE compound directed to a data server needs to have the
GETATTR calls suppressed.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We grab the lseg sent in from the doio function and attach it to
each struct nfs_write_data created. This is how the lseg will be
sent to the layout driver.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add callback that pnfs layout driver can use to do its own handling
of data server WRITE response.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reorder nfs_write_rpcsetup, preparing for a pnfs entry point.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If a data server is unavailable, go through MDS.
Mark the deviceid containing the data server as a negative cache entry.
Do not try to connect to any data server on a deviceid marked as a negative
cache entry. Mark any layout that tries to use the marked deviceid as failed.
Inodes with a layout marked as fails will not use the layout for I/O, and will
not perform any more layoutgets.
Inodes without a layout will still do layoutget, but the layout will get
marked immediately.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
No need for generic cache with only one user.
Keep a simple hash of deviceids in the filelayout driver.
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Acked-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Use our own async error handler.
Mark the layout as failed and retry i/o through the MDS on specified errors.
Update the mds_offset in nfs_readpage_retry so that a failed short-read retry
to a DS gets correctly resent through the MDS.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Attempt a pNFS file layout read by setting up the nfs_read_data struct and
calling nfs_initiate_read with the data server rpc client and the
filelayout rpc call ops.
Error handling is implemented in a subsequent patch.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Tested-by: Guo Mingyang <guomingyang@nrchpc.ac.cn>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Prepare for filelayout_read_pagelist with helper functions that find the correct
data server, filehandle, and offset.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: Mike Sager <sager@netapp.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
Signed-off-by: Tigran Mkrtchyan <tigran@anahit.desy.de>
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Introduce a data server set_client and init session following the
nfs4_set_client and nfs4_init_session convention.
Once a new nfs_client is on the nfs_client_list, the nfs_client cl_cons_state
serializes access to creating an nfs_client struct with matching properties.
Use the new nfs_get_client() that initializes new clients.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Separate the rpc run portion of nfs_read_rpcsetup into a new function
nfs_initiate_read that is called for normal NFS I/O.
Add a pNFS read_pagelist function that is called instead of nfs_intitate_read
for pNFS reads.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Mike Sager <sager@netapp.com>
Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn>
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move the pnfs_update_layout call location to nfs_pageio_do_add_request().
Grab the lseg sent in the doio function to nfs_read_rpcsetup and attach
it to each nfs_read_data so it can be sent to the layout driver.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add a pg_test layout driver hook which is used to avoid coelescing I/O across
layout stripes.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Prepare put_lseg and get_lseg to be called from the pNFS I/O code.
Pull common code from pnfs_lseg_locked to call from pnfs_lseg.
Inline pnfs_lseg_locked into it's only caller.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Data servers cannot send nfs4_proc_get_lease_time. but still need to setup
state renewal. Add the NFS_CS_CHECK_LEASE_TIME bit to indicate if the lease
time can be checked.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Data servers not sharing a session with the mount MDS always have an empty
cl_superblocks list.
Replace the cl_superblocks empty list check to see if it is time to shut down
renewd with the NFS_CS_STOP_RENEW bit which is not set by such a data server.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Data servers require a zero stateid seqid, and there is no advantage to not
doing the same for all NFSv4.1
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now nfs_get_client returns an nfs_client ready to be used no matter if it was
found or created.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Prevents an Oops triggered by CB_LAYOUTRECALL and LAYOUTGET race on a
pnfs_layout_hdr first pnfs_layout_segment.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The return values are not used by any callers.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The code was doing nothing more in either branch of the if.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The pnfs code was using throughout the lock order i_lock, cl_lock.
This conflicts with the nfs delegation code. Rework the pnfs code
to avoid taking both locks simultaneously.
Currently the code takes the double lock to add/remove the layout to a
nfs_client list, while atomically checking that the list of lsegs is
empty. To avoid this, we rely on existing serializations. When a
layout is initialized with lseg count equal zero, LAYOUTGET's
openstateid serialization is in effect, making it safe to assume it
stays zero unless we change it. And once a layout's lseg count drops
to zero, it is set as DESTROYED and so will stay at zero.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We do not need to clear the NFS_LAYOUT_BULK_RECALL, as setting it
guarantees that NFS_LAYOUT_DESTROYED will be set once any outstanding
io is finished.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The code could violate the following from RFC5661, section 12.5.3:
"Once a client has no more layouts on a file, the layout stateid is no
longer valid and MUST NOT be used."
This can occur when a layout already has a lseg, starts another
non-everlapping LAYOUTGET, and a CB_LAYOUTRECALL for the existing lseg
is processed before we hit pnfs_layout_process().
Solve by setting, each time the client has no more lsegs for a file, a
flag which blocks further use of the layout and triggers its removal.
This also fixes a second bug which occurs in the same instance as
above. If we actually use pnfs_layout_process, we add the new lseg to
the layout, but the layout has been removed from the nfs_client list
by the intervening CB_LAYOUTRECALL and will not be added back. Thus
the newly acquired lseg will not be properly returned in the event of
a subsequent CB_LAYOUTRECALL.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There have been a number of recent reports that NFSROOT is no longer
working with default mount options, but fails only with certain NICs.
Brian Downing <bdowning@lavos.net> bisected to commit 56463e50 "NFS:
Use super.c for NFSROOT mount option parsing". Among other things,
this commit changes the default mount options for NFSROOT to use TCP
instead of UDP as the underlying transport.
TCP seems less able to deal with NICs that are slow to initialize.
The system logs that have accompanied reports of problems all show
that NFSROOT attempts to establish a TCP connection before the NIC is
fully initialized, and thus the TCP connection attempt fails.
When a TCP connection attempt fails during a mount operation, the
NFS stack needs to fail the operation. Usually user space knows how
and when to retry it. The network layer does not report a distinct
error code for this particular failure mode. Thus, there isn't a
clean way for the RPC client to see that it needs to retry in this
case, but not in others.
Because NFSROOT is used in some environments where it is not possible
to update the kernel command line to specify "udp", the proper thing
to do is change NFSROOT to use UDP by default, as it did before commit
56463e50.
To make it easier to see how to change default mount options for
NFSROOT and to distinguish default settings from mandatory settings,
I've adjusted a couple of areas to document the specifics.
root_nfs_cat() is also modified to deal with commas properly when
concatenating strings containing mount option lists. This keeps
root_nfs_cat() call sites simpler, now that we may be concatenating
multiple mount option strings.
Tested-by: Brian Downing <bdowning@lavos.net>
Tested-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@kernel.org> # 2.6.37
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There are no more external users of nfs4_state_mark_reclaim_nograce() or
nfs4_state_mark_reclaim_reboot(), so mark them as static.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We want SEQUENCE status bits to be handled by the state manager in order
to avoid threading issues.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs4_schedule_state_recovery() should only be used when we need to force
the state manager to check the lease. If we just want to start the
state manager in order to handle a state recovery situation, we should be
using nfs4_schedule_state_manager().
This patch fixes the abuses of nfs4_schedule_state_recovery() by replacing
its use with a set of helper functions that do the right thing.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The log lock is currently used to protect the AIL lists and
the movements of buffers into and out of them. The lists
are self contained and no log specific items outside the
lists are accessed when starting or emptying the AIL lists.
Hence the operation of the AIL does not require the protection
of the log lock so split them out into a new AIL specific lock
to reduce the amount of traffic on the log lock. This will
also reduce the amount of serialisation that occurs when
the gfs2_logd pushes on the AIL to move it forward.
This reduces the impact of log pushing on sequential write
throughput.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
GFS2 fallocate wasn't properly checking if a blocks were already allocated.
In write_empty_blocks(), if a page didn't have buffer_heads attached, GFS2
was always treating it as if there were no blocks allocated for that page.
GFS2 now calls gfs2_block_map() to check if the blocks are allocated before
writing them out.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is a small patch that optimizes multiple glock dequeue
operations. It changes the unlock order to be more efficient
and makes it easier for lock debugging tools to unravel. It
also eliminates the need for the temp variable x, although
that would likely be optimized out.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Change the default UBIFS behavior WRT data CRC checking. Currently,
UBIFS checks data CRC when reading, which slows it down quite a bit,
and this is the default option. However, it looks like in average
user does not need this feature and would prefer faster read speed
over extra reliability. And this seems to be de-facto standard that
file-systems do not check data CRC every time they read from the
media.
Thus, make UBIFS default behavior so that it does not check data
CRC. This corresponds to the no_chk_data_crc mount option. Those users
who need extra protection can always enable it using the chk_data_crc
option.
Please, read more information about this feature here:
http://www.linux-mtd.infradead.org/doc/ubifs.html#L_checksumming
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Remove debug message level and debug checks Kconfig options as they
proved to be useless anyway. We have sysfs interface which we can
use for fine-grained debugging messages and checks selection, see
Documentation/filesystems/ubifs.txt for mode details.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Running kernel 2.6.37, my PPC-based device occasionally gets an
order-2 allocation failure in UBIFS, which causes the root FS to
become unwritable:
kswapd0: page allocation failure. order:2, mode:0x4050
Call Trace:
[c787dc30] [c00085b8] show_stack+0x7c/0x194 (unreliable)
[c787dc70] [c0061aec] __alloc_pages_nodemask+0x4f0/0x57c
[c787dd00] [c0061b98] __get_free_pages+0x20/0x50
[c787dd10] [c00e4f88] ubifs_jnl_write_data+0x54/0x200
[c787dd50] [c00e82d4] do_writepage+0x94/0x198
[c787dd90] [c00675e4] shrink_page_list+0x40c/0x77c
[c787de40] [c0067de0] shrink_inactive_list+0x1e0/0x370
[c787de90] [c0068224] shrink_zone+0x2b4/0x2b8
[c787df00] [c0068854] kswapd+0x408/0x5d4
[c787dfb0] [c0037bcc] kthread+0x80/0x84
[c787dff0] [c000ef44] kernel_thread+0x4c/0x68
Similar problems were encountered last April by Tomasz Stanislawski:
http://patchwork.ozlabs.org/patch/50965/
This patch implements Artem's suggested fix: fall back to a
mutex-protected static buffer, allocated at mount time. I tested it
by forcing execution down the failure path, and didn't see any ill
effects.
Artem: massaged the patch a little, improved it so that we'd not
allocate the write reserve buffer when we are in R/O mode.
Signed-off-by: Matthew L. Creech <mlcreech@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Fix bug where we currently retry the EXCHANGEID call again, eventhough
we already have a valid clientid. Instead, delay and retry the CREATE_SESSION
call.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The problem was use of an int32, which when converted to a uint64
is sign extended resulting in a fileid that doesn't fit in 32 bits
even though the intent of the function is to fit the fileid into
32 bits.
Signed-off-by: Frank Filz <ffilzlnx@us.ibm.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
[Trond: Added an include for compat.h]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
add kmalloc return value check in decode_and_add_ds
Signed-off-by: Stanislav Fomichev <kernel@fomichev.me>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I've been adding in more artificial delays in the NFSv4 commit and close
codepaths to uncover races. The kernel I'm testing has the patch to
close the race in __rpc_wait_for_completion_task that's in Trond's
cthon2011 branch. The reproducer I've been using does this in a loop:
mkdir("DIR");
fd = open("DIR/FILE", O_WRONLY|O_CREAT|O_EXCL, 0644);
write(fd, "abcdefg", 7);
close(fd);
unlink("DIR/FILE");
rmdir("DIR");
The above reproducer shouldn't result in any silly-renaming. However,
when I add a "msleep(100)" just after the nfs_commit_clear_lock call in
nfs_commit_release, I can almost always force one to occur. If I can
force it to occur with that, then it can happen without that delay
given the right timing.
nfs_commit_inode waits for the NFS_INO_COMMIT bit to clear when called
with FLUSH_SYNC set. nfs_commit_rpcsetup on the other hand does not wait
for the task to complete before putting its reference to it, so the last
reference get put in rpc_release task and gets queued to a workqueue.
In this situation, the last open context reference may be put by the
COMMIT release instead of the close() syscall. The close() syscall
returns too quickly and the unlink runs while the d_count is still
high since the COMMIT release hasn't put its dentry reference yet.
Fix this by having rpc_commit_rpcsetup wait for the RPC call to complete
before putting the task reference when FLUSH_SYNC is set. With this, the
last reference is put by the process that's initiating the FLUSH_SYNC
commit and the race is closed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Although they run as rpciod background tasks, under normal operation
(i.e. no SIGKILL), functions like nfs_sillyrename(), nfs4_proc_unlck()
and nfs4_do_close() want to be fully synchronous. This means that when we
exit, we want all references to the rpc_task to be gone, and we want
any dentry references etc. held by that task to be released.
For this reason these functions call __rpc_wait_for_completion_task(),
followed by rpc_put_task() in the expectation that the latter will be
releasing the last reference to the rpc_task, and thus ensuring that the
callback_ops->rpc_release() has been called synchronously.
This patch fixes a race which exists due to the fact that
rpciod calls rpc_complete_task() (in order to wake up the callers of
__rpc_wait_for_completion_task()) and then subsequently calls
rpc_put_task() without ensuring that these two steps are done atomically.
In order to avoid adding new spin locks, the patch uses the existing
waitqueue spin lock to order the rpc_task reference count releases between
the waiting process and rpciod.
The common case where nobody is waiting for completion is optimised for by
checking if the RPC_TASK_ASYNC flag is cleared and/or if the rpc_task
reference count is 1: in those cases we drop trying to grab the spin lock,
and immediately free up the rpc_task.
Those few processes that need to put the rpc_task from inside an
asynchronous context and that do not care about ordering are given a new
helper: rpc_put_task_async().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make all three hash tables a consistent size of 1024
rather than 1024, 512, 256. All three tables, for
resources, locks, and lock dir entries, will generally
be filled to the same order of magnitude.
Signed-off-by: David Teigland <teigland@redhat.com>
Change how callbacks are recorded for locks. Previously, information
about multiple callbacks was combined into a couple of variables that
indicated what the end result should be. In some situations, we
could not tell from this combined state what the exact sequence of
callbacks were, and would end up either delivering the callbacks in
the wrong order, or suppress redundant callbacks incorrectly. This
new approach records all the data for each callback, leaving no
uncertainty about what needs to be delivered.
Signed-off-by: David Teigland <teigland@redhat.com>
btrfs_link() will insert 3 items(inode ref, dir name item and dir index item)
into the b+ tree and update 2 items(its inode, and parent's inode) in the b+
tree. So we should reserve space for these 5 items, not 3 items.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs DIO code leaks dip structs when dip->csums allocation
fails; bio->bi_end_io isn't set at the point where the free_ordered
branch is consequently taken, thus bio_endio doesn't call the function
which would free it in the normal case. Fix.
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Acked-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Without this patch, inodes are not promptly freed on last close of an
unlinked file by an nfs client:
client$ mount -tnfs4 server:/export/ /mnt/
client$ tail -f /mnt/FOO
...
server$ df -i /export
server$ rm /export/FOO
(^C the tail -f)
server$ df -i /export
server$ echo 2 >/proc/sys/vm/drop_caches
server$ df -i /export
the df's will show that the inode is not freed on the filesystem until
the last step, when it could have been freed after killing the client's
tail -f. On-disk data won't be deallocated either, leading to possible
spurious ENOSPC.
This occurs because when the client does the close, it arrives in a
compound with a putfh and a close, processed like:
- putfh: look up the filehandle. The only alias found for the
inode will be DCACHE_UNHASHED alias referenced by the filp
this, so it creates a new DCACHE_DISCONECTED dentry and
returns that instead.
- close: closes the existing filp, which is destroyed
immediately by dput() since it's DCACHE_UNHASHED.
- end of the compound: release the reference
to the current filehandle, and dput() the new
DCACHE_DISCONECTED dentry, which gets put on the
unused list instead of being destroyed immediately.
Nick Piggin suggested fixing this by allowing d_obtain_alias to return
the unhashed dentry that is referenced by the filp, instead of making it
create a new dentry.
Leave __d_find_alias() alone to avoid changing behavior of other
callers.
Also nfsd doesn't need all the checks of __d_find_alias(); any dentry,
hashed or unhashed, disconnected or not, should work.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In the fallocate path the kernel doesn't check for the immutable/append
flag. It's possible to have a race condition in this scenario: an
application open a file in read/write and it does something, meanwhile
root set the immutable flag on the file, the application at that point
can call fallocate with success. In addition, we don't allow to do any
unreserve operation on an append only file but only the reserve one.
Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With the plugging now being explicitly controlled by the
submitter, callers need not pass down unplugging hints
to the block layer. If they want to unplug, it's because they
manually plugged on their own - in which case, they should just
unplug at will.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-2.6.38' of git://linux-nfs.org/~bfields/linux:
nfsd: wrong index used in inner loop
nfsd4: fix bad pointer on failure to find delegation
NFSD: fix decode_cb_sequence4resok
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
nd->inode is not set on the second attempt in path_walk()
unfuck proc_sysctl ->d_compare()
minimal fix for do_filp_open() race
Not all block drivers clear events immediately after reporting. Some
do so in ->revalidate_disk() or other steps during ->open(). There is
a slim chance event poll may happen between the clearing event check
from check_disk_change() and the actual clearing of the events which
would result in spurious events.
Block event checks while block device open is in progress. There is
no need to kick explicit event check afterwards as events are always
checked during open.
-v2: The original patch could have called disk_unblock_events() with
an already released or %NULL @disk causing oops. Fixed by making
sure references are put after disk_unblock_events() is called.
It also makes the error path of __blkdev_get() a bit simpler.
This problem was reported by Jens.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
The block event mechanism currently always checks events when the
device is being closed regardless of the open mode. The intention was
to allow detection of EJECT_REQUEST when a device is closed whether
disk event polling is enabled or not.
This is unnecessary as, for devices of interest, events are checked
from either userland or kernel and in the former case ->check_events()
is performed on open of each poll attempt anyway. Furthermore, this
unconditional event check on close makes the code susceptible to event
loop if the block driver doesn't clear reported events correctly - an
event triggers userland to open and close the device which in turn
causes another event, rinse and repeat.
Check events on close only if it was blocked by excl write open.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Currently, disk_unblock_events() implicitly kick event check if the
block count reaches zero. This behavior is not described in the
comment and hinders with future changes. Make the unblocker
explicitly check events by calling disk_check_events() as necessary.
This patch doesn't cause any behavior difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Updating the AGF and transactions counters is duplicated between allocating
and freeing extents. Factor the code into a common helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Pass a xfs_alloc_arg structure to xfs_alloc_compute_aligned and derive
the alignment and minlen paramters from it. This cleans up the existing
callers, and we'll need even more information from the xfs_alloc_arg
in subsequent patches. Based on a patch from Dave Chinner.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
This patch ensures that we always wait for glock demotion when
dropping flocks on a file in order to prevent any race
conditions associated with further flock calls or closing
the file.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes a race in deallocating glocks which was introduced
in the RCU glock patch. We need to ensure that the glock count is
kept correct even in the case that there is a race to add a new
glock into the hash table. Also, to avoid having to wait for an
RCU grace period, the glock counter can be decremented before
call_rcu() is called.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Immediately after being synced to disk, cached quotas are zeroed out and a
subsequent access of the cached quotas results in incorrect zero values. This
meant that gfs2 assumed the actual usage to be the zero (or near-zero) usage
values it found in the cached quotas and comparison against warn/limits never
triggered a quota violation.
This patch adds a new flag QDF_REFRESH that is set after a sync so that the
cached quotas are forcefully refreshed from disk on a subsequent access on
seeing this flag set.
Resolves: rhbz#675944
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This directly uses sb->s_fs_info to keep a nilfs filesystem object and
fully removes the intermediate nilfs_sb_info structure. With this
change, the hierarchy of on-memory structures of nilfs will be
simplified as follows:
Before:
super_block
-> nilfs_sb_info
-> the_nilfs
-> cptree --+-> nilfs_root (current file system)
+-> nilfs_root (snapshot A)
+-> nilfs_root (snapshot B)
:
-> nilfs_sc_info (log writer structure)
After:
super_block
-> the_nilfs
-> cptree --+-> nilfs_root (current file system)
+-> nilfs_root (snapshot A)
+-> nilfs_root (snapshot B)
:
-> nilfs_sc_info (log writer structure)
The reason why we didn't design so from the beginning is because the
initial shape also differed from the above. The early hierachy was
composed of "per-mount-point" super_block -> nilfs_sb_info pairs and a
shared nilfs object. On the kernel 2.6.37, it was changed to the
current shape in order to unify super block instances into one per
device, and this cleanup became applicable as the result.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
We leave it at whatever it had been pointing to after the
first link_path_walk() had failed with -ESTALE. Things
do not work well after that...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Removes sci->sc_sbi which is a back pointer to nilfs_sb_info struct
from log writer object (nilfs_sc_info).
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Log writer is held by the nilfs_sb_info structure. This moves it into
nilfs object and replaces all uses of NILFS_SC() accessor.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Moves s_next_generation counter and a spinlock protecting it to nilfs
object from nilfs_sb_info structure.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Moves s_inode_lock spinlock and s_dirty_files list to nilfs object
from nilfs_sb_info structure.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This moves four parameter variables on nilfs_sb_info s_resuid,
s_resgid, s_interval and s_watermark to the nilfs object.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Index i was already used in the outer loop
Cc: stable@kernel.org
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Make sure we properly reference count the struct files that a lock
depends on, and release them when the lock stateid is released.
This fixes a major leak of struct files when using locking over nfsv4.
Cc: stable@kernel.org
Reported-by: Rick Koshi <nfs-bug-report@more-right-rudder.com>
Tested-by: Ivo Přikryl <prikryl@eurosat.cz>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Minor cleanup in preparation for a bugfix--moving some code to avoid
forward references, etc. No change in functionality.
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The btrfs fiemap code was incorrectly returning duplicate or overlapping
extents in some cases. cp was blindly trusting this result and we would
end up with a destination file that was bigger than the original because
some bytes were copied twice.
The fix here adjusts our offsets to make sure we're always moving
forward in the fiemap results.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When recovering from unclean reboots UBIFS scans the journal and checks nodes.
If a corrupted node is found, UBIFS tries to check if this is the last node
in the LEB or not. This is is done by checking if there only 0xFF bytes
starting from the next min. I/O unit. However, since now we write in
c->max_write_size, we should actually check for 0xFFs starting from the
next max. write unit.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Switch write-buffers from 'c->min_io_size' to 'c->max_write_size' which
presumably has to be more write speed-efficient. However, when write-buffer
is synchronized, write only the the min. I/O units which contain the
data, do not write whole write-buffer. This is more space-efficient.
Additionally, this patch takes into account that the LEB might not start
from the max. write unit-aligned address.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Currently we assume write-buffer size is always min_io_size. But
this is about to change and write-buffers may be of variable size.
Namely, they will be of max_write_size at the beginning, but will
get smaller when we are approaching the end of LEB.
This is a preparation patch which introduces 'size' field in
the write-buffer structure which carries the current write-buffer
size.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Incorporate the LEB offset information into UBIFS. We'll use this
information in one of the next patches to figure out what are the
max. write size offsets relative to the PEB. So this patch is just
a preparation.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Incorporate maximum write size into the UBIFS description data
structure. This patch just introduces new 'c->max_write_size'
and 'c->max_write_shift' fields as a preparation for the following
patches.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
The block integrity subsystem no longer uses the bio_vec slabs so this
code can safely be compiled in.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
a) struct inode is not going to be freed under ->d_compare();
however, the thing PROC_I(inode)->sysctl points to just might.
Fortunately, it's enough to make freeing that sucker delayed,
provided that we don't step on its ->unregistering, clear
the pointer to it in PROC_I(inode) before dropping the reference
and check if it's NULL in ->d_compare().
b) I'm not sure that we *can* walk into NULL inode here (we recheck
dentry->seq between verifying that it's still hashed / fetching
dentry->d_inode and passing it to ->d_compare() and there's no
negative hashed dentries in /proc/sys/*), but if we can walk into
that, we really should not have ->d_compare() return 0 on it!
Said that, I really suspect that this check can be simply killed.
Nick?
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This records the number of used blocks per checkpoint in each
checkpoint entry of cpfile. Even though userland tools can get the
block count via nilfs_get_cpinfo ioctl, it was not updated by the
nilfs2 kernel code. This fixes the issue and makes it available for
userland tools to calculate used amount per checkpoint.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Jiro SEKIBA <jir@unicus.jp>
This is a similar change to those in ext2/ext3 codebase (commit
40a063f669 and a4ae309486, respectively).
The addition of 64k block capability in the rec_len_from_disk and
rec_len_to_disk functions added a bit of math overhead which slows
down file create workloads needlessly when the architecture cannot
even support 64k blocks. This will cut the corner.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
At present, the same warning message can be output twice when nilfs
detected a problem on super blocks:
NILFS warning: broken superblock. using spare superblock.
NILFS warning: broken superblock. using spare superblock.
...
This is because these super blocks are reloaded with the block size
written in a super block if it differs from the first block size, but
this repetition looks somewhat confusing. So, we hint at what is
going on by appending block size information to those messages.
Reported-by: Wakko Warner <wakko@animx.eu.org>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
The current FS_IOC_GETFLAGS/SETFLAGS/GETVERSION will fail if
application is 32 bit and kernel is 64 bit.
This issue is avoidable by adding compat_ioctl method.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Add support for the standard attributes set via chattr and read via
lsattr. These attributes are already in the flags value in the nilfs2
inode, but currently we don't have any ioctl commands that expose them
to the userland.
Collaterally, this adds the FS_IOC_GETVERSION ioctl for getting
i_generation, which allows users to list the file's generation number
with "lsattr -v".
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Nilfs has few rectrictions on which flags may be set on which inodes
like ext2/3/4 filesystems used to be. Specifically DIRSYNC may only
be set on directories and IMMUTABLE and APPEND may not be set on
links. Tighten that to disallow TOPDIR being set on non-directories
and only NODUMP and NOATIME to be set on non-regular file,
non-directories.
This introduces a flags masking function like those of extN and uses
it during inode creation.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
At present, nilfs marks S_NOATIME flag on all inodes. This restricts
nilfs_set_inode_flags function so that it marks S_NOATIME only if a
given inode has an FS_NOATIME_FL flag.
Although nilfs does not support atime yet, touch_atime() still safely
returns on IS_NOATIME check since MS_NOATIME is always set on sb.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Replaces uses of own inode flags (i.e. NILFS_SECRM_FL, NILFS_UNRM_FL,
NILFS_COMPR_FL, and so forth) with common inode flags, and removes the
own flag declarations.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Three functions of the current persistent object allocator,
nilfs_palloc_commit_free_entry, nilfs_palloc_abort_alloc_entry, and
nilfs_palloc_freev functions unconditionally add a counter after doing
clear bit operation on a bitmap block.
If the clear bit operation overlapped, the counter will not add up.
This fixes the issue by making the counter operations conditional.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This fixes the issue that inodes count will not add up after removal
of raw inodes fails. Hence, this prevents possible under flow of the
inodes count.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
The members of nfsd4_op_flags, (ALLOWED_WITHOUT_FH | ALLOWED_ON_ABSENT_FS)
equals to ALLOWED_AS_FIRST_OP, maybe that's not what we want.
OP_PUTROOTFH with op_flags = ALLOWED_WITHOUT_FH | ALLOWED_ON_ABSENT_FS,
can't appears as the first operation with out SEQUENCE ops.
This patch modify the wrong value of ALLOWED_WITHOUT_FH etc which
was introduced by f9bb94c4.
Cc: stable@kernel.org
Reviewed-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Add a new proc file which lists the encryption types supported
by the kernel's gss_krb5 code.
Newer MIT Kerberos libraries support the assertion of acceptor
subkeys. This enctype information allows user-land (svcgssd)
to request that the Kerberos libraries limit the encryption
types that it uses when generating the subkeys.
Signed-off-by: Kevin Coffman <kwc@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently we have the following code in fs/nfsd/vfs.c::nfsd_rename() :
...
host_err = nfsd_break_lease(odentry->d_inode);
if (host_err)
goto out_drop_write;
if (ndentry->d_inode) {
host_err = nfsd_break_lease(ndentry->d_inode);
if (host_err)
goto out_drop_write;
}
if (host_err)
goto out_drop_write;
...
'host_err' is guaranteed to be 0 by the time we test 'ndentry->d_inode'.
If 'host_err' becomes != 0 inside the 'if' statement, then we goto
'out_drop_write'. So, after the 'if' statement there is no way that
'host_err' can be anything but 0, so the test afterwards is just dead
code.
This patch removes the dead code.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
These macros had never been used for several years.
So, remove them.
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In case of a nonempty list, the return on error here is obviously bogus;
it ends up being a pointer to the list head instead of to any valid
delegation on the list.
In particular, if nfsd4_delegreturn() hits this case, and you're quite unlucky,
then renew_client may oops, and it may take an embarassingly long time to
figure out why. Facepalm.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
IP: [<ffffffff81292965>] nfsd4_delegreturn+0x125/0x200
...
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
(crossport of 1f7bebb9e9
by Andreas Schlick <schlick@lavabit.com>)
When ext3_dx_add_entry() has to split an index node, it has to ensure that
name_len of dx_node's fake_dirent is also zero, because otherwise e2fsck
won't recognise it as an intermediate htree node and consider the htree to
be corrupted.
CC: stable@kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When copy_from_user is only able to copy some of the bytes we requested,
we may end up creating a partially up to date page. To avoid garbage in
the page, we need to treat a partial copy as a zero length copy.
This makes the rest of the file_write code drop the page and
retry the whole copy instead of marking the partially up to
date page as dirty.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
cc: stable@kernel.org
Commit 914ee295af fixed deadlocks in
btrfs_file_write where we would catch page faults on pages we had
locked.
But, there were a few problems:
1) The x86-32 iov_iter_copy_from_user_atomic code always fails to copy
data when the amount to copy is more than 4K and the offset to start
copying from is not page aligned. The result was btrfs_file_write
looping forever retrying the iov_iter_copy_from_user_atomic
We deal with this by changing btrfs_file_write to drop down to single
page copies when iov_iter_copy_from_user_atomic starts returning failure.
2) The btrfs_file_write code was leaking delalloc reservations when
iov_iter_copy_from_user_atomic returned zero. The looping above would
result in the entire filesystem running out of delalloc reservations and
constantly trying to flush things to disk.
3) btrfs_file_write will lock down page cache pages, make sure
any writeback is finished, do the copy_from_user and then release them.
Before the loop runs we check the first and last pages in the write to
see if they are only being partially modified. If the start or end of
the write isn't aligned, we make sure the corresponding pages are
up to date so that we don't introduce garbage into the file.
With the copy_from_user changes, we're allowing the VM to reclaim the
pages after a partial update from copy_from_user, but we're not
making sure the page cache page is up to date when we loop around to
resume the write.
We deal with this by pushing the up to date checks down into the page
prep code. This fits better with how the rest of file_write works.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
cc: stable@kernel.org
The remaining functionality in debug.[ch] is effectively just assert
handling, conditional debug definitions and hex dumping. The hex
dumping and assert function can be moved into the new printk module,
while the rest can be moved into top-level header files. This allows
fs/xfs/support/debug.[ch] to be completely removed from the
codebase.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Once converted, kill the remainder of the cmn_err() interface.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The "cmn_err" part of the function name is no longer relevant. Rename
the function to xfs_alert_fsblock_zero() to match the new logging
API.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Continue to clean up the error logging code by converting all the
callers of xfs_fs_cmn_err() to the new API. Once done, remove the
unused old API function.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The xfs_fs_mount_cmn_err() hides a simple check as to whether the
mount path should output an error or not. Remove the macro and open
code the check.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
In certain cases of inode corruption, the xfs_fs_repair_cmn_err()
macro is used to output an extra message in the corruption report.
That extra message is "unmount and run xfs_repair", which really
applies to any corruption report. Each case that this macro is
called (except one) a following call to xfs_corruption_error() is
made to optionally dump more information about the error.
Hence, move the output of "run xfs_repair" to xfs_corruption_error()
so that it is output on all corruption reports. Also, convert the
callers of the repair macro that don't call xfs_corruption_error()
to call it, hence provide consiѕtent error reporting for all cases
where xfs_fs_repair_cmn_err() used to be called.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Continue the conversion of the old cmn_err interface be converting
all the conditional panic tag errors to xfs_alert_tag() and then
removing xfs_cmn_err().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Convert the xfs log operations to use the new error logging
interfaces. This removes the xlog_{warn,panic} wrappers and makes
almost all errors emit the device they belong to instead of just
refering to "XFS".
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Convert the files in fs/xfs/linux-2.6/ to use the new xfs_<level>
logging format that replaces the old Irix inherited cmn_err()
interfaces. While there, also convert naked printk calls to use the
relevant xfs logging function to standardise output format.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
filldir returning an error does *not* mean "skip this entry, try the
next one"...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Bob Copeland <me@bobcopeland.com>
In case of directory-overwriting rename(), omfs forgot to mark the
victim doomed, so omfs_evict_inode() didn't free it.
We could fix that by calling omfs_rmdir() for directory victims
instead of doing omfs_unlink(), but it's easier to merge omfs_unlink()
and omfs_rmdir() instead. Note that we have no hardlinks here.
It also makes the checks in omfs_rename() go away - they fold into
what omfs_remove() does when it runs into a directory.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Since omfs directories are hashes of inodes and name is part of
inode, we have to remove inode from old directory before we can
put it into new one / under new name. So instead of
bump i_nlink
call omfs_unlink, which does
omfs_delete_entry()
decrement i_nlink and mark parent dirty in case of success
decrement i_nlink if omfs_unlink failed and hadn't done it itself
let's just call omfs_delete_entry() and dirty the parent ourselves...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Bob Copeland <me@bobcopeland.com>
we *do* mark it dirty before, but it doesn't guarantee that we
don't get preempted just before assignment to ->i_ctime, with
inode getting written out before we get CPU back...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Bob Copeland <me@bobcopeland.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: no .snap inside of snapped namespace
libceph: fix msgr standby handling
libceph: fix msgr keepalive flag
libceph: fix msgr backoff
libceph: retry after authorization failure
libceph: fix handling of short returns from get_user_pages
ceph: do not clear I_COMPLETE from d_release
ceph: do not set I_COMPLETE
Revert "ceph: keep reference to parent inode on ceph_dentry"
While running ext4 testing on multiple core, we found there are per
cpu ext4-dio-unwritten threads processing conversion from unwritten
extents to written for IOs completed from async direct IO patch. Per
filesystem is enough, we don't need per cpu threads to work on
conversion.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
The comment is no longer true as (now that the BKL conversion is
finished) a spinlock _is_ now used to protect file_lock_list,
blocked_list and inode->i_flock.
Signed-off-by: Matt Fleming <matt.fleming@linux.intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
The "bad_page()" page allocator sanity check was reported recently (call
chain as follows):
bad_page+0x69/0x91
free_hot_cold_page+0x81/0x144
skb_release_data+0x5f/0x98
__kfree_skb+0x11/0x1a
tcp_ack+0x6a3/0x1868
tcp_rcv_established+0x7a6/0x8b9
tcp_v4_do_rcv+0x2a/0x2fa
tcp_v4_rcv+0x9a2/0x9f6
do_timer+0x2df/0x52c
ip_local_deliver+0x19d/0x263
ip_rcv+0x539/0x57c
netif_receive_skb+0x470/0x49f
:virtio_net:virtnet_poll+0x46b/0x5c5
net_rx_action+0xac/0x1b3
__do_softirq+0x89/0x133
call_softirq+0x1c/0x28
do_softirq+0x2c/0x7d
do_IRQ+0xec/0xf5
default_idle+0x0/0x50
ret_from_intr+0x0/0xa
default_idle+0x29/0x50
cpu_idle+0x95/0xb8
start_kernel+0x220/0x225
_sinittext+0x22f/0x236
It occurs because an skb with a fraglist was freed from the tcp
retransmit queue when it was acked, but a page on that fraglist had
PG_Slab set (indicating it was allocated from the Slab allocator (which
means the free path above can't safely free it via put_page.
We tracked this back to an nfsv4 setacl operation, in which the nfs code
attempted to fill convert the passed in buffer to an array of pages in
__nfs4_proc_set_acl, which gets used by the skb->frags list in
xs_sendpages. __nfs4_proc_set_acl just converts each page in the buffer
to a page struct via virt_to_page, but the vfs allocates the buffer via
kmalloc, meaning the PG_slab bit is set. We can't create a buffer with
kmalloc and free it later in the tcp ack path with put_page, so we need
to either:
1) ensure that when we create the list of pages, no page struct has
PG_Slab set
or
2) not use a page list to send this data
Given that these buffers can be multiple pages and arbitrarily sized, I
think (1) is the right way to go. I've written the below patch to
allocate a page from the buddy allocator directly and copy the data over
to it. This ensures that we have a put_page free-able page for every
entry that winds up on an skb frag list, so it can be safely freed when
the frame is acked. We do a put page on each entry after the
rpc_call_sync call so as to drop our own reference count to the page,
leaving only the ref count taken by tcp_sendpages. This way the data
will be properly freed when the ack comes in
Successfully tested by myself to solve the above oops.
Note, as this is the result of a setacl operation that exceeded a page
of data, I think this amounts to a local DOS triggerable by an
uprivlidged user, so I'm CCing security on this as well.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
CC: security@kernel.org
CC: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
failure exits on the no-O_CREAT side of do_filp_open() merge with
those of O_CREAT one; unfortunately, if do_path_lookup() returns
-ESTALE, we'll get out_filp:, notice that we are about to return
-ESTALE without having trying to create the sucker with LOOKUP_REVAL
and jump right into the O_CREAT side of code. And proceed to try
and create a file. Usually that'll fail with -ESTALE again, but
we can race and get that attempt of pathname resolution to succeed.
open() without O_CREAT really shouldn't end up creating files, races
or not. The real fix is to rearchitect the whole do_filp_open(),
but for now splitting the failure exits will do.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In a bs=4096 volume, if we call FITRIM with the following parameter as
fstrim_range(start = 102400, len = 134144000, minlen = 10240), with the
following code:
if (len >= EXT3_BLOCKS_PER_GROUP(sb))
len -= (EXT3_BLOCKS_PER_GROUP(sb) - first_block);
else
last_block = first_block + len;
So if len < EXT3_BLOCKS_PER_GROUP while first_block + len >
EXT3_BLOCKS_PER_GROUP, last_block will be set to an overflow value
which exceeds EXT3_BLOCKS_PER_GROUP.
This patch fixes it and adjusts len and last_block accordingly.
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The VFS mount code passes the mount options to the LSM. The LSM will remove
options it understands from the data and the VFS will then pass the remaining
options onto the underlying filesystem. This is how options like the
SELinux context= work. The problem comes in that -o remount never calls
into LSM code. So if you include an LSM specific option it will get passed
to the filesystem and will cause the remount to fail. An example of where
this is a problem is the 'seclabel' option. The SELinux LSM hook will
print this word in /proc/mounts if the filesystem is being labeled using
xattrs. If you pass this word on mount it will be silently stripped and
ignored. But if you pass this word on remount the LSM never gets called
and it will be passed to the FS. The FS doesn't know what seclabel means
and thus should fail the mount. For example an ext3 fs mounted over loop
# mount -o loop /tmp/fs /mnt/tmp
# cat /proc/mounts | grep /mnt/tmp
/dev/loop0 /mnt/tmp ext3 rw,seclabel,relatime,errors=continue,barrier=0,data=ordered 0 0
# mount -o remount /mnt/tmp
mount: /mnt/tmp not mounted already, or bad option
# dmesg
EXT3-fs (loop0): error: unrecognized mount option "seclabel" or missing value
This patch passes the remount mount options to an new LSM hook.
Signed-off-by: Eric Paris <eparis@redhat.com>
Reviewed-by: James Morris <jmorris@namei.org>
First, this was racy anyway: d_release isn't called until well after the
dentry is unhashed. Second, this runs afoul of the recent dcache change
that clears d_parent prior to calling d_release (949854d0), causing a NULL
pointer dereference.
Signed-off-by: Sage Weil <sage@newdream.net>
This reverts commit 97d79b403e.
This fails to account for d_parent changes due to rename or disconnected
dentries due to submounts or NFS reexports.
Signed-off-by: Sage Weil <sage@newdream.net>
(256 << sizeof(x)) - 1 is not the maximal possible value of x...
In reality, the maximal allowed value for UDF FileLinkCount is
65535.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
if directory has so many subdirectories that its link count is set
to 1 (i.e. "can't tell accurately") and reiserfs_new_inode() fails,
we shouldn't decrement the parent's link count in cleanup path;
that's what DEC_DIR_INODE_NLINK() is for. As it is, we end up
with parent suddenly getting zero i_nlink, with very unpleasant
effects.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'devicetree/merge' of git://git.secretlab.ca/git/linux-2.6:
of/promtree: allow DT device matching by fixing 'name' brokenness (v5)
x86: OLPC: have prom_early_alloc BUG rather than return NULL
of/flattree: Drop an uninteresting message to pr_debug level
of: Add missing of_address.h to xilinx ehci driver
This introduces a new per-superblock mutex in UFS to replace
the big kernel lock. I have been careful to avoid nested
calls to lock_ufs and to get the lock order right with
respect to other mutexes, in particular lock_super.
I did not make any attempt to prove that the big kernel
lock is not needed in a particular place in the code,
which is very possible.
The mutex has a significant performance impact, so it is only
used on SMP or PREEMPT configurations.
As Nick Piggin noticed, any allocation inside of the lock
may end up deadlocking when we get to ufs_getfrag_block
in the reclaim task, so we now use GFP_NOFS.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Tested-by: Nick Bowler <nbowler@elliptictech.com>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Cc: Nick Piggin <npiggin@gmail.com>
This removes the BKL in hpfs in a rather awful
way, by making the code only work on uniprocessor
systems without kernel preemption, as suggested
by Andi Kleen.
The HPFS code probably has close to zero remaining
users on current kernels, all archeological uses of
the file system can probably be done with the significant
restrictions.
The hpfs_lock/hpfs_unlock functions are left in the
code, sincen Mikulas has indicated that he is still
interested in fixing it in a better way.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Andi Kleen <ak@linux.intel.com>
Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: linux-fsdevel@vger.kernel.org
This message looks like an error (which it isn't) when booting with a
flattened device tree. Remove the message from normal kernel builds.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
vfs_rename_other() does not lock renamed inode with i_mutex. Thus changing
i_nlink in a non-atomic manner (which happens in ext2_rename()) can corrupt
it as reported and analyzed by Josh.
In fact, there is no good reason to mess with i_nlink of the moved file.
We did it presumably to simulate linking into the new directory and unlinking
from an old one. But the practical effect of this is disputable because fsck
can possibly treat file as being properly linked into both directories without
writing any error which is confusing. So we just stop increment-decrement
games with i_nlink which also fixes the corruption.
CC: stable@kernel.org
CC: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Commit 493f3358cb added this call to
xfs_fs_geometry() in order to avoid passing kernel stack data back
to user space:
+ memset(geo, 0, sizeof(*geo));
Unfortunately, one of the callers of that function passes the
address of a smaller data type, cast to fit the type that
xfs_fs_geometry() requires. As a result, this can happen:
Kernel panic - not syncing: stack-protector: Kernel stack is corrupted
in: f87aca93
Pid: 262, comm: xfs_fsr Not tainted 2.6.38-rc6-493f3358cb2+ #1
Call Trace:
[<c12991ac>] ? panic+0x50/0x150
[<c102ed71>] ? __stack_chk_fail+0x10/0x18
[<f87aca93>] ? xfs_ioc_fsgeometry_v1+0x56/0x5d [xfs]
Fix this by fixing that one caller to pass the right type and then
copy out the subset it is interested in.
Note: This patch is an alternative to one originally proposed by
Eric Sandeen.
Reported-by: Jeffrey Hundstad <jeffrey.hundstad@mnsu.edu>
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Jeffrey Hundstad <jeffrey.hundstad@mnsu.edu>
Most of the logging infrastructure in XFS is unneccessary and
designed around the infrastructure supplied by Irix rather than
Linux. To rationalise the logging interfaces, start by introducing
simple printk wrappers similar to the dev_printk() infrastructure.
Later patches will convert code to use this new interface.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Commit 493f3358cb added this call to
xfs_fs_geometry() in order to avoid passing kernel stack data back
to user space:
+ memset(geo, 0, sizeof(*geo));
Unfortunately, one of the callers of that function passes the
address of a smaller data type, cast to fit the type that
xfs_fs_geometry() requires. As a result, this can happen:
Kernel panic - not syncing: stack-protector: Kernel stack is corrupted
in: f87aca93
Pid: 262, comm: xfs_fsr Not tainted 2.6.38-rc6-493f3358cb2+ #1
Call Trace:
[<c12991ac>] ? panic+0x50/0x150
[<c102ed71>] ? __stack_chk_fail+0x10/0x18
[<f87aca93>] ? xfs_ioc_fsgeometry_v1+0x56/0x5d [xfs]
Fix this by fixing that one caller to pass the right type and then
copy out the subset it is interested in.
Note: This patch is an alternative to one originally proposed by
Eric Sandeen.
Reported-by: Jeffrey Hundstad <jeffrey.hundstad@mnsu.edu>
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Jeffrey Hundstad <jeffrey.hundstad@mnsu.edu>
According to the report from Jiro SEKIBA titled "regression in
2.6.37?" (Message-Id: <8739n8vs1f.wl%jir@sekiba.com>), on 2.6.37 and
later kernels, lscp command no longer displays "i" flag on checkpoints
that snapshot operations or garbage collection created.
This is a regression of nilfs2 checkpointing function, and it's
critical since it broke behavior of a part of nilfs2 applications.
For instance, snapshot manager of TimeBrowse gets to create
meaningless snapshots continuously; snapshot creation triggers another
checkpoint, but applications cannot distinguish whether the new
checkpoint contains meaningful changes or not without the i-flag.
This patch fixes the regression and brings that application behavior
back to normal.
Reported-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Jiro SEKIBA <jir@unicus.jp>
Cc: stable <stable@kernel.org> [2.6.37]
According to Russell King, adfs was written to not require the big
kernel lock, and all inode updates are done under adfs_dir_lock.
All other metadata in adfs is read-only and does not require locking.
The use of the BKL is the result of various pushdowns from the VFS
operations.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Cc: Stuart Swales <stuart.swales.croftnuisk@gmail.com>
Fix new kernel-doc warning in fs/block_dev.c:
Warning(fs/block_dev.c:937): No description found for parameter 'kill_dirty'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix truncate after open
fuse: fix hang of single threaded fuseblk filesystem
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
ocfs2: Check heartbeat mode for kernel stacks only
Ocfs2/refcounttree: Fix a bug for refcounttree to writeback clusters in a right number.
ocfs2: Fix estimate of necessary credits for mkdir
The Patch below removes one to many "n's" in a word..
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: linux-ext4@vger.kernel.org
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Orphan cleanup is currently executed even if the file system has some
number of unknown ROCOMPAT features, which deletes inodes and frees
blocks, which could be very bad for some RO_COMPAT features.
This patch skips the orphan cleanup if it contains readonly compatible
features not known by this ext3 implementation, which would prevent
the fs from being mounted (or remounted) readwrite.
Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: Jan Kara <jack@suse.cz>
vfs_rename_other() does not lock renamed inode with i_mutex. Thus changing
i_nlink in a non-atomic manner (which happens in ext2_rename()) can corrupt
it as reported and analyzed by Josh.
In fact, there is no good reason to mess with i_nlink of the moved file.
We did it presumably to simulate linking into the new directory and unlinking
from an old one. But the practical effect of this is disputable because fsck
can possibly treat file as being properly linked into both directories without
writing any error which is confusing. So we just stop increment-decrement
games with i_nlink which also fixes the corruption.
CC: stable@kernel.org
CC: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Squashfs_get_sb() to squashfs_mount() conversion (commit 152a0836)
results in line over 80 characters.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Pass the dictionary size used to compress datablocks. Using a
dictionary size less than the block size saves memory overhead, in many
cases without adversely affecting compression ratio.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Extend decompressor framework to handle compression options stored in
the filesystem. These options can be used by the relevant decompressor
at initialisation time to over-ride defaults.
The presence of compression options in the filesystem is indicated by
the COMP_OPT filesystem flag. If present the data is read from the
filesystem and passed to the decompressor init function. The decompressor
init function signature has been extended to take this data.
Also update the init function signature in the glib, lzo and xz
decompressor wrappers.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
If no extent conversion is required, wake up any processes waiting for
the page's writeback to be complete and free the ext4_io_end structure
directly in ext4_end_bio() instead of dropping it on the linked list
(which requires taking a spinlock to queue and dequeue the io_end
structure), and waiting for the workqueue to do this work.
This removes an extra scheduling delay before process waiting for an
fsync() to complete gets woken up, and it also reduces the CPU
overhead for a random write workload.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Orphan cleanup is currently executed even if the file system has some
number of unknown ROCOMPAT features, which deletes inodes and frees
blocks, which could be very bad for some RO_COMPAT features,
especially the SNAPSHOT feature.
This patch skips the orphan cleanup if it contains readonly compatible
features not known by this ext4 implementation, which would prevent
the fs from being mounted (or remounted) readwrite.
Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
nblocks is passed into ext4_truncate_restart_trans() from
ext4_ext_truncate_extend_restart() with a value different from the default
blocks_for_truncate(), but is being ignored.
The two other calls to ext4_truncate_restart_trans() already pass the
default value, which is then being recalculated inside the function.
Fix the problem by using the passed argument.
Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
This assures that the root inode is not leaked, and that sb->s_root is
NULL, which will prevent generic_shutdown_super() from doing extra
work, including call sync_filesystem, which ultimately results in
ext4_sync_fs() getting called with an uninitialized struct super,
which is the cause of the crash noted in Kernel Bugzilla #26752.
https://bugzilla.kernel.org/show_bug.cgi?id=26752
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Fix the FIEMAP ioctl so that it returns all of the page ranges which
are still subject to delayed allocation. We were missing some cases
if the file was sparse.
Reported by Chris Mason <chris.mason@oracle.com>:
>We've had reports on btrfs that cp is giving us files full of zeros
>instead of actually copying them. It was tracked down to a bug with
>the btrfs fiemap implementation where it was returning holes for
>delalloc ranges.
>
>Newer versions of cp are trusting fiemap to tell it where the holes
>are, which does seem like a pretty neat trick.
>
>I decided to give xfs and ext4 a shot with a few tests cases too, xfs
>passed with all the ones btrfs was getting wrong, and ext4 got the basic
>delalloc case right.
>$ mkfs.ext4 /dev/xxx
>$ mount /dev/xxx /mnt
>$ dd if=/dev/zero of=/mnt/foo bs=1M count=1
>$ fiemap-test foo
>ext: 0 logical: [ 0.. 255] phys: 0.. 255
>flags: 0x007 tot: 256
>
>Horray! But once we throw a hole in, things go bad:
>$ mkfs.ext4 /dev/xxx
>$ mount /dev/xxx /mnt
>$ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=1
>$ fiemap-test foo
>< no output >
>
>We've got a delalloc extent after the hole and ext4 fiemap didn't find
>it. If I run sync to kick the delalloc out:
>$sync
>$ fiemap-test foo
>ext: 0 logical: [ 256.. 511] phys: 34048.. 34303
>flags: 0x001 tot: 256
>
>fiemap-test is sitting in my /usr/local/bin, and I have no idea how it
>got there. It's full of pretty comments so I know it isn't mine, but
>you can grab it here:
>
>http://oss.oracle.com/~mason/fiemap-test.c
>
>xfsqa has a fiemap program too.
After Fix, test results are as follows:
ext: 0 logical: [ 256.. 511] phys: 0.. 255
flags: 0x007 tot: 256
ext: 0 logical: [ 256.. 511] phys: 33280.. 33535
flags: 0x001 tot: 256
$ mkfs.ext4 /dev/xxx
$ mount /dev/xxx /mnt
$ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=1
$ sync
$ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=3
$ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=5
$ fiemap-test foo
ext: 0 logical: [ 256.. 511] phys: 33280.. 33535
flags: 0x000 tot: 256
ext: 1 logical: [ 768.. 1023] phys: 0.. 255
flags: 0x006 tot: 256
ext: 2 logical: [ 1280.. 1535] phys: 0.. 255
flags: 0x007 tot: 256
Tested-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If CONFIG_EXT4_DEBUG is enabled, then if a block allocation fails due
to disk being full, a verbose debugging message is printed, even if
the malloc-debug switch has not been enabled. Suppress the debugging
message so that nothing is printed unless malloc-debug has been turned
on.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In ext4_bio_write_page(), if the memory allocation for the struct
ext4_io_page fails, it returns with the page's PageWriteback flag set.
This will end up causing the page not to skip writeback in
WB_SYNC_NONE mode, and in WB_SYNC_ALL mode (i.e., on a sync, fsync, or
umount) the writeback daemon will get stuck forever on the
wait_on_page_writeback() function in write_cache_pages_da().
Or, if journalling is enabled and the file gets deleted, it the
journal thread can get stuck in journal_finish_inode_data_buffers()
call to filemap_fdatawait().
Another place where things can get hung up is in
truncate_inode_pages(), called out of ext4_evict_inode().
Fix this by not setting PageWriteback until after we have successfully
allocated the struct ext4_io_page.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Move the initialization of all of the fields of the mpd structure to
write_cache_pages_da(). This simplifies the code considerably.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If we have accumulated a contiguous region of memory to be written
out, and the next page can added to this region, don't bother locking
(and then unlocking the page) before writing out the memory. In the
unlikely event that the next page was being written back by some other
CPU, we can also skip waiting that page to finish writeback.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Because the ext4 page writeback codepath had been prematurely calling
clear_page_dirty_for_io(), if it turned out that a particular page
couldn't be written out during a particular pass of
write_cache_pages_da(), the page would have to get redirtied by
calling redirty_pages_for_writeback(). Not only was this wasted work,
but redirty_page_for_writeback() would increment wbc->pages_skipped to
signal to writeback_sb_inodes() that buffers were locked, and that it
should skip this inode until later.
Since this signal was incorrect in ext4's case --- which was caused by
ext4's historically incorrect use of write_cache_pages() ---
ext4_da_writepages() saved and restored wbc->skipped_pages to avoid
confusing writeback_sb_inodes().
Now that we've fixed ext4 to call clear_page_dirty_for_io() right
before initiating the page I/O, we can nuke the page_skipped
save/restore hackery, and breathe a sigh of relief.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Move when we call clear_page_dirty_for_io() to just before we actually
write the page. This simplifies the code somewhat, and avoids marking
pages as clean and then needing to remark them as dirty later.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Eliminate duplicate code, unneeded variables, etc., to make it easier
to understand the code. No behavioral changes were made in this patch.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Fold the __mpage_da_writepage() function into write_cache_pages_da().
This will give us opportunities to clean up and simplify the resulting
code.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Now that we've fixed the file corruption bug in commit d50bdd5aa5,
it's time to enable mblk_io_submit by default.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If ext4_da_block_invalidatepages() is called because of a
failure from ext4_map_blocks() in mpage_da_map_and_submit(),
it's supposed to clean up -- including unlock -- all the
pages in the mpd structure. But these values may not match
up, even on a system in which block size == page size:
mpd->b_blocknr != mpd->first_page
mpd->b_size != (mpd->next_page - mpd->first_page)
ext4_da_block_invalidatepages() has been using b_blocknr and
b_size; this patch changes it to use first_page and
next_page.
Tested: I injected a small number (5%) of failures in
ext4_map_blocks() in the case that the flags contain
EXT4_GET_BLOCKS_DELALLOC_RESERVE, and ran fsstress on this
kernel. Without this patch, I got hung tasks every time.
With this patch, I see no hangs in many runs of fsstress.
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In mpage_da_map_and_submit(), if we have a delayed block
allocation failure from ext4_map_blocks(), we need to mark
the IO as complete, by setting
mpd->io_done = 1;
Otherwise, we could end up submitting the pages in an outer
loop; since they are unlocked on mapping failure in
ext4_da_block_invalidatepages(), this will cause a bug check
in mpage_da_submit_io().
I tested this by injected failures into ext4_map_blocks().
Without this patch, a simple fsstress run will bug check;
with the patch, it works fine.
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
A race can occur when io_submit() races with io_destroy():
CPU1 CPU2
io_submit()
do_io_submit()
...
ctx = lookup_ioctx(ctx_id);
io_destroy()
Now do_io_submit() holds the last reference to ctx.
...
queue new AIO
put_ioctx(ctx) - frees ctx with active AIOs
We solve this issue by checking whether ctx is being destroyed in AIO
submission path after adding new AIO to ctx. Then we are guaranteed that
either io_destroy() waits for new AIO or we see that ctx is being
destroyed and bail out.
Cc: Nick Piggin <npiggin@kernel.dk>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
aio-dio-invalidate-failure GPFs in aio_put_req from io_submit.
lookup_ioctx doesn't implement the rcu lookup pattern properly.
rcu_read_lock does not prevent refcount going to zero, so we might take
a refcount on a zero count ioctx.
Fix the bug by atomically testing for zero refcount before incrementing.
[jack@suse.cz: added comment into the code]
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel automatically evaluates partition tables of storage devices.
The code for evaluating LDM partitions (in fs/partitions/ldm.c) contains
a bug that causes a kernel oops on certain corrupted LDM partitions. A
kernel subsystem seems to crash, because, after the oops, the kernel no
longer recognizes newly connected storage devices.
The patch changes ldm_parse_vmdb() to Validate the value of vblk_size.
Signed-off-by: Timo Warns <warns@pre-sense.de>
Cc: Eugene Teo <eugeneteo@kernel.sg>
Acked-by: Richard Russon <ldm@flatcap.org>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In several places, an epoll fd can call another file's ->f_op->poll()
method with ep->mtx held. This is in general unsafe, because that other
file could itself be an epoll fd that contains the original epoll fd.
The code defends against this possibility in its own ->poll() method using
ep_call_nested, but there are several other unsafe calls to ->poll
elsewhere that can be made to deadlock. For example, the following simple
program causes the call in ep_insert recursively call the original fd's
->poll, leading to deadlock:
#include <unistd.h>
#include <sys/epoll.h>
int main(void) {
int e1, e2, p[2];
struct epoll_event evt = {
.events = EPOLLIN
};
e1 = epoll_create(1);
e2 = epoll_create(2);
pipe(p);
epoll_ctl(e2, EPOLL_CTL_ADD, e1, &evt);
epoll_ctl(e1, EPOLL_CTL_ADD, p[0], &evt);
write(p[1], p, sizeof p);
epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt);
return 0;
}
On insertion, check whether the inserted file is itself a struct epoll,
and if so, do a recursive walk to detect whether inserting this file would
create a loop of epoll structures, which could lead to deadlock.
[nelhage@ksplice.com: Use epmutex to serialize concurrent inserts]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Nelson Elhage <nelhage@ksplice.com>
Reported-by: Nelson Elhage <nelhage@ksplice.com>
Tested-by: Nelson Elhage <nelhage@ksplice.com>
Cc: <stable@kernel.org> [2.6.34+, possibly earlier]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: fix fiemap bugs with delalloc
Btrfs: set FMODE_EXCL in btrfs_device->mode
Btrfs: make btrfs_rm_device() fail gracefully
Btrfs: Avoid accessing unmapped kernel address
Btrfs: Fix BTRFS_IOC_SUBVOL_SETFLAGS ioctl
Btrfs: allow balance to explicitly allocate chunks as it relocates
Btrfs: put ENOSPC debugging under a mount option
* 'for-linus' of git://neil.brown.name/md:
md: Fix - again - partition detection when array becomes active
Fix over-zealous flush_disk when changing device size.
md: avoid spinlock problem in blk_throtl_exit
md: correctly handle probe of an 'mdp' device.
md: don't set_capacity before array is active.
md: Fix raid1->raid0 takeover
I'm seeing the following oops when testing afs:
Unable to handle kernel paging request for data at address 0x00000008
...
NIP [c0000000003393b0] .afs_unlink_writeback+0x38/0xc0
LR [c00000000033987c] .afs_put_writeback+0x98/0xec
Call Trace:
[c00000000345f600] [c00000000033987c] .afs_put_writeback+0x98/0xec
[c00000000345f690] [c00000000033ae80] .afs_write_begin+0x6a4/0x75c
[c00000000345f790] [c00000000012b77c] .generic_file_buffered_write+0x148/0x320
[c00000000345f8d0] [c00000000012e1b8] .__generic_file_aio_write+0x37c/0x3e4
[c00000000345f9d0] [c00000000012e2a8] .generic_file_aio_write+0x88/0xfc
[c00000000345fa90] [c0000000003390a8] .afs_file_write+0x10c/0x178
[c00000000345fb40] [c000000000188788] .do_sync_write+0xc4/0x128
[c00000000345fcc0] [c000000000189658] .vfs_write+0xe8/0x1d8
[c00000000345fd70] [c000000000189884] .SyS_write+0x68/0xb0
[c00000000345fe30] [c000000000008564] syscall_exit+0x0/0x40
afs_write_begin hits an error and calls afs_unlink_writeback. In there
we do list_del_init on an uninitialised list.
The patch below initialises ->link when creating the afs_writeback struct.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit e1181ee6 "vfs: pass struct file to do_truncate on O_TRUNC
opens" broke the behavior of open(O_TRUNC|O_RDONLY) in fuse. Fuse
assumed that when called from open, a truncate() will be done, not an
ftruncate().
Fix by restoring the old behavior, based on the ATTR_OPEN flag.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Single threaded NTFS-3G could get stuck if a delayed RELEASE reply
triggered a DESTROY request via path_put().
Fix this by
a) making RELEASE requests synchronous, whenever possible, on fuseblk
filesystems
b) if not possible (triggered by an asynchronous read/write) then do
the path_put() in a separate thread with schedule_work().
Reported-by: Oliver Neukum <oneukum@suse.de>
Cc: stable@kernel.org
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
In ext4_mb_check_group_pa(), the current preallocation space is
replaced with a new preallocation space when the two have the same
distance from the goal block.
This doesn't actually gain us anything, so change things so that the
function only switches to the new preallocation group if its distance
from the goal block is strictly smaller than the current preallocaiton
group's distance from the goal block.
Signed-off-by: Coly Li <bosong.ly@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch adds comments to ext4_mb_mark_free_simple to make it more
understandable.
Signed-off-by: Coly Li <bosong.ly@taobao.com>
Cc: Alex Tomas <alex@clusterfs.com>
Cc: Theodore Tso <tytso@google.com>
In __mb_check_buddy(), look at the code below:
591 fstart = -1;
592 buddy = mb_find_buddy(e4b, 0, &max);
593 for (i = 0; i < max; i++) {
594 if (!mb_test_bit(i, buddy)) {
595 MB_CHECK_ASSERT(i >= e4b->bd_info->bb_first_free);
596 if (fstart == -1) {
597 fragments++;
598 fstart = i;
599 }
600 continue;
601 }
602 fstart = -1;
603 /* check used bits only */
604 for (j = 0; j < e4b->bd_blkbits + 1; j++) {
605 buddy2 = mb_find_buddy(e4b, j, &max2);
606 k = i >> j;
607 MB_CHECK_ASSERT(k < max2);
608 MB_CHECK_ASSERT(mb_test_bit(k, buddy2));
609 }
610 }
611 MB_CHECK_ASSERT(!EXT4_MB_GRP_NEED_INIT(e4b->bd_info));
612 MB_CHECK_ASSERT(e4b->bd_info->bb_fragments == fragments);
613
614 grp = ext4_get_group_info(sb, e4b->bd_group);
615 buddy = mb_find_buddy(e4b, 0, &max);
On line 592, buddy is fetched by mb_find_buddy() with order 0, between
line 593 to line 615, buddy is not changed, therefore there is
no need to fetch buddy again from mb_find_buddy() with order 0 again.
We can safely remove the second mb_find_buddy() on line 615.
Signed-off-by: Coly Li <bosong.ly@taobao.com>
Cc: Alex Tomas <alex@clusterfs.com>
Cc: Theodore Tso <tytso@google.com>
Current code calculate max no matter whether order is zero, it's
unnecessary. This cleanup patch sets max to "1 << (e4b->bd_blkbits
+ 3)" only when order == 0.
Signed-off-by: Coly Li <bosong.ly@taobao.com>
Cc: Alex Tomas <alex@clusterfs.com>
Cc: Theodore Tso <tytso@google.com>
The new implementation of bd_link_disk_holder() added by 49731baa41
(block: restore multiple bd_link_disk_holder() support) didn't get an
extra reference for the holder_dir kobject of the slave bdev; however,
bdev kills holder_dir on removal, not release, so if the slave bdev is
removed while there are holder links, the holder_dir will be destroyed
while there still are holder links, which leads to oops later when
bd_unlink_disk_order() tries to remove those links.
Make bd_link_disk_holder() grab an extra reference for the slave's
holder_dir and put it in bd_unlink_disk_holder().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>
Tested-by: "Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is a performance improvement to GFS2's dealloc code.
Rather than update the quota file and statfs file for every
single block that's stripped off in unlink function do_strip,
this patch keeps track and updates them once for every layer
that's stripped. This is done entirely inside the existing
transaction, so there should be no risk of corruption.
The other functions that deallocate blocks will be unaffected
because they are using wrapper functions that do the same
thing that they do today.
I tested this code on my roth cluster by creating 200
files in a directory, each of which is 100MB, then on
four nodes, I simultaneously deleted the files, thus competing
for GFS2 resources (but different files). The commands
I used were:
[root@roth-01]# time for i in `seq 1 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
[root@roth-02]# time for i in `seq 2 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
[root@roth-03]# time for i in `seq 3 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
[root@roth-05]# time for i in `seq 4 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
The performance increase was significant:
roth-01 roth-02 roth-03 roth-05
--------- --------- --------- ---------
old: real 0m34.027 0m25.021s 0m23.906s 0m35.646s
new: real 0m22.379s 0m24.362s 0m24.133s 0m18.562s
Total time spent deleting:
old: 118.6s
new: 89.4
For this particular case, this showed a 25% performance increase for
GFS2 unlinks.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When we trim some free blocks in a group of ext3, we should
calculate the free blocks properly and check whether there are
enough freed blocks left for us to trim. Current solution will
only calculate free spaces if they are large for a trim which
is wrong.
Let us see a small example:
a group has 1.5M free which are 300k, 300k, 300k, 300k, 300k.
And minblocks is 1M. With current solution, we have to iterate
the whole group since these 300k will never be subtracted from
1.5M. But actually we should exit after we find the first 2
free spaces since the left 3 chunks only sum up to 900K if we
subtract the first 600K although they can't be trimed.
Cc: Jan Kara <jack@suse.cz>
Cc: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
As we have make the consense in the e-mail[1], the trim start should
be added with first_data_block. So this patch fulfill it and remove
the check for start < first_data_block.
[1] http://www.spinics.net/lists/linux-ext4/msg22737.html
Cc: Jan Kara <jack@suse.cz>
Cc: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
By the commit
b3e19d9 2011-01-07 fs: scale mntget/mntput
vfsmount_lock was introduced around testing mnt_count.
Fix the mis-typed 'unlock'
Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There are two cases when we call flush_disk.
In one, the device has disappeared (check_disk_change) so any
data will hold becomes irrelevant.
In the oter, the device has changed size (check_disk_size_change)
so data we hold may be irrelevant.
In both cases it makes sense to discard any 'clean' buffers,
so they will be read back from the device if needed.
In the former case it makes sense to discard 'dirty' buffers
as there will never be anywhere safe to write the data. In the
second case it *does*not* make sense to discard dirty buffers
as that will lead to file system corruption when you simply enlarge
the containing devices.
flush_disk calls __invalidate_devices.
__invalidate_device calls both invalidate_inodes and invalidate_bdev.
invalidate_inodes *does* discard I_DIRTY inodes and this does lead
to fs corruption.
invalidate_bev *does*not* discard dirty pages, but I don't really care
about that at present.
So this patch adds a flag to __invalidate_device (calling it
__invalidate_device2) to indicate whether dirty buffers should be
killed, and this is passed to invalidate_inodes which can choose to
skip dirty inodes.
flusk_disk then passes true from check_disk_change and false from
check_disk_size_change.
dm avoids tripping over this problem by calling i_size_write directly
rathher than using check_disk_size_change.
md does use check_disk_size_change and so is affected.
This regression was introduced by commit 608aeef17a which causes
check_disk_size_change to call flush_disk, so it is suitable for any
kernel since 2.6.27.
Cc: stable@kernel.org
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Andrew Patterson <andrew.patterson@hp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NeilBrown <neilb@suse.de>
Michael Leun reported that running parallel opens on a fuse filesystem
can trigger a "kernel BUG at mm/truncate.c:475"
Gurudas Pai reported the same bug on NFS.
The reason is, unmap_mapping_range() is not prepared for more than
one concurrent invocation per inode. For example:
thread1: going through a big range, stops in the middle of a vma and
stores the restart address in vm_truncate_count.
thread2: comes in with a small (e.g. single page) unmap request on
the same vma, somewhere before restart_address, finds that the
vma was already unmapped up to the restart address and happily
returns without doing anything.
Another scenario would be two big unmap requests, both having to
restart the unmapping and each one setting vm_truncate_count to its
own value. This could go on forever without any of them being able to
finish.
Truncate and hole punching already serialize with i_mutex. Other
callers of unmap_mapping_range() do not, and it's difficult to get
i_mutex protection for all callers. In particular ->d_revalidate(),
which calls invalidate_inode_pages2_range() in fuse, may be called
with or without i_mutex.
This patch adds a new mutex to 'struct address_space' to prevent
running multiple concurrent unmap_mapping_range() on the same mapping.
[ We'll hopefully get rid of all this with the upcoming mm
preemptibility series by Peter Zijlstra, the "mm: Remove i_mmap_mutex
lockbreak" patch in particular. But that is for 2.6.39 ]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reported-by: Michael Leun <lkml20101129@newton.leun.net>
Reported-by: Gurudas Pai <gurudas.pai@oracle.com>
Tested-by: Gurudas Pai <gurudas.pai@oracle.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's no good reason to require the extra step of providing
a mount option for acl or user_xattr once the feature is configured
on; no other filesystem that I know of requires this.
Userspace patches have set these options in default mount options,
and this patch makes them default in the kernel. At some point
we can start to deprecate the options, perhaps.
For now I've removed default mount option checks in show_options()
to be explicit about what's set, since it's changing the default,
but I'm open to alternatives if desired.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Discard granularity tells us the minimum size of extent that can be
discarded by the device. If the user supplies a minimum extent that
should be discarded (range.minlen) which is smaller than the discard
granularity, increase minlen to the discard granularity, since there's
no point submitting trim requests that the device will reject anyway.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The Btrfs fiemap code wasn't properly returning delalloc extents,
so applications that trust fiemap to decide if there are holes in the
file see holes instead of delalloc.
This reworks the btrfs fiemap code, adding a get_extent helper that
searches for delalloc ranges and also adding a helper for extent_fiemap
that skips past holes in the file.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
For a device that does not support discard, the FITRIM ioctl returns
-EOPNOTSUPP when blkdev_issue_discard() returns this error code, which
is how the user is informed that the device does not support discard.
If there are no suitable free extents to be trimmed, then FITRIM will
return success even though the device does not support discard, which
could confuse the user. So check explicitly if the device supports
discard and return an error code at the beginning of the FITRIM ioctl
processing.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Fix compiler warning
fs/udf/balloc.c: In function 'udf_bitmap_new_block':
fs/udf/balloc.c:273: warning: passing argument 1 of '_find_next_bit_le' from incompatible pointer type
fs/udf/balloc.c:285: warning: passing argument 1 of '_find_next_bit_le' from incompatible pointer type
fs/udf/balloc.c:311: warning: passing argument 1 of '_find_next_bit_le' from incompatible pointer type
fs/udf/balloc.c:325: warning: passing argument 1 of '_find_next_bit_le' from incompatible pointer type
The main fix is to add a cast in ext2_find_next_bit().
As all other usage locations of udf_find_next_one_bit()
directly use bh->b_data (which is a char *), the useless
(char *) cast in line 311 can be removed, too.
Signed-off-by: Dirk Behme <dirk.behme@de.bosch.com>
Signed-off-by: George G. Davis <gdavis@mvista.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently we return iodes from xfs_ialloc with just a single reference held.
But we need two references, as one is dropped during transaction commit and
the second needs to be transfered to the VFS. Change xfs_ialloc to use
xfs_iget plus xfs_trans_ijoin_ref to grab two references to the inode,
and remove the now superflous IHOLD calls from all callers. This also
greatly simplifies the error handling in xfs_create and also allow to remove
xfs_trans_iget as no other callers are left.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
During mount we establish references to the RT inodes, which we keep for
the lifetime of the filesystem. Instead of using xfs_trans_iget to grab
additional references when adding RT inodes to transactions use the
combination of xfs_ilock and xfs_trans_ijoin_ref, which archives the same
end result with less overhead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Fix bug introduced in patch
85a56480 NFSD: Update XDR decoders in NFSv4 callback client
Although decode_cb_sequence4resok ignores highest slotid and target highest slotid
it must account for their space in their xdr stream when calling xdr_inline_decode
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Right now we, are relying on the fact that when we attempt to
actually do the discard, blkdev_issue_discar() returns -EOPNOTSUPP
and the user is informed that the device does not support discard.
However, in the case where the we do not hit any suitable free
extent to trim in FITRIM code, it will finish without any error.
This is very confusing, because it seems that FITRIM was successful
even though the device does not actually supports discard.
Solution: Check for the discard support before attempt to search for
free extents.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
The FSGEOMETRY_V1 ioctl (and its compat equivalent) calls out to
xfs_fs_geometry() with a version number of 3. This code path does not
fill in the logsunit member of the passed xfs_fsop_geom_t, leading to
the leaking of four bytes of uninitialized stack data to potentially
unprivileged callers.
v2 switches to memset() to avoid future issues if structure members
change, on suggestion of Dave Chinner.
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Reviewed-by: Eugene Teo <eugeneteo@kernel.org>
Signed-off-by: Alex Elder <aelder@sgi.com>
In dlm_query_region_handler(), move the kmalloc outside the spinlock.
This allows us to use GFP_KERNEL instead of GFP_ATOMIC.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Right now we, are relying on the fact that when we attempt to
actually do the discard, blkdev_issue_discar() returns -EOPNOTSUPP
and the user is informed that the device does not support discard.
However, in the case where the we do not hit any suitable free
extent to trim in FITRIM code, it will finish without any error.
This is very confusing, because it seems that FITRIM was successful
even though the device does not actually supports discard.
Solution: Check for the discard support before attempt to search for
free extents.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
I cannot disable inode-read-ahead feature of ext4 (on 2.6.37):
# echo 0 > /sys/fs/ext4/sda2/inode_readahead_blks
bash: echo: write error: Invalid argument
On a server with lots of small files and random access this read-ahead makes
performance worse, and I'd like to disable it. I work around this problem
by using value of 1, but it still reads an extra block.
This patch fixes the problem by checking for zero explicitly.
Signed-off-by: Alexander V. Lukyanov <lav@netis.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch fixes the warning "Using plain integer as NULL pointer",
generated by sparse, by replacing the offending 0s with NULL.
Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The FSGEOMETRY_V1 ioctl (and its compat equivalent) calls out to
xfs_fs_geometry() with a version number of 3. This code path does not
fill in the logsunit member of the passed xfs_fsop_geom_t, leading to
the leaking of four bytes of uninitialized stack data to potentially
unprivileged callers.
v2 switches to memset() to avoid future issues if structure members
change, on suggestion of Dave Chinner.
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Reviewed-by: Eugene Teo <eugeneteo@kernel.org>
Signed-off-by: Alex Elder <aelder@sgi.com>
Compile 2.6.38-rc1 with turning EXT4FS_DEBUG on,
we get following compile warnings. This patch fixes them.
CC fs/ext4/hash.o
CC fs/ext4/resize.o
fs/ext4/resize.c: In function 'setup_new_group_blocks':
fs/ext4/resize.c:233:2: warning: format '%#04llx' expects type 'long long
unsigned int', but argument 3 has type 'long unsigned int'
fs/ext4/resize.c:251:2: warning: format '%#04llx' expects type 'long long
unsigned int', but argument 3 has type 'long unsigned int'
CC fs/ext4/extents.o
CC fs/ext4/ext4_jbd2.o
CC fs/ext4/migrate.o
Reported-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6:
eCryptfs: Copy up lower inode attrs in getattr
ecryptfs: read on a directory should return EISDIR if not supported
eCryptfs: Handle NULL nameidata pointers
eCryptfs: Revert "dont call lookup_one_len to avoid NULL nameidata"
LANMAN response length was changed to 16 bytes instead of 24 bytes.
Revert it back to 24 bytes.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
CC: stable@kernel.org
Signed-off-by: Steve French <sfrench@us.ibm.com>
The lower filesystem may do some type of inode revalidation during a
getattr call. eCryptfs should take advantage of that by copying the
lower inode attributes to the eCryptfs inode after a call to
vfs_getattr() on the lower inode.
I originally wrote this fix while working on eCryptfs on nfsv3 support,
but discovered it also fixed an eCryptfs on ext4 nanosecond timestamp
bug that was reported.
https://bugs.launchpad.net/bugs/613873
Cc: <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
read() calls against a file descriptor connected to a directory are
incorrectly returning EINVAL rather than EISDIR:
[EISDIR]
[XSI] [Option Start] The fildes argument refers to a directory and the
implementation does not allow the directory to be read using read()
or pread(). The readdir() function should be used instead. [Option End]
This occurs because we do not have a .read operation defined for
ecryptfs directories. Connect this up to generic_read_dir().
BugLink: http://bugs.launchpad.net/bugs/719691
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Remove mlog(0,...) and mlog(ML_UPTODATE,...) from
fs/ocfs2/uptodate.c and fs/ocfs2/buffer_head_io.c.
The masklog UPTODATE is removed finally.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Remove mlog(0,...) and mlog(ML_BH_IO,...) from
fs/ocfs2/buffer_head_io.c.
The masklog BH_IO is removed finally.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
ocfs2_iget is used to get/create inode. Only iget5_locked
will give us an inode = NULL. So move this check ahead of
ocfs2_read_locked_inode so that we don't need to check
inode before we read and unlock inode. This is also helpful
for trace event(see the next patch).
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
As there are no such debug information in fs/ocfs2/ioctl.c,
fs/ocfs2/locks.c and fs/ocfs2/sysfile.c, ML_INODE are also
removed.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Change all the "mlog(0," in fs/ocfs2/refcounttree.c to trace events.
And finally remove masklog ML_REFCOUNT.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Since all 4 files, localalloc.c, suballoc.c, alloc.c and
resize.c, which use DISK_ALLOC are changed to trace events,
Remove masklog DISK_ALLOC totally.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
This is the 2nd step to remove the debug info of DISK_ALLOC.
So this patch removes all mlog(0,...) from localalloc.c and adds
the corresponding tracepoints. Different mlogs have different
solutions.
1. Some are replaced with trace event directly.
2. Some are replaced while some new parameters are added.
3. Some are combined into one trace events.
4. Some redundant mlogs are removed.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
This is the first try of replacing debug mlog(0,...) to
trace events. Wengang has did some work in his original
patch
http://oss.oracle.com/pipermail/ocfs2-devel/2009-November/005513.html
But he didn't finished it.
So this patch removes all mlog(0,...) from alloc.c and adds
the corresponding trace events. Different mlogs have different
solutions.
1. Some are replaced with trace event directly.
2. Some are replaced and some new parameters are added since
I think we need to know the btree owner in that case.
3. Some are combined into one trace events.
4. Some redundant mlogs are removed.
What's more, it defines some event classes so that we can use
them later.
Cc: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
About one year ago, Wengang Wang tried some first steps
to add tracepoints to ocfs2. Hiss original patch is here:
http://oss.oracle.com/pipermail/ocfs2-devel/2009-November/005512.html
But as Steven Rostedt indicated in his article
http://lwn.net/Articles/383362/, we'd better have our trace
files resides in fs/ocfs2, so I rewrited the patch using the
method Steven mentioned in that article.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
mlog_exit is used to record the exit status of a function.
But because it is added in so many functions, if we enable it,
the system logs get filled up quickly and cause too much I/O.
So actually no one can open it for a production system or even
for a test.
This patch just try to remove it or change it. So:
1. if all the error paths already use mlog_errno, it is just removed.
Otherwise, it will be replaced by mlog_errno.
2. if it is used to print some return value, it is replaced with
mlog(0,...).
mlog_exit_ptr is changed to mlog(0.
All those mlog(0,...) will be replaced with trace events later.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
ENTRY is used to record the entry of a function.
But because it is added in so many functions, if we enable it,
the system logs get filled up quickly and cause too much I/O.
So actually no one can open it for a production system or even
for a test.
So for mlog_entry_void, we just remove it.
for mlog_entry(...), we replace it with mlog(0,...), and they
will be replace by trace event later.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
It's a best-effort attempt to simplize duplicated codes here.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
In cad3f00, ext4_check_dir_entry was modified by adding some unlikely.
Ted described it as "This function gets called a lot for large
directories, and the answer is almost always 'no, no, there's no problem'.
This means using unlikely() is a good thing."
ext3 added the similar change in commit a4ae309.
So change it accordingly in ocfs2.
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Patch makes use of the hrtimer to track times in ocfs2 lock stats.
The patch is a bit involved to ensure no additional impact on the memory
footprint. The size of ocfs2_inode_cache remains 1280 bytes on 32-bit systems.
A related change was to modify the unit of the max wait time from nanosec to
microsec allowing us to track max time larger than 4 secs. This change
necessitated the bumping of the output version in the debugfs file,
locking_state, from 2 to 3.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Commit 2c442719e9 added some checks for proper
heartbeat mode when the o2cb stack is running. Unfortunately, it didn't
take into account that a userpsace stack could be running. Fix this by only
doing the check if o2cb is in use. This patch allows userspace stacks to
mount the fs again.
Cc: stable@kernel.org
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Current refcounttree codes actually didn't writeback the new pages out in
write-back mode, due to a bug of always passing a ZERO number of clusters
to 'ocfs2_cow_sync_writeback', the patch tries to pass a proper one in.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Joel Becker <jlbec@evilplan.org>
In the rare case that INLINE_DATA, INDEX_DIR, QUOTA, XATTR features are
disabled and both the allocation of the directory inode and the allocation
of the first directory block need to relink allocation group, there need
not be enough credits reserved in a transaction. Fix the estimate.
CC: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
When creating a new dentry we now hold a reference to the parent
inode in the ceph_dentry. This is required due to the new RCU
changes from 949854d0, which set dentry->d_parent to NULL in d_kill before
calling the ->release() callback. If/when that behavior is changed, we can
revert this hack.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
* 'fixes-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: make sure MAYDAY_INITIAL_TIMEOUT is at least 2 jiffies long
workqueue, freezer: unify spelling of 'freeze' + 'able' to 'freezable'
workqueue: wake up a worker when a rescuer is leaving a gcwq
When __debugfs_remove() fails (because simple_rmdir() fails e.g. when a
directory is not empty), we must not decrement use count of the filesystem
as nothing was in fact deleted.
This fixes use after free caused by debugfs in some cases.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This reverts commit 21edad3220 and commit
93c3fe40c2, which fixed a regression by
the former.
Al Viro pointed out bypassed dcache lookups in
ecryptfs_new_lower_dentry(), misuse of vfs_path_lookup() in
ecryptfs_lookup_one_lower() and a dislike of passing nameidata to the
lower filesystem.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Validate number of blocks in map and remove redundant variable.
Signed-off-by: Timo Warns <warns@pre-sense.de>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dcache-locking.txt is not exist any more, and the path was not
correct anyway. Fix it.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The code finds, the '%' sign in an ipv6 address and copies that to a
buffer allocated on the stack. It then ignores that buffer, and passes
'pct' to simple_strtoul(), which doesn't work right because we're
comparing 'endp' against a completely different string.
Fix it by passing the correct pointer. While we're at it, this is a
good candidate for conversion to strict_strtoul as well.
Cc: stable@kernel.org
Cc: David Howells <dhowells@redhat.com>
Reported-by: Björn JACKE <bj@sernet.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This reverts commit 75f1dc0d07 ("block: check bdev_read_only() from
blkdev_get()"). That commit added stricter checking to make sure
devices that were being used read-only were actually opened in that
mode.
It turns out that the change breaks a bunch of kernel code that opens
block devices. Affected systems include dm, md, and the loop device.
Because strict checking for read-only opens of block devices was not
done before this, the code that opens the devices was opening them
read-write even if they were being used read-only. Auditing all that
code will take time, and new userspace packages for dm, mdadm, etc.
will also be required.
Signed-off-by: Chuck Ebbert <cebbert@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These functions return an nfs status, not a host_err. So don't
try to convert before returning.
This is a regression introduced by
3c726023402a2f3b28f49b9d90ebf9e71151157d; I fixed up two of the callers,
but missed these two.
Cc: stable@kernel.org
Reported-by: Herbert Poetzl <herbert@13thfloor.at>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This fixes a bug introduced in d4d77629, where the device added online
(and therefore initialized via btrfs_init_new_device()) would be left
with the positive bdev->bd_holders after unmount. Since d4d77629 we no
longer OR FMODE_EXCL explicitly on blkdev_put(), set it in
btrfs_device->mode.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If shrinking done as part of the online device removal fails add that
device back to the allocation list and increment the rw_devices counter.
This fixes two bugs:
1) we could have a perfectly good device out of alloc list for no good
reason;
2) in the btrfs consisting of two devices, failure in btrfs_rm_device()
could lead to a situation where it was impossible to remove any of the
devices because of the "unable to remove the only writeable device"
error.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When decompressing a chunk of data, we'll copy the data out to
a working buffer if the data is stored in more than one page,
otherwise we'll use the mapped page directly to avoid memory
copy.
In the latter case, we'll end up accessing the kernel address
after we've unmapped the page in a corner case.
Reported-by: Juan Francisco Cantero Hurtado <iam@juanfra.info>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
- Check user-specified flags correctly
- Check the inode owership
- Search root item in root tree but not fs tree
Reported-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs device shrinking and balancing ends up reallocating all the blocks
in order to allow COW to move them to new destinations. It is somewhat
awkward in terms of ENOSPC because most of the enospc code is built
around the idea that some operation on a reference counted tree triggers
allocations in the non-reference counted trees.
This commit changes the balancing code to deal with enospc by trying to
allocate a new chunk. If that allocation succeeds, we go ahead and
retry whatever failed due to enospc.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
ENOSPC in btrfs is getting to the point where the extra debugging isn't
required. I've put it under mount -o enospc_debug just in case someone
is having difficult problems.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When Al moved the nameidata_dentry_drop_rcu_maybe() call into the
do_follow_link function in commit 844a391799 ("nothing in
do_follow_link() is going to see RCU"), he mistakenly left the
BUG_ON(inode != path->dentry->d_inode);
behind. Which would otherwise be ok, but that BUG_ON() really needs to
be _after_ dropping RCU, since the dentry isn't necessarily stable
otherwise.
So complete the code movement in that commit, and move the BUG_ON() into
do_follow_link() too. This means that we need to pass in 'inode' as an
argument (just for this one use), but that's a small thing. And
eventually we may be confident enough in our path lookup that we can
just remove the BUG_ON() and the unnecessary inode argument.
Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are two spellings in use for 'freeze' + 'able' - 'freezable' and
'freezeable'. The former is the more prominent one. The latter is
mostly used by workqueue and in a few other odd places. Unify the
spelling to 'freezable'.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Dmitry Torokhov <dtor@mail.ru>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Steven Whitehouse <swhiteho@redhat.com>
* 'for-2.6.38' of git://linux-nfs.org/~bfields/linux:
nfsd: break lease on unlink due to rename
nfsd4: acquire only one lease per file
nfsd4: modify fi_delegations under recall_lock
nfsd4: remove unused deleg dprintk's.
nfsd4: split lease setting into separate function
nfsd4: fix leak on allocation error
nfsd4: add helper function for lease setup
nfsd4: split up nfsd_break_deleg_cb
NFSD: memory corruption due to writing beyond the stat array
NFSD: use nfserr for status after decode_cb_op_status
nfsd: don't leak dentry count on mnt_want_write failure
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
get rid of nameidata_dentry_drop_rcu() calling nameidata_drop_rcu()
drop out of RCU in return_reval
split do_revalidate() into RCU and non-RCU cases
in do_lookup() split RCU and non-RCU cases of need_revalidate
nothing in do_follow_link() is going to see RCU
task_show_regs used to be a debugging aid in the early bringup days
of Linux on s390. /proc/<pid>/status is a world readable file, it
is not a good idea to show the registers of a process. The only
correct fix is to remove task_show_regs.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I add the check on the return value of alloc_extent_map() to several places.
In addition, alloc_extent_map() returns only the address or NULL.
Therefore, check by IS_ERR() is unnecessary. So, I remove IS_ERR() checking.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Memory allocated by calling kstrdup() should be freed.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Commit bf5fc093c5 refactored
btrfs_ioctl_space_info() and introduced several security issues.
space_args.space_slots is an unsigned 64-bit type controlled by a
possibly unprivileged caller. The comparison as a signed int type
allows providing values that are treated as negative and cause the
subsequent allocation size calculation to wrap, or be truncated to 0.
By providing a size that's truncated to 0, kmalloc() will return
ZERO_SIZE_PTR. It's also possible to provide a value smaller than the
slot count. The subsequent loop ignores the allocation size when
copying data in, resulting in a heap overflow or write to ZERO_SIZE_PTR.
The fix changes the slot count type and comparison typecast to u64,
which prevents truncation or signedness errors, and also ensures that we
don't copy more data than we've allocated in the subsequent loop. Note
that zero-size allocations are no longer possible since there is already
an explicit check for space_args.space_slots being 0 and truncation of
this value is no longer an issue.
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Mark the cloned backref_node as checked in clone_backref_node()
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs tracks uptodate state in an rbtree as well as in the
page bits. This is supposed to enable us to use block sizes other than
the page size, but there are a few parts still missing before that
completely works.
But, our readpage routine trusts this additional range based tracking
of uptodateness, much in the same way the buffer head up to date bits
are trusted for the other filesystems.
The problem is that sometimes we need to allocate memory in order to
split records in the rbtree, even when we are just clearing bits. This
can be difficult when our clearing function is called GFP_ATOMIC, which
can happen in the releasepage path.
So, what happens today looks like this:
releasepage called with GFP_ATOMIC
btrfs_releasepage calls clear_extent_bit
clear_extent_bit fails to allocate ram, leaving the up to date bit set
btrfs_releasepage returns success
The end result is the page being gone, but btrfs thinking the range is
up to date. Later on if someone tries to read that same page, the
btrfs readpage code will return immediately thinking the page is already
up to date.
This commit fixes things to fail the releasepage when we can't clear the
extent state bits. It covers both data pages and metadata tree blocks.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is a race where btrfs_releasepage can drop the
page->private contents just as alloc_extent_buffer is setting
up pages for metadata. Because of how the Btrfs page flags work,
this results in us skipping the crc on the page during IO.
This patch sovles the race by waiting until after the extent buffer
is inserted into the radix tree before it sets page private.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
4795bb37ef "nfsd: break lease on unlink,
link, and rename", only broke the lease on the file that was being
renamed, and didn't handle the case where the target path refers to an
already-existing file that will be unlinked by a rename--in that case
the target file should have any leases broken as well.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Instead of acquiring one lease each time another client opens a file,
nfsd can acquire just one lease to represent all of them, and reference
count it to determine when to release it.
This fixes a regression introduced by
c45821d263 "locks: eliminate fl_mylease
callback": after that patch, only the struct file * is used to determine
who owns a given lease. But since we recently converted the server to
share a single struct file per open, if we acquire multiple leases on
the same file from nfsd, it then becomes impossible on unlocking a lease
to determine which of those leases (all of whom share the same struct
file *) we meant to remove.
Thanks to Takashi Iwai <tiwai@suse.de> for catching a bug in a previous
version of this patch.
Tested-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Modify fi_delegations only under the recall_lock, allowing us to use
that list on lease breaks.
Also some trivial cleanup to simplify later changes.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If nfsd fails to find an exported via NFS file in the readahead cache, it
should increment corresponding nfsdstats counter (ra_depth[10]), but due to a
bug it may instead write to ra_depth[11], corrupting the following field.
In a kernel with NFSDv4 compiled in the corruption takes the form of an
increment of a counter of the number of NFSv4 operation 0's received; since
there is no operation 0, this is harmless.
In a kernel with NFSDv4 disabled it corrupts whatever happens to be in the
memory beyond nfsdstats.
Signed-off-by: Konstantin Khorenko <khorenko@openvz.org>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Bugs introduced in 85a5648019
"NFSD: Update XDR decoders in NFSv4 callback client"
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
jbd2: call __jbd2_log_start_commit with j_state_lock write locked
ext4: serialize unaligned asynchronous DIO
ext4: make grpinfo slab cache names static
ext4: Fix data corruption with multi-block writepages support
ext4: fix up ext4 error handling
ext4: unregister features interface on module unload
ext4: fix panic on module unload when stopping lazyinit thread
On an SMP ARM system running ext4, I've received a report that the
first J_ASSERT in jbd2_journal_commit_transaction has been triggering:
J_ASSERT(journal->j_running_transaction != NULL);
While investigating possible causes for this problem, I noticed that
__jbd2_log_start_commit() is getting called with j_state_lock only
read-locked, in spite of the fact that it's possible for it might
j_commit_request. Fix this by grabbing the necessary information so
we can test to see if we need to start a new transaction before
dropping the read lock, and then calling jbd2_log_start_commit() which
will grab the write lock.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In 2.6.37 I was running into oopses with repeated module
loads & unloads. I tracked this down to:
fb1813f4 ext4: use dedicated slab caches for group_info structures
(this was in addition to the features advert unload problem)
The kstrdup & subsequent kfree of the cache name was causing
a double free. In slub, at least, if I read it right it allocates
& frees the name itself, slab seems to do something different...
so in slub I think we were leaking -our- cachep->name, and double
freeing the one allocated by slub.
After getting lost in slab/slub/slob a bit, I just looked at other
sized-caches that get allocated. jbd2, biovec, sgpool all do it
more or less the way jbd2 does. Below patch follows the jbd2
method of dynamically allocating a cache at mount time from
a list of static names.
(This might also possibly fix a race creating the caches with
parallel mounts running).
[Folded in a fix from Dan Carpenter which fixed an off-by-one error in
the original patch]
Cc: stable@kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: don't always drop malformed replies on the floor (try #3)
cifs: clean up checks in cifs_echo_request
[CIFS] Do not send SMBEcho requests on new sockets until SMBNegotiate
In commit fa0d7e3de6 ("fs: icache RCU free inodes"), we use rcu free
inode instead of freeing the inode directly. It causes a crash when we
rmmod immediately after we umount the volume[1].
So we need to call rcu_barrier after we kill_sb so that the inode is
freed before we do rmmod. The idea is inspired by Aneesh Kumar.
rcu_barrier will wait for all callbacks to end before preceding. The
original patch was done by Tao Ma, but synchronize_rcu() is not enough
here.
1. http://marc.info/?l=linux-fsdevel&m=129680863330185&w=2
Tested-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit 31e6b01f41 ("fs: rcu-walk for path lookup") we started doing
path lookup using RCU, which then falls back to a careful non-RCU lookup
in case of problems (LOOKUP_REVAL). So do_filp_open() has this "re-do
the lookup carefully" looping case.
However, that means that we must not release the open-intent file data
if we are going to loop around and use it once more!
Fix this by moving the release of the open-intent data to the function
that allocates it (do_filp_open() itself) rather than the helper
functions that can get called multiple times (finish_open() and
do_last()). This makes the logic for the lifetime of that field much
more obvious, and avoids the possible double free.
Reported-by: J. R. Okajima <hooanon05@yahoo.co.jp>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The recent commit to use cmwq for send and recv threads
dcce240ead introduced problems,
apparently due to multiple workqueue threads. Single threads
make the problems go away, so return to that until we fully
understand the concurrency issues with multiple threads.
Signed-off-by: David Teigland <teigland@redhat.com>
Slight revision to this patch...use min_t() instead of conditional
assignment. Also, remove the FIXME comment and replace it with the
explanation that Steve gave earlier.
After receiving a packet, we currently check the header. If it's no
good, then we toss it out and continue the loop, leaving the caller
waiting on that response.
In cases where the packet has length inconsistencies, but the MID is
valid, this leads to unneeded delays. That's especially problematic now
that the client waits indefinitely for responses.
Instead, don't immediately discard the packet if checkSMB fails. Try to
find a matching mid_q_entry, mark it as having a malformed response and
issue the callback.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
ima_counts_get() updated the readcount and invalidated the PCR,
as necessary. Only update the i_readcount in the VFS layer.
Move the PCR invalidation checks to ima_file_check(), where it
belongs.
Maintaining the i_readcount in the VFS layer, will allow other
subsystems to use i_readcount.
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Acked-by: Eric Paris <eparis@redhat.com>
Follow-on patch to 7e90d705 which is already in Steve's tree...
The check for tcpStatus == CifsGood is not meaningful since it doesn't
indicate whether the NEGOTIATE request has been done. Also, clarify
why we're checking for maxBuf == 0.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
In order to determine whether an SMBEcho request can be sent
we need to know that the socket is established (server tcpStatus == CifsGood)
AND that an SMB NegotiateProtocol has been sent (server maxBuf != 0).
Without the second check we can send an Echo request during reconnection
before the server can accept it.
CC: JG <jg@cms.ac>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This is a minor patch which fixes the LEB number we print when
corrupted empty space is found.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (33 commits)
Btrfs: Fix page count calculation
btrfs: Drop __exit attribute on btrfs_exit_compress
btrfs: cleanup error handling in btrfs_unlink_inode()
Btrfs: exclude super blocks when we read in block groups
Btrfs: make sure search_bitmap finds something in remove_from_bitmap
btrfs: fix return value check of btrfs_start_transaction()
btrfs: checking NULL or not in some functions
Btrfs: avoid uninit variable warnings in ordered-data.c
Btrfs: catch errors from btrfs_sync_log
Btrfs: make shrink_delalloc a little friendlier
Btrfs: handle no memory properly in prepare_pages
Btrfs: do error checking in btrfs_del_csums
Btrfs: use the global block reserve if we cannot reserve space
Btrfs: do not release more reserved bytes to the global_block_rsv than we need
Btrfs: fix check_path_shared so it returns the right value
btrfs: check return value of btrfs_start_ioctl_transaction() properly
btrfs: fix return value check of btrfs_join_transaction()
fs/btrfs/inode.c: Add missing IS_ERR test
btrfs: fix missing break in switch phrase
btrfs: fix several uncheck memory allocations
...
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: remove checks for ses->status == CifsExiting
cifs: add check for kmalloc in parse_dacl
cifs: don't send an echo request unless NegProt has been done
cifs: enable signing flag in SMB header when server has it on
cifs: Possible slab memory corruption while updating extended stats (repost)
CIFS: Fix variable types in cifs_iovec_read/write (try #2)
cifs: fix length vs. total_read confusion in cifs_demultiplex_thread
Handle block allocation for forceful unstuffing of quota dinode during quota
update using quotactl(). Also fix block reservation for special cases when
quotas cross over block boundaries and update 2 blocks instead of 1.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The rt bitmap and summary inodes do not participate in the normal inode
locking protocol. Instead the rt bitmap inode can be locked in any
transaction involving rt allocations, and the both of the rt inodes can
be locked at the same time. Add specific lockdep subclasses for the rt
inodes to prevent lockdep from blowing up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
We can easily set the extsize flag without setting an extent size
hint, or one that evaluates to zero. Historically the di_extsize
field was only used when it was non-zero, but the commit
"Cleanup inode extent size hint extraction"
broke this. Restore the old behaviour, thus fixing xfsqa 090 with
a debug kernel.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Currently both xfs_rtpick_extent and xfs_rtallocate_extent call
xfs_trans_iget to grab and lock the rt bitmap inode, which results in a
deadlock since the removal of the lock recursion counters in commit
"xfs: simplify inode to transaction joining"
Fix this by acquiring and locking the inode in xfs_bmap_rtalloc before
calling into xfs_rtpick_extent and xfs_rtallocate_extent.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
take offset of start position into account when calculating page count.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This fixes a corruption problem with the multi-block
writepages submittal change for ext4, from commit
bd2d0210cf ("ext4: use bio
layer instead of buffer layer in mpage_da_submit_io").
(Note that this corruption is not present in 2.6.37 on
ext4, because the corruption was detected after the
feature was merged in 2.6.37-rc1, and so it was turned
off by adding a non-default mount option,
mblk_io_submit. With this commit, which hopefully
fixes the last of the bugs with this feature, we'll be
able to turn on this performance feature by default in
2.6.38, and remove the mblk_io_submit option.)
The ext4 code path to bundle multiple pages for
writeback in ext4_bio_write_page() had a bug: we should
be clearing buffer head dirty flags *before* we submit
the bio, not in the completion routine.
The patch below was tested on 2.6.37 under KVM with the
postgresql script which was submitted by Jon Nelson as
documented in commit 1449032be1.
Without the patch, I'd hit the corruption problem about
50-70% of the time. With the patch, I executed the
script > 100 times with no corruption seen.
I also fixed a bug to make sure ext4_end_bio() doesn't
dereference the bio after the bio_put() call.
Reported-by: Jon Nelson <jnelson@jamponi.net>
Reported-by: Matthias Bayer <jackdachef@gmail.com>
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
ses->status is never set to CifsExiting, so these checks are
always false.
Tested-by: JG <jg@cms.ac>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This patch adds more commentaries about UBIFS recovery logic which should
explain the famous UBIFS "corrupt empty space" errors.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
As this function is called in some error paths while not
removing the module, the __exit attribute prevents the kernel
image from linking when btrfs is compiled in statically.
Signed-off-by: Alexey Charkov <alchark@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs_alloc_path() fails, btrfs_free_path() need not be called.
Therefore, it changes the branch ahead.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This has been resulting in a BUT_ON(ret) after btrfs_reserve_extent in
btrfs_cow_file_range. The reason is we don't actually calculate the bytes_super
for a block group until we go to cache it, which means that the space_info can
hand out reservations for space that it doesn't actually have, and we can run
out of data space. This is also a problem if you are using space caching since
we don't ever calculate bytes_super for the block groups. So instead everytime
we read a block group call exclude_super_stripes, which calculates the
bytes_super for the block group so it can be left out of the space_info. Then
whenever caching completes we just call free_excluded_extents so that the super
excluded extents are freed up. Also if we are unmounting and we hit any block
groups that haven't been cached we still need to call free_excluded_extents to
make sure things are cleaned up properly. Thanks,
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we're cleaning up the tree log we need to be able to remove free space from
the block group. The problem is if that free space spans bitmaps we would not
find the space since we're looking for too many bytes. So make sure the amount
of bytes we search for is limited to either the number of bytes we want, or the
number of bytes left in the bitmap. This was tested by a user who was hitting
the BUG() after search_bitmap. With this patch he can now mount his fs.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Exit from parse_dacl if no memory returned from the call to kmalloc.
Signed-off-by: Stanislav Fomichev <kernel@fomichev.me>
Signed-off-by: Steve French <sfrench@us.ibm.com>
We were forming a dirty list, and then queueing cap_snaps for each realm
_and_ its children, regardless of whether the children were already in the
dirty list. This meant we did it twice for some realms. Which in turn
meant we corrupted mdsc->snap_flush_list when the cap_snap was re-added to
the list it was already on, and could trigger an infinite loop.
We were also using recursion to do reach all the children, a no-no when
stack is limited.
Instead, (re)queue any children on the dirty list, avoiding processing
anything twice and avoiding any recursion.
Signed-off-by: Sage Weil <sage@newdream.net>
When the socket to the server is disconnected, the client more or less
immediately calls cifs_reconnect to reconnect the socket. The NegProt
and SessSetup however are not done until an actual call needs to be
made.
With the addition of the SMB echo code, it's possible that the server
will initiate a disconnect on an idle socket. The client will then
reconnect the socket but no NegotiateProtocol request is done. The
SMBEcho workqueue job will then eventually pop, and an SMBEcho will be
sent on the socket. The server will then reject it since no NegProt was
done.
The ideal fix would be to either have the socket not be reconnected
until we plan to use it, or to immediately do a NegProt when the
reconnect occurs. The code is not structured for this however. For now
we must just settle for not sending any echoes until the NegProt is
done.
Reported-by: JG <jg@cms.ac>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
cifs_sign_smb only generates a signature if the correct Flags2 bit is
set. Make sure that it gets set correctly if we're sending an async
call.
This patch fixes:
https://bugzilla.kernel.org/show_bug.cgi?id=28142
Reported-and-Tested-by: JG <jg@cms.ac>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Updating extended statistics here can cause slab memory corruption
if a callback function frees slab memory (mid_entry).
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
In get_empty_filp() since 2.6.29, file_free(f) is called with f->f_cred == NULL
when security_file_alloc() returned an error. As a result, kernel will panic()
due to put_cred(NULL) call within RCU callback.
Fix this bug by assigning f->f_cred before calling security_file_alloc().
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Variable 'i' should be unsigned long as it's used in circle with num_pages,
and bytes_read/total_written should be ssize_t according to return value.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/hfsplus:
hfsplus: fix up a comparism in hfsplus_file_extend
hfsplus: fix two memory leaks in wrapper.c
hfsplus: do not leak buffer on error
hfsplus: fix failed mount handling
debugfs can't be a module, so module_exit() is meaningless for it.
Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Revert an incorrect hunk from commit b2837fcf49,
"hfsplus: %L-to-%ll, macro correction, and remove unneeded braces"
revert a pointless change of comparism operation argument order, which turned
out to not even be equivalent.
Reported-by: Joe Perches <joe@perches.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Currently the error handling in hfsplus_fill_super is a mess, and can
lead to accessing fields in the superblock that haven't been even set
up yet. Fix this by making sure we do not set up sb->s_root until we
have the mount fully set up, and before that do proper step by step
unwinding instead of using hfsplus_put_super as a big hammer.
Reported-by: Dan Williams <dcbw@redhat.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Ext4 features interface was not properly unregistered which led to
problems while unloading/reloading ext4 module. This commit fixes that by
adding proper kobject unregistration code into ext4_exit_fs() as well as
fail-path of ext4_init_fs()
Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
https://bugzilla.kernel.org/show_bug.cgi?id=27652
If the lazyinit thread is running, the teardown function
ext4_destroy_lazyinit_thread() has problems:
ext4_clear_request_list();
while (ext4_li_info->li_task) {
wake_up(&ext4_li_info->li_wait_daemon);
wait_event(ext4_li_info->li_wait_task,
ext4_li_info->li_task == NULL);
}
Clearing the request list will cause the thread to exit and free
ext4_li_info, so then we're waiting on something which is getting
freed.
Fix this up by making the thread respond to kthread_stop, and exit,
without the need to wait for that exit in some other homegrown way.
Cc: stable@kernel.org
Reported-and-Tested-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This reverts commit 115e19c535.
Apparently setting inode->bdi to one's own sb->s_bdi stops VFS from
sending *read-aheads*. This problem was bisected to this commit. A
revert fixes it. I'll investigate farther why is this happening for the
next Kernel, but for now a revert.
I'm sending to stable@kernel.org as well, since it exists also in
2.6.37. 2.6.36 is good and does not have this patch.
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some filesystems don't deal well with being asked to map less than
blocksize blocks (GFS2 for example). Since we are always mapping at least
blocksize sections anyway, just make sure len is at least as big as a
blocksize so we don't trip up any filesystems. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
FMODE_EXEC is a constant type of fmode_t but was used with normal integer
constants. This results in following warnings from sparse. Fix it using
new macro __FMODE_EXEC.
fs/exec.c:116:58: warning: restricted fmode_t degrades to integer
fs/exec.c:689:58: warning: restricted fmode_t degrades to integer
fs/fcntl.c:777:9: warning: restricted fmode_t degrades to integer
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit 95aac7b1cd ("epoll: make epoll_wait() use the hrtimer range
feature") added a performance regression because it uses timespec_add_ns()
with potential very large 'ns' values.
[akpm@linux-foundation.org: s/epoll_set_mstimeout/ep_set_mstimeout/, per Davide]
Reported-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Shawn Bohrer <shawn.bohrer@gmail.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: <stable@kernel.org> [2.6.37.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The mmap system call grabs a glock when an update to atime maybe
required. It does this in order to ensure that the flags on the
inode are uptodate, but since it will only mark atime for a future
update, an exclusive lock is not required here (one will be taken
later when the actual update is performed).
Also, the lock can be skipped when the mount is marked noatime in
addition to the original check which only looked at the noatime
flag for the inode itself.
This should increase the scalability of the mmap call when multiple
nodes are all mmaping the same file.
Reported-by: Scooter Morris <scooter@cgl.ucsf.edu>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
length at this point is the length returned by the last kernel_recvmsg
call. total_read is the length of all of the data read so far. length
is more or less meaningless at this point, so use total_read for
everything.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: fix length checks in checkSMB
[CIFS] Update cifs minor version
cifs: No need to check crypto blockcipher allocation
cifs: clean up some compiler warnings
cifs: make CIFS depend on CRYPTO_MD4
cifs: force a reconnect if there are too many MIDs in flight
cifs: don't pop a printk when sending on a socket is interrupted
cifs: simplify SMB header check routine
cifs: send an NT_CANCEL request when a process is signalled
cifs: handle cancelled requests better
cifs: fix two compiler warning about uninitialized vars
This fixes an old (2007) selinux regression: filesystem labeling for
/proc/sys returned
-r--r--r-- unknown /proc/sys/fs/file-nr
instead of
-r--r--r-- system_u:object_r:sysctl_fs_t:s0 /proc/sys/fs/file-nr
Events that lead to breaking of /proc/sys/ selinux labeling:
1) sysctl was reimplemented to route all calls through /proc/sys/
commit 77b14db502
[PATCH] sysctl: reimplement the sysctl proc support
2) proc_dir_entry was removed from ctl_table:
commit 3fbfa98112
[PATCH] sysctl: remove the proc_dir_entry member for the sysctl tables
3) selinux still walked the proc_dir_entry tree to apply
labeling. Because ctl_tables don't have a proc_dir_entry, we did
not label /proc/sys/ inodes any more. To achieve this the /proc/sys/
inodes were marked private and private inodes were ignored by
selinux.
commit bbaca6c2e7
[PATCH] selinux: enhance selinux to always ignore private inodes
commit 86a71dbd3e
[PATCH] sysctl: hide the sysctl proc inodes from selinux
Access control checks have been done by means of a special sysctl hook
that was called for read/write accesses to any /proc/sys/ entry.
We don't have to do this because, instead of walking the
proc_dir_entry tree we can walk the dentry tree (as done in this
patch). With this patch:
* we don't mark /proc/sys/ inodes as private
* we don't need the sysclt security hook
* we walk the dentry tree to find the path to the inode.
We have to strip the PID in /proc/PID/ entries that have a
proc_dir_entry because selinux does not know how to label paths like
'/1/net/rpc/nfsd.fh' (and defaults to 'proc_t' labeling). Selinux does
know of '/net/rpc/nfsd.fh' (and applies the 'sysctl_rpc_t' label).
PID stripping from the path was done implicitly in the previous code
because the proc_dir_entry tree had the root in '/net' in the example
from above. The dentry tree has the root in '/1'.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
SELinux would like to implement a new labeling behavior of newly created
inodes. We currently label new inodes based on the parent and the creating
process. This new behavior would also take into account the name of the
new object when deciding the new label. This is not the (supposed) full path,
just the last component of the path.
This is very useful because creating /etc/shadow is different than creating
/etc/passwd but the kernel hooks are unable to differentiate these
operations. We currently require that userspace realize it is doing some
difficult operation like that and than userspace jumps through SELinux hoops
to get things set up correctly. This patch does not implement new
behavior, that is obviously contained in a seperate SELinux patch, but it
does pass the needed name down to the correct LSM hook. If no such name
exists it is fine to pass NULL.
Signed-off-by: Eric Paris <eparis@redhat.com>
The error check of btrfs_start_transaction() is added, and the mistake
of the error check on several places is corrected.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Because NULL is returned when the memory allocation fails,
it is checked whether it is NULL.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Convert from create[_singlethread]_workqueue() to alloc_workqueue().
* xfsdatad_workqueue and xfsconvertd_workqueue are identity converted.
Using higher concurrency limit might be useful but given the
complexity of workqueue usage in xfs, proceeding cautiously seems
better.
* xfs_mru_reap_wq is converted to non-ordered workqueue with max
concurrency of 1 as the work items don't require any specific
ordering and already have proper synchronization. It seems it was
singlethreaded to save worker threads, which is no longer a concern.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alex Elder <aelder@sgi.com>
Cc: xfs-masters@oss.sgi.com
Cc: Christoph Hellwig <hch@infradead.org>
The maximum number of concurrent work items queued on commit_wq is
bound by the number of active journals. Convert to alloc_workqueue()
and use the default concurrency level so that they can be processed in
parallel.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: reiserfs-devel@vger.kernel.org
ocfs2_quota_wq is not depended upon during memory reclaim and, with
cmwq, there's no reason to use a dedicated workqueue. Drop
ocfs2_quota_wq and use system_wq instead. dqi_sync_work is already
sync canceled on quota disable and no further synchronization is
necessary.
This change makes ocfs2_quota_setup/shutdown() noops. Both functions
removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Convert create_workqueue() to alloc_workqueue(). This is an identity
conversion.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
This one isn't really an uninit variable, but for pretty
obscure reasons. Let's make it clearly correct.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The cERROR message in checkSMB when the calculated length doesn't match
the RFC1001 length is incorrect in many cases. It always says that the
RFC1001 length is bigger than the SMB, even when it's actually the
reverse.
Fix the error message to say the reverse of what it does now when the
SMB length goes beyond the end of the received data. Also, clarify the
error message when the RFC length is too big. Finally, clarify the
comments to show that the 512 byte limit on extra data at the end of
the packet is arbitrary.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
btrfs_sync_log returns -EAGAIN when we need full transaction commits
instead of small log commits, but sometimes we were dropping the return
value.
In practice, we check for this a few different ways, but this is still a
bug that can leave off full log commits when we really need them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Xfstests 224 will just sit there and spin for ever until eventually we give up
flushing delalloc and exit. On my box this took several hours. I could not
interrupt this process either, even though we use INTERRUPTIBLE. So do 2 things
1) Keep us from looping over and over again without reclaiming anything
2) If we get interrupted exit the loop
I tested this and the test now exits in a reasonable amount of time, and can be
interrupted with ctrl+c. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Missed one change as per earlier suggestion.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
New compiler warnings that I noticed when building a patchset based
on recent Fedora kernel:
fs/cifs/cifssmb.c: In function 'CIFSSMBSetFileSize':
fs/cifs/cifssmb.c:4813:8: warning: variable 'data_offset' set but not used
[-Wunused-but-set-variable]
fs/cifs/file.c: In function 'cifs_open':
fs/cifs/file.c:349:24: warning: variable 'pCifsInode' set but not used
[-Wunused-but-set-variable]
fs/cifs/file.c: In function 'cifs_partialpagewrite':
fs/cifs/file.c:1149:23: warning: variable 'cifs_sb' set but not used
[-Wunused-but-set-variable]
fs/cifs/file.c: In function 'cifs_iovec_write':
fs/cifs/file.c:1740:9: warning: passing argument 6 of 'CIFSSMBWrite2' from
incompatible pointer type [enabled by default]
fs/cifs/cifsproto.h:337:12: note: expected 'unsigned int *' but argument is
of type 'size_t *'
fs/cifs/readdir.c: In function 'cifs_readdir':
fs/cifs/readdir.c:767:23: warning: variable 'cifs_sb' set but not used
[-Wunused-but-set-variable]
fs/cifs/cifs_dfs_ref.c: In function 'cifs_dfs_d_automount':
fs/cifs/cifs_dfs_ref.c:342:2: warning: 'rc' may be used uninitialized in
this function [-Wuninitialized]
fs/cifs/cifs_dfs_ref.c:278:6: note: 'rc' was declared here
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Recently CIFS was changed to use the kernel crypto API for MD4 hashes,
but the Kconfig dependencies were not changed to reflect this.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reported-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Currently, we allow the pending_mid_q to grow without bound with
SIGKILL'ed processes. This could eventually be a DoS'able problem. An
unprivileged user could a process that does a long-running call and then
SIGKILL it.
If he can also intercept the NT_CANCEL calls or the replies from the
server, then the pending_mid_q could grow very large, possibly even to
2^16 entries which might leave GetNextMid in an infinite loop. Fix this
by imposing a hard limit of 32k calls per server. If we cross that
limit, set the tcpStatus to CifsNeedReconnect to force cifsd to
eventually reconnect the socket and clean out the pending_mid_q.
While we're at it, clean up the function a bit and eliminate an
unnecessary NULL pointer check.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
If we kill the process while it's sending on a socket then the
kernel_sendmsg will return -EINTR. This is normal. No need to spam the
ring buffer with this info.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
...just cleanup. There should be no behavior change.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Use the new send_nt_cancel function to send an NT_CANCEL when the
process is delivered a fatal signal. This is a "best effort" enterprise
however, so don't bother to check the return code. There's nothing we
can reasonably do if it fails anyway.
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Currently, when a request is cancelled via signal, we delete the mid
immediately. If the request was already transmitted however, the client
is still likely to receive a response. When it does, it won't recognize
it however and will pop a printk.
It's also a little dangerous to just delete the mid entry like this. We
may end up reusing that mid. If we do then we could potentially get the
response from the first request confused with the later one.
Prevent the reuse of mids by marking them as cancelled and keeping them
on the pending_mid_q list. If the reply comes in, we'll delete it from
the list then. If it never comes, then we'll delete it at reconnect
or when cifsd comes down.
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
fs/cifs/link.c: In function ‘symlink_hash’:
fs/cifs/link.c:58:3: warning: ‘rc’ may be used uninitialized in this
function [-Wuninitialized]
fs/cifs/smbencrypt.c: In function ‘mdfour’:
fs/cifs/smbencrypt.c:61:3: warning: ‘rc’ may be used uninitialized in this
function [-Wuninitialized]
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
In ntfs_mft_record_alloc() when mapping the new extent mft record with
map_extent_mft_record() we overwrite @m with the return value and on
error, we then try to use the old @m but that is no longer there as @m
now contains an error code instead so we crash when dereferencing the
error code as if it were a pointer.
The simple fix is to use a temporary variable to store the return value
thus preserving the original @m for later use. This is a backport from
the commercial Tuxera-NTFS driver and is well tested...
Thanks go to Julia Lawall for pointing this out (whilst I had fixed it
in the commercial driver I had failed to fix it in the Linux kernel).
Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of doing a BUG_ON(1) in prepare_pages if grab_cache_page() fails, just
loop through the pages we've already grabbed and unlock and release them, then
return -ENOMEM like we should. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Got a report of a box panicing because we got a NULL eb in read_extent_buffer.
His fs was borked and btrfs_search_path returned EIO, but we don't check for
errors so the box paniced. Yes I know this will just make something higher up
the stack panic, but that's a problem for future Josef. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We call use_block_rsv right before we make an allocation in order to make sure
we have enough space. Now normally people have called btrfs_start_transaction()
with the appropriate amount of space that we need, so we just use some of that
pre-reserved space and move along happily. The problem is where people use
btrfs_join_transaction(), which doesn't actually reserve any space. So we try
and reserve space here, but we cannot flush delalloc, so this forces us to
return -ENOSPC when in reality we have plenty of space. The most common symptom
is seeing a bunch of "couldn't dirty inode" messages in syslog. With
xfstests 224 we end up falling back to start_transaction and then doing all the
flush delalloc stuff which causes to hang for a very long time.
So instead steal from the global reserve, which is what this is meant for
anyway. With this patch and the other 2 I have sent xfstests 224 now passes
successfully. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we do btrfs_block_rsv_release, if global_block_rsv is not full we will
release all the extra bytes to global_block_rsv, even if it's only a little
short of the amount of space that we need to reserve. This causes us to starve
ourselves of reservable space during the transaction which will force us to
shrink delalloc bytes and commit the transaction more often than we should. So
instead just add the amount of bytes we need to add to the global reserve so
reserved == size, and then add the rest back into the space_info for general
use. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When running xfstests 224 I kept getting ENOSPC when trying to remove the files,
and this is because we were returning ret from check_path_shared while it was
uninitalized, which isn't right. Fix this to return 0 properly, and now
xfstests 224 doesn't freak out when it tries to clean itself up. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_start_ioctl_transaction() returns ERR_PTR(), not NULL.
So, it is necessary to use IS_ERR() to check the return value.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The error check of btrfs_join_transaction()/btrfs_join_transaction_nolock()
is added, and the mistake of the error check in several places is
corrected.
For more stable Btrfs, I think that we should reduce BUG_ON().
But, I think that long time is necessary for this.
So, I propose this patch as a short-term solution.
With this patch:
- To more stable Btrfs, the part that should be corrected is clarified.
- The panic isn't done by the NULL pointer reference etc. (even if
BUG_ON() is increased temporarily)
- The error code is returned in the place where the error can be easily
returned.
As a long-term plan:
- BUG_ON() is reduced by using the forced-readonly framework, etc.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After the conditional that precedes the following code, inode may be an
ERR_PTR value. This can eg result from a memory allocation failure via the
call to btrfs_iget, and thus does not imply that root is different than
sub_root. Thus, an IS_ERR check is added to ensure that there is no
dereference of inode in this case.
The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@r@
identifier f;
@@
f(...) { ... return ERR_PTR(...); }
@@
identifier r.f, fld;
expression x;
statement S1,S2;
@@
x = f(...)
... when != IS_ERR(x)
(
if (IS_ERR(x) ||...) S1 else S2
|
*x->fld
)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
To make btrfs more stable, add several missing necessary memory allocation
checks, and when no memory, return proper errno.
We've checked that some of those -ENOMEM errors will be returned to
userspace, and some will be catched by BUG_ON() in the upper callers,
and none will be ignored silently.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_submit_compressed_read() is lack of memory allocation checks and
corresponding error route.
After this fix, if it comes to "no memory" case, errno will be returned
to userland step by step, and tell users this operation cannot go on.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On recent 2.6.38-rc kernels, connectathon basic test 6 fails on
NFSv4 mounts of OpenSolaris with something like:
> ./test6: readdir
> ./test6: (/mnt/klimt/matisse.test) didn't read expected 'file.12' dir entry, pass 0
> ./test6: (/mnt/klimt/matisse.test) didn't read expected 'file.82' dir entry, pass 0
> ./test6: (/mnt/klimt/matisse.test) didn't read expected 'file.164' dir entry, pass 0
> ./test6: (/mnt/klimt/matisse.test) Test failed with 3 errors
> basic tests failed
> Tests failed, leaving /mnt/klimt mounted
> [cel@matisse cthon04]$
I narrowed the problem down to nfs4_decode_dirent() reporting that the
decode buffer had overflowed while decoding the entries for those
missing files.
verify_attr_len() assumes both it's pointer arguments reside on the
same page. When these arguments point to locations on two different
pages, verify_attr_len() can report false errors. This can happen now
that a large NFSv4 readdir result can span pages.
We have reasonably good checking in nfs4_decode_dirent() anyway, so
it should be safe to simply remove the extra checking.
At a guess, this was introduced by commit 6650239a, "NFS: Don't use
vm_map_ram() in readdir".
Cc: stable@kernel.org [2.6.37]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make the decoding of NFSv4 directory entries slightly more efficient
by:
1. Avoiding unnecessary byte swapping when checking XDR booleans,
and
2. Not bumping "p" when its value will be immediately replaced by
xdr_inline_decode()
This commit makes nfs4_decode_dirent() consistent with similar logic
in the other two decode_dirent() functions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There is no reason to be freeing the delegation cred in the rcu callback,
and doing so is resulting in a lockdep complaint that rpc_credcache_lock
is being called from both softirq and non-softirq contexts.
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
When filling in the middle of a previous delayed allocation in
xfs_bmap_add_extent_delay_real, set br_startblock of the new delay
extent to the right to nullstartblock instead of 0 before inserting
the extent into the ifork (xfs_iext_insert), rather than setting
br_startblock afterward.
Adding the extent into the ifork with br_startblock=0 can lead to
the extent being copied into the btree by xfs_bmap_extent_to_btree
if we happen to convert from extents format to btree format before
updating br_startblock with the correct value. The unexpected
addition of this delay extent to the btree can cause subsequent
XFS_WANT_CORRUPTED_GOTO filesystem shutdown in several
xfs_bmap_add_extent_delay_real cases where we are converting a delay
extent to real and unexpectedly find an extent already inserted.
For example:
911 case BMAP_LEFT_FILLING:
912 /*
913 * Filling in the first part of a previous delayed allocation.
914 * The left neighbor is not contiguous.
915 */
916 trace_xfs_bmap_pre_update(ip, idx, state, _THIS_IP_);
917 xfs_bmbt_set_startoff(ep, new_endoff);
918 temp = PREV.br_blockcount - new->br_blockcount;
919 xfs_bmbt_set_blockcount(ep, temp);
920 xfs_iext_insert(ip, idx, 1, new, state);
921 ip->i_df.if_lastex = idx;
922 ip->i_d.di_nextents++;
923 if (cur == NULL)
924 rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
925 else {
926 rval = XFS_ILOG_CORE;
927 if ((error = xfs_bmbt_lookup_eq(cur, new->br_startoff,
928 new->br_startblock, new->br_blockcount,
929 &i)))
930 goto done;
931 XFS_WANT_CORRUPTED_GOTO(i == 0, done);
With the bogus extent in the btree we shutdown the filesystem at
931. The conversion from extents to btree format happens when the
number of extents in the inode increases above ip->i_df.if_ext_max.
xfs_bmap_extent_to_btree copies extents from the ifork into the
btree, ignoring all delalloc extents which are denoted by
br_startblock having some value of nullstartblock.
SGI-PV: 1013221
Signed-off-by: Ben Myers <bpm@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Commit 368e136 ("xfs: remove duplicate code from dquot reclaim") fails
to unlock the dquot freelist when the number of loop restarts is
exceeded in xfs_qm_dqreclaim_one(). This causes hangs in memory
reclaim.
Rework the loop control logic into an unwind stack that all the
different cases jump into. This means there is only one set of code
that processes the loop exit criteria, and simplifies the unlocking
of all the items from different points in the loop. It also fixes a
double increment of the restart counter from the qi_dqlist_lock
case.
Reported-by: Malcolm Scott <lkml@malc.org.uk>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Failure to commit a transaction into the CIL is not handled
correctly. This currently can only happen when racing with a
shutdown and requires an explicit shutdown check, so it rare and can
be avoided. Remove the shutdown check and make the CIL commit a void
function to indicate it will always succeed, thereby removing the
incorrectly handled failure case.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
The extent size hint can be set to larger than an AG. This means
that the alignment process can push the range to be allocated
outside the bounds of the AG, resulting in assert failures or
corrupted bmbt records. Similarly, if the extsize is larger than the
maximum extent size supported, the alignment process will produce
extents that are too large to fit into the bmbt records, resulting
in a different type of assert/corruption failure.
Fix this by limiting extsize at the time іt is set firstly to be
less than MAXEXTLEN, then to be a maximum of half the size of the
AGs in the filesystem for non-realtime inodes. Realtime inodes do
not allocate out of AGs, so don't have to be restricted by the size
of AGs.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
When doing delayed allocation, if the allocation size is for a
maximally sized extent, extent size alignment can push it over this
limit. This results in an assert failure in xfs_bmbt_set_allf() as
the extent length is too large to find in the extent record.
Fix this by ensuring that we allow for space that extent size
alignment requires (up to 2 * (extsize -1) blocks as we have to
handle both head and tail alignment) when limiting the maximum size
of the extent.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Delayed allocation extents can be larger than AGs, so when trying to
convert a large range we may scan every AG inside
xfs_bmap_alloc_nullfb() trying to find an AG with a size larger than
an AG. We should stop when we find the first AG with a maximum
possible allocation size. This causes excessive CPU usage when there
are lots of AGs.
The same problem occurs when doing preallocation of a range larger
than an AG.
Fix the problem by limiting real allocation lengths to the maximum
that an AG can support. This means if we have empty AGs, we'll stop
the search at the first of them. If there are no empty AGs, we'll
still scan them all, but that is a different problem....
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
rounddown_power_of_2() returns an undefined result when passed a
value of zero. The specualtive delayed allocation code is doing this
when the inode is zero length. Hence occasionally the preallocation
is much, much larger than is necessary (e.g. 8GB for a 270 _byte_
file). Ensure we don't even pass a zero value to this function so
the result of preallocation is always the desired size.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
After test 139, kmemleak shows:
unreferenced object 0xffff880078b405d8 (size 400):
comm "xfs_io", pid 4904, jiffies 4294909383 (age 1186.728s)
hex dump (first 32 bytes):
60 c1 17 79 00 88 ff ff 60 c1 17 79 00 88 ff ff `..y....`..y....
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff81afb04d>] kmemleak_alloc+0x2d/0x60
[<ffffffff8115c6cf>] kmem_cache_alloc+0x13f/0x2b0
[<ffffffff814aaa97>] kmem_zone_alloc+0x77/0xf0
[<ffffffff814aab2e>] kmem_zone_zalloc+0x1e/0x50
[<ffffffff8147cd6b>] xfs_efi_init+0x4b/0xb0
[<ffffffff814a4ee8>] xfs_trans_get_efi+0x58/0x90
[<ffffffff81455fab>] xfs_bmap_finish+0x8b/0x1d0
[<ffffffff814851b4>] xfs_itruncate_finish+0x2c4/0x5d0
[<ffffffff814a970f>] xfs_setattr+0x8df/0xa70
[<ffffffff814b5c7b>] xfs_vn_setattr+0x1b/0x20
[<ffffffff8117dc00>] notify_change+0x170/0x2e0
[<ffffffff81163bf6>] do_truncate+0x66/0xa0
[<ffffffff81163d0b>] sys_ftruncate+0xdb/0xe0
[<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
The cause of the leak is that the "remove" parameter of IOP_UNPIN()
is never set when a CIL push is aborted. This means that the EFI
item is never freed if it was in the push being cancelled. The
problem is specific to delayed logging, but has uncovered a couple
of problems with the handling of IOP_UNPIN(remove).
Firstly, we cannot safely call xfs_trans_del_item() from IOP_UNPIN()
in the CIL commit failure path or the iclog write failure path
because for delayed loging we have no transaction context. Hence we
must only call xfs_trans_del_item() if the log item being unpinned
has an active log item descriptor.
Secondly, xfs_trans_uncommit() does not handle log item descriptor
freeing during the traversal of log items on a transaction. It can
reference a freed log item descriptor when unpinning an EFI item.
Hence it needs to use a safe list traversal method to allow items to
be removed from the transaction during IOP_UNPIN().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: avoid picking MDS that is not active
ceph: avoid immediate cap check after import
ceph: fix flushing of caps vs cap import
ceph: fix erroneous cap flush to non-auth mds
ceph: fix cap_wanted_delay_{min,max} mount option initialization
ceph: fix xattr rbtree search
ceph: fix getattr on directory when using norbytes
Replaced md4 hashing function local to cifs module with kernel crypto APIs.
As a result, md4 hashing function and its supporting functions in
file md4.c are not needed anymore.
Cleaned up function declarations, removed forward function declarations,
and removed a header file that is being deleted from being included.
Verified that sec=ntlm/i, sec=ntlmv2/i, and sec=ntlmssp/i work correctly.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Suppose:
- the source extent is: [0, 100]
- the src offset is 10
- the clone length is 90
- the dest offset is 0
This statement:
new_key.offset = key.offset + destoff - off
will produce such an extent for the dest file:
[ino, BTRFS_EXTENT_DATA_KEY, -10]
, which is obviously wrong.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
fixup, which is allocated when starting page write to fix up the
extent without ORDERED bit set, should be freed after this work
is done.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
We must save and free the original kstrdup()'ed pointer
because strsep() modifies its first argument.
Signed-off-by: Tero Roponen <tero.roponen@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
We missed a memory deallocation in commit 450ba0ea.
If an existing super block is found at mount and there is no
error condition then the pre-allocated tree_root and fs_info
are no not used and are not freeded.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
After returing extents from a cluster to the block group, some
extents in the block group may be mergeable.
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
When adding a new extent, we'll firstly see if we can merge
this extent to the left or/and right extent. Extract this as
a helper try_merge_free_space().
As a side effect, we fix a small bug that if the new extent
has non-bitmap left entry but is unmergeble, we'll directly
link the extent without trying to drop it into bitmap.
This also prepares for the next patch.
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
When allocating extent entry from a cluster, we should update
the free_space and free_extents fields of the block group.
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
If there's no more free space in a bitmap, we should free it.
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Remove some duplicated code.
This prepares for the next patch.
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
If a block group is smaller than 1GB, the extent entry threadhold
calculation will always set the threshold to 0.
So as free space gets fragmented, btrfs will switch to use bitmap
to manage free space, but then will never switch back to extents
due to this bug.
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
aio_wq isn't used during memory reclaim. Convert to alloc_workqueue()
without WQ_MEM_RECLAIM. It's possible to use system_wq but given that
the number of work items is determined from userland and the work item
may block, enforcing strict concurrency limit would be a good idea.
Also, move fput_work to system_wq so that aio_wq is used soley to
throttle the max concurrency of aio work items and fput_work doesn't
interact with other work items.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: linux-aio@kvack.org
As stated in section 2.4 of RFC 5661, subsequent instances of the client need
to present the same co_ownerid. Concatinate the client's IP dot address,
host name, and the rpc_auth pseudoflavor to form the co_ownerid.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The -rt patches change the console_semaphore to console_mutex. As a
result, a quite large chunk of the patches changes all
acquire/release_console_sem() to acquire/release_console_mutex()
This commit makes things use more neutral function names which dont make
implications about the underlying lock.
The only real change is the return value of console_trylock which is
inverted from try_acquire_console_sem()
This patch also paves the way to switching console_sem from a semaphore to
a mutex.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make console_trylock return 1 on success, per Geert]
Signed-off-by: Torben Hohn <torbenh@gmx.de>
Cc: Thomas Gleixner <tglx@tglx.de>
Cc: Greg KH <gregkh@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix potential use of uninitialised variable caused by recent
decompressor code optimisations.
In zlib_uncompress (zlib_wrapper.c) we have
int zlib_err, zlib_init = 0;
...
do {
...
if (avail == 0) {
offset = 0;
put_bh(bh[k++]);
continue;
}
...
zlib_err = zlib_inflate(stream, Z_SYNC_FLUSH);
...
} while (zlib_err == Z_OK);
If continue is executed (avail == 0) then the while condition will be
evaluated testing zlib_err, which is uninitialised first time around the
loop.
Fix this by getting rid of the 'if (avail == 0)' condition test, this
edge condition should not be being handled in the decompressor code, and
instead handle it generically in the caller code.
Similarly for xz_wrapper.c.
Incidentally, on most architectures (bar Mips and Parisc), no
uninitialised variable warning is generated by gcc, this is because the
while condition test on continue is optimised out and not performed
(when executing continue zlib_err has not been changed since entering
the loop, and logically if the while condition was true previously, then
it's still true).
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Reported-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the call to nfs_wcc_update_inode() results in an attribute update, we
need to ensure that the inode's attr_gencount gets bumped too, otherwise
we are not protected against races with other GETATTR calls.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
What we really want to know is the ref count.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Always assign the cb_process_state nfs_client pointer so a processing error
in cb_sequence after the nfs_client is found and referenced returns
a non-NULL cb_process_state nfs_client and the matching nfs_put_client in
nfs4_callback_compound dereferences the client.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The information required to find the nfs_client cooresponding to the incoming
back channel request is contained in the NFS layer. Perform minimal checking
in the RPC layer pg_authenticate method, and push more detailed checking into
the NFS layer where the nfs_client can be found.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Nick Bowler <nbowler@elliptictech.com> reports:
> We were just having some NFS server troubles, and my client machine
> running 2.6.38-rc1+ (specifically, commit 2b1caf6ed7) crashed
> hard (syslog output appended to this mail).
>
> I'm not sure what the exact timeline was or how to reproduce this,
> but the server was rebooted during all this. Since I've never seen
> this happen before, it is possibly a regression from previous kernel
> releases. However, I recently updated my nfs-utils (on the client) to
> version 1.2.3, so that might be related as well.
[ BUG output redacted ]
When done searching, the for_each_host loop in next_host_state() falls
through and returns the final host on the host chain without bumping
it's reference count.
Since the host's ref count is only one at that point, releasing the
host in nlm_host_rebooted() attempts to destroy the host prematurely,
and therefore hits a BUG().
Likely, the original intent of the for_each_host behavior in
next_host_state() was to handle the case when the host chain is empty.
Searching the chain and finding no suitable host to return needs to be
handled as well.
Defensively restructure next_host_state() always to return NULL when
the loop falls through.
Introduced by commit b10e30f6 "lockd: reorganize nlm_host_rebooted".
Cc: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfsacl_encode() allocates memory in certain cases. This of course
is not guaranteed to work.
Since commit 9f06c719 "SUNRPC: New xdr_streams XDR encoder API", the
kernel's XDR encoders can't return a result indicating possibly a
failure, so a memory allocation failure in nfsacl_encode() has become
fatal (ie, the XDR code Oopses) in some cases.
However, the allocated memory is a tiny fixed amount, on the order
of 40-50 bytes. We can easily use a stack-allocated buffer for
this, with only a wee bit of nose-holding.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up.
The nfsacl_encode() and nfsacl_decode() functions return negative
errno values, and each call site verifies that the returned value
is not negative. Change the synopsis of both of these functions
to reflect this usage.
Document the synopsis and return values.
Reported-by: Trond Myklebust <trond.myklebust@netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Nick Piggin reports:
> I'm getting use after frees in aio code in NFS
>
> [ 2703.396766] Call Trace:
> [ 2703.396858] [<ffffffff8100b057>] ? native_sched_clock+0x27/0x80
> [ 2703.396959] [<ffffffff8108509e>] ? put_lock_stats+0xe/0x40
> [ 2703.397058] [<ffffffff81088348>] ? lock_release_holdtime+0xa8/0x140
> [ 2703.397159] [<ffffffff8108a2a5>] lock_acquire+0x95/0x1b0
> [ 2703.397260] [<ffffffff811627db>] ? aio_put_req+0x2b/0x60
> [ 2703.397361] [<ffffffff81039701>] ? get_parent_ip+0x11/0x50
> [ 2703.397464] [<ffffffff81612a31>] _raw_spin_lock_irq+0x41/0x80
> [ 2703.397564] [<ffffffff811627db>] ? aio_put_req+0x2b/0x60
> [ 2703.397662] [<ffffffff811627db>] aio_put_req+0x2b/0x60
> [ 2703.397761] [<ffffffff811647fe>] do_io_submit+0x2be/0x7c0
> [ 2703.397895] [<ffffffff81164d0b>] sys_io_submit+0xb/0x10
> [ 2703.397995] [<ffffffff8100307b>] system_call_fastpath+0x16/0x1b
>
> Adding some tracing, it is due to nfs completing the request then
> returning something other than -EIOCBQUEUED, so aio.c
> also completes the request.
To address this, prevent the NFS direct I/O engine from completing
async iocbs when the forward path returns an error without starting
any I/O.
This fix appears to survive ^C during both "xfstest no. 208" and "fsx
-Z."
It's likely this bug has existed for a very long while, as we are seeing
very similar symptoms in OEL 5. Copying stable.
Cc: Stable <stable@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
On Mon, 17 Jan 2011, Mi Jinlong wrote:
>
>
> Jesper Juhl:
> > strrchr() can return NULL if nothing is found. If this happens we'll
> > dereference a NULL pointer in
> > fs/nfs/nfs4filelayoutdev.c::decode_and_add_ds().
> >
> > I tried to find some other code that guarantees that this can never
> > happen but I was unsuccessful. So, unless someone else can point to some
> > code that ensures this can never be a problem, I believe this patch is
> > needed.
> >
> > While I was changing this code I also noticed that all the dprintk()
> > statements, except one, start with "%s:". The one missing the ":" I added
> > it to.
>
> Maybe another one also should be changed at decode_and_add_ds() at line 243:
>
> 243 printk("%s Decoded address and port %s\n", __func__, buf);
>
Missed that one. Thanks.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Use for switching on strict cache mode. In this mode the
client reads from the cache all the time it has Oplock Level II,
otherwise - read from the server. As for write - the client stores
a data in the cache in Exclusive Oplock case, otherwise - write
directly to the server.
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
If we don't have Exclusive oplock we write a data to the server.
Also set invalidate_mapping flag on the inode if we wrote something
to the server. Add cifs_iovec_write to let the client write iovec
buffers through CIFSSMBWrite2.
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Replace remaining use of md5 hash functions local to cifs module
with kernel crypto APIs.
Remove header and source file containing those local functions.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Ignore replication or auth frag data if it indicates an MDS that is not
active. This can happen if the MDS shuts down and the client has stale
data about the namespace distribution across the MDS cluster. If that's
the case, fall back to directing the request based on the auth cap (which
should always be accurate).
Signed-off-by: Sage Weil <sage@newdream.net>
WQ_RESCUER is now an internal flag and should only be used in the
workqueue implementation proper. Use WQ_MEM_RECLAIM instead.
This doesn't introduce any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
This patch fixes suboptimal UBIFS 'sync_fs()' implementation which causes
flash I/O even if the file-system is synchronized. E.g., a 'printk()'
in the MTD erasure function (e.g., 'nand_erase_nand()') can show that
for every 'sync' shell command UBIFS erases at least one eraseblock.
So '$ while true; do sync; done' will cause huge amount of flash I/O.
The reason for this is that UBIFS commits in 'sync_fs()', and starts the
commit even if there is nothing to commit, e.g., it anyway changes the
log. This patch adds a check in the 'do_commit()' UBIFS functions which
prevents the commit if there is nothing to commit.
Reported-by: Hans J. Koch <hjk@linutronix.de>
Tested-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
Make CIFS mount work in a container.
CIFS: Remove pointless variable assignment in cifs_dfs_do_automount()
Teach cifs about network namespaces, so mounting uses adresses/routing
visible from the container rather than from init context.
A container is a chroot on steroids that changes more than just the root
filesystem the new processes see. One thing containers can isolate is
"network namespaces", meaning each container can have its own set of
ethernet interfaces, each with its own own IP address and routing to the
outside world. And if you open a socket in _userspace_ from processes
within such a container, this works fine.
But sockets opened from within the kernel still use a single global
networking context in a lot of places, meaning the new socket's address
and routing are correct for PID 1 on the host, but are _not_ what
userspace processes in the container get to use.
So when you mount a network filesystem from within in a container, the
mount code in the CIFS driver uses the host's networking context and not
the container's networking context, so it gets the wrong address, uses
the wrong routing, and may even try to go out an interface that the
container can't even access... Bad stuff.
This patch copies the mount process's network context into the CIFS
structure that stores the rest of the server information for that mount
point, and changes the socket open code to use the saved network context
instead of the global network context. I.E. "when you attempt to use
these addresses, do so relative to THIS set of network interfaces and
routing rules, not the old global context from back before we supported
containers".
The big long HOWTO sets up a test environment on the assumption you've
never used ocntainers before. It basically says:
1) configure and build a new kernel that has container support
2) build a new root filesystem that includes the userspace container
control package (LXC)
3) package/run them under KVM (so you don't have to mess up your host
system in order to play with containers).
4) set up some containers under the KVM system
5) set up contradictory routing in the KVM system and the container so
that the host and the container see different things for the same address
6) try to mount a CIFS share from both contexts so you can both force it
to work and force it to fail.
For a long drawn out test reproduction sequence, see:
http://landley.livejournal.com/47024.htmlhttp://landley.livejournal.com/47205.htmlhttp://landley.livejournal.com/47476.html
Signed-off-by: Rob Landley <rlandley@parallels.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
In fs/cifs/cifs_dfs_ref.c::cifs_dfs_do_automount() we have this code:
...
mnt = ERR_PTR(-EINVAL);
if (IS_ERR(tlink)) {
mnt = ERR_CAST(tlink);
goto free_full_path;
}
ses = tlink_tcon(tlink)->ses;
rc = get_dfs_path(xid, ses, full_path + 1, cifs_sb->local_nls,
&num_referrals, &referrals,
cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR);
cifs_put_tlink(tlink);
mnt = ERR_PTR(-ENOENT);
...
The assignment of 'mnt = ERR_PTR(-EINVAL);' is completely pointless. If we
take the 'if (IS_ERR(tlink))' branch we'll set 'mnt' again and we'll also
do so if we do not take the branch. There is no way we'll ever use 'mnt'
with the assigned 'ERR_PTR(-EINVAL)' value, so we may as well just remove
the pointless assignment.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Add calls to path-based security hooks into CacheFiles as, unlike inode-based
security, these aren't implicit in the vfs_mkdir() and similar calls.
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
Fix new fs/dcache.c kernel-doc warnings:
Warning(fs/dcache.c:184): No description found for parameter 'dentry'
Warning(fs/dcache.c:296): No description found for parameter 'parent'
Warning(fs/dcache.c:1985): No description found for parameter 'dparent'
Warning(fs/dcache.c:1985): Excess function parameter 'parent' description in 'd_validate'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: fix up CIFSSMBEcho for unaligned access
cifs: fix unaligned accesses in cifsConvertToUCS
cifs: clean up unaligned accesses in cifs_unicode.c
cifs: fix unaligned access in check2ndT2 and coalesce_t2
cifs: clean up unaligned accesses in validate_t2
cifs: use get/put_unaligned functions to access ByteCount
cifs: move time field in cifsInodeInfo
cifs: TCP_Server_Info diet
CIFS: Implement cifs_strict_readv (try #4)
CIFS: Implement cifs_file_strict_mmap (try #2)
CIFS: Implement cifs_strict_fsync
CIFS: Make cifsFileInfo_put work with strict cache mode
We can allow a few more cases to use RCU path walking than
originally allowed. It should be possible to also enable
RCU path walking when the glock is already cached. Thats
a bit more complicated though, so left for a future patch.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Nick Piggin <npiggin@gmail.com>
This has a number of advantages:
- Reduces contention on the hash table lock
- Makes the code smaller and simpler
- Should speed up glock dumps when under load
- Removes ref count changing in examine_bucket
- No longer need hash chain lock in glock_put() in common case
There are some further changes which this enables and which
we may do in the future. One is to look at using SLAB_RCU,
and another is to look at using a per-cpu counter for the
per-sb glock counter, since that is touched twice in the
lifetime of each glock (but only used at umount time).
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Make sure that CIFSSMBEcho can handle unaligned fields. Also fix a minor
bug that causes this warning:
fs/cifs/cifssmb.c: In function 'CIFSSMBEcho':
fs/cifs/cifssmb.c:740: warning: large integer implicitly truncated to unsigned type
...WordCount is u8, not __le16, so no need to convert it.
This patch should apply cleanly on top of the rest of the patchset to
clean up unaligned access.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* akpm:
kernel/smp.c: consolidate writes in smp_call_function_interrupt()
kernel/smp.c: fix smp_call_function_many() SMP race
memcg: correctly order reading PCG_USED and pc->mem_cgroup
backlight: fix 88pm860x_bl macro collision
drivers/leds/ledtrig-gpio.c: make output match input, tighten input checking
MAINTAINERS: update Atmel AT91 entry
mm: fix truncate_setsize() comment
memcg: fix rmdir, force_empty with THP
memcg: fix LRU accounting with THP
memcg: fix USED bit handling at uncharge in THP
memcg: modify accounting function for supporting THP better
fs/direct-io.c: don't try to allocate more than BIO_MAX_PAGES in a bio
mm: compaction: prevent division-by-zero during user-requested compaction
mm/vmscan.c: remove duplicate include of compaction.h
memblock: fix memblock_is_region_memory()
thp: keep highpte mapped until it is no longer needed
kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT
When using devices that support max_segments > BIO_MAX_PAGES (256), direct
IO tries to allocate a bio with more pages than allowed, which leads to an
oops in dio_bio_alloc(). Clamp the request to the supported maximum, and
change dio_bio_alloc() to reflect that bio_alloc() will always return a
bio when called with __GFP_WAIT and a valid number of vectors.
[akpm@linux-foundation.org: remove redundant BUG_ON()]
Signed-off-by: David Dillow <dillowda@ornl.gov>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The meaning of CONFIG_EMBEDDED has long since been obsoleted; the option
is used to configure any non-standard kernel with a much larger scope than
only small devices.
This patch renames the option to CONFIG_EXPERT in init/Kconfig and fixes
references to the option throughout the kernel. A new CONFIG_EMBEDDED
option is added that automatically selects CONFIG_EXPERT when enabled and
can be used in the future to isolate options that should only be
considered for embedded systems (RISC architectures, SLOB, etc).
Calling the option "EXPERT" more accurately represents its intention: only
expert users who understand the impact of the configuration changes they
are making should enable it.
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Acked-by: David Woodhouse <david.woodhouse@intel.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Greg KH <gregkh@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Robin Holt <holt@sgi.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: mangle existing header for SMB_COM_NT_CANCEL
cifs: remove code for setting timeouts on requests
[CIFS] cifs: reconnect unresponsive servers
cifs: set up recurring workqueue job to do SMB echo requests
cifs: add ability to send an echo request
cifs: add cifs_call_async
cifs: allow for different handling of received response
cifs: clean up sync_mid_result
cifs: don't reconnect server when we don't get a response
cifs: wait indefinitely for responses
cifs: Use mask of ACEs for SID Everyone to calculate all three permissions user, group, and other
cifs: Fix regression during share-level security mounts (Repost)
[CIFS] Update cifs version number
cifs: move mid result processing into common function
cifs: move locked sections out of DeleteMidQEntry and AllocMidQEntry
cifs: clean up accesses to midCount
cifs: make wait_for_free_request take a TCP_Server_Info pointer
cifs: no need to mark smb_ses_list as cifs_demultiplex_thread is exiting
cifs: don't fail writepages on -EAGAIN errors
CIFS: Fix oplock break handling (try #2)
Commit e462c448fd ("pipe: use event aware wakeups") optimized the pipe
event wakeup calls to avoid wakeups if the events do not match the
requested set.
However, the optimization was buggy, in that it didn't actually use the
correct sets for the events: when we make room for more data to be
written, the pipe poll() routine will return both the POLLOUT _and_
POLLWRNORM bits. Similarly for read.
And most critically, when a pipe is released, that will potentially
result in POLLHUP|POLLERR (depending on whether it was the last reader
or writer), not just the regular POLLIN|POLLOUT.
This bug showed itself as a hung gnome-screensaver-dialog process, stuck
forever (or at least until it was poked by a signal or by being traced)
in a poll() system call.
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move cifsConvertToUCS to cifs_unicode.c where all of the other unicode
related functions live. Have it store mapped characters in 'temp' and
then use put_unaligned_le16 to copy it to the target buffer. Also fix
the comments to match kernel coding style.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Make sure we use get/put_unaligned routines when accessing wide
character strings.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
...and clean up function to reduce indentation.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
It's possible that when we access the ByteCount that the alignment
will be off. Most CPUs deal with that transparently, but there's
usually some performance impact. Some CPUs raise an exception on
unaligned accesses.
Fix this by accessing the byte count using the get_unaligned and
put_unaligned inlined functions. While we're at it, fix the types
of some of the variables that end up getting returns from these
functions.
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Remove fields that are completely unused, and rearrange struct
according to recommendations by "pahole".
Before:
/* size: 1112, cachelines: 18, members: 49 */
/* sum members: 1086, holes: 8, sum holes: 26 */
/* bit holes: 1, sum bit holes: 7 bits */
/* last cacheline: 24 bytes */
After:
/* size: 1072, cachelines: 17, members: 42 */
/* sum members: 1065, holes: 3, sum holes: 7 */
/* last cacheline: 48 bytes */
...savings of 40 bytes per struct on x86_64. 21 bytes by field removal,
and 19 by reorganizing to eliminate holes.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Read from the cache if we have at least Level II oplock - otherwise
read from the server. Add cifs_user_readv to let the client read into
iovec buffers.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Invalidate inode mapping if we don't have at least Level II oplock.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Invalidate inode mapping if we don't have at least Level II oplock in
cifs_strict_fsync. Also remove filemap_write_and_wait call from cifs_fsync
because it is previously called from vfs_fsync_range. Add file operations'
structures for strict cache mode.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
On strict cache mode when we close the last file handle of the inode we
should set invalid_mapping flag on this inode to prevent data coherency
problem when we open it again but it has been modified on the server.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The NT_CANCEL command looks just like the original command, except for a
few small differences. The send_nt_cancel function however currently takes
a tcon, which we don't have in SendReceive and SendReceive2.
Instead of "respinning" the entire header for an NT_CANCEL, just mangle
the existing header by replacing just the fields we need. This means we
don't need a tcon and allows us to call it from other places.
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Since we don't time out individual requests anymore, remove the code
that we used to use for setting timeouts on different requests.
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
If the server isn't responding to echoes, we don't want to leave tasks
hung waiting for it to reply. At that point, we'll want to reconnect
so that soft mounts can return an error to userspace quickly.
If the client hasn't received a reply after a specified number of echo
intervals, assume that the transport is down and attempt to reconnect
the socket.
The number of echo_intervals to wait before attempting to reconnect is
tunable via a module parameter. Setting it to 0, means that the client
will never attempt to reconnect. The default is 5.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Add a function that will send a request, and set up the mid for an
async reply.
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
In order to incorporate async requests, we need to allow for a more
general way to do things on receive, rather than just waking up a
process.
Turn the task pointer in the mid_q_entry into a callback function and a
generic data pointer. When a response comes in, or the socket is
reconnected, cifsd can call the callback function in order to wake up
the process.
The default is to just wake up the current process which should mean no
change in behavior for existing code.
Also, clean up the locking in cifs_reconnect. There doesn't seem to be
any need to hold both the srv_mutex and GlobalMid_Lock when walking the
list of mids.
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Make it use a switch statement based on the value of the midStatus. If
the resp_buf is set, then MID_RESPONSE_RECEIVED is too.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
We only want to force a reconnect to the server under very limited and
specific circumstances. Now that we have processes waiting indefinitely
for responses, we shouldn't reach this point unless a reconnect is
already in process. Thus, there's no reason to re-mark the server for
reconnect here.
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The client should not be timing out on individual SMB requests. Too much
of the state between client and server is tied to the state of the
socket. If we time out requests and issue spurious disconnects then that
comprimises data integrity.
Instead of doing this complicated dance where we try to decide how long
to wait for a response for particular requests, have the client instead
wait indefinitely for a response. Also, use a TASK_KILLABLE sleep here
so that fatal signals will break out of this waiting.
Later patches will add support for detecting dead peers and forcing
reconnects based on that.
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
If a DACL has entries for ACEs for SID Everyone and Authenticated Users,
factor in mask in respective entries during calculation of permissions
for all three, user, group, and other.
http://technet.microsoft.com/en-us/library/bb463216.aspx
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Cleanup of the allocated list entries should not call
put_nfs_open_context() on each entry, as the context will
always be NULL, causing an oops.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NTLM response length was changed to 16 bytes instead of 24 bytes
that are sent in Tree Connection Request during share-level security
share mounts. Revert it back to 24 bytes.
Reported-and-Tested-by: Grzegorz Ozanski <grzegorz.ozanski@intel.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Cc: stable@kernel.org
Signed-off-by: Steve French <sfrench@us.ibm.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
In later patches, we're going to need to have finer-grained control
over the addition and removal of these structs from the pending_mid_q
and we'll need to be able to call the destructor while holding the
spinlock. Move the locked sections out of both routines and into
the callers. Fix up current callers of DeleteMidQEntry to call a new
routine that dequeues the entry and then destroys it.
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
It's an atomic_t and the code accesses the "counter" field in it directly
instead of using atomic_read(). It also is sometimes accessed under a
spinlock and sometimes not. Move it out of the spinlock since we don't need
belt-and-suspenders for something that's just informational.
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The cifsSesInfo pointer is only used to get at the server.
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The TCP_Server_Info is refcounted and every SMB session holds a
reference to it. Thus, smb_ses_list is always going to be empty when
cifsd is coming down. This is dead code.
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
If CIFSSMBWrite2 returns -EAGAIN, then the error should be considered
temporary. CIFS should retry the write instead of setting an error on
the mapping and returning.
For WB_SYNC_ALL, just retry the write immediately. In the WB_SYNC_NONE
case, call redirty_page_for_writeback on all of the pages that didn't
get written out and then move on.
Also, fix up the handling of a short write with a successful return
code. MS-CIFS says that 0 bytes_written means ENOSPC or EFBIG. It
doesn't mention what a short, but non-zero write means, so for now
treat it as we would an -EAGAIN return.
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
When we get oplock break notification we should set the appropriate
value of OplockLevel field in oplock break acknowledge according to
the oplock level held by the client in this time. As we only can have
level II oplock or no oplock in the case of oplock break, we should be
aware only about clientCanCacheRead field in cifsInodeInfo structure.
Also fix bug connected with wrong interpretation of OplockLevel field
during oplock break notification processing.
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The NODELAY flag avoids the heuristics that delay cap (issued/wanted)
release. There's no reason for that after we import a cap, and it kills
whatever benefit we get from those delays.
Signed-off-by: Sage Weil <sage@newdream.net>
If we are mid-flush and a cap is migrated to another node, we need to
resend the cap flush message to the new MDS, and do so with the original
flush_seq to avoid leaking across a sync boundary. Previously we didn't
redo the flush (we only flushed newly dirty data), which would cause a
later sync to hang forever.
Signed-off-by: Sage Weil <sage@newdream.net>
The int flushing is global and not clear on each iteration of the loop,
which can cause a second flush of caps to any MDSs with ids greater than
the auth.
Signed-off-by: Sage Weil <sage@newdream.net>
Fix a bunch of
warning: ‘inline’ is not at beginning of declaration
messages when building a 'make allyesconfig' kernel with -Wextra.
These warnings are trivial to kill, yet rather annoying when building with
-Wextra.
The more we can cut down on pointless crap like this the better (IMHO).
A previous patch to do this for a 'allnoconfig' build has already been
merged. This just takes the cleanup a little further.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
In the (impossible, except if there is fs corruption) error path
in gfs2_lookup_by_inum() if the call to gfs2_inode_refresh()
fails, it was leaving the function by calling iput() rather
than iget_failed(). This would cause future lookups of the same
inode to block forever.
This patch fixes the problem by moving the call to gfs2_inode_refresh()
into gfs2_inode_lookup() where iget_failed() is part of the error path
already. Also this cleans up some unreachable code and makes
gfs2_set_iop() static.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When a file gets deleted on GFS2, if a node can't get an exclusive lock on the
file's iopen glock, it punts on actually freeing up the space, because another
node is using the file. When it does this, it needs to drop the iopen glock
from its cache so that the other node can get an exclusive lock on it. Now,
gfs2_delete_inode() sets GL_NOCACHE before dropping the shared lock on the
iopen glock in preparation for grabbing it in the exclusive state. Since the
node needs the glock in the exclusive state, dropping the shared lock from the
cache doesn't slow down the case where no other nodes are using the file.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The latter is called only when both ino and dentry are about to
be freed, so cleaning ->d_fsdata and ->dentry is pointless.
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
split init_ino into new_ino and clean_ino; the former is
what used to be init_ino(NULL, sbi), the latter is for cases
where we passed non-NULL ino. Lose unused arguments.
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It's used only to pass the length of symlink body to
autofs4_get_inode() in autofs4_dir_symlink(). We can
bloody well set inode->i_size in autofs4_dir_symlink()
directly and be done with that.
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In all cases we'd set inf->mode to know value just before
passing it to autofs4_get_inode(). That kills the need
to store it in autofs_info and pass it to autofs_init_ino()
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Kill it. Mind you, it's been an obfuscated call of autofs4_init_ino()
ever since 2.3.99pre6-4...
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
gets rid of all ->free()/->u.symlink machinery in autofs; we simply
keep symlink bodies in inode->i_private and free them in ->evict_inode().
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
oz_mode isn't defined any more, use autofs4_oz_mode(sbi) instead.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There is a ref count problem in fs/namei.c:do_lookup().
When walking in ref-walk mode, if follow_managed() returns a fail we
need to drop dentry and possibly vfsmount. Clean up properly,
as we do in the other caller of follow_managed().
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The initialization condition in fs/autofs4/expire.c:get_next_positive_dentry()
appears to be incorrect. If prev == NULL I believe that root should be
returned.
Further down, at the current dentry check for it being simple_positive()
it looks like the d_lock for dentry p should be dropped instead of dentry
ret, otherwise when p is assinged to ret we end up with no lock on p and
a lost lock on ret, which leads to a deadlock.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits)
Btrfs: forced readonly mounts on errors
btrfs: Require CAP_SYS_ADMIN for filesystem rebalance
Btrfs: don't warn if we get ENOSPC in btrfs_block_rsv_check
btrfs: Fix memory leak in btrfs_read_fs_root_no_radix()
btrfs: check NULL or not
btrfs: Don't pass NULL ptr to func that may deref it.
btrfs: mount failure return value fix
btrfs: Mem leak in btrfs_get_acl()
btrfs: fix wrong free space information of btrfs
btrfs: make the chunk allocator utilize the devices better
btrfs: restructure find_free_dev_extent()
btrfs: fix wrong calculation of stripe size
btrfs: try to reclaim some space when chunk allocation fails
btrfs: fix wrong data space statistics
fs/btrfs: Fix build of ctree
Btrfs: fix off by one while setting block groups readonly
Btrfs: Add BTRFS_IOC_SUBVOL_GETFLAGS/SETFLAGS ioctls
Btrfs: Add readonly snapshots support
Btrfs: Refactor btrfs_ioctl_snap_create()
btrfs: Extract duplicate decompress code
...
This is a preparational patch which removes the 'c->always_chk_crc' which was
set during mounting and remounting to R/W mode and introduces 'c->mounting'
flag which is set when mounting. Now the 'c->always_chk_crc' flag is the
same as 'c->remounting_rw && c->mounting'.
This patch is a preparation for the next one which will need to know when we
are mounting and remounting to R/W mode, which is exactly what
'c->always_chk_crc' effectively is, but its name does not suite the
next patch. The other possibility would be to just re-name it, but then
we'd end up with less logical flags coverage.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This is a cosmetic patch which re-arranges variables in 'struct ubifs_info'
so that all boolean-like variables which are only changed during mounting or
re-mounting to R/W mode are places together. Then they are turned into
bit-fields, which makes the structure a little bit smaller.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6:
ecryptfs: remove unnecessary decrypt when extending a file
ecryptfs: Fix ecryptfs_printk() size_t warnings
fs/ecryptfs: Add printf format/argument verification and fix fallout
ecryptfs: fixed testing of file descriptor flags
ecryptfs: test lower_file pointer when lower_file_mutex is locked
ecryptfs: missing initialization of the superblock 'magic' field
ecryptfs: moved ECRYPTFS_SUPER_MAGIC definition to linux/magic.h
ecryptfs: fix truncation error in ecryptfs_read_update_atime
On platforms that call panic() inside their BUG() macro (m68k/sun3, and
all platforms that don't set HAVE_ARCH_BUG), compilation fails with:
| fs/xfs/support/debug.c: In function ‘xfs_cmn_err’:
| fs/xfs/support/debug.c:92: error: called object ‘panic’ is not a function
as the local variable "panic" conflicts with the "panic()" function.
Rename the local variable to resolve this.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch comes from "Forced readonly mounts on errors" ideas.
As we know, this is the first step in being more fault tolerant of disk
corruptions instead of just using BUG() statements.
The major content:
- add a framework for generating errors that should result in filesystems
going readonly.
- keep FS state in disk super block.
- make sure that all of resource will be freed and released at umount time.
- make sure that fter FS is forced readonly on error, there will be no more
disk change before FS is corrected. For this, we should stop write operation.
After this patch is applied, the conversion from BUG() to such a framework can
happen incrementally.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: add cruid= mount option
cifs: cFYI the entire error code in map_smb_to_linux_error
* git://git.infradead.org/mtd-2.6: (59 commits)
mtd: mtdpart: disallow reading OOB past the end of the partition
mtd: pxa3xx_nand: NULL dereference in pxa3xx_nand_probe
UBI: use mtd->writebufsize to set minimal I/O unit size
mtd: initialize writebufsize in the MTD object of a partition
mtd: onenand: add mtd->writebufsize initialization
mtd: nand: add mtd->writebufsize initialization
mtd: cfi: add writebufsize initialization
mtd: add writebufsize field to mtd_info struct
mtd: OneNAND: OMAP2/3: prevent regulator sleeping while OneNAND is in use
mtd: OneNAND: add enable / disable methods to onenand_chip
mtd: m25p80: Fix JEDEC ID for AT26DF321
mtd: txx9ndfmc: limit transfer bytes to 512 (ECC provides 6 bytes max)
mtd: cfi_cmdset_0002: add support for Samsung K8D3x16UxC NOR chips
mtd: cfi_cmdset_0002: add support for Samsung K8D6x16UxM NOR chips
mtd: nand: ams-delta: drop omap_read/write, use ioremap
mtd: m25p80: add debugging trace in sst_write
mtd: nand: ams-delta: select for built-in by default
mtd: OneNAND: lighten scary initial bad block messages
mtd: OneNAND: OMAP2/3: add support for command line partitioning
mtd: nand: rearrange ONFI revision checking, add ONFI 2.3
...
Fix up trivial conflict in drivers/mtd/Kconfig as per DavidW.
Removes an unecessary page decrypt from ecryptfs_begin_write when the
page is beyond the current file size. Previously, the call to
ecryptfs_decrypt_page would result in a read of 0 bytes, but still
attempt to decrypt an entire page. This patch detects that case and
merely zeros the page before marking it up-to-date.
Signed-off-by: Frank Swiderski <fes@chromium.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Commit cb55d21f6fa19d8c6c2680d90317ce88c1f57269 revealed a number of
missing 'z' length modifiers in calls to ecryptfs_printk() when
printing variables of type size_t. This patch fixes those compiler
warnings.
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Add __attribute__((format... to __ecryptfs_printk
Make formats and arguments match.
Add casts to (unsigned long long) for %llu.
Signed-off-by: Joe Perches <joe@perches.com>
[tyhicks: 80 columns cleanup and fixed typo]
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
This patch prevents the lower_file pointer in the 'ecryptfs_inode_info'
structure to be checked when the mutex 'lower_file_mutex' is not locked.
Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
This patch initializes the 'magic' field of ecryptfs filesystems to
ECRYPTFS_SUPER_MAGIC.
Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
[tyhicks: merge with 66cb76666d]
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
The definition of ECRYPTFS_SUPER_MAGIC has been moved to the include
file 'linux/magic.h' to become available to other kernel subsystems.
Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
This is similar to the bug found in direct-io not so long ago.
Fix up truncation (ssize_t->int). This only matters with >2G
reads/writes, which the kernel doesn't permit.
Signed-off-by: Edward Shishkin <edward.shishkin@gmail.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Eric Sandeen <esandeen@redhat.com>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
The fi_extents_start field of struct fiemap_extent_info is a
user pointer but was not marked as __user. This makes sparse
emit following warnings:
CHECK fs/ioctl.c
fs/ioctl.c:114:26: warning: incorrect type in argument 1 (different address spaces)
fs/ioctl.c:114:26: expected void [noderef] <asn:1>*dst
fs/ioctl.c:114:26: got struct fiemap_extent *[assigned] dest
fs/ioctl.c:202:14: warning: incorrect type in argument 1 (different address spaces)
fs/ioctl.c:202:14: expected void const volatile [noderef] <asn:1>*<noident>
fs/ioctl.c:202:14: got struct fiemap_extent *[assigned] fi_extents_start
fs/ioctl.c:212:27: warning: incorrect type in argument 1 (different address spaces)
fs/ioctl.c:212:27: expected void [noderef] <asn:1>*dst
fs/ioctl.c:212:27: got char *<noident>
Also add 'ufiemap' variable to eliminate unnecessary casts.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This fixed a case that 'sparse' spotted where hpfs_setattr has an error return
that didn't go through it's path that unlocks.
This is against git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
version 6313e3c217.
Build tested only, I don't have an hpfs file system to test.
Dave
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
f_flags and f_spare fields were not copied to userspace when
compat_sys_[f]statfs64 called.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The commit 7ed1ee6118 ("Take statfs variants to fs/statfs.c")
separates out statfs syscalls from fs/open.c. Thus the comment
should be changed also.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Jiri Kosina <trivial@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*@ret_pointer is initialized to @fast_pointer thus the assignment is
redundant.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- Fix a kconfig unmet dependency warning.
- Remove the comment that identifies which filesystems use POSIX ACL
utility routines.
- Move the FS_POSIX_ACL symbol outside of the BLOCK symbol if/endif block
because its functions do not depend on BLOCK and some of the filesystems
that use it do not depend on BLOCK.
warning: (GENERIC_ACL && JFFS2_FS_POSIX_ACL && NFSD_V4 && NFS_ACL_SUPPORT && 9P_FS_POSIX_ACL) selects FS_POSIX_ACL which has unmet direct dependencies (BLOCK)
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There's an unlikely() in fget_light() that assumes the file ref count
will be 1. Running the annotate branch profiler on a desktop that is
performing daily tasks (running firefox, evolution, xchat and is also part
of a distcc farm), it shows that the ref count is not 1 that often.
correct incorrect % Function File Line
------- --------- - -------- ---- ----
1035099358 6209599193 85 fget_light file_table.c 315
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently all filesystems except XFS implement fallocate asynchronously,
while XFS forced a commit. Both of these are suboptimal - in case of O_SYNC
I/O we really want our allocation on disk, especially for the !KEEP_SIZE
case where we actually grow the file with user-visible zeroes. On the
other hand always commiting the transaction is a bad idea for fast-path
uses of fallocate like for example in recent Samba versions. Given
that block allocation is a data plane operation anyway change it from
an inode operation to a file operation so that we have the file structure
available that lets us check for O_SYNC.
This also includes moving the code around for a few of the filesystems,
and remove the already unnedded S_ISDIR checks given that we only wire
up fallocate for regular files.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of various home grown checks that might need updates for new
flags just check for any bit outside the mask of the features supported
by the filesystem. This makes the check future proof for any newly
added flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
do_add_mount() and mnt_clear_expiry() are not needed outside of
namespace.c anymore, now that namei has finish_automount() to
use.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
That gets rid of the kludge in finish_automount() - we need
to keep refcount on the vfsmount as-is until we evict it from
expiry list.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
mnt_longterm is there only on SMP
Reported-and-tested-by: Joachim Eastwood <manabian@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch fixes the following kconfig error after changing
CONFIGFS_FS -> select SYSFS:
fs/sysfs/Kconfig:1:error: recursive dependency detected!
fs/sysfs/Kconfig:1: symbol SYSFS is selected by CONFIGFS_FS
fs/configfs/Kconfig:1: symbol CONFIGFS_FS is selected by OCFS2_FS
fs/ocfs2/Kconfig:1: symbol OCFS2_FS depends on SYSFS
Signed-off-by: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: James Bottomley <James.Bottomley@suse.de>
This patch fixes the following kconfig error after changing
CONFIGFS_FS -> select SYSFS:
fs/sysfs/Kconfig:1:error: recursive dependency detected!
fs/sysfs/Kconfig:1: symbol SYSFS is selected by CONFIGFS_FS
fs/configfs/Kconfig:1: symbol CONFIGFS_FS is selected by DLM
fs/dlm/Kconfig:1: symbol DLM depends on SYSFS
Signed-off-by: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: James Bottomley <James.Bottomley@suse.de>
This patch changes configfs to select SYSFS to fix the following:
warning: (TARGET_CORE && GFS2_FS) selects CONFIGFS_FS which has unmet direct dependencies (SYSFS)
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Nicholas A. Bellinger <nab@linux-iscsi.org>
Acked-by: Joel Becker <jlbec@evilplan.org>
Fix the build failure in some configurations:
CC [M] fs/btrfs/ctree.o
In file included from fs/btrfs/ctree.c:21:0:
fs/btrfs/ctree.h:1003:17: error: field 'super_kobj' has incomplete type
fs/btrfs/ctree.h:1074:17: error: field 'root_kobj' has incomplete type
make[2]: *** [fs/btrfs/ctree.o] Error 1
make[1]: *** [fs/btrfs] Error 2
make: *** [fs] Error 2
caused by commit 57cc7215b7 ("headers: kobject.h redux")
We need to include kobject.h here.
Reported-by: Jeff Garzik <jeff@garzik.org>
Fix-suggested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (23 commits)
sanitize vfsmount refcounting changes
fix old umount_tree() breakage
autofs4: Merge the remaining dentry ops tables
Unexport do_add_mount() and add in follow_automount(), not ->d_automount()
Allow d_manage() to be used in RCU-walk mode
Remove a further kludge from __do_follow_link()
autofs4: Bump version
autofs4: Add v4 pseudo direct mount support
autofs4: Fix wait validation
autofs4: Clean up autofs4_free_ino()
autofs4: Clean up dentry operations
autofs4: Clean up inode operations
autofs4: Remove unused code
autofs4: Add d_manage() dentry operation
autofs4: Add d_automount() dentry operation
Remove the automount through follow_link() kludge code from pathwalk
CIFS: Use d_automount() rather than abusing follow_link()
NFS: Use d_automount() rather than abusing follow_link()
AFS: Use d_automount() rather than abusing follow_link()
Add an AT_NO_AUTOMOUNT flag to suppress terminal automount
...
Instead of splitting refcount between (per-cpu) mnt_count
and (SMP-only) mnt_longrefs, make all references contribute
to mnt_count again and keep track of how many are longterm
ones.
Accounting rules for longterm count:
* 1 for each fs_struct.root.mnt
* 1 for each fs_struct.pwd.mnt
* 1 for having non-NULL ->mnt_ns
* decrement to 0 happens only under vfsmount lock exclusive
That allows nice common case for mntput() - since we can't drop the
final reference until after mnt_longterm has reached 0 due to the rules
above, mntput() can grab vfsmount lock shared and check mnt_longterm.
If it turns out to be non-zero (which is the common case), we know
that this is not the final mntput() and can just blindly decrement
percpu mnt_count. Otherwise we grab vfsmount lock exclusive and
do usual decrement-and-check of percpu mnt_count.
For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
namespace.c uses the latter in places where we don't already hold
vfsmount lock exclusive and opencodes a few remaining spots where
we need to manipulate mnt_longterm.
Note that we mostly revert the code outside of fs/namespace.c back
to what we used to have; in particular, normal code doesn't need
to care about two kinds of references, etc. And we get to keep
the optimization Nick's variant had bought us...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Expiry-related code calls umount_tree() several times with
the same list to collect vfsmounts to. Which is fine, except
that umount_tree() implicitly assumed that the list would
be empty on each call - it moves the victims over there and
then iterates through the list kicking them out. It's *almost*
idempotent, so everything nearly worked. However, mnt->ghosts
handling (and thus expirability checks) had been broken - that
part was not idempotent...
The fix is trivial - use local temporary list, splice it to
the the collector list when we are through.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Filesystem rebalancing (BTRFS_IOC_BALANCE) affects the entire
filesystem and may run uninterruptibly for a long time. This does not
seem to be something that an unprivileged user should be able to do.
Reported-by: Aron Xu <happyaron.xu@gmail.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we run low on space we could get a bunch of warnings out of
btrfs_block_rsv_check, but this is mostly just called via the transaction code
to see if we need to end the transaction, it expects to see failures, so let's
not WARN and freak everybody out for no reason. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In btrfs_read_fs_root_no_radix(), 'root' is not freed if
btrfs_search_slot() returns error.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Hi,
In fs/btrfs/inode.c::fixup_tree_root_location() we have this code:
...
if (!path) {
err = -ENOMEM;
goto out;
}
...
out:
btrfs_free_path(path);
return err;
btrfs_free_path() passes its argument on to other functions and some of
them end up dereferencing the pointer.
In the code above that pointer is clearly NULL, so btrfs_free_path() will
eventually cause a NULL dereference.
There are many ways to cut this cake (fix the bug). The one I chose was to
make btrfs_free_path() deal gracefully with NULL pointers. If you
disagree, feel free to come up with an alternative patch.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I happened to pass swap partition as root partition in cmdline,
then kernel panic and tell me about "Cannot open root device".
It is not correct, in fact it is a fs type mismatch instead of 'no device'.
Eventually I found btrfs mounting failed with -EIO, it should be -EINVAL.
The logic in init/do_mounts.c:
for (p = fs_names; *p; p += strlen(p)+1) {
int err = do_mount_root(name, p, flags, root_mount_data);
switch (err) {
case 0:
goto out;
case -EACCES:
flags |= MS_RDONLY;
goto retry;
case -EINVAL:
continue;
}
print "Cannot open root device"
panic
}
SO fs type after btrfs will have no chance to mount
Here fix the return value as -EINVAL
Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It seems to me that we leak the memory allocated to 'value' in
btrfs_get_acl() if the call to posix_acl_from_xattr() fails.
Here's a patch that attempts to correct that problem.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we store data by raid profile in btrfs with two or more different size
disks, df command shows there is some free space in the filesystem, but the
user can not write any data in fact, df command shows the wrong free space
information of btrfs.
# mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 28.00KB
devid 1 size 5.01GB used 2.03GB path /dev/sda9
devid 2 size 10.00GB used 2.01GB path /dev/sda10
# btrfs device scan /dev/sda9 /dev/sda10
# mount /dev/sda9 /mnt
# dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
(fill the filesystem)
# sync
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 5.4G 62% /mnt
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 3.99GB
devid 1 size 5.01GB used 5.01GB path /dev/sda9
devid 2 size 10.00GB used 4.99GB path /dev/sda10
It is because btrfs cannot allocate chunks when one of the pairing disks has
no space, the free space on the other disks can not be used for ever, and should
be subtracted from the total space, but btrfs doesn't subtract this space from
the total. It is strange to the user.
This patch fixes it by calcing the free space that can be used to allocate
chunks.
Implementation:
1. get all the devices free space, and align them by stripe length.
2. sort the devices by the free space.
3. check the free space of the devices,
3.1. if it is not zero, and then check the number of the devices that has
more free space than this device,
if the number of the devices is beyond the min stripe number, the free
space can be used, and add into total free space.
if the number of the devices is below the min stripe number, we can not
use the free space, the check ends.
3.2. if the free space is zero, check the next devices, goto 3.1
This implementation is just likely fake chunk allocation.
After appling this patch, df can show correct space information:
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 0 100% /mnt
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With this patch, we change the handling method when we can not get enough free
extents with default size.
Implementation:
1. Look up the suitable free extent on each device and keep the search result.
If not find a suitable free extent, keep the max free extent
2. If we get enough suitable free extents with default size, chunk allocation
succeeds.
3. If we can not get enough free extents, but the number of the extent with
default size is >= min_stripes, we just change the mapping information
(reduce the number of stripes in the extent map), and chunk allocation
succeeds.
4. If the number of the extent with default size is < min_stripes, sort the
devices by its max free extent's size descending
5. Use the size of the max free extent on the (num_stripes - 1)th device as the
stripe size to allocate the device space
By this way, the chunk allocator can allocate chunks as large as possible when
the devices' space is not enough and make full use of the devices.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
- make it return the start position and length of the max free space when it can
not find a suitable free space.
- make it more readability
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are two tiny problem:
- One is When we check the chunk size is greater than the max chunk size or not,
we should take mirrors into account, but the original code didn't.
- The other is btrfs shouldn't use the size of the residual free space as the
length of of a dup chunk when doing chunk allocation. It is because the device
space that a dup chunk needs is twice as large as the chunk size, if we use
the size of the residual free space as the length of a dup chunk, we can not
get enough free space. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We cannot write data into files when when there is tiny space in the filesystem.
Reproduce steps:
# mkfs.btrfs /dev/sda1
# mount /dev/sda1 /mnt
# dd if=/dev/zero of=/mnt/tmpfile0 bs=4K count=1
# dd if=/dev/zero of=/mnt/tmpfile1 bs=4K count=99999999999999
(fill the filesystem)
# umount /mnt
# mount /dev/sda1 /mnt
# rm -f /mnt/tmpfile0
# dd if=/dev/zero of=/mnt/tmpfile0 bs=4K count=1
(failed with nospec)
But if we do the last step again, we can write data successfully. The reason of
the problem is that btrfs didn't try to commit the current transaction and
reclaim some space when chunk allocation failed.
This patch fixes it by committing the current transaction to reclaim some
space when chunk allocation fails.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef has implemented mixed data/metadata chunks, we must add those chunks'
space just like data chunks.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
CC [M] fs/btrfs/ctree.o
In file included from fs/btrfs/ctree.c:21:0:
fs/btrfs/ctree.h:1003:17: error: field <91>super_kobj<92> has incomplete type
fs/btrfs/ctree.h:1074:17: error: field <91>root_kobj<92> has incomplete type
make[2]: *** [fs/btrfs/ctree.o] Error 1
make[1]: *** [fs/btrfs] Error 2
make: *** [fs] Error 2
We need to include kobject.h here.
Reported-by: Jeff Garzik <jeff@garzik.org>
Fix-suggested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Merge the remaining autofs4 dentry ops tables. It doesn't matter if
d_automount and d_manage are present on something that's not mountable or
holdable as these ops are only used if the appropriate flags are set in
dentry->d_flags.
[AV] switch to ->s_d_op, since now _everything_ on autofs4 is using the
same dentry_operations.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
added rather than calling do_add_mount() itself. follow_automount() will then
do the addition.
This slightly complicates things as ->d_automount() normally wants to add the
new vfsmount to an expiration list and start an expiration timer. The problem
with that is that the vfsmount will be deleted if it has a refcount of 1 and
the timer will not repeat if the expiration list is empty.
To this end, we require the vfsmount to be returned from d_automount() with a
refcount of (at least) 2. One of these refs will be dropped unconditionally.
In addition, follow_automount() must get a 3rd ref around the call to
do_add_mount() lest it eat a ref and return an error, leaving the mount we
have open to being expired as we would otherwise have only 1 ref on it.
d_automount() should also add the the vfsmount to the expiration list (by
calling mnt_set_expiry()) and start the expiration timer before returning, if
this mechanism is to be used. The vfsmount will be unlinked from the
expiration list by follow_automount() if do_add_mount() fails.
This patch also fixes the call to do_add_mount() for AFS to propagate the mount
flags from the parent vfsmount.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Allow d_manage() to be called from pathwalk when it is in RCU-walk mode as well
as when it is in Ref-walk mode. This permits __follow_mount_rcu() to call
d_manage() directly. d_manage() needs a parameter to indicate that it is in
RCU-walk mode as it isn't allowed to sleep if in that mode (but should return
-ECHILD instead).
autofs4_d_manage() can then be set to retain RCU-walk mode if the daemon
accesses it and otherwise request dropping back to ref-walk mode.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove a further kludge from __do_follow_link() as it's no longer required with
the automount code.
This reverts the non-helper-function parts of
051d381259, which breaks union mounts.
Reported-by: vaurora@redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Version 4 of autofs provides a pseudo direct mount implementation
that relies on directories at the leaves of a directory tree under
an indirect mount to trigger mounts.
This patch adds support for that functionality.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It is possible for the check in wait.c:validate_request() to return
an incorrect result if the dentry that was mounted upon has changed
during the callback.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When this function is called the local reference count does't need to
be updated since the dentry is going away and dput definitely must
not be called here.
Also the autofs info struct field inode isn't used so remove it.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There are now two distinct dentry operations uses. One for dentrys
that trigger mounts and one for dentrys that do not.
Rationalize the use of these dentry operations and rename them to
reflect their function.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Since the use of ->follow_link() has been eliminated there is no
need to separate the indirect and direct inode operations.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove code that is not used due to the use of ->d_automount()
and ->d_manage().
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch required a previous patch to add the ->d_automount()
dentry operation.
Add a function to use the newly defined ->d_manage() dentry operation
for blocking during mount and expire.
Whether the VFS calls the dentry operations d_automount() and d_manage()
is controled by the DMANAGED_AUTOMOUNT and DMANAGED_TRANSIT flags. autofs
uses the d_automount() operation to callback to user space to request
mount operations and the d_manage() operation to block walks into mounts
that are under construction or destruction.
In order to prevent these functions from being called unnecessarily the
DMANAGED_* flags are cleared for cases which would cause this. In the
common case the DMANAGED_AUTOMOUNT and DMANAGED_TRANSIT flags are both
set for dentrys waiting to be mounted. The DMANAGED_TRANSIT flag is
cleared upon successful mount request completion and set during expire
runs, both during the dentry expire check, and if selected for expire,
is left set until a subsequent successful mount request completes.
The exception to this is the so-called rootless multi-mount which has
no actual mount at its base. In this case the DMANAGED_AUTOMOUNT flag
is cleared upon successful mount request completion as well and set
again after a successful expire.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a function to use the newly defined ->d_automount() dentry operation
for triggering mounts instead of doing the user space callback in ->lookup()
and ->d_revalidate().
Note, to be useful the subsequent patch to add the ->d_manage() dentry
operation is also needed so the discussion of functionality is deferred to
that patch.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove the automount through follow_link() kludge code from pathwalk in favour
of using d_automount().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make CIFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.
[NOTE: THIS IS UNTESTED!]
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make NFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make AFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add an AT_NO_AUTOMOUNT flag to suppress terminal automounting of automount
point directories. This can be used by fstatat() users to permit the
gathering of attributes on an automount point and also prevent
mass-automounting of a directory of automount points by ls.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a dentry op (d_manage) to permit a filesystem to hold a process and make it
sleep when it tries to transit away from one of that filesystem's directories
during a pathwalk. The operation is keyed off a new dentry flag
(DCACHE_MANAGE_TRANSIT).
The filesystem is allowed to be selective about which processes it holds and
which it permits to continue on or prohibits from transiting from each flagged
directory. This will allow autofs to hold up client processes whilst letting
its userspace daemon through to maintain the directory or the stuff behind it
or mounted upon it.
The ->d_manage() dentry operation:
int (*d_manage)(struct path *path, bool mounting_here);
takes a pointer to the directory about to be transited away from and a flag
indicating whether the transit is undertaken by do_add_mount() or
do_move_mount() skipping through a pile of filesystems mounted on a mountpoint.
It should return 0 if successful and to let the process continue on its way;
-EISDIR to prohibit the caller from skipping to overmounted filesystems or
automounting, and to use this directory; or some other error code to return to
the user.
->d_manage() is called with namespace_sem writelocked if mounting_here is true
and no other locks held, so it may sleep. However, if mounting_here is true,
it may not initiate or wait for a mount or unmount upon the parameter
directory, even if the act is actually performed by userspace.
Within fs/namei.c, follow_managed() is extended to check with d_manage() first
on each managed directory, before transiting away from it or attempting to
automount upon it.
follow_down() is renamed follow_down_one() and should only be used where the
filesystem deliberately intends to avoid management steps (e.g. autofs).
A new follow_down() is added that incorporates the loop done by all other
callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS
and CIFS do use it, their use is removed by converting them to use
d_automount()). The new follow_down() calls d_manage() as appropriate. It
also takes an extra parameter to indicate if it is being called from mount code
(with namespace_sem writelocked) which it passes to d_manage(). follow_down()
ignores automount points so that it can be used to mount on them.
__follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with
DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to
sleep. It would be possible to enter d_manage() in rcu-walk mode too, and have
that determine whether to abort or not itself. That would allow the autofs
daemon to continue on in rcu-walk mode.
Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't
required as every tranist from that directory will cause d_manage() to be
invoked. It can always be set again when necessary.
==========================
WHAT THIS MEANS FOR AUTOFS
==========================
Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to
trigger the automounting of indirect mounts, and both of these can be called
with i_mutex held.
autofs knows that the i_mutex will be held by the caller in lookup(), and so
can drop it before invoking the daemon - but this isn't so for d_revalidate(),
since the lock is only held on _some_ of the code paths that call it. This
means that autofs can't risk dropping i_mutex from its d_revalidate() function
before it calls the daemon.
The bug could manifest itself as, for example, a process that's trying to
validate an automount dentry that gets made to wait because that dentry is
expired and needs cleaning up:
mkdir S ffffffff8014e05a 0 32580 24956
Call Trace:
[<ffffffff885371fd>] :autofs4:autofs4_wait+0x674/0x897
[<ffffffff80127f7d>] avc_has_perm+0x46/0x58
[<ffffffff8009fdcf>] autoremove_wake_function+0x0/0x2e
[<ffffffff88537be6>] :autofs4:autofs4_expire_wait+0x41/0x6b
[<ffffffff88535cfc>] :autofs4:autofs4_revalidate+0x91/0x149
[<ffffffff80036d96>] __lookup_hash+0xa0/0x12f
[<ffffffff80057a2f>] lookup_create+0x46/0x80
[<ffffffff800e6e31>] sys_mkdirat+0x56/0xe4
versus the automount daemon which wants to remove that dentry, but can't
because the normal process is holding the i_mutex lock:
automount D ffffffff8014e05a 0 32581 1 32561
Call Trace:
[<ffffffff80063c3f>] __mutex_lock_slowpath+0x60/0x9b
[<ffffffff8000ccf1>] do_path_lookup+0x2ca/0x2f1
[<ffffffff80063c89>] .text.lock.mutex+0xf/0x14
[<ffffffff800e6d55>] do_rmdir+0x77/0xde
[<ffffffff8005d229>] tracesys+0x71/0xe0
[<ffffffff8005d28d>] tracesys+0xd5/0xe0
which means that the system is deadlocked.
This patch allows autofs to hold up normal processes whilst the daemon goes
ahead and does things to the dentry tree behind the automouter point without
risking a deadlock as almost no locks are held in d_manage() and none in
d_automount().
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).
This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.
The ->d_automount() dentry operation:
struct vfsmount *(*d_automount)(struct path *mountpoint);
takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.
The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.
Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.
__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).
__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.
follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".
I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.
I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.
[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
do_lookup() has a path leading from LOOKUP_RCU case to non-RCU
crossing of mountpoints, which breaks things badly. If we
hit need_revalidate: and do nothing in there, we need to come
back into LOOKUP_RCU half of things, not to done: in non-RCU
one.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
block: restore multiple bd_link_disk_holder() support
block cfq: compensate preempted queue even if it has no slice assigned
block cfq: make queue preempt work for queues from different workload
It's indicative of a real problem, and it actually triggers with
autofs4, but the BUG_ON() is excessive. The autofs4 case is being fixed
(to only set d_op in the ->lookup method) but not merged yet. In the
meantime this gets the code limping along.
Reported-by: Alex Elder <aelder@sgi.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-2.6.38' of git://linux-nfs.org/~bfields/linux: (62 commits)
nfsd4: fix callback restarting
nfsd: break lease on unlink, link, and rename
nfsd4: break lease on nfsd setattr
nfsd: don't support msnfs export option
nfsd4: initialize cb_per_client
nfsd4: allow restarting callbacks
nfsd4: simplify nfsd4_cb_prepare
nfsd4: give out delegations more quickly in 4.1 case
nfsd4: add helper function to run callbacks
nfsd4: make sure sequence flags are set after destroy_session
nfsd4: re-probe callback on connection loss
nfsd4: set sequence flag when backchannel is down
nfsd4: keep finer-grained callback status
rpc: allow xprt_class->setup to return a preexisting xprt
rpc: keep backchannel xprt as long as server connection
rpc: move sk_bc_xprt to svc_xprt
nfsd4: allow backchannel recovery
nfsd4: support BIND_CONN_TO_SESSION
nfsd4: modify session list under cl_lock
Documentation: fl_mylease no longer exists
...
Fix up conflicts in fs/nfsd/vfs.c with the vfs-scale work. The
vfs-scale work touched some msnfs cases, and this merge removes support
for that entirely, so the conflict was trivial to resolve.
In commit 3e4b3e1f we separated the "uid" mount option such that it
no longer determined the owner of the credential cache by default. When
we did this, we added a new option to cifs.upcall (--legacy-uid) to
try to make it so that it would behave the same was as it did before.
This ignored a rather important point -- the kernel has no way to know
what options are being passed to cifs.upcall, so it doesn't know what
uid it should use to determine whether to match an existing krb5 session.
The simplest solution is to simply add a new "cruid=" mount option that
only governs the uid owner of the credential cache for the mount.
Unfortunately, this means that the --legacy-uid option in cifs.upcall was
ill-considered and is now useless, but I don't see a better way to deal
with this.
A patch for the mount.cifs manpage will follow once this patch has been
accepted.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Commit e09b457b (block: simplify holder symlink handling) incorrectly
assumed that there is only one link at maximum. dm may use multiple
links and expects block layer to track reference count for each link,
which is different from and unrelated to the exclusive device holder
identified by @holder when the device is opened.
Remove the single holder assumption and automatic removal of the link
and revive the per-link reference count tracking. The code
essentially behaves the same as before commit e09b457b sans the
unnecessary kobject reference count dancing.
While at it, note that this facility should not be used by anyone else
than the current ones. Sysfs symlinks shouldn't be abused like this
and the whole thing doesn't belong in the block layer at all.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Milan Broz <mbroz@redhat.com>
Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
flush_scheduled_work() is going away. afs needs to make sure all the
works it has queued have finished before being unloaded and there can
be arbitrary number of pending works. Add afs_wq and use it as the
flush domain instead of the system workqueue.
Also, convert cancel_delayed_work() + flush_scheduled_work() to
cancel_delayed_work_sync() in afs_mntpt_kill_timer().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: linux-afs@lists.infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fscache_submit_exclusive_op() adds an operation to the pending list if
other operations are pending. Fix the check for pending ops as n_ops
must be greater than 0 at the point it is checked as it is incremented
immediately before under lock.
Signed-off-by: Akshat Aranya <aranya@nec-labs.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin:
kernel: fix hlist_bl again
cgroups: Fix a lockdep warning at cgroup removal
fs: namei fix ->put_link on wrong inode in do_filp_open
J. R. Okajima noticed that ->put_link is being attempted on the
wrong inode, and suggested the way to fix it. I changed it a bit
according to Al's suggestion to keep an explicit link path around.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin:
fs: fix do_last error case when need_reval_dot
nfs: add missing rcu-walk check
fs: hlist UP debug fixup
fs: fix dropping of rcu-walk from force_reval_path
fs: force_reval_path drop rcu-walk before d_invalidate
fs: small rcu-walk documentation fixes
Fixed up trivial conflicts in Documentation/filesystems/porting
When open(2) without O_DIRECTORY opens an existing dir, it should return
EISDIR. In do_last(), the variable 'error' is initialized EISDIR, but it
is changed by d_revalidate() which returns any positive to represent
'the target dir is valid.'
Should we keep and return the initialized 'error' in this case.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
As J. R. Okajima noted, force_reval_path passes in the same dentry to
d_revalidate as the one in the nameidata structure (other callers pass in a
child), so the locking breaks. This can oops with a chrooted nfs mount, for
example. Similarly there can be other problems with revalidating a dentry
which is already in nameidata of the path walk.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
d_revalidate can return in rcu-walk mode even when it returns 0. We can't just
call any old dcache function on rcu-walk dentry (the dentry is unstable, so
even through d_lock can safely be taken, the result may no longer be what we
expect -- careful re-checks would be required). So just drop rcu in this case.
(I missed this conversion when switching to the rcu-walk convention that Linus
suggested)
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
We've long had these pointless #ifdef MSNFS's sprinkled throughout the
code--pointless because MSNFS is always defined (and we give no config
option to make that easy to change). So we could just remove the
ifdef's and compile the resulting code unconditionally.
But as long as we're there: why not just rip out this code entirely?
The only purpose is to implement the "msnfs" export option which turns
on Windows-like behavior in some cases, and:
- the export option isn't documented anywhere;
- the userland utilities (which would need to be able to parse
"msnfs" in an export file) don't support it;
- I don't know how to maintain this, as I don't know what the
proper behavior is; and
- google shows no evidence that anyone has ever used this.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Otherwise a callback that is aborted before it runs will result in a
list_del on an uninitialized list head.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The sync_inodes_sb() function does not have a return value. Remove the
outdated documentation comment.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can
be added to page->flags without overflowing (because of the sparse section
bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also
has to move the memory hotplug code from _mapcount to lru.next to avoid
any risk of clashes. We can't use lru.next for PG_buddy removal, but
memory hotplug can use lru.next even more easily than the mapcount
instead.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add hugepage stat information to /proc/vmstat and /proc/meminfo.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We'd like to be able to oom_score_adj a process up/down as it
enters/leaves the foreground. Currently, it is not possible to oom_adj
down without CAP_SYS_RESOURCE. This patch allows a task to decrease its
oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to
or its inherited value at fork. Assuming the thread that has forked it
has oom_score_adj of 0, each process could decrease it back from 0 upon
activation unless a CAP_SYS_RESOURCE thread elevated it to something
higher.
Alternative considered:
* a setuid binary
* a daemon with CAP_SYS_RESOURCE
Since you don't wan't all processes to be able to reduce their oom_adj, a
setuid or daemon implementation would be complex. The alternatives also
have much higher overhead.
This patch updated from original patch based on feedback from David
Rientjes.
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently there is no way to find whether a process has locked its pages
in memory or not. And which of the memory regions are locked in memory.
Add a new field "Locked" to export this information via the smaps file.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge mpage_end_io_read() and mpage_end_io_write() into mpage_end_io() to
eliminate code duplication.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Hai Shan <shan.hai@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use correct function name, remove incorrect apostrophe
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is
usually set to LONG_MAX. The logic in wb_writeback() then calls
__writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and we
easily end up with non-positive nr_to_write after the function returns, if
the inode has more than MAX_WRITEBACK_PAGES dirty pages at the moment.
When nr_to_write is <= 0 wb_writeback() decides we need another round of
writeback but this is wrong in some cases! For example when a single
large file is continuously dirtied, we would never finish syncing it
because each pass would be able to write MAX_WRITEBACK_PAGES and inode
dirty timestamp never gets updated (as inode is never completely clean).
Thus __writeback_inodes_sb() would write the redirtied inode again and
again.
Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode. We
do not need nr_to_write in WB_SYNC_ALL mode anyway since
write_cache_pages() does livelock avoidance using page tagging in
WB_SYNC_ALL mode.
This makes wb_writeback() call __writeback_inodes_sb() only once on
WB_SYNC_ALL. The latter function won't livelock because it works on
- a finite set of files by doing queue_io() once at the beginning
- a finite set of pages by PAGECACHE_TAG_TOWRITE page tagging
After this patch, program from http://lkml.org/lkml/2010/10/24/154 is no
longer able to stall sync forever.
[fengguang.wu@intel.com: fix locking comment]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Background writeback is easily livelockable in a loop in wb_writeback() by
a process continuously re-dirtying pages (or continuously appending to a
file). This is in fact intended as the target of background writeback is
to write dirty pages it can find as long as we are over
dirty_background_threshold.
But the above behavior gets inconvenient at times because no other work
queued in the flusher thread's queue gets processed. In particular, since
e.g. sync(1) relies on flusher thread to do all the IO for it, sync(1)
can hang forever waiting for flusher thread to do the work.
Generally, when a flusher thread has some work queued, someone submitted
the work to achieve a goal more specific than what background writeback
does. Moreover by working on the specific work, we also reduce amount of
dirty pages which is exactly the target of background writeout. So it
makes sense to give specific work a priority over a generic page cleaning.
Thus we interrupt background writeback if there is some other work to do.
We return to the background writeback after completing all the queued
work.
This may delay the writeback of expired inodes for a while, however the
expired inodes will eventually be flushed to disk as long as the other
works won't livelock.
[fengguang.wu@intel.com: update comment]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This tracks when balance_dirty_pages() tries to wakeup the flusher thread
for background writeback (if it was not started already).
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Check whether background writeback is needed after finishing each work.
When bdi flusher thread finishes doing some work check whether any kind of
background writeback needs to be done (either because
dirty_background_ratio is exceeded or because we need to start flushing
old inodes). If so, just do background write back.
This way, bdi_start_background_writeback() just needs to wake up the
flusher thread. It will do background writeback as soon as there is no
other work.
This is a preparatory patch for the next patch which stops background
writeback as soon as there is other work to do.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stephen Rothwell reports that the vfs merge broke the build of ecryptfs.
The breakage comes from commit 66cb76666d ("sanitize ecryptfs
->mount()") which was obviously not even build tested. Tssk, tssk, Al.
This is the minimal build fixup for the situation, although I don't have
a filesystem to actually test it with.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix xattr name comparison in rbtree search for strings that share a prefix.
The *name argument is null terminated, but the xattr name is not, so we
need to use strncmp, but that means adjusting for the case where name is
a prefix of xattr->name.
The corresponding case in __set_xattr() already handles this properly
(although in that case *name is also not null terminated).
Reported-by: Sergiy Kibrik <sakib@meta.ua>
Signed-off-by: Sage Weil <sage@newdream.net>
The norbytes mount option was broken, and when doing getattr
on a directory it return the rbytes instead of the number of
entities. This commit fixes it.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Move squashfs_i() definition out of squashfs.h, this eliminates
the need to #include squashfs_fs_i.h from numerous files.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
As pointed out by Geert Uytterhoeven, "default n" is the default,
no reason to specify it.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
On file system corruption zlib can return Z_STREAM_OK with
input buffers remaining, which will not be released.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Add support for reading file systems compressed with the
XZ compression algorithm.
This patch adds the XZ decompressor wrapper code.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Commit c0204fd2b8 (NFS: Clean up
nfs4_proc_create()) broke NFSv3 exclusive open by removing the code
that passes the O_EXCL flag down to nfs3_proc_create(). This patch
reverts that offending hunk from the original commit.
Reported-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org [2.6.37]
Tested-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
block: ensure that completion error gets properly traced
blktrace: add missing probe argument to block_bio_complete
block cfq: don't use atomic_t for cfq_group
block cfq: don't use atomic_t for cfq_queue
block: trace event block fix unassigned field
block: add internal hd part table references
block: fix accounting bug on cross partition merges
kref: add kref_test_and_get
bio-integrity: mark kintegrityd_wq highpri and CPU intensive
block: make kblockd_workqueue smarter
Revert "sd: implement sd_check_events()"
block: Clean up exit_io_context() source code.
Fix compile warnings due to missing removal of a 'ret' variable
fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
cfq-iosched: don't check cfqg in choose_service_tree()
fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
cdrom: export cdrom_check_events()
sd: implement sd_check_events()
sr: implement sr_check_events()
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (41 commits)
fs: add documentation on fallocate hole punching
Gfs2: fail if we try to use hole punch
Btrfs: fail if we try to use hole punch
Ext4: fail if we try to use hole punch
Ocfs2: handle hole punching via fallocate properly
XFS: handle hole punching via fallocate properly
fs: add hole punching to fallocate
vfs: pass struct file to do_truncate on O_TRUNC opens (try #2)
fix signedness mess in rw_verify_area() on 64bit architectures
fs: fix kernel-doc for dcache::prepend_path
fs: fix kernel-doc for dcache::d_validate
sanitize ecryptfs ->mount()
switch afs
move internal-only parts of ncpfs headers to fs/ncpfs
switch ncpfs
switch 9p
pass default dentry_operations to mount_pseudo()
switch hostfs
switch affs
switch configfs
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: fix cleanup when trying to mount inexistent image
net/ceph: make ceph_msgr_wq non-reentrant
ceph: fsc->*_wq's aren't used in memory reclaim path
ceph: Always free allocated memory in osdmap_decode()
ceph: Makefile: Remove unnessary code
ceph: associate requests with opening sessions
ceph: drop redundant r_mds field
ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS
ceph: add dir_layout to inode
Generate a unique inode numbers for any entries in the cram file system.
For files which did not contain data's (device nodes, fifos and sockets)
the offset of the directory entry inside the cramfs plus 1 will be used as
inode number.
The + 1 for the inode will it make possible to distinguish between a file
which contains no data and files which has data, the later one has a inode
value where the lower two bits are always 0.
It also reimplements the behavior to set the size and the number of block
to 0 for special file, which is the right value for empty files, devices,
fifos and sockets
As a little benefit it will be also more compatible which older mkcramfs,
because it will never use the cramfs_inode->offset for creating a inode
number for special files.
[akpm@linux-foundation.org: trivial comment fix]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
aio_run_iocbs() is not used at all, so get rid of it.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 66fa12c571 ("ieee1394: remove the old IEEE 1394 driver stack")
eliminated the only user of cdev_index(). So it can be removed too.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 34aacb2920 ("procfs: Use generic_file_llseek in /proc/kcore") broke
seeking on /proc/kcore. This changes it back to use default_llseek in
order to restore the original behavior.
The problem with generic_file_llseek is that it only allows seeks up to
inode->i_sb->s_maxbytes, which is 2GB-1 on procfs, where the memory file
offset values in the /proc/kcore PT_LOAD segments may exceed or start
beyond that offset value.
A similar revert was made for /proc/vmcore.
Signed-off-by: Dave Anderson <anderson@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Filename is supposed to match procfile name for random junk.
Add __init while I'm at it.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For the common case where a proc entry is being removed and nobody is in
the process of using it, save a LOCK/UNLOCK pair.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a PageSlab() check before adding the _mapcount value to /kpagecount.
page->_mapcount is in a union with the SLAB structure so for pages
controlled by SLAB, page_mapcount() returns nonsense.
Signed-off-by: Petr Holasek <pholasek@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
single_open()'s third argument is for copying into seq_file->private. Use
that, rather than open-coding it.
Signed-off-by: Jovi Zhang <bookjovi@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- ->low_ino is write-once field -- reading it under locks is unnecessary.
- /proc/$PID stuff never reaches pde_put()/free_proc_entry() --
PROC_DYNAMIC_FIRST check never triggers.
- in proc_get_inode(), inode number always matches proc dir entry, so
save one parameter.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For string without format specifiers, use seq_puts().
For seq_printf("\n"), use seq_putc('\n').
text data bss dec hex filename
61866 488 112 62466 f402 fs/proc/proc.o
61729 488 112 62329 f379 fs/proc/proc.o
----------------------------------------------------
-139
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/*/statm code needlessly truncates data from unsigned long to int.
One needs only 8+ TB of RAM to make truncation visible.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use temporary lr for struct latency_record for improved readability and
fewer columns used. Removed trailing space from output.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Jiri Kosina <trivial@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A call to va_start() must always be followed by a call to va_end() in the
same function. In fs/reiserfs/prints.c::print_block() this is not always
the case. If 'bh' is NULL we'll return without calling va_end().
One could add a call to va_end() before the 'return' statement, but it's
nicer to just move the call to va_start() after the test for 'bh' being
NULL.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Acked-by: Edward Shishkin <edward.shishkin@gmail.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
'struct befs_disk_data_stream' is huge (~144 bytes) and it's being passed
by value in fs/befs/endian.h::cpu_to_fsrun().
It would be better to pass a pointer.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: Will Dyson <will_dyson@pobox.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Send the events the wakeup refers to, so that epoll, and even the new poll
code in fs/select.c can avoid wakeups if the events do not match the
requested set.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>