A deadlock in xfstests 113 was uncovered by commit
d187663ef2
This is because we would not return EIOCBQUEUED for short AIO reads, instead
we'd wait for the DIO to complete and then return the amount of data we
transferred, which would allow our stuff to unlock the remaning amount. But
with this change this no longer happens, so if we have a short AIO read (for
example if we try to read past EOF), we could leave the section from EOF to
the end of where we tried to read locked. Fixing this is tricky since there
is no clear way to know exactly how much data DIO truly submitted for IO, so
to make this less hard on ourselves and less combersome we need to lock the
extents as we try to map them, and then we unlock any areas we didn't
actually map. This makes us completely safe from deadlocks and reliance on
a particular behavior of the DIO code. This also lays the groundwork for
allowing us to use the normal csum storage method for reads which means we
can remove an allocation. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
"trans->transid" is cpu endian but we want to store the data as little
endian. "item->ctime.nsec" is only 32 bits, not 64.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
There are few important bug fixes for LogFS
9f0bbd8 logfs: query block device for number of pages to send with bio
This BUG was found when LogFS was used on KVM. The patch fixes
the problem by asking for underlaying block device the number
of pages to send with each BIO.
41b93bc logfs: maintain the ordering of meta-inode destruction
LogFS maintains file system meta-data in special inodes. These
inodes are releated to each other, therefore they must be
destroyed in a proper order.
ddb24bb logfs: create a pagecache page if it is not present
cd8bfa9 logfs: initialize the number of iovecs in bio
LogFS used to panic when it was created on an encrypted LVM
volume. The patch fixes the problem by properly initializing
the BIO.
d2dcd90 logfs: destroy the reserved inodes while unmounting
Diffstat:
fs/logfs/dev_bdev.c | 15 ++++++++-------
fs/logfs/inode.c | 18 +-----------------
fs/logfs/journal.c | 2 +-
fs/logfs/readwrite.c | 1 +
fs/logfs/segment.c | 2 +-
5 files changed, 12 insertions(+), 26 deletions(-)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJQNmnnAAoJEDFA/f+3K+ZNhc0QAJWWtcvqCIU2UHKw0DaQieH8
X+YwoqdH5UBR0vylyoHCgCRSoRuttYEy47DYDhkpUdqad60atDhwIwhFaknlNZbt
+hv2ve6hZTzcs42HZhu+0WUzMkUs0s6Xjuu9JlZND59zk1wWBfttDlKI2/SdjMlm
GnW0nx5gLFZ1rmpaL8bdRpyLXxUvb9FmUc+YWsuinj8A2Lnqg5bNOmJZ3CiMYNVk
UvbHDYJmNaMZndbYeXZtxXtUo9Uk4HsU+7whpmgD+OPz7h+VMOgfXnwkpsgmtbY2
qTXAjyFVpN23zhBTbpCMXvpfbrffdBJBfkEW+2sXqpr9IHbMX0Y9cb/XJ3Ub1Qz+
HRLdBh0iAZewWVRsGOKG2OU1WUrTNzKUAJ795QFL0c0bZsODzQ+5OlcD7rBzEMpq
ioN49UtJWlmckWaott6PinSl/OKWlvz771Zayh9+ttuL3Dvt6coV7K3ns5MI6glN
M9DTMd8GAW2Kdz80EyAjZYWw2M3lbs0/GghJB4ozYg3mrDpBq1w5ouEvSNw8PuJw
yoUEe2xLGpLZdzQDouKMlrai8kohCyMoHKH1Kimx+iG4LZkYLy8VJepY1mfzQbEY
1zSB1nCV/c67qI3fxRdPNmJ7YtjTC9ero0qOouXwSJWwZ+OEikGXOKWnjRutcY1C
Jfbfg6gNqOrZZbYWp1ke
=eS/c
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://github.com/prasad-joshi/logfs_upstream
Pull LogFS bugfixes from Prasad Joshi:
- "logfs: query block device for number of pages to send with bio"
This BUG was found when LogFS was used on KVM. The patch fixes
the problem by asking for underlaying block device the number
of pages to send with each BIO.
- "logfs: maintain the ordering of meta-inode destruction"
LogFS maintains file system meta-data in special inodes. These
inodes are releated to each other, therefore they must be
destroyed in a proper order.
- "logfs: initialize the number of iovecs in bio"
LogFS used to panic when it was created on an encrypted LVM
volume. The patch fixes the problem by properly initializing
the BIO.
Plus a couple more:
- logfs: create a pagecache page if it is not present
- logfs: destroy the reserved inodes while unmounting
* tag 'for-linus' of git://github.com/prasad-joshi/logfs_upstream:
logfs: query block device for number of pages to send with bio
logfs: maintain the ordering of meta-inode destruction
logfs: create a pagecache page if it is not present
logfs: initialize the number of iovecs in bio
logfs: destroy the reserved inodes while unmounting
- fix uninitialised variable in xfs_rtbuf_get()
- unlock the AGI buffer when looping in xfs_dialloc
- check for possible overflow in xfs_ioc_trim
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJQN7FMAAoJENaLyazVq6ZOrwsQAMB2Cg/t/XeiPFD5QtvFX7Ua
9cS3klSo9p29Xr5oeo54b3mF+5dycS61PQu2zdg4XhU5ZqoWoS6xUIFxzo3yWiVi
nC/BbgMQTgkfEQs+2ulmXRgP8A2DdakSDmvAZDQNURZQwH4zykU0tPKD9V009TMW
SaIprIWQqhRrsd9prrv364FgKv3/vyuFL1Agho304o3ynk+2SLDzm3eqdTaXTD9Z
9OGP5Hs5Mh9CZmr46MIpRnoaHppqNRZh7EbDyMFL8kMN8F2yOdJk5MgA1m7gPJ5Z
ADfnv3laxEp04hhD5lazePYJwZlg/KB77Y4YXD5hmTSznIxcQNlfRcMV9ju4FGgr
Nb5SgA/6iBk5dkZyuyIdqCzj1Mt6SimpIWb3+/vcS8+VaqQUsyoufccO0zrgeoIA
/O1PtCe//haFENlD+hMc9QrQSISsaR/YYUv0yz4YcsxKsdmyYpbHdTyFhVjnkKoS
BtA0ZOgWYebgEzr7J1ijW085CrDvAc9jsPujOWmZlBJWcfo1rtVfqhXWll0ms5cr
ZRoWqLDqoOhjlUDqwMN4dk6ENxG4QbSl6AWXXVXrUvtvaNzoi0MBNX3HPt/RVmF5
rGKFvLuIxBHAfD1xYCqtCfP9yzeQzDy1SPdASForEZOyM+RVUnOf9i+1nhbsu/97
AhiW4lozQetWzx+RQwU+
=+FuN
-----END PGP SIGNATURE-----
Merge tag 'for-linus-v3.6-rc4' of git://oss.sgi.com/xfs/xfs
Pull xfs bugfixes from Ben Myers:
- fix uninitialised variable in xfs_rtbuf_get()
- unlock the AGI buffer when looping in xfs_dialloc
- check for possible overflow in xfs_ioc_trim
* tag 'for-linus-v3.6-rc4' of git://oss.sgi.com/xfs/xfs:
xfs: check for possible overflow in xfs_ioc_trim
xfs: unlock the AGI buffer when looping in xfs_dialloc
xfs: fix uninitialised variable in xfs_rtbuf_get()
Pull nfsd bugfixes from J. Bruce Fields:
"Particular thanks to Michael Tokarev, Malahal Naineni, and Jamie
Heilman for their testing and debugging help."
* 'for-3.6' of git://linux-nfs.org/~bfields/linux:
svcrpc: fix svc_xprt_enqueue/svc_recv busy-looping
svcrpc: sends on closed socket should stop immediately
svcrpc: fix BUG() in svc_tcp_clear_pages
nfsd4: fix security flavor of NFSv4.0 callback
Pull block-related fixes from Jens Axboe:
- Improvements to the buffered and direct write IO plugging from
Fengguang.
- Abstract out the mapping of a bio in a request, and use that to
provide a blk_bio_map_sg() helper. Useful for mapping just a bio
instead of a full request.
- Regression fix from Hugh, fixing up a patch that went into the
previous release cycle (and marked stable, too) attempting to prevent
a loop in __getblk_slow().
- Updates to discard requests, fixing up the sizing and how we align
them. Also a change to disallow merging of discard requests, since
that doesn't really work properly yet.
- A few drbd fixes.
- Documentation updates.
* 'for-linus' of git://git.kernel.dk/linux-block:
block: replace __getblk_slow misfix by grow_dev_page fix
drbd: Write all pages of the bitmap after an online resize
drbd: Finish requests that completed while IO was frozen
drbd: fix drbd wire compatibility for empty flushes
Documentation: update tunable options in block/cfq-iosched.txt
Documentation: update tunable options in block/cfq-iosched.txt
Documentation: update missing index files in block/00-INDEX
block: move down direct IO plugging
block: remove plugging at buffered write time
block: disable discard request merge temporarily
bio: Fix potential memory leak in bio_find_or_create_slab()
block: Don't use static to define "void *p" in show_partition_start()
block: Add blk_bio_map_sg() helper
block: Introduce __blk_segment_map_sg() helper
fs/block-dev.c:fix performance regression in O_DIRECT writes to md block devices
block: split discard into aligned requests
block: reorganize rounding of max_discard_sectors
Pull UDF, ext3 & reiserfs fixes from Jan Kara:
"A couple of fixes (udf, reiserfs, ext3) that accumulated over my
vacation."
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: fix retun value on error path in udf_load_logicalvol
jbd: don't write superblock when unmounting an ro filesystem
reiserfs: fix deadlocks with quotas
quota: Move down dqptr_sem read after initializing default warn[] type at __dquot_alloc_space().
UDF: During mount free lvid_bh before rescanning with different blocksize
udf: fix udf_setsize() for file data in ICB
If range.start or range.minlen is bigger than filesystem size, return
invalid value error. This fixes possible overflow in BTOBB macro when
passed value was nearly ULLONG_MAX.
Signed-off-by: Tomas Racek <tracek@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Also update some commens in the area to make the code easier to read.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Commit 91f68c89d8 ("block: fix infinite loop in __getblk_slow")
is not good: a successful call to grow_buffers() cannot guarantee
that the page won't be reclaimed before the immediate next call to
__find_get_block(), which is why there was always a loop there.
Yesterday I got "EXT4-fs error (device loop0): __ext4_get_inode_loc:3595:
inode #19278: block 664: comm cc1: unable to read itable block" on console,
which pointed to this commit.
I've been trying to bisect for weeks, why kbuild-on-ext4-on-loop-on-tmpfs
sometimes fails from a missing header file, under memory pressure on
ppc G5. I've never seen this on x86, and I've never seen it on 3.5-rc7
itself, despite that commit being in there: bisection pointed to an
irrelevant pinctrl merge, but hard to tell when failure takes between
18 minutes and 38 hours (but so far it's happened quicker on 3.6-rc2).
(I've since found such __ext4_get_inode_loc errors in /var/log/messages
from previous weeks: why the message never appeared on console until
yesterday morning is a mystery for another day.)
Revert 91f68c89d8, restoring __getblk_slow() to how it was (plus
a checkpatch nitfix). Simplify the interface between grow_buffers()
and grow_dev_page(), and avoid the infinite loop beyond end of device
by instead checking init_page_buffers()'s end_block there (I presume
that's more efficient than a repeated call to blkdev_max_block()),
returning -ENXIO to __getblk_slow() in that case.
And remove akpm's ten-year-old "__getblk() cannot fail ... weird"
comment, but that is worrying: are all users of __getblk() really
now prepared for a NULL bh beyond end of device, or will some oops??
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # 3.0 3.2 3.4 3.5
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull ceph fixes from Sage Weil:
"Jim's fix closes a narrow race introduced with the msgr changes. One
fix resolves problems with debugfs initialization that Yan found when
multiple client instances are created (e.g., two clusters mounted, or
rbd + cephfs), another one fixes problems with mounting a nonexistent
server subdirectory, and the last one fixes a divide by zero error
from unsanitized ioctl input that Dan Carpenter found."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: avoid divide by zero in __validate_layout()
libceph: avoid truncation due to racing banners
ceph: tolerate (and warn on) extraneous dentry from mds
libceph: delay debugfs initialization until we learn global_id
- NFSv3 mounts need to fail if the FSINFO rpc call fails
- Ensure that the NFS commit cache gets torn down when we unload the
NFS module.
- Fix memory scribble issues when interrupting a LAYOUTGET rpc call
- Fix NFSv4 legacy idmapper regressions
- Fix issues with the NFSv4 getacl command
- Fix a regression when using the legacy "mount -t nfs4"
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJQNPyKAAoJEGcL54qWCgDyQlAP/AmnLaJmKMPj5P4GCiA/uxMV
sr0XSSVAv+OgUtRxLNO5PDyunVIIEjwA5nGa2RZEG14+eYwNEeQehaGoaGqoKLhh
ATOCMvHHEs5qSKLckWsZWflf6U/MLimYS3D/hHVUtP0QWbBMtol3ImABs2ht/VUc
HW0wQoEG+5mhbhC87d4ku5E+OHQzYeNNmdX2VkKOmT0CwRd/bfsHxlGwMmcHldP2
TSge+MRhoKNycpy0j7nugHj2pZ12UiX+O8371OlAketnPk0OA+6RNGMyNo49IQAn
VEN9MFWc4ypH1+hJ0OvhzmUA6Zn2/lyQXUtwM1DcXeBmLI4aClb3bT3G3xKNh09O
yLEvkh2aUUlmoTSgP7vDK/Ui3QwVnsaJULvg+gywQJG6qTSFgUQfErNKgJmabHQb
d0EuumlSGnJ7qdC5nmrUp+M/CpbfGh15ax/JnTRGVK7C3jeCm18vc2np+z4EUDp/
USzrvuBu1B8eI0nQNoht0BeNf7sEJGgFIn8BJzxDakP8buXPMqEvFwA8c1JmMeBX
E6pbG2jHqPdrTJtxQTpMRqhA0MxbFlVNi5cCs6U7ifiBqbRWEXyNVgAHTwoIxIrn
ASbL0s8gZH+EMEXNOmPmFiO2NUjBCeSzAjrsTmSpWa13YmGfiroZN8N/iPzLnFf+
FR3SFUnoJHq7x4uLZoAX
=lqfh
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.6-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
- NFSv3 mounts need to fail if the FSINFO rpc call fails
- Ensure that the NFS commit cache gets torn down when we unload the
NFS module.
- Fix memory scribble issues when interrupting a LAYOUTGET rpc call
- Fix NFSv4 legacy idmapper regressions
- Fix issues with the NFSv4 getacl command
- Fix a regression when using the legacy "mount -t nfs4"
* tag 'nfs-for-3.6-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv3: Ensure that do_proc_get_root() reports errors correctly
NFSv4: Ensure that nfs4_alloc_client cleans up on error.
NFS: return -ENOKEY when the upcall fails to map the name
NFS: Clear key construction data if the idmap upcall fails
NFSv4: Don't use private xdr_stream fields in decode_getacl
NFSv4: Fix the acl cache size calculation
NFSv4: Fix pointer arithmetic in decode_getacl
NFS: Alias the nfs module to nfs4
NFS: Fix a regression when loading the NFS v4 module
NFSv4.1: Remove a bogus BUG_ON() in nfs4_layoutreturn_done
pnfs-obj: Better IO pattern in case of unaligned offset
NFS41: add pg_layout_private to nfs_pageio_descriptor
pnfs: nfs4_proc_layoutget returns void
pnfs: defer release of pages in layoutget
nfs: tear down caches in nfs_init_writepagecache when allocation fails
Pull assorted fixes - mostly vfs - from Al Viro:
"Assorted fixes, with an unexpected detour into vfio refcounting logics
(fell out when digging in an analog of eventpoll race in there)."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
task_work: add a scheduling point in task_work_run()
fs: fix fs/namei.c kernel-doc warnings
eventpoll: use-after-possible-free in epoll_create1()
vfio: grab vfio_device reference *before* exposing the sucker via fd_install()
vfio: get rid of vfio_device_put()/vfio_group_get_device* races
vfio: get rid of open-coding kref_put_mutex
introduce kref_put_mutex()
vfio: don't dereference after kfree...
mqueue: lift mnt_want_write() outside ->i_mutex, clean up a bit
Fix kernel-doc warnings in fs/namei.c:
Warning(fs/namei.c:360): No description found for parameter 'inode'
Warning(fs/namei.c:672): No description found for parameter 'nd'
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
As soon as we'd installed the file into descriptor table, it can
get closed by another thread. Freeing ep in process...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When debugging is enabled, we use a temporary on-stack buffer for formatting
the key strings like "(11368871, direntry, 0xcd0750)". The buffer size is
32 bytes and sometimes it is not enough to fit the key string - e.g., when
inode numbers are high. This is not fatal, but the key strings are incomplete
and UBIFS complains like this:
UBIFS assert failed in dbg_snprintf_key at 137 (pid 1)
This is a regression caused by "515315a UBIFS: fix key printing".
Fix the issue by increasing the buffer to 48 bytes.
Reported-by: Michael Hench <michaelhench@gmail.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Tested-by: Michael Hench <michaelhench@gmail.com>
Cc: stable@vger.kernel.org [v3.3+]
If "l->stripe_unit" is zero the the mod on the next line will cause a
divide by zero bug. This comes from the copy_from_user() in
ceph_ioctl_set_layout_policy(). Passing 0 is valid, though (it means
"do not change") so avoid the % check in that case.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
If the MDS gives us a dentry and we weren't prepared to handle it,
WARN_ON_ONCE instead of crashing.
Reported-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Commit "d51f17e UBIFS: simplify reply code a bit" introduces a bug with the
following symptoms:
UBIFS error (pid 1): replay_log_leb: first CS node at LEB 3:0 has wrong commit number 0 expected 1
The issue is that we start replaying the log from UBIFS_LOG_LNUM instead
of c->lhead_lnum. This patch fixes that.
Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
This patch fixes a regression introduced by
"4994297 UBIFS: make ubifs_lpt_init clean-up in case of failure" which
I've hit while running the 'integck -p' test. When remount the file-system
from R/O mode to R/W mode and 'lpt_init_wr()' fails, we free _all_ LPT
resources by calling 'ubifs_lpt_free(c, 0)', even those needed for R/O
mode. This leads to subsequent crashes, e.g., if we try to unmount
the file-system.
Cc: stable@vger.kernel.org [v3.5+]
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Commit d5497fc693 "nfsd4: move rq_flavor
into svc_cred" forgot to remove cl_flavor from the client, leaving two
places (cl_flavor and cl_cred.cr_flavor) for the flavor to be stored.
After that patch, the latter was the one that was updated, but the
former was the one that the callback used.
Symptoms were a long delay on utime(). This is because the utime()
generated a setattr which recalled a delegation, but the cb_recall was
ignored by the client because it had the wrong security flavor.
Cc: stable@vger.kernel.org
Tested-by: Jamie Heilman <jamie@audible.transient.net>
Reported-by: Jamie Heilman <jamie@audible.transient.net>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
compat_sys_{read,write}v() need the same "pass a copy of file->f_pos" thing
as sys_{read,write}{,v}().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The debugfs directory includes the cluster fsid and our unique global_id.
We need to delay the initialization of the debug entry until we have
learned both the fsid and our global_id from the monitor or else the
second client can't create its debugfs entry and will fail (and multiple
client instances aren't properly reflected in debugfs).
Reported by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
If the rpc call to NFS3PROC_FSINFO fails, then we need to report that
error so that the mount fails. Otherwise we can end up with a
superblock with completely unusable values for block sizes, maxfilesize,
etc.
Reported-by: Yuanming Chen <hikvision_linux@163.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Any pointer that was allocated through nfs_alloc_client() needs to be
freed via a call to nfs_free_client().
Reported-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Commit d2c127197d caused a regression
in cifs_do_create error handling. Fix this by closing a file handle
in the case of a get_inode_info(_unix) error. Also remove unnecessary
checks for newinode being NULL.
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
While trying to debug a SMB signature related issue with Windows Servers
figured out it might be easier to debug if we print the error code from
cifs_verify_signature(). Also, fix indendation while at it.
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
that can cause warning messages. Pavel had initially
suggested a smaller patch around drop_nlink, after
a similar problem was discovered NFS. Protecting
additional places where nlink is touched was
suggested by Jeff Layton and is included in this.
Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Pull vfs fixes from Miklos Szeredi.
This mainly fixes some confusion about whether the open 'mode' variable
passed around should contain the full file type (S_IFREG etc)
information or just the permission mode. In particular, the lack of
proper file type information had confused fuse.
* 'vfs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
vfs: fix propagation of atomic_open create error on negative dentry
fuse: check create mode in atomic open
vfs: pass right create mode to may_o_create()
vfs: atomic_open(): fix create mode usage
vfs: canonicalize create mode in build_open_flags()
the ones which cause problems for ext4 on RAID --- a performance
problem when mounting very large filesystems, and a kernel OOPS when
doing an rm -rf on large directory hierarchies on fast devices.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJQLlVSAAoJENNvdpvBGATwhOYQAKx22YF9+SHnnVv2GtCQrsE6
N3acE/FoDYiO1/LRa5M6NDO3ZIL4Vqi6409LYFQky/SQL8ziM5CBeLOBD9qPTE1L
AGrzWn4vzZcjEf90ZEBN99fS5Uj1A9Axz2denxy7Hfb2HcSXfcuuVQXpdS42cFo8
gwtlX4jxmPkbjRlAdeqYVBNuWpTZ0a+UyLc4A6v3aLcbzPSJvPYmI7mKCksiSySU
yt0atjMDb56blAQJ2TITdAZN6rQShNzyok2pPfxaLusl5g0Gtjq8sSEPof1PQ1zq
gDFc+kpZvUyPdwQzV3IL8+TodFFZ0x/2OhqoYCTKajROLHGjQFsdkb8EJbnNeDff
EDxIjeVJR0kzSuSNWu+n5028lmEd9Sk7Ykr37cHeUxG8/0SADUqSDQYNhvbOPQsj
iq5dwF79tKjuMqjJrABuWA1ZNgHBISXgyBmHLXgEk3LrgucT9UIg8Zlkhq480SYO
JXhmkO2Ka4UwkVUShoWgRtEzRgUxhINBShs6g67zwm6slS4s2CWHnqhUn6EQe6+r
DY/hoUA8KbdG3Cf5iJBFM2kUO68CDIXeJjocA7JvlouRgQSxmkOceIuk7DusAitM
nHJKAtSNgC/z7yMoNi7YN0S5YYcCebmO1MEPzYSpPH07YwLJVNmh9Fk6BIfb7vi3
vJSQMBrgGrbaXrnhBA7z
=Zulv
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bug fixes from Ted Ts'o:
"The following are all bug fixes and regressions. The most notable are
the ones which cause problems for ext4 on RAID --- a performance
problem when mounting very large filesystems, and a kernel OOPS when
doing an rm -rf on large directory hierarchies on fast devices."
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix kernel BUG on large-scale rm -rf commands
ext4: fix long mount times on very big file systems
ext4: don't call ext4_error while block group is locked
ext4: avoid kmemcheck complaint from reading uninitialized memory
ext4: make sure the journal sb is written in ext4_clear_journal_err()
In some cases when an autofs indirect mount is contained in a file
system that is marked as shared (such as when systemd does the
equivalent of "mount --make-rshared /" early in the boot), mounts
stop expiring.
When this happens the first expiry check on a mountpoint dentry in
autofs_expire_indirect() sees a mountpoint dentry with a higher
than minimal reference count. Consequently the dentry is condidered
busy and the actual expiry check is never done.
This particular check was originally meant as an optimisation to
detect a path walk in progress but with the addition of rcu-walk
it can be ineffective anyway.
Removing the test allows automounts to expire again since the
actual expire check doesn't rely on the dentry reference count.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 968dee7722: "ext4: fix hole punch failure when depth is greater
than 0" introduced a regression in v3.5.1/v3.6-rc1 which caused kernel
crashes when users ran run "rm -rf" on large directory hierarchy on
ext4 filesystems on RAID devices:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
Process rm (pid: 18229, threadinfo ffff8801276bc000, task ffff880123631710)
Call Trace:
[<ffffffff81236483>] ? __ext4_handle_dirty_metadata+0x83/0x110
[<ffffffff812353d3>] ext4_ext_truncate+0x193/0x1d0
[<ffffffff8120a8cf>] ? ext4_mark_inode_dirty+0x7f/0x1f0
[<ffffffff81207e05>] ext4_truncate+0xf5/0x100
[<ffffffff8120cd51>] ext4_evict_inode+0x461/0x490
[<ffffffff811a1312>] evict+0xa2/0x1a0
[<ffffffff811a1513>] iput+0x103/0x1f0
[<ffffffff81196d84>] do_unlinkat+0x154/0x1c0
[<ffffffff8118cc3a>] ? sys_newfstatat+0x2a/0x40
[<ffffffff81197b0b>] sys_unlinkat+0x1b/0x50
[<ffffffff816135e9>] system_call_fastpath+0x16/0x1b
Code: 8b 4d 20 0f b7 41 02 48 8d 04 40 48 8d 04 81 49 89 45 18 0f b7 49 02 48 83 c1 01 49 89 4d 00 e9 ae f8 ff ff 0f 1f 00 49 8b 45 28 <48> 8b 40 28 49 89 45 20 e9 85 f8 ff ff 0f 1f 80 00 00 00
RIP [<ffffffff81233164>] ext4_ext_remove_space+0xa34/0xdf0
This could be reproduced as follows:
The problem in commit 968dee7722 was that caused the variable 'i' to
be left uninitialized if the truncate required more space than was
available in the journal. This resulted in the function
ext4_ext_truncate_extend_restart() returning -EAGAIN, which caused
ext4_ext_remove_space() to restart the truncate operation after
starting a new jbd2 handle.
Reported-by: Maciej Żenczykowski <maze@google.com>
Reported-by: Marti Raudsepp <marti@juffo.org>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
Commit 8aeb00ff85a: "ext4: fix overhead calculation used by
ext4_statfs()" introduced a O(n**2) calculation which makes very large
file systems take forever to mount. Fix this with an optimization for
non-bigalloc file systems. (For bigalloc file systems the overhead
needs to be set in the the superblock.)
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
While in ext4_validate_block_bitmap(), if an block allocation bitmap
is found to be invalid, we call ext4_error() while the block group is
still locked. This causes ext4_commit_super() to call a function
which might sleep while in an atomic context.
There's no need to keep the block group locked at this point, so hoist
the ext4_error() call up to ext4_validate_block_bitmap() and release
the block group spinlock before calling ext4_error().
The reported stack trace can be found at:
http://article.gmane.org/gmane.comp.file-systems.ext4/33731
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
This allows the normal error-paths to handle the error, rather than
making a special call to complete_request_key() just for this instance.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Tested-by: William Dauchy <wdauchy@gmail.com>
Cc: stable@vger.kernel.org [>= 3.4]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
idmap_pipe_downcall already clears this field if the upcall succeeds,
but if it fails (rpc.idmapd isn't running) the field will still be set
on the next call triggering a BUG_ON(). This patch tries to handle all
possible ways that the upcall could fail and clear the idmap key data
for each one.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Tested-by: William Dauchy <wdauchy@gmail.com>
Cc: stable@vger.kernel.org [>= 3.4]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Instead of using the private field xdr->p from struct xdr_stream,
use the public xdr_stream_pos().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, we do not take into account the size of the 16 byte
struct nfs4_cached_acl header, when deciding whether or not we should
cache the acl data. Consequently, we will end up allocating an
8k buffer in order to fit a maximum size 4k acl.
This patch adjusts the calculation so that we limit the cache size
to 4k for the acl header+data.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Resetting the cursor xdr->p to a previous value is not a safe
practice: if the xdr_stream has crossed out of the initial iovec,
then a bunch of other fields would need to be reset too.
Fix this issue by using xdr_enter_page() so that the buffer gets
page aligned at the bitmap _before_ we decode it.
Also fix the confusion of the ACL length with the page buffer length
by not adding the base offset to the ACL length...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
This allows distros to remove the line from their modprobe
configuration.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Some systems have a modprobe.d/nfs.conf file that sets an nfs4 alias
pointing to nfs.ko, rather than nfs4.ko. This can prevent the v4 module
from loading on mount, since the kernel sees that something named "nfs4"
has already been loaded. To work around this, I've renamed the modules
to "nfsv2.ko" "nfsv3.ko" and "nfsv4.ko".
I also had to move the nfs4_fs_type back to nfs.ko to ensure that `mount
-t nfs4` still works.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Following a report of a crash during an automount expire I found that
the locking in fs/autofs4/expire.c:get_next_positive_subdir() was wrong.
Not only is the locking wrong but the function is more complex than it
needs to be.
The function is meant to calculate (and dget) the next entry in the list
of directories contained in the root of an autofs mount point (an autofs
indirect mount to be precise). The main problem was that the d_lock of
the owner of the list was not being taken when walking the list, which
lead to list corruption under load. The only other lock that needs to
be taken is against the next dentry candidate so it can be checked for
usability.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If ->atomic_open() returns -ENOENT, we take care to return the create
error (e.g., EACCES), if any. Do the same when ->atomic_open() returns 1
and provides a negative dentry.
This fixes a regression where an unprivileged open O_CREAT fails with
ENOENT instead of EACCES, introduced with the new atomic_open code. It
is tested by the open/08.t test in the pjd posix test suite, and was
observed on top of fuse (backed by ceph-fuse).
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
In case we detect a problem and bail out, we fail to set "ret" to a
nonzero value, and udf_load_logicalvol will mistakenly report success.
Signed-off-by: Nikola Pajkovsky <npajkovs@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
This sequence:
results in an IO error when unmounting the RO filesystem. The bug was
introduced by:
commit 9754e39c7b
Author: Jan Kara <jack@suse.cz>
Date: Sat Apr 7 12:33:03 2012 +0200
jbd: Split updating of journal superblock and marking journal empty
which lost some of the magic in journal_update_superblock() which
used to test for a journal with no outstanding transactions.
This is a port of a jbd2 fix by Eric Sandeen.
CC: <stable@vger.kernel.org> # 3.4.x
Signed-off-by: Jan Kara <jack@suse.cz>
Verify that the VFS is passing us a complete create mode with the S_IFREG to
atomic open.
Reported-by: Steve <steveamigauk@yahoo.co.uk>
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Pass the umask-ed create mode to may_o_create() instead of the original one.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Don't mask S_ISREG off the create mode before passing to ->atomic_open(). Other
methods (->create, ->mknod) also get the complete file mode and filesystems
expect it.
Reported-by: Steve <steveamigauk@yahoo.co.uk>
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Userspace can pass weird create mode in open(2) that we canonicalize to
"(mode & S_IALLUGO) | S_IFREG" in vfs_create().
The problem is that we use the uncanonicalized mode before calling vfs_create()
with unforseen consequences.
So do the canonicalization early in build_open_flags().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
CC: stable@vger.kernel.org
The BKL push-down for reiserfs made lock recursion a special case that needs
to be handled explicitly. One of the cases that was unhandled is dropping
the quota during inode eviction. Both reiserfs_evict_inode and
reiserfs_write_dquot take the write lock, but when the journal lock is
taken it only drops one the references. The locking rules are that the journal
lock be acquired before the write lock so leaving the reference open leads
to a ABBA deadlock.
This patch pushes the unlock up before clear_inode and avoids the recursive
locking.
Another ABBA situation can occur when the write lock is dropped while reading
the bitmap buffer while in the quota code. When the lock is reacquired, it
will deadlock against dquot->dq_lock and dqopt->dqio_mutex in the dquot_acquire
path. It's safe to retain the lock across the read and should be cached under
write load.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
sb->s_dqopt->dqptr_sem is used to serialize ops using pointers from inode to
dquots. But for __dquot_alloc_space(), it could be safely moved down after the
default warn[] array got initialized.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
If s_lvid_bh is not freed and set to NULL before re-scanning partition
with default block size, we might end up using wrong lvid in case
s_lvid_bh is not updated in udf_load_logicalvolint during rescan.
Signed-off-by: Ashish Sangwan <ashish.sangwan2@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
If the new size is larger than the old size and the old file data was
stored in the ICB (iinfo->i_alloc_type == ICBTAG_FLAG_AD_IN_ICB) and the
new size still fits in the ICB, skip the call to udf_extend_file() as it
does not handle this i_alloc_type value (it calls BUG()).
Signed-off-by: Ian Abbott <abbotti@mev.co.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
Pull btrfs merge fix from Chris Mason:
"This fixes a merge error in rc1. The calls to mnt_want_write should
have been removed."
* 'for-linus-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: remove mnt_want_write call in btrfs_mksubvol
We got a recursive lock in mksubvol because the caller already held
a lock. I think we got into this due to a merge error. Commit a874a63
removed the mnt_want_write call from btrfs_mksubvol and added a
replacement call to mnt_want_write_file in btrfs_ioctl_snap_create_transid.
Commit e7848683 however tried to move all calls to mnt_want_write above
i_mutex. So somewhere while merging this, it got mixed up. The
solution is to remove the mnt_want_write call completely from
mksubvol.
Reported-by: David Sterba <dave@jikos.cz>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Move unplugging for direct I/O from around ->direct_IO() down to
do_blockdev_direct_IO(). This implicitly adds plugging for direct
writes.
CC: Li Shaohua <shli@fusionio.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Do not leak memory by updating pointer with potentially NULL realloc return value.
Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence
disconnected data server) we've been sending layoutreturn calls
while there is potentially still outstanding I/O to the data
servers. The reason we do this is to avoid races between replayed
writes to the MDS and the original writes to the DS.
When this happens, the BUG_ON() in nfs4_layoutreturn_done can
be triggered because it assumes that we would never call
layoutreturn without knowing that all I/O to the DS is
finished. The fix is to remove the BUG_ON() now that the
assumptions behind the test are obsolete.
Reported-by: Boaz Harrosh <bharrosh@panasas.com>
Reported-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>=3.5]
Commit 7572777eef attempted to verify that
the total iovec from the client doesn't overflow iov_length() but it
only checked the first element. The iovec could still overflow by
starting with a small element. The obvious fix is to check all the
elements.
The overflow case doesn't look dangerous to the kernel as the copy is
limited by the length after the overflow. This fix restores the
intention of returning an error instead of successfully copying less
than the iovec represented.
I found this by code inspection. I built it but don't have a test case.
I'm cc:ing stable because the initial commit did as well.
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: <stable@vger.kernel.org> [2.6.37+]
Commit 03179fe923 introduced a kmemcheck complaint in
ext4_da_get_block_prep() because we save and restore
ei->i_da_metadata_calc_last_lblock even though it is left
uninitialized in the case where i_da_metadata_calc_len is zero.
This doesn't hurt anything, but silencing the kmemcheck complaint
makes it easier for people to find real bugs.
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=45631
(which is marked as a regression).
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
After we transfer set the EXT4_ERROR_FS bit in the file system
superblock, it's not enough to call jbd2_journal_clear_err() to clear
the error indication from journal superblock --- we need to call
jbd2_journal_update_sb_errno() as well. Otherwise, when the root file
system is mounted read-only, the journal is replayed, and the error
indicator is transferred to the superblock --- but the s_errno field
in the jbd2 superblock is left set (since although we cleared it in
memory, we never flushed it out to disk).
This can end up confusing e2fsck. We should make e2fsck more robust
in this case, but the kernel shouldn't be leaving things in this
confused state, either.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
The pdflush thread is long gone, so this patch removes references to pdflush
from UBIFS comments.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The pdflush thread is long gone, so this patch removes references to pdflush
from gfs comments.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The '->write_super' superblock method is gone, and this patch removes all the
references to 'write_super' from ntfs.
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The '->write_super' superblock method is gone, and this patch removes all the
references to 'write_super' from hfs.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The pdflush thread is long gone, so this patch removes references to pdflush
from vfs comments.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The '->write_super' superblock method is gone, and this patch removes all the
references to 'write_super' from various jbd and jbd2.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The pdflush thread is long gone, so this patch removes references to pdflush
from btrfs comments.
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The '->write_super' superblock method is gone, and this patch removes all the
references to 'write_super' from btrfs.
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The pdflush thread is long gone, so this patch removes references to pdflush
from ext4 comments.
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The '->write_super' superblock method is gone, and this patch removes all the
references to 'write_super' from ext3.
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The '->write_super' superblock method is gone, and this patch removes all the
references to 'write_super' from ext3.
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Finally we can kill the 'sync_supers' kernel thread along with the
'->write_super()' superblock operation because all the users are gone.
Now every file-system is supposed to self-manage own superblock and
its dirty state.
The nice thing about killing this thread is that it improves power management.
Indeed, 'sync_supers' is a source of monotonic system wake-ups - it woke up
every 5 seconds no matter what - even if there were no dirty superblocks and
even if there were no file-systems using this service (e.g., btrfs and
journalled ext4 do not need it). So it was wasting power most of the time. And
because the thread was in the core of the kernel, all systems had to have it.
So I am quite happy to make it go away.
Interestingly, this thread is a left-over from the pdflush kernel thread which
was a self-forking kernel thread responsible for all the write-back in old
Linux kernels. It was turned into per-block device BDI threads, and
'sync_supers' was a left-over. Thus, R.I.P, pdflush as well.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull exofs update from Boaz Harrosh:
"They are all mostly fixes, except the most important patch by Artem
Bityutskiy which removes the use of s_dirt. After this patch s_dirt
can be completely removed from the tree."
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
ore: Fix out-of-bounds access in _ios_obj()
exofs: Use proper max_IO calculations from ore
exofs: Fix __r4w_get_page when offset is beyond i_size
exofs: stop using s_dirt
exofs: readpage_strip: Add a BUG_ON to check for PageLocked(page)
Depending on layout and ARCH, ORE has some limits on max IO sizes
which is communicated on (what else) ore_layout->max_io_length,
which is always stripe aligned.
This was considered as the pg_test boundary for splitting and starting
a new IO.
But in the case of a long IO where the start offset is not aligned
what would happen is that both end of IO[N] and start of IO[N+1]
would be unaligned, causing each IO boundary parity unit to be
calculated and written twice.
So what we do in this patch is split the very start of an unaligned
IO, up to a stripe boundary, and then next IO's can continue fully
aligned til the end.
We might be sacrificing the case where the full unaligned IO would
fit within a single max_io_length, but the sacrifice is well worth
the elimination of double calculation and parity units IO.
Actually the sacrificing is marginal and is almost unmeasurable.
TODO:
If we know the total expected linear segment that will
be received, at pg_init, we could use that information
in many places:
1. blocks-layout get_layout write segment size
2. Better mds-threshold
3. In above situation for a better clean split
I will do this in future submission.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
To allow layout driver to pass private information around
pg_init/pg_doio.
Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
since the only user of nfs4_proc_layoutget is send_layoutget, which
ignores its return value, there is no reason to return any value.
Signed-off-by: Idan Kedar <idank@tonian.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
we have encountered a bug whereby reading a lot of files (copying
fedora's /bin) from a pNFS mount and hitting Ctrl+C in the middle caused
a general protection fault in xdr_shrink_bufhead. this function is
called when decoding the response from LAYOUTGET. the decoding is done
by a worker thread, and the caller of LAYOUTGET waits for the worker
thread to complete.
hitting Ctrl+C caused the synchronous wait to end and the next thing the
caller does is to free the pages, so when the worker thread calls
xdr_shrink_bufhead, the pages are gone. therefore, the cleanup of these
pages has been moved to nfs4_layoutget_release.
Signed-off-by: Idan Kedar <idank@tonian.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
...and ensure that we tear down the nfs_commit_data cache too when
unloading the module.
Cc: Bryan Schumaker <bjschuma@netapp.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Pull two ceph fixes from Sage Weil:
"The first patch fixes up the old crufty open intent code to use the
atomic_open stuff properly, and the second fixes a possible null deref
and memory leak with the crypto keys."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
libceph: fix crypto key null deref, memory leak
ceph: simplify+fix atomic_open
eCryptfs mount options do not
- Cleanups in the messaging code
- Better handling of empty files in the lower filesystem to improve usability.
Failed file creations are now cleaned up and empty lower files are converted
into eCryptfs during open().
- The write-through cache changes are being reverted due to bugs that are not
easy to fix. Stability outweighs the performance enhancements here.
- Improvement to the mount code to catch unsupported ciphers specified in the
mount options
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABCgAGBQJQGhVWAAoJENaSAD2qAscKjvcQAMgF66sl8KjwzFCgojh30a4u
1hCXltxUCLjXxjXxyygUHb/pH0xdvC+Ss3FTtsPXAjgYm2lXjjoOmVG8WAvwHHx1
2ZjDo+8fQ4XA8Rl9kYuvt/abF0IssNRK3csTWplR7lpoQ8AWbkpkag1I4WZhibey
cgs/zECl8ACTJ5zQ+AyRGnrssq4jI7xZAKWLK0+KKk7S9yIRI7K/xdz1xK39jGK6
N09Dw3VWY/bMcMq77ZXBtyHdP7KR7wKUtCeQttmCvdf20Ocy6AXzr1FRKvUxionF
Sf31tJim0u9OO8hmy8cjCyWEy9LHnXnSd/5vn+Qd9ok9GvuiYmKw07rbXi/gjhBX
ai5PKtl05WiQgp80BybUYfIY1Hq71MsppNi6h9Zgiid5rEvWWvCBWBWP95G8DTmC
6TwLaCG0rh8uuZyeiVrs3xZQ2IG5Zmu0CX3XGyfsaLvqmQWhtT5ZQVMeMQEEBxyQ
ur9SSU2O/nC8ceLB7fzGmZPTLZUWOuYQnd24NJNK+7j0P+Km7pqmDYpCdwmFpx3C
CQ0gGaJGHeycvBF327bwxPmsPdO4fy+nmEL8vrEXPTyQ3ZVAPvxZK9t8Jpk4UFOl
JSTWFiK0mvgE+5dX2kB0nzN7iD7hcMgmozht50qQP3OUkI1kkBrEHgHVO3KzgUIA
+aRAljeLdelq44JBxjiB
=4vDd
-----END PGP SIGNATURE-----
Merge tag 'ecryptfs-3.6-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs
Pull ecryptfs fixes from Tyler Hicks:
- Fixes a bug when the lower filesystem mount options include 'acl',
but the eCryptfs mount options do not
- Cleanups in the messaging code
- Better handling of empty files in the lower filesystem to improve
usability. Failed file creations are now cleaned up and empty lower
files are converted into eCryptfs during open().
- The write-through cache changes are being reverted due to bugs that
are not easy to fix. Stability outweighs the performance
enhancements here.
- Improvement to the mount code to catch unsupported ciphers specified
in the mount options
* tag 'ecryptfs-3.6-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
eCryptfs: check for eCryptfs cipher support at mount
eCryptfs: Revert to a writethrough cache model
eCryptfs: Initialize empty lower files when opening them
eCryptfs: Unlink lower inode when ecryptfs_create() fails
eCryptfs: Make all miscdev functions use daemon ptr in file private_data
eCryptfs: Remove unused messaging declarations and function
eCryptfs: Copy up POSIX ACL and read-only flags from lower mount
Pull CIFS update from Steve French:
"Adds SMB2 rmdir/mkdir capability to the SMB2/SMB2.1 support in cifs.
I am holding up a few more days on merging the remainder of the
SMB2/SMB2.1 enablement although it is nearing review completion, in
order to address some review comments from Jeff Layton on a few of the
subsequent SMB2 patches, and also to debug an unrelated cifs problem
that Pavel discovered."
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Add SMB2 support for rmdir
CIFS: Move rmdir code to ops struct
CIFS: Add SMB2 support for mkdir operation
CIFS: Separate protocol specific part from mkdir
CIFS: Simplify cifs_mkdir call
The initial ->atomic_open op was carried over from the old intent code,
which was incomplete and didn't really work. Replace it with a fresh
method. In particular:
* always attempt to do an atomic open+lookup, both for the create case
and for lookups of existing files.
* fix symlink handling by returning 1 to the VFS so that we can follow
the link to its destination. This fixes a longstanding ceph bug (#2392).
Signed-off-by: Sage Weil <sage@inktank.com>
_ios_obj() is accessed by group_index not device_table index.
The oc->comps array is only a group_full of devices at a time
it is not like ore_comp_dev() which is indexed by a global
device_table index.
This did not BUG until now because exofs only uses a single
COMP for all devices. But with other FSs like PanFS this is
not true.
This bug was only in the write_path, all other users were
using it correctly
[This is a bug since 3.2 Kernel]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
exofs_max_io_pages should just use the ORE's
calculated layout->max_io_length,
And avoid unnecessary BUGs, calculations made here were
also a layering violation.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
It is very common for the end of the file to be unaligned on
stripe size. But since we know it's beyond file's end then
the XOR should be preformed with all zeros.
Old code used to just read zeros out of the OSD devices, which is a great
waist. But what scares me more about this situation is that, we now have
pages attached to the file's mapping that are beyond i_size. I don't
like the kind of bugs this calls for.
Fix both birds, by returning a global ZERO_PAGE, if offset is beyond
i_size.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Exofs has the '->write_super()' handler and makes some use of the '->s_dirt'
superblock flag, but it really needs neither of them because it never sets
's_dirt' to one which means the VFS never calls its '->write_super()' handler.
Thus, remove both.
Note, I am trying to remove both 's_dirt' and 'write_super()' from VFS
altogether once all users are gone.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
readpage_strip can be called from several code paths all of which
require that the page be locked before any operations are carried
out.
Since we export the exofs_readpage callback to the VFS, add a
BUG_ON to check for PageLocked(page) to make sure that this
understanding is never compromised.
Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
For regular file, write operaion used blk_plug function.But for block
file,write operation did not use blk_plug.
This patch is also for write-cache mode for block-device.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull second vfs pile from Al Viro:
"The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
deadlock reproduced by xfstests 068), symlink and hardlink restriction
patches, plus assorted cleanups and fixes.
Note that another fsfreeze deadlock (emergency thaw one) is *not*
dealt with - the series by Fernando conflicts a lot with Jan's, breaks
userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
for massive vfsmount leak; this is going to be handled next cycle.
There probably will be another pull request, but that stuff won't be
in it."
Fix up trivial conflicts due to unrelated changes next to each other in
drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
delousing target_core_file a bit
Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
fs: Remove old freezing mechanism
ext2: Implement freezing
btrfs: Convert to new freezing mechanism
nilfs2: Convert to new freezing mechanism
ntfs: Convert to new freezing mechanism
fuse: Convert to new freezing mechanism
gfs2: Convert to new freezing mechanism
ocfs2: Convert to new freezing mechanism
xfs: Convert to new freezing code
ext4: Convert to new freezing mechanism
fs: Protect write paths by sb_start_write - sb_end_write
fs: Skip atime update on frozen filesystem
fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
fs: Improve filesystem freezing handling
switch the protection of percpu_counter list to spinlock
nfsd: Push mnt_want_write() outside of i_mutex
btrfs: Push mnt_want_write() outside of i_mutex
fat: Push mnt_want_write() outside of i_mutex
...
In commit 3b6e2723f3 ("locks: prevent side-effects of
locks_release_private before file_lock is initialized") we removed the
last user of lm_release_private without removing the field itself.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge Andrew's second set of patches:
- MM
- a few random fixes
- a couple of RTC leftovers
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits)
rtc/rtc-88pm80x: remove unneed devm_kfree
rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
tmpfs: distribute interleave better across nodes
mm: remove redundant initialization
mm: warn if pg_data_t isn't initialized with zero
mips: zero out pg_data_t when it's allocated
memcg: gix memory accounting scalability in shrink_page_list
mm/sparse: remove index_init_lock
mm/sparse: more checks on mem_section number
mm/sparse: optimize sparse_index_alloc
memcg: add mem_cgroup_from_css() helper
memcg: further prevent OOM with too many dirty pages
memcg: prevent OOM with too many dirty pages
mm: mmu_notifier: fix freed page still mapped in secondary MMU
mm: memcg: only check anon swapin page charges for swap cache
mm: memcg: only check swap cache pages for repeated charging
mm: memcg: split swapin charge function into private and public part
mm: memcg: remove needless !mm fixup to init_mm when charging
mm: memcg: remove unneeded shmem charge type
...
Features include:
- Patches from Bryan to allow splitting of the NFSv2/v3/v4 code into
separate modules.
- Fix Oopses in the NFSv4 idmapper
- Fix a deadlock whereby rpciod tries to allocate a new socket and
ends up recursing into the NFS code due to memory reclaim.
- Increase the number of permitted callback connections.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJQGF+vAAoJEGcL54qWCgDy6ncP/jKVl/Yjz6i1b/8QtC0EmYFb
XNwbjZicNupVM98XPLm+sfdWxvoGiHSMbwG8t/hx5Z6CteM18adeGNO1KqP9xQyg
scPvFmqj5VJNHsHztcHIFHFYdpsuJqRKcW3TA2l8AkNOm6uBcnXArRosYN6LrXik
hI/jWcyv8wZuooSQrLf463JK37t/s6LuUQo6jKP4582sbANHvPdLkHFCYg3yt1Sk
2IIo2CLLs2cJUswGJnsHT76Jxfvh21NdGtvydjVjpr4H0/LY5GykBc23AAL9TWj6
KlegDO902hLzEE93xazF59hQZrPuwBi3quEMJ3X0mjdNQWb4G96s/t2hmwpTjWyc
mjpRsuaGlCXDxcrI5dnU52hqKa0Ju1ipbLpghxSVfilRy998xvGp5Yc6ADU+pYnt
uP09lamCyjFEa+gDSqGt7lv4dFmdPJjWZdkXLPv0ah5sXVMYAT8NRLUfiE1+hLLu
zKaM84PHSEvykFeJ2WvjDTgF3L55y3x6L3LN4UkKmfO5nqQf7csPcquU1NfHh25S
jjKGq2SKGoYG1l6FwK3hemup5cnFUEX78E2QeLrmaHg92fJDGrz587i6MPc5jrSe
sOSX/CpuCJd4VhodIo8T00me0GZSHEU7RP2KEc518ZvrPmz13avXeLeG9nYhjuxM
cV9Ex8zh2Y4aiUXCy3uA
=KSix
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.6-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull second wave of NFS client updates from Trond Myklebust:
- Patches from Bryan to allow splitting of the NFSv2/v3/v4 code into
separate modules.
- Fix Oopses in the NFSv4 idmapper
- Fix a deadlock whereby rpciod tries to allocate a new socket and ends
up recursing into the NFS code due to memory reclaim.
- Increase the number of permitted callback connections.
* tag 'nfs-for-3.6-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
nfs: explicitly reject LOCK_MAND flock() requests
nfs: increase number of permitted callback connections.
SUNRPC: return negative value in case rpcbind client creation error
NFS: Convert v4 into a module
NFS: Convert v3 into a module
NFS: Convert v2 into a module
NFS: Keep module parameters in the generic NFS client
NFS: Split out remaining NFS v4 inode functions
NFS: Pass super operations and xattr handlers in the nfs_subversion
NFS: Only initialize the ACL client in the v3 case
NFS: Create a try_mount rpc op
NFS: Remove the NFS v4 xdev mount function
NFS: Add version registering framework
NFS: Fix a number of bugs in the idmapper
nfs: skip commit in releasepage if we're freeing memory for fs-related reasons
sunrpc: clarify comments on rpc_make_runnable
pnfsblock: bail out partial page IO
GFP_NOFS is _more_ permissive than GFP_NOIO in that it will initiate IO,
just not of any filesystem data.
The problem is that previously NOFS was correct because that avoids
recursion into the NFS code. With swap-over-NFS, it is no longer correct
as swap IO can lead to this recursion.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Neil Brown <neilb@suse.de>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement the new swapfile a_ops for NFS and hook up ->direct_IO. This
will set the NFS socket to SOCK_MEMALLOC and run socket reconnect under
PF_MEMALLOC as well as reset SOCK_MEMALLOC before engaging the protocol
->connect() method.
PF_MEMALLOC should allow the allocation of struct socket and related
objects and the early (re)setting of SOCK_MEMALLOC should allow us to
receive the packets required for the TCP connection buildup.
[jlayton@redhat.com: Restore PF_MEMALLOC task flags in all cases]
[dfeng@redhat.com: Fix handling of multiple swap files]
[a.p.zijlstra@chello.nl: Original patch]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Neil Brown <neilb@suse.de>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The VM does not like PG_private set on PG_swapcache pages. As suggested
by Trond in http://lkml.org/lkml/2006/8/25/348, this patch disables NFS
data cache revalidation on swap files. as it does not make sense to have
other clients change the file while it is being used as swap. This avoids
setting PG_private on swap pages, since there ought to be no further races
with invalidate_inode_pages2() to deal with.
Since we cannot set PG_private we cannot use page->private which is
already used by PG_swapcache pages to store the nfs_page. Thus augment
the new nfs_page_find_request logic.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Neil Brown <neilb@suse.de>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace all relevant occurences of page->index and page->mapping in the
NFS client with the new page_file_index() and page_file_mapping()
functions.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Neil Brown <neilb@suse.de>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
09f363c7 ("vmscan: fix shrinker callback bug in fs/super.c") fixed a
shrinker callback which was returning -1 when nr_to_scan is zero, which
caused excessive slab scanning. But 635697c6 ("vmscan: fix initial
shrinker size handling") fixed the problem, again so we can freely return
-1 although nr_to_scan is zero. So let's revert 09f363c7 because the
comment added in 09f363c7 made an unnecessary rule.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use a mmu_gather instead of a temporary linked list for accumulating pages
when we unmap a hugepage range
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since per-BDI flusher threads were introduced in 2.6, the pdflush
mechanism is not used any more. But the old interface exported through
/proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.
For back-compatibility, printk warning information and return 2 to notify
the users that the interface is removed.
Signed-off-by: Wanpeng Li <liwp@linux.vnet.ibm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull nfsd changes from J. Bruce Fields:
"This has been an unusually quiet cycle--mostly bugfixes and cleanup.
The one large piece is Stanislav's work to containerize the server's
grace period--but that in itself is just one more step in a
not-yet-complete project to allow fully containerized nfs service.
There are a number of outstanding delegation, container, v4 state, and
gss patches that aren't quite ready yet; 3.7 may be wilder."
* 'nfsd-next' of git://linux-nfs.org/~bfields/linux: (35 commits)
NFSd: make boot_time variable per network namespace
NFSd: make grace end flag per network namespace
Lockd: move grace period management from lockd() to per-net functions
LockD: pass actual network namespace to grace period management functions
LockD: manage grace list per network namespace
SUNRPC: service request network namespace helper introduced
NFSd: make nfsd4_manager allocated per network namespace context.
LockD: make lockd manager allocated per network namespace
LockD: manage grace period per network namespace
Lockd: add more debug to host shutdown functions
Lockd: host complaining function introduced
LockD: manage used host count per networks namespace
LockD: manage garbage collection timeout per networks namespace
LockD: make garbage collector network namespace aware.
LockD: mark host per network namespace on garbage collect
nfsd4: fix missing fault_inject.h include
locks: move lease-specific code out of locks_delete_lock
locks: prevent side-effects of locks_release_private before file_lock is initialized
NFSd: set nfsd_serv to NULL after service destruction
NFSd: introduce nfsd_destroy() helper
...
Pull Ceph changes from Sage Weil:
"Lots of stuff this time around:
- lots of cleanup and refactoring in the libceph messenger code, and
many hard to hit races and bugs closed as a result.
- lots of cleanup and refactoring in the rbd code from Alex Elder,
mostly in preparation for the layering functionality that will be
coming in 3.7.
- some misc rbd cleanups from Josh Durgin that are finally going
upstream
- support for CRUSH tunables (used by newer clusters to improve the
data placement)
- some cleanup in our use of d_parent that Al brought up a while back
- a random collection of fixes across the tree
There is another patch coming that fixes up our ->atomic_open()
behavior, but I'm going to hammer on it a bit more before sending it."
Fix up conflicts due to commits that were already committed earlier in
drivers/block/rbd.c, net/ceph/{messenger.c, osd_client.c}
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (132 commits)
rbd: create rbd_refresh_helper()
rbd: return obj version in __rbd_refresh_header()
rbd: fixes in rbd_header_from_disk()
rbd: always pass ops array to rbd_req_sync_op()
rbd: pass null version pointer in add_snap()
rbd: make rbd_create_rw_ops() return a pointer
rbd: have __rbd_add_snap_dev() return a pointer
libceph: recheck con state after allocating incoming message
libceph: change ceph_con_in_msg_alloc convention to be less weird
libceph: avoid dropping con mutex before fault
libceph: verify state after retaking con lock after dispatch
libceph: revoke mon_client messages on session restart
libceph: fix handling of immediate socket connect failure
ceph: update MAINTAINERS file
libceph: be less chatty about stray replies
libceph: clear all flags on con_close
libceph: clean up con flags
libceph: replace connection state bits with states
libceph: drop unnecessary CLOSED check in socket state change callback
libceph: close socket directly from ceph_con_close()
...
We have no mechanism to emulate LOCK_MAND locks on NFSv4, so explicitly
return -EINVAL if someone requests it.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
By default a sunrpc service is limited to (N+3)*20 connections
where N is the number of threads. This is 80 when N==1.
If this number is exceeded a warning is printed suggesting that
the number of threads be increased. However with services which
run a single thread, this is impossible.
For such services there is a ->sv_maxconn setting that can be
used to forcibly increase the limit, and silence the message.
This is used by lockd.
The nfs client uses a sunrpc service to handle callbacks and
it too is single-threaded, so to avoid the useless messages,
and to allow a reasonable number of concurrent connections,
we need to set ->sv_maxconn. 1024 seems like a good number.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now that all users are converted, we can remove functions, variables, and
constants defined by the old freezing mechanism.
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The only missing piece to make freezing work reliably with ext2 is to
stop iput() of unlinked inode from deleting the inode on frozen filesystem.
So add a necessary protection to ext2_evict_inode().
We also provide appropriate ->freeze_fs and ->unfreeze_fs functions.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We convert btrfs_file_aio_write() to use new freeze check. We also add proper
freeze protection to btrfs_page_mkwrite(). We also add freeze protection to
the transaction mechanism to avoid starting transactions on frozen filesystem.
At minimum this is necessary to stop iput() of unlinked file to change frozen
filesystem during truncation.
Checks in cleaner_kthread() and transaction_kthread() can be safely removed
since btrfs_freeze() will lock the mutexes and thus block the threads (and they
shouldn't have anything to do anyway).
CC: linux-btrfs@vger.kernel.org
CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We change nilfs_page_mkwrite() to provide proper freeze protection for
writeable page faults (we must wait for frozen filesystem even if the
page is fully mapped).
We remove all vfs_check_frozen() checks since they are now handled by
the generic code.
CC: linux-nilfs@vger.kernel.org
CC: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move check in ntfs_file_aio_write_nolock() to ntfs_file_aio_write() and
use new freeze protection.
CC: linux-ntfs-dev@lists.sourceforge.net
CC: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Convert check in fuse_file_aio_write() to using new freeze protection.
CC: fuse-devel@lists.sourceforge.net
CC: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We update gfs2_page_mkwrite() to use new freeze protection and the transaction
code to use freeze protection while the transaction is running. That is needed
to stop iput() of unlinked file from modifying the filesystem. The rest is
handled by the generic code.
CC: cluster-devel@redhat.com
CC: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Protect ocfs2_page_mkwrite() and ocfs2_file_aio_write() using the new freeze
protection. We also protect several ioctl entry points which were missing the
protection. Finally, we add freeze protection to the journaling mechanism so
that iput() of unlinked inode cannot modify a frozen filesystem.
CC: Mark Fasheh <mfasheh@suse.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: ocfs2-devel@oss.oracle.com
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Generic code now blocks all writers from standard write paths. So we add
blocking of all writers coming from ioctl (we get a protection of ioctl against
racing remount read-only as a bonus) and convert xfs_file_aio_write() to a
non-racy freeze protection. We also keep freeze protection on transaction
start to block internal filesystem writes such as removal of preallocated
blocks.
CC: Ben Myers <bpm@sgi.com>
CC: Alex Elder <elder@kernel.org>
CC: xfs@oss.sgi.com
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We remove most of frozen checks since upper layer takes care of blocking all
writes. We have to handle protection in ext4_page_mkwrite() in a special way
because we cannot use generic block_page_mkwrite(). Also we add a freeze
protection to ext4_evict_inode() so that iput() of unlinked inode cannot modify
a frozen filesystem (we cannot easily instrument ext4_journal_start() /
ext4_journal_stop() with freeze protection because we are missing the
superblock pointer in ext4_journal_stop() in nojournal mode).
CC: linux-ext4@vger.kernel.org
CC: "Theodore Ts'o" <tytso@mit.edu>
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There are several entry points which dirty pages in a filesystem. mmap
(handled by block_page_mkwrite()), buffered write (handled by
__generic_file_aio_write()), splice write (generic_file_splice_write),
truncate, and fallocate (these can dirty last partial page - handled inside
each filesystem separately). Protect these places with sb_start_write() and
sb_end_write().
->page_mkwrite() calls are particularly complex since they are called with
mmap_sem held and thus we cannot use standard sb_start_write() due to lock
ordering constraints. We solve the problem by using a special freeze protection
sb_start_pagefault() which ranks below mmap_sem.
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It is unexpected to block reading of frozen filesystem because of atime update.
Also handling blocking on frozen filesystem because of atime update would make
locking more complex than it already is. So just skip atime update when
filesystem is frozen like we skip it when filesystem is remounted read-only.
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Most of places where we want freeze protection coincides with the places where
we also have remount-ro protection. So make mnt_want_write() and
mnt_drop_write() (and their _file alternative) prevent freezing as well.
For the few cases that are really interested only in remount-ro protection
provide new function variants.
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
vfs_check_frozen() tests are racy since the filesystem can be frozen just after
the test is performed. Thus in write paths we can end up marking some pages or
inodes dirty even though the file system is already frozen. This creates
problems with flusher thread hanging on frozen filesystem.
Another problem is that exclusion between ->page_mkwrite() and filesystem
freezing has been handled by setting page dirty and then verifying s_frozen.
This guaranteed that either the freezing code sees the faulted page, writes it,
and writeprotects it again or we see s_frozen set and bail out of page fault.
This works to protect from page being marked writeable while filesystem
freezing is running but has an unpleasant artefact of leaving dirty (although
unmodified and writeprotected) pages on frozen filesystem resulting in similar
problems with flusher thread as the first problem.
This patch aims at providing exclusion between write paths and filesystem
freezing. We implement a writer-freeze read-write semaphore in the superblock.
Actually, there are three such semaphores because of lock ranking reasons - one
for page fault handlers (->page_mkwrite), one for all other writers, and one of
internal filesystem purposes (used e.g. to track running transactions). Write
paths which should block freezing (e.g. directory operations, ->aio_write(),
->page_mkwrite) hold reader side of the semaphore. Code freezing the filesystem
takes the writer side.
Only that we don't really want to bounce cachelines of the semaphores between
CPUs for each write happening. So we implement the reader side of the semaphore
as a per-cpu counter and the writer side is implemented using s_writers.frozen
superblock field.
[AV: microoptimize sb_start_write(); we want it fast in normal case]
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
which can adapt equally well to fast/slow devices.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJQF0/wAAoJECvKgwp+S8JaszsP/16EO5F5mUCOFgncVRp+8R9U
BxuKJ61j2R9ckHA+ngMEg72W5vJQds64cjywZnz6HMr0/+3tXUf4QBbU4/4sCeai
0lpK8MCKgp5KHHCxgO8zyoSaboankUgoDcbSmGJREV1WXoR8VWXsO9gXqiiH9XOe
e8ADjds/YdxkQbOYDRgZKvLzwWS61K9Kwq5/56GASh2uflw7rkJZ38xqvGbo3YiQ
IJJwOUYfJjFadIewYARQmkZZyWeAmtY0ADh15Z8pJt+iY4PgcDDlaWagwUH2Oaoi
vhTFO4KnCjhSpc872et21g/jN/VrcQqzuUF/LUE9rW5irXeDZVCDrCQOuHQ+3Uo5
YuV3rpNABW/LU8AvtIwt9hUunaKUnrXaSluoL9LzH2VpH++JljQeR4yZ62Q5rpRs
z4Bow25p7tlbcIJWzueMPdOFUr5s3P6XfQHoLVRWqN94eJ+Z1DPOgBlOain6cCUN
oivPh4FxZfUscNmX8/7cHaTNEpGzJ0FzLPMNFnGQ2zwG0gk41fnMdWb19aVIE5GL
/Q96TB24k/v+o1lbxGmyPi0L+Aq+NknvNm+p/YJHAAQrIpu5t2hPvka8/m7Nj7tu
c3rM75RKiZkEI+3U6Ws1DhhQPtVcfNIVlNYQMGzlIndtK6T0ByueQ4eN5Z7ltjSE
pS89rv2hyBw7yCaTU6ui
=gZrM
-----END PGP SIGNATURE-----
Merge tag 'writeback-proportions' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux
Pull writeback updates from Wu Fengguang:
"Use time based periods to age the writeback proportions, which can
adapt equally well to fast/slow devices."
Fix up trivial conflict in comment in fs/sync.c
* tag 'writeback-proportions' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Fix some comment errors
block: Convert BDI proportion calculations to flexible proportions
lib: Fix possible deadlock in flexible proportion code
lib: Proportions with flexible period
Features include:
- More preparatory patches for modularising NFSv2/v3/v4.
Split out the various NFSv2/v3/v4-specific code into separate
files
- More preparation for the NFSv4 migration code
- Ensure that OPEN(O_CREATE) observes the pNFS mds threshold parameters
- pNFS fast failover when the data servers are down
- Various cleanups and debugging patches
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJQFwYRAAoJEGcL54qWCgDy6YAQAIZuUPFCQbuzWev4L/XA0uhg
0xjr9YBmg57UOs8e7FhS0azJip+D/kmF2Rx5gCMX0QO/IwaEVoms4BS8zjFOr3yL
cPIyY5ErF+WJ25f3Mz8k0wOHTzrOvMdq4/7TQf5hztsrGYFid78sSb3O9ZO9bgb4
QxqhgA0Z4S2vIqeObJAF9BT5UOT/cXfVKV0iTd52ZQVA+ZhqJ2SDlNFsoxnVOweV
qzKOi0FZCPsjH+Z4a/LLdieHG5AYRPhOScBdG7gTFAdkysI8DDtSNCmJ68dMHYRw
OHtyt/X4k9TImfwauV3Eww1sw3NkDswX+dv3GOSDTlMvVBrm0WXWj2zoHZbOB+0Q
ETI9Zpolvd9pADtyfu9c+kCYIR+NdB4lGKd15iZfdEy/CZ0CtbLAbn5Szo2YcwNR
iVjswR0jsbklD+mZjlMeZoG4JX2d9ZRqLs20POz7QQBmdIQ8jb9/SwVj9zGvX0w0
NyVUHTFlGaNwkpo7oeRoWi1xjZWE6OzfoncV/46DOJWb08OKZ4fR4ugMYJWGi7Xx
I4nQrIiHtInqL+sF5Y3wlZ48dufLuIhAcHh7klxgud4GRj7t8j+a99pTrRmpHvvC
4K2Yu69VYcIA7N2RsSLohIllGPFmMwpdar7yERdD0So8/fz/zf7wn0dFFYY11+bL
YNVVGhvNxccMU5EU9YR4
=Lc59
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Features include:
- More preparatory patches for modularising NFSv2/v3/v4. Split out
the various NFSv2/v3/v4-specific code into separate files
- More preparation for the NFSv4 migration code
- Ensure that OPEN(O_CREATE) observes the pNFS mds threshold
parameters
- pNFS fast failover when the data servers are down
- Various cleanups and debugging patches"
* tag 'nfs-for-3.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (67 commits)
nfs: fix fl_type tests in NFSv4 code
NFS: fix pnfs regression with directio writes
NFS: fix pnfs regression with directio reads
sunrpc: clnt: Add missing braces
nfs: fix stub return type warnings
NFS: exit_nfs_v4() shouldn't be an __exit function
SUNRPC: Add a missing spin_unlock to gss_mech_list_pseudoflavors
NFS: Split out NFS v4 client functions
NFS: Split out the NFS v4 filesystem types
NFS: Create a single nfs_clone_super() function
NFS: Split out NFS v4 server creating code
NFS: Initialize the NFS v4 client from init_nfs_v4()
NFS: Move the v4 getroot code to nfs4getroot.c
NFS: Split out NFS v4 file operations
NFS: Initialize v4 sysctls from nfs_init_v4()
NFS: Create an init_nfs_v4() function
NFS: Split out NFS v4 inode operations
NFS: Split out NFS v3 inode operations
NFS: Split out NFS v2 inode operations
NFS: Clean up nfs4_proc_setclientid() and friends
...
There are two structures in which a count of snapshots are
maintained:
struct ceph_snap_context {
...
u32 num_snaps;
...
}
and
struct ceph_snap_realm {
...
u32 num_prior_parent_snaps; /* had prior to parent_since */
...
u32 num_snaps;
...
}
These fields never take on negative values (e.g., to hold special
meaning), and so are really inherently unsigned. Furthermore they
take their value from over-the-wire or on-disk formatted 32-bit
values.
So change their definition to have type u32, and change some spots
elsewhere in the code to account for this change.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
We re-run the loop but we don't re-set the attrs pointer back to NULL.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
When we detect a mds session reset, close the old ceph_connection before
reopening it. This ensures we clean up the old socket properly and keep
the ceph_connection state correct.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Merge Andrew's first set of patches:
"Non-MM patches:
- lots of misc bits
- tree-wide have_clk() cleanups
- quite a lot of printk tweaks. I draw your attention to "printk:
convert the format for KERN_<LEVEL> to a 2 byte pattern" which
looks a bit scary. But afaict it's solid.
- backlight updates
- lib/ feature work (notably the addition and use of memweight())
- checkpatch updates
- rtc updates
- nilfs updates
- fatfs updates (partial, still waiting for acks)
- kdump, proc, fork, IPC, sysctl, taskstats, pps, etc
- new fault-injection feature work"
* Merge emailed patches from Andrew Morton <akpm@linux-foundation.org>: (128 commits)
drivers/misc/lkdtm.c: fix missing allocation failure check
lib/scatterlist: do not re-write gfp_flags in __sg_alloc_table()
fault-injection: add tool to run command with failslab or fail_page_alloc
fault-injection: add selftests for cpu and memory hotplug
powerpc: pSeries reconfig notifier error injection module
memory: memory notifier error injection module
PM: PM notifier error injection module
cpu: rewrite cpu-notifier-error-inject module
fault-injection: notifier error injection
c/r: fcntl: add F_GETOWNER_UIDS option
resource: make sure requested range is included in the root range
include/linux/aio.h: cpp->C conversions
fs: cachefiles: add support for large files in filesystem caching
pps: return PTR_ERR on error in device_create
taskstats: check nla_reserve() return
sysctl: suppress kmemleak messages
ipc: use Kconfig options for __ARCH_WANT_[COMPAT_]IPC_PARSE_VERSION
ipc: compat: use signed size_t types for msgsnd and msgrcv
ipc: allow compat IPC version field parsing if !ARCH_WANT_OLD_COMPAT_IPC
ipc: add COMPAT_SHMLBA support
...
When we restore file descriptors we would like them to look exactly as
they were at dumping time.
With help of fcntl it's almost possible, the missing snippet is file
owners UIDs.
To be able to read their values the F_GETOWNER_UIDS is introduced.
This option is valid iif CONFIG_CHECKPOINT_RESTORE is turned on, otherwise
returning -EINVAL.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__mem_open() which is called by both /proc/<pid>/environ and
/proc/<pid>/mem ->open() handlers will allow the use of negative offsets.
/proc/<pid>/mem has negative offsets but not /proc/<pid>/environ.
Clean this by moving the 'force FMODE_UNSIGNED_OFFSET flag' to mem_open()
to allow negative offsets only on /proc/<pid>/mem.
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Brad Spengler <spender@grsecurity.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the following offset and environment address range check in
environ_read() of /proc/<pid>/environ is buggy:
int this_len = mm->env_end - (mm->env_start + src);
if (this_len <= 0)
break;
Large or negative offsets on /proc/<pid>/environ converted to 'unsigned
long' may pass this check since '(mm->env_start + src)' can overflow and
'this_len' will be positive.
This can turn /proc/<pid>/environ to act like /proc/<pid>/mem since
(mm->env_start + src) will point and read from another VMA.
There are two fixes here plus some code cleaning:
1) Fix the overflow by checking if the offset that was converted to
unsigned long will always point to the [mm->env_start, mm->env_end]
address range.
2) Remove the truncation that was made to the result of the check,
storing the result in 'int this_len' will alter its value and we can
not depend on it.
For kernels that have commit b409e578d ("proc: clean up
/proc/<pid>/environ handling") which adds the appropriate ptrace check and
saves the 'mm' at ->open() time, this is not a security issue.
This patch is taken from the grsecurity patch since it was just made
available.
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Brad Spengler <spender@grsecurity.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit 898b374af6 ("exec: replace call_usermodehelper_pipe with use
of umh init function and resolve limit"), the core limits recursive
check value was changed from 0 to 1, but the corresponding comments were
not updated.
Signed-off-by: Jovi Zhang <bookjovi@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nearly identical shortname parsing is performed in fat_search_long() and
__fat_readdir(). Extract this code into a function that may be called by
both.
Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Simplify code by providing accessor functions for the directory entry
start cluster fields.
Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use -ENOMEM return value instead of -EINVAL when kzalloc() fails.
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An fs-thaw ioctl causes deadlock with a chcp or mkcp -s command:
chcp D ffff88013870f3d0 0 1325 1324 0x00000004
...
Call Trace:
nilfs_transaction_begin+0x11c/0x1a0 [nilfs2]
wake_up_bit+0x20/0x20
copy_from_user+0x18/0x30 [nilfs2]
nilfs_ioctl_change_cpmode+0x7d/0xcf [nilfs2]
nilfs_ioctl+0x252/0x61a [nilfs2]
do_page_fault+0x311/0x34c
get_unmapped_area+0x132/0x14e
do_vfs_ioctl+0x44b/0x490
__set_task_blocked+0x5a/0x61
vm_mmap_pgoff+0x76/0x87
__set_current_blocked+0x30/0x4a
sys_ioctl+0x4b/0x6f
system_call_fastpath+0x16/0x1b
thaw D ffff88013870d890 0 1352 1351 0x00000004
...
Call Trace:
rwsem_down_failed_common+0xdb/0x10f
call_rwsem_down_write_failed+0x13/0x20
down_write+0x25/0x27
thaw_super+0x13/0x9e
do_vfs_ioctl+0x1f5/0x490
vm_mmap_pgoff+0x76/0x87
sys_ioctl+0x4b/0x6f
filp_close+0x64/0x6c
system_call_fastpath+0x16/0x1b
where the thaw ioctl deadlocked at thaw_super() when called while chcp was
waiting at nilfs_transaction_begin() called from
nilfs_ioctl_change_cpmode(). This deadlock is 100% reproducible.
This is because nilfs_ioctl_change_cpmode() first locks sb->s_umount in
read mode and then waits for unfreezing in nilfs_transaction_begin(),
whereas thaw_super() locks sb->s_umount in write mode. The locking of
sb->s_umount here was intended to make snapshot mounts and the downgrade
of snapshots to checkpoints exclusive.
This fixes the deadlock issue by replacing the sb->s_umount usage in
nilfs_ioctl_change_cpmode() with a dedicated mutex which protects snapshot
mounts.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The checkpoint deletion ioctl (rmcp ioctl) has potential for breaking
snapshot because it is not fully exclusive with checkpoint mode change
ioctl (chcp ioctl).
The rmcp ioctl first tests if the specified checkpoint is a snapshot or
not within nilfs_cpfile_delete_checkpoint function, and then calls
nilfs_cpfile_delete_checkpoints function to actually invalidate the
checkpoint only if it's not a snapshot. However, the checkpoint can be
changed into a snapshot by the chcp ioctl between these two operations.
In that case, calling nilfs_cpfile_delete_checkpoints() wrongly
invalidates the snapshot, which leads to snapshot list corruption and
snapshot count mismatch.
This fixes the issue by changing nilfs_cpfile_delete_checkpoints() so
that it reconfirms the target checkpoints are snapshot or not.
This second check is exclusive with the chcp operation since it is
protected by an existing semaphore.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
->delete_inode(), ->write_super_lockfs(), ->unlockfs() are gone so remove
references to them in the NTFS code. Noticed while cleaning up the
fsfreeze mess.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On minix2 and minix3 usually max_size is 7fffffff and the check in
question prohibits creation of last block spanning right before 7fffffff,
due to downward rounding during the division. Fix it by using
multiplication instead.
[akpm@linux-foundation.org: fix up code layout, use local `sb']
Signed-off-by: Vladimir Serbinenko <phcoder@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert ext4_count_free() to use memweight() instead of table lookup
based counting clear bits implementation. This change only affects the
code segments enabled by EXT4FS_DEBUG.
Note that this memweight() call can't be replaced with a single
bitmap_weight() call, although the pointer to the memory area is aligned
to long-word boundary. Because the size of the memory area may not be a
multiple of BITS_PER_LONG, then it returns wrong value on big-endian
architecture.
This also includes the following change.
- Remove unnecessary map == NULL check in ext4_count_free() which
always takes non-null pointer as the memory area.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>