Commit Graph

29224 Commits

Author SHA1 Message Date
Miao Xie
de6c4115a2 Btrfs: fix unnecessary while loop when search the free space, cache
When we find a bitmap free space entry, we may check the previous extent
entry covers the offset or not. But if we find this entry is also a bitmap
entry, we will continue to check the previous entry of the current one by
a while loop. It is unnecessary because it is impossible that the extent
entry which is in front of a bitmap entry can cover the offset of the entry
after that bitmap entry.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11 13:31:33 -05:00
Josef Bacik
de1ee92ac3 Btrfs: recheck bio against block device when we map the bio
Alex reported a problem where we were writing between chunks on a rbd
device.  The thing is we do bio_add_page using logical offsets, but the
physical offset may be different.  So when we map the bio now check to see
if the bio is still ok with the physical offset, and if it is not split the
bio up and redo the bio_add_page with the physical sector.  This fixes the
problem for Alex and doesn't affect performance in the normal case.  Thanks,

Reported-and-tested-by: Alex Elder <elder@inktank.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11 13:31:32 -05:00
Miao Xie
08e007d2e5 Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.

We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11 13:31:31 -05:00
Miao Xie
561c294d4c Btrfs: fix wrong comment in can_overcommit()
The comment is not coincident with the code. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11 13:31:30 -05:00
Miao Xie
3fed40cc97 Btrfs: cleanup duplicated division functions
div_factor{_fine} has been implemented for two times, cleanup it.
And I move them into a independent file named math.h because they are
common math functions.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11 13:31:30 -05:00
Linus Torvalds
684c9aaebb vfs: fix O_DIRECT read past end of block device
The direct-IO write path already had the i_size checks in mm/filemap.c,
but it turns out the read path did not, and removing the block size
checks in fs/block_dev.c (commit bbec0270bd: "blkdev_max_block: make
private to fs/buffer.c") removed the magic "shrink IO to past the end of
the device" code there.

Fix it by truncating the IO to the size of the block device, like the
write path already does.

NOTE! I suspect the write path would be *much* better off doing it this
way in fs/block_dev.c, rather than hidden deep in mm/filemap.c.  The
mm/filemap.c code is extremely hard to follow, and has various
conditionals on the target being a block device (ie the flag passed in
to 'generic_write_checks()', along with a conditional update of the
inode timestamp etc).

It is also quite possible that we should treat this whole block device
size as a "s_maxbytes" issue, and try to make the logic even more
generic.  However, in the meantime this is the fairly minimal targeted
fix.

Noted by Milan Broz thanks to a regression test for the cryptsetup
reencrypt tool.

Reported-and-tested-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-08 08:28:26 -08:00
Dan Carpenter
27d7c2a006 vfs: clear to the end of the buffer on partial buffer reads
READ is zero so the "rw & READ" test is always false.  The intended test
was "((rw & RW_MASK) == READ)".

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-05 10:32:59 -08:00
Linus Torvalds
57302e0ddf vfs: avoid "attempt to access beyond end of device" warnings
The block device access simplification that avoided accessing the (racy)
block size information (commit bbec0270bd: "blkdev_max_block: make
private to fs/buffer.c") no longer checks the maximum block size in the
block mapping path.

That was _almost_ as simple as just removing the code entirely, because
the readers and writers all check the size of the device anyway, so
under normal circumstances it "just worked".

However, the block size may be such that the end of the device may
straddle one single buffer_head.  At which point we may still want to
access the end of the device, but the buffer we use to access it
partially extends past the end.

The 'bd_set_size()' function intentionally sets the block size to avoid
this, but mounting the device - or setting the block size by hand to
some other value - can modify that block size.

So instead, teach 'submit_bh()' about the special case of the buffer
head straddling the end of the device, and turning such an access into a
smaller IO access, avoiding the problem.

This, btw, also means that unlike before, we can now access the whole
device regardless of device block size setting.  So now, even if the
device size is only 512-byte aligned, we can read and write even the
last sector even when having a much bigger block size for accessing the
rest of the device.

So with this, we could now get rid of the 'bd_set_size()' block size
code entirely - resulting in faster IO for the common case - but that
would be a separate patch.

Reported-and-tested-by: Romain Francoise <romain@orebokech.com>
Reporeted-and-tested-by: Meelis Roos <mroos@linux.ee>
Reported-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-04 08:25:11 -08:00
Linus Torvalds
d3594ea2b3 Merge branch 'block-dev'
Merge 'block-dev' branch.

I was going to just mark everything here for stable and leave it to the
3.8 merge window, but having decided on doing another -rc, I migth as
well merge it now.

This removes the bd_block_size_semaphore semaphore that was added in
this release to fix a race condition between block size changes and
block IO, and replaces it with atomicity guaratees in fs/buffer.c
instead, along with simplifying fs/block-dev.c.

This removes more lines than it adds, makes the code generally simpler,
and avoids the latency/rt issues that the block size semaphore
introduced for mount.

I'm not happy with the timing, but it wouldn't be much better doing this
during the merge window and then having some delayed back-port of it
into stable.

* block-dev:
  blkdev_max_block: make private to fs/buffer.c
  direct-io: don't read inode->i_blkbits multiple times
  blockdev: remove bd_block_size_semaphore again
  fs/buffer.c: make block-size be per-page and protected by the page lock
2012-12-03 10:53:25 -08:00
Linus Torvalds
331fee3cd3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "A bunch of fixes; the last one is this cycle regression, the rest are
  -stable fodder."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix off-by-one in argument passed by iterate_fd() to callbacks
  lookup_one_len: don't accept . and ..
  cifs: get rid of blind d_drop() in readdir
  nfs_lookup_revalidate(): fix a leak
  don't do blind d_drop() in nfs_prime_dcache()
2012-12-01 13:29:55 -08:00
Linus Torvalds
086486e46e Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
 "Two low risk, small fixes, that fix cifs regressions introduced in
  3.7."

* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: Fix wrong buffer pointer usage in smb_set_file_info
  cifs: fix writeback race with file that is growing
2012-11-30 16:57:18 -08:00
Al Viro
a77cfcb429 fix off-by-one in argument passed by iterate_fd() to callbacks
Noticed by Pavel Roskin; the thing in his patch I disagree with
was compensating for that shite in callbacks instead of fixing
it once in the iterator itself.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-29 23:01:30 -05:00
Al Viro
21d8a15ac3 lookup_one_len: don't accept . and ..
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-29 22:17:21 -05:00
Al Viro
0903a0c849 cifs: get rid of blind d_drop() in readdir
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-29 22:11:06 -05:00
Al Viro
c44600c9d1 nfs_lookup_revalidate(): fix a leak
We are leaking fattr and fhandle if we decide that dentry is not to
be invalidated, after all (e.g. happens to be a mountpoint).  Just
free both before that...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-29 22:04:36 -05:00
Al Viro
696199f8cc don't do blind d_drop() in nfs_prime_dcache()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-29 22:00:51 -05:00
Linus Torvalds
bbec0270bd blkdev_max_block: make private to fs/buffer.c
We really don't want to look at the block size for the raw block device
accesses in fs/block-dev.c, because it may be changing from under us.
So get rid of the max_block logic entirely, since the caller should
already have done it anyway.

That leaves the only user of this function in fs/buffer.c, so move the
whole function there and make it static.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-29 17:48:12 -08:00
Linus Torvalds
ab73857e35 direct-io: don't read inode->i_blkbits multiple times
Since directio can work on a raw block device, and the block size of the
device can change under it, we need to do the same thing that
fs/buffer.c now does: read the block size a single time, using
ACCESS_ONCE().

Reading it multiple times can get different results, which will then
confuse the code because it actually encodes the i_blksize in
relationship to the underlying logical blocksize.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-29 12:38:44 -08:00
Linus Torvalds
1e8b33328a blockdev: remove bd_block_size_semaphore again
This reverts the block-device direct access code to the previous
unlocked code, now that fs/buffer.c no longer needs external locking.

With this, fs/block_dev.c is back to the original version, apart from a
whitespace cleanup that I didn't want to revert.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-29 10:52:19 -08:00
Linus Torvalds
45bce8f3e3 fs/buffer.c: make block-size be per-page and protected by the page lock
This makes the buffer size handling be a per-page thing, which allows us
to not have to worry about locking too much when changing the buffer
size.  If a page doesn't have buffers, we still need to read the block
size from the inode, but we can do that with ACCESS_ONCE(), so that even
if the size is changing, we get a consistent value.

This doesn't convert all functions - many of the buffer functions are
used purely by filesystems, which in turn results in the buffer size
being fixed at mount-time.  So they don't have the same consistency
issues that the raw device access can have.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-29 10:47:20 -08:00
Pavel Shilovsky
c772aa92b6 CIFS: Fix wrong buffer pointer usage in smb_set_file_info
Commit 6bdf6dbd66 caused a regression
in setattr codepath that leads to files with wrong attributes.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-11-28 10:02:46 -06:00
Jeff Layton
3a98b86143 cifs: fix writeback race with file that is growing
Commit eddb079deb created a regression in the writepages codepath.
Previously, whenever it needed to check the size of the file, it did so
by consulting the inode->i_size field directly. With that patch, the
i_size was fetched once on entry into the writepages code and that value
was used henceforth.

If the file is changing size though (for instance, if someone is writing
to it or has truncated it), then that value is likely to be wrong. This
can lead to data corruption. Pages past the EOF at the time that the
writepages call was issued may be silently dropped and ignored because
cifs_writepages wrongly assumes that the file must have been truncated
in the interim.

Fix cifs_writepages to properly fetch the size from the inode->i_size
field instead to properly account for this possibility.

Original bug report is here:

    https://bugzilla.kernel.org/show_bug.cgi?id=50991

Reported-and-Tested-by: Maxim Britov <ungifted01@gmail.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-11-27 13:46:12 -06:00
Linus Torvalds
2844a48706 Merge branch 'akpm' (Fixes from Andrew)
Merge misc fixes from Andrew Morton:
 "8 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (8 patches)
  futex: avoid wake_futex() for a PI futex_q
  watchdog: using u64 in get_sample_period()
  writeback: put unused inodes to LRU after writeback completion
  mm: vmscan: check for fatal signals iff the process was throttled
  Revert "mm: remove __GFP_NO_KSWAPD"
  proc: check vma->vm_file before dereferencing
  UAPI: strip the _UAPI prefix from header guards during header installation
  include/linux/bug.h: fix sparse warning related to BUILD_BUG_ON_INVALID
2012-11-26 18:33:33 -08:00
Linus Torvalds
87726c334b Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext3 regression fix from Jan Kara:
 "Fix an ext3 regression introduced during 3.7 merge window.  It leads
  to deadlock if you stress the filesystem in the right way (luckily
  only if blocksize < pagesize)."

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  jbd: Fix lock ordering bug in journal_unmap_buffer()
2012-11-26 17:42:07 -08:00
Jan Kara
4eff96dd52 writeback: put unused inodes to LRU after writeback completion
Commit 169ebd9013 ("writeback: Avoid iput() from flusher thread")
removed iget-iput pair from inode writeback.  As a side effect, inodes
that are dirty during iput_final() call won't be ever added to inode LRU
(iput_final() doesn't add dirty inodes to LRU and later when the inode
is cleaned there's noone to add the inode there).  Thus inodes are
effectively unreclaimable until someone looks them up again.

The practical effect of this bug is limited by the fact that inodes are
pinned by a dentry for long enough that the inode gets cleaned.  But
still the bug can have nasty consequences leading up to OOM conditions
under certain circumstances.  Following can easily reproduce the
problem:

  for (( i = 0; i < 1000; i++ )); do
    mkdir $i
    for (( j = 0; j < 1000; j++ )); do
      touch $i/$j
      echo 2 > /proc/sys/vm/drop_caches
    done
  done

then one needs to run 'sync; ls -lR' to make inodes reclaimable again.

We fix the issue by inserting unused clean inodes into the LRU after
writeback finishes in inode_sync_complete().

Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: <stable@vger.kernel.org>		[3.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-26 17:41:24 -08:00
Stanislav Kinsbursky
05f564849d proc: check vma->vm_file before dereferencing
Commit 7b540d0646 ("proc_map_files_readdir(): don't bother with
grabbing files") switched proc_map_files_readdir() to use @f_mode
directly instead of grabbing @file reference, but same time the test for
@vm_file presence was lost leading to nil dereference.  The patch brings
the test back.

The all proc_map_files feature is CONFIG_CHECKPOINT_RESTORE wrapped
(which is set to 'n' by default) so the bug doesn't affect regular
kernels.

The regression is 3.7-rc1 only as far as I can tell.

[gorcunov@openvz.org: provided changelog]
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-26 17:41:24 -08:00
Linus Torvalds
35f95d228e Most important part of this is that it fixes a regression in Samsung
NAND chip detection, introduced by some rework which went into 3.7. The
 initial fix wasn't quite complete, so it's in two parts. In fact the
 first part is committed twice (Artem committed his own copy of the same
 patch) and I've merged Artem's tree into mine which already had that fix.
 
 I'd have recommitted that to make it somewhat cleaner, but figured by
 this point in the release cycle it was better to merge *exactly* the
 commits which have been in linux-next.
 
 If I'd recommitted, I'd also omit the sparse warning fix. But it's there,
 and it's harmless — just marking one function as 'static' in onenand code.
 
 This also includes a couple more fixes for stable: an AB-BA deadlock in
 JFFS2, and an invalid range check in slram.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iEYEABECAAYFAlCwEIsACgkQdwG7hYl686NfZgCfSYFA2q8yp7jEMdDaxpFPuuDm
 FFMAoI3V27BpWxRab6GylYh8erHp9ful
 =Wo+T
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20121123' of git://git.infradead.org/mtd-2.6

Pull MTD fixes from David Woodhouse:
 "The most important part of this is that it fixes a regression in
  Samsung NAND chip detection, introduced by some rework which went into
  3.7.  The initial fix wasn't quite complete, so it's in two parts.  In
  fact the first part is committed twice (Artem committed his own copy
  of the same patch) and I've merged Artem's tree into mine which
  already had that fix.

  I'd have recommitted that to make it somewhat cleaner, but figured by
  this point in the release cycle it was better to merge *exactly* the
  commits which have been in linux-next.

  If I'd recommitted, I'd also omit the sparse warning fix.  But it's
  there, and it's harmless — just marking one function as 'static' in
  onenand code.

  This also includes a couple more fixes for stable: an AB-BA deadlock
  in JFFS2, and an invalid range check in slram."

* tag 'for-linus-20121123' of git://git.infradead.org/mtd-2.6:
  mtd: nand: fix Samsung SLC detection regression
  mtd: nand: fix Samsung SLC NAND identification regression
  jffs2: Fix lock acquisition order bug in jffs2_write_begin
  mtd: onenand: Make flexonenand_set_boundary static
  mtd: slram: invalid checking of absolute end address
  mtd: ofpart: Fix incorrect NULL check in parse_ofoldpart_partitions()
  mtd: nand: fix Samsung SLC NAND identification regression
2012-11-23 15:12:17 -10:00
Jan Kara
25389bb207 jbd: Fix lock ordering bug in journal_unmap_buffer()
Commit 09e05d48 introduced a wait for transaction commit into
journal_unmap_buffer() in the case we are truncating a buffer undergoing commit
in the page stradding i_size on a filesystem with blocksize < pagesize. Sadly
we forgot to drop buffer lock before waiting for transaction commit and thus
deadlock is possible when kjournald wants to lock the buffer.

Fix the problem by dropping the buffer lock before waiting for transaction
commit. Since we are still holding page lock (and that is OK), buffer cannot
disappear under us.

CC: stable@vger.kernel.org # Wherever commit 09e05d48 was taken
Signed-off-by: Jan Kara <jack@suse.cz>
2012-11-23 15:17:18 +01:00
Linus Torvalds
ca6215dfc7 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull reiserfs and ext3 fixes from Jan Kara:
 "Fixes of reiserfs deadlocks when quotas are enabled (locking there was
  completely busted by BKL conversion) and also one small ext3 fix in
  the trim interface."

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext3: Avoid underflow of in ext3_trim_fs()
  reiserfs: Move quota calls out of write lock
  reiserfs: Protect reiserfs_quota_write() with write lock
  reiserfs: Protect reiserfs_quota_on() with write lock
  reiserfs: Fix lock ordering during remount
2012-11-20 18:48:25 -10:00
Lukas Czerner
ae49eeec78 ext3: Avoid underflow of in ext3_trim_fs()
Currently if len argument in ext3_trim_fs() is smaller than one block,
the 'end' variable underflow. Avoid that by returning EINVAL if len is
smaller than file system block.

Also remove useless unlikely().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2012-11-19 21:36:12 +01:00
Jan Kara
7af1168693 reiserfs: Move quota calls out of write lock
Calls into highlevel quota code cannot happen under the write lock. These
calls take dqio_mutex which ranks above write lock. So drop write lock
before calling back into quota code.

CC: stable@vger.kernel.org # >= 3.0
Signed-off-by: Jan Kara <jack@suse.cz>
2012-11-19 21:34:33 +01:00
Jan Kara
361d94a338 reiserfs: Protect reiserfs_quota_write() with write lock
Calls into reiserfs journalling code and reiserfs_get_block() need to
be protected with write lock. We remove write lock around calls to high
level quota code in the next patch so these paths would suddently become
unprotected.

CC: stable@vger.kernel.org # >= 3.0
Signed-off-by: Jan Kara <jack@suse.cz>
2012-11-19 21:34:33 +01:00
Jan Kara
b9e06ef2e8 reiserfs: Protect reiserfs_quota_on() with write lock
In reiserfs_quota_on() we do quite some work - for example unpacking
tail of a quota file. Thus we have to hold write lock until a moment
we call back into the quota code.

CC: stable@vger.kernel.org # >= 3.0
Signed-off-by: Jan Kara <jack@suse.cz>
2012-11-19 21:34:32 +01:00
Jan Kara
3bb3e1fc47 reiserfs: Fix lock ordering during remount
When remounting reiserfs dquot_suspend() or dquot_resume() can be called.
These functions take dqonoff_mutex which ranks above write lock so we have
to drop it before calling into quota code.

CC: stable@vger.kernel.org # >= 3.0
Signed-off-by: Jan Kara <jack@suse.cz>
2012-11-19 21:34:32 +01:00
Al Viro
3587b1b097 fanotify: fix FAN_Q_OVERFLOW case of fanotify_read()
If the FAN_Q_OVERFLOW bit set in event->mask, the fanotify event
metadata will not contain a valid file descriptor, but
copy_event_to_user() didn't check for that, and unconditionally does a
fd_install() on the file descriptor.

Which in turn will cause a BUG_ON() in __fd_install().

Introduced by commit 352e3b2492 ("fanotify: sanitize failure exits in
copy_event_to_user()")

Mea culpa - missed that path ;-/

Reported-by: Alex Shi <lkml.alex@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-18 09:30:00 -10:00
Linus Torvalds
8d938105e4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc VFS fixes from Al Viro:
 "Remove a bogus BUG_ON() that can trigger spuriously + alpha bits of
  do_mount() constification I'd missed during the merge window."

This pull request came in a week ago, I missed it for some reason.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  kill bogus BUG_ON() in do_close_on_exec()
  missing const in alpha callers of do_mount()
2012-11-18 09:13:48 -10:00
Linus Torvalds
d28d3730fd xfs: bugfixes for 3.7-rc7
- fix attr tree double split corruption
 - fix broken error handling in xfs_vm_writepage
 - drop buffer io reference when a bad bio is built
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJQp7sfAAoJENaLyazVq6ZOWHwP/2WTlenvM74i8HDa/nYW8KTC
 EubCZ6X1C7LPTV9tm9YUpKZ1VtI1O+OmuGcSmWdBKSMMoBVNyKvWXvrJeVKBVtXV
 sQ/jh1zCiPYzt9DfxGuarkw8Uy5qKNOYrbEAK1WwPMeOsDODYncfmTm+A/VYMeTt
 bWOjaxFd5QQOMuf0x9NO/keZc84R5l51ezYxA7HyYa5XvV/MDmLLVL0IhuSTFKyw
 oOiQMp0hby4zsJg6nqu/eINmdlgBIw+32m8aMSB2jreUQm4yvt0CY7M3Zq6sPmsM
 2tC6cFonPw31FBBu9jvv9h5wNz7McyzxtZBS0+zDV+7K0UrIyxWm1BhzZIXoXzLz
 vHwc4gnZV8nOP/g34aftHLYYRD3ZJhG8mX5AdBRzlWWqDSFvYVEq+1evHrv8kk4l
 coTapzimNnR3aJ16qdP1M0gExKO9nrGVqrRi8ndLNbxLpxC9mFG7CfJBQPMumukX
 G8pTV1wQvqONHDNlN4mxqMBHN0d9dGp5xjYQ0Q92/siIA1C5szjCwTHekKNrP6Ol
 7xd+nO7Xcgj7Uwaakv31paqOSAGhla6H5jvxPF2A54hZWQqlp88QpChLt3LFPxwh
 tEYTEf1zRoaoCS4TD3zMYTLY+9cXvUybSIf3hbgns+JMYHJtuZdzbvcaXE6Wl4Jr
 6esA5fsBFP1J2/EzpLof
 =depY
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v3.7-rc7' of git://oss.sgi.com/xfs/xfs

Pull xfs bugfixes from Ben Myers:

 - fix attr tree double split corruption

 - fix broken error handling in xfs_vm_writepage

 - drop buffer io reference when a bad bio is built

* tag 'for-linus-v3.7-rc7' of git://oss.sgi.com/xfs/xfs:
  xfs: drop buffer io reference when a bad bio is built
  xfs: fix broken error handling in xfs_vm_writepage
  xfs: fix attr tree double split corruption
2012-11-18 08:29:34 -10:00
Dave Chinner
d69043c42d xfs: drop buffer io reference when a bad bio is built
Error handling in xfs_buf_ioapply_map() does not handle IO reference
counts correctly. We increment the b_io_remaining count before
building the bio, but then fail to decrement it in the failure case.
This leads to the buffer never running IO completion and releasing
the reference that the IO holds, so at unmount we can leak the
buffer. This leak is captured by this assert failure during unmount:

XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 273

This is not a new bug - the b_io_remaining accounting has had this
problem for a long, long time - it's just very hard to get a
zero length bio being built by this code...

Further, the buffer IO error can be overwritten on a multi-segment
buffer by subsequent bio completions for partial sections of the
buffer. Hence we should only set the buffer error status if the
buffer is not already carrying an error status. This ensures that a
partial IO error on a multi-segment buffer will not be lost. This
part of the problem is a regression, however.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-17 09:36:57 -06:00
Dave Chinner
3daed8bc3e xfs: fix broken error handling in xfs_vm_writepage
When we shut down the filesystem, it might first be detected in
writeback when we are allocating a inode size transaction. This
happens after we have moved all the pages into the writeback state
and unlocked them. Unfortunately, if we fail to set up the
transaction we then abort writeback and try to invalidate the
current page. This then triggers are BUG() in block_invalidatepage()
because we are trying to invalidate an unlocked page.

Fixing this is a bit of a chicken and egg problem - we can't
allocate the transaction until we've clustered all the pages into
the IO and we know the size of it (i.e. whether the last block of
the IO is beyond the current EOF or not). However, we don't want to
hold pages locked for long periods of time, especially while we lock
other pages to cluster them into the write.

To fix this, we need to make a clear delineation in writeback where
errors can only be handled by IO completion processing. That is,
once we have marked a page for writeback and unlocked it, we have to
report errors via IO completion because we've already started the
IO. We may not have submitted any IO, but we've changed the page
state to indicate that it is under IO so we must now use the IO
completion path to report errors.

To do this, add an error field to xfs_submit_ioend() to pass it the
error that occurred during the building on the ioend chain. When
this is non-zero, mark each ioend with the error and call
xfs_finish_ioend() directly rather than building bios. This will
immediately push the ioends through completion processing with the
error that has occurred.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-17 09:35:42 -06:00
Dave Chinner
42e2976f13 xfs: fix attr tree double split corruption
In certain circumstances, a double split of an attribute tree is
needed to insert or replace an attribute. In rare situations, this
can go wrong, leaving the attribute tree corrupted. In this case,
the attr being replaced is the last attr in a leaf node, and the
replacement is larger so doesn't fit in the same leaf node.
When we have the initial condition of a node format attribute
btree with two leaves at index 1 and 2. Call them L1 and L2.  The
leaf L1 is completely full, there is not a single byte of free space
in it. L2 is mostly empty.  The attribute being replaced - call it X
- is the last attribute in L1.

The way an attribute replace is executed is that the replacement
attribute - call it Y - is first inserted into the tree, but has an
INCOMPLETE flag set on it so that list traversals ignore it. Once
this transaction is committed, a second transaction it run to
atomically mark Y as COMPLETE and X as INCOMPLETE, so that a
traversal will now find Y and skip X. Once that transaction is
committed, attribute X is then removed.

So, the initial condition is:

     +--------+     +--------+
     |   L1   |     |   L2   |
     | fwd: 2 |---->| fwd: 0 |
     | bwd: 0 |<----| bwd: 1 |
     | fsp: 0 |     | fsp: N |
     |--------|     |--------|
     | attr A |     | attr 1 |
     |--------|     |--------|
     | attr B |     | attr 2 |
     |--------|     |--------|
     ..........     ..........
     |--------|     |--------|
     | attr X |     | attr n |
     +--------+     +--------+

So now we go to replace X, and see that L1:fsp = 0 - it is full so
we can't insert Y in the same leaf. So we record the the location of
attribute X so we can track it for later use, then we split L1 into
L1 and L3 and reblance across the two leafs. We end with:

     +--------+     +--------+     +--------+
     |   L1   |     |   L3   |     |   L2   |
     | fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
     | bwd: 0 |<----| bwd: 1 |<----| bwd: 3 |
     | fsp: M |     | fsp: J |     | fsp: N |
     |--------|     |--------|     |--------|
     | attr A |     | attr X |     | attr 1 |
     |--------|     +--------+     |--------|
     | attr B |                    | attr 2 |
     |--------|                    |--------|
     ..........                    ..........
     |--------|                    |--------|
     | attr W |                    | attr n |
     +--------+                    +--------+

And we track that the original attribute is now at L3:0.

We then try to insert Y into L1 again, and find that there isn't
enough room because the new attribute is larger than the old one.
Hence we have to split again to make room for Y. We end up with
this:

     +--------+     +--------+     +--------+     +--------+
     |   L1   |     |   L4   |     |   L3   |     |   L2   |
     | fwd: 4 |---->| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
     | bwd: 0 |<----| bwd: 1 |<----| bwd: 4 |<----| bwd: 3 |
     | fsp: M |     | fsp: J |     | fsp: J |     | fsp: N |
     |--------|     |--------|     |--------|     |--------|
     | attr A |     | attr Y |     | attr X |     | attr 1 |
     |--------|     + INCOMP +     +--------+     |--------|
     | attr B |     +--------+                    | attr 2 |
     |--------|                                   |--------|
     ..........                                   ..........
     |--------|                                   |--------|
     | attr W |                                   | attr n |
     +--------+                                   +--------+

And now we have the new (incomplete) attribute @ L4:0, and the
original attribute at L3:0. At this point, the first transaction is
committed, and we move to the flipping of the flags.

This is where we are supposed to end up with this:

     +--------+     +--------+     +--------+     +--------+
     |   L1   |     |   L4   |     |   L3   |     |   L2   |
     | fwd: 4 |---->| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
     | bwd: 0 |<----| bwd: 1 |<----| bwd: 4 |<----| bwd: 3 |
     | fsp: M |     | fsp: J |     | fsp: J |     | fsp: N |
     |--------|     |--------|     |--------|     |--------|
     | attr A |     | attr Y |     | attr X |     | attr 1 |
     |--------|     +--------+     + INCOMP +     |--------|
     | attr B |                    +--------+     | attr 2 |
     |--------|                                   |--------|
     ..........                                   ..........
     |--------|                                   |--------|
     | attr W |                                   | attr n |
     +--------+                                   +--------+

But that doesn't happen properly - the attribute tracking indexes
are not pointing to the right locations. What we end up with is both
the old attribute to be removed pointing at L4:0 and the new
attribute at L4:1.  On a debug kernel, this assert fails like so:

XFS: Assertion failed: args->index2 < be16_to_cpu(leaf2->hdr.count), file: fs/xfs/xfs_attr_leaf.c, line: 2725

because the new attribute location does not exist. On a production
kernel, this goes unnoticed and the code proceeds ahead merrily and
removes L4 because it thinks that is the block that is no longer
needed. This leaves the hash index node pointing to entries
L1, L4 and L2, but only blocks L1, L3 and L2 to exist. Further, the
leaf level sibling list is L1 <-> L4 <-> L2, but L4 is now free
space, and so everything is busted. This corruption is caused by the
removal of the old attribute triggering a join - it joins everything
correctly but then frees the wrong block.

xfs_repair will report something like:

bad sibling back pointer for block 4 in attribute fork for inode 131
problem with attribute contents in inode 131
would clear attr fork
bad nblocks 8 for inode 131, would reset to 3
bad anextents 4 for inode 131, would reset to 0

The problem lies in the assignment of the old/new blocks for
tracking purposes when the double leaf split occurs. The first split
tries to place the new attribute inside the current leaf (i.e.
"inleaf == true") and moves the old attribute (X) to the new block.
This sets up the old block/index to L1:X, and newly allocated
block to L3:0. It then moves attr X to the new block and tries to
insert attr Y at the old index. That fails, so it splits again.

With the second split, the rebalance ends up placing the new attr in
the second new block - L4:0 - and this is where the code goes wrong.
What is does is it sets both the new and old block index to the
second new block. Hence it inserts attr Y at the right place (L4:0)
but overwrites the current location of the attr to replace that is
held in the new block index (currently L3:0). It over writes it with
L4:1 - the index we later assert fail on.

Hopefully this table will show this in a foramt that is a bit easier
to understand:

Split		old attr index		new attr index
		vanilla	patched		vanilla	patched
before 1st	L1:26	L1:26		N/A	N/A
after 1st	L3:0	L3:0		L1:26	L1:26
after 2nd	L4:0	L3:0		L4:1	L4:0
                ^^^^			^^^^
		wrong			wrong

The fix is surprisingly simple, for all this analysis - just stop
the rebalance on the out-of leaf case from overwriting the new attr
index - it's already correct for the double split case.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-17 09:34:13 -06:00
David Rientjes
fa0cbbf145 mm, oom: reintroduce /proc/pid/oom_adj
This is mostly a revert of 01dc52ebdf ("oom: remove deprecated oom_adj")
from Davidlohr Bueso.

It reintroduces /proc/pid/oom_adj for backwards compatibility with earlier
kernels.  It simply scales the value linearly when /proc/pid/oom_score_adj
is written.

The major difference is that its scheduled removal is no longer included
in Documentation/feature-removal-schedule.txt.  We do warn users with a
single printk, though, to suggest the more powerful and supported
/proc/pid/oom_score_adj interface.

Reported-by: Artem S. Tashkinov <t.artem@lycos.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16 10:15:35 -08:00
Linus Torvalds
ce95a36bb9 Two patches which fix a problem reported by several people in the past,
but only fixed now because no one gave enough material for debugging.
 
 Anyway, these fix the problem that sometimes after a power cut the
 file-system is not mountable with the following symptom:
 
 	grab_empty_leb: could not find an empty LEB
 
 The fixes make the file-system mountable again.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQpQcdAAoJECmIfjd9wqK0iOAP/RbkGdtNznPpm4henealAx+d
 9R5kDsmVT3iSmVC7ryAwCapDoe0JUhzN9OiaAMzH66Hmw0isFoOFY5VGd+jPmbMd
 biIUTvQ9AmmrTuLW0zkU8SmUwLDxmZD48IhJnzkmEaUAzF1cN1eLox5FDTzs9a/p
 RW+m1RXJWX2tPpGs9WRtltVfaLaieQcfFbuSmuY48vyR9TM9svoMOqW2hEABlMFA
 xvJFwir/H1w1mE+QVcB7FhvvUTB3fMcCNnxIEiLec2QLlQwJXfIO03ZpMJAN7Z/s
 s7lgRl29pI6vCwbyV1u9P3hfPg02a0hB71U4nk3HXE5rGBJMTf2qgXJnBE8sDxVu
 kO8yIcGv9f/rD6JFmFbQ1VmMH7XCBb6Glyyqbz8hkRR6hnkyeDiGz3UnO48cqM2d
 gApoRlRGPwG3vqjIxKdKY2qzkPgtw86ktKgS2K7X8Owh+yYgcGYKstK5OtRn83aW
 2s5Z6n10i/ucI0bVNYoBr7I1NnSg7b2AETbeNdnQiAmOS+6022HaOaSVxJQTlY7n
 p/WhsRMHp3vUeok7M42+ObXOMe8tJF00NnlneNx9bpPmlTOvnKxB7/WcPNEqGSwo
 vBkrFTp+HCo+mLCT/4ug2XSxMQD42oGUv5wYSjNwpsT9gCsrpC2ZUD++yk6ifliq
 dXSypO7Gd4jLSF7sHtq0
 =1iao
 -----END PGP SIGNATURE-----

Merge tag 'upstream-3.7-rc6' of git://git.infradead.org/linux-ubifs

Pull UBIFS fixes from Artem Bityutskiy:
 "Two patches which fix a problem reported by several people in the
  past, but only fixed now because no one gave enough material for
  debugging.

  Anyway, these fix the problem that sometimes after a power cut the
  file-system is not mountable with the following symptom:

	grab_empty_leb: could not find an empty LEB

  The fixes make the file-system mountable again."

* tag 'upstream-3.7-rc6' of git://git.infradead.org/linux-ubifs:
  UBIFS: fix mounting problems after power cuts
  UBIFS: introduce categorized lprops counter
2012-11-15 11:28:43 -08:00
Colin Ian King
70a6f46d7b pstore: Fix NULL pointer dereference in console writes
Passing a NULL id causes a NULL pointer deference in writers such as
erst_writer and efi_pstore_write because they expect to update this id.
Pass a dummy id instead.

This avoids a cascade of oopses caused when the initial
pstore_console_write passes a null which in turn causes writes to the
console causing further oopses in subsequent pstore_console_write calls.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
2012-11-14 18:30:21 -08:00
Al Viro
5a8477660d kill bogus BUG_ON() in do_close_on_exec()
It can be legitimately triggered via procfs access.  Now, at least
2 of 3 of get_files_struct() callers in procfs are useless, but
when and if we get rid of those we can always add WARN_ON() here.
BUG_ON() at that spot is simply wrong.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-12 01:19:02 -05:00
Linus Torvalds
affd9a8dbc Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Jeff Layton.

* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Do not lookup hashed negative dentry in cifs_atomic_open
  cifs: fix potential buffer overrun in cifs.idmap handling code
2012-11-10 06:59:35 +01:00
Thomas Betker
5ffd3412ae jffs2: Fix lock acquisition order bug in jffs2_write_begin
jffs2_write_begin() first acquires the page lock, then f->sem. This
causes an AB-BA deadlock with jffs2_garbage_collect_live(), which first
acquires f->sem, then the page lock:

jffs2_garbage_collect_live
    mutex_lock(&f->sem)                         (A)
    jffs2_garbage_collect_dnode
        jffs2_gc_fetch_page
            read_cache_page_async
                do_read_cache_page
                    lock_page(page)             (B)

jffs2_write_begin
    grab_cache_page_write_begin
        find_lock_page
            lock_page(page)                     (B)
    mutex_lock(&f->sem)                         (A)

We fix this by restructuring jffs2_write_begin() to take f->sem before
the page lock. However, we make sure that f->sem is not held when
calling jffs2_reserve_space(), as this is not permitted by the locking
rules.

The deadlock above was observed multiple times on an SoC with a dual
ARMv7 (Cortex-A9), running the long-term 3.4.11 kernel; it occurred
when using scp to copy files from a host system to the ARM target
system. The fix was heavily tested on the same target system.

Cc: stable@vger.kernel.org
Signed-off-by: Thomas Betker <thomas.betker@rohde-schwarz.com>
Acked-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2012-11-09 17:02:50 +02:00
Linus Torvalds
ce6d841e9c Merge branch 'akpm' (Fixes from Andrew)
Merge misc fixes from Andrew Morton:
 "Five fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (5 patches)
  h8300: add missing L1_CACHE_SHIFT
  mm: bugfix: set current->reclaim_state to NULL while returning from kswapd()
  fanotify: fix missing break
  revert "epoll: support for disabling items, and a self-test app"
  checkpatch: improve network block comment style checking
2012-11-09 06:53:02 +01:00
Linus Torvalds
a601e63717 xfs: bugfixes for 3.7-rc5
- fix for large transactions spanning multiple iclog buffers
 - zero the allocation_args structure on the stack before using it
   to determine whether to use a worker for allocation
 - move allocation stack switch to xfs_bmapi_allocate in order
   to prevent deadlock on AGF buffers
 - growfs no longer reads in garbage for new secondary superblocks
 - silence a build warning
 - ensure that invalid buffers never get written to disk while on
   free list
 - don't vmap inode cluster buffers during free
 - fix buffer shutdown reference count mismatch
 - fix reading of wrapped log data
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJQm/NqAAoJENaLyazVq6ZOhkgP/2wcLIfJBfdfFV/MBEYYNqdB
 ufgsl351L6d9xX9WdQNIB3ely7sbZhsh+9uZM1i7LuySXuYC1JC7OvOBByV0UV9W
 /WWIzonFC+8W0X9EsetUb2Ru+4C40RAodZMRwsdBQQXhSS3A4GoUbv4t/Ty3bEk/
 1v27diGjCKQYBongHhFVjaNjtuXIebPG+yIegkZJseKnd7zRZPo2AgULIxtWPcIT
 xkH3WP3fVU/zF6ESjF4cGqGMtKaw+a2nk8mfMqHNRIvehBQHZ7CRnBwgVMtQk6fr
 0J62GvAFDA10bwz/toUd/8TJNewJKjBhEwfkITsO3EHw8SBp0KvrvrGvfDVew4Wr
 oOb0+YkI+jtFZkz3WpZqd+Kgryeyxh6j1OBQ09FVLe3SQ2ZKcDVKFOkaPjEqqNuF
 TUIwenWK4I5zmWeWGkLUJTzzY/ModVKUGTft57HUTyD318H7xFu2fBYj10jVE8dl
 fl4u8A3ifCXcancROEeCX7AXkY3SFrkvYqCWDWqNDzhyr1WsjEJmkf6F45makaD0
 cPBsKYUi3P3pVvbcdHdGEgCWx5X1TzOyhX8pggWHbBRfc8frg9IIfMX4fl5f66vP
 fmxRHwP+4co1PQdu++LqJ92clz2ja2DYRQzKVJPNQkhRfHOFDEQ8DlxPFrCb8iK0
 HHHlp4v+Gyme6YhgD317
 =+24f
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v3.7-rc5' of git://oss.sgi.com/xfs/xfs

Pull xfs bugfixes from Ben Myers:

 - fix for large transactions spanning multiple iclog buffers

 - zero the allocation_args structure on the stack before using it to
   determine whether to use a worker for allocation
 - move allocation stack switch to xfs_bmapi_allocate in order to
   prevent deadlock on AGF buffers

 - growfs no longer reads in garbage for new secondary superblocks

 - silence a build warning

 - ensure that invalid buffers never get written to disk while on free
   list

 - don't vmap inode cluster buffers during free

 - fix buffer shutdown reference count mismatch

 - fix reading of wrapped log data

* tag 'for-linus-v3.7-rc5' of git://oss.sgi.com/xfs/xfs:
  xfs: fix reading of wrapped log data
  xfs: fix buffer shudown reference count mismatch
  xfs: don't vmap inode cluster buffers during free
  xfs: invalidate allocbt blocks moved to the free list
  xfs: silence uninitialised f.file warning.
  xfs: growfs: don't read garbage for new secondary superblocks
  xfs: move allocation stack switch up to xfs_bmapi_allocate
  xfs: introduce XFS_BMAPI_STACK_SWITCH
  xfs: zero allocation_args on the kernel stack
  xfs: only update the last_sync_lsn when a transaction completes
2012-11-09 06:42:51 +01:00
Eric Paris
848561d368 fanotify: fix missing break
Anders Blomdell noted in 2010 that Fanotify lost events and provided a
test case.  Eric Paris confirmed it was a bug and posted a fix to the
list

  https://groups.google.com/forum/?fromgroups=#!topic/linux.kernel/RrJfTfyW2BE

but never applied it.  Repeated attempts over time to actually get him
to apply it have never had a reply from anyone who has raised it

So apply it anyway

Signed-off-by: Alan Cox <alan@linux.intel.com>
Reported-by: Anders Blomdell <anders.blomdell@control.lth.se>
Cc: Eric Paris <eparis@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-09 06:41:47 +01:00
Andrew Morton
a80a6b85b4 revert "epoll: support for disabling items, and a self-test app"
Revert commit 03a7beb55b ("epoll: support for disabling items, and a
self-test app") pending resolution of the issues identified by Michael
Kerrisk, copied below.

We'll revisit this for 3.8.

: I've taken a look at this patch as it currently stands in 3.7-rc1, and
: done a bit of testing. (By the way, the test program
: tools/testing/selftests/epoll/test_epoll.c does not compile...)
:
: There are one or two places where the behavior seems a little strange,
: so I have a question or two at the end of this mail. But other than
: that, I want to check my understanding so that the interface can be
: correctly documented.
:
: Just to go though my understanding, the problem is the following
: scenario in a multithreaded application:
:
: 1. Multiple threads are performing epoll_wait() operations,
:    and maintaining a user-space cache that contains information
:    corresponding to each file descriptor being monitored by
:    epoll_wait().
:
: 2. At some point, a thread wants to delete (EPOLL_CTL_DEL)
:    a file descriptor from the epoll interest list, and
:    delete the corresponding record from the user-space cache.
:
: 3. The problem with (2) is that some other thread may have
:    previously done an epoll_wait() that retrieved information
:    about the fd in question, and may be in the middle of using
:    information in the cache that relates to that fd. Thus,
:    there is a potential race.
:
: 4. The race can't solved purely in user space, because doing
:    so would require applying a mutex across the epoll_wait()
:    call, which would of course blow thread concurrency.
:
: Right?
:
: Your solution is the EPOLL_CTL_DISABLE operation. I want to
: confirm my understanding about how to use this flag, since
: the description that has accompanied the patches so far
: has been a bit sparse
:
: 0. In the scenario you're concerned about, deleting a file
:    descriptor means (safely) doing the following:
:    (a) Deleting the file descriptor from the epoll interest list
:        using EPOLL_CTL_DEL
:    (b) Deleting the corresponding record in the user-space cache
:
: 1. It's only meaningful to use this EPOLL_CTL_DISABLE in
:    conjunction with EPOLLONESHOT.
:
: 2. Using EPOLL_CTL_DISABLE without using EPOLLONESHOT in
:    conjunction is a logical error.
:
: 3. The correct way to code multithreaded applications using
:    EPOLL_CTL_DISABLE and EPOLLONESHOT is as follows:
:
:    a. All EPOLL_CTL_ADD and EPOLL_CTL_MOD operations should
:       should EPOLLONESHOT.
:
:    b. When a thread wants to delete a file descriptor, it
:       should do the following:
:
:       [1] Call epoll_ctl(EPOLL_CTL_DISABLE)
:       [2] If the return status from epoll_ctl(EPOLL_CTL_DISABLE)
:           was zero, then the file descriptor can be safely
:           deleted by the thread that made this call.
:       [3] If the epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY,
:           then the descriptor is in use. In this case, the calling
:           thread should set a flag in the user-space cache to
:           indicate that the thread that is using the descriptor
:           should perform the deletion operation.
:
: Is all of the above correct?
:
: The implementation depends on checking on whether
: (events & ~EP_PRIVATE_BITS) == 0
: This replies on the fact that EPOLL_CTL_AD and EPOLL_CTL_MOD always
: set EPOLLHUP and EPOLLERR in the 'events' mask, and EPOLLONESHOT
: causes those flags (as well as all others in ~EP_PRIVATE_BITS) to be
: cleared.
:
: A corollary to the previous paragraph is that using EPOLL_CTL_DISABLE
: is only useful in conjunction with EPOLLONESHOT. However, as things
: stand, one can use EPOLL_CTL_DISABLE on a file descriptor that does
: not have EPOLLONESHOT set in 'events' This results in the following
: (slightly surprising) behavior:
:
: (a) The first call to epoll_ctl(EPOLL_CTL_DISABLE) returns 0
:     (the indicator that the file descriptor can be safely deleted).
: (b) The next call to epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY.
:
: This doesn't seem particularly useful, and in fact is probably an
: indication that the user made a logic error: they should only be using
: epoll_ctl(EPOLL_CTL_DISABLE) on a file descriptor for which
: EPOLLONESHOT was set in 'events'. If that is correct, then would it
: not make sense to return an error to user space for this case?

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paton J. Lewis" <palewis@adobe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-09 06:41:46 +01:00