Commit Graph

27900 Commits

Author SHA1 Message Date
Pavel Shilovsky
093b2bdad3 CIFS: Make demultiplex_thread work with SMB2 code
Now we can process SMB2 messages: check message, get message id
and wakeup awaiting routines.

Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-24 21:54:52 +04:00
Pavel Shilovsky
4b1241006c CIFS: Fix a wrong pointer in atomic_open
Commit 30d9049474 caused a regression
in cifs open codepath.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-24 21:54:26 +04:00
Pavel Shilovsky
28ea5290d7 CIFS: Add SMB2 credits support
For SMB2 protocol we can add more than one credit for one received
request: it depends on CreditRequest field in SMB2 response header.
Also we divide all requests by type: echoes, oplocks and others.
Each type uses its own slot pull.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-24 10:25:23 -05:00
Pavel Shilovsky
2dc7e1c033 CIFS: Make transport routines work with SMB2
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-24 10:25:20 -05:00
Steve French
ddfbefbd39 CIFS: Map SMB2 status codes to POSIX errors
Add mapping table for 32 bit SMB2 status codes to linux errors.
Note that SMB2 does not use DOS/OS2 errors (ever) so mapping to
DOS/OS2 errors as a common network subset (as we do for cifs)
doesn't help. And note that the set of status codes is much more
complete here.

Signed-off-by: Steve French <sfrench@us.ibm.com>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-24 10:25:15 -05:00
Pavel Shilovsky
b8030603d9 CIFS: Add SMB2 status codes
Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-24 10:25:13 -05:00
Pavel Shilovsky
f7ec0d0bbc CIFS: Rename 7 error codes to NT_ style
and consider such codes as CIFS errors.

Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-24 10:25:10 -05:00
Pavel Shilovsky
6d5786a34d CIFS: Rename Get/FreeXid and make them work with unsigned int
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-24 10:25:08 -05:00
Pavel Shilovsky
2e6e02ab6d CIFS: Move protocol specific tcon/tdis code to ops struct
and rename variables around the code changes.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-24 10:25:06 -05:00
Pavel Shilovsky
58c45c58a1 CIFS: Move protocol specific session setup/logoff code to ops struct
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-24 10:25:03 -05:00
Cong Wang
2164d33446 pipe: remove KM_USER0 from comments
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-07-24 15:27:34 +08:00
Pavel Shilovsky
286170aa24 CIFS: Move protocol specific negotiate code to ops struct
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-24 00:33:26 -05:00
Pavel Shilovsky
a891f0f895 CIFS: Extend credit mechanism to process request type
Split all requests to echos, oplocks and others - each group uses
its own credit slot. This is indicated by new flags

CIFS_ECHO_OP and CIFS_OBREAK_OP

that are not used now for CIFS. This change is required to support
SMB2 protocol because of different processing of these commands.

Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-24 00:32:48 -05:00
Pavel Shilovsky
316cf94a91 CIFS: Move trans2 processing to ops struct
Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-24 00:32:25 -05:00
Linus Torvalds
83c7f72259 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc updates from Benjamin Herrenschmidt:
 "Notable highlights:

   - iommu improvements from Anton removing the per-iommu global lock in
     favor of dividing the DMA space into pools, each with its own lock,
     and hashed on the CPU number.  Along with making the locking more
     fine grained, this gives significant improvements in multiqueue
     networking scalability.

   - Still from Anton, we know provide a vdso based variant of getcpu
     which makes sched_getcpu with the appropriate glibc patch something
     like 18 times faster.

   - More anton goodness (he's been busy !) in other areas such as a
     faster __clear_user and copy_page on P7, various perf fixes to
     improve sampling quality, etc...

   - One more step toward removing legacy i2c interfaces by using new
     device-tree based probing of platform devices for the AOA audio
     drivers

   - A nice series of patches from Michael Neuling that helps avoiding
     confusion between register numbers and litterals in assembly code,
     trying to enforce the use of "%rN" register names in gas rather
     than plain numbers.

   - A pile of FSL updates

   - The usual bunch of small fixes, cleanups etc...

  You may spot a change to drivers/char/mem.  The patch got no comment
  or ack from outside, it's a trivial patch to allow the architecture to
  skip creating /dev/port, which we use to disable it on ppc64 that
  don't have a legacy brige.  On those, IO ports 0...64K are not mapped
  in kernel space at all, so accesses to /dev/port cause oopses (and
  yes, distros -still- ship userspace that bangs hard coded ports such
  as kbdrate)."

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (106 commits)
  powerpc/mpic: Create a revmap with enough entries for IPIs and timers
  Remove stale .rej file
  powerpc/iommu: Fix iommu pool initialization
  powerpc/eeh: Check handle_eeh_events() return value
  powerpc/85xx: Add phy nodes in SGMII mode for MPC8536/44/72DS & P2020DS
  powerpc/e500: add paravirt QEMU platform
  powerpc/mpc85xx_ds: convert to unified PCI init
  powerpc/fsl-pci: get PCI init out of board files
  powerpc/85xx: Update corenet64_smp_defconfig
  powerpc/85xx: Update corenet32_smp_defconfig
  powerpc/85xx: Rename P1021RDB-PC device trees to be consistent
  powerpc/watchdog: move booke watchdog param related code to setup-common.c
  sound/aoa: Adapt to new i2c probing scheme
  i2c/powermac: Improve detection of devices from device-tree
  powerpc: Disable /dev/port interface on systems without an ISA bridge
  of: Improve prom_update_property() function
  powerpc: Add "memory" attribute for mfmsr()
  powerpc/ftrace: Fix assembly trampoline register usage
  powerpc/hw_breakpoints: Fix incorrect pointer access
  powerpc: Put the gpr save/restore functions in their own section
  ...
2012-07-23 18:54:23 -07:00
Jeff Layton
7659624ffb cifs: reinstate sec=ntlmv2 mount option
sec=ntlmv2 as a mount option got dropped in the mount option overhaul.

Cc: Sachin Prabhu <sprabhu@redhat.com>
Cc: <stable@vger.kernel.org> # 3.4+
Reported-by: Günter Kukkukk <linux@kukkukk.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-23 20:48:24 -05:00
Linus Torvalds
93912fe69d * Added another debugfs knob for forcing UBIFS R/O mode without flushing caches
or finishing commit or any other I/O operation. I've originally added this
   knob in order to reproduce the free space fixup bug (see c672793) on nandsim.
   Without this knob I would have to do real power-cuts, which would make
   debugging much harder. Then I've decided to keep this knob because it is also
   useful for UBIFS power-cut recovery end error-paths testing.
 * Well-spotted fix from Julia. This bug did not cause real troubles for
   UBIFS, but nevertheless it could cause issues for someone trying to modify
   the orphans handling code. Kudos to coccinelle!
 * Minor cleanups.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQDQxSAAoJECmIfjd9wqK0W3QQAKCbDBNMJaj/idXcupatoOI2
 4M3IBJ5mwfjrBeXy1kcsxxSadW7fJTy6FK7jfPpYq2iCLmFkh0Ba6QSgHHgCmb1v
 3ExIR2+2hLl1YhliDU1q+k9H4fGDklq/CNgiTGDti0pdPczZfBKrksXQy8leqT8d
 MxzQDzQsCxVKIizHHMEr373s1y8QHhabFK1LU1wkXJK/p/bYS0mgD0BOjIrZKuCd
 W5L5XzIIbcXs1h91Lgdnr8fj5TmMtM1tQ890U2OaNPt+44Eh6Gjjku3N8RaWvaIe
 vXMpDNb00NwtJp0V2SIsez3VtaA1xJaW7bd4UcyNNvBy5aECNx8K0IbZKTUkiMSj
 3xOiqZr/LWpPkAq3NL6kAZkgiam9Aez5Yc+XTna3HP79L/LYbXsJiOewlDwvGbrh
 89+EatjLUVFh00BvzmmxuZkm2bWm3sPQiAKWqwc7ijMGwnPt+c6Eiepgw3XXVHgI
 qmw1Ddkh4VxdGXi8r3RMxg21MqFttM0astT5cmQeF//H6Wq3S2u8RFgZqaLpGc6s
 vcgx92V6BBrWzCZKxBQs3Bd21uSrVsOa/js5wORpuWZAGk1Yy1BFy1vFC0PuqmcI
 kvZa/bQPP3/PiHAI8qq2Min5baoG9jUl6ui1eM8KXECkzQSQJjxN1bth8kzZHuPY
 HMNoK0RVUt3V5xtYI7i7
 =ecyI
 -----END PGP SIGNATURE-----

Merge tag 'upstream-3.6-rc1' of git://git.infradead.org/linux-ubifs

Pull UBIFS updates from Artem Bityutskiy:

 - Added another debugfs knob for forcing UBIFS R/O mode without
   flushing caches or finishing commit or any other I/O operation.  I've
   originally added this knob in order to reproduce the free space fixup
   bug (see commit c6727932cfdb: "UBIFS: fix a bug in empty space
   fix-up") on nandsim.

   Without this knob I would have to do real power-cuts, which would
   make debugging much harder.  Then I've decided to keep this knob
   because it is also useful for UBIFS power-cut recovery end
   error-paths testing.

 - Well-spotted fix from Julia.  This bug did not cause real troubles
   for UBIFS, but nevertheless it could cause issues for someone trying
   to modify the orphans handling code.  Kudos to coccinelle!

 - Minor cleanups.

* tag 'upstream-3.6-rc1' of git://git.infradead.org/linux-ubifs:
  UBIFS: remove invalid reference to list iterator variable
  UBIFS: simplify reply code a bit
  UBIFS: add debugfs knob to switch to R/O mode
  UBIFS: fix compilation warning
2012-07-23 15:50:52 -07:00
Jeff Layton
762a4206a3 cifs: rename cifs_sign_smb2 to cifs_sign_smbv
"smb2" makes me think of the SMB2.x protocol, which isn't at all what
this function is for...

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-23 16:36:34 -05:00
Jeff Layton
d971e0656b cifs: remove bogus reset of smb_buf_length in smb_send routines
There's a comment here about how we don't want to modify this length,
but nothing in this function actually does.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-23 16:36:31 -05:00
Jeff Layton
c5fd363d77 cifs: move file_lock off stack in cifs_push_posix_locks
struct file_lock is pretty large, so we really don't want that on the
stack in a potentially long call chain. Reorganize the arguments to
CIFSSMBPosixLock to eliminate the need for that.

Eliminate the get_flag and simply use a non-NULL pLockInfo to indicate
that this is a "get" operation. In order to do that, need to add a new
loff_t argument for the start_offset.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-23 16:36:29 -05:00
Jeff Layton
ac3aa2f8ae cifs: remove extraneous newlines from cERROR and cFYI calls
Those macros add a newline on their own, so there's not any need to
embed one in the message itself.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-23 16:36:26 -05:00
Jeff Layton
00401ff780 cifs: after upcalling for krb5 creds, invalidate key rather than revoking it
Calling key_revoke here isn't ideal as further requests for the key will
end up returning -EKEYREVOKED until it gets purged from the cache. What we
really intend here is to force a new upcall on the next request_key.

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-23 16:36:24 -05:00
Liu Bo
67c9684f48 Btrfs: improve multi-thread buffer read
While testing with my buffer read fio jobs[1], I find that btrfs does not
perform well enough.

Here is a scenario in fio jobs:

We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file,
and all of them will race on add_to_page_cache_lru(), and if one thread
successfully puts its page into the page cache, it takes the responsibility
to read the page's data.

And what's more, reading a page needs a period of time to finish, in which
other threads can slide in and process rest pages:

     t1          t2          t3          t4
   add Page1
   read Page1  add Page2
     |         read Page2  add Page3
     |            |        read Page3  add Page4
     |            |           |        read Page4
-----|------------|-----------|-----------|--------
     v            v           v           v
    bio          bio         bio         bio

Now we have four bios, each of which holds only one page since we need to
maintain consecutive pages in bio.  Thus, we can end up with far more bios
than we need.

Here we're going to
a) delay the real read-page section and
b) try to put more pages into page cache.

With that said, we can make each bio hold more pages and reduce the number
of bios we need.

Here is some numbers taken from fio results:
         w/o patch                 w patch
       -------------  --------  ---------------
READ:    745MB/s        +25%       934MB/s

[1]:
[global]
group_reporting
thread
numjobs=4
bs=32k
rw=read
ioengine=sync
directory=/mnt/btrfs/

[READ]
filename=foobar
size=2000M
invalidate=1

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:10 -04:00
Liu Bo
df57dbe6bf Btrfs: make btrfs's allocation smoothly with preallocation
For backref walking, we've introduce delayed ref's sequence.  However,
it changes our preallocation behavior.

The story is that when we preallocate an extent and then mark it written
piece by piece, the ideal case should be that we don't need to COW the
extent, which is why we use 'preallocate'.

But we may not make use of preallocation, since when we check for cross refs on
the extent, we may have two ref entries which have the same content except
the sequence value, and we recognize them as cross refs and do COW to allocate
another extent.

So we end up with several pieces of space instead of an whole extent.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:10 -04:00
Josef Bacik
51561ffec9 Btrfs: lock the transition from dirty to writeback for an eb
There is a small window where an eb can have no IO bits set on it, which
could potentially result in extent_buffer_under_io() returning false when we
want it to return true, which could result in not fun things happening.  So
in order to protect this case we need to hold the refs_lock when we make
this transition to make sure we get reliable results out of
extent_buffer_udner_io().  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:09 -04:00
Josef Bacik
594831c4b2 Btrfs: fix potential race in extent buffer freeing
This sounds sort of impossible but it is the only thing I can think of and
at the very least it is theoretically possible so here it goes.

If we are in try_release_extent_buffer we will check that the ref count on
the extent buffer is 1 and not under IO, and then go down and clear the tree
ref.  If between this check and clearing the tree ref somebody else comes in
and grabs a ref on the eb and the marks it dirty before
try_release_extent_buffer() does it's tree ref clear we can end up with a
dirty eb that will be freed while it is still dirty which will result in a
panic.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:09 -04:00
Josef Bacik
e64860aa05 Btrfs: don't return true in releasepage unless we actually freed the eb
I noticed while looking at an extent_buffer race that we will
unconditionally return 1 if we get down to release_extent_buffer after
clearing the tree ref.  However we can easily race in here and get a ref on
the eb and not actually free the eb.  So make release_extent_buffer return 1
if it free'd the eb and 0 if not so we can be a little kinder to the vm.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:08 -04:00
Stefan Behrens
a98cdb85b9 Btrfs: suppress printk() if all device I/O stats are zero
Code is added to suppress the I/O stats printing at mount time if all
statistic values are zero.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2012-07-23 16:28:07 -04:00
Stefan Behrens
5021976d8d Btrfs: remove unwanted printk() for btrfs device I/O stats
People complained about the annoying kernel log message
"btrfs: no dev_stats entry found ... (OK on first mount after mkfs)"
everytime a filesystem is mounted for the first time after running
mkfs. Since the distribution of the btrfs-progs is not synchronized
to the kernel version, mkfs like it is now will be used also in the
future. Then this message is not useful to find errors, it is just
annoying. This commit removes the printk().

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2012-07-23 16:28:07 -04:00
Li Zefan
18077bb413 Btrfs: rewrite BTRFS_SETGET_FUNCS
BTRFS_SETGET_FUNCS macro is used to generate btrfs_set_foo() and
btrfs_foo() functions, which read and write specific fields in the
extent buffer.

The total number of set/get functions is ~200, but in fact we only
need 8 functions: 2 for u8 field, 2 for u16, 2 for u32 and 2 for u64.

It results in redunction of ~37K bytes.

   text    data     bss     dec     hex filename
 629661   12489     216  642366   9cd3e fs/btrfs/btrfs.o.orig
 592637   12489     216  605342   93c9e fs/btrfs/btrfs.o

Signed-off-by: Li Zefan <lizefan@huawei.com>
2012-07-23 16:28:06 -04:00
Li Zefan
293f7e0740 Btrfs: zero unused bytes in inode item
The otime field is not zeroed, so users will see random otime in an old
filesystem with a new kernel which has otime support in the future.

The reserved bytes are also not zeroed, and we'll have compatibility
issue if we make use of those bytes.

Signed-off-by: Li Zefan <lizefan@huawei.com>
2012-07-23 16:28:05 -04:00
Li Zefan
b4d7c3c945 Btrfs: kill free_space pointer from inode structure
Inodes always allocate free space with BTRFS_BLOCK_GROUP_DATA type,
which means every inode has the same BTRFS_I(inode)->free_space pointer.

This shrinks struct btrfs_inode by 4 bytes (or 8 bytes on 64 bits).

Signed-off-by: Li Zefan <lizefan@huawei.com>
2012-07-23 16:28:05 -04:00
Anand Jain
d5b025d510 btrfs read error corrected message floods the console during recovery
Changing printk_in_rcu to printk_ratelimited_in_rcu will suffice

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:04 -04:00
Jan Schmidt
e6466e354a Btrfs: fix buffer leak in btrfs_next_old_leaf
When calling btrfs_next_old_leaf, we were leaking an extent buffer in the
rare case of using the deadlock avoidance code needed for the tree mod log.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:03 -04:00
Liu Bo
f6175efab1 Btrfs: do not count in readonly bytes
If a block group is ro, do not count its entries in when we dump space info.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:03 -04:00
Liu Bo
799ffc3c31 Btrfs: add ro notification to dump_space_info
Block group has ro attributes, make dump_space_info show it.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:02 -04:00
Liu Bo
cf7c1ef6e1 Btrfs: fix a bug of writting free space cache during balance
Here is the whole story:
1)
A free space cache consists of two parts:
o  free space cache inode, which is special becase it's stored in root tree.
o  free space info, which is stored as the above inode's file data.

But we only build up another new inode and does not flush its free space info
onto disk when we _clear and setup_ free space cache, and this ends up with
that the block group cache's cache_state remains DC_SETUP instead of DC_WRITTEN.

And holding DC_SETUP means that we will not truncate this free space cache inode,
which means the disk offset of its file extent will remain _unchanged_ at least
until next transaction finishes committing itself.

2)
We can set a block group readonly when we relocate the block group.

However,
if the readonly block group covers the disk offset where our free space cache
inode is going to write, it will force the free space cache inode into
cow_file_range() and it'll end up hitting a BUG_ON.

3)
Due to the above analysis, we fix this bug by adding the missing dirty flag.

4)
However, it's not over, there is still another case, nospace_cache.

With nospace_cache, we do not want to set dirty flag, instead we just truncate
free space cache inode and bail out with setting cache state DC_WRITTEN.

We can benifit from it since it saves us another 'pre-allocation' part which
usually costs a lot.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:02 -04:00
Liu Bo
0678938423 Btrfs: do not abort transaction in prealloc case
During disk balance, we prealloc new file extent for file data relocation,
but we may fail in 'no available space' case, and it leads to flipping btrfs
into readonly.

It is not necessary to bail out and abort transaction since we do have several
ways to rescue ourselves from ENOSPC case.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:01 -04:00
Liu Bo
83eea1f1ba Btrfs: kill root from btrfs_is_free_space_inode
Since root can be fetched via BTRFS_I macro directly, we can save an args
for btrfs_is_free_space_inode().

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:00 -04:00
Liu Bo
51a8cf9d2d Btrfs: fix btrfs_is_free_space_inode to recognize btree inode
For btree inode, its root is also 'tree root', so btree inode can be
misunderstood as a free space inode.

We should add one more check for btree inode.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:00 -04:00
Stefan Behrens
c0901581ad Btrfs: avoid I/O repair BUG() from btree_read_extent_buffer_pages()
From btree_read_extent_buffer_pages(), currently repair_io_failure()
can be called with mirror_num being zero when submit_one_bio() returned
an error before. This used to cause a BUG_ON(!mirror_num) in
repair_io_failure() and indeed this is not a case that needs the I/O
repair code to rewrite disk blocks.
This commit prevents calling repair_io_failure() in this case and thus
avoids the BUG_ON() and malfunction.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:27:59 -04:00
Josef Bacik
f4c738c2e7 Btrfs: rework shrink_delalloc
So shrink_delalloc has grown all sorts of cruft over the years thanks to
many reworkings of how we track enospc.  What happens now as we fill up the
disk is we will loop for freaking ever hoping to reclaim a arbitrary amount
of space of metadata, this was from when everybody flushed at the same time.
Now we only have people flushing one at a time.  So instead of trying to
reclaim a huge amount of space, just try to flush a decent chunk of space,
and stop looping as soon as we have enough free space to satisfy our
reservation.  This makes xfstests 224 go much faster.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:27:58 -04:00
Liu Bo
b9ca0664dc Btrfs: do not set subvolume flags in readonly mode
$ mkfs.btrfs /dev/sdb7
$ btrfstune -S1 /dev/sdb7
$ mount /dev/sdb7 /mnt/btrfs
mount: block device /dev/sdb7 is write-protected, mounting read-only
$ btrfs dev add /dev/sdb8 /mnt/btrfs/

Now we get a btrfs in which mnt flags has readonly but sb flags does
not.  So for those ioctls that only check sb flags with MS_RDONLY, it
is going to be a problem.
Setting subvolume flags is such an ioctl, we should use mnt_want_write_file()
to check RO flags.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-07-23 16:27:58 -04:00
Liu Bo
e54bfa3104 Btrfs: use mnt_want_write_file instead of mnt_want_write
mnt_want_write_file is faster when file has been opened for write.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-07-23 16:27:57 -04:00
Liu Bo
768e9dfe82 Btrfs: remove redundant r/o check for superblock
mnt_want_write() and mnt_want_write_file() will check sb->s_flags with
MS_RDONLY, and we don't need to do it ourselves.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-07-23 16:27:56 -04:00
Liu Bo
a874a63e13 Btrfs: check write access to mount earlier while creating snapshots
Move check of write access to mount into upper functions so that we can
use mnt_want_write_file instead, which is faster than mnt_want_write.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-07-23 16:27:56 -04:00
Liu Bo
287082b0bd Btrfs: fix typo in cow_file_range_async and async_cow_submit
It should be 10 * 1024 * 1024.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-07-23 16:27:55 -04:00
Josef Bacik
0e72110692 Btrfs: change how we indicate we're adding csums
There is weird logic I had to put in place to make sure that when we were
adding csums that we'd used the delalloc block rsv instead of the global
block rsv.  Part of this meant that we had to free up our transaction
reservation before we ran the delayed refs since csum deletion happens
during the delayed ref work.  The problem with this is that when we release
a reservation we will add it to the global reserve if it is not full in
order to keep us going along longer before we have to force a transaction
commit.  By releasing our reservation before we run delayed refs we don't
get the opportunity to drain down the global reserve for the work we did, so
we won't refill it as often.  This isn't a problem per-se, it just results
in us possibly committing transactions more and more often, and in rare
cases could cause those WARN_ON()'s to pop in use_block_rsv because we ran
out of space in our block rsv.

This also helps us by holding onto space while the delayed refs run so we
don't end up with as many people trying to do things at the same time, which
again will help us not force commits or hit the use_block_rsv warnings.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:27:55 -04:00
Tsutomu Itoh
b995929515 Btrfs: return error of btrfs_update_inode() to caller
We didn't check error of btrfs_update_inode(), but that error looks
easy to bubble back up.

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:27:54 -04:00
Dan Carpenter
23291a044c Btrfs: fix error handling in __add_reloc_root()
We dereferenced "node" in the error message after freeing it.  Also
btrfs_panic() can return so we should return an error code instead of
continuing.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2012-07-23 16:27:53 -04:00
Ilya Dryomov
44c44af2f4 Btrfs: do not ignore errors from btrfs_cleanup_fs_roots() when mounting
There used to be a BUG_ON(ret) there before EH patch (79787eaa) went in.
Bail out with EINVAL.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-07-23 16:27:53 -04:00
Ilya Dryomov
fed425c742 Btrfs: do not return EINVAL instead of ENOMEM from open_ctree()
When bailing from open_ctree() err is returned, not ret.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-07-23 16:27:52 -04:00
Josef Bacik
02db0844be Btrfs: add DEVICE_READY ioctl
This will be used in conjunction with btrfs device ready <dev>.  This is
needed for initrd's to have a nice and lightweight way to tell if all of the
devices needed for a file system are in the cache currently.  This keeps
them from having to do mount+sleep loops waiting for devices to show up.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:27:42 -04:00
J. Bruce Fields
0ec4f431eb locks: fix checking of fcntl_setlease argument
The only checks of the long argument passed to fcntl(fd,F_SETLEASE,.)
are done after converting the long to an int.  Thus some illegal values
may be let through and cause problems in later code.

[ They actually *don't* cause problems in mainline, as of Dave Jones's
  commit 8d657eb3b4 "Remove easily user-triggerable BUG from
  generic_setlease", but we should fix this anyway.  And this patch will
  be necessary to fix real bugs on earlier kernels. ]

Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-23 12:46:01 -07:00
Josef Bacik
96c3f4331a Btrfs: flush delayed inodes if we're short on space
Those crazy gentoo guys have been complaining about ENOSPC errors on their
portage volumes.  This is because doing things like untar tends to create
lots of new files which will soak up all the reservation space in the
delayed inodes.  Usually this gets papered over by the fact that we will try
and commit the transaction, however if this happens in the wrong spot or we
choose not to commit the transaction you will be screwed.  So add the
ability to expclitly flush delayed inodes to free up space.  Please test
this out guys to make sure it works since as usual I cannot reproduce.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 15:41:40 -04:00
David Sterba
b27f7c0c15 btrfs: join DEV_STATS ioctls to one
Commit c11d2c236c (Btrfs: add ioctl to get and reset the device
stats) introduced two ioctls doing almost the same thing distinguished
by just the ioctl number which encodes "do reset after read". I have
suggested

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg16604.html

to implement it via the ioctl args. This hasn't happen, and I think we
should use a more clean way to pass flags and should not waste ioctl
numbers.

CC: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: David Sterba <dsterba@suse.cz>
2012-07-23 15:41:40 -04:00
Andrew Mahone
a43a211133 btrfs: ignore unfragmented file checks in defrag when compression enabled - rebased
Rebased on btrfs-next and retested.

Inform should_defrag_range if BTRFS_DEFRAG_RANGE_COMPRESS is set. If so, skip
checks for adjacent extents and extent size when deciding whether to defrag,
as these can prevent an uncompressed and unfragmented file from being
compressed as requested.

Signed-off-by: Andrew Mahone <andrew.mahone@gmail.com>
2012-07-23 15:41:39 -04:00
Dan Carpenter
e4b50e14c8 Btrfs: small naming cleanup in join_transaction()
"root->fs_info" and "fs_info" are the same, but "fs_info" is prefered
because it is shorter and that's what is used in the rest of the
function.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2012-07-23 15:41:39 -04:00
Alexander Block
2bc5565286 Btrfs: don't update atime on RO subvolumes
Before the update_time inode operation was indroduced, it was
not possible to prevent updates of atime on RO subvolumes. VFS
was only able to check for RO on the mount, but did not know
anything about btrfs subvolumes.

btrfs_update_time does now check if the root is RO and skip
updating of times.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-07-23 15:41:38 -04:00
Arnd Hannemann
063849eafd Btrfs: allow mount -o remount,compress=no
Btrfs allows to turn on compression on a mounted and used filesystem
by issuing mount -o remount,compress=lzo.
This patch allows to turn compression off again
while the filesystem is mounted. As suggested by David Sterba
if the compress-force option was set, it is implicitly cleared
if compression is turned off.

Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Arnd Hannemann <arnd@arndnet.de>
2012-07-23 15:41:38 -04:00
Josef Bacik
c5c3c5f31e Btrfs: remove ->dirty_inode
We do all of our inode updating when we change it, and now that we do
->update_time we don't need ->dirty_inode for atime updates anymore, so just
remove it.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-07-23 15:41:38 -04:00
Chris Mason
cbea5ac1ee Btrfs: reduce calls to wake_up on uncontended locks
The btrfs locks were unconditionally calling wake_up as the
locks were released.  This lead to extra thrashing on the waitqueue,
especially for locks that were dominated by readers.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-23 15:36:18 -04:00
Chris Mason
e39e64ac0c Btrfs: don't wait around for new log writers on an SSD
Waiting on spindles improves performance, but ssds want all the
IO as quickly as we can push it down.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-23 15:36:17 -04:00
Linus Torvalds
a66d2c8f7e Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull the big VFS changes from Al Viro:
 "This one is *big* and changes quite a few things around VFS.  What's in there:

   - the first of two really major architecture changes - death to open
     intents.

     The former is finally there; it was very long in making, but with
     Miklos getting through really hard and messy final push in
     fs/namei.c, we finally have it.  Unlike his variant, this one
     doesn't introduce struct opendata; what we have instead is
     ->atomic_open() taking preallocated struct file * and passing
     everything via its fields.

     Instead of returning struct file *, it returns -E...  on error, 0
     on success and 1 in "deal with it yourself" case (e.g.  symlink
     found on server, etc.).

     See comments before fs/namei.c:atomic_open().  That made a lot of
     goodies finally possible and quite a few are in that pile:
     ->lookup(), ->d_revalidate() and ->create() do not get struct
     nameidata * anymore; ->lookup() and ->d_revalidate() get lookup
     flags instead, ->create() gets "do we want it exclusive" flag.

     With the introduction of new helper (kern_path_locked()) we are rid
     of all struct nameidata instances outside of fs/namei.c; it's still
     visible in namei.h, but not for long.  Come the next cycle,
     declaration will move either to fs/internal.h or to fs/namei.c
     itself.  [me, miklos, hch]

   - The second major change: behaviour of final fput().  Now we have
     __fput() done without any locks held by caller *and* not from deep
     in call stack.

     That obviously lifts a lot of constraints on the locking in there.
     Moreover, it's legal now to call fput() from atomic contexts (which
     has immediately simplified life for aio.c).  We also don't need
     anti-recursion logics in __scm_destroy() anymore.

     There is a price, though - the damn thing has become partially
     asynchronous.  For fput() from normal process we are guaranteed
     that pending __fput() will be done before the caller returns to
     userland, exits or gets stopped for ptrace.

     For kernel threads and atomic contexts it's done via
     schedule_work(), so theoretically we might need a way to make sure
     it's finished; so far only one such place had been found, but there
     might be more.

     There's flush_delayed_fput() (do all pending __fput()) and there's
     __fput_sync() (fput() analog doing __fput() immediately).  I hope
     we won't need them often; see warnings in fs/file_table.c for
     details.  [me, based on task_work series from Oleg merged last
     cycle]

   - sync series from Jan

   - large part of "death to sync_supers()" work from Artem; the only
     bits missing here are exofs and ext4 ones.  As far as I understand,
     those are going via the exofs and ext4 trees resp.; once they are
     in, we can put ->write_super() to the rest, along with the thread
     calling it.

   - preparatory bits from unionmount series (from dhowells).

   - assorted cleanups and fixes all over the place, as usual.

  This is not the last pile for this cycle; there's at least jlayton's
  ESTALE work and fsfreeze series (the latter - in dire need of fixes,
  so I'm not sure it'll make the cut this cycle).  I'll probably throw
  symlink/hardlink restrictions stuff from Kees into the next pile, too.
  Plus there's a lot of misc patches I hadn't thrown into that one -
  it's large enough as it is..."

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (127 commits)
  ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file()
  btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file()
  switch dentry_open() to struct path, make it grab references itself
  spufs: shift dget/mntget towards dentry_open()
  zoran: don't bother with struct file * in zoran_map
  ecryptfs: don't reinvent the wheels, please - use struct completion
  don't expose I_NEW inodes via dentry->d_inode
  tidy up namei.c a bit
  unobfuscate follow_up() a bit
  ext3: pass custom EOF to generic_file_llseek_size()
  ext4: use core vfs llseek code for dir seeks
  vfs: allow custom EOF in generic_file_llseek code
  vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes
  vfs: Remove unnecessary flushing of block devices
  vfs: Make sys_sync writeout also block device inodes
  vfs: Create function for iterating over block devices
  vfs: Reorder operations during sys_sync
  quota: Move quota syncing to ->sync_fs method
  quota: Split dquot_quota_sync() to writeback and cache flushing part
  vfs: Move noop_backing_dev_info check from sync into writeback
  ...
2012-07-23 12:27:27 -07:00
Cong Wang
906adea153 jbd2: remove the second argument of kmap_atomic
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-07-23 14:11:22 +08:00
Theodore Ts'o
03179fe923 ext4: undo ext4_calc_metadata_amount if we fail to claim space
The function ext4_calc_metadata_amount() has side effects, although
it's not obvious from its function name.  So if we fail to claim
space, regardless of whether we retry to claim the space again, or
return an error, we need to undo these side effects.

Otherwise we can end up incorrectly calculating the number of metadata
blocks needed for the operation, which was responsible for an xfstests
failure for test #271 when using an ext2 file system with delalloc
enabled.

Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-07-23 00:00:20 -04:00
Brian Foster
97795d2a5b ext4: don't let i_reserved_meta_blocks go negative
If we hit a condition where we have allocated metadata blocks that
were not appropriately reserved, we risk underflow of
ei->i_reserved_meta_blocks.  In turn, this can throw
sbi->s_dirtyclusters_counter significantly out of whack and undermine
the nondelalloc fallback logic in ext4_nonda_switch().  Warn if this
occurs and set i_allocated_meta_blocks to avoid this problem.

This condition is reproduced by xfstests 270 against ext2 with
delalloc enabled:

Mar 28 08:58:02 localhost kernel: [  171.526344] EXT4-fs (loop1): delayed block allocation failed for inode 14 at logical offset 64486 with max blocks 64 with error -28
Mar 28 08:58:02 localhost kernel: [  171.526346] EXT4-fs (loop1): This should not happen!! Data will be lost

270 ultimately fails with an inconsistent filesystem and requires an
fsck to repair.  The cause of the error is an underflow in
ext4_da_update_reserve_space() due to an unreserved meta block
allocation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-07-22 23:59:40 -04:00
Ashish Sangwan
968dee7722 ext4: fix hole punch failure when depth is greater than 0
Whether to continue removing extents or not is decided by the return
value of function ext4_ext_more_to_rm() which checks 2 conditions:
a) if there are no more indexes to process.
b) if the number of entries are decreased in the header of "depth -1".

In case of hole punch, if the last block to be removed is not part of
the last extent index than this index will not be deleted, hence the
number of valid entries in the extent header of "depth - 1" will
remain as it is and ext4_ext_more_to_rm will return 0 although the
required blocks are not yet removed.

This patch fixes the above mentioned problem as instead of removing
the extents from the end of file, it starts removing the blocks from
the particular extent from which removing blocks is actually required
and continue backward until done.

Signed-off-by: Ashish Sangwan <ashish.sangwan2@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org
2012-07-22 22:49:08 -04:00
Artem Bityutskiy
b50924c2c6 ext4: remove unnecessary argument from __ext4_handle_dirty_metadata()
The '__ext4_handle_dirty_metadata()' does not need the 'now' argument
anymore and we can kill it.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2012-07-22 20:37:31 -04:00
Artem Bityutskiy
4d47603d97 ext4: weed out ext4_write_super
We do not depend on VFS's '->write_super()' anymore and do not need
the 's_dirt' flag anymore, so weed out 'ext4_write_super()' and
's_dirt'.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2012-07-22 20:35:31 -04:00
Artem Bityutskiy
58c5873a76 ext4: remove unnecessary superblock dirtying
This patch changes the 'ext4_handle_dirty_super()' function which
submits the superblock for I/O in the following cases:

1. When creating the first large file on a file system without
   EXT4_FEATURE_RO_COMPAT_LARGE_FILE feature.
2. When re-sizing the file-system.
3. When creating an xattr on a file-system without the
   EXT4_FEATURE_COMPAT_EXT_ATTR feature.

If the file-system has journal enabled, the superblock is written via
the journal. We do not modify this path.

If the file-system has no journal, this function, falls back to just
marking the superblock as dirty using the 's_dirt' superblock
flag. This means that it delays the actual superblock I/O submission
by 5 seconds (default setting).  Namely, the 'sync_supers()' kernel
thread will call 'ext4_write_super()' later and will actually submit
the superblock for I/O.

And this is the behavior this patch modifies: we stop using 's_dirt'
and just mark the superblock buffer as dirty right away. Indeed, all 3
cases above are extremely rare and it does not add any value to delay
the I/O submission for them.

Note: 'ext4_handle_dirty_super()' executes
'__ext4_handle_dirty_super()' with 'now = 0'. This patch basically
makes the 'now' argument unneeded and it will be deleted in one of the
next patches.

This patch also removes 's_dirt' condition on the unmount path because
we never set it anymore, so we should not test it.

Tested using xfstests for both journalled and non-journalled ext4.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2012-07-22 20:33:31 -04:00
Jan Kara
044ce47fec ext4: convert last user of ext4_mark_super_dirty() to ext4_handle_dirty_super()
The last user of ext4_mark_super_dirty() in ext4_file_open() is so
rare it can well be modifying the superblock properly by journalling
the change.  Change it and get rid of ext4_mark_super_dirty() as it's
not needed anymore.

Artem: small amendments.
Artem: tested using xfstests for both journalled and non-journalled ext4.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2012-07-22 20:31:31 -04:00
Jan Kara
97a7406880 ext4: remove useless marking of superblock dirty
Commit a0375156 properly notes that superblock doesn't need to be marked
as dirty when only number of free inodes / blocks / number of directories
changes since that is recomputed on each mount anyway. However that comment
leaves some unnecessary markings as dirty in place. Remove these.

Artem: tested using xfstests for both journalled and non-journalled ext4.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2012-07-22 20:29:31 -04:00
Al Viro
254706056b ext4: fix ext4 mismerge back in January
Duplicate caused, AFAICS, by mismerge in
ff9cb1c4eead5e4c292e75cd3170a82d66944101>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-07-22 20:27:31 -04:00
Theodore Ts'o
3108b54bce ext4: remove dynamic array size in ext4_chksum()
The ext4_checksum() inline function was using a dynamic array size,
which is not legal C.  (It is a gcc extension).

Remove it.

Cc: "Darrick J. Wong" <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-22 20:25:31 -04:00
Theodore Ts'o
8a9918497b ext4: remove unused variable in ext4_update_super()
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-22 20:23:31 -04:00
Aditya Kali
7c319d3285 ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.

It is based on the proposal at:                                                                                                           
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4

This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields.  Some changes introduced by this patch that should be pointed
out are:

1) Two new ext4-superblock fields - s_usr_quota_inum and
   s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
   for tracking group quota. The superblock fields can be set to use
   other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
   superblock, the quota usage tracking is turned on at mount time. On
   'quotaon' ioctl, the quota limits enforcement is turned
   on. 'quotaoff' ioctl turns off only the limits enforcement in this
   case.
4) When QUOTA feature is in use, the quota mount options 'quota',
   'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
   quota inodes. The default reserved inodes will not be visible to user
   as regular files.
6) The quota-tools will need to be modified to support hidden quota
   files on ext4. E2fsprogs will also include support for creating and
   fixing quota files.
7) Support is only for the new V2 quota file format.

Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-22 20:21:31 -04:00
Zheng Liu
4bd809dbbf ext4: don't take the i_mutex lock when doing DIO overwrites
Aligned and overwrite direct I/O can be parallelized.  In
ext4_file_dio_write, we first check whether these conditions are
satisfied or not.  If so, we take i_data_sem and release i_mutex lock
directly.  Meanwhile iocb->private is set to indicate that this is a
dio overwrite, and it will be handled in ext4_ext_direct_IO.

[ Added fix from Dan Carpenter to fix locking bug on the error path. ]

CC: Tao Ma <tm@tao.ma>
CC: Eric Sandeen <sandeen@redhat.com>
CC: Robin Dong <hao.bigrat@gmail.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2012-07-22 20:19:31 -04:00
Al Viro
8cae6f7158 ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:01:55 +04:00
Al Viro
11e62a8fab btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:01:43 +04:00
Al Viro
765927b2d5 switch dentry_open() to struct path, make it grab references itself
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:01:29 +04:00
Al Viro
3b8b487114 ecryptfs: don't reinvent the wheels, please - use struct completion
... and keep the sodding requests on stack - they are small enough.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:01:02 +04:00
Al Viro
8fc37ec54c don't expose I_NEW inodes via dentry->d_inode
d_instantiate(dentry, inode);
	unlock_new_inode(inode);

is a bad idea; do it the other way round...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:00:58 +04:00
Al Viro
32a7991b6a tidy up namei.c a bit
locking/unlocking for rcu walk taken to a couple of inline helpers

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:00:55 +04:00
Al Viro
3c0a616368 unobfuscate follow_up() a bit
really convoluted test in there has grown up during struct mount
introduction; what it checks is that we'd reached the root of
mount tree.
2012-07-23 00:00:45 +04:00
Eric Sandeen
de9b942202 ext3: pass custom EOF to generic_file_llseek_size()
Use the new custom EOF argument to generic_file_llseek_size so
that SEEK_END will go to the max hash value for htree dirs
in ext3 rather than to i_size_read()

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:00:30 +04:00
Eric Sandeen
ec7268ce21 ext4: use core vfs llseek code for dir seeks
Use the new functionality in generic_file_llseek_size() to
accept a custom EOF position, and un-cut-and-paste all the
vfs llseek code from ext4.

Also fix up comments on ext4_llseek() to reflect reality.

Signed-off-by: Eric Sandeen <sandeen@redaht.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:00:28 +04:00
Eric Sandeen
e8b96eb503 vfs: allow custom EOF in generic_file_llseek code
For ext3/4 htree directories, using the vfs llseek function with
SEEK_END goes to i_size like for any other file, but in reality
we want the maximum possible hash value.  Recent changes
in ext4 have cut & pasted generic_file_llseek() back into fs/ext4/dir.c,
but replicating this core code seems like a bad idea, especially
since the copy has already diverged from the vfs.

This patch updates generic_file_llseek_size to accept
both a custom maximum offset, and a custom EOF position.  With this
in place, ext4_dir_llseek can pass in the appropriate maximum hash
position for both maxsize and eof, and get what it wants.

As far as I know, this does not fix any bugs - nfs in the kernel
doesn't use SEEK_END, and I don't know of any user who does.  But
some ext4 folks seem keen on doing the right thing here, and I can't
really argue.

(Patch also fixes up some comments slightly)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:00:15 +04:00
Jan Kara
4ea425b63a vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes
wakeup_flusher_threads(0) will queue work doing complete writeback for each
flusher thread. Thus there is not much point in submitting another work doing
full inode WB_SYNC_NONE writeback by writeback_inodes_sb().

After this change it does not make sense to call nonblocking ->sync_fs and
block device flush before calling sync_inodes_sb() because
wakeup_flusher_threads() is completely asynchronous and thus these functions
would be called in parallel with inode writeback running which will effectively
void any work they do. So we move sync_inodes_sb() call before these two
functions.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:59:01 +04:00
Jan Kara
d0e91b13eb vfs: Remove unnecessary flushing of block devices
It is not necessary to write block devices twice. The reason why we first did
flush and then proper sync is that
  for_each_bdev() {
    write_bdev()
    wait_for_completion()
  }
is much slower than
  for_each_bdev()
    write_bdev()
  for_each_bdev()
    wait_for_completion()
when there is bigger amount of data. But as is seen in the above, there's no real
need to scan pages and submit them twice. We just need to separate the submission
and waiting part. This patch does that.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:53 +04:00
Jan Kara
a8c7176b6d vfs: Make sys_sync writeout also block device inodes
In case block device does not have filesystem mounted on it, sys_sync will just
ignore it and doesn't writeout its dirty pages. This is because writeback code
avoids writing inodes from superblock without backing device and
blockdev_superblock is such a superblock.  Since it's unexpected that sync
doesn't writeout dirty data for block devices be nice to users and change the
behavior to do so. So now we iterate over all block devices on blockdev_super
instead of iterating over all superblocks when syncing block devices.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:49 +04:00
Jan Kara
5c0d6b60a0 vfs: Create function for iterating over block devices
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:45 +04:00
Jan Kara
b3de653105 vfs: Reorder operations during sys_sync
Change the order of operations during sync from

for_each_sb {
        writeback_inodes_sb();
        sync_fs(nowait);
        __sync_blockdev(nowait);
}
for_each_sb {
        sync_inodes_sb();
        sync_fs(wait);
        __sync_blockdev(wait);
}

to

for_each_sb
        writeback_inodes_sb();
for_each_sb
        sync_fs(nowait);
for_each_sb
        __sync_blockdev(nowait);
for_each_sb
        sync_inodes_sb();
for_each_sb
        sync_fs(wait);
for_each_sb
        __sync_blockdev(wait);

This is a preparation for the following patches in this series.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:41 +04:00
Jan Kara
a117782571 quota: Move quota syncing to ->sync_fs method
Since the moment writes to quota files are using block device page cache and
space for quota structures is reserved at the moment they are first accessed we
have no reason to sync quota before inode writeback. In fact this order is now
only harmful since quota information can easily change during inode writeback
(either because conversion of delayed-allocated extents or simply because of
allocation of new blocks for simple filesystems not using page_mkwrite).

So move syncing of quota information after writeback of inodes into ->sync_fs
method. This way we do not have to use ->quota_sync callback which is primarily
intended for use by quotactl syscall anyway and we get rid of calling
->sync_fs() twice unnecessarily. We skip quota syncing for OCFS2 since it does
proper quota journalling in all cases (unlike ext3, ext4, and reiserfs which
also support legacy non-journalled quotas) and thus there are no dirty quota
structures.

CC: "Theodore Ts'o" <tytso@mit.edu>
CC: Joel Becker <jlbec@evilplan.org>
CC: reiserfs-devel@vger.kernel.org
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Dave Kleikamp <shaggy@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:34 +04:00
Jan Kara
ceed17236a quota: Split dquot_quota_sync() to writeback and cache flushing part
Split off part of dquot_quota_sync() which writes dquots into a quota file
to a separate function. In the next patch we will use the function from
filesystems and we do not want to abuse ->quota_sync quotactl callback more
than necessary.

Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:19 +04:00
Jan Kara
6eedc70150 vfs: Move noop_backing_dev_info check from sync into writeback
In principle, a filesystem may want to have ->sync_fs() called during sync(1)
although it does not have a bdi (i.e. s_bdi is set to noop_backing_dev_info).
Only writeback code really needs bdi set to something reasonable. So move the
checks where they are more logical.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:18 +04:00
Artem Bityutskiy
9e9ad5f408 fs/ufs: get rid of write_super
This patch makes UFS stop using the VFS '->write_super()' method along with
the 's_dirt' superblock flag, because they are on their way out.

The way we implement this is that we schedule a delay job instead relying on
's_dirt' and '->write_super()'.

The whole "superblock write-out" VFS infrastructure is served by the
'sync_supers()' kernel thread, which wakes up every 5 (by default) seconds and
writes out all dirty superblocks using the '->write_super()' call-back.  But the
problem with this thread is that it wastes power by waking up the system every
5 seconds, even if there are no diry superblocks, or there are no client
file-systems which would need this (e.g., btrfs does not use
'->write_super()'). So we want to kill it completely and thus, we need to make
file-systems to stop using the '->write_super()' VFS service, and then remove
it together with the kernel thread.

Tested using fsstress from the LTP project.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:16 +04:00
Artem Bityutskiy
7bd54ef722 fs/ufs: re-arrange the code a bit
This patch does not do any functional changes. It only moves 3 functions
in fs/ufs/super.c a little bit up in order to prepare for further changes
where I'll need this new arrangement to avoid forward declarations.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:14 +04:00
Artem Bityutskiy
65e5e83f7d fs/ufs: remove extra superblock write on unmount
UFS calls 'ufs_write_super()' from 'ufs_put_super()' in order to write the
superblocks to the media. However, it is not needed because VFS calls
'->sync_fs()' before calling '->put_super()' - so by the time we are in
'ufs_write_super()', the superblocks are already synchronized.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:14 +04:00
Artem Bityutskiy
9d46be294d fs/sysv: stop using write_super and s_dirt
It does not look like sysv FS needs 'write_super()' at all, because all it
does is a timestamp update. I cannot test this patch, because this
file-system is so old and probably has not been used by anyone for years,
so there are no tools to create it in Linux. But from the code I see that
marking the superblock as dirty is basically marking the superblock buffers as
drity and then setting the s_dirt flag. And when 'write_super()' is executed to
handle the s_dirt flag, we just update the timestamp and again mark the
superblock buffer as dirty. Seems pointless.

It looks like we can update the timestamp more opprtunistically - on unmount
or remount of sync, and nothing should change.

Thus, this patch removes 'sysv_write_super()' and 's_dirt'.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:12 +04:00