Commit Graph

40163 Commits

Author SHA1 Message Date
Richard Kennedy
d8c0fca68d fsnotify: remove alignment padding from fsnotify_mark on 64 bit builds
Reorder struct fsnotfiy_mark to remove 8 bytes of alignment padding on 64
bit builds.  Shrinks fsnotfiy_mark to 128 bytes allowing more objects per
slab in its kmem_cache and reduces the number of cachelines needed for
each structure.

Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:16 -04:00
Stefan Hajnoczi
50e4a98914 fanotify: Fix FAN_CLOSE comments
The comments for FAN_CLOSE_WRITE and FAN_CLOSE_NOWRITE do not match
FS_CLOSE_WRITE and FS_CLOSE_NOWRITE, respectively.  WRITE is for
writable files while NOWRITE is for non-writable files.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:16 -04:00
Eric Paris
8fcd65280a fanotify: ignore events on directories unless specifically requested
fanotify has a very limited number of events it sends on directories.  The
usefulness of these events is yet to be seen and still we send them.  This
is particularly painful for mount marks where one might receive many of
these useless events.  As such this patch will drop events on IS_DIR()
inodes unless they were explictly requested with FAN_ON_DIR.

This means that a mark on a directory without FAN_EVENT_ON_CHILD or
FAN_ON_DIR is meaningless and will result in no events ever (although it
will still be allowed since detecting it is hard)

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:16 -04:00
Eric Paris
b29866aab8 fsnotify: rename FS_IN_ISDIR to FS_ISDIR
The _IN_ in the naming is reserved for flags only used by inotify.  Since I
am about to use this flag for fanotify rename it to be generic like the
rest.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:15 -04:00
Eric Paris
4afeff8505 fanotify: limit number of listeners per user
fanotify currently has no limit on the number of listeners a given user can
have open.  This patch limits the total number of listeners per user to
128.  This is the same as the inotify default limit.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:15 -04:00
Eric Paris
ac7e22dcfa fanotify: allow userspace to override max marks
Some fanotify groups, especially those like AV scanners, will need to place
lots of marks, particularly ignore marks.  Since ignore marks do not pin
inodes in cache and are cleared if the inode is removed from core (usually
under memory pressure) we expose an interface for listeners, with
CAP_SYS_ADMIN, to override the maximum number of marks and be allowed to
set and 'unlimited' number of marks.  Programs which make use of this
feature will be able to OOM a machine.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:15 -04:00
Eric Paris
e7099d8a5a fanotify: limit the number of marks in a single fanotify group
There is currently no limit on the number of marks a given fanotify group
can have.  Since fanotify is gated on CAP_SYS_ADMIN this was not seen as
a serious DoS threat.  This patch implements a default of 8192, the same as
inotify to work towards removing the CAP_SYS_ADMIN gating and eliminating
the default DoS'able status.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:14 -04:00
Eric Paris
5dd03f55fd fanotify: allow userspace to override max queue depth
fanotify has a defualt max queue depth.  This patch allows processes which
explicitly request it to have an 'unlimited' queue depth.  These processes
need to be very careful to make sure they cannot fall far enough behind
that they OOM the box.  Thus this flag is gated on CAP_SYS_ADMIN.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:14 -04:00
Eric Paris
2529a0df0f fsnotify: implement a default maximum queue depth
Currently fanotify has no maximum queue depth.  Since fanotify is
CAP_SYS_ADMIN only this does not pose a normal user DoS issue, but it
certianly is possible that an fanotify listener which can't keep up could
OOM the box.  This patch implements a default 16k depth.  This is the same
default depth used by inotify, but given fanotify's better queue merging in
many situations this queue will contain many additional useful events by
comparison.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:14 -04:00
Eric Paris
bbf2aba50f fanotify: allow userspace to flush all marks
fanotify is supposed to be able to flush all marks.  This is mostly useful
for the AV community to flush all cached decisions on a security policy
change.  This functionality has existed in the kernel but wasn't correctly
exposed to userspace.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:14 -04:00
Eric Paris
52420392c8 fsnotify: call fsnotify_parent in perm events
fsnotify perm events do not call fsnotify parent.  That means you cannot
register a perm event on a directory and enforce permissions on all inodes in
that directory.  This patch fixes that situation.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:13 -04:00
Eric Paris
ff8bcbd03d fsnotify: correctly handle return codes from listeners
When fsnotify groups return errors they are ignored.  For permissions
events these should be passed back up the stack, but for most events these
should continue to be ignored.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:13 -04:00
Eric Paris
2868201965 fanotify: use __aligned_u64 in fanotify userspace metadata
Currently the userspace struct exposed by fanotify uses
__attribute__((packed)) to make sure that alignment works on multiarch
platforms.  Since this causes a severe performance penalty on some
platforms we are going to switch to using explicit alignment notation on
the 64bit values so we don't have to use 'packed'

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:13 -04:00
Eric Paris
4231a23530 fanotify: implement fanotify listener ordering
The fanotify listeners needs to be able to specify what types of operations
they are going to perform so they can be ordered appropriately between other
listeners doing other types of operations.  They need this to be able to make
sure that things like hierarchichal storage managers will get access to inodes
before processes which need the data.  This patch defines 3 possible uses
which groups must indicate in the fanotify_init() flags.

FAN_CLASS_PRE_CONTENT
FAN_CLASS_CONTENT
FAN_CLASS_NOTIF

Groups will receive notification in that order.  The order between 2 groups in
the same class is undeterministic.

FAN_CLASS_PRE_CONTENT is intended to be used by listeners which need access to
the inode before they are certain that the inode contains it's final data.  A
hierarchical storage manager should choose to use this class.

FAN_CLASS_CONTENT is intended to be used by listeners which need access to the
inode after it contains its intended contents.  This would be the appropriate
level for an AV solution or document control system.

FAN_CLASS_NOTIF is intended for normal async notification about access, much the
same as inotify and dnotify.  Syncronous permissions events are not permitted
at this class.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:13 -04:00
Eric Paris
6ad2d4e3e9 fsnotify: implement ordering between notifiers
fanotify needs to be able to specify that some groups get events before
others.  They use this idea to make sure that a hierarchical storage
manager gets access to files before programs which actually use them.  This
is purely infrastructure.  Everything will have a priority of 0, but the
infrastructure will exist for it to be non-zero.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:13 -04:00
Eric Paris
9343919c14 fanotify: allow fanotify to be built
We disabled the ability to build fanotify in commit 7c5347733d.
This reverts that commit and allows people to build fanotify.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-10-28 17:22:13 -04:00
Linus Torvalds
f063a0c0c9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6: (841 commits)
  Staging: brcm80211: fix usage of roundup in structures
  Staging: bcm: fix up network device reference counting
  Staging: keucr: fix up US_ macro change
  staging: brcm80211: brcmfmac: Removed codeversion from firmware filenames.
  staging: brcm80211: Remove unnecessary header files.
  staging: brcm80211: Remove unnecessary includes from bcmutils.c
  staging: brcm80211: Removed unnecessary pktsetprio() function.
  Staging: brcm80211: remove typedefs.h
  Staging: brcm80211: remove uintptr typedef usage
  Staging: hv: remove struct vmbus_channel_interface
  Staging: hv: remove Open from struct vmbus_channel_interface
  Staging: hv: storvsc: call vmbus_open directly
  Staging: hv: netvsc: call vmbus_open directly
  Staging: hv: channel: export vmbus_open to modules
  Staging: hv: remove Close from struct vmbus_channel_interface
  Staging: hv: netvsc: call vmbus_close directly
  Staging: hv: storvsc: call vmbus_close directly
  Staging: hv: channel: export vmbus_close to modules
  Staging: hv: remove SendPacket from struct vmbus_channel_interface
  Staging: hv: storvsc: call vmbus_sendpacket directly
  ...

Fix up conflicts in
	drivers/staging/cx25821/cx25821-audio-upstream.c
	drivers/staging/cx25821/cx25821-audio.h
due to warring whitespace cleanups (neither of which were all that great)
2010-10-28 12:13:00 -07:00
Linus Torvalds
3c37629578 Merge git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (32 commits)
  sh: intc: switch irq_desc iteration to new active IRQ iterator.
  sh: fix up cpu hotplug IRQ migration for irq_data changes.
  sh: oprofile: Make sure the backtrace op is available for timer-fallback.
  sh64: oprofile: Fix up kernel stack pointer size mismatch.
  sh: oprofile: Fix up and extend op_name_from_perf_id().
  sh: lockless get_user_pages_fast()
  sh64: _PAGE_SPECIAL support.
  sound: sh: ctrl_in/outX to __raw_read/writeX conversion.
  sh: disable deprecated genirq support.
  sh: update show_interrupts() for irq_data chip lookup.
  sh: intc: irq_data conversion.
  sh64: irq_data conversion.
  sh64: update for IRQ flag handling naming changes.
  rtc: rtc-rs5c313: ctrl_in/outX to __raw_read/writeX conversion.
  sh: mach-se: irq_data conversion.
  input: hp680_ts_input: ctrl_in/outX to __raw_read/writeX conversion.
  input: jornada680_kbd: ctrl_in/outX to __raw_read/writeX conversion.
  sh: hd64461: irq_data conversion.
  sh: mach-x3proto: irq_data conversion.
  sh: mach-systemh: irq_data conversion.
  ...
2010-10-28 12:06:51 -07:00
Linus Torvalds
e9f29c9a56 Merge branch 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (27 commits)
  x86: allocate space within a region top-down
  x86: update iomem_resource end based on CPU physical address capabilities
  x86/PCI: allocate space from the end of a region, not the beginning
  PCI: allocate bus resources from the top down
  resources: support allocating space within a region from the top down
  resources: handle overflow when aligning start of available area
  resources: ensure callback doesn't allocate outside available space
  resources: factor out resource_clip() to simplify find_resource()
  resources: add a default alignf to simplify find_resource()
  x86/PCI: MMCONFIG: fix region end calculation
  PCI: Add support for polling PME state on suspended legacy PCI devices
  PCI: Export some PCI PM functionality
  PCI: fix message typo
  PCI: log vendor/device ID always
  PCI: update Intel chipset names and defines
  PCI: use new ccflags variable in Makefile
  PCI: add PCI_MSIX_TABLE/PBA defines
  PCI: add PCI vendor id for STmicroelectronics
  x86/PCI: irq and pci_ids patch for Intel Patsburg DeviceIDs
  PCI: OLPC: Only enable PCI configuration type override on XO-1
  ...
2010-10-28 11:59:52 -07:00
Greg Kroah-Hartman
e4c5bf8e3d Merge 'staging-next' to Linus's tree
This merges the staging-next tree to Linus's tree and resolves
some conflicts that were present due to changes in other trees that were
affected by files here.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-10-28 09:44:56 -07:00
Linus Torvalds
0851668fdd Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6
* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6: (505 commits)
  [media] af9015: Fix max I2C message size when used with tda18271
  [media] IR: initialize ir_raw_event in few more drivers
  [media] Guard a divide in v4l1 compat layer
  [media] imon: fix nomouse modprobe option
  [media] imon: remove redundant change_protocol call
  [media] imon: fix my egregious brown paper bag w/rdev/idev split
  [media] cafe_ccic: Configure ov7670 correctly
  [media] ov7670: allow configuration of image size, clock speed, and I/O method
  [media] af9015: support for DigitalNow TinyTwin v3 [1f4d:9016]
  [media] af9015: map DigitalNow TinyTwin v2 remote
  [media] DigitalNow TinyTwin remote controller
  [media] af9015: RC fixes and improvements
  videodev2.h.xml: Update to reflect the latest changes at videodev2.h
  [media] v4l: document new Bayer and monochrome pixel formats
  [media] DocBook/v4l: Add missing formats used on gspca cpia1 and sn9c2028
  [media] firedtv: add parameter to fake ca_system_ids in CA_INFO
  [media] tm6000: fix a macro coding style issue
  tm6000: Remove some ugly debug code
  [media] Nova-S-Plus audio line input
  [media] [RFC,1/1] V4L2: Use new CAP bits in existing RDS capable drivers
  ...
2010-10-28 09:35:11 -07:00
Linus Torvalds
00ebb6382b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc: (66 commits)
  mmc: add new sdhci-pxa driver for Marvell SoCs
  mmc: make number of mmcblk minors configurable
  mmc_spi: Recover from CRC errors for r/w operation over SPI.
  mmc: sdhci-pltfm: add -pltfm driver for imx35/51
  mmc: sdhci-of-esdhc: factor out common stuff
  mmc: sdhci_pltfm: pass more data on custom init call
  mmc: sdhci: introduce get_ro private write-protect hook
  mmc: sdhci-pltfm: move .h file into appropriate subdir
  mmc: sdhci-pltfm: Add structure for host-specific data
  mmc: fix cb710 kconfig dependency warning
  mmc: cb710: remove debugging printk (info duplicated from mmc-core)
  mmc: cb710: clear irq handler on init() error path
  mmc: cb710: remove unnecessary msleep()
  mmc: cb710: implement get_cd() callback
  mmc: cb710: partially demystify clock selection
  mmc: add a file to debugfs for changing host clock at runtime
  mmc: sdhci: allow for eMMC 74 clock generation by controller
  mmc: sdhci: highspeed: check for mmc as well as sd cards
  mmc: sdhci: Add Moorestown device support
  mmc: sdhci: Intel Medfield support
  ...
2010-10-28 09:33:42 -07:00
Linus Torvalds
90a2b69c14 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: (28 commits)
  net/9p: Return error on read with NULL buffer
  9p: Add datasync to client side TFSYNC/RFSYNC for dotl
  net/9p: Return error if we fail to encode protocol data
  fs/9p: Use generic_file_open with lookup_instantiate_filp
  fs/9p: Add missing iput in v9fs_vfs_lookup
  fs/9p: Use mknod 9p operation on create without open request
  net/9p: Add waitq to VirtIO transport.
  [net/9p]Serialize virtqueue operations to make VirtIO transport SMP safe.
  9p: Implement TREADLINK operation for 9p2000.L
  9p: Use V9FS_MAGIC in statfs
  9p: Implement TGETLOCK
  9p: Implement TLOCK
  [9p] Introduce client side TFSYNC/RFSYNC for dotl.
  [fs/9p] Add file_operations for cached mode in dotl protocol.
  fs/9p: Add access = client option to opt in acl evaluation.
  fs/9p: Implement create time inheritance
  fs/9p: Update ACL on chmod
  fs/9p: Implement setting posix acl
  fs/9p: Add xattr callbacks for POSIX ACL
  fs/9p: Implement POSIX ACL permission checking function
  ...
2010-10-28 09:25:11 -07:00
Figo.zhang
e732ff7077 mmu_notifier.h: fix comment spelling
Signed-off-by: Figo.zhang <figo1802@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-28 09:02:15 -07:00
Ingo Molnar
b31d42a5af Fix compile brekage with !CONFIG_BLOCK
Today's git tree fails to build on !CONFIG_BLOCK, due to upstream commit
367a51a339 ("fs: Add FITRIM ioctl"):

 include/linux/fs.h:36: error: expected specifier-qualifier-list before ‘uint64_t’
 include/linux/fs.h:36: error: expected specifier-qualifier-list before ‘uint64_t’
 include/linux/fs.h:36: error: expected specifier-qualifier-list before ‘uint64_t’

The commit adds uint64_t type usage to fs.h, but linux/types.h is not included
explicitly - it's only included implicitly via linux/blk_types.h, and there only if
CONFIG_BLOCK is enabled.

Add the explicit #include to fix this.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-28 09:02:15 -07:00
Venkateswararao Jujjuri (JV)
b165d60145 9p: Add datasync to client side TFSYNC/RFSYNC for dotl
SYNOPSIS
    size[4] Tfsync tag[2] fid[4] datasync[4]

    size[4] Rfsync tag[2]

DESCRIPTION

    The Tfsync transaction transfers ("flushes") all modified in-core data of
    file identified by fid to the disk device (or other  permanent  storage
    device)  where that  file  resides.

    If datasync flag is specified data will be fleshed but does not flush
    modified metadata unless  that  metadata  is  needed  in order to allow a
    subsequent data retrieval to be correctly handled.

Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:49 -05:00
M. Mohan Kumar
329176cc2c 9p: Implement TREADLINK operation for 9p2000.L
Synopsis

	size[4] TReadlink tag[2] fid[4]
	size[4] RReadlink tag[2] target[s]

Description
	Readlink is used to return the contents of the symoblic link
        referred by fid. Contents of symboic link is returned as a
        response.

	target[s] - Contents of the symbolic link referred by fid.

Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:48 -05:00
M. Mohan Kumar
368c09d2a3 9p: Use V9FS_MAGIC in statfs
Use V9FS_MAGIC as the file system type while filling kernel statfs
strucutre instead of using host file system magic number. Also move
the definition of V9FS_MAGIC from v9fs.h to standard magic.h file.

Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:47 -05:00
M. Mohan Kumar
1d769cd192 9p: Implement TGETLOCK
Synopsis

    size[4] TGetlock tag[2] fid[4] getlock[n]
    size[4] RGetlock tag[2] getlock[n]

Description

TGetlock is used to test for the existence of byte range posix locks on a file
identified by given fid. The reply contains getlock structure. If the lock could
be placed it returns F_UNLCK in type field of getlock structure.  Otherwise it
returns the details of the conflicting locks in the getlock structure

    getlock structure:
      type[1] - Type of lock: F_RDLCK, F_WRLCK
      start[8] - Starting offset for lock
      length[8] - Number of bytes to check for the lock
             If length is 0, check for lock in all bytes starting at the location
            'start' through to the end of file
      pid[4] - PID of the process that wants to take lock/owns the task
               in case of reply
      client[4] - Client id of the system that owns the process which
                  has the conflicting lock

Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:47 -05:00
M. Mohan Kumar
a099027c77 9p: Implement TLOCK
Synopsis

    size[4] TLock tag[2] fid[4] flock[n]
    size[4] RLock tag[2] status[1]

Description

Tlock is used to acquire/release byte range posix locks on a file
identified by given fid. The reply contains status of the lock request

    flock structure:
        type[1] - Type of lock: F_RDLCK, F_WRLCK, F_UNLCK
        flags[4] - Flags could be either of
          P9_LOCK_FLAGS_BLOCK - Blocked lock request, if there is a
            conflicting lock exists, wait for that lock to be released.
          P9_LOCK_FLAGS_RECLAIM - Reclaim lock request, used when client is
            trying to reclaim a lock after a server restrart (due to crash)
        start[8] - Starting offset for lock
        length[8] - Number of bytes to lock
          If length is 0, lock all bytes starting at the location 'start'
          through to the end of file
        pid[4] - PID of the process that wants to take lock
        client_id[4] - Unique client id

        status[1] - Status of the lock request, can be
          P9_LOCK_SUCCESS(0), P9_LOCK_BLOCKED(1), P9_LOCK_ERROR(2) or
          P9_LOCK_GRACE(3)
          P9_LOCK_SUCCESS - Request was successful
          P9_LOCK_BLOCKED - A conflicting lock is held by another process
          P9_LOCK_ERROR - Error while processing the lock request
          P9_LOCK_GRACE - Server is in grace period, it can't accept new lock
            requests in this period (except locks with
            P9_LOCK_FLAGS_RECLAIM flag set)

Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:47 -05:00
Venkateswararao Jujjuri (JV)
920e65dc69 [9p] Introduce client side TFSYNC/RFSYNC for dotl.
SYNOPSIS
    size[4] Tfsync tag[2] fid[4]

    size[4] Rfsync tag[2]

DESCRIPTION

The Tfsync transaction transfers ("flushes") all modified in-core data of
file identified by fid to the disk device (or other  permanent  storage
device)  where that  file  resides.

Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:47 -05:00
Arun R Bharadwaj
4f7ebe8072 net/9p: This patch implements TLERROR/RLERROR on the 9P client.
Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-10-28 09:08:45 -05:00
Theodore Ts'o
a107e5a3a4 Merge branch 'next' into upstream-merge
Conflicts:
	fs/ext4/inode.c
	fs/ext4/mballoc.c
	include/trace/events/ext4.h
2010-10-27 23:44:47 -04:00
Theodore Ts'o
a269029d0e ext4,jbd2: convert tracepoints to use major/minor numbers
Unfortunately perf can't deal with anything other than direct structure
accesses in the TP_printk() section.  It will drop dead when it sees
jbd2_dev_to_name() in the "print fmt" section of the tracepoint.

Addresses-Google-Bug: 3138508

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 22:08:50 -04:00
Linus Torvalds
e3e1288e86 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx: (48 commits)
  DMAENGINE: move COH901318 to arch_initcall
  dma: imx-dma: fix signedness bug
  dma/timberdale: simplify conditional
  ste_dma40: remove channel_type
  ste_dma40: remove enum for endianess
  ste_dma40: remove TIM_FOR_LINK option
  ste_dma40: move mode_opt to separate config
  ste_dma40: move channel mode to a separate field
  ste_dma40: move priority to separate field
  ste_dma40: add variable to indicate valid dma_cfg
  async_tx: make async_tx channel switching opt-in
  move async raid6 test to lib/Kconfig.debug
  dmaengine: Add Freescale i.MX1/21/27 DMA driver
  intel_mid_dma: change the slave interface
  intel_mid_dma: fix the WARN_ONs
  intel_mid_dma: Add sg list support to DMA driver
  intel_mid_dma: Allow DMAC2 to share interrupt
  intel_mid_dma: Allow IRQ sharing
  intel_mid_dma: Add runtime PM support
  DMAENGINE: define a dummy filter function for ste_dma40
  ...
2010-10-27 19:04:36 -07:00
Linus Torvalds
bdab225015 Merge git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-mn10300
* git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-mn10300: (44 commits)
  MN10300: Save frame pointer in thread_info struct rather than global var
  MN10300: Change "Matsushita" to "Panasonic".
  MN10300: Create a defconfig for the ASB2364 board
  MN10300: Update the ASB2303 defconfig
  MN10300: ASB2364: Add support for SMSC911X and SMC911X
  MN10300: ASB2364: Handle the IRQ multiplexer in the FPGA
  MN10300: Generic time support
  MN10300: Specify an ELF HWCAP flag for MN10300 Atomic Operations Unit support
  MN10300: Map userspace atomic op regs as a vmalloc page
  MN10300: And Panasonic AM34 subarch and implement SMP
  MN10300: Delete idle_timestamp from irq_cpustat_t
  MN10300: Make various interrupt priority settings configurable
  MN10300: Optimise do_csum()
  MN10300: Implement atomic ops using atomic ops unit
  MN10300: Make the FPU operate in non-lazy mode under SMP
  MN10300: SMP TLB flushing
  MN10300: Use the [ID]PTEL2 registers rather than [ID]PTEL for TLB control
  MN10300: Make the use of PIDR to mark TLB entries controllable
  MN10300: Rename __flush_tlb*() to local_flush_tlb*()
  MN10300: AM34 erratum requires MMUCTR read and write on exception entry
  ...
2010-10-27 18:53:26 -07:00
Linus Torvalds
a042e26137 Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (50 commits)
  perf python scripting: Add futex-contention script
  perf python scripting: Fixup cut'n'paste error in sctop script
  perf scripting: Shut up 'perf record' final status
  perf record: Remove newline character from perror() argument
  perf python scripting: Support fedora 11 (audit 1.7.17)
  perf python scripting: Improve the syscalls-by-pid script
  perf python scripting: print the syscall name on sctop
  perf python scripting: Improve the syscalls-counts script
  perf python scripting: Improve the failed-syscalls-by-pid script
  kprobes: Remove redundant text_mutex lock in optimize
  x86/oprofile: Fix uninitialized variable use in debug printk
  tracing: Fix 'faild' -> 'failed' typo
  perf probe: Fix format specified for Dwarf_Off parameter
  perf trace: Fix detection of script extension
  perf trace: Use $PERF_EXEC_PATH in canned report scripts
  perf tools: Document event modifiers
  perf tools: Remove direct slang.h include
  perf_events: Fix for transaction recovery in group_sched_in()
  perf_events: Revert: Fix transaction recovery in group_sched_in()
  perf, x86: Use NUMA aware allocations for PEBS/BTS/DS allocations
  ...
2010-10-27 18:48:00 -07:00
Linus Torvalds
17bb51d56c Merge branch 'akpm-incoming-2'
* akpm-incoming-2: (139 commits)
  epoll: make epoll_wait() use the hrtimer range feature
  select: rename estimate_accuracy() to select_estimate_accuracy()
  Remove duplicate includes from many files
  ramoops: use the platform data structure instead of module params
  kernel/resource.c: handle reinsertion of an already-inserted resource
  kfifo: fix kfifo_alloc() to return a signed int value
  w1: don't allow arbitrary users to remove w1 devices
  alpha: remove dma64_addr_t usage
  mips: remove dma64_addr_t usage
  sparc: remove dma64_addr_t usage
  fuse: use release_pages()
  taskstats: use real microsecond granularity for CPU times
  taskstats: split fill_pid function
  taskstats: separate taskstats commands
  delayacct: align to 8 byte boundary on 64-bit systems
  delay-accounting: reimplement -c for getdelays.c to report information on a target command
  namespaces Kconfig: move namespace menu location after the cgroup
  namespaces Kconfig: remove the cgroup device whitelist experimental tag
  namespaces Kconfig: remove pointless cgroup dependency
  namespaces Kconfig: make namespace a submenu
  ...
2010-10-27 18:42:52 -07:00
Linus Torvalds
0671b7674f Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  percpu: Remove the multi-page alignment facility
  x86-32: Allocate irq stacks seperate from percpu area
  x86-32, mm: Remove duplicated #include
  x86, printk: Get rid of <0> from stack output
  x86, kexec: Make sure to stop all CPUs before exiting the kernel
  x86/vsmp: Eliminate kconfig dependency warning
2010-10-27 18:38:55 -07:00
Theodore Ts'o
7f93cff90f ext4: fix kernel oops if the journal superblock has a non-zero j_errno
Commit 84061e0 fixed an accounting bug only to introduce the
possibility of a kernel OOPS if the journal has a non-zero j_errno
field indicating that the file system had detected a fs inconsistency.
After the journal replay, if the journal superblock indicates that the
file system has an error, this indication is transfered to the file
system and then ext4_commit_super() is called to write this to the
disk.

But since the percpu counters are now initialized after the journal
replay, the call to ext4_commit_super() will cause a kernel oops since
it needs to use the percpu counters the ext4 superblock structure.

The fix is to skip setting the ext4 free block and free inode fields
if the percpu counter has not been set.

Thanks to Ken Sumrall for reporting and analyzing the root causes of
this bug.

Addresses-Google-Bug: #3054080

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:13 -04:00
Eric Sandeen
5b41d92437 ext4: implement writeback livelock avoidance using page tagging
This is analogous to Jan Kara's commit,
f446daaea9
mm: implement writeback livelock avoidance using page tagging

but since we forked write_cache_pages, we need to reimplement
it there (and in ext4_da_writepages, since range_cyclic handling
was moved to there)

If you start a large buffered IO to a file, and then set
fsync after it, you'll find that fsync does not complete
until the other IO stops.

If you continue re-dirtying the file (say, putting dd
with conv=notrunc in a loop), when fsync finally completes
(after all IO is done), it reports via tracing that
it has written many more pages than the file contains;
in other words it has synced and re-synced pages in
the file multiple times.

This then leads to problems with our writeback_index
update, since it advances it by pages written, and
essentially sets writeback_index off the end of the
file...

With the following patch, we only sync as much as was
dirty at the time of the sync.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:13 -04:00
Lukas Czerner
367a51a339 fs: Add FITRIM ioctl
Adds an filesystem independent ioctl to allow implementation of file
system batched discard support. I takes fstrim_range structure as an
argument. fstrim_range is definec in the include/fs.h and its
definition is as follows.

struct fstrim_range {
	start;
	len;
	minlen;
}

start	- first Byte to trim
len	- number of Bytes to trim from start
minlen	- minimum extent length to trim, free extents shorter than this
	  number of Bytes will be ignored. This will be rounded up to fs
	  block size.

It is also possible to specify NULL as an argument. In this case the
arguments will set itself as follows:

start = 0;
len = ULLONG_MAX;
minlen = 0;

So it will trim the whole file system at one run.

After the FITRIM is done, the number of actually discarded Bytes is stored
in fstrim_range.len to give the user better insight on how much storage
space has been really released for wear-leveling.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:11 -04:00
Eric Sandeen
3e1e5f5016 ext4: don't use ext4_allocation_contexts for tracing
Many tracepoints were populating an ext4_allocation_context
to pass in, but this requires a slab allocation even when
tracepoints are off.  In fact, 4 of 5 of these allocations
were only for tracing.  In addition, we were only using a
small fraction of the 144 bytes of this structure for this
purpose.

We can do away with all these alloc/frees of the ac and
simply pass in the bits we care about, instead.

I tested this by turning on tracing and running through
xfstests on x86_64.  I did not actually do anything with
the trace output, however.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:07 -04:00
Eric Sandeen
4d5476164a ext4: fix oops in trace_ext4_mb_release_group_pa
Our QA reported an oops in the ext4_mb_release_group_pa tracing,
and Josef Bacik pointed out that it was because we may have a
non-null but uninitialized ac_inode in the allocation context.

I can reproduce it when running xfstests with ext4 tracepoints on, 
on a CONFIG_SLAB_DEBUG kernel.

We call trace_ext4_mb_release_group_pa from 2 places, 
ext4_mb_discard_group_preallocations and 
ext4_mb_discard_lg_preallocations

In both cases we allocate an ac as a container just for tracing (!)
and never fill in the ac_inode.  There's no reason to be assigning,
testing, or printing it as far as I can see, so just remove it from
the tracepoint.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:07 -04:00
Lukas Czerner
e6fa0be699 Add helper function for blkdev_issue_zeroout (sb_issue_discard)
This is done the same way as helper sb_issue_discard for
blkdev_issue_discard.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:04 -04:00
Linus Torvalds
22cdbd1d57 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (108 commits)
  ehea: Fixing statistics
  bonding: Fix lockdep warning after bond_vlan_rx_register()
  tunnels: Fix tunnels change rcu protection
  caif-u5500: Build config for CAIF shared mem driver
  caif-u5500: CAIF shared memory mailbox interface
  caif-u5500: CAIF shared memory transport protocol
  caif-u5500: Adding shared memory include
  drivers/isdn: delete double assignment
  drivers/net/typhoon.c: delete double assignment
  drivers/net/sb1000.c: delete double assignment
  qlcnic: define valid vlan id range
  qlcnic: reduce rx ring size
  qlcnic: fix mac learning
  ehea: fix use after free
  inetpeer: __rcu annotations
  fib_rules: __rcu annotates ctarget
  tunnels: add __rcu annotations
  net: add __rcu annotations to protocol
  ipv4: add __rcu annotations to routes.c
  qlge: bugfix: Restoring the vlan setting.
  ...
2010-10-27 18:28:00 -07:00
Wen Congyang
b853fd3648 ext4: avoid null dereference in trace_ext4_mballoc_discard
ac->inode is set to null in function ext4_mb_release_group_pa(),
and then trace_ext4_mballoc_discard(ac) is called, the kernel
will panic.

BUG: unable to handle kernel NULL pointer dereference at 000000a4
IP: [<f87e1714>] ftrace_raw_event_ext4__mballoc+0x54/0xc0 [ext4]
*pdpt = 0000000000abd001 *pde = 0000000000000000
Oops: 0000 [#1] SMP

Pid: 550, comm: flush-8:16 Not tainted 2.6.36-rc1 #1 SE7320EP2/Altos G530
EIP: 0060:[<f87e1714>] EFLAGS: 00010206 CPU: 1
EIP is at ftrace_raw_event_ext4__mballoc+0x54/0xc0 [ext4]
EAX: f32ac840 EBX: f3f1cf88 ECX: f32ac840 EDX: 00000000
ESI: f32ac83c EDI: f880b9d8 EBP: 00000000 ESP: f4b77ae4
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process flush-8:16 (pid: 550, ti=f4b76000 task=f613e540 task.ti=f4b76000)
Call Trace:
 [<f87f5ac1>] ? ext4_mb_release_group_pa+0x121/0x150 [ext4]
 [<f87f8356>] ? ext4_mb_discard_group_preallocations+0x336/0x400 [ext4]
 [<f87fb7f1>] ? ext4_mb_new_blocks+0x3d1/0x4f0 [ext4]
 [<c05a6c5b>] ? __make_request+0x10b/0x440
 [<f87f1fb4>] ? ext4_ext_map_blocks+0x1334/0x1980 [ext4]
 [<c04ac78a>] ? rb_reserve_next_event+0xaa/0x3b0
 [<f87d18d6>] ? ext4_map_blocks+0xd6/0x1d0 [ext4]
 [<f87d2da7>] ? mpage_da_map_blocks+0xc7/0x8a0 [ext4]
 [<c04c8a68>] ? find_get_pages_tag+0x38/0x110
 [<c04d23a5>] ? __pagevec_release+0x15/0x20
 [<f87d3ca5>] ? ext4_da_writepages+0x2b5/0x5d0 [ext4]
 [<c04cfbe0>] ? __writepage+0x0/0x30
 [<c04d0e34>] ? do_writepages+0x14/0x30
 [<c0526600>] ? writeback_single_inode+0xa0/0x240
 [<c0526971>] ? writeback_sb_inodes+0xc1/0x180
 [<c0526ab8>] ? writeback_inodes_wb+0x88/0x140
 [<c0526d7b>] ? wb_writeback+0x20b/0x320
 [<c045aca7>] ? lock_timer_base+0x27/0x50
 [<c0526fe0>] ? wb_do_writeback+0x150/0x190
 [<c05270a8>] ? bdi_writeback_thread+0x88/0x1f0
 [<c043b680>] ? complete+0x40/0x60
 [<c0527020>] ? bdi_writeback_thread+0x0/0x1f0
 [<c0469474>] ? kthread+0x74/0x80
 [<c0469400>] ? kthread+0x0/0x80
 [<c040a23e>] ? kernel_thread_helper+0x6/0x10

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:27:12 -04:00
Brian King
39e3ac2599 jbd2: Fix I/O hang in jbd2_journal_release_jbd_inode
This fixes a hang seen in jbd2_journal_release_jbd_inode
on a lot of Power 6 systems running with ext4. When we get
in the hung state, all I/O to the disk in question gets blocked
where we stay indefinitely. Looking at the task list, I can see
we are stuck in jbd2_journal_release_jbd_inode waiting on a
wake up. I added some debug code to detect this scenario and
dump additional data if we were stuck in jbd2_journal_release_jbd_inode
for longer than 30 minutes. When it hit, I was able to see that
i_flags was 0, suggesting we missed the wake up.

This patch changes i_flags to be an unsigned long, uses bit operators
to access it, and adds barriers around the accesses. Prior to applying
this patch, we were regularly hitting this hang on numerous systems
in our test environment. After applying the patch, the hangs no longer
occur.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:25:12 -04:00
Linus Torvalds
7420a8c0de Merge branch 'flock' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl
* 'flock' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
  locks: turn lock_flocks into a spinlock
  fasync: re-organize fasync entry insertion to allow it under a spinlock
  locks/nfsd: allocate file lock outside of spinlock
  lockd: fix nlmsvc_notify_blocked locking
  lockd: push lock_flocks down
2010-10-27 18:13:34 -07:00
Shawn Bohrer
95aac7b1cd epoll: make epoll_wait() use the hrtimer range feature
This make epoll use hrtimers for the timeout value which prevents
epoll_wait() from timing out up to a millisecond early.

This mirrors the behavior of select() and poll().

Signed-off-by: Shawn Bohrer <shawn.bohrer@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 18:03:18 -07:00