When we are short on memory, we want to expedite the cleaning of
dirty objects. Hence when we run short on memory, we need to kick
the AIL flushing into action to clean as many dirty objects as
quickly as possible. To implement this, sample the lsn of the log
item at the head of the AIL and use that as the push target for the
AIL flush.
Further, we keep items in the AIL that are dirty that are not
tracked any other way, so we can get objects sitting in the AIL that
don't get written back until the AIL is pushed. Hence to get the
filesystem to the idle state, we might need to push the AIL to flush
out any remaining dirty objects sitting in the AIL. This requires
the same push mechanism as the reclaim push.
This patch also renames xfs_trans_ail_tail() to xfs_ail_min_lsn() to
match the new xfs_ail_max_lsn() function introduced in this patch.
Similarly for xfs_trans_ail_push -> xfs_ail_push.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
This patch rearranges the location of functions in xfs_trans_ail.c
to remove the need for forward declarations of those functions in
preparation for adding new functions without the need for forward
declarations.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Similar to the xfssyncd, the per-filesystem xfsaild threads can be
converted to a global workqueue and run periodically by delayed
works. This makes sense for the AIL pushing because it uses
variable timeouts depending on the work that needs to be done.
By removing the xfsaild, we simplify the AIL pushing code and
remove the need to spread the code to implement the threading
and pushing across multiple files.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Background inode reclaim needs to run more frequently that the XFS
syncd work is run as 30s is too long between optimal reclaim runs.
Add a new periodic work item to the xfs syncd workqueue to run a
fast, non-blocking inode reclaim scan.
Background inode reclaim is kicked by the act of marking inodes for
reclaim. When an AG is first marked as having reclaimable inodes,
the background reclaim work is kicked. It will continue to run
periodically untill it detects that there are no more reclaimable
inodes. It will be kicked again when the first inode is queued for
reclaim.
To ensure shrinker based inode reclaim throttles to the inode
cleaning and reclaim rate but still reclaim inodes efficiently, make it kick the
background inode reclaim so that when we are low on memory we are
trying to reclaim inodes as efficiently as possible. This kick shoul
d not be necessary, but it will protect against failures to kick the
background reclaim when inodes are first dirtied.
To provide the rate throttling, make the shrinker pass do
synchronous inode reclaim so that it blocks on inodes under IO. This
means that the shrinker will reclaim inodes rather than just
skipping over them, but it does not adversely affect the rate of
reclaim because most dirty inodes are already under IO due to the
background reclaim work the shrinker kicked.
These two modifications solve one of the two OOM killer invocations
Chris Mason reported recently when running a stress testing script.
The particular workload trigger for the OOM killer invocation is
where there are more threads than CPUs all unlinking files in an
extremely memory constrained environment. Unlike other solutions,
this one does not have a performance impact on performance when
memory is not constrained or the number of concurrent threads
operating is <= to the number of CPUs.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
On of the problems with the current inode flush at ENOSPC is that we
queue a flush per ENOSPC event, regardless of how many are already
queued. Thi can result in hundreds of queued flushes, most of
which simply burn CPU scanned and do no real work. This simply slows
down allocation at ENOSPC.
We really only need one active flush at a time, and we can easily
implement that via the new xfs_syncd_wq. All we need to do is queue
a flush if one is not already active, then block waiting for the
currently active flush to complete. The result is that we only ever
have a single ENOSPC inode flush active at a time and this greatly
reduces the overhead of ENOSPC processing.
On my 2p test machine, this results in tests exercising ENOSPC
conditions running significantly faster - 042 halves execution time,
083 drops from 60s to 5s, etc - while not introducing test
regressions.
This allows us to remove the old xfssyncd threads and infrastructure
as they are no longer used.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
All of the work xfssyncd does is background functionality. There is
no need for a thread per filesystem to do this work - it can al be
managed by a global workqueue now they manage concurrency
effectively.
Introduce a new gglobal xfssyncd workqueue, and convert the periodic
work to use this new functionality. To do this, use a delayed work
construct to schedule the next running of the periodic sync work
for the filesystem. When the sync work is complete, queue a new
delayed work for the next running of the sync work.
For laptop mode, we wait on completion for the sync works, so ensure
that the sync work queuing interface can flush and wait for work to
complete to enable the work queue infrastructure to replace the
current sequence number and wakeup that is used.
Because the sync work does non-trivial amounts of work, mark the
new work queue as CPU intensive.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
When formatting an inode item, we have to allocate a separate buffer
to hold extents when there are delayed allocation extents on the
inode and it is in extent format. The allocation size is derived
from the in-core data fork representation, which accounts for
delayed allocation extents, while the on-disk representation does
not contain any delalloc extents.
As a result of this mismatch, the allocated buffer can be far larger
than needed to hold the real extent list which, due to the fact the
inode is in extent format, is limited to the size of the literal
area of the inode. However, we can have thousands of delalloc
extents, resulting in an allocation size orders of magnitude larger
than is needed to hold all the real extents.
Fix this by limiting the size of the buffer being allocated to the
size of the literal area of the inodes in the filesystem (i.e. the
maximum size an inode fork can grow to).
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc: (26 commits)
mmc: SDHI should depend on SUPERH || ARCH_SHMOBILE
mmc: tmio_mmc: Move some defines into a shared header
mmc: tmio: support aggressive clock gating
mmc: tmio: fix power-mode interpretation
mmc: tmio: remove work-around for unmasked SDIO interrupts
sh: fix SDHI IO address-range
ARM: mach-shmobile: fix SDHI IO address-range
mmc: tmio: only access registers above 0xff, if available
mfd: remove now redundant sh_mobile_sdhi.h header
sh: convert boards to use linux/mmc/sh_mobile_sdhi.h
ARM: mach-shmobile: convert boards to use linux/mmc/sh_mobile_sdhi.h
mmc: tmio: convert the SDHI MMC driver from MFD to a platform driver
sh: ecovec: use the CONFIG_MMC_TMIO symbols instead of MFD
mmc: tmio: split core functionality, DMA and MFD glue
mmc: tmio: use PIO for short transfers
mmc: tmio-mmc: Improve DMA stability on sh-mobile
mmc: fix mmc_app_send_scr() for dma transfer
mmc: sdhci-esdhc: enable esdhc on imx53
mmc: sdhci-esdhc: use writel/readl as general APIs
mmc: sdhci: add the abort CMDTYPE bits definition
...
* 'frv' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-frv:
FRV: Use generic show_interrupts()
FRV: Convert genirq namespace
frv: Select GENERIC_HARDIRQS_NO_DEPRECATED
frv: Convert cpu irq_chip to new functions
frv: Convert mb93493 irq_chip to new functions
frv: Convert mb93093 irq_chip to new function
frv: Convert mb93091 irq_chip to new functions
frv: Fix typo from __do_IRQ overhaul
frv: Remove stale irq_chip.end
FRV: Do some cleanups
FRV: Missing node arg in alloc_thread_info_node() macro
NOMMU: implement access_remote_vm
NOMMU: support SMP dynamic percpu_alloc
NOMMU: percpu should use is_vmalloc_addr().
* git://git.kernel.org/pub/scm/linux/kernel/git/wim/linux-2.6-watchdog:
watchdog: softdog.c: enhancement to optionally invoke panic instead of reboot on timer expiry
watchdog: fix nv_tco section mismatch
watchdog: sp5100_tco.c: Check if firmware has set correct value in tcobase.
watchdog: Convert release_resource to release_region/release_mem_region
watchdog: s3c2410_wdt.c: Convert release_resource to release_region/release_mem_region
* 'irq-final-for-linus-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (111 commits)
gpio: ab8500: Mark broken
genirq: Remove move_*irq leftovers
genirq: Remove compat code
drivers: Final irq namespace conversion
mn10300: Use generic show_interrupts()
mn10300: Cleanup irq_desc access
mn10300: Convert genirq namespace
frv: Use generic show_interrupts()
frv: Convert genirq namespace
frv: Select GENERIC_HARDIRQS_NO_DEPRECATED
frv: Convert cpu irq_chip to new functions
frv: Convert mb93493 irq_chip to new functions
frv: Convert mb93093 irq_chip to new function
frv: Convert mb93091 irq_chip to new functions
frv: Fix typo from __do_IRQ overhaul
frv: Remove stale irq_chip.end
m68k: Convert irq function namespace
xen: Use new irq_move functions
xen: Cleanup genirq namespace
unicore32: Use generic show_interrupts()
...
This patch fixes information leakage to the userspace by initializing
the data buffer to zero.
Reported-by: Peter Huewe <huewe.external@infineon.com>
Signed-off-by: Peter Huewe <huewe.external@infineon.com>
Signed-off-by: Marcel Selhorst <m.selhorst@sirrix.com>
[ Also removed the silly "* sizeof(u8)". If that isn't 1, we have way
deeper problems than a simple multiplication can fix. - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We check the pointers together but at least one of them could be invalid
due to failed allocation. Since we cannot continue if either of the two
allocations has failed, exit early by freeing them both.
Cc: <stable@kernel.org> # 38.x
Reported-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Fix the incorrect use of igrab() inside the i_lock in NFS and Ceph‥
If we are already holding the i_lock, we have a reference to the
inode so we can safely use ihold() to gain an extra reference. This
avoids hangs due to lock recursion on the i_lock now that the
inode_lock is gone and igrab() uses the i_lock itself.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Ryan Mallon <ryan@bluewatersys.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (30 commits)
xfrm: Restrict extended sequence numbers to esp
xfrm: Check for esn buffer len in xfrm_new_ae
xfrm: Assign esn pointers when cloning a state
xfrm: Move the test on replay window size into the replay check functions
netdev: bfin_mac: document TE setting in RMII modes
drivers net: Fix declaration ordering in inline functions.
cxgb3: Apply interrupt coalescing settings to all queues
net: Always allocate at least 16 skb frags regardless of page size
ipv4: Don't ip_rt_put() an error pointer in RAW sockets.
net: fix ethtool->set_flags not intended -EINVAL return value
mlx4_en: Fix loss of promiscuity
tg3: Fix inline keyword usage
tg3: use <linux/io.h> and <linux/uaccess.h> instead <asm/io.h> and <asm/uaccess.h>
net: use CHECKSUM_NONE instead of magic number
Net / jme: Do not use legacy PCI power management
myri10ge: small rx_done refactoring
bridge: notify applications if address of bridge device changes
ipv4: Fix IP timestamp option (IPOPT_TS_PRESPEC) handling in ip_options_echo()
can: c_can: Fix tx_bytes accounting
can: c_can_platform: fix irq check in probe
...
These functions take irq_data as an argument and avoid a redundant
lookup in the sparse irq case.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Converted with coccinelle.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Fix section mismatch warnings:
set_phys_range_identity() is called by __init xen_set_identity(),
so also mark set_phys_range_identity() as __init.
then:
__early_alloc_p2m() is called set_phys_range_identity(), so also mark
__early_alloc_p2m() as __init.
WARNING: arch/x86/built-in.o(.text+0x7856): Section mismatch in reference from the function __early_alloc_p2m() to the function .init.text:extend_brk()
The function __early_alloc_p2m() references
the function __init extend_brk().
This is often because __early_alloc_p2m lacks a __init
annotation or the annotation of extend_brk is wrong.
WARNING: arch/x86/built-in.o(.text+0x7967): Section mismatch in reference from the function set_phys_range_identity() to the function .init.text:extend_brk()
The function set_phys_range_identity() references
the function __init extend_brk().
This is often because set_phys_range_identity lacks a __init
annotation or the annotation of extend_brk is wrong.
[v2: Per Stephen Hemming recommonedation made __early_alloc_p2m static]
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Convert to new function names. Converted with coccinelle.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Howells <dhowells@redhat.com>
irq_chip.end got obsolete with the removal of __do_IRQ().
irq-mb93093.c even lacks an implementation, but nobody noticed that
it's broken since commit 88d6e1 in 2006.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Howells <dhowells@redhat.com>
1. frv doesn't support SMP, remove the useless SMP bits.
2. frv has its own alloc_task_struct, so define __HAVE_ARCH_TASK_STRUCT_ALLOCATOR
(I am not sure if frv should use generic alloc_task_struct().)
Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
There are two alloc_thread_info_node() macros defined (one for debugging and
one for normal). The commit that changed them most recently:
commit b6a84016bd
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue Mar 22 16:30:42 2011 -0700
Subject: mm: NUMA aware alloc_thread_info_node()
didn't add the node argument into the macro argument list for the normal macro.
This results in the following error:
kernel/fork.c:267:39: error: macro "alloc_thread_info_node" passed 2 arguments, but takes just 1
kernel/fork.c: In function 'dup_task_struct':
kernel/fork.c:267: error: 'alloc_thread_info_node' undeclared (first use in this function)
kernel/fork.c:267: error: (Each undeclared identifier is reported only once
kernel/fork.c:267: error: for each function it appears in.)
Signed-off-by: David Howells <dhowells@redhat.com>
Recent vm changes brought in a new function which the core procfs code
utilizes. So implement it for nommu systems too to avoid link failures.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Simon Horman <horms@verge.net.au>
Tested-by: Ithamar Adema <ithamar.adema@team-embedded.nl>
Acked-by: Greg Ungerer <gerg@uclinux.org>
This driver is broken in several aspects.
1) old style irq_chip functions. Sigh
2) Abuse of the unlock callback. That's not supposed to be a state
machine for evrything and some more.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
All chips converted
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
LKML-Reference: <20110206192106.601290592@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
LKML-Reference: <20110206192106.501651128@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
LKML-Reference: <20110206192106.401266547@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
LKML-Reference: <20110206192106.300303769@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
LKML-Reference: <20110206192106.203431646@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Compiles way better.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
LKML-Reference: <20110206192106.109992056@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
irq_chip.end got obsolete with the removal of __do_IRQ().
irq-mb93093.c even lacks an implementation, but nobody noticed that
it's broken since commit 88d6e1 in 2006.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
LKML-Reference: <20110206192106.011224503@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>