* git://git.linux-nfs.org/pub/linux/nfs-2.6: (51 commits)
nfs: remove nfs_put_link()
nfs-build-fix-99
git-nfs-build-fixes
Merge branch 'odirect'
NFS: alloc nfs_read/write_data as direct I/O is scheduled
NFS: Eliminate nfs_get_user_pages()
NFS: refactor nfs_direct_free_user_pages
NFS: remove user_addr, user_count, and pos from nfs_direct_req
NFS: "open code" the NFS direct write rescheduler
NFS: Separate functions for counting outstanding NFS direct I/Os
NLM: Fix reclaim races
NLM: sem to mutex conversion
locks.c: add the fl_owner to nlm_compare_locks
NFS: Display the chosen RPCSEC_GSS security flavour in /proc/mounts
NFS: Split fs/nfs/inode.c
NFS: Fix typo in nfs_do_clone_mount()
NFS: Fix compile errors introduced by referrals patches
NFSv4: Ensure that referral mounts bind to a reserved port
NFSv4: A root pathname is sent as a zero component4
NFSv4: Follow a referral
...
Backoff readahead size exponentially on I/O error.
Michael Tokarev <mjt@tls.msk.ru> described the problem as:
[QUOTE]
Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
In order to "fix" it, one have to read it and write to another CD-rom,
or something.. or just ignore the error (if it's just a skip in a video
stream). Let's assume the unreadable block is number U.
But current behavior is just insane. An application requests block
number N, which is before U. Kernel tries to read-ahead blocks N..U.
Cdrom drive tries to read it, re-read it.. for some time. Finally,
when all the N..U-1 blocks are read, kernel returns block number N
(as requested) to an application, successefully.
Now an app requests block number N+1, and kernel tries to read
blocks N+1..U+1. Retrying again as in previous step.
And so on, up to when an app requests block number U-1. And when,
finally, it requests block U, it receives read error.
So, kernel currentry tries to re-read the same failing block as
many times as the current readahead value (256 (times?) by default).
This whole process already killed my cdrom drive (I posted about it
to LKML several months ago) - literally, the drive has fried, and
does not work anymore. Ofcourse that problem was a bug in firmware
(or whatever) of the drive *too*, but.. main problem with that is
current readahead logic as described above.
[/QUOTE]
Which was confirmed by Jens Axboe <axboe@suse.de>:
[QUOTE]
For ide-cd, it tends do only end the first part of the request on a
medium error. So you may see a lot of repeats :/
[/QUOTE]
With this patch, retries are expected to be reduced from, say, 256, to 5.
[akpm@osdl.org: cleanups]
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Put short function description for read_cache_pages() on one line as needed
by kernel-doc.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The problem is that when we write to a file, the copy from userspace to
pagecache is first done with preemption disabled, so if the source address is
not immediately available the copy fails *and* *zeros* *the* *destination*.
This is a problem because a concurrent read (which admittedly is an odd thing
to do) might see zeros rather that was there before the write, or what was
there after, or some mixture of the two (any of these being a reasonable thing
to see).
If the copy did fail, it will immediately be retried with preemption
re-enabled so any transient problem with accessing the source won't cause an
error.
The first copying does not need to zero any uncopied bytes, and doing so
causes the problem. It uses copy_from_user_atomic rather than copy_from_user
so the simple expedient is to change copy_from_user_atomic to *not* zero out
bytes on failure.
The first of these two patches prepares for the change by fixing two places
which assume copy_from_user_atomic does zero the tail. The two usages are
very similar pieces of code which copy from a userspace iovec into one or more
page-cache pages. These are changed to remove the assumption.
The second patch changes __copy_from_user_inatomic* to not zero the tail.
Once these are accepted, I will look at similar patches of other architectures
where this is important (ppc, mips and sparc being the ones I can find).
This patch:
There is a problem with __copy_from_user_inatomic zeroing the tail of the
buffer in the case of an error. As it is called in atomic context, the error
may be transient, so it results in zeros being written where maybe they
shouldn't be.
In the usage in filemap, this opens a window for a well timed read to see data
(zeros) which is not consistent with any ordering of reads and writes.
Most cases where __copy_from_user_inatomic is called, a failure results in
__copy_from_user being called immediately. As long as the latter zeros the
tail, the former doesn't need to. However in *copy_from_user_iovec
implementations (in both filemap and ntfs/file), it is assumed that
copy_from_user_inatomic will zero the tail.
This patch removes that assumption, so that after this patch it will
be safe for copy_from_user_inatomic to not zero the tail.
This patch also adds some commentary to filemap.h and asm-i386/uaccess.h.
After this patch, all architectures that might disable preempt when
kmap_atomic is called need to have their __copy_from_user_inatomic* "fixed".
This includes
- powerpc
- i386
- mips
- sparc
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is redundant with check in wakeup_kswapd.
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Acked-by: Paul Jackson <pj@sgi.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
pdflush is carefully designed to ensure that all wakeups have some
corresponding work to do - if a woken-up pdflush thread discovers that it
hasn't been given any work to do then this is considered an error.
That all broke when swsusp came along - because a timer-delivered wakeup to a
frozen pdflush thread will just get lost. This causes the pdflush thread to
get lost as well: the writeback timer is supposed to be re-armed by pdflush in
process context, but pdflush doesn't execute the callout which does this.
Fix that up by ignoring the return value from try_to_freeze(): jsut proceed,
see if we have any work pending and only go back to sleep if that is not the
case.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hugh clarified the role of VM_LOCKED. So we can now implement page
migration for mlocked pages.
Allow the migration of mlocked pages. This means that try_to_unmap must
unmap mlocked pages in the migration case.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hooks for calling vma specific migration functions
With this patch a vma may define a vma->vm_ops->migrate function. That
function may perform page migration on its own (some vmas may not contain page
structs and therefore cannot be handled by regular page migration. Pages in a
vma may require special preparatory treatment before migration is possible
etc) . Only mmap_sem is held when the migration function is called. The
migrate() function gets passed two sets of nodemasks describing the source and
the target of the migration. The flags parameter either contains
MPOL_MF_MOVE which means that only pages used exclusively by
the specified mm should be moved
or
MPOL_MF_MOVE_ALL which means that pages shared with other processes
should also be moved.
The migration function returns 0 on success or an error condition. An error
condition will prevent regular page migration from occurring.
On its own this patch cannot be included since there are no users for this
functionality. But it seems that the uncached allocator will need this
functionality at some point.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
AOP_TRUNCATED_PAGE victims in read_pages() belong in the LRU
Nick Piggin rightly pointed out that the introduction of AOP_TRUNCATED_PAGE
to read_pages() was wrong to leave A_T_P victim pages in the page cache but
not put them in the LRU. Failing to do so hid them from the VM.
A_T_P just means that the aop method unlocked the page rather than
performing IO. It would be very rare that the page was truncated between
the unlock and testing A_T_P. So we leave the pages in the LRU for likely
reuse soon rather than backing them back out of the page cache. We do this
by matching the behaviour before the A_T_P introduction which added pages
to the LRU regardless of what ->readpage() did.
This doesn't include the unrelated cleanup in Nick's initial fix which
changed read_pages() to return void to match its only caller's behaviour of
ignoring errors.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A process flag to indicate whether we are doing sync io is incredibly
ugly. It also causes performance problems when one does a lot of async
io and then proceeds to sync it. Part of the io will go out as async,
and the other part as sync. This causes a disconnect between the
previously submitted io and the synced io. For io schedulers such as CFQ,
this will cause us lost merges and suboptimal behaviour in scheduling.
Remove PF_SYNCWRITE completely from the fsync/msync paths, and let
the O_DIRECT path just directly indicate that the writes are sync
by using WRITE_SYNC instead.
Signed-off-by: Jens Axboe <axboe@suse.de>
If invalidate_mapping_pages is called to invalidate a very large mapping
(e.g. a very large block device) and if the only active page in that
device is near the end (or at least, at a very large index), such as, say,
the superblock of an md array, and if that page happens to be locked when
invalidate_mapping_pages is called, then
pagevec_lookup will return this page and
as it is locked, 'next' will be incremented and pagevec_lookup
will be called again. and again. and again.
while we count from 0 upto a very large number.
We should really always set 'next' to 'page->index+1' before going around
the loop again, not just if the page isn't locked.
Cc: "Steinar H. Gunderson" <sgunderson@bigfoot.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add read_mapping_page() which is used for callers that pass
mapping->a_ops->readpage as the filler for read_cache_page. This removes
some duplication from filesystem code.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use the x86 cache-bypassing copy instructions for copy_from_user().
Some performance data are
Total of GLOBAL_POWER_EVENTS (CPU cycle samples)
2.6.12.4.orig 1921587
2.6.12.4.nt 1599424
1599424/1921587=83.23% (16.77% reduction)
BSQ_CACHE_REFERENCE (L3 cache miss)
2.6.12.4.orig 57427
2.6.12.4.nt 20858
20858/57427=36.32% (63.7% reduction)
L3 cache miss reduction of __copy_from_user_ll
samples %
37408 65.1412 vmlinux __copy_from_user_ll
23 0.1103 vmlinux __copy_user_zeroing_intel_nocache
23/37408=0.061% (99.94% reduction)
Top 5 of 2.6.12.4.nt
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
samples % app name symbol name
128392 8.0274 vmlinux __copy_user_zeroing_intel_nocache
64206 4.0143 vmlinux journal_add_journal_head
59746 3.7355 vmlinux do_get_write_access
47674 2.9807 vmlinux journal_put_journal_head
46021 2.8774 vmlinux journal_dirty_metadata
pattern9-0-cpu4-0-09011728/summary.out
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000
samples % app name symbol name
69755 4.2861 vmlinux __copy_user_zeroing_intel_nocache
55685 3.4215 vmlinux journal_add_journal_head
52371 3.2179 vmlinux __find_get_block
45504 2.7960 vmlinux journal_put_journal_head
36005 2.2123 vmlinux journal_stop
pattern9-0-cpu4-0-09011744/summary.out
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples % app name symbol name
1147 5.4994 vmlinux journal_add_journal_head
881 4.2240 vmlinux journal_dirty_data
872 4.1809 vmlinux blk_rq_map_sg
734 3.5192 vmlinux journal_commit_transaction
617 2.9582 vmlinux radix_tree_delete
pattern9-0-cpu4-0-09011731/summary.out
iozone results are
original 2.6.12.4 CPU time = 207.768 sec
cache aware CPU time = 184.783 sec
(three times run)
184.783/207.768=88.94% (11.06% reduction)
original:
pattern9-0-cpu4-0-08191720/iozone.out: CPU Utilization: Wall time 45.997 CPU time 64.527 CPU utilization 140.28 %
pattern9-0-cpu4-0-08191741/iozone.out: CPU Utilization: Wall time 46.878 CPU time 71.933 CPU utilization 153.45 %
pattern9-0-cpu4-0-08191743/iozone.out: CPU Utilization: Wall time 45.152 CPU time 71.308 CPU utilization 157.93 %
cache awre:
pattern9-0-cpu4-0-09011728/iozone.out: CPU Utilization: Wall time 44.842 CPU time 62.465 CPU utilization 139.30 %
pattern9-0-cpu4-0-09011731/iozone.out: CPU Utilization: Wall time 44.718 CPU time 59.273 CPU utilization 132.55 %
pattern9-0-cpu4-0-09011744/iozone.out: CPU Utilization: Wall time 44.367 CPU time 63.045 CPU utilization 142.10 %
Signed-off-by: Hiro Yoshioka <hyoshiok@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch inserts security_task_movememory hook calls into memory management
code to enable security modules to mediate this operation between tasks.
Since the last posting, the hook has been renamed following feedback from
Christoph Lameter.
Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Cc: Andi Kleen <ak@muc.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
move_pages() is used to move individual pages of a process. The function can
be used to determine the location of pages and to move them onto the desired
node. move_pages() returns status information for each page.
long move_pages(pid, number_of_pages_to_move,
addresses_of_pages[],
nodes[] or NULL,
status[],
flags);
The addresses of pages is an array of void * pointing to the
pages to be moved.
The nodes array contains the node numbers that the pages should be moved
to. If a NULL is passed instead of an array then no pages are moved but
the status array is updated. The status request may be used to determine
the page state before issuing another move_pages() to move pages.
The status array will contain the state of all individual page migration
attempts when the function terminates. The status array is only valid if
move_pages() completed successfullly.
Possible page states in status[]:
0..MAX_NUMNODES The page is now on the indicated node.
-ENOENT Page is not present
-EACCES Page is mapped by multiple processes and can only
be moved if MPOL_MF_MOVE_ALL is specified.
-EPERM The page has been mlocked by a process/driver and
cannot be moved.
-EBUSY Page is busy and cannot be moved. Try again later.
-EFAULT Invalid address (no VMA or zero page).
-ENOMEM Unable to allocate memory on target node.
-EIO Unable to write back page. The page must be written
back in order to move it since the page is dirty and the
filesystem does not provide a migration function that
would allow the moving of dirty pages.
-EINVAL A dirty page cannot be moved. The filesystem does not provide
a migration function and has no ability to write back pages.
The flags parameter indicates what types of pages to move:
MPOL_MF_MOVE Move pages that are only mapped by the process.
MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes.
Requires sufficient capabilities.
Possible return codes from move_pages()
-ENOENT No pages found that would require moving. All pages
are either already on the target node, not present, had an
invalid address or could not be moved because they were
mapped by multiple processes.
-EINVAL Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt
to migrate pages in a kernel thread.
-EPERM MPOL_MF_MOVE_ALL specified without sufficient priviledges.
or an attempt to move a process belonging to another user.
-EACCES One of the target nodes is not allowed by the current cpuset.
-ENODEV One of the target nodes is not online.
-ESRCH Process does not exist.
-E2BIG Too many pages to move.
-ENOMEM Not enough memory to allocate control array.
-EFAULT Parameters could not be accessed.
A test program for move_pages() may be found with the patches
on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3
From: Christoph Lameter <clameter@sgi.com>
Detailed results for sys_move_pages()
Pass a pointer to an integer to get_new_page() that may be used to
indicate where the completion status of a migration operation should be
placed. This allows sys_move_pags() to report back exactly what happened to
each page.
Wish there would be a better way to do this. Looks a bit hacky.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Jes Sorensen <jes@trained-monkey.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Instead of passing a list of new pages, pass a function to allocate a new
page. This allows the correct placement of MPOL_INTERLEAVE pages during page
migration. It also further simplifies the callers of migrate pages.
migrate_pages() becomes similar to migrate_pages_to() so drop
migrate_pages_to(). The batching of new page allocations becomes unnecessary.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Jes Sorensen <jes@trained-monkey.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Do not leave pages on the lists passed to migrate_pages(). Seems that we will
not need any postprocessing of pages. This will simplify the handling of
pages by the callers of migrate_pages().
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Jes Sorensen <jes@trained-monkey.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently migrate_pages() is mess with lots of goto. Extract two functions
from migrate_pages() and get rid of the gotos.
Plus we can just unconditionally set the locked bit on the new page since we
are the only one holding a reference. Locking is to stop others from
accessing the page once we establish references to the new page.
Remove the list_del from move_to_lru in order to have finer control over list
processing.
[akpm@osdl.org: add debug check]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Jes Sorensen <jes@trained-monkey.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes printk() under zone->lock in show_free_areas(). It can be
unsafe to call printk() under this lock, since caller can try to
allocate/free some memory and selfdeadlock on this lock. I found
allocations/freeing mem both in netconsole and serial console.
This issue was faced in reallity when meminfo was periodically printed for
debug purposes and netconsole was used.
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
mm/filemap.c:
- add lots of kernel-doc;
- fix some typos and kernel-doc errors;
- drop some blank lines between function close and EXPORT_SYMBOL();
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Move comments for kmalloc to right place, currently it near __do_kmalloc
- Comments for kzalloc
- More detailed comments for kmalloc
- Appearance of "kmalloc" and "kzalloc" man pages after "make mandocs"
[rdunlap@xenotime.net: simplification]
Signed-off-by: Paul Drynoff <pauldrynoff@gmail.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initialise total_memory earlier in boot. Because if for some reason we run
page reclaim early in boot, we don't want total_memory to be zero when we use
it as a divisor.
And rename total_memory to vm_total_pages to avoid naming clashes with
architectures.
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Martin Bligh <mbligh@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The SLAB bootstrap code assumes that the first two kmalloc caches created
(the INDEX_AC and INDEX_L3 kmalloc caches) wont be off-slab. But due to AC
and L3 structure size increase in lockdep, one of them ended up being
off-slab, and subsequently crashing with:
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
[<ffffffff80267478>] kmem_cache_alloc+0x26/0x7d
The fix is to introduce a bootstrap flag and to use it to prevent off-slab
caches being created so early during bootup.
(The calculation for off-slab caches is quite complex so i didnt want to
complicate things with introducing yet another INDEX_ calculation, the flag
approach is simpler and smaller.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There are two calls to update_mmu_cache in fremap.c, both defective.
The one in install_page needs to be accompanied by lazy_mmu_prot_update
(some other cleanup time, move that into ia64 update_mmu_cache itself); and
the one in install_file_pte should be removed since the pte is not present.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Now that we have atomic_inc_not_zero, it's more elegant for try_to_unuse to
use that on mm_users: doesn't actually matter at present, but safer to be
sure that once mm_users has gone to 0, nothing raises it for an instant.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a new VMA operation to notify a filesystem or other driver about the
MMU generating a fault because userspace attempted to write to a page
mapped through a read-only PTE.
This facility permits the filesystem or driver to:
(*) Implement storage allocation/reservation on attempted write, and so to
deal with problems such as ENOSPC more gracefully (perhaps by generating
SIGBUS).
(*) Delay making the page writable until the contents have been written to a
backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS.
It permits the filesystem to have some guarantee about the state of the
cache.
(*) Account and limit number of dirty pages. This is one piece of the puzzle
needed to make shared writable mapping work safely in FUSE.
Needed by cachefs (Or is it cachefiles? Or fscache? <head spins>).
At least four other groups have stated an interest in it or a desire to use
the functionality it provides: FUSE, OCFS2, NTFS and JFFS2. Also, things like
EXT3 really ought to use it to deal with the case of shared-writable mmap
encountering ENOSPC before we permit the page to be dirtied.
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
get_user_pages(.write=1, .force=1) can generate COW hits on read-only
shared mappings, this patch traps those as mkpage_write candidates and fails
to handle them the old way.
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Joel Becker <Joel.Becker@oracle.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Record the node id as we mark sections for instantiation. Use this nid
during instantiation to direct allocations.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Mike Kravetz <kravetz@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Bob Picco <bob.picco@hp.com>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Martin Bligh <mbligh@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Passing an invalid pointer to kfree() and kmem_cache_free() is likely to
cause bad memory corruption or even take down the whole system because the
bad pointer is likely reused immediately due to the per-CPU caches. Until
now, we don't do any verification for this if CONFIG_DEBUG_SLAB is
disabled.
As suggested by Linus, add PageSlab check to page_to_cache() and
page_to_slab() to verify pointers passed to kfree(). Also, move the
stronger check from cache_free_debugcheck() to kmem_cache_free() to ensure
the passed pointer actually belongs to the cache we're about to free the
object.
For page_to_cache() and page_to_slab(), the assertions should have
virtually no extra cost (two instructions, no data cache pressure) and for
kmem_cache_free() the overhead should be minimal.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This implements the use of migration entries to preserve ptes of file backed
pages during migration. Processes can therefore be migrated back and forth
without loosing their connection to pagecache pages.
Note that we implement the migration entries only for linear mappings.
Nonlinear mappings still require the unmapping of the ptes for migration.
And another writepage() ugliness shows up. writepage() can drop the page
lock. Therefore we have to remove migration ptes before calling writepages()
in order to avoid having migration entries point to unlocked pages.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If we install a migration entry then the rss not really decreases since the
page is just moved somewhere else. We can save ourselves the work of
decrementing and later incrementing which will just eventually cause cacheline
bouncing.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use the migration entries for page migration
This modifies the migration code to use the new migration entries. It now
becomes possible to migrate anonymous pages without having to add a swap
entry.
We add a couple of new functions to replace migration entries with the proper
ptes.
We cannot take the tree_lock for migrating anonymous pages anymore. However,
we know that we hold the only remaining reference to the page when the page
count reaches 1.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rip the page migration logic out.
Remove all code that has to do with swapping during page migration.
This also guts the ability to migrate pages to swap. No one used that so lets
let it go for good.
Page migration should be a bit broken after this patch.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move the fallback code into a new fallback function and make the function
behave like any other migration function. This requires retaking the lock if
pageout() drops it.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change handling of address spaces.
Pass a pointer to the address space in which the page is migrated to all
migration function. This avoids repeatedly having to retrieve the address
space pointer from the page and checking it for validity. The old page
mapping will change once migration has gone to a certain step, so it is less
confusing to have the pointer always available.
Move the setting of the mapping and index for the new page into
migrate_pages().
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Extract try_to_unmap and rename remove_references -> move_mapping
try_to_unmap() may significantly change the page state by for example setting
the dirty bit. It is therefore best to unmap in migrate_pages() before
calling any migration functions.
migrate_page_remove_references() will then only move the new page in place of
the old page in the mapping. Rename the function to
migrate_page_move_mapping().
This allows us to get rid of the special unmapping for the fallback path.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Drop nr_refs parameter from migrate_page_remove_references()
The nr_refs parameter is not really useful since the number of remaining
references is always
1 for anonymous pages without a mapping
2 for pages with a mapping
3 for pages with a mapping and PagePrivate set.
Remove the early check for the number of references since we are checking
page_mapcount() earlier. Ultimately only the refcount matters after the
tree_lock has been obtained.
Signed-off-by: Christoph Lameter <clameter@sgi.coim>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the export for migrate_page_remove_references() and migrate_page_copy()
that are unlikely to be used directly by filesystems implementing migration.
The export was useful when buffer_migrate_page() lived in fs/buffer.c but it
has now been moved to migrate.c in the migration reorg.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Reorder functions in migrate.c. Group all migration functions for struct
address_space_operations together.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
migrate is a better name since it is only used by page migration.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When a writeback_control's `start' and `end' fields are used to
indicate a one-byte-range starting at file offset zero, the required
values of .start=0,.end=0 mean that the ->writepages() implementation
has no way of telling that it is being asked to perform a range
request. Because we're currently overloading (start == 0 && end == 0)
to mean "this is not a write-a-range request".
To make all this sane, the patch changes range of writeback_control.
So caller does: If it is calling ->writepages() to write pages, it
sets range (range_start/end or range_cyclic) always.
And if range_cyclic is true, ->writepages() thinks the range is
cyclic, otherwise it just uses range_start and range_end.
This patch does,
- Add LLONG_MAX, LLONG_MIN, ULLONG_MAX to include/linux/kernel.h
-1 is usually ok for range_end (type is long long). But, if someone did,
range_end += val; range_end is "val - 1"
u64val = range_end >> bits; u64val is "~(0ULL)"
or something, they are wrong. So, this adds LLONG_MAX to avoid nasty
things, and uses LLONG_MAX for range_end.
- All callers of ->writepages() sets range_start/end or range_cyclic.
- Fix updates of ->writeback_index. It seems already bit strange.
If it starts at 0 and ended by check of nr_to_write, this last
index may reduce chance to scan end of file. So, this updates
->writeback_index only if range_cyclic is true or whole-file is
scanned.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Steven French <sfrench@us.ibm.com>
Cc: "Vladimir V. Saveliev" <vs@namesys.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
At present our slab debugging tells us that it detected a double-free or
corruption - it does not distinguish between them. Sometimes it's useful
to be able to differentiate between these two types of information.
Add double-free detection to redzone verification when freeing an object.
As explained by Manfred, when we are freeing an object, both redzones
should be RED_ACTIVE. However, if both are RED_INACTIVE, we are trying to
free an object that was already free'd.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With likely/unlikely profiling on my not-so-busy-typical-developmentsystem
there are 5k misses vs 2k hits. So I guess we should remove the unlikely.
Signed-off-by: Hua Zhong <hzhong@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add remap_vmalloc_range, vmalloc_user, and vmalloc_32_user so that drivers
can have a nice interface for remapping vmalloc memory.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rework the swsusp's memory shrinker in the following way:
- Simplify balance_pgdat() by removing all of the swsusp-related code
from it.
- Make shrink_all_memory() use shrink_slab() and a new function
shrink_all_zones() which calls shrink_active_list() and
shrink_inactive_list() directly for each zone in a way that's optimized
for suspend.
In shrink_all_memory() we try to free exactly as many pages as the caller
asks for, preferably in one shot, starting from easier targets. If slab
caches are huge, they are most likely to have enough pages to reclaim.
The inactive lists are next (the zones with more inactive pages go first)
etc.
Each time shrink_all_memory() attempts to shrink the active and inactive
lists for each zone in 5 passes. In the first pass, only the inactive
lists are taken into consideration. In the next two passes the active
lists are also shrunk, but mapped pages are not reclaimed. In the last
two passes the active and inactive lists are shrunk and mapped pages are
reclaimed as well. The aim of this is to alter the reclaim logic to choose
the best pages to keep on resume and improve the responsiveness of the
resumed system.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use the _entry variant everywhere to clean the code up a tiny bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>