Move the include of io.h down into the #ifdef __KERNEL__ protected
region.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Now that the PCI core manages the 'name' for each individual
hotplug driver, and all drivers (except rpaphp) have been converted
to use hotplug_slot_name(), there is no need for the PCI hotplug
core to drag around its own copy of name either.
Cc: kristen.c.accardi@intel.com
Cc: matthew@wil.cx
Acked-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
In preparation for cleaning up the various hotplug drivers
such that they don't have to manage their own 'name' parameters
anymore, we provide the following convenience functions:
pci_slot_name()
hotplug_slot_name()
These helpers will be used by individual hotplug drivers.
Cc: kristen.c.accardi@intel.com
Cc: matthew@wil.cx
Acked-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Slot detection drivers can co-exist with hotplug drivers. The names
of the detected/claimed slots may be different depending on module
load order.
For legacy reasons, we need to allow hotplug drivers to override
the slot name if a detection driver is loaded first (and they find
the same slots).
Creating and overriding slot names should be an atomic operation,
otherwise you get a locking nightmare as various drivers race to
call pci_create_slot().
pci_create_slot() is already serialized by grabbing the pci_bus_sem.
We update the API and add a 'hotplug' param, which is:
set if the caller is a hotplug driver
NULL if the caller is a detection driver
pci_create_slot() does not actually use the 'hotplug' parameter in this
patch. A later patch will add the logic that uses it.
Cc: kristen.c.accardi@intel.com
Cc: matthew@wil.cx
Acked-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
The GPL exported symbol pci_update_slot_number has been renamed to
pci_renumber_slot. Some of the safety checks were unnecessary and
were removed.
Cc: kristen.c.accardi@intel.com
Cc: matthew@wil.cx
Acked-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Update pci_hp_register() to take a const char *name parameter.
The motivation for this is to clean up the individual hotplug
drivers so that each one does not have to manage its own name.
The PCI core should be the place where we manage the name.
We update the interface and all callsites first, in a
"no functional change" manner, and clean up the drivers later.
Cc: kristen.c.accardi@intel.com
Acked-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Ingo pointed out that the m32r build was broken by pci_ioremap. It looks like
some files include pci.h w/o including io.h. The latter defines ioremap_* if
present, so it makes sense to include it in pci.h now that we have pci_ioremap
there.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Sometimes, it's necessary to enable software's ability to quiesce and
reset endpoint hardware with function-level granularity, so provide
support for it.
The patch implement Function Level Reset(FLR) feature following PCI-e
spec. And this is the first step. We would add more generic method, like
D0/D3, to allow more devices support this function.
The patch contains two functions. pcie_reset_function() is the new
driver API, and, contains some action to quiesce a device. The other
function is a helper: pcie_execute_reset_function() just executes the
reset for a particular device function.
Current the usage model is in KVM. Function reset is necessary for
assigning device to a guest, or moving it between partitions.
For Function Level Reset(FLR), please refer to PCI Express spec chapter
6.6.2.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Multi-protocol adapters support different port types. Each consumer
of mlx4_core queries for supported port types; in particular mlx4_ib
can no longer assume that all physical ports belong to it. Port type
is configured through a sysfs interface. When the type of a port is
changed, all mlx4 interfaces are unregistered, and then registered
again with the new port types.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Maybe the incorrect power state is returned on the bogus bios, which
is different with the real power state. For example: the bios returns D0
state and the real power state is D3. OS expects to set the device to D0
state. In such case if OS uses the power state returned by the BIOS and
checks the device power state very strictly in power transition, the device
can't be transited to the correct power state.
So the boot option of "acpi.power_nocheck=1" is added to avoid checking
the device power in the course of device power transition.
http://bugzilla.kernel.org/show_bug.cgi?id=8049http://bugzilla.kernel.org/show_bug.cgi?id=11000
Signed-off-by: Zhao Yakui <yakui.zhao@intel.com>
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Li Shaohua <shaohua.li@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Add support for managing MAC and VLAN filters for each port.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Oren Duer <oren@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If present the info->archdata is copied into the dev->archdata.
Some (OpenFirmware) platforms need it.
Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Fix most checkpatch.pl errors and warnings. This includes replacing
spaces with tabs in many places, adding and removing spaces, and
folding long lines.
Also complete a couple prototypes to make it clearer what the
parameters represent.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
We have no users and no implementers for these transfer types so it
makes little sense to define functionality bits for them.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
i2c_get_clientdata doesn't change the i2c_client it is passed as a
parameter, so it can be constified. Same for i2c_get_adapdata.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Make clear what the class field of i2c_adapter is good for.
Signed-off-by: Wolfram Sang <w.sang@pengutronix.de>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Some I2C bus implementations need to synchronize with external
entities, such as system firmware, which might also be programming the
same I2C bus.
In order to facilitate this add ->xfer_begin() and ->xfer_end() hooks
which are invoked around pcf_xfer().
[JD: Make these hooks optional.]
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Get maximum ethernet MTU and default MAC address from the firmware
QUERY_DEV_CAP command.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
For ethernet support, we need to reserve QPs for the ethernet and
fibre channel driver. The QPs are reserved at the end of the QP
table. (This way we assure that they are aligned to their size)
We need to consider these reserved ranges in bitmap creation, so we
extend the mlx4 bitmap utility functions to allow reserved ranges at
both the bottom and the top of the range.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add type to the hid structure to distinguish to which device type
(now only mouse) we are talking to. Needed for per device type ignore
list support.
Note: this patch leaves the type as unknown for bluetooth devices,
there is not support for this in the hidp code.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Platforms like ARM Ltd's RealView require the IRQ polarity bit to be set
for the SMC9118 chip. This patch allows the dynamic configuration via
the smc911x_platdata structure.
This patch also changes the smc91x_platdata structure name to the
correct smc911x_platdata in the smc911x_drv_probe() function.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
The Intel 7300 Memory Controller supports dynamic throttling of memory which can
be used to save power when system is idle. This driver does the memory
throttling when all CPUs are idle on such a system.
Refer to "Intel 7300 Memory Controller Hub (MCH)" datasheet
for the config space description.
Signed-off-by: Andy Henroid <andrew.d.henroid@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
- Move it into a separate file; clean and streamline it
- Restructure the init code for reuse during secondary dispatch
- Support both levels (primary, secondary) of IRQ dispatch
- Use a workqueue for irq mask/unmask and trigger configuration
Code for two subchips currently share that secondary handler code.
One is the power subchip; its IRQs are now handled by this core,
courtesy of this patch. The other is the GPIO module, which will
be supported through a later patch.
There are also minor changes to the header file, mostly related
to GPIO support; nothing yet in mainline cares about those. A
few references to OMAP-specific symbols are disabled; when they
can all be removed, the TWL4030 support ceases being OMAP-specific.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Samuel Ortiz <sameo@openedhand.com>
create_rt_workqueue will create a real time prioritized workqueue.
This is needed for the conversion of stop_machine to a workqueue based
implementation.
This patch adds yet another parameter to __create_workqueue_key to tell
it that we want an rt workqueue.
However it looks like we rather should have something like "int type"
instead of singlethread, freezable and rt.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
There are a lot of one-liner uses of __setup() in the kernel: they're
cumbersome and not queryable (definitely not settable) via /sys. Yet
it's ugly to simplify them to module_param(), because by default that
inserts a prefix of the module name (usually filename).
So, introduce a "core_param". The parameter gets no prefix, but
appears in /sys/module/kernel/parameters/ (if non-zero perms arg). I
thought about using the name "core", but that's more common than
"kernel". And if you create a module called "kernel", you will die
a horrible death.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Instead of insisting each new module_param sysfs entry is unique,
handle the case where it already exists (for builtin modules).
The current code assumes that all identical prefixes are together in
the section: true for normal uses, but not necessarily so if someone
overrides MODULE_PARAM_PREFIX. More importantly, it's not true with
the new "core_param()" code which uses "kernel" as a prefix.
This simplifies the caller for the builtin case, at a slight loss of
efficiency (we do the lookup every time to see if the directory
exists).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
The kparam code tries to handle over-length parameter prefixes at
runtime. Not only would I bet this has never been tested, it's not
clear that truncating names is a good idea either.
So let's check at compile time. We need to move the #define to
moduleparam.h to do this, though.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Linus' recent catch of stack overflow in load_module lead me to look
at the code. A couple of helpers to get a section address and get
objects from a section can help clean things up a little.
(And in case you're wondering, the stack size also dropped from 328 to
284 bytes).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Separate the region hash code from raid1 so it can be shared by forthcoming
targets. Use BUG_ON() for failed async dm_io() calls.
Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move array_too_big to include/linux/device-mapper.h because it is
used by targets.
Remove the test from dm-raid1 as the number of mirror legs is limited
such that it can never fail. (Even for stripes it seems rather
unlikely.)
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This creates a self contained frontend de-allocator
for the instances where an adapter has not been
registered yet frontend de-allocation may
be required.
Signed-off-by: Darron Broad <darron@kewl.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
inode is never used on video_ioctl2. Remove it and rename the function to
__video_ioctl2. This allows its usage directly as a callback at
fops.unlocked_ioctl.
Since we still need a callback with inode to be used with fops.ioctl,
this patch adds video_ioctl2() that is just a call to __video_ioctl2().
Also, this patch adds some comments about video_ioctl2 and __video_ioctl2
usage at v4l2-ioctl.h.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
The inode parameter at v4l_compat_translate_ioctl() were just passed over several
places just to keep compatible with fops.ioctl. However, it weren't used anywere.
This patch gets hid of this unused parameter.
Cc: Laurent Pinchart <laurent.pinchart@skynet.be>
Cc: Mike Isely <isely@pobox.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Based on an older patch from Sakari Ailus.
Cc: Sakari Ailus <sakari.ailus@nokia.com>
Signed-off-by: Hans Verkuil <hverkuil@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Export v4l2_int_device_try_attach_all. This allows initiating the
initialisation of int if device after the drivers have been registered.
Also allow drivers to call ioctls if v4l2-int-if was compiled as
module.
Signed-off-by: Sakari Ailus <sakari.ailus@nokia.com>
Signed-off-by: Hans Verkuil <hverkuil@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Use enum v4l2_power instead of int as second argument to
vidioc_int_s_power. The new functionality is that standby state is also
recognised.
Signed-off-by: Sakari Ailus <sakari.ailus@nokia.com>
Signed-off-by: Hans Verkuil <hverkuil@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Add 10-bit raw bayer format expanded to 16 bits. Adds also definition
for 10-bit raw bayer format dpcm-compressed to 8 bits.
Signed-off-by: Sergio Aguirre <saaguirre@ti.com>
Signed-off-by: Hans Verkuil <hverkuil@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
vidioc_int_g_priv is used to get master's slave-related private data
structure. The structure can contain for example master's configuration
specific to slave.
Signed-off-by: Sakari Ailus <sakari.ailus@nokia.com>
Signed-off-by: Hans Verkuil <hverkuil@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
The uncached area starts at e000.0000 and spans 1GB. Also add an IOADDR
macro to determine the bypass region to access the IO space.
Signed-off-by: Chris Zankel <chris@zankel.net>
Fix off-by-one in for_each_irq_desc_reverse().
Impact is near zero in practice, because nothing substantial wants to
iterate down to IRQ#0 - but fix it nevertheless.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* get rid of fake struct file/struct dentry in __blkdev_get()
* merge __blkdev_get() and do_open()
* get rid of flags argument of blkdev_get()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
replace open_bdev_excl/close_bdev_excl with variants taking fmode_t.
superblock gets the value used to mount it stored in sb->s_mode
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
To keep the size of changesets sane we split the switch by drivers;
to keep the damn thing bisectable we do the following:
1) rename the affected methods, add ones with correct
prototypes, make (few) callers handle both. That's this changeset.
2) for each driver convert to new methods. *ALL* drivers
are converted in this series.
3) kill the old (renamed) methods.
Note that it _is_ a flagday; all in-tree drivers are converted and by the
end of this series no trace of old methods remain. The only reason why
we do that this way is to keep the damn thing bisectable and allow per-driver
debugging if anything goes wrong.
New methods:
open(bdev, mode)
release(disk, mode)
ioctl(bdev, mode, cmd, arg) /* Called without BKL */
compat_ioctl(bdev, mode, cmd, arg)
locked_ioctl(bdev, mode, cmd, arg) /* Called with BKL, legacy */
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Analog of blkdev_driver_ioctl() with sane arguments. For
now uses fake struct file, by the end of the series it won't
and blkdev_driver_ioctl() will become a wrapper around it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The issue is the SPU code is not holding the kernel mutex lock while
adding samples to the kernel buffer.
This patch creates per SPU buffers to hold the data. Data
is added to the buffers from in interrupt context. The data
is periodically pushed to the kernel buffer via a new Oprofile
function oprofile_put_buff(). The oprofile_put_buff() function
is called via a work queue enabling the funtion to acquire the
mutex lock.
The existing user controls for adjusting the per CPU buffer
size is used to control the size of the per SPU buffers.
Similarly, overflows of the SPU buffers are reported by
incrementing the per CPU buffer stats. This eliminates the
need to have architecture specific controls for the per SPU
buffers which is not acceptable to the OProfile user tool
maintainer.
The export of the oprofile add_event_entry() is removed as it
is no longer needed given this patch.
Note, this patch has not addressed the issue of indexing arrays
by the spu number. This still needs to be fixed as the spu
numbering is not guarenteed to be 0 to max_num_spus-1.
Signed-off-by: Carl Love <carll@us.ibm.com>
Signed-off-by: Maynard Johnson <maynardj@us.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Acked-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The 'state' file for a device reports, for example, when the device
has failed. Changes should be reported to userspace ASAP without
the possibility of blocking on low-memory. sysfs_notify does
have that possibility (as it takes a mutex which can be held
across a kmalloc) so use sysfs_notify_dirent instead.
Signed-off-by: NeilBrown <neilb@suse.de>
Now that we have sysfs_notify_dirent, use it to notify changes
to md/array_state.
As sysfs_notify_dirent can be called in atomic context, we can
remove the delayed notify and the MD_NOTIFY_ARRAY_STATE flag.
Signed-off-by: NeilBrown <neilb@suse.de>
powerpc doesn't use the generic WARN_ON infrastructure. The newly
introduced WARN() as a result didn't print the message, this patch adds
the printk for this specific case.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The default_trigger fields of struct gpio_led and thus struct
led_classdev are pretty much always assigned from a string literal,
which means the string can't be modified. Which is fine, since there is
no reason to modify the string and in fact it never is.
But they should be marked const to prevent such code from being added,
to prevent warnings if -Wwrite-strings is used, when assigned from a
constant string other than a string literal (which produces a warning
under current kernel compiler flags), and for general good coding
practices.
Signed-off-by: Trent Piepho <tpiepho@freescale.com>
Signed-off-by: Richard Purdie <rpurdie@linux.intel.com>
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (41 commits)
PCI: fix pci_ioremap_bar() on s390
PCI: fix AER capability check
PCI: use pci_find_ext_capability everywhere
PCI: remove #ifdef DEBUG around dev_dbg call
PCI hotplug: fix get_##name return value problem
PCI: document the pcie_aspm kernel parameter
PCI: introduce an pci_ioremap(pdev, barnr) function
powerpc/PCI: Add legacy PCI access via sysfs
PCI: Add ability to mmap legacy_io on some platforms
PCI: probing debug message uniformization
PCI: support PCIe ARI capability
PCI: centralize the capabilities code in probe.c
PCI: centralize the capabilities code in pci-sysfs.c
PCI: fix 64-vbit prefetchable memory resource BARs
PCI: replace cfg space size (256/4096) by macros.
PCI: use resource_size() everywhere.
PCI: use same arg names in PCI_VDEVICE comment
PCI hotplug: rpaphp: make debug var unique
PCI: use %pF instead of print_fn_descriptor_symbol() in quirks.c
PCI: fix hotplug get_##name return value problem
...
This merges branches irq/genirq, irq/sparseirq-v4, timers/hpet-percpu
and x86/uv.
The sparseirq branch is just preliminary groundwork: no sparse IRQs are
actually implemented by this tree anymore - just the new APIs are added
while keeping the old way intact as well (the new APIs map 1:1 to
irq_desc[]). The 'real' sparse IRQ support will then be a relatively
small patch ontop of this - with a v2.6.29 merge target.
* 'genirq-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (178 commits)
genirq: improve include files
intr_remapping: fix typo
io_apic: make irq_mis_count available on 64-bit too
genirq: fix name space collisions of nr_irqs in arch/*
genirq: fix name space collision of nr_irqs in autoprobe.c
genirq: use iterators for irq_desc loops
proc: fixup irq iterator
genirq: add reverse iterator for irq_desc
x86: move ack_bad_irq() to irq.c
x86: unify show_interrupts() and proc helpers
x86: cleanup show_interrupts
genirq: cleanup the sparseirq modifications
genirq: remove artifacts from sparseirq removal
genirq: revert dynarray
genirq: remove irq_to_desc_alloc
genirq: remove sparse irq code
genirq: use inline function for irq_to_desc
genirq: consolidate nr_irqs and for_each_irq_desc()
x86: remove sparse irq from Kconfig
genirq: define nr_irqs for architectures with GENERIC_HARDIRQS=n
...
* 'v28-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (36 commits)
fix documentation of sysrq-q really
Fix documentation of sysrq-q
timer_list: add base address to clock base
timer_list: print cpu number of clockevents device
timer_list: print real timer address
NOHZ: restart tick device from irq_enter()
NOHZ: split tick_nohz_restart_sched_tick()
NOHZ: unify the nohz function calls in irq_enter()
timers: fix itimer/many thread hang, fix
timers: fix itimer/many thread hang, v3
ntp: improve adjtimex frequency rounding
timekeeping: fix rounding problem during clock update
ntp: let update_persistent_clock() sleep
hrtimer: reorder struct hrtimer to save 8 bytes on 64bit builds
posix-timers: lock_timer: make it readable
posix-timers: lock_timer: kill the bogus ->it_id check
posix-timers: kill ->it_sigev_signo and ->it_sigev_value
posix-timers: sys_timer_create: cleanup the error handling
posix-timers: move the initialization of timer->sigq from send to create path
posix-timers: sys_timer_create: simplify and s/tasklist/rcu/
...
Fix trivial conflicts due to sysrq-q description clahes in
Documentation/sysrq.txt and drivers/char/sysrq.c
s390 doesn't have ioremap_*, so protect the definition of the new
pci_ioremap_bar function with CONFIG_HAS_IOMEM to avoid build breakage.
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
The 'use pci_find_ext_capability everywhere' cleanup brought a new bug,
which makes the AER stop working. Fix it by actually using find_ext_cap
instead of just find_cap. Drop the unused config space size define while
we're at it.
Signed-off-by: Yu Zhao <yu.zhao@intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Remove some open coded (and buggy) versions of pci_find_ext_capability
in favor of the real routine in the PCI core.
Tested-by: Tomasz Czernecki <czernecki@gmail.com>
Acked-by: Andrew Vasquez <andrew.vasquez@qlogic.com>
Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
A common thing in many PCI drivers is to ioremap() an entire bar. This
is a slightly fragile thing right now, needing both an address and a
size, and many driver writers do.. various things there.
This patch introduces an pci_ioremap() function taking just a PCI device
struct and the bar number as arguments, and figures this all out itself,
in one place. In addition, we can add various sanity checks to this
function (the patch already checks to make sure that the bar in question
really is a MEM bar; few to no drivers do that sort of thing).
Hopefully with this type of API we get less chance of mistakes in
drivers with ioremap() operations.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
This patch adds support for PCI Express Alternative Routing-ID
Interpretation (ARI) capability.
The ARI capability extends the Function Number field of the PCI Express
Endpoint by reusing the Device Number which is otherwise hardwired to 0.
With ARI, an Endpoint can have up to 256 functions.
Signed-off-by: Yu Zhao <yu.zhao@intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
This cleanup makes the argument names in PCI_VDEVICE comment consistent
with those used in its definition.
Signed-off-by: Yu Zhao <yu.zhao@intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
This patch updates the Intel Ibex Peak (PCH) LPC and SMBus Controller
DeviceIDs.
The LPC Controller ID is set by Firmware within the range of
0x3b00-3b1f. This range is included in pci_ids.h using min and max
values, and irq.c now has code to handle the range (in lieu of 32
additions to a SWITCH statement).
The SMBus Controller ID is a fixed-value and will not change.
Signed-off-by: Seth Heasley <seth.heasley@intel.com>
Acked-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
We are using 28bit pci (bus/dev/fn + 12 bits) as irq number, so the
cache for irq number should be 32 bit too.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Andrew Vasquez <andrew.vasquez@qlogic.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Many device drivers use the following sequence of statements to enable
the device to wake up the system while being in the D3_hot or D3_cold
low power state:
pci_enable_wake(pdev, PCI_D3hot, 1);
pci_enable_wake(pdev, PCI_D3cold, 1);
However, the second call is not necessary if the first one succeeds (the
ordering of the statements above doesn't matter here) and it may even be
harmful, because we are not supposed to enable PME# after the wake-up
power has been enabled for the device.
To allow drivers to overcome this problem, introduce function
pci_wake_from_d3() that will enable the device to wake up the system
from any of D3_hot and D3_cold as long as the wake-up from at least one
of them is supported.
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
The driver flag dynids.use_driver_data is almost consistently not set,
and causes more problems than it solves. It was initially intended as a
flag to indicate whether a driver's usage of driver_data had been
carefully inspected and was ready for values from userspace. That audit
was never done, so most drivers just get a 0 for driver_data when new
IDs are added from userspace via sysfs. So remove the flag, allowing
drivers to see the data directly (a followon patch validates the passed
driver_data value against what the drivers expect).
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
* git://git.infradead.org/battery-2.6:
bq27x00_battery: use unaligned access helper
power_supply: fix dependency of tosa_battery
power_supply: Support for Texas Instruments BQ27200 battery managers
power_supply: Add function to return system-wide power state
pda_power: Check and handle return value of set_irq_wake
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: (26 commits)
9p: add more conservative locking
9p: fix oops in protocol stat parsing error path.
9p: fix device file handling
9p: Improve debug support
9p: eliminate depricated conv functions
9p: rework client code to use new protocol support functions
9p: remove unnecessary tag field from p9_req_t structure
9p: remove 9p fcall debug prints
9p: add new protocol support code
9p: encapsulate version function
9p: move dirread to fs layer
9p: adjust 9p vfs write operation
9p: move readn meta-function from client to fs layer
9p: consolidate read/write functions
9p: drop broken unused error path from p9_conn_create()
9p: make rpc code common and rework flush code
9p: use the rcall structure passed in the request in trans_fd read_work
9p: apply common request code to trans_fd
9p: apply common tagpool handling to trans_fd
9p: move request management to client code
...
* git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: use correct fs type for v4 submounts and referrals
Make nfs_file_cred more robust.
NFS: Enable NFSv4 callback server to listen on AF_INET6 sockets
* 'for-next' of git://git.o-hand.com/linux-mfd:
mfd: further unbork the ucb1400 ac97_bus dependencies
mfd: ucb1400 needs GPIO
mfd: ucb1400 sound driver uses/depends on AC97_BUS:
mfd: Don't use NO_IRQ in WM8350
mfd: update TMIO drivers to use the clock API
mfd: twl4030-core irq simplification
mfd: add base support for Dialog DA9030/DA9034 PMICs
mfd: TWL4030 core driver
mfd: support tmiofb cell on tc6393xb
mfd: add OHCI cell to tc6393xb
mfd: Fix htc-egpio compile warning
mfd: do tcb6393xb state restore on resume only if requested
mfd: provide and use setup hook for tc6393xb
mfd: update sm501 debugging/low information messages
mfd: reduce stack usage in mfd-core.c
* git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (112 commits)
sh: Move SH-4 CPU headers down one more level.
sh: Only build in gpio.o when CONFIG_GENERIC_GPIO is selected.
sh: Migrate common board headers to mach-common/.
sh: Move the CPU definition headers from asm/ to cpu/.
serial: sh-sci: Add support SCIF of SH7723
video: add sh_mobile_lcdc platform flags
video: remove unused sh_mobile_lcdc platform data
sh: remove consistent alloc cruft
sh: add dynamic crash base address support
sh: reduce Migo-R smc91x overruns
sh: Fix up some merge damage.
Fix debugfs_create_file's error checking method for arch/sh/mm/
Fix debugfs_create_dir's error checking method for arch/sh/kernel/
sh: ap325rxa: Add support RTC RX-8564LC in AP325RXA board
sh: Use sh7720 GPIO on magicpanelr2 board
sh: Add sh7720 pinmux code
sh: Use sh7203 GPIO on rsk7203 board
sh: Add sh7203 pinmux code
sh: Use sh7723 GPIO on AP325RXA board
sh: Add sh7723 pinmux code
...
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
netfilter: replace old NF_ARP calls with NFPROTO_ARP
netfilter: fix compilation error with NAT=n
netfilter: xt_recent: use proc_create_data()
netfilter: snmp nat leaks memory in case of failure
netfilter: xt_iprange: fix range inversion match
netfilter: netns: use NFPROTO_NUMPROTO instead of NUMPROTO for tables array
netfilter: ctnetlink: remove obsolete NAT dependency from Kconfig
pkt_sched: sch_generic: Fix oops in sch_teql
dccp: Port redirection support for DCCP
tcp: Fix IPv6 fallout from 'Port redirection support for TCP'
netdev: change name dropping error codes
ipvs: Update CONFIG_IP_VS_IPV6 description and help text
* git://git.infradead.org/mtd-2.6: (69 commits)
Revert "[MTD] m25p80.c code cleanup"
[MTD] [NAND] GPIO driver depends on ARM... for now.
[MTD] [NAND] sh_flctl: fix compile error
[MTD] [NOR] AT49BV6416 has swapped erase regions
[MTD] [NAND] GPIO NAND flash driver
[MTD] cmdlineparts documentation change - explain where mtd-id comes from
[MTD] cfi_cmdset_0002.c: Add Macronix CFI V1.0 TopBottom detection
[MTD] [NAND] Fix compilation warnings in drivers/mtd/nand/cs553x_nand.c
[JFFS2] Write buffer offset adjustment for NOR-ECC (Sibley) flash
[MTD] mtdoops: Fix a bug where block may not be erased
[MTD] mtdoops: Add a magic number to logged kernel oops
[MTD] mtdoops: Fix an off by one error
[JFFS2] Correct parameter names of jffs2_compress() in comments
[MTD] [NAND] sh_flctl: add support for Renesas SuperH FLCTL
[MTD] [NAND] Bug on atmel_nand HW ECC : OOB info not correctly written
[MTD] [MAPS] Remove unused variable after ROM API cleanup.
[MTD] m25p80.c extended jedec support (v2)
[MTD] remove unused mtd parameter in of_mtd_parse_partitions()
[MTD] [NAND] remove dead Kconfig associated with !CONFIG_PPC_MERGE
[MTD] [NAND] driver extension to support NAND on TQM85xx modules
...
Tejun's commit 7b595756ec made sysfs
attribute->owner unnecessary. But the field was left in the structure to
ease the merge. It's been over a year since that change and it is now
time to start killing attribute->owner along with its users - one arch at
a time!
This patch is attempt #1 to get rid of attribute->owner only for
CONFIG_X86_64 or CONFIG_X86_32 . We will deal with other arches later on
as and when possible - avr32 will be the next since that is something I
can test. Compile (make allyesconfig / make allmodconfig / custom config)
and boot tested.
akpm: the idea is that we put the declaration of sttribute.owner inside
`#ifndef CONFIG_X86'. But that proved to be too ambitious for now because
new usages kept on turning up in subsystem trees.
[akpm: remove the ifdef for now]
Signed-off-by: Parag Warudkar <parag.lkml@gmail.com>
Cc: Greg KH <greg@kroah.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Jean Delvare <khali@linux-fr.org>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: David Brownell <david-b@pacbell.net>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- the macros are gone
- there's no more code in this file,
LGPL + GPL = GPL,
and the code that was moved to lib/bcd.c is anyway trivial
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change various rtc related code to use the new bcd2bin/bin2bcd functions
instead of the obsolete BCD_TO_BIN/BIN_TO_BCD/BCD2BIN/BIN2BCD macros.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Alessandro Zummo <a.zummo@towertech.it>
Cc: David Brownell <david-b@pacbell.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch makes the needlessly global anon_vma_cachep static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is needed during the transition to the new byteorder headers as the
swabb.h functionality will be provided from asm/byteorder.h in the new
version. To avoid breakage on arches still using the old implementation,
provide swabb.h from asm/byteorder.h as well.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes the new implementation of the byteorder helpers match the old
in how it degraded when an arch-defined version was not available:
1) swab()
- look for arch defined
- if not, use generic c version
2) swabp()
- look for arch-defined
- if not, deref pointer and use swab()
3) swabs()
- look for arch defined
- if not, use swabp
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The usage of elfcorehdr_addr has changed recently such that being set to
ELFCORE_ADDR_MAX is used by is_kdump_kernel() to indicate if the code is
executing in a kernel executed as a crash kernel.
However, arch/ia64/kernel/setup.c:reserve_elfcorehdr will rest
elfcorehdr_addr to ELFCORE_ADDR_MAX on error, which means any subsequent
calls to is_kdump_kernel() will return 0, even though they should return
1.
Ok, at this point in time there are no subsequent calls, but I think its
fair to say that there is ample scope for error or at the very least
confusion.
This patch add an extra state, ELFCORE_ADDR_ERR, which indicates that
elfcorehdr_addr was passed on the command line, and thus execution is
taking place in a crashdump kernel, but vmcore can't be used for some
reason. This is tested for using is_vmcore_usable() and set using
vmcore_unusable(). A subsequent patch makes use of this new code.
To summarise, the states that elfcorehdr_addr can now be in are as follows:
ELFCORE_ADDR_MAX: not a crashdump kernel
ELFCORE_ADDR_ERR: crashdump kernel but vmcore is unusable
any other value: crash dump kernel and vmcore is usable
Signed-off-by: Simon Horman <horms@verge.net.au>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
o elfcorehdr_addr is used by not only the code under CONFIG_PROC_VMCORE
but also by the code which is not inside CONFIG_PROC_VMCORE. For
example, is_kdump_kernel() is used by powerpc code to determine if
kernel is booting after a panic then use previous kernel's TCE table.
So even if CONFIG_PROC_VMCORE is not set in second kernel, one should be
able to correctly determine that we are booting after a panic and setup
calgary iommu accordingly.
o So remove the assumption that elfcorehdr_addr is under
CONFIG_PROC_VMCORE.
o Move definition of elfcorehdr_addr to arch dependent crash files.
(Unfortunately crash dump does not have an arch independent file
otherwise that would have been the best place).
o kexec.c is not the right place as one can Have CRASH_DUMP enabled in
second kernel without KEXEC being enabled.
o I don't see sh setup code parsing the command line for
elfcorehdr_addr. I am wondering how does vmcore interface work on sh.
Anyway, I am atleast defining elfcoredhr_addr so that compilation is not
broken on sh.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Simon Horman <horms@verge.net.au>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds a kconfig option to change the /proc/PID/coredump_filter default.
Fedora has been carrying a trivial patch to change the hard-wired value for
this default, since Fedora 8. The default default can't change safely
because there are old GDB versions out there (all before 6.7) that are
confused by the core dump files created by the MMF_DUMP_ELF_HEADERS setting.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Kawai Hidehiro <hidehiro.kawai.ez@hitachi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ptrace_untrace() can now become static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
bitmap_scnprintf_len() is not used now, so we remove it.
Otherwise we have to maintain it and make its return
value always equal to bitmap_scnprintf()'s return value.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
seq_cpumask_list(), seq_nodemask_list() are very like seq_cpumask(),
seq_nodemask(), but they print human readable string.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allocate all page_cgroup at boot and remove page_cgroup poitner from
struct page. This patch adds an interface as
struct page_cgroup *lookup_page_cgroup(struct page*)
All FLATMEM/DISCONTIGMEM/SPARSEMEM and MEMORY_HOTPLUG is supported.
Remove page_cgroup pointer reduces the amount of memory by
- 4 bytes per PAGE_SIZE.
- 8 bytes per PAGE_SIZE
if memory controller is disabled. (even if configured.)
On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
On my x86-64 server with 48GB of memory, this saves 96MB of memory.
I think this reduction makes sense.
By pre-allocation, kmalloc/kfree in charge/uncharge are removed.
This means
- we're not necessary to be afraid of kmalloc faiulre.
(this can happen because of gfp_mask type.)
- we can avoid calling kmalloc/kfree.
- we can avoid allocating tons of small objects which can be fragmented.
- we can know what amount of memory will be used for this extra-lru handling.
I added printk message as
"allocated %ld bytes of page_cgroup"
"please try cgroup_disable=memory option if you don't want"
maybe enough informative for users.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The choice of real/dummy declaration for cgroup_mm_owner_callbacks()
shouldn't be based on CONFIG_MM_OWNER, but on CONFIG_CGROUPS. Otherwise
kernel/exit.c fails to compile when something other than a cgroups
controller selects CONFIG_MM_OWNER
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rather than pre-generating the entire text for the "tasks" file each
time the file is opened, we instead just generate/update the array of
process ids and use a seq_file to report these to userspace. All open
file handles on the same "tasks" file can share a pid array, which may
be updated any time that no thread is actively reading the array. By
sharing the array, the potential for userspace to DoS the system by
opening many handles on the same "tasks" file is removed.
[Based on a patch by Lai Jiangshan, extended to use seq_file]
Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
put_css_set_taskexit may be called when find_css_set is called on other
cpu. And the race will occur:
put_css_set_taskexit side find_css_set side
|
atomic_dec_and_test(&kref->refcount) |
/* kref->refcount = 0 */ |
....................................................................
| read_lock(&css_set_lock)
| find_existing_css_set
| get_css_set
| read_unlock(&css_set_lock);
....................................................................
__release_css_set |
....................................................................
| /* use a released css_set */
|
[put_css_set is the same. But in the current code, all put_css_set are
put into cgroup mutex critical region as the same as find_css_set.]
[akpm@linux-foundation.org: repair comments]
[menage@google.com: eliminate race in css_set refcounting]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the journal doesn't abort when it gets an IO error in file data blocks,
the file data corruption will spread silently. Because most of
applications and commands do buffered writes without fsync(), they don't
notice the IO error. It's scary for mission critical systems. On the
other hand, if the journal aborts whenever it gets an IO error in file
data blocks, the system will easily become inoperable. So this patch
introduces a filesystem option to determine whether it aborts the journal
or just call printk() when it gets an IO error in file data.
If you mount a ext3 fs with data_err=abort option, it aborts on file data
write error. If you mount it with data_err=ignore, it doesn't abort, just
call printk(). data_err=ignore is the default.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Jan Kara <jack@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change lock_kernel()/unlock_kernel() to local fb mutex. Each frame buffer
instance has its own mutex.
The one line try_to_load() function is unrolled to request_module() in two
places for readability.
[righi.andrea@gmail.com: fb: fix NULL pointer BUG dereference in fb_open()]
Signed-off-by: Krzysztof Helt <krzysztof.h1@wp.pl>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch implements a new freezer subsystem in the control groups
framework. It provides a way to stop and resume execution of all tasks in
a cgroup by writing in the cgroup filesystem.
The freezer subsystem in the container filesystem defines a file named
freezer.state. Writing "FROZEN" to the state file will freeze all tasks
in the cgroup. Subsequently writing "RUNNING" will unfreeze the tasks in
the cgroup. Reading will return the current state.
* Examples of usage :
# mkdir /containers/freezer
# mount -t cgroup -ofreezer freezer /containers
# mkdir /containers/0
# echo $some_pid > /containers/0/tasks
to get status of the freezer subsystem :
# cat /containers/0/freezer.state
RUNNING
to freeze all tasks in the container :
# echo FROZEN > /containers/0/freezer.state
# cat /containers/0/freezer.state
FREEZING
# cat /containers/0/freezer.state
FROZEN
to unfreeze all tasks in the container :
# echo RUNNING > /containers/0/freezer.state
# cat /containers/0/freezer.state
RUNNING
This is the basic mechanism which should do the right thing for user space
task in a simple scenario.
It's important to note that freezing can be incomplete. In that case we
return EBUSY. This means that some tasks in the cgroup are busy doing
something that prevents us from completely freezing the cgroup at this
time. After EBUSY, the cgroup will remain partially frozen -- reflected
by freezer.state reporting "FREEZING" when read. The state will remain
"FREEZING" until one of these things happens:
1) Userspace cancels the freezing operation by writing "RUNNING" to
the freezer.state file
2) Userspace retries the freezing operation by writing "FROZEN" to
the freezer.state file (writing "FREEZING" is not legal
and returns EIO)
3) The tasks that blocked the cgroup from entering the "FROZEN"
state disappear from the cgroup's set of tasks.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: export thaw_process]
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that the TIF_FREEZE flag is available in all architectures, extract
the refrigerator() and freeze_task() from kernel/power/process.c and make
it available to all.
The refrigerator() can now be used in a control group subsystem
implementing a control group freezer.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Matt Helsley <matthltc@us.ibm.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch series introduces a cgroup subsystem that utilizes the swsusp
freezer to freeze a group of tasks. It's immediately useful for batch job
management scripts. It should also be useful in the future for
implementing container checkpoint/restart.
The freezer subsystem in the container filesystem defines a cgroup file
named freezer.state. Reading freezer.state will return the current state
of the cgroup. Writing "FROZEN" to the state file will freeze all tasks
in the cgroup. Subsequently writing "RUNNING" will unfreeze the tasks in
the cgroup.
* Examples of usage :
# mkdir /containers/freezer
# mount -t cgroup -ofreezer freezer /containers
# mkdir /containers/0
# echo $some_pid > /containers/0/tasks
to get status of the freezer subsystem :
# cat /containers/0/freezer.state
RUNNING
to freeze all tasks in the container :
# echo FROZEN > /containers/0/freezer.state
# cat /containers/0/freezer.state
FREEZING
# cat /containers/0/freezer.state
FROZEN
to unfreeze all tasks in the container :
# echo RUNNING > /containers/0/freezer.state
# cat /containers/0/freezer.state
RUNNING
This patch:
The first step in making the refrigerator() available to all
architectures, even for those without power management.
The purpose of such a change is to be able to use the refrigerator() in a
new control group subsystem which will implement a control group freezer.
[akpm@linux-foundation.org: fix sparc]
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Acked-by: Pavel Machek <pavel@suse.cz>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Nigel Cunningham <nigel@tuxonice.net>
Tested-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Presently hugepage's vma has a VM_RESERVED flag in order not to be
swapped. But a VM_RESERVED vma isn't core dumped because this flag is
often used for some kernel vmas (e.g. vmalloc, sound related).
Thus hugepages are never dumped and it can't be debugged easily. Many
developers want hugepages to be included into core-dump.
However, We can't read generic VM_RESERVED area because this area is often
IO mapping area. then these area reading may change device state. it is
definitly undesiable side-effect.
So adding a hugepage specific bit to the coredump filter is better. It
will be able to hugepage core dumping and doesn't cause any side-effect to
any i/o devices.
In additional, libhugetlb use hugetlb private mapping pages as anonymous
page. Then, hugepage private mapping pages should be core dumped by
default.
Then, /proc/[pid]/core_dump_filter has two new bits.
- bit 5 mean hugetlb private mapping pages are dumped or not. (default: yes)
- bit 6 mean hugetlb shared mapping pages are dumped or not. (default: no)
I tested by following method.
% ulimit -c unlimited
% ./crash_hugepage 50
% ./crash_hugepage 50 -p
% ls -lh
% gdb ./crash_hugepage core
%
% echo 0x43 > /proc/self/coredump_filter
% ./crash_hugepage 50
% ./crash_hugepage 50 -p
% ls -lh
% gdb ./crash_hugepage core
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include "hugetlbfs.h"
int main(int argc, char** argv){
char* p;
int ch;
int mmap_flags = MAP_SHARED;
int fd;
int nr_pages;
while((ch = getopt(argc, argv, "p")) != -1) {
switch (ch) {
case 'p':
mmap_flags &= ~MAP_SHARED;
mmap_flags |= MAP_PRIVATE;
break;
default:
/* nothing*/
break;
}
}
argc -= optind;
argv += optind;
if (argc == 0){
printf("need # of pages\n");
exit(1);
}
nr_pages = atoi(argv[0]);
if (nr_pages < 2) {
printf("nr_pages must >2\n");
exit(1);
}
fd = hugetlbfs_unlinked_fd();
p = mmap(NULL, nr_pages * gethugepagesize(),
PROT_READ|PROT_WRITE, mmap_flags, fd, 0);
sleep(2);
*(p + gethugepagesize()) = 1; /* COW */
sleep(2);
/* crash! */
*(int*)0 = 1;
return 0;
}
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Kawai Hidehiro <hidehiro.kawai.ez@hitachi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: William Irwin <wli@holomorphy.com>
Cc: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
provide a fast, scalable percpu frontend for small vmaps (requires a
slightly different API, though).
The biggest problem with vmap is actually vunmap. Presently this requires
a global kernel TLB flush, which on most architectures is a broadcast IPI
to all CPUs to flush the cache. This is all done under a global lock. As
the number of CPUs increases, so will the number of vunmaps a scaled
workload will want to perform, and so will the cost of a global TLB flush.
This gives terrible quadratic scalability characteristics.
Another problem is that the entire vmap subsystem works under a single
lock. It is a rwlock, but it is actually taken for write in all the fast
paths, and the read locking would likely never be run concurrently anyway,
so it's just pointless.
This is a rewrite of vmap subsystem to solve those problems. The existing
vmalloc API is implemented on top of the rewritten subsystem.
The TLB flushing problem is solved by using lazy TLB unmapping. vmap
addresses do not have to be flushed immediately when they are vunmapped,
because the kernel will not reuse them again (would be a use-after-free)
until they are reallocated. So the addresses aren't allocated again until
a subsequent TLB flush. A single TLB flush then can flush multiple
vunmaps from each CPU.
XEN and PAT and such do not like deferred TLB flushing because they can't
always handle multiple aliasing virtual addresses to a physical address.
They now call vm_unmap_aliases() in order to flush any deferred mappings.
That call is very expensive (well, actually not a lot more expensive than
a single vunmap under the old scheme), however it should be OK if not
called too often.
The virtual memory extent information is stored in an rbtree rather than a
linked list to improve the algorithmic scalability.
There is a per-CPU allocator for small vmaps, which amortizes or avoids
global locking.
To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
must be used in place of vmap and vunmap. Vmalloc does not use these
interfaces at the moment, so it will not be quite so scalable (although it
will use lazy TLB flushing).
As a quick test of performance, I ran a test that loops in the kernel,
linearly mapping then touching then unmapping 4 pages. Different numbers
of tests were run in parallel on an 4 core, 2 socket opteron. Results are
in nanoseconds per map+touch+unmap.
threads vanilla vmap rewrite
1 14700 2900
2 33600 3000
4 49500 2800
8 70631 2900
So with a 8 cores, the rewritten version is already 25x faster.
In a slightly more realistic test (although with an older and less
scalable version of the patch), I ripped the not-very-good vunmap batching
code out of XFS, and implemented the large buffer mapping with vm_map_ram
and vm_unmap_ram... along with a couple of other tricks, I was able to
speed up a large directory workload by 20x on a 64 CPU system. I believe
vmap/vunmap is actually sped up a lot more than 20x on such a system, but
I'm running into other locks now. vmap is pretty well blown off the
profiles.
Before:
1352059 total 0.1401
798784 _write_lock 8320.6667 <- vmlist_lock
529313 default_idle 1181.5022
15242 smp_call_function 15.8771 <- vmap tlb flushing
2472 __get_vm_area_node 1.9312 <- vmap
1762 remove_vm_area 4.5885 <- vunmap
316 map_vm_area 0.2297 <- vmap
312 kfree 0.1950
300 _spin_lock 3.1250
252 sn_send_IPI_phys 0.4375 <- tlb flushing
238 vmap 0.8264 <- vmap
216 find_lock_page 0.5192
196 find_next_bit 0.3603
136 sn2_send_IPI 0.2024
130 pio_phys_write_mmr 2.0312
118 unmap_kernel_range 0.1229
After:
78406 total 0.0081
40053 default_idle 89.4040
33576 ia64_spinlock_contention 349.7500
1650 _spin_lock 17.1875
319 __reg_op 0.5538
281 _atomic_dec_and_lock 1.0977
153 mutex_unlock 1.5938
123 iget_locked 0.1671
117 xfs_dir_lookup 0.1662
117 dput 0.1406
114 xfs_iget_core 0.0268
92 xfs_da_hashname 0.1917
75 d_alloc 0.0670
68 vmap_page_range 0.0462 <- vmap
58 kmem_cache_alloc 0.0604
57 memset 0.0540
52 rb_next 0.1625
50 __copy_user 0.0208
49 bitmap_find_free_region 0.2188 <- vmap
46 ia64_sn_udelay 0.1106
45 find_inode_fast 0.1406
42 memcmp 0.2188
42 finish_task_switch 0.1094
42 __d_lookup 0.0410
40 radix_tree_lookup_slot 0.1250
37 _spin_unlock_irqrestore 0.3854
36 xfs_bmapi 0.0050
36 kmem_cache_free 0.0256
35 xfs_vn_getattr 0.0322
34 radix_tree_lookup 0.1062
33 __link_path_walk 0.0035
31 xfs_da_do_buf 0.0091
30 _xfs_buf_find 0.0204
28 find_get_page 0.0875
27 xfs_iread 0.0241
27 __strncpy_from_user 0.2812
26 _xfs_buf_initialize 0.0406
24 _xfs_buf_lookup_pages 0.0179
24 vunmap_page_range 0.0250 <- vunmap
23 find_lock_page 0.0799
22 vm_map_ram 0.0087 <- vmap
20 kfree 0.0125
19 put_page 0.0330
18 __kmalloc 0.0176
17 xfs_da_node_lookup_int 0.0086
17 _read_lock 0.0885
17 page_waitqueue 0.0664
vmap has gone from being the top 5 on the profiles and flushing the crap
out of all TLBs, to using less than 1% of kernel time.
[akpm@linux-foundation.org: cleanups, section fix]
[akpm@linux-foundation.org: fix build on alpha]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
trylock_buffer and unlock_buffer open and close a critical section.
Hence, we can use the lock bitops to get the desired memory ordering.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
trylock_page, unlock_page open and close a critical section. Hence,
we can use the lock bitops to get the desired memory ordering.
Also, mark trylock as likely to succeed (and remove the annotation from
callers).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Setting and clearing the page locked when inserting it into swapcache /
pagecache when it has no other references can use non-atomic page flags
operations because no other CPU may be operating on it at this time.
This saves one atomic operation when inserting a page into pagecache.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Several LRU manupuration function are not used now. So they can be
removed.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allow free of mlock()ed pages. This shouldn't happen, but during
developement, it occasionally did.
This patch allows us to survive that condition, while keeping the
statistics and events correct for debug.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds a function to scan individual or all zones' unevictable
lists and move any pages that have become evictable onto the respective
zone's inactive list, where shrink_inactive_list() will deal with them.
Adds sysctl to scan all nodes, and per node attributes to individual
nodes' zones.
Kosaki: If evictable page found in unevictable lru when write
/proc/sys/vm/scan_unevictable_pages, print filename and file offset of
these pages.
[akpm@linux-foundation.org: fix one CONFIG_MMU=n build error]
[kosaki.motohiro@jp.fujitsu.com: adapt vmscan-unevictable-lru-scan-sysctl.patch to new sysfs API]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the fault paths that install new anonymous pages, check whether the
page is evictable or not using lru_cache_add_active_or_unevictable(). If
the page is evictable, just add it to the active lru list [via the pagevec
cache], else add it to the unevictable list.
This "proactive" culling in the fault path mimics the handling of mlocked
pages in Nick Piggin's series to keep mlocked pages off the lru lists.
Notes:
1) This patch is optional--e.g., if one is concerned about the
additional test in the fault path. We can defer the moving of
nonreclaimable pages until when vmscan [shrink_*_list()]
encounters them. Vmscan will only need to handle such pages
once, but if there are a lot of them it could impact system
performance.
2) The 'vma' argument to page_evictable() is require to notice that
we're faulting a page into an mlock()ed vma w/o having to scan the
page's rmap in the fault path. Culling mlock()ed anon pages is
currently the only reason for this patch.
3) We can't cull swap pages in read_swap_cache_async() because the
vma argument doesn't necessarily correspond to the swap cache
offset passed in by swapin_readahead(). This could [did!] result
in mlocking pages in non-VM_LOCKED vmas if [when] we tried to
cull in this path.
4) Move set_pte_at() to after where we add page to lru to keep it
hidden from other tasks that might walk the page table.
We already do it in this order in do_anonymous() page. And,
these are COW'd anon pages. Is this safe?
[riel@redhat.com: undo an overzealous code cleanup]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add NR_MLOCK zone page state, which provides a (conservative) count of
mlocked pages (actually, the number of mlocked pages moved off the LRU).
Reworked by lts to fit in with the modified mlock page support in the
Reclaim Scalability series.
[kosaki.motohiro@jp.fujitsu.com: fix incorrect Mlocked field of /proc/meminfo]
[lee.schermerhorn@hp.com: mlocked-pages: add event counting with statistics]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.
This is achieved through various strategies:
1) add yet another page flag--PG_mlocked--to indicate that
the page is locked for efficient testing in vmscan and,
optionally, fault path. This allows early culling of
unevictable pages, preventing them from getting to
page_referenced()/try_to_unmap(). Also allows separate
accounting of mlock'd pages, as Nick's original patch
did.
Note: Nick's original mlock patch used a PG_mlocked
flag. I had removed this in favor of the PG_unevictable
flag + an mlock_count [new page struct member]. I
restored the PG_mlocked flag to eliminate the new
count field.
2) add the mlock/unevictable infrastructure to mm/mlock.c,
with internal APIs in mm/internal.h. This is a rework
of Nick's original patch to these files, taking into
account that mlocked pages are now kept on unevictable
LRU list.
3) update vmscan.c:page_evictable() to check PageMlocked()
and, if vma passed in, the vm_flags. Note that the vma
will only be passed in for new pages in the fault path;
and then only if the "cull unevictable pages in fault
path" patch is included.
4) add try_to_unlock() to rmap.c to walk a page's rmap and
ClearPageMlocked() if no other vmas have it mlocked.
Reuses as much of try_to_unmap() as possible. This
effectively replaces the use of one of the lru list links
as an mlock count. If this mechanism let's pages in mlocked
vmas leak through w/o PG_mlocked set [I don't know that it
does], we should catch them later in try_to_unmap(). One
hopes this will be rare, as it will be relatively expensive.
Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>
splitlru: introduce __get_user_pages():
New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
because current get_user_pages() can't grab PROT_NONE pages theresore it
cause PROT_NONE pages can't munlock.
[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shmem segments locked into memory via shmctl(SHM_LOCKED) should not be
kept on the normal LRU, since scanning them is a waste of time and might
throw off kswapd's balancing algorithms. Place them on the unevictable
LRU list instead.
Use the AS_UNEVICTABLE flag to mark address_space of SHM_LOCKed shared
memory regions as unevictable. Then these pages will be culled off the
normal LRU lists during vmscan.
Add new wrapper function to clear the mapping's unevictable state when/if
shared memory segment is munlocked.
Add 'scan_mapping_unevictable_page()' to mm/vmscan.c to scan all pages in
the shmem segment's mapping [struct address_space] for evictability now
that they're no longer locked. If so, move them to the appropriate zone
lru list.
Changes depend on [CONFIG_]UNEVICTABLE_LRU.
[kosaki.motohiro@jp.fujitsu.com: revert shm change]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Christoph Lameter pointed out that ram disk pages also clutter the LRU
lists. When vmscan finds them dirty and tries to clean them, the ram disk
writeback function just redirties the page so that it goes back onto the
active list. Round and round she goes...
With the ram disk driver [rd.c] replaced by the newer 'brd.c', this is no
longer the case, as ram disk pages are no longer maintained on the lru.
[This makes them unmigratable for defrag or memory hot remove, but that
can be addressed by a separate patch series.] However, the ramfs pages
behave like ram disk pages used to, so:
Define new address_space flag [shares address_space flags member with
mapping's gfp mask] to indicate that the address space contains all
unevictable pages. This will provide for efficient testing of ramfs pages
in page_evictable().
Also provide wrapper functions to set/test the unevictable state to
minimize #ifdefs in ramfs driver and any other users of this facility.
Set the unevictable state on address_space structures for new ramfs
inodes. Test the unevictable state in page_evictable() to cull
unevictable pages.
These changes depend on [CONFIG_]UNEVICTABLE_LRU.
[riel@redhat.com: undo the brd.c part]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Debugged-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix to unevictable-lru-page-statistics.patch
Add unevictable lru infrastructure vm events to the statistics patch.
Rename the "NORECL_" and "noreclaim_" symbols and text strings to
"UNEVICTABLE_" and "unevictable_", respectively.
Currently, both the infrastructure and the mlocked pages event are
added by a single patch later in the series. This makes it difficult
to add or rework the incremental patches. The events actually "belong"
with the stats, so pull them up to here.
Also, restore the event counting to putback_lru_page(). This was removed
from previous patch in series where it was "misplaced". The actual events
weren't defined that early.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages. Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable. Subsequent patches will add the various
!evictable tests. We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference. If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list. This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We avoid evicting and scanning anonymous pages for the most part, but
under some workloads we can end up with most of memory filled with
anonymous pages. At that point, we suddenly need to clear the referenced
bits on all of memory, which can take ages on very large memory systems.
We can reduce the maximum number of pages that need to be scanned by not
taking the referenced state into account when deactivating an anonymous
page. After all, every anonymous page starts out referenced, so why
check?
If an anonymous page gets referenced again before it reaches the end of
the inactive list, we move it back to the active list.
To keep the maximum amount of necessary work reasonable, we scale the
active to inactive ratio with the size of memory, using the formula
active:inactive ratio = sqrt(memory in GB * 10).
Kswapd CPU use now seems to scale by the amount of pageout bandwidth,
instead of by the amount of memory present in the system.
[kamezawa.hiroyu@jp.fujitsu.com: fix OOM with memcg]
[kamezawa.hiroyu@jp.fujitsu.com: memcg: lru scan fix]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Split the LRU lists in two, one set for pages that are backed by real file
systems ("file") and one for pages that are backed by memory and swap
("anon"). The latter includes tmpfs.
The advantage of doing this is that the VM will not have to scan over lots
of anonymous pages (which we generally do not want to swap out), just to
find the page cache pages that it should evict.
This patch has the infrastructure and a basic policy to balance how much
we scan the anon lists and how much we scan the file lists. The big
policy changes are in separate patches.
[lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
[kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
[kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
[hugh@veritas.com: memcg swapbacked pages active]
[hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
[akpm@linux-foundation.org: fix /proc/vmstat units]
[nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
[kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
[kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Define page_file_cache() function to answer the question:
is page backed by a file?
Originally part of Rik van Riel's split-lru patch. Extracted to make
available for other, independent reclaim patches.
Moved inline function to linux/mm_inline.h where it will be needed by
subsequent "split LRU" and "noreclaim" patches.
Unfortunately this needs to use a page flag, since the PG_swapbacked state
needs to be preserved all the way to the point where the page is last
removed from the LRU. Trying to derive the status from other info in the
page resulted in wrong VM statistics in earlier split VM patchsets.
The total number of page flags in use on a 32 bit machine after this patch
is 19.
[akpm@linux-foundation.org: fix up out-of-order merge fallout]
[hugh@veritas.com: splitlru: shmem_getpage SetPageSwapBacked sooner[
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: MinChan Kim <minchan.kim@gmail.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If vm_swap_full() (swap space more than 50% full), the system will free
swap space at swapin time. With this patch, the system will also free the
swap space in the pageout code, when we decide that the page is not a
candidate for swapout (and just wasting swap space).
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: MinChan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Turn the pagevecs into an array just like the LRUs. This significantly
cleans up the source code and reduces the size of the kernel by about 13kB
after all the LRU lists have been created further down in the split VM
patch series.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we are defining explicit variables for the inactive and active
list. An indexed array can be more generic and avoid repeating similar
code in several places in the reclaim code.
We are saving a few bytes in terms of code size:
Before:
text data bss dec hex filename
4097753 573120 4092484 8763357 85b7dd vmlinux
After:
text data bss dec hex filename
4097729 573120 4092484 8763333 85b7c5 vmlinux
Having an easy way to add new lru lists may ease future work on the
reclaim code.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory. Not only
does it use up CPU time, but it also provokes lock contention and can
leave large systems under memory presure in a catatonic state.
This patch series improves VM scalability by:
1) putting filesystem backed, swap backed and unevictable pages
onto their own LRUs, so the system only scans the pages that it
can/should evict from memory
2) switching to two handed clock replacement for the anonymous LRUs,
so the number of pages that need to be scanned when the system
starts swapping is bound to a reasonable number
3) keeping unevictable pages off the LRU completely, so the
VM does not waste CPU time scanning them. ramfs, ramdisk,
SHM_LOCKED shared memory segments and mlock()ed VMA pages
are keept on the unevictable list.
This patch:
isolate_lru_page logically belongs to be in vmscan.c than migrate.c.
It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c. However a
subsequent patch needs to make use of it in the core mm, so we can happily
move it to vmscan.c.
Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list. Callers can do that.
Note that we now have '__isolate_lru_page()', that does
something quite different, visible outside of vmscan.c
for use with memory controller. Methinks we need to
rationalize these names/purposes. --lts
[akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I noticed that tg_shares_up() unconditionally takes rq-locks for all cpus
in the sched_domain. This hurts.
We need the rq-locks whenever we change the weight of the per-cpu group sched
entities. To allevate this a little, only change the weight when the new
weight is at least shares_thresh away from the old value.
This avoids the rq-lock for the top level entries, since those will never
be re-weighted, and fuzzes the lower level entries a little to gain performance
in semi-stable situations.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The netfilter families have been decoupled from regular protocol families.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add platform data flags for detailed lcd display configuration.
Signed-off-by: Magnus Damm <damm@igel.co.jp>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Remove lddckr from the platform data, these days we calculate the
register value from clock source and clock dividers anyway.
Signed-off-by: Magnus Damm <damm@igel.co.jp>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>