Commit Graph

16541 Commits

Author SHA1 Message Date
Kirill Tkhai
5d5a08003d rcu: Fix CONFIG_RCU_NOCB_CPU_ALL panic on machines with sparse CPU mask
Some architectures have sparse cpu mask. UltraSparc's cpuinfo for example:

CPU0: online
CPU2: online

So, set only possible CPUs when CONFIG_RCU_NOCB_CPU_ALL is enabled.

Also, check that user passes right 'rcu_nocbs=' option.

Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
CC: Dipankar Sarma <dipankar@in.ibm.com>
[ paulmck: Fix pr_info() issue noted by scripts/checkpatch.pl. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-09-23 14:11:11 -07:00
Paul E. McKenney
0d75292467 rcu: Have rcutiny tracepoints use tracepoint_string()
This commit extends the work done in f7f7bac9 (rcu: Have the RCU
tracepoints use the tracepoint_string infrastructure) to cover rcutiny.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
2013-09-23 09:15:31 -07:00
Paul E. McKenney
26cdfedf6a rcu: Reject memory-order-induced stall-warning false positives
If a system is idle from an RCU perspective for longer than specified
by CONFIG_RCU_CPU_STALL_TIMEOUT, and if one CPU starts a grace period
just as a second checks for CPU stalls, and if this second CPU happens
to see the old value of rsp->jiffies_stall, it will incorrectly report a
CPU stall.  This is quite rare, but apparently occurs deterministically
on systems with about 6TB of memory.

This commit therefore orders accesses to the data used to determine
whether or not a CPU stall is in progress.  Grace-period initialization
and cleanup first increments rsp->completed to mark the end of the
previous grace period, then records the current jiffies in rsp->gp_start,
then records the jiffies at which a stall can be expected to occur in
rsp->jiffies_stall, and finally increments rsp->gpnum to mark the start
of the new grace period.  Now, this ordering by itself does not prevent
false positives.  For example, if grace-period initialization was delayed
between recording rsp->gp_start and rsp->jiffies_stall, the CPU stall
warning code might still see an old value of rsp->jiffies_stall.

Therefore, this commit also orders the CPU stall warning accesses as
well, loading rsp->gpnum and jiffies, then rsp->jiffies_stall, then
rsp->gp_start, and finally rsp->completed.  This ordering means that
the false-positive scenario in the previous paragraph would result
in rsp->completed being greater than or equal to rsp->gpnum, which is
never valid for a CPU stall, allowing the false positive to be rejected.
Furthermore, any fetch that gets an old value of rsp->jiffies_stall
must also get an old value of rsp->gpnum, which will again be rejected
by the comparison of rsp->gpnum and rsp->completed.  Situations where
rsp->gp_start is later than rsp->jiffies_stall are also rejected, as
are situations where jiffies is less than rsp->jiffies_stall.

Although use of unsynchronized accesses means that there are likely
still some false-positive scenarios (synchronization has proven to be
a very bad idea on large systems), this should get rid of a large class
of these scenarios.

Reported-by: Fabian Herschel <fabian.herschel@suse.com>
Reported-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Jochen Striepe <jochen@tolot.escape.de>
2013-09-23 09:15:30 -07:00
Paul E. McKenney
69c8d28c96 rcu: Micro-optimize rcu_cpu_has_callbacks()
The for_each_rcu_flavor() loop unconditionally scans all flavors, even
when the first flavor might have some non-lazy callbacks.  Once the
loop has seen a non-lazy callback, further passes through the loop
cannot change the state.  This is not a huge problem, given that there
can be at most three RCU flavors (RCU-bh, RCU-preempt, and RCU-sched),
but this code is on the path to idle, so speeding it up even a small
amount would have some benefit.

This commit therefore does two things:

1.	Rearranges the order of the list of RCU flavors in order to
	place the most active flavor first in the list.  The most active
	RCU flavor is RCU-preempt, or, if there is no RCU-preempt,
	RCU-sched.

2.	Reworks the for_each_rcu_flavor() to exit early when the first
	non-lazy callback is seen, or, in the case where the caller
	does not care about non-lazy callbacks (RCU_FAST_NO_HZ=n),
	when the first callback is seen.

Reported-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-09-23 09:15:28 -07:00
Paul E. McKenney
289828e62d rcu: Silence unused-variable warnings
The "idle" variable in both rcu_eqs_enter_common() and
rcu_eqs_exit_common() is only used in a WARN_ON_ONCE().  If the kernel
is built disabling WARN_ON_ONCE(), the compiler will complain (rightly)
that "idle" is unused.  This commit therefore adds a __maybe_unused to
the declaration of both variables.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-09-23 09:15:27 -07:00
Christoph Lameter
c9d4b0af9e rcu: Replace __get_cpu_var() uses
__get_cpu_var() is used for multiple purposes in the kernel source. One
of them is address calculation via the form &__get_cpu_var(x). This
calculates the address for the instance of the percpu variable of the
current processor based on an offset.

Other use cases are for storing and retrieving data from the current
processors percpu area.  __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.

__get_cpu_var() is defined as :

__get_cpu_var() always only does an address determination. However,
store and retrieve operations could use a segment prefix (or global
register on other platforms) to avoid the address calculation.

this_cpu_write() and this_cpu_read() can directly take an offset into
a percpu area and use optimized assembly code to read and write per
cpu variables.

This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations
that use the offset. Thereby address calcualtions are avoided and less
registers are used when code is generated.

At the end of the patchset all uses of __get_cpu_var have been removed
so the macro is removed too.

The patchset includes passes over all arches as well. Once these
operations are used throughout then specialized macros can be defined in
non -x86 arches as well in order to optimize per cpu access by f.e. using
a global register that may be set to the per cpu base.

Transformations done to __get_cpu_var()

1. Determine the address of the percpu instance of the current processor.

	DEFINE_PER_CPU(int, y);
	int *x = &__get_cpu_var(y);

    Converts to

	int *x = this_cpu_ptr(&y);

2. Same as #1 but this time an array structure is involved.

	DEFINE_PER_CPU(int, y[20]);
	int *x = __get_cpu_var(y);

    Converts to

	int *x = this_cpu_ptr(y);

3. Retrieve the content of the current processors instance of a per cpu
   variable.

	DEFINE_PER_CPU(int, u);
	int x = __get_cpu_var(y)

   Converts to

	int x = __this_cpu_read(y);

4. Retrieve the content of a percpu struct

	DEFINE_PER_CPU(struct mystruct, y);
	struct mystruct x = __get_cpu_var(y);

   Converts to

	memcpy(this_cpu_ptr(&x), y, sizeof(x));

5. Assignment to a per cpu variable

	DEFINE_PER_CPU(int, y)
	__get_cpu_var(y) = x;

   Converts to

	this_cpu_write(y, x);

6. Increment/Decrement etc of a per cpu variable

	DEFINE_PER_CPU(int, y);
	__get_cpu_var(y)++

   Converts to

	this_cpu_inc(y)

Signed-off-by: Christoph Lameter <cl@linux.com>
[ paulmck: Address conflicts. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-09-23 09:15:22 -07:00
Paul E. McKenney
829511d8aa rcu: Fix dubious "if" condition in __call_rcu_nocb_enqueue()
This commit replaces an incorrect (but fortunately functional)
bitwise OR ("|") operator with the correct logical OR ("||").

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-09-23 09:13:44 -07:00
Paul E. McKenney
01896f7e0a rcu: Convert local functions to static
The rcu_cpu_stall_timeout kernel parameter, the rcu_dynticks per-CPU
variable, and the rcu_gp_fqs() function are used only locally.  This
commit therefore marks them as static.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-09-23 09:12:31 -07:00
Paul E. McKenney
b3f2d02598 rcu: Use proper cpp macro for ->gp_flags
One of the ->gp_flags assignments used a raw number rather than the
cpp macro that was intended for this purpose, which this commit fixes.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-09-20 09:43:06 -07:00
Linus Torvalds
9bf12df31f Merge git://git.kvack.org/~bcrl/aio-next
Pull aio changes from Ben LaHaise:
 "First off, sorry for this pull request being late in the merge window.
  Al had raised a couple of concerns about 2 items in the series below.
  I addressed the first issue (the race introduced by Gu's use of
  mm_populate()), but he has not provided any further details on how he
  wants to rework the anon_inode.c changes (which were sent out months
  ago but have yet to be commented on).

  The bulk of the changes have been sitting in the -next tree for a few
  months, with all the issues raised being addressed"

* git://git.kvack.org/~bcrl/aio-next: (22 commits)
  aio: rcu_read_lock protection for new rcu_dereference calls
  aio: fix race in ring buffer page lookup introduced by page migration support
  aio: fix rcu sparse warnings introduced by ioctx table lookup patch
  aio: remove unnecessary debugging from aio_free_ring()
  aio: table lookup: verify ctx pointer
  staging/lustre: kiocb->ki_left is removed
  aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
  aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
  aio: convert the ioctx list to table lookup v3
  aio: double aio_max_nr in calculations
  aio: Kill ki_dtor
  aio: Kill ki_users
  aio: Kill unneeded kiocb members
  aio: Kill aio_rw_vect_retry()
  aio: Don't use ctx->tail unnecessarily
  aio: io_cancel() no longer returns the io_event
  aio: percpu ioctx refcount
  aio: percpu reqs_available
  aio: reqs_active -> reqs_available
  aio: fix build when migration is disabled
  ...
2013-09-13 10:55:58 -07:00
Martin Schwidefsky
0244ad004a Remove GENERIC_HARDIRQ config option
After the last architecture switched to generic hard irqs the config
options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code
for !CONFIG_GENERIC_HARDIRQS can be removed.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-09-13 15:09:52 +02:00
Linus Torvalds
ac4de9543a Merge branch 'akpm' (patches from Andrew Morton)
Merge more patches from Andrew Morton:
 "The rest of MM.  Plus one misc cleanup"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (35 commits)
  mm/Kconfig: add MMU dependency for MIGRATION.
  kernel: replace strict_strto*() with kstrto*()
  mm, thp: count thp_fault_fallback anytime thp fault fails
  thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
  thp: do_huge_pmd_anonymous_page() cleanup
  thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
  mm: cleanup add_to_page_cache_locked()
  thp: account anon transparent huge pages into NR_ANON_PAGES
  truncate: drop 'oldsize' truncate_pagecache() parameter
  mm: make lru_add_drain_all() selective
  memcg: document cgroup dirty/writeback memory statistics
  memcg: add per cgroup writeback pages accounting
  memcg: check for proper lock held in mem_cgroup_update_page_stat
  memcg: remove MEMCG_NR_FILE_MAPPED
  memcg: reduce function dereference
  memcg: avoid overflow caused by PAGE_ALIGN
  memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
  memcg: correct RESOURCE_MAX to ULLONG_MAX
  mm: memcg: do not trap chargers with full callstack on OOM
  mm: memcg: rework and document OOM waiting and wakeup
  ...
2013-09-12 15:44:27 -07:00
Jingoo Han
6072ddc852 kernel: replace strict_strto*() with kstrto*()
The usage of strict_strto*() is not preferred, because strict_strto*() is
obsolete.  Thus, kstrto*() should be used.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:03 -07:00
Sha Zhengju
1a36e59d48 memcg: reduce function dereference
This function dereferences res far too often, so optimize it.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Sha Zhengju
3af3351676 memcg: avoid overflow caused by PAGE_ALIGN
Since PAGE_ALIGN is aligning up(the next page boundary), so after
PAGE_ALIGN, the value might be overflow, such as write the MAX value to
*.limit_in_bytes.

  $ cat /cgroup/memory/memory.limit_in_bytes
  18446744073709551615

  # echo 18446744073709551615 > /cgroup/memory/memory.limit_in_bytes
  bash: echo: write error: Invalid argument

Some user programs might depend on such behaviours(like libcg, we read
the value in snapshot, then use the value to reset cgroup later), and
that will cause confusion.  So we need to fix it.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Sha Zhengju
6de5a8bfca memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Linus Torvalds
26935fb06e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile 4 from Al Viro:
 "list_lru pile, mostly"

This came out of Andrew's pile, Al ended up doing the merge work so that
Andrew didn't have to.

Additionally, a few fixes.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits)
  super: fix for destroy lrus
  list_lru: dynamically adjust node arrays
  shrinker: Kill old ->shrink API.
  shrinker: convert remaining shrinkers to count/scan API
  staging/lustre/libcfs: cleanup linux-mem.h
  staging/lustre/ptlrpc: convert to new shrinker API
  staging/lustre/obdclass: convert lu_object shrinker to count/scan API
  staging/lustre/ldlm: convert to shrinkers to count/scan API
  hugepage: convert huge zero page shrinker to new shrinker API
  i915: bail out earlier when shrinker cannot acquire mutex
  drivers: convert shrinkers to new count/scan API
  fs: convert fs shrinkers to new scan/count API
  xfs: fix dquot isolation hang
  xfs-convert-dquot-cache-lru-to-list_lru-fix
  xfs: convert dquot cache lru to list_lru
  xfs: rework buffer dispose list tracking
  xfs-convert-buftarg-lru-to-generic-code-fix
  xfs: convert buftarg LRU to generic code
  fs: convert inode and dentry shrinking to be node aware
  vmscan: per-node deferred work
  ...
2013-09-12 15:01:38 -07:00
Linus Torvalds
02b9735c12 ACPI and power management fixes for 3.12-rc1
1) ACPI-based PCI hotplug (ACPIPHP) fixes related to spurious events
 
   After the recent ACPIPHP changes we've seen some interesting breakage
   on a system that triggers device check notifications during boot for
   non-existing devices.  Although those notifications are really
   spurious, we should be able to deal with them nevertheless and that
   shouldn't introduce too much overhead.  Four commits to make that
   work properly.
 
  2) Memory hotplug and hibernation mutual exclusion rework
 
   This was maent to be a cleanup, but it happens to fix a classical
   ABBA deadlock between system suspend/hibernation and ACPI memory
   hotplug which is possible if they are started roughly at the same
   time.  Three commits rework memory hotplug so that it doesn't
   acquire pm_mutex and make hibernation use device_hotplug_lock
   which prevents it from racing with memory hotplug.
 
  3) ACPI Intel LPSS (Low-Power Subsystem) driver crash fix
 
   The ACPI LPSS driver crashes during boot on Apple Macbook Air with
   Haswell that has slightly unusual BIOS configuration in which one
   of the LPSS device's _CRS method doesn't return all of the information
   expected by the driver.  Fix from Mika Westerberg, for stable.
 
  4) ACPICA fix related to Store->ArgX operation
 
   AML interpreter fix for obscure breakage that causes AML to be
   executed incorrectly on some machines (observed in practice).  From
   Bob Moore.
 
  5) ACPI core fix for PCI ACPI device objects lookup
 
   There still are cases in which there is more than one ACPI device
   object matching a given PCI device and we don't choose the one that
   the BIOS expects us to choose, so this makes the lookup take more
   criteria into account in those cases.
 
  6) Fix to prevent cpuidle from crashing in some rare cases
 
   If the result of cpuidle_get_driver() is NULL, which can happen on
   some systems, cpuidle_driver_ref() will crash trying to use that
   pointer and the Daniel Fu's fix prevents that from happening.
 
  7) cpufreq fixes related to CPU hotplug
 
   Stephen Boyd reported a number of concurrency problems with cpufreq
   related to CPU hotplug which are addressed by a series of fixes
   from Srivatsa S Bhat and Viresh Kumar.
 
  8) cpufreq fix for time conversion in time_in_state attribute
 
   Time conversion carried out by cpufreq when user space attempts to
   read /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state won't
   work correcty if cputime_t doesn't map directly to jiffies.  Fix
   from Andreas Schwab.
 
  9) Revert of a troublesome cpufreq commit
 
   Commit 7c30ed5 (cpufreq: make sure frequency transitions are
   serialized) was intended to address some known concurrency problems
   in cpufreq related to the ordering of transitions, but unfortunately
   it introduced several problems of its own, so I decided to revert it
   now and address the original problems later in a more robust way.
 
 10) Intel Haswell CPU models for intel_pstate from Nell Hardcastle.
 
 11) cpufreq fixes related to system suspend/resume
 
   The recent cpufreq changes that made it preserve CPU sysfs attributes
   over suspend/resume cycles introduced a possible NULL pointer
   dereference that caused it to crash during the second attempt to
   suspend.  Three commits from Srivatsa S Bhat fix that problem and a
   couple of related issues.
 
 12) cpufreq locking fix
 
   cpufreq_policy_restore() should acquire the lock for reading, but
   it acquires it for writing.  Fix from Lan Tianyu.
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABCAAGBQJSMbdRAAoJEKhOf7ml8uNsiFkQAKSh1iBXuiUCxBApEGZgoQio
 8lmnuyWdhNQWdjZTnh7ptjpDxdrWhxcoxvoaGABU++reDObjef1QnyrQtdO3r8dl
 oy0C/YGh5kq5SIffIDEwPIb/ipDe/47cgRMW8iBlnViDa1MJBqICuLyefcTRIrKp
 QGvv0owUM2o7TXpA10+qm8zXjv6m5mu1DTtxYI+2Eodhwi54neAqb+aKMspa2thy
 V9KFcVv3Td4rJrNvw6BhXNM81QbaYpRxaK3DRr1T6SM++EKvbqYFA1jgW24YvqTL
 nrCZlDMb6KRww5DCxA/ns9Kx5H+ZyicoRwdtAM3PBYA6MGqsLqPozC/8VKV1fSvZ
 sgUdbUSuLqKRAkOqM1bjKAhi9PdCGBvkQAg2AqbRK6IBl4HJC8xhdb5E6eZ/J42G
 GyNBpKef7wVJwYKXE2hSChZ5dYjqMizNHWxFHf8Xy1dveExbQ2nmSJmaWMy2A3kx
 YOXFkcTV5F6GOIZB8WCRruzUalff9xal4G+iVhGF+AZIOCm7bC+FDXfwIS82uVor
 ej2l+uQLLZCB499IRmM6942ZIAXshmtN7eRfGtKBc6jsbSCEdQDqf1Z7oRwqAD6h
 WkD/k/zz30CyM8y4snOkAXkZgqAQsZodtqfowE3e9OHd51tfcNiqdht+obwCx+eD
 MWXc2xATMAX6NcZTXSZS
 =U/Jw
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-fixes-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI and power management fixes from Rafael Wysocki:
 "All of these commits are fixes that have emerged recently and some of
  them fix bugs introduced during this merge window.

  Specifics:

   1) ACPI-based PCI hotplug (ACPIPHP) fixes related to spurious events

      After the recent ACPIPHP changes we've seen some interesting
      breakage on a system that triggers device check notifications
      during boot for non-existing devices.  Although those
      notifications are really spurious, we should be able to deal with
      them nevertheless and that shouldn't introduce too much overhead.
      Four commits to make that work properly.

   2) Memory hotplug and hibernation mutual exclusion rework

      This was maent to be a cleanup, but it happens to fix a classical
      ABBA deadlock between system suspend/hibernation and ACPI memory
      hotplug which is possible if they are started roughly at the same
      time.  Three commits rework memory hotplug so that it doesn't
      acquire pm_mutex and make hibernation use device_hotplug_lock
      which prevents it from racing with memory hotplug.

   3) ACPI Intel LPSS (Low-Power Subsystem) driver crash fix

      The ACPI LPSS driver crashes during boot on Apple Macbook Air with
      Haswell that has slightly unusual BIOS configuration in which one
      of the LPSS device's _CRS method doesn't return all of the
      information expected by the driver.  Fix from Mika Westerberg, for
      stable.

   4) ACPICA fix related to Store->ArgX operation

      AML interpreter fix for obscure breakage that causes AML to be
      executed incorrectly on some machines (observed in practice).
      From Bob Moore.

   5) ACPI core fix for PCI ACPI device objects lookup

      There still are cases in which there is more than one ACPI device
      object matching a given PCI device and we don't choose the one
      that the BIOS expects us to choose, so this makes the lookup take
      more criteria into account in those cases.

   6) Fix to prevent cpuidle from crashing in some rare cases

      If the result of cpuidle_get_driver() is NULL, which can happen on
      some systems, cpuidle_driver_ref() will crash trying to use that
      pointer and the Daniel Fu's fix prevents that from happening.

   7) cpufreq fixes related to CPU hotplug

      Stephen Boyd reported a number of concurrency problems with
      cpufreq related to CPU hotplug which are addressed by a series of
      fixes from Srivatsa S Bhat and Viresh Kumar.

   8) cpufreq fix for time conversion in time_in_state attribute

      Time conversion carried out by cpufreq when user space attempts to
      read /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state
      won't work correcty if cputime_t doesn't map directly to jiffies.
      Fix from Andreas Schwab.

   9) Revert of a troublesome cpufreq commit

      Commit 7c30ed5 (cpufreq: make sure frequency transitions are
      serialized) was intended to address some known concurrency
      problems in cpufreq related to the ordering of transitions, but
      unfortunately it introduced several problems of its own, so I
      decided to revert it now and address the original problems later
      in a more robust way.

  10) Intel Haswell CPU models for intel_pstate from Nell Hardcastle.

  11) cpufreq fixes related to system suspend/resume

      The recent cpufreq changes that made it preserve CPU sysfs
      attributes over suspend/resume cycles introduced a possible NULL
      pointer dereference that caused it to crash during the second
      attempt to suspend.  Three commits from Srivatsa S Bhat fix that
      problem and a couple of related issues.

  12) cpufreq locking fix

      cpufreq_policy_restore() should acquire the lock for reading, but
      it acquires it for writing.  Fix from Lan Tianyu"

* tag 'pm+acpi-fixes-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (25 commits)
  cpufreq: Acquire the lock in cpufreq_policy_restore() for reading
  cpufreq: Prevent problems in update_policy_cpu() if last_cpu == new_cpu
  cpufreq: Restructure if/else block to avoid unintended behavior
  cpufreq: Fix crash in cpufreq-stats during suspend/resume
  intel_pstate: Add Haswell CPU models
  Revert "cpufreq: make sure frequency transitions are serialized"
  cpufreq: Use signed type for 'ret' variable, to store negative error values
  cpufreq: Remove temporary fix for race between CPU hotplug and sysfs-writes
  cpufreq: Synchronize the cpufreq store_*() routines with CPU hotplug
  cpufreq: Invoke __cpufreq_remove_dev_finish() after releasing cpu_hotplug.lock
  cpufreq: Split __cpufreq_remove_dev() into two parts
  cpufreq: Fix wrong time unit conversion
  cpufreq: serialize calls to __cpufreq_governor()
  cpufreq: don't allow governor limits to be changed when it is disabled
  ACPI / bind: Prefer device objects with _STA to those without it
  ACPI / hotplug / PCI: Avoid parent bus rescans on spurious device checks
  ACPI / hotplug / PCI: Use _OST to notify firmware about notify status
  ACPI / hotplug / PCI: Avoid doing too much for spurious notifies
  ACPICA: Fix for a Store->ArgX when ArgX contains a reference to a field.
  ACPI / hotplug / PCI: Don't trim devices before scanning the namespace
  ...
2013-09-12 11:22:45 -07:00
Linus Torvalds
75acebf242 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Various fixes.

  The -g perf report lockup you reported is only partially addressed,
  patches that fix the excessive runtime are still being worked on"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Fix uncore PCI fixed counter handling
  uprobes: Fix utask->depth accounting in handle_trampoline()
  perf/x86: Add constraint for IVB CYCLE_ACTIVITY:CYCLES_LDM_PENDING
  perf: Fix up MMAP2 buffer space reservation
  perf tools: Add attr->mmap2 support
  perf kvm: Fix sample_type manipulation
  perf evlist: Fix id pos in perf_evlist__open()
  perf trace: Handle perf.data files with no tracepoints
  perf session: Separate progress bar update when processing events
  perf trace: Check if MAP_32BIT is defined
  perf hists: Fix formatting of long symbol names
  perf evlist: Fix parsing with no sample_id_all bit set
  perf tools: Add test for parsing with no sample_id_all bit
  perf trace: Check control+C more often
2013-09-12 10:44:54 -07:00
Linus Torvalds
b55ee2816e Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
 "Performance regression fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix load balancing performance regression in should_we_balance()
2013-09-12 10:44:13 -07:00
Oleg Nesterov
878b5a6efd uprobes: Fix utask->depth accounting in handle_trampoline()
Currently utask->depth is simply the number of allocated/pending
return_instance's in uprobe_task->return_instances list.

handle_trampoline() should decrement this counter every time we
handle/free an instance, but due to typo it does this only if
->chained == T. This means that in the likely case this counter
is never decremented and the probed task can't report more than
MAX_URETPROBE_DEPTH events.

Reported-by: Mikhail Kulemin <Mikhail.Kulemin@ru.ibm.com>
Reported-by: Hemant Kumar Shaw <hkshaw@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Cc: masami.hiramatsu.pt@hitachi.com
Cc: srikar@linux.vnet.ibm.com
Cc: systemtap@sourceware.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20130911154726.GA8093@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-12 08:00:55 +02:00
Kees Cook
6723734cdf panic: call panic handlers before kmsg_dump
Since the panic handlers may produce additional information (via printk)
for the kernel log, it should be reported as part of the panic output
saved by kmsg_dump().  Without this re-ordering, nothing that adds
information to a panic will show up in pstore's view when kmsg_dump runs,
and is therefore not visible to crash reporting tools that examine pstore
output.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Vikram Mulukutla <markivx@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:30 -07:00
Xishi Qiu
80c74f6a40 kexec: remove unnecessary return
Code can not run here forever, so remove the unnecessary return.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Suggested-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:10 -07:00
Mark Grondona
73af963f9f __ptrace_may_access() should not deny sub-threads
__ptrace_may_access() checks get_dumpable/ptrace_has_cap/etc if task !=
current, this can can lead to surprising results.

For example, a sub-thread can't readlink("/proc/self/exe") if the
executable is not readable.  setup_new_exec()->would_dump() notices that
inode_permission(MAY_READ) fails and then it does
set_dumpable(suid_dumpable).  After that get_dumpable() fails.

(It is not clear why proc_pid_readlink() checks get_dumpable(), perhaps we
could add PTRACE_MODE_NODUMPABLE)

Change __ptrace_may_access() to use same_thread_group() instead of "task
== current".  Any security check is pointless when the tasks share the
same ->mm.

Signed-off-by: Mark Grondona <mgrondona@llnl.gov>
Signed-off-by: Ben Woodard <woodard@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:01 -07:00
Heiko Carstens
af96397de8 kprobes: allow to specify custom allocator for insn caches
The current two insn slot caches both use module_alloc/module_free to
allocate and free insn slot cache pages.

For s390 this is not sufficient since there is the need to allocate insn
slots that are either within the vmalloc module area or within dma memory.

Therefore add a mechanism which allows to specify an own allocator for an
own insn slot cache.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:52 -07:00
Heiko Carstens
c802d64a35 kprobes: unify insn caches
The current kpropes insn caches allocate memory areas for insn slots
with module_alloc().  The assumption is that the kernel image and module
area are both within the same +/- 2GB memory area.

This however is not true for s390 where the kernel image resides within
the first 2GB (DMA memory area), but the module area is far away in the
vmalloc area, usually somewhere close below the 4TB area.

For new pc relative instructions s390 needs insn slots that are within
+/- 2GB of each area.  That way we can patch displacements of
pc-relative instructions within the insn slots just like x86 and
powerpc.

The module area works already with the normal insn slot allocator,
however there is currently no way to get insn slots that are within the
first 2GB on s390 (aka DMA area).

Therefore this patch set modifies the kprobes insn slot cache code in
order to allow to specify a custom allocator for the insn slot cache
pages.  In addition architecure can now have private insn slot caches
withhout the need to modify common code.

Patch 1 unifies and simplifies the current insn and optinsn caches
        implementation. This is a preparation which allows to add more
        insn caches in a simple way.

Patch 2 adds the possibility to specify a custom allocator.

Patch 3 makes s390 use the new insn slot mechanisms and adds support for
        pc-relative instructions with long displacements.

This patch (of 3):

The two insn caches (insn, and optinsn) each have an own mutex and
alloc/free functions (get_[opt]insn_slot() / free_[opt]insn_slot()).

Since there is the need for yet another insn cache which satifies dma
allocations on s390, unify and simplify the current implementation:

- Move the per insn cache mutex into struct kprobe_insn_cache.
- Move the alloc/free functions to kprobe.h so they are simply
  wrappers for the generic __get_insn_slot/__free_insn_slot functions.
  The implementation is done with a DEFINE_INSN_CACHE_OPS() macro
  which provides the alloc/free functions for each cache if needed.
- move the struct kprobe_insn_cache to kprobe.h which allows to generate
  architecture specific insn slot caches outside of the core kprobes
  code.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:52 -07:00
Oleg Nesterov
892f6668f3 task_work: documentation
No functional changes, just comments.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:27 -07:00
Oleg Nesterov
205e550a0f task_work: minor cleanups
Trivial.  Remove the unnecessary "work = NULL" initialization and turn
read_barrier_depends() into smp_read_barrier_depends() in
task_work_cancel().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:26 -07:00
David Daney
202da40057 kernel/smp.c: quit unconditionally enabling irqs in on_each_cpu_mask().
As in commit f21afc25f9 ("smp.h: Use local_irq_{save,restore}() in
!SMP version of on_each_cpu()"), we don't want to enable irqs if they
are not already enabled.

I don't know of any bugs currently caused by this unconditional
local_irq_enable(), but I want to use this function in MIPS/OCTEON early
boot (when we have early_boot_irqs_disabled).  This also makes this
function have similar semantics to on_each_cpu() which is good in
itself.

Signed-off-by: David Daney <david.daney@cavium.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:25 -07:00
Uwe Kleine-König
e656a63411 extable: skip sorting if the table is empty
At least on ARM no-MMU the extable is empty and so there is nothing to
sort. So add a check for the table to be empty which effectively only
changes that the misleading pr_notice is suppressed.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: David Daney <david.daney@cavium.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:25 -07:00
David Daney
bff2dc42bc smp.h: move !SMP version of on_each_cpu() out-of-line
All of the other non-trivial !SMP versions of functions in smp.h are
out-of-line in up.c.  Move on_each_cpu() there as well.

This allows us to get rid of the #include <linux/irqflags.h>.  The
drawback is that this makes both the x86_64 and i386 defconfig !SMP
kernels about 200 bytes larger each.

Signed-off-by: David Daney <david.daney@cavium.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:25 -07:00
David Daney
081192b25c up.c: use local_irq_{save,restore}() in smp_call_function_single.
The SMP version of this function doesn't unconditionally enable irqs, so
neither should this !SMP version.  There are no know problems caused by
this, but we make the change for consistency's sake.

Signed-off-by: David Daney <david.daney@cavium.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:25 -07:00
David Daney
fa688207c9 smp: quit unconditionally enabling irq in on_each_cpu_mask and on_each_cpu_cond
As in commit f21afc25f9 ("smp.h: Use local_irq_{save,restore}() in
!SMP version of on_each_cpu()"), we don't want to enable irqs if they
are not already enabled.  There are currently no known problematical
callers of these functions, but since it is a known failure pattern, we
preemptively fix them.

Since they are not trivial functions, make them non-inline by moving
them to up.c.  This also makes it so we don't have to fix #include
dependancies for preempt_{disable,enable}.

Signed-off-by: David Daney <david.daney@cavium.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:23 -07:00
Will Deacon
c14c338cb0 kernel/spinlock.c: add default arch_*_relax definitions for GENERIC_LOCKBREAK
When running with GENERIC_LOCKBREAK=y, the locking implementations emit
calls to arch_{read,write,spin}_relax when spinning on a contended lock
in order to allow architectures to favour the CPU owning the lock if
possible.

In reality, everybody apart from PowerPC and S390 just does cpu_relax()
here, so make that the default behaviour and allow it to be overridden
if required.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:21 -07:00
Chen Gang
60c323699b kernel/smp.c: free related resources when failure occurs in hotplug_cfd()
When failure occurs in hotplug_cfd(), need release related resources, or
will cause memory leak.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Acked-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:21 -07:00
Andi Kleen
54a33b1b14 kernel/modsign_pubkey.c: fix init const for module signing code
const has to use __initconst, not __initdata

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:21 -07:00
Mathieu Desnoyers
3ddc5b46a8 kernel-wide: fix missing validations on __get/__put/__copy_to/__copy_from_user()
I found the following pattern that leads in to interesting findings:

  grep -r "ret.*|=.*__put_user" *
  grep -r "ret.*|=.*__get_user" *
  grep -r "ret.*|=.*__copy" *

The __put_user() calls in compat_ioctl.c, ptrace compat, signal compat,
since those appear in compat code, we could probably expect the kernel
addresses not to be reachable in the lower 32-bit range, so I think they
might not be exploitable.

For the "__get_user" cases, I don't think those are exploitable: the worse
that can happen is that the kernel will copy kernel memory into in-kernel
buffers, and will fail immediately afterward.

The alpha csum_partial_copy_from_user() seems to be missing the
access_ok() check entirely.  The fix is inspired from x86.  This could
lead to information leak on alpha.  I also noticed that many architectures
map csum_partial_copy_from_user() to csum_partial_copy_generic(), but I
wonder if the latter is performing the access checks on every
architectures.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:18 -07:00
Naoya Horiguchi
86cdb465cf mm: prepare to remove /proc/sys/vm/hugepages_treat_as_movable
Now hugepage migration is enabled, although restricted on pmd-based
hugepages for now (due to lack of testing.) So we should allocate
migratable hugepages from ZONE_MOVABLE if possible.

This patch makes GFP flags in hugepage allocation dependent on migration
support, not only the value of hugepages_treat_as_movable.  It provides no
change on the behavior for architectures which do not support hugepage
migration,

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:57:49 -07:00
Xishi Qiu
c33bc315fd mm: use zone_end_pfn() instead of zone_start_pfn+spanned_pages
Use "zone_end_pfn()" instead of "zone->zone_start_pfn + zone->spanned_pages".
Simplify the code, no functional change.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:57:36 -07:00
Oleg Nesterov
ef0855d334 mm: mempolicy: turn vma_set_policy() into vma_dup_policy()
Simple cleanup.  Every user of vma_set_policy() does the same work, this
looks a bit annoying imho.  And the new trivial helper which does
mpol_dup() + vma_set_policy() to simplify the callers.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:57:00 -07:00
Oleg Nesterov
40a0d32d1e fork: unify and tighten up CLONE_NEWUSER/CLONE_NEWPID checks
do_fork() denies CLONE_THREAD | CLONE_PARENT if NEWUSER | NEWPID.

Then later copy_process() denies CLONE_SIGHAND if the new process will
be in a different pid namespace (task_active_pid_ns() doesn't match
current->nsproxy->pid_ns).

This looks confusing and inconsistent.  CLONE_NEWPID is very similar to
the case when ->pid_ns was already unshared, we want the same
restrictions so copy_process() should also nack CLONE_PARENT.

And it would be better to deny CLONE_NEWUSER && CLONE_SIGHAND as well
just for consistency.

Kill the "CLONE_NEWUSER | CLONE_NEWPID" check in do_fork() and change
copy_process() to do the same check along with ->pid_ns check we already
have.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Colin Walters <walters@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:20 -07:00
Oleg Nesterov
5167246a8a pidns: kill the unnecessary CLONE_NEWPID in copy_process()
Commit 8382fcac1b ("pidns: Outlaw thread creation after
unshare(CLONE_NEWPID)") nacks CLONE_NEWPID if the forking process
unshared pid_ns.  This is correct but unnecessary, copy_pid_ns() does
the same check.

Remove the CLONE_NEWPID check to cleanup the code and prepare for the
next change.

Test-case:

	static int child(void *arg)
	{
		return 0;
	}

	static char stack[16 * 1024];

	int main(void)
	{
		pid_t pid;

		assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0);

		pid = clone(child, stack + sizeof(stack) / 2,
				CLONE_NEWPID | SIGCHLD, NULL);
		assert(pid < 0 && errno == EINVAL);

		return 0;
	}

clone(CLONE_NEWPID) correctly fails with or without this change.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Colin Walters <walters@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:19 -07:00
Oleg Nesterov
e79f525e99 pidns: fix vfork() after unshare(CLONE_NEWPID)
Commit 8382fcac1b ("pidns: Outlaw thread creation after
unshare(CLONE_NEWPID)") nacks CLONE_VM if the forking process unshared
pid_ns, this obviously breaks vfork:

	int main(void)
	{
		assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0);
		assert(vfork() >= 0);
		_exit(0);
		return 0;
	}

fails without this patch.

Change this check to use CLONE_SIGHAND instead.  This also forbids
CLONE_THREAD automatically, and this is what the comment implies.

We could probably even drop CLONE_SIGHAND and use CLONE_THREAD, but it
would be safer to not do this.  The current check denies CLONE_SIGHAND
implicitely and there is no reason to change this.

Eric said "CLONE_SIGHAND is fine.  CLONE_THREAD would be even better.
Having shared signal handling between two different pid namespaces is
the case that we are fundamentally guarding against."

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Colin Walters <walters@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:19 -07:00
Arnaldo Carvalho de Melo
d008d5258e perf: Fix up MMAP2 buffer space reservation
The ino_generation field was added in the PERF_RECORD_MMAP2 record in
the 13d7a24 cset but no space for it was allocated, corrupting the
PERF_FORMAT_{TIME,CPU,TID,etc} area (sample_type/sample_id_all), fix it.

Detected with one of the regression tests done by 'perf test':

  [root@sandy ~]# perf test -v 7
   7: Validate PERF_RECORD_* events & perf_sample fields     :
  --- start ---
  61315294449606 0 PERF_RECORD_SAMPLE
  61315294453161 0 PERF_RECORD_SAMPLE
  61315294454441 0 PERF_RECORD_SAMPLE
  61315294455709 0 PERF_RECORD_SAMPLE
  61315295600899 0 PERF_RECORD_COMM: sleep:6500
  27917287430500 342521613 PERF_RECORD_MMAP2 6500/6500: [0x400000(0x7000) @ 0 00:1d 311442 9016]: /usr/bin/sleep
  MMAP2 going backwards in time, prev=61315295600899, curr=27917287430500
  MMAP2 with unexpected cpu, expected 0, got 342521613
  MMAP2 with unexpected pid, expected 6500, got 1701606191
  MMAP2 with unexpected tid, expected 6500, got 28773
  27917287430500 342561333 PERF_RECORD_MMAP2 6500/6500: [0x3b7e000000(0x223000) @ 0 00:1d 309186 9016]: /usr/lib64/ld-2.16.so
  MMAP2 with unexpected cpu, expected 0, got 342561333
  MMAP2 with unexpected pid, expected 6500, got 1932408369
  MMAP2 with unexpected tid, expected 6500, got 111
  27917287430500 342600095 PERF_RECORD_MMAP2 6500/6500: [0x7fffbd7dc000(0x1000) @ 0x7fffbd7dc000 00:00 0 0]: [vdso]
  MMAP2 with unexpected cpu, expected 0, got 342600095
  MMAP2 with unexpected pid, expected 6500, got 1935963739
  MMAP2 with unexpected tid, expected 6500, got 23919
  27917287430500 342882834 PERF_RECORD_MMAP2 6500/6500: [0x3b7e400000(0x3b8000) @ 0 00:1d 309187 9016]: /usr/lib64/libc-2.16.so
  MMAP2 with unexpected cpu, expected 0, got 342882834
  MMAP2 with unexpected pid, expected 6500, got 909192754
  MMAP2 with unexpected tid, expected 6500, got 7303982
  61316297195411 0 PERF_RECORD_EXIT(6500:6500):(6500:6500)
  ---- end ----
  Validate PERF_RECORD_* events & perf_sample fields: FAILED!
  [root@sandy ~]#

After this patch:

  [root@sandy ~]# perf test 7
   7: Validate PERF_RECORD_* events & perf_sample fields     : Ok
  [root@sandy ~]#

Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Stephane Eranian <eranian@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/n/tip-heeuv986b8ha7whqg4o3he7c@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2013-09-11 10:11:46 -03:00
Glauber Costa
3942c07ccf fs: bump inode and dentry counters to long
This series reworks our current object cache shrinking infrastructure in
two main ways:

 * Noticing that a lot of users copy and paste their own version of LRU
   lists for objects, we put some effort in providing a generic version.
   It is modeled after the filesystem users: dentries, inodes, and xfs
   (for various tasks), but we expect that other users could benefit in
   the near future with little or no modification.  Let us know if you
   have any issues.

 * The underlying list_lru being proposed automatically and
   transparently keeps the elements in per-node lists, and is able to
   manipulate the node lists individually.  Given this infrastructure, we
   are able to modify the up-to-now hammer called shrink_slab to proceed
   with node-reclaim instead of always searching memory from all over like
   it has been doing.

Per-node lru lists are also expected to lead to less contention in the lru
locks on multi-node scans, since we are now no longer fighting for a
global lock.  The locks usually disappear from the profilers with this
change.

Although we have no official benchmarks for this version - be our guest to
independently evaluate this - earlier versions of this series were
performance tested (details at
http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no
visible performance regressions while yielding a better qualitative
behavior in NUMA machines.

With this infrastructure in place, we can use the list_lru entry point to
provide memcg isolation and per-memcg targeted reclaim.  Historically,
those two pieces of work have been posted together.  This version presents
only the infrastructure work, deferring the memcg work for a later time,
so we can focus on getting this part tested.  You can see more about the
history of such work at http://lwn.net/Articles/552769/

Dave Chinner (18):
  dcache: convert dentry_stat.nr_unused to per-cpu counters
  dentry: move to per-sb LRU locks
  dcache: remove dentries from LRU before putting on dispose list
  mm: new shrinker API
  shrinker: convert superblock shrinkers to new API
  list: add a new LRU list type
  inode: convert inode lru list to generic lru list code.
  dcache: convert to use new lru list infrastructure
  list_lru: per-node list infrastructure
  shrinker: add node awareness
  fs: convert inode and dentry shrinking to be node aware
  xfs: convert buftarg LRU to generic code
  xfs: rework buffer dispose list tracking
  xfs: convert dquot cache lru to list_lru
  fs: convert fs shrinkers to new scan/count API
  drivers: convert shrinkers to new count/scan API
  shrinker: convert remaining shrinkers to count/scan API
  shrinker: Kill old ->shrink API.

Glauber Costa (7):
  fs: bump inode and dentry counters to long
  super: fix calculation of shrinkable objects for small numbers
  list_lru: per-node API
  vmscan: per-node deferred work
  i915: bail out earlier when shrinker cannot acquire mutex
  hugepage: convert huge zero page shrinker to new shrinker API
  list_lru: dynamically adjust node arrays

This patch:

There are situations in very large machines in which we can have a large
quantity of dirty inodes, unused dentries, etc.  This is particularly true
when umounting a filesystem, where eventually since every live object will
eventually be discarded.

Dave Chinner reported a problem with this while experimenting with the
shrinker revamp patchset.  So we believe it is time for a change.  This
patch just moves int to longs.  Machines where it matters should have a
big long anyway.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:29 -04:00
Rafael J. Wysocki
85fb0a1c35 Merge branch 'acpi-hotplug'
* acpi-hotplug:
  PM / hibernate / memory hotplug: Rework mutual exclusion
  PM / hibernate: Create memory bitmaps after freezing user space
  ACPI / scan: Change ordering of locks for device hotplug
2013-09-10 23:14:53 +02:00
Linus Torvalds
b05430fc93 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile 3 (of many) from Al Viro:
 "Waiman's conversion of d_path() and bits related to it,
  kern_path_mountpoint(), several cleanups and fixes (exportfs
  one is -stable fodder, IMO).

  There definitely will be more...  ;-/"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  split read_seqretry_or_unlock(), convert d_walk() to resulting primitives
  dcache: Translating dentry into pathname without taking rename_lock
  autofs4 - fix device ioctl mount lookup
  introduce kern_path_mountpoint()
  rename user_path_umountat() to user_path_mountpoint_at()
  take unlazy_walk() into umount_lookup_last()
  Kill indirect include of file.h from eventfd.h, use fdget() in cgroup.c
  prune_super(): sb->s_op is never NULL
  exportfs: don't assume that ->iterate() won't feed us too long entries
  afs: get rid of redundant ->d_name.len checks
2013-09-10 12:44:24 -07:00
Joonsoo Kim
b0cff9d88c sched: Fix load balancing performance regression in should_we_balance()
Commit 23f0d20 ("sched: Factor out code to should_we_balance()")
introduces the should_we_balance() function.  This function should
return 1 if this cpu is appropriate for balancing. But the newly
introduced code doesn't do so, it returns 0 instead of 1.

This introduces performance regression, reported by Dave Chinner:

                        v4 filesystem           v5 filesystem
3.11+xfsdev:            220k files/s            225k files/s
3.12-git                180k files/s            185k files/s
3.12-git-revert         245k files/s            247k files/s

You can find more detailed information at:

  https://lkml.org/lkml/2013/9/10/1

This patch corrects the return value of should_we_balance()
function as orignally intended.

With this patch, Dave Chinner reports that the regression is gone:

                        v4 filesystem           v5 filesystem
3.11+xfsdev:            220k files/s            225k files/s
3.12-git                180k files/s            185k files/s
3.12-git-revert         245k files/s            247k files/s
3.12-git-fix            249k files/s            248k files/s

Reported-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Link: http://lkml.kernel.org/r/20130910065448.GA20368@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-10 09:20:42 +02:00
Linus Torvalds
7eb69529cb Not much changes for the 3.12 merge window. The major tracing changes
are still in flux, and will have to wait for 3.13.
 
 The changes for 3.12 are mostly clean ups and minor fixes.
 
 H. Peter Anvin added a check to x86_32 static function tracing that
 helps a small segment of the kernel community.
 
 Oleg Nesterov had a few changes from 3.11, but were mostly clean ups
 and not worth pushing in the -rc time frame.
 
 Li Zefan had small clean up with annotating a raw_init with __init.
 
 I fixed a slight race in updating function callbacks, but the race
 is so small and the bug that happens when it occurs is so minor it's
 not even worth pushing to stable.
 
 The only real enhancement is from Alexander Z Lam that made the
 tracing_cpumask work for trace buffer instances, instead of them all
 sharing a global cpumask.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQEcBAABAgAGBQJSLJm1AAoJEOdOSU1xswtMSu0H/0/Uuh0D5VhANZRcTATY4gUO
 n3WH6sm3atOxH+cbeYQcFXxOcvRcR2n90tvCMpiFlPiC0NiNR1yjro3VLS4zWb77
 twq7gABdJf+Tdq7sOBmSzmY5vRKQVHIXvAfC27mBez38nCWZz0BjJGEsPBwoly25
 ZaiCbKlusw/QKIEy40tuKUL/rXF6yEWnQrMujhBbyNm0w7sJVdfnd+HHmCvy15H2
 IQE1g83d/dAMBjFY2BYg77J+oV6qmJxql2itvDivQWXHqFb52Jw3ZTwHwWLZlPYU
 AZcHtYGs2lSUscQLF56LejB7zZyE8taUufExFEVexXxZS5u7nNPXsPrA2LOOK70=
 =JWO6
 -----END PGP SIGNATURE-----

Merge tag 'trace-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "Not much changes for the 3.12 merge window.  The major tracing changes
  are still in flux, and will have to wait for 3.13.

  The changes for 3.12 are mostly clean ups and minor fixes.

  H Peter Anvin added a check to x86_32 static function tracing that
  helps a small segment of the kernel community.

  Oleg Nesterov had a few changes from 3.11, but were mostly clean ups
  and not worth pushing in the -rc time frame.

  Li Zefan had small clean up with annotating a raw_init with __init.

  I fixed a slight race in updating function callbacks, but the race is
  so small and the bug that happens when it occurs is so minor it's not
  even worth pushing to stable.

  The only real enhancement is from Alexander Z Lam that made the
  tracing_cpumask work for trace buffer instances, instead of them all
  sharing a global cpumask"

* tag 'trace-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace/rcu: Do not trace debug_lockdep_rcu_enabled()
  x86-32, ftrace: Fix static ftrace when early microcode is enabled
  ftrace: Fix a slight race in modifying what function callback gets traced
  tracing: Make tracing_cpumask available for all instances
  tracing: Kill the !CONFIG_MODULES code in trace_events.c
  tracing: Don't pass file_operations array to event_create_dir()
  tracing: Kill trace_create_file_ops() and friends
  tracing/syscalls: Annotate raw_init function with __init
2013-09-09 14:42:15 -07:00
Linus Torvalds
300893b08f xfs: update for v3.12-rc1
For 3.12-rc1 there are a number of bugfixes in addition to work to ease usage
 of shared code between libxfs and the kernel, the rest of the work to enable
 project and group quotas to be used simultaneously, performance optimisations
 in the log and the CIL, directory entry file type support, fixes for log space
 reservations, some spelling/grammar cleanups, and the addition of user
 namespace support.
 
 - introduce readahead to log recovery
 - add directory entry file type support
 - fix a number of spelling errors in comments
 - introduce new Q_XGETQSTATV quotactl for project quotas
 - add USER_NS support
 - log space reservation rework
 - CIL optimisations
 - kernel/userspace libxfs rework
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJSLeikAAoJENaLyazVq6ZOciEP/3tc850sQsPlNwP9aqd1l2Wk
 S1RJ8i+MUQ2W/PlbswCXvdUCT8DIwXWxL31tGvi8vtaLhh6t8ICSZwqNil+/GCIJ
 BErVvY4oXhEMHhlbIRRvpxblTfJGiYy3puUEz9VI0yDdUVnC33+DuEeLTQ/0mibo
 /UUqKFmM3KYpOc8vIQvH5K5i8PkjtMt9yge0k4l9COD30gtY2okkaD4b1voOsKc+
 5YFqulq7zcXBUYti+EFCQeV8aUBTGEPN4PJRdcS12/ylzsTzZivAOO+QREu7qBW8
 x+Gj8fOC+yYWCttmJlfa1n8taxge3ndEuzKN97nvvfQgjvvunMvwJ499skryYVdB
 EcPnBnpDUQuz/y7exKBT9uROK817vZBtfHzSova29ayQSWC+qDpNE4xXeDIqeCtT
 CPxdHuWMOvIdZg41E4x7je0elaZl8EAZ8hycc2WuRhtukEkIdE1O8aD7IVrMYee8
 kg+aVHG5nmYRInO1WuMinbtiCzwvVoBJToWM3y4cbfgW0dILASRyL53HDd+eCr1j
 kOpPIVgXlBZgiPMmdYahWxyVVWcE7zyex0w4frzWVlJMZ4lP5brppD6qfQg1JwOB
 z21Y95F5C2GxSyN/Lwps0G6jujHrpe6GVeYK7uKCtnqTD83nSShv5Naln7pQ3AUs
 qUMsqmJob4+bwt94Xgbx
 =V4s4
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-v3.12-rc1' of git://oss.sgi.com/xfs/xfs

Pull xfs updates from Ben Myers:
 "For 3.12-rc1 there are a number of bugfixes in addition to work to
  ease usage of shared code between libxfs and the kernel, the rest of
  the work to enable project and group quotas to be used simultaneously,
  performance optimisations in the log and the CIL, directory entry file
  type support, fixes for log space reservations, some spelling/grammar
  cleanups, and the addition of user namespace support.

   - introduce readahead to log recovery
   - add directory entry file type support
   - fix a number of spelling errors in comments
   - introduce new Q_XGETQSTATV quotactl for project quotas
   - add USER_NS support
   - log space reservation rework
   - CIL optimisations
  - kernel/userspace libxfs rework"

* tag 'xfs-for-linus-v3.12-rc1' of git://oss.sgi.com/xfs/xfs: (112 commits)
  xfs: XFS_MOUNT_QUOTA_ALL needed by userspace
  xfs: dtype changed xfs_dir2_sfe_put_ino to xfs_dir3_sfe_put_ino
  Fix wrong flag ASSERT in xfs_attr_shortform_getvalue
  xfs: finish removing IOP_* macros.
  xfs: inode log reservations are too small
  xfs: check correct status variable for xfs_inobt_get_rec() call
  xfs: inode buffers may not be valid during recovery readahead
  xfs: check LSN ordering for v5 superblocks during recovery
  xfs: btree block LSN escaping to disk uninitialised
  XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568
  xfs: fix bad dquot buffer size in log recovery readahead
  xfs: don't account buffer cancellation during log recovery readahead
  xfs: check for underflow in xfs_iformat_fork()
  xfs: xfs_dir3_sfe_put_ino can be static
  xfs: introduce object readahead to log recovery
  xfs: Simplify xfs_ail_min() with list_first_entry_or_null()
  xfs: Register hotcpu notifier after initialization
  xfs: add xfs sb v4 support for dirent filetype field
  xfs: Add write support for dirent filetype field
  xfs: Add read-only support for dirent filetype field
  ...
2013-09-09 11:19:09 -07:00