Commit Graph

14268 Commits

Author SHA1 Message Date
Linus Torvalds
79071638ce Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "The biggest change is a performance improvement on SMP systems:

  | 4 socket 40 core + SMT Westmere box, single 30 sec tbench
  | runs, higher is better:
  |
  | clients     1       2       4        8       16       32       64      128
  |..........................................................................
  | pre        30      41     118      645     3769     6214    12233    14312
  | post      299     603    1211     2418     4697     6847    11606    14557
  |
  | A nice increase in performance.

  which speedup is particularly noticeable on heavily interacting
  few-tasks workloads, so the changes should help desktop-style Xorg
  workloads and interactivity as well, on multi-core CPUs.

  There are also cpuset suspend behavior fixes/restructuring and various
  smaller tweaks."

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix race in task_group()
  sched: Improve balance_cpu() to consider other cpus in its group as target of (pinned) task
  sched: Reset loop counters if all tasks are pinned and we need to redo load balance
  sched: Reorder 'struct lb_env' members to reduce its size
  sched: Improve scalability via 'CPU buddies', which withstand random perturbations
  cpusets: Remove/update outdated comments
  cpusets, hotplug: Restructure functions that are invoked during hotplug
  cpusets, hotplug: Implement cpuset tree traversal in a helper function
  CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume
  sched/x86: Remove broken power estimation
2012-07-26 13:08:01 -07:00
Linus Torvalds
fa93669a19 Driver core merge for 3.6-rc1
Here's the big driver core pull request for 3.6-rc1.
 
 Unlike 3.5, this kernel should be a lot tamer, with the printk changes now
 settled down.  All we have here is some extcon driver updates, w1 driver
 updates, a few printk cleanups that weren't needed for 3.5, but are good to
 have now, and some other minor fixes/changes in the driver core.
 
 All of these have been in the linux-next releases for a while now.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iEYEABECAAYFAlARgIUACgkQMUfUDdst+ynDHgCfRNwIB9L+zZvjcKE5e1BhDbUl
 wVUAn398DFgbJ1+PjGkd1EMR2uVTh7Ou
 =MIFu
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core changes from Greg Kroah-Hartman:
 "Here's the big driver core pull request for 3.6-rc1.

  Unlike 3.5, this kernel should be a lot tamer, with the printk changes
  now settled down.  All we have here is some extcon driver updates, w1
  driver updates, a few printk cleanups that weren't needed for 3.5, but
  are good to have now, and some other minor fixes/changes in the driver
  core.

  All of these have been in the linux-next releases for a while now.

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

* tag 'driver-core-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (38 commits)
  printk: Export struct log size and member offsets through vmcoreinfo
  Drivers: hv: Change the hex constant to a decimal constant
  driver core: don't trigger uevent after failure
  extcon: MAX77693: Add extcon-max77693 driver to support Maxim MAX77693 MUIC device
  sysfs: fail dentry revalidation after namespace change fix
  sysfs: fail dentry revalidation after namespace change
  extcon: spelling of detach in function doc
  extcon: arizona: Stop microphone detection if we give up on it
  extcon: arizona: Update cable reporting calls and split headset
  PM / Runtime: Do not increment device usage counts before probing
  kmsg - do not flush partial lines when the console is busy
  kmsg - export "continuation record" flag to /dev/kmsg
  kmsg - avoid warning for CONFIG_PRINTK=n compilations
  kmsg - properly print over-long continuation lines
  driver-core: Use kobj_to_dev instead of re-implementing it
  driver-core: Move kobj_to_dev from genhd.h to device.h
  driver core: Move deferred devices to the end of dpm_list before probing
  driver core: move uevent call to driver_register
  driver core: fix shutdown races with probe/remove(v3)
  Extcon: Arizona: Add driver for Wolfson Arizona class devices
  ...
2012-07-26 11:25:33 -07:00
Linus Torvalds
b13bc8dda8 Staging tree patches for 3.6-rc1
Here's the big staging tree merge for the 3.6-rc1 merge window.
 
 There are some patches in here outside of drivers/staging/, notibly the iio
 code (which is still stradeling the staging / not staging boundry), the pstore
 code, and the tracing code.  All of these have gotten ackes from the various
 subsystem maintainers to be included in this tree.  The pstore and tracing
 patches are related, and are coming here as they replace one of the android
 staging drivers.
 
 Otherwise, the normal staging mess.  Lots of cleanups and a few new drivers
 (some iio drivers, and the large csr wireless driver abomination.)
 
 Note, you will get a merge issue with the following files:
 	drivers/staging/comedi/drivers/s626.h
 	drivers/staging/gdm72xx/netlink_k.c
 both of which should be trivial for you to handle.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iEYEABECAAYFAlAQiD8ACgkQMUfUDdst+ykxhgCeMUjvc+1RTtSprzvkzpejgoUU
 6A4AnAleWMnkaCD8vruGnRdGl/Qtz51+
 =mN6M
 -----END PGP SIGNATURE-----

Merge tag 'staging-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

Pull staging tree patches from Greg Kroah-Hartman:
 "Here's the big staging tree merge for the 3.6-rc1 merge window.

  There are some patches in here outside of drivers/staging/, notibly
  the iio code (which is still stradeling the staging / not staging
  boundry), the pstore code, and the tracing code.  All of these have
  gotten acks from the various subsystem maintainers to be included in
  this tree.  The pstore and tracing patches are related, and are coming
  here as they replace one of the android staging drivers.

  Otherwise, the normal staging mess.  Lots of cleanups and a few new
  drivers (some iio drivers, and the large csr wireless driver
  abomination.)

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

Fixed up trivial conflicts in drivers/staging/comedi/drivers/s626.h and
drivers/staging/gdm72xx/netlink_k.c

* tag 'staging-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (1108 commits)
  staging: csr: delete a bunch of unused library functions
  staging: csr: remove csr_utf16.c
  staging: csr: remove csr_pmem.h
  staging: csr: remove CsrPmemAlloc
  staging: csr: remove CsrPmemFree()
  staging: csr: remove CsrMemAllocDma()
  staging: csr: remove CsrMemCalloc()
  staging: csr: remove CsrMemAlloc()
  staging: csr: remove CsrMemFree() and CsrMemFreeDma()
  staging: csr: remove csr_util.h
  staging: csr: remove CsrOffSetOf()
  stating: csr: remove unneeded #includes in csr_util.c
  staging: csr: make CsrUInt16ToHex static
  staging: csr: remove CsrMemCpy()
  staging: csr: remove CsrStrLen()
  staging: csr: remove CsrVsnprintf()
  staging: csr: remove CsrStrDup
  staging: csr: remove CsrStrChr()
  staging: csr: remove CsrStrNCmp
  staging: csr: remove CsrStrCmp
  ...
2012-07-26 11:14:49 -07:00
Andrew Vagin
895dd92c03 sched: Deliver sched_switch events to the current task
Otherwise they can't be filtered for a defined task:

  perf record -e sched:sched_switch ./foo

This command doesn't report any events without this patch.

I think it isn't a security concern if someone knows who will
be executed next - this can already be observed by polling /proc
state. By default perf is disabled for non-root users in any case.

I need these events for profiling sleep times.  sched_switch is used for
getting callchains and sched_stat_* is used for getting time periods.
These events are combined in user space, then it can be analyzed by
perf tools.

Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arun Sharma <asharma@fb.com>
Link: http://lkml.kernel.org/r/1342088069-1005148-1-git-send-email-avagin@openvz.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-26 12:23:10 +02:00
Ying Xue
014acbf0d5 sched: Fix minor code style issues
Delete redudant spaces between type name and data name or operators.

Signed-off-by: Ying Xue <ying.xue0@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1342076622-6606-1-git-send-email-ying.xue0@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-26 11:47:00 +02:00
Namhyung Kim
45afb1734f sched: Use task_rq_unlock() in __sched_setscheduler()
It seems there's no specific reason to open-code it.  I guess
commit 0122ec5b02 ("sched: Add p->pi_lock to task_rq_lock()")
simply missed it.  Let's be consistent with others.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1341647342-6742-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-26 11:46:59 +02:00
Thomas Gleixner
dc9b229a58 genirq: Allow irq chips to mark themself oneshot safe
Some interrupt chips like MSI are oneshot safe by implementation. For
those interrupts we can avoid the mask/unmask sequence for threaded
interrupt handlers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1207132056540.32033@ionos
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Jan Kiszka <jan.kiszka@web.de>
2012-07-25 12:46:38 +02:00
Mark Brown
f5a1ad057e irqdomain: Improve diagnostics when a domain mapping fails
When the map operation fails log the error code we get and add a WARN_ON()
so we get a backtrace (which should help work out which interrupt is the
source of the issue).

Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2012-07-24 22:37:30 -06:00
Grant Likely
4c0946c474 irqdomain: eliminate slow-path revmap lookups
With the current state of irq_domain, the reverse map is always updated
when new IRQs get mapped.  This means that the irq_find_mapping() function
can be simplified to execute the revmap lookup functions unconditionally

This patch adds lookup functions for the revmaps that don't yet have one
and removes the slow path lookup code path.

v8: Broke out unrelated changes into separate patches.  Rebased on Paul's irq
    association patches.
v7: Rebased to irqdomain/next for v3.4 and applied before the removal of 'hint'
v6: Remove the slow path entirely.  The only place where the slow path
    could get called is for a linear mapping if the hwirq number is larger
    than the linear revmap size.  There shouldn't be any interrupt
    controllers that do that.
v5: rewrite to not use a ->revmap() callback.  It is simpler, smaller,
    safer and faster to open code each of the revmap lookups directly into
    irq_find_mapping() via a switch statement.
v4: Fix build failure on incorrect variable reference.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Milton Miller <miltonm@bga.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Rob Herring <rob.herring@calxeda.com>
2012-07-24 22:37:23 -06:00
Grant Likely
6aeea3ecc3 Merge remote-tracking branch 'origin' into irqdomain/next 2012-07-24 22:34:40 -06:00
Linus Torvalds
bdc0077af5 SCSI misc on 20120724
The most important feature of this patch set is the new async infrastructure
 that makes sure async_synchronize_full() synchronizes all domains and allows
 us to remove all the hacks (like having scsi_complete_async_scans() in the
 device base code) and means that the async infrastructure will "just work" in
 future. The rest is assorted driver updates (aacraid, bnx2fc, virto-scsi,
 megaraid, bfa, lpfc, qla2xxx, qla4xxx) plus a lot of infrastructure work in
 sas and FC.
 
 Signed-off-by: James Bottomley <JBottomley@Parallels.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQEcBAABAgAGBQJQDjDCAAoJEDeqqVYsXL0M/sMH/jVgBfF1mjR+DQuTscKyD21w
 0BQLn5OmvDZDqo44iqQzNRObw7CxkBkUtHoozsknLijw+KggER653ZOAtUdIHfI/
 /uo7iJQ3J3D/Ezm99HYSpZiF2juZwsBRtFBoKkGqOpMlzFUx5o4hUbH5OcINxnHR
 VmvJU5K1kg8D77Q6zK+Atl14/Rfibc2IoufFmbYdplUAM/tV0BpBSSHJAJvqua76
 NGMl4KJcPZnXe/4LXcxZia5A2efdFFEzaQ2mM9rUVEAgHDAxc0Zg9IoDhGd08FX4
 G55NK+6+bKb9s7bgyva0T/iy817TRCzjteeYNFrb8nBRe7aQbAivaBHQFXIyvdQ=
 =y2sh
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull first round of SCSI updates from James Bottomley:
 "The most important feature of this patch set is the new async
  infrastructure that makes sure async_synchronize_full() synchronizes
  all domains and allows us to remove all the hacks (like having
  scsi_complete_async_scans() in the device base code) and means that
  the async infrastructure will "just work" in future.

  The rest is assorted driver updates (aacraid, bnx2fc, virto-scsi,
  megaraid, bfa, lpfc, qla2xxx, qla4xxx) plus a lot of infrastructure
  work in sas and FC.

  Signed-off-by: James Bottomley <JBottomley@Parallels.com>"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (97 commits)
  [SCSI] Revert "[SCSI] fix async probe regression"
  [SCSI] cleanup usages of scsi_complete_async_scans
  [SCSI] queue async scan work to an async_schedule domain
  [SCSI] async: make async_synchronize_full() flush all work regardless of domain
  [SCSI] async: introduce 'async_domain' type
  [SCSI] bfa: Fix to set correct return error codes and misc cleanup.
  [SCSI] aacraid: Series 7 Async. (performance) mode support
  [SCSI] aha152x: Allow use on 64bit systems
  [SCSI] virtio-scsi: Add vdrv->scan for post VIRTIO_CONFIG_S_DRIVER_OK LUN scanning
  [SCSI] bfa: squelch lockdep complaint with a spin_lock_init
  [SCSI] qla2xxx: remove unnecessary reads of PCI_CAP_ID_EXP
  [SCSI] qla4xxx: remove unnecessary read of PCI_CAP_ID_EXP
  [SCSI] ufs: fix incorrect return value about SUCCESS and FAILED
  [SCSI] ufs: reverse the ufshcd_is_device_present logic
  [SCSI] ufs: use module_pci_driver
  [SCSI] usb-storage: update usb devices for write cache quirk in quirk list.
  [SCSI] usb-storage: add support for write cache quirk
  [SCSI] set to WCE if usb cache quirk is present.
  [SCSI] virtio-scsi: hotplug support for virtio-scsi
  [SCSI] virtio-scsi: split scatterlist per target
  ...
2012-07-24 18:11:22 -07:00
Linus Torvalds
614a6d4341 Merge branch 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup changes from Tejun Heo:
 "Nothing too interesting.  A minor bug fix and some cleanups."

* 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Update remount documentation
  cgroup: cgroup_rm_files() was calling simple_unlink() with the wrong inode
  cgroup: Remove populate() documentation
  cgroup: remove hierarchy_mutex
2012-07-24 17:47:44 -07:00
Linus Torvalds
a08489c569 Merge branch 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue changes from Tejun Heo:
 "There are three major changes.

   - WQ_HIGHPRI has been reimplemented so that high priority work items
     are served by worker threads with -20 nice value from dedicated
     highpri worker pools.

   - CPU hotplug support has been reimplemented such that idle workers
     are kept across CPU hotplug events.  This makes CPU hotplug cheaper
     (for PM) and makes the code simpler.

   - flush_kthread_work() has been reimplemented so that a work item can
     be freed while executing.  This removes an annoying behavior
     difference between kthread_worker and workqueue."

* 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: fix spurious CPU locality WARN from process_one_work()
  kthread_worker: reimplement flush_kthread_work() to allow freeing the work item being executed
  kthread_worker: reorganize to prepare for flush_kthread_work() reimplementation
  workqueue: simplify CPU hotplug code
  workqueue: remove CPU offline trustee
  workqueue: don't butcher idle workers on an offline CPU
  workqueue: reimplement CPU online rebinding to handle idle workers
  workqueue: drop @bind from create_worker()
  workqueue: use mutex for global_cwq manager exclusion
  workqueue: ROGUE workers are UNBOUND workers
  workqueue: drop CPU_DYING notifier operation
  workqueue: perform cpu down operations from low priority cpu_notifier()
  workqueue: reimplement WQ_HIGHPRI using a separate worker_pool
  workqueue: introduce NR_WORKER_POOLS and for_each_worker_pool()
  workqueue: separate out worker_pool flags
  workqueue: use @pool instead of @gcwq or @cpu where applicable
  workqueue: factor out worker_pool from global_cwq
  workqueue: don't use WQ_HIGHPRI for unbound workqueues
2012-07-24 17:46:16 -07:00
Linus Torvalds
6dd53aa456 PCI changes for the 3.6 merge window:
Host bridge hotplug
     - Add MMCONFIG support for hot-added host bridges (Jiang Liu)
   Device hotplug
     - Move fixups from __init to __devinit (Sebastian Andrzej Siewior)
     - Call FINAL fixups for hot-added devices, too (Myron Stowe)
     - Factor out generic code for P2P bridge hot-add (Yinghai Lu)
     - Remove all functions in a slot, not just those with _EJx (Amos Kong)
   Dynamic resource management
     - Track bus number allocation (struct resource tree per domain) (Yinghai Lu)
     - Make P2P bridge 1K I/O windows work with resource reassignment (Bjorn Helgaas, Yinghai Lu)
     - Disable decoding while updating 64-bit BARs (Bjorn Helgaas)
   Power management
     - Add PCIe runtime D3cold support (Huang Ying)
   Virtualization
     - Add VFIO infrastructure (ACS, DMA source ID quirks) (Alex Williamson)
     - Add quirks for devices with broken INTx masking (Jan Kiszka)
   Miscellaneous
     - Fix some PCI Express capability version issues (Myron Stowe)
     - Factor out some arch code with a weak, generic, pcibios_setup() (Myron Stowe)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJQBy+9AAoJEPGMOI97Hn6zOpQP+wVFvA7pcteFj6HPs5nTq2Hc
 55oeRqCO0wBHoFMCKB0AjeTATjqxi9OhcjaiVrZejxNyWKC9MnrXuunpQ0l/hCbR
 M/TK+BCelfX2FU4eXNf+TBCCcOhOVWqQft9Gm6nYKwX8Y0msRVCceI4WwhZgSwtI
 vdtmnqlwolscdnq+8ThsnvUMtwkN0gExmn2FJRl6EoEgG0DTqhMkZ83uA+NPBhvv
 I+g0XbA6haaZph2nnSYR0hIW4Q7JkT/LgA6uVAQxamctwxLol7xxsjCRnfqrulkf
 kaRr2fAgBXfmaOIltro4UkXrCM52ZSyggCDfExHp6mWGPKMjE5ZcyK1YbGfmmumk
 DS3t1S0eBdDJXrnf9l/Yb8e95dQxRCYKelKzr1rTD9QAXsInE8rC40hvhfFaTa4s
 nZYRTz0SKv6coQihqaOR7shx1DNomLFk7jndaWEElfl9/cT/nQnZ8XLfVMzkJNNB
 Y4SM6zkiIaCL0aiSEE16MqVjmODYRjbURLYzQIrqr2KJQg8X6XjIRojQLjL6xEgA
 22ry2ZRPhqO68g7aLqvixiSDaTp0Z0Vw+JmgjtBqvkokwZcGQtm4umkpAdOi+Es8
 3bJaMY7ZUpDX53FE8iyP6AnmR/1k19rC1gNnNq/syWyjtYOYJ9i3QCTafFgvE1VC
 5coQ1L5tByHvpzK5PHwf
 =oo/A
 -----END PGP SIGNATURE-----

Merge tag 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI changes from Bjorn Helgaas:
 "Host bridge hotplug:
    - Add MMCONFIG support for hot-added host bridges (Jiang Liu)
  Device hotplug:
    - Move fixups from __init to __devinit (Sebastian Andrzej Siewior)
    - Call FINAL fixups for hot-added devices, too (Myron Stowe)
    - Factor out generic code for P2P bridge hot-add (Yinghai Lu)
    - Remove all functions in a slot, not just those with _EJx (Amos
      Kong)
  Dynamic resource management:
    - Track bus number allocation (struct resource tree per domain)
      (Yinghai Lu)
    - Make P2P bridge 1K I/O windows work with resource reassignment
      (Bjorn Helgaas, Yinghai Lu)
    - Disable decoding while updating 64-bit BARs (Bjorn Helgaas)
  Power management:
    - Add PCIe runtime D3cold support (Huang Ying)
  Virtualization:
    - Add VFIO infrastructure (ACS, DMA source ID quirks) (Alex
      Williamson)
    - Add quirks for devices with broken INTx masking (Jan Kiszka)
  Miscellaneous:
    - Fix some PCI Express capability version issues (Myron Stowe)
    - Factor out some arch code with a weak, generic, pcibios_setup()
      (Myron Stowe)"

* tag 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (122 commits)
  PCI: hotplug: ensure a consistent return value in error case
  PCI: fix undefined reference to 'pci_fixup_final_inited'
  PCI: build resource code for M68K architecture
  PCI: pciehp: remove unused pciehp_get_max_lnk_width(), pciehp_get_cur_lnk_width()
  PCI: reorder __pci_assign_resource() (no change)
  PCI: fix truncation of resource size to 32 bits
  PCI: acpiphp: merge acpiphp_debug and debug
  PCI: acpiphp: remove unused res_lock
  sparc/PCI: replace pci_cfg_fake_ranges() with pci_read_bridge_bases()
  PCI: call final fixups hot-added devices
  PCI: move final fixups from __init to __devinit
  x86/PCI: move final fixups from __init to __devinit
  MIPS/PCI: move final fixups from __init to __devinit
  PCI: support sizing P2P bridge I/O windows with 1K granularity
  PCI: reimplement P2P bridge 1K I/O windows (Intel P64H2)
  PCI: disable MEM decoding while updating 64-bit MEM BARs
  PCI: leave MEM and IO decoding disabled during 64-bit BAR sizing, too
  PCI: never discard enable/suspend/resume_early/resume fixups
  PCI: release temporary reference in __nv_msi_ht_cap_quirk()
  PCI: restructure 'pci_do_fixups()'
  ...
2012-07-24 16:17:07 -07:00
Linus Torvalds
f14121ab35 Devicetree updates for 3.6
A small set of changes for devicetree:
 - Couple of Documentation fixes
 - Addition of new helper function of_node_full_name
 - Improve of_parse_phandle_with_args return values
 - Some NULL related sparse fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQEcBAABAgAGBQJQDwsgAAoJEMhvYp4jgsXiuwUH/Ri6ZSnqHcz4Wa/X4FxvNc3I
 3Xelo/Vt3WLYue3s/+OYiM5FK9+KH8T6x+U79Q4p7vePcfUh6GJII0AUbMeRghkS
 m3FjNd5syzYNJlnDnqdngQYRDpaz8U/SyftjXyMPjJ1VWiyLx/EJQUkj1EEwDLe/
 ZVabppnco3Y6OJpFuETONNvXx5mE7xq86isW5+aYmviMkWSMMwJPf8qofLJ78Dh5
 OAhWuCPRDooz548+Wkabt90qHjF6FU43w5fU7zZW26NT39ptppcbZ2bAXcTYqIIq
 sATp5YSitvwFqO2c1mA/drZ9nrgxDPCaw3qCDyiMdcbWgXqDirz2x7q1iauVHF4=
 =5TZ/
 -----END PGP SIGNATURE-----

Merge tag 'dt-for-3.6' of git://sources.calxeda.com/kernel/linux

Pull devicetree updates from Rob Herring:
 "A small set of changes for devicetree:
   - Couple of Documentation fixes
   - Addition of new helper function of_node_full_name
   - Improve of_parse_phandle_with_args return values
   - Some NULL related sparse fixes"

Grant's busy packing.

* tag 'dt-for-3.6' of git://sources.calxeda.com/kernel/linux:
  of: mtd: nuke useless const qualifier
  devicetree: add helper inline for retrieving a node's full name
  of: return -ENOENT when no property
  usage-model.txt: fix typo machine_init->init_machine
  of: Fix null pointer related warnings in base.c file
  LED: Fix missing semicolon in OF documentation
  of: fix a few typos in the binding documentation
2012-07-24 14:07:22 -07:00
Linus Torvalds
3c4cfadef6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking changes from David S Miller:

 1) Remove the ipv4 routing cache.  Now lookups go directly into the FIB
    trie and use prebuilt routes cached there.

    No more garbage collection, no more rDOS attacks on the routing
    cache.  Instead we now get predictable and consistent performance,
    no matter what the pattern of traffic we service.

    This has been almost 2 years in the making.  Special thanks to
    Julian Anastasov, Eric Dumazet, Steffen Klassert, and others who
    have helped along the way.

    I'm sure that with a change of this magnitude there will be some
    kind of fallout, but such things ought the be simple to fix at this
    point.  Luckily I'm not European so I'll be around all of August to
    fix things :-)

    The major stages of this work here are each fronted by a forced
    merge commit whose commit message contains a top-level description
    of the motivations and implementation issues.

 2) Pre-demux of established ipv4 TCP sockets, saves a route demux on
    input.

 3) TCP SYN/ACK performance tweaks from Eric Dumazet.

 4) Add namespace support for netfilter L4 conntrack helpers, from Gao
    Feng.

 5) Add config mechanism for Energy Efficient Ethernet to ethtool, from
    Yuval Mintz.

 6) Remove quadratic behavior from /proc/net/unix, from Eric Dumazet.

 7) Support for connection tracker helpers in userspace, from Pablo
    Neira Ayuso.

 8) Allow userspace driven TX load balancing functions in TEAM driver,
    from Jiri Pirko.

 9) Kill off NLMSG_PUT and RTA_PUT macros, more gross stuff with
    embedded gotos.

10) TCP Small Queues, essentially minimize the amount of TCP data queued
    up in the packet scheduler layer.  Whereas the existing BQL (Byte
    Queue Limits) limits the pkt_sched --> netdevice queuing levels,
    this controls the TCP --> pkt_sched queueing levels.

    From Eric Dumazet.

11) Reduce the number of get_page/put_page ops done on SKB fragments,
    from Alexander Duyck.

12) Implement protection against blind resets in TCP (RFC 5961), from
    Eric Dumazet.

13) Support the client side of TCP Fast Open, basically the ability to
    send data in the SYN exchange, from Yuchung Cheng.

    Basically, the sender queues up data with a sendmsg() call using
    MSG_FASTOPEN, then they do the connect() which emits the queued up
    fastopen data.

14) Avoid all the problems we get into in TCP when timers or PMTU events
    hit a locked socket.  The TCP Small Queues changes added a
    tcp_release_cb() that allows us to queue work up to the
    release_sock() caller, and that's what we use here too.  From Eric
    Dumazet.

15) Zero copy on TX support for TUN driver, from Michael S. Tsirkin.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1870 commits)
  genetlink: define lockdep_genl_is_held() when CONFIG_LOCKDEP
  r8169: revert "add byte queue limit support".
  ipv4: Change rt->rt_iif encoding.
  net: Make skb->skb_iif always track skb->dev
  ipv4: Prepare for change of rt->rt_iif encoding.
  ipv4: Remove all RTCF_DIRECTSRC handliing.
  ipv4: Really ignore ICMP address requests/replies.
  decnet: Don't set RTCF_DIRECTSRC.
  net/ipv4/ip_vti.c: Fix __rcu warnings detected by sparse.
  ipv4: Remove redundant assignment
  rds: set correct msg_namelen
  openvswitch: potential NULL deref in sample()
  tcp: dont drop MTU reduction indications
  bnx2x: Add new 57840 device IDs
  tcp: avoid oops in tcp_metrics and reset tcpm_stamp
  niu: Change niu_rbr_fill() to use unlikely() to check niu_rbr_add_page() return value
  niu: Fix to check for dma mapping errors.
  net: Fix references to out-of-scope variables in put_cmsg_compat()
  net: ethernet: davinci_emac: add pm_runtime support
  net: ethernet: davinci_emac: Remove unnecessary #include
  ...
2012-07-24 10:01:50 -07:00
John Stultz
b44d50dcac time: Fix casting issue in tk_set_xtime and tk_xtime_add
commit 1e75fa8b (time: Condense timekeeper.xtime into xtime_sec)
introduced helper functions which apply a timespec to the core
internal timekeeper data. The internal storage type is u64. The
timespec tv_nsec value must be shifted before set or added to the
internal value. tv_nsec is a long, which is 32bit on a 32bit system,
so without casting tv_nsec to u64 we lose the bits which are shifted
over the 32bit boundary.

Add the proper typecasts.

Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1343074957-16541-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-24 16:48:45 +02:00
Darren Hart
6f7b0a2a5c futex: Forbid uaddr == uaddr2 in futex_wait_requeue_pi()
If uaddr == uaddr2, then we have broken the rule of only requeueing
from a non-pi futex to a pi futex with this call. If we attempt this,
as the trinity test suite manages to do, we miss early wakeups as
q.key is equal to key2 (because they are the same uaddr). We will then
attempt to dereference the pi_mutex (which would exist had the futex_q
been properly requeued to a pi futex) and trigger a NULL pointer
dereference.

Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/ad82bfe7f7d130247fbe2b5b4275654807774227.1342809673.git.dvhart@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-24 16:02:57 +02:00
Darren Hart
f27071cb7f futex: Fix bug in WARN_ON for NULL q.pi_state
The WARN_ON in futex_wait_requeue_pi() for a NULL q.pi_state was testing
the address (&q.pi_state) of the pointer instead of the value
(q.pi_state) of the pointer. Correct it accordingly.

Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1c85d97f6e5f79ec389a4ead3e367363c74bd09a.1342809673.git.dvhart@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-24 16:02:57 +02:00
Darren Hart
b6070a8d98 futex: Test for pi_mutex on fault in futex_wait_requeue_pi()
If fixup_pi_state_owner() faults, pi_mutex may be NULL. Test
for pi_mutex != NULL before testing the owner against current
and possibly unlocking it.

Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/dc59890338fc413606f04e5c5b131530734dae3d.1342809673.git.dvhart@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-24 16:02:56 +02:00
Peter Zijlstra
8323f26ce3 sched: Fix race in task_group()
Stefan reported a crash on a kernel before a3e5d1091c ("sched:
Don't call task_group() too many times in set_task_rq()"), he
found the reason to be that the multiple task_group()
invocations in set_task_rq() returned different values.

Looking at all that I found a lack of serialization and plain
wrong comments.

The below tries to fix it using an extra pointer which is
updated under the appropriate scheduler locks. Its not pretty,
but I can't really see another way given how all the cgroup
stuff works.

Reported-and-tested-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-24 13:58:20 +02:00
Srivatsa Vaddagiri
88b8dac0a1 sched: Improve balance_cpu() to consider other cpus in its group as target of (pinned) task
Current load balance scheme requires only one cpu in a
sched_group (balance_cpu) to look at other peer sched_groups for
imbalance and pull tasks towards itself from a busy cpu. Tasks
thus pulled by balance_cpu could later get picked up by cpus
that are in the same sched_group as that of balance_cpu.

This scheme however fails to pull tasks that are not allowed to
run on balance_cpu (but are allowed to run on other cpus in its
sched_group). That can affect fairness and in some worst case
scenarios cause starvation.

Consider a two core (2 threads/core) system running tasks as
below:

          Core0            Core1
         /     \          /     \
	C0     C1	 C2     C3
        |      |         |      |
        v      v         v      v
	F0     T1        F1     [idle]
			 T2

 F0 = SCHED_FIFO task (pinned to C0)
 F1 = SCHED_FIFO task (pinned to C2)
 T1 = SCHED_OTHER task (pinned to C1)
 T2 = SCHED_OTHER task (pinned to C1 and C2)

F1 could become a cpu hog, which will starve T2 unless C1 pulls
it. Between C0 and C1 however, C0 is required to look for
imbalance between cores, which will fail to pull T2 towards
Core0. T2 will starve eternally in this case. The same scenario
can arise in presence of non-rt tasks as well (say we replace F1
with high irq load).

We tackle this problem by having balance_cpu move pinned tasks
to one of its sibling cpus (where they can run). We first check
if load balance goal can be met by ignoring pinned tasks,
failing which we retry move_tasks() with a new env->dst_cpu.

This patch modifies load balance semantics on who can move load
towards a given cpu in a given sched_domain.

Before this patch, a given_cpu or a ilb_cpu acting on behalf of
an idle given_cpu is responsible for moving load to given_cpu.

With this patch applied, balance_cpu can in addition decide on
moving some load to a given_cpu.

There is a remote possibility that excess load could get moved
as a result of this (balance_cpu and given_cpu/ilb_cpu deciding
*independently* and at *same* time to move some load to a
given_cpu). However we should see less of such conflicting
decisions in practice and moreover subsequent load balance
cycles should correct the excess load moved to given_cpu.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Prashanth Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FE06CDB.2060605@linux.vnet.ibm.com
[ minor edits ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-24 13:58:06 +02:00
Prashanth Nageshappa
bbf18b1949 sched: Reset loop counters if all tasks are pinned and we need to redo load balance
While load balancing, if all tasks on the source runqueue are pinned,
we retry after excluding the corresponding source cpu. However, loop counters
env.loop and env.loop_break are not reset before retrying, which can lead
to failure in moving the tasks. In this patch we reset env.loop and
env.loop_break to their inital values before we retry.

Signed-off-by: Prashanth Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FE06EEF.2090709@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-24 13:55:37 +02:00
Prashanth Nageshappa
85c1e7dae1 sched: Reorder 'struct lb_env' members to reduce its size
Members of 'struct lb_env' are not in appropriate order to reuse compiler
added padding on 64bit architectures. In this patch we reorder those struct
members and help reduce the size of the structure from 96 bytes to 80
bytes on 64 bit architectures.

Suggested-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Prashanth Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FE06DDE.7000403@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-24 13:55:20 +02:00
Mike Galbraith
970e178985 sched: Improve scalability via 'CPU buddies', which withstand random perturbations
Traversing an entire package is not only expensive, it also leads to tasks
bouncing all over a partially idle and possible quite large package.  Fix
that up by assigning a 'buddy' CPU to try to motivate.  Each buddy may try
to motivate that one other CPU, if it's busy, tough, it may then try its
SMT sibling, but that's all this optimization is allowed to cost.

Sibling cache buddies are cross-wired to prevent bouncing.

4 socket 40 core + SMT Westmere box, single 30 sec tbench runs, higher is better:

 clients     1       2       4        8       16       32       64      128
 ..........................................................................
 pre        30      41     118      645     3769     6214    12233    14312
 post      299     603    1211     2418     4697     6847    11606    14557

A nice increase in performance.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1339471112.7352.32.camel@marge.simpson.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-24 13:53:34 +02:00
Srivatsa S. Bhat
a1cd2b13f7 cpusets: Remove/update outdated comments
cpuset_track_online_cpus() is no longer present. So remove the
outdated comment and replace it with reference to cpuset_update_active_cpus()
which is its equivalent.

Also, we don't lack memory hot-unplug anymore. And David Rientjes pointed
out how it is dealt with. So update that comment as well.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120524141700.3692.98192.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-24 13:53:28 +02:00
Srivatsa S. Bhat
7ddf96b02f cpusets, hotplug: Restructure functions that are invoked during hotplug
Separate out the cpuset related handling for CPU/Memory online/offline.
This also helps us exploit the most obvious and basic level of optimization
that any notification mechanism (CPU/Mem online/offline) has to offer us:
"We *know* why we have been invoked. So stop pretending that we are lost,
and do only the necessary amount of processing!".

And while at it, rename scan_for_empty_cpusets() to
scan_cpusets_upon_hotplug(), which is more appropriate considering how
it is restructured.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120524141650.3692.48637.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-24 13:53:22 +02:00
Srivatsa S. Bhat
80d1fa6463 cpusets, hotplug: Implement cpuset tree traversal in a helper function
At present, the functions that deal with cpusets during CPU/Mem hotplug
are quite messy, since a lot of the functionality is mixed up without clear
separation. And this takes a toll on optimization as well. For example,
the function cpuset_update_active_cpus() is called on both CPU offline and CPU
online events; and it invokes scan_for_empty_cpusets(), which makes sense
only for CPU offline events. And hence, the current code ends up unnecessarily
traversing the cpuset tree during CPU online also.

As a first step towards cleaning up those functions, encapsulate the cpuset
tree traversal in a helper function, so as to facilitate upcoming changes.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120524141635.3692.893.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-24 13:53:18 +02:00
Srivatsa S. Bhat
d35be8bab9 CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume
In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
masks as and when necessary to ensure that the tasks belonging to the cpusets
have some place (online CPUs) to run on. And regular CPU hotplug is
destructive in the sense that the kernel doesn't remember the original cpuset
configurations set by the user, across hotplug operations.

However, suspend/resume (which uses CPU hotplug) is a special case in which
the kernel has the responsibility to restore the system (during resume), to
exactly the same state it was in before suspend.

In order to achieve that, do the following:

1. Don't modify cpusets during suspend/resume. At all.
   In particular, don't move the tasks from one cpuset to another, and
   don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
   during the CPU hotplug operations that are carried out in the
   suspend/resume path.

2. However, cpusets and sched domains are related. We just want to avoid
   altering cpusets alone. So, to keep the sched domains updated, build
   a single sched domain (containing all active cpus) during each of the
   CPU hotplug operations carried out in s/r path, effectively ignoring
   the cpusets' cpus_allowed masks.

   (Since userspace is frozen while doing all this, it will go unnoticed.)

3. During the last CPU online operation during resume, build the sched
   domains by looking up the (unaltered) cpusets' cpus_allowed masks.
   That will bring back the system to the same original state as it was in
   before suspend.

Ultimately, this will not only solve the cpuset problem related to suspend
resume (ie., restores the cpusets to exactly what it was before suspend, by
not touching it at all) but also speeds up suspend/resume because we avoid
running cpuset update code for every CPU being offlined/onlined.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120524141611.3692.20155.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-24 13:53:14 +02:00
Linus Torvalds
a66d2c8f7e Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull the big VFS changes from Al Viro:
 "This one is *big* and changes quite a few things around VFS.  What's in there:

   - the first of two really major architecture changes - death to open
     intents.

     The former is finally there; it was very long in making, but with
     Miklos getting through really hard and messy final push in
     fs/namei.c, we finally have it.  Unlike his variant, this one
     doesn't introduce struct opendata; what we have instead is
     ->atomic_open() taking preallocated struct file * and passing
     everything via its fields.

     Instead of returning struct file *, it returns -E...  on error, 0
     on success and 1 in "deal with it yourself" case (e.g.  symlink
     found on server, etc.).

     See comments before fs/namei.c:atomic_open().  That made a lot of
     goodies finally possible and quite a few are in that pile:
     ->lookup(), ->d_revalidate() and ->create() do not get struct
     nameidata * anymore; ->lookup() and ->d_revalidate() get lookup
     flags instead, ->create() gets "do we want it exclusive" flag.

     With the introduction of new helper (kern_path_locked()) we are rid
     of all struct nameidata instances outside of fs/namei.c; it's still
     visible in namei.h, but not for long.  Come the next cycle,
     declaration will move either to fs/internal.h or to fs/namei.c
     itself.  [me, miklos, hch]

   - The second major change: behaviour of final fput().  Now we have
     __fput() done without any locks held by caller *and* not from deep
     in call stack.

     That obviously lifts a lot of constraints on the locking in there.
     Moreover, it's legal now to call fput() from atomic contexts (which
     has immediately simplified life for aio.c).  We also don't need
     anti-recursion logics in __scm_destroy() anymore.

     There is a price, though - the damn thing has become partially
     asynchronous.  For fput() from normal process we are guaranteed
     that pending __fput() will be done before the caller returns to
     userland, exits or gets stopped for ptrace.

     For kernel threads and atomic contexts it's done via
     schedule_work(), so theoretically we might need a way to make sure
     it's finished; so far only one such place had been found, but there
     might be more.

     There's flush_delayed_fput() (do all pending __fput()) and there's
     __fput_sync() (fput() analog doing __fput() immediately).  I hope
     we won't need them often; see warnings in fs/file_table.c for
     details.  [me, based on task_work series from Oleg merged last
     cycle]

   - sync series from Jan

   - large part of "death to sync_supers()" work from Artem; the only
     bits missing here are exofs and ext4 ones.  As far as I understand,
     those are going via the exofs and ext4 trees resp.; once they are
     in, we can put ->write_super() to the rest, along with the thread
     calling it.

   - preparatory bits from unionmount series (from dhowells).

   - assorted cleanups and fixes all over the place, as usual.

  This is not the last pile for this cycle; there's at least jlayton's
  ESTALE work and fsfreeze series (the latter - in dire need of fixes,
  so I'm not sure it'll make the cut this cycle).  I'll probably throw
  symlink/hardlink restrictions stuff from Kees into the next pile, too.
  Plus there's a lot of misc patches I hadn't thrown into that one -
  it's large enough as it is..."

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (127 commits)
  ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file()
  btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file()
  switch dentry_open() to struct path, make it grab references itself
  spufs: shift dget/mntget towards dentry_open()
  zoran: don't bother with struct file * in zoran_map
  ecryptfs: don't reinvent the wheels, please - use struct completion
  don't expose I_NEW inodes via dentry->d_inode
  tidy up namei.c a bit
  unobfuscate follow_up() a bit
  ext3: pass custom EOF to generic_file_llseek_size()
  ext4: use core vfs llseek code for dir seeks
  vfs: allow custom EOF in generic_file_llseek code
  vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes
  vfs: Remove unnecessary flushing of block devices
  vfs: Make sys_sync writeout also block device inodes
  vfs: Create function for iterating over block devices
  vfs: Reorder operations during sys_sync
  quota: Move quota syncing to ->sync_fs method
  quota: Split dquot_quota_sync() to writeback and cache flushing part
  vfs: Move noop_backing_dev_info check from sync into writeback
  ...
2012-07-23 12:27:27 -07:00
Linus Torvalds
7100e505b7 Power management updates for 3.6
* ACPI conversion to PM handling based on struct dev_pm_ops.
 * Conversion of a number of platform drivers to PM handling based on struct
   dev_pm_ops and removal of empty legacy PM callbacks from a couple of PCI
   drivers.
 * Suspend-to-both for in-kernel hibernation from Bojan Smojver.
 * cpuidle fixes and cleanups from ShuoX Liu, Daniel Lezcano and Preeti U Murthy.
 * cpufreq bug fixes from Jonghwa Lee and Stephen Boyd.
 * Suspend and hibernate fixes from Srivatsa S. Bhat and Colin Cross.
 * Generic PM domains framework updates.
 * RTC CMOS wakeup signaling update from Paul Fox.
 * sparse warnings fixes from Sachin Kamat.
 * Build warnings fixes for the generic PM domains framework and PM sysfs code.
 * sysfs switch for printing device suspend times from Sameer Nanda.
 * Documentation fix from Oskar Schirmer.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIcBAABAgAGBQJQDF5eAAoJEKhOf7ml8uNsEaAP/2wg4faoOGob5A0/7tLqG3Cw
 xnTmGsfL7wG07Q8ykCL1BSlBb1VeJz8L6LTmUpaABI4M//oIBlcYQKyCE0Tat1AO
 9bJXFzK7qcHMhkTz6d6LDqtVzR3NGM3ypjZqj8aEXBov07LMR1AXvgNwXXhv25zM
 0unwrh1XNinBN3n+oaktpWk1YHUjsa5IMU+2tQJrocuHXcgK30vGXZVrZ4g9w1c2
 eS+ED1oKUqOYtFzIUX+aCtaDDheGaPlugk/GOtIB7Sae0s0vMlxH/T5ncB4SxRC+
 v3s4OykqQc5Dc8+0bNlBH7ykSVNB0PoQiyKDY67CxtH+q1xQSc9/f3XJqnGMaVDE
 17eZUZsL4qSyzRuCbPCGAgwBHmx3qNCMu1i1BcmnSxU+ikPUeCR7mYOP0mRThwPH
 OSfs+c/vZ+Ow6CwVE4UFrbm9Jve7ADnCrlZzT2m6XjhHGyjKP7SJlzP9TPsZ0LRk
 oxgQDYHmxbo50t9tBCz5L4ZTMKkDp28e78x84/CteP85srcW3GqDxrPyp2uzJu5O
 tvIEBvVlc4ucq8sG83RkugQwrG/2cQwG2HO9ERAwq01HHA1BYsuU3A961Jqf5CZo
 nFRSnByvVj/imPf47OWpDPAbVEs7jxufJuLEbPwGj1MkttTGDBIRu3zldXt2S6kP
 Q4qYU6fDaQQHFc90pqxQ
 =vC4/
 -----END PGP SIGNATURE-----

Merge tag 'pm-for-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:

 - ACPI conversion to PM handling based on struct dev_pm_ops.
 - Conversion of a number of platform drivers to PM handling based on
   struct dev_pm_ops and removal of empty legacy PM callbacks from a
   couple of PCI drivers.
 - Suspend-to-both for in-kernel hibernation from Bojan Smojver.
 - cpuidle fixes and cleanups from ShuoX Liu, Daniel Lezcano and Preeti
   Murthy.
 - cpufreq bug fixes from Jonghwa Lee and Stephen Boyd.
 - Suspend and hibernate fixes from Srivatsa Bhat and Colin Cross.
 - Generic PM domains framework updates.
 - RTC CMOS wakeup signaling update from Paul Fox.
 - sparse warnings fixes from Sachin Kamat.
 - Build warnings fixes for the generic PM domains framework and PM
   sysfs code.
 - sysfs switch for printing device suspend times from Sameer Nanda.
 - Documentation fix from Oskar Schirmer.

* tag 'pm-for-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (70 commits)
  cpufreq: Fix sysfs deadlock with concurrent hotplug/frequency switch
  EXYNOS: bugfix on retrieving old_index from freqs.old
  PM / Sleep: call early resume handlers when suspend_noirq fails
  PM / QoS: Use NULL pointer instead of plain integer in qos.c
  PM / QoS: Use NULL pointer instead of plain integer in pm_qos.h
  PM / Sleep: Require CAP_BLOCK_SUSPEND to use wake_lock/wake_unlock
  PM / Sleep: Add missing static storage class specifiers in main.c
  cpuilde / ACPI: remove time from acpi_processor_cx structure
  cpuidle / ACPI: remove usage from acpi_processor_cx structure
  cpuidle / ACPI : remove latency_ticks from acpi_processor_cx structure
  rtc-cmos: report wakeups from interrupt handler
  PM / Sleep: Fix build warning in sysfs.c for CONFIG_PM_SLEEP unset
  PM / Domains: Fix build warning for CONFIG_PM_RUNTIME unset
  olpc-xo15-sci: Use struct dev_pm_ops for power management
  PM / Domains: Replace plain integer with NULL pointer in domain.c file
  PM / Domains: Add missing static storage class specifier in domain.c file
  PM / crypto / ux500: Use struct dev_pm_ops for power management
  PM / IPMI: Remove empty legacy PCI PM callbacks
  tpm_nsc: Use struct dev_pm_ops for power management
  tpm_tis: Use struct dev_pm_ops for power management
  ...
2012-07-22 13:36:52 -07:00
Al Viro
a2d4c71d15 deal with task_work callbacks adding more work
It doesn't matter on normal return to userland path (we'll recheck the
NOTIFY_RESUME flag anyway), but in case of exit_task_work() we'll
need that as soon as we get callbacks capable of triggering more
task_work_add().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:57:57 +04:00
Al Viro
ed3e694d78 move exit_task_work() past exit_files() et.al.
... and get rid of PF_EXITING check in task_work_add().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:57:57 +04:00
Al Viro
67d1214551 merge task_work and rcu_head, get rid of separate allocation for keyring case
task_work and rcu_head are identical now; merge them (calling the result
struct callback_head, rcu_head #define'd to it), kill separate allocation
in security/keys since we can just use cred->rcu now.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:57:56 +04:00
Al Viro
158e1645e0 trim task_work: get rid of hlist
layout based on Oleg's suggestion; single-linked list,
task->task_works points to the last element, forward pointer
from said last element points to head.  I'd still prefer
much more regular scheme with two pointers in task_work,
but...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:57:55 +04:00
Al Viro
41f9d29f09 trimming task_work: kill ->data
get rid of the only user of ->data; this is _not_ the final variant - in the
end we'll have task_work and rcu_head identical and just use cred->rcu,
at which point the separate allocation will be gone completely.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:57:54 +04:00
Al Viro
7266702805 signal: make sure we don't get stopped with pending task_work
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:57:54 +04:00
Linus Torvalds
3992c03212 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer core changes from Ingo Molnar:
 "Continued cleanups of the core time and NTP code, plus more nohz work
  preparing for tick-less userspace execution."

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time: Rework timekeeping functions to take timekeeper ptr as argument
  time: Move xtime_nsec adjustment underflow handling timekeeping_adjust
  time: Move arch_gettimeoffset() usage into timekeeping_get_ns()
  time: Refactor accumulation of nsecs to secs
  time: Condense timekeeper.xtime into xtime_sec
  time: Explicitly use u32 instead of int for shift values
  time: Whitespace cleanups per Ingo%27s requests
  nohz: Move next idle expiry time record into idle logic area
  nohz: Move ts->idle_calls incrementation into strict idle logic
  nohz: Rename ts->idle_tick to ts->last_tick
  nohz: Make nohz API agnostic against idle ticks cputime accounting
  nohz: Separate idle sleeping time accounting from nohz logic
  timers: Improve get_next_timer_interrupt()
  timers: Add accounting of non deferrable timers
  timers: Consolidate base->next_timer update
  timers: Create detach_if_pending() and use it
2012-07-22 11:35:46 -07:00
Linus Torvalds
55acdddbac Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp/hotplug changes from Ingo Molnar:
 "Various cleanups to the SMP hotplug code - a continuing effort of
  Thomas et al"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smpboot: Remove leftover declaration
  smp: Remove num_booting_cpus()
  smp: Remove ipi_call_lock[_irq]()/ipi_call_unlock[_irq]()
  POWERPC: Smp: remove call to ipi_call_lock()/ipi_call_unlock()
  SPARC: SMP: Remove call to ipi_call_lock_irq()/ipi_call_unlock_irq()
  ia64: SMP: Remove call to ipi_call_lock_irq()/ipi_call_unlock_irq()
  x86-smp-remove-call-to-ipi_call_lock-ipi_call_unlock
  tile: SMP: Remove call to ipi_call_lock()/ipi_call_unlock()
  S390: Smp: remove call to ipi_call_lock()/ipi_call_unlock()
  parisc: Smp: remove call to ipi_call_lock()/ipi_call_unlock()
  mn10300: SMP: Remove call to ipi_call_lock()/ipi_call_unlock()
  hexagon: SMP: Remove call to ipi_call_lock()/ipi_call_unlock()
2012-07-22 11:22:15 -07:00
Linus Torvalds
2eafeb6a41 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events changes from Ingo Molnar:

 "- kernel side:

   - Intel uncore PMU support for Nehalem and Sandy Bridge CPUs, we
     support both the events available via the MSR and via the PCI
     access space.

   - various uprobes cleanups and restructurings

   - PMU driver quirks by microcode version and required x86 microcode
     loader cleanups/robustization

   - various tracing robustness updates

   - static keys: remove obsolete static_branch()

  - tooling side:

   - GTK browser improvements

   - perf report browser: support screenshots to file

   - more automated tests

   - perf kvm improvements

   - perf bench refinements

   - build environment improvements

   - pipe mode improvements

   - libtraceevent updates, we have now hopefully merged most bits with
     the out of tree forked code base

  ... and many other goodies."

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (138 commits)
  tracing: Check for allocation failure in __tracing_open()
  perf/x86: Fix intel_perfmon_event_mapformatting
  jump label: Remove static_branch()
  tracepoint: Use static_key_false(), since static_branch() is deprecated
  perf/x86: Uncore filter support for SandyBridge-EP
  perf/x86: Detect number of instances of uncore CBox
  perf/x86: Fix event constraint for SandyBridge-EP C-Box
  perf/x86: Use 0xff as pseudo code for fixed uncore event
  perf/x86: Save a few bytes in 'struct x86_pmu'
  perf/x86: Add a microcode revision check for SNB-PEBS
  perf/x86: Improve debug output in check_hw_exists()
  perf/x86/amd: Unify AMD's generic and family 15h pmus
  perf/x86: Move Intel specific code to intel_pmu_init()
  perf/x86: Rename Intel specific macros
  perf/x86: Fix USER/KERNEL tagging of samples
  perf tools: Split event symbols arrays to hw and sw parts
  perf tools: Split out PE_VALUE_SYM parsing token to SW and HW tokens
  perf tools: Add empty rule for new line in event syntax parsing
  perf test: Use ARRAY_SIZE in parse events tests
  tools lib traceevent: Cleanup realloc use
  ...
2012-07-22 11:10:36 -07:00
Linus Torvalds
16d286e656 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU changes from Ingo Molnar:
 "Quoting from Paul, the major features of this series are:

  1. Preventing latency spikes of more than 200 microseconds for
     kernels built with NR_CPUS=4096, which is reportedly becoming the
     default for some distros.  This is a first step, as it does not
     help with systems that actually -have- 4096 CPUs (work on this case
     is in progress, but is not yet ready for mainline).

     This category also includes improving concurrency of rcu_barrier(),
     placed here due to conflicts.  Posted to LKML at:

      https://lkml.org/lkml/2012/6/22/381

     Note that patches 18-22 of that series have been defered to 3.7, as
     they have not yet proven themselves to be mainline-ready (and yes,
     these are the ones intended to get rid of RCU's latency spikes for
     systems that actually have 4096 CPUs).

  2. Updates to documentation and rcutorture fixes, the latter category
     including improvements to rcu_barrier() testing.  Posted to LKML at

      http://lkml.indiana.edu/hypermail/linux/kernel/1206.1/04094.html.

  3. Miscellaneous fixes posted to LKML at:

      https://lkml.org/lkml/2012/6/22/500

     with the exception of the last commit, which was posted here:

      http://www.gossamer-threads.com/lists/linux/kernel/1561830

  4. RCU_FAST_NO_HZ fixes and improvements.  Posted to LKML at:

      http://lkml.indiana.edu/hypermail/linux/kernel/1206.1/00006.html
      http://www.gossamer-threads.com/lists/linux/kernel/1561833

     The first four patches of the first series went into 3.5 to fix a
     regression.

  5. Code-style fixes.  These were posted to LKML at

      http://lkml.indiana.edu/hypermail/linux/kernel/1205.2/01180.html
      http://lkml.indiana.edu/hypermail/linux/kernel/1205.2/01181.html"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
  rcu: Fix broken strings in RCU's source code.
  rcu: Fix code-style issues involving "else"
  rcu: Introduce check for callback list/count mismatch
  rcu: Make RCU_FAST_NO_HZ respect nohz= boot parameter
  rcu: Fix qlen_lazy breakage
  rcu: Round FAST_NO_HZ lazy timeout to nearest second
  rcu: The rcu_needs_cpu() function is not a quiescent state
  rcu: Dump only the current CPU's buffers for idle-entry/exit warnings
  rcu: Add check for CPUs going offline with callbacks queued
  rcu: Disable preemption in rcu_blocking_is_gp()
  rcu: Prevent uninitialized string in RCU CPU stall info
  rcu: Fix rcu_is_cpu_idle() #ifdef in TINY_RCU
  rcu: Split RCU core processing out of __call_rcu()
  rcu: Prevent __call_rcu() from invoking RCU core on offline CPUs
  rcu: Make __call_rcu() handle invocation from idle
  rcu: Remove function versions of __kfree_rcu and __is_kfree_rcu_offset
  rcu: Consolidate tree/tiny __rcu_read_{,un}lock() implementations
  rcu: Remove return value from rcu_assign_pointer()
  key: Remove extraneous parentheses from rcu_assign_keypointer()
  rcu: Remove return value from RCU_INIT_POINTER()
  ...
2012-07-22 10:45:05 -07:00
Tejun Heo
6fec10a1a5 workqueue: fix spurious CPU locality WARN from process_one_work()
25511a4776 "workqueue: reimplement CPU online rebinding to handle idle
workers" added CPU locality sanity check in process_one_work().  It
triggers if a worker is executing on a different CPU without UNBOUND
or REBIND set.

This works for all normal workers but rescuers can trigger this
spuriously when they're serving the unbound or a disassociated
global_cwq - rescuers don't have either flag set and thus its
gcwq->cpu can be a different value including %WORK_CPU_UNBOUND.

Fix it by additionally testing %GCWQ_DISASSOCIATED.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
LKML-Refence: <20120721213656.GA7783@linux.vnet.ibm.com>
2012-07-22 10:16:34 -07:00
Tejun Heo
46f3d97621 kthread_worker: reimplement flush_kthread_work() to allow freeing the work item being executed
kthread_worker provides minimalistic workqueue-like interface for
users which need a dedicated worker thread (e.g. for realtime
priority).  It has basic queue, flush_work, flush_worker operations
which mostly match the workqueue counterparts; however, due to the way
flush_work() is implemented, it has a noticeable difference of not
allowing work items to be freed while being executed.

While the current users of kthread_worker are okay with the current
behavior, the restriction does impede some valid use cases.  Also,
removing this difference isn't difficult and actually makes the code
easier to understand.

This patch reimplements flush_kthread_work() such that it uses a
flush_work item instead of queue/done sequence numbers.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-07-22 10:15:28 -07:00
Tejun Heo
9a2e03d8ed kthread_worker: reorganize to prepare for flush_kthread_work() reimplementation
Make the following two non-functional changes.

* Separate out insert_kthread_work() from queue_kthread_work().

* Relocate struct kthread_flush_work and kthread_flush_work_fn()
  definitions above flush_kthread_work().

v2: Added lockdep_assert_held() in insert_kthread_work() as suggested
    by Andy Walls.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andy Walls <awalls@md.metrocast.net>
2012-07-22 10:11:01 -07:00
Linus Torvalds
9a2bc8603e Merge branch 'anton-kgdb' (kgdb dmesg fixups)
Merge emailed kgdb dmesg fixups patches from Anton Vorontsov:
 "The dmesg command appears to be broken after the printk rework.  The
  old logic in the kdb code makes no sense in terms of current
  printk/logging storage format, and KDB simply hangs forever upon
  entering 'dmesg' command.

  The first patch revives the command by switching to kmsg_dumper
  iterator.  As a side-effect, the code is now much more simpler.

  A few changes were needed in the printk.c: we needed unlocked variant
  of the kmsg_dumper iterator, but these can surely wait for 3.6.

  It's probably too late even for the first patch to go to 3.5, but I'll
  try to convince otherwise.  :-) Here we go:

   - The current code is broken for sure, and has no hope to work at
     all.  It is a regression
   - The new code works for me, and probably works for everyone else;
   - If it compiles (and I urge everyone to compile-test it on your
     setup), it hardly can make things worse."

* Merge emailed patches from Anton Vorontsov: (4 commits)
  kdb: Switch to nolock variants of kmsg_dump functions
  printk: Implement some unlocked kmsg_dump functions
  printk: Remove kdb_syslog_data
  kdb: Revive dmesg command
2012-07-21 10:34:13 -07:00
Anton Vorontsov
c064da4714 kdb: Switch to nolock variants of kmsg_dump functions
The locked variants are prone to deadlocks (suppose we got to the
debugger w/ the logbuf lock held), so let's switch to nolock variants.

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-21 10:34:00 -07:00
Anton Vorontsov
533827c921 printk: Implement some unlocked kmsg_dump functions
If used from KDB, the locked variants are prone to deadlocks (suppose we
got to the debugger w/ the logbuf lock held).

So, we have to implement a few routines that grab no logbuf lock.

Yet we don't need these functions in modules, so we don't export them.

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-21 10:34:00 -07:00
Anton Vorontsov
1b499d05ee printk: Remove kdb_syslog_data
The function is no longer needed, so remove it.

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-21 10:34:00 -07:00
Anton Vorontsov
bc792e612e kdb: Revive dmesg command
The kgdb dmesg command is broken after the printk rework.  The old logic
in kdb code makes no sense in terms of current printk/logging storage
format, and KDB simply hangs forever.

This patch revives the command by switching to kmsg_dumper iterator.

The code is now much more simpler and shorter.

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-21 10:34:00 -07:00
Dan Williams
a4683487f9 [SCSI] async: make async_synchronize_full() flush all work regardless of domain
In response to an async related regression James noted:

  "My theory is that this is an init problem: The assumption in a lot of
   our code is that async_synchronize_full() waits for everything ... even
   the domain specific async schedules, which isn't true."

...so make this assumption true.

Each domain, including the default one, registers itself on a global domain
list when work is scheduled.  Once all entries complete it exits that
list.  Waiting for the list to be empty syncs all in-flight work across
all domains.

Domains can opt-out of global syncing if they are declared as exclusive
ASYNC_DOMAIN_EXCLUSIVE().  All stack-based domains have been declared
exclusive since the domain may go out of scope as soon as the last work
item completes.

Statically declared domains are mostly ok, but async_unregister_domain()
is there to close any theoretical races with pending
async_synchronize_full waiters at module removal time.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Eldad Zack <eldadzack@gmail.com>
Tested-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2012-07-20 09:07:37 +01:00
Dan Williams
2955b47d2c [SCSI] async: introduce 'async_domain' type
This is in preparation for teaching async_synchronize_full() to sync all
pending async work, and not just on the async_running domain.  This
conversion is functionally equivalent, just embedding the existing list
in a new async_domain type.

The .registered attribute is used in a later patch to distinguish
between domains that want to be flushed by async_synchronize_full()
versus those that only expect async_synchronize_{full|cookie}_domain to
be used for flushing.

[jejb: add async.h to scsi_priv.h for struct async_domain]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Tested-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2012-07-20 09:05:54 +01:00
Vivek Goyal
6791457a09 printk: Export struct log size and member offsets through vmcoreinfo
There are tools like makedumpfile and vmcore-dmesg which can extract
kernel log buffer from vmcore. Since we introduced structured logging,
that functionality is broken. Now user space tools need to know about
"struct log" and offsets of various fields to be able to parse struct
log data and extract text message or dictonary.

This patch exports some of the fields.

Currently I am not exporting log "level" info as that is a bitfield and
offsetof() bitfields can't be calculated. But if people start asking for
log level info in the output then we probably either need to seprate
out "level" or use bit shift operations for flags and level.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-19 17:14:18 -07:00
David S. Miller
abaa72d7fd Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
2012-07-19 11:17:30 -07:00
Steven Rostedt
08f6fba503 ftrace/x86: Add separate function to save regs
Add a way to have different functions calling different trampolines.
If a ftrace_ops wants regs saved on the return, then have only the
functions with ops registered to save regs. Functions registered by
other ops would not be affected, unless the functions overlap.

If one ftrace_ops registered functions A, B and C and another ops
registered fucntions to save regs on A, and D, then only functions
A and D would be saving regs. Function B and C would work as normal.
Although A is registered by both ops: normal and saves regs; this is fine
as saving the regs is needed to satisfy one of the ops that calls it
but the regs are ignored by the other ops function.

x86_64 implements the full regs saving, and i386 just passes a NULL
for regs to satisfy the ftrace_ops passing. Where an arch must supply
both regs and ftrace_ops parameters, even if regs is just NULL.

It is OK for an arch to pass NULL regs. All function trace users that
require regs passing must add the flag FTRACE_OPS_FL_SAVE_REGS when
registering the ftrace_ops. If the arch does not support saving regs
then the ftrace_ops will fail to register. The flag
FTRACE_OPS_FL_SAVE_REGS_IF_SUPPORTED may be set that will prevent the
ftrace_ops from failing to register. In this case, the handler may
either check if regs is not NULL or check if ARCH_SUPPORTS_FTRACE_SAVE_REGS.
If the arch supports passing regs it will set this macro and pass regs
for ops that request them. All other archs will just pass NULL.

Link: Link: http://lkml.kernel.org/r/20120711195745.107705970@goodmis.org

Cc: Alexander van Heukelum <heukelum@fastmail.fm>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-07-19 13:20:03 -04:00
Steven Rostedt
a1e2e31d17 ftrace: Return pt_regs to function trace callback
Return as the 4th paramater to the function tracer callback the pt_regs.

Later patches that implement regs passing for the architectures will require
having the ftrace_ops set the SAVE_REGS flag, which will tell the arch
to take the time to pass a full set of pt_regs to the ftrace_ops callback
function. If the arch does not support it then it should pass NULL.

If an arch can pass full regs, then it should define:
 ARCH_SUPPORTS_FTRACE_SAVE_REGS to 1

Link: http://lkml.kernel.org/r/20120702201821.019966811@goodmis.org

Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-07-19 13:18:49 -04:00
Steven Rostedt
ccf3672d53 ftrace: Consolidate arch dependent functions with 'list' function
As the function tracer starts to get more features, the support for
theses features will spread out throughout the different architectures
over time. These features boil down to what each arch does in the
mcount trampoline (the ftrace_caller).

Currently there's two features that are not the same throughout the
archs.

 1) Support to stop function tracing before the callback
 2) passing of the ftrace ops

Both of these require placing an indirect function to support the
features if the mcount trampoline does not.

On a side note, for all architectures, when more than one callback
is registered to the function tracer, an intermediate 'list' function
is called by the mcount trampoline to iterate through the callbacks
that are registered.

Instead of making a separate function for each of these features,
and requiring several indirect calls, just use the single 'list' function
as the intermediate, to handle all cases. If an arch does not support
the 'stop function tracing' or the passing of ftrace ops, just force
it to use the list function that will handle the features required.

This makes the code cleaner and simpler and removes a lot of
 #ifdefs in the code.

Link: http://lkml.kernel.org/r/20120612225424.495625483@goodmis.org

Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-07-19 13:18:22 -04:00
Steven Rostedt
2f5f6ad939 ftrace: Pass ftrace_ops as third parameter to function trace callback
Currently the function trace callback receives only the ip and parent_ip
of the function that it traced. It would be more powerful to also return
the ops that registered the function as well. This allows the same function
to act differently depending on what ftrace_ops registered it.

Link: http://lkml.kernel.org/r/20120612225424.267254552@goodmis.org

Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-07-19 13:17:35 -04:00
Theodore Ts'o
c5857ccf29 random: remove rand_initialize_irq()
With the new interrupt sampling system, we are no longer using the
timer_rand_state structure in the irq descriptor, so we can stop
initializing it now.

[ Merged in fixes from Sedat to find some last missing references to
  rand_initialize_irq() ]

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com>
2012-07-19 10:38:32 -04:00
Linus Torvalds
eea03c20ae Make wait_for_device_probe() also do scsi_complete_async_scans()
Commit a7a20d1039 ("sd: limit the scope of the async probe domain")
make the SCSI device probing run device discovery in it's own async
domain.

However, as a result, the partition detection was no longer synchronized
by async_synchronize_full() (which, despite the name, only synchronizes
the global async space, not all of them).  Which in turn meant that
"wait_for_device_probe()" would not wait for the SCSI partitions to be
parsed.

And "wait_for_device_probe()" was what the boot time init code relied on
for mounting the root filesystem.

Now, most people never noticed this, because not only is it
timing-dependent, but modern distributions all use initrd.  So the root
filesystem isn't actually on a disk at all.  And then before they
actually mount the final disk filesystem, they will have loaded the
scsi-wait-scan module, which not only does the expected
wait_for_device_probe(), but also does scsi_complete_async_scans().

[ Side note: scsi_complete_async_scans() had also been partially broken,
  but that was fixed in commit 43a8d39d01 ("fix async probe
  regression"), so that same commit a7a20d1039 had actually broken
  setups even if you used scsi-wait-scan explicitly ]

Solve this problem by just moving the scsi_complete_async_scans() call
into wait_for_device_probe().  Everybody who wants to wait for device
probing to finish really wants the SCSI probing to complete, so there's
no reason not to do this.

So now "wait_for_device_probe()" really does what the name implies, and
properly waits for device probing to finish.  This also removes the now
unnecessary extra calls to scsi_complete_async_scans().

Reported-and-tested-by: Artem S. Tashkinov <t.artem@mailcity.com>
Cc: Dan Williams <dan.j.williams@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: James Bottomley <jbottomley@parallels.com>
Cc: Borislav Petkov <bp@amd64.org>
Cc: linux-scsi <linux-scsi@vger.kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-18 18:15:46 -07:00
Rafael J. Wysocki
11388c87d2 PM / Sleep: Require CAP_BLOCK_SUSPEND to use wake_lock/wake_unlock
Require processes wanting to use the wake_lock/wake_unlock sysfs
files to have the CAP_BLOCK_SUSPEND capability, which also is
required for the eventpoll EPOLLWAKEUP flag to be effective, so that
all interfaces related to blocking autosleep depend on the same
capability.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: stable@vger.kernel.org
Acked-by: Michael Kerrisk <mtk.man-pages@gmail.com>
2012-07-19 00:00:58 +02:00
Rafael J. Wysocki
823d936409 Merge branch 'fixes' into pm-sleep
The 'fixes' branch contains material the next commit depends on.
2012-07-18 23:58:24 +02:00
Linus Torvalds
6f70242858 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
One more time/ntp fix pulled from Ingo Molnar.

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  ntp: Fix STA_INS/DEL clearing bug
2012-07-18 10:36:02 -07:00
Ingo Molnar
eec19d1a0d Merge branch 'linus' into timers/core
Resolve semantic conflict in kernel/time/timekeeping.c.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-18 11:25:55 +02:00
Ingo Molnar
6e0f17be03 Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core
Pull tracing fix from Steve Rostedt.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-18 11:18:00 +02:00
Ingo Molnar
a2fe194723 Merge branch 'linus' into perf/core
Pick up the latest ring-buffer fixes, before applying a new fix.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-18 11:17:17 +02:00
Tejun Heo
8db25e7891 workqueue: simplify CPU hotplug code
With trustee gone, CPU hotplug code can be simplified.

* gcwq_claim/release_management() now grab and release gcwq lock too
  respectively and gained _and_lock and _and_unlock postfixes.

* All CPU hotplug logic was implemented in workqueue_cpu_callback()
  which was called by workqueue_cpu_up/down_callback() for the correct
  priority.  This was because up and down paths shared a lot of logic,
  which is no longer true.  Remove workqueue_cpu_callback() and move
  all hotplug logic into the two actual callbacks.

This patch doesn't make any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-17 12:39:28 -07:00
Tejun Heo
628c78e7ea workqueue: remove CPU offline trustee
With the previous changes, a disassociated global_cwq now can run as
an unbound one on its own - it can create workers as necessary to
drain remaining works after the CPU has been brought down and manage
the number of workers using the usual idle timer mechanism making
trustee completely redundant except for the actual unbinding
operation.

This patch removes the trustee and let a disassociated global_cwq
manage itself.  Unbinding is moved to a work item (for CPU affinity)
which is scheduled and flushed from CPU_DONW_PREPARE.

This patch moves nr_running clearing outside gcwq and manager locks to
simplify the code.  As nr_running is unused at the point, this is
safe.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-17 12:39:27 -07:00
Tejun Heo
3ce6337730 workqueue: don't butcher idle workers on an offline CPU
Currently, during CPU offlining, after all pending work items are
drained, the trustee butchers all workers.  Also, on CPU onlining
failure, workqueue_cpu_callback() ensures that the first idle worker
is destroyed.  Combined, these guarantee that an offline CPU doesn't
have any worker for it once all the lingering work items are finished.

This guarantee isn't really necessary and makes CPU on/offlining more
expensive than needs to be, especially for platforms which use CPU
hotplug for powersaving.

This patch lets offline CPUs removes idle worker butchering from the
trustee and let a CPU which failed onlining keep the created first
worker.  The first worker is created if the CPU doesn't have any
during CPU_DOWN_PREPARE and started right away.  If onlining succeeds,
the rebind_workers() call in CPU_ONLINE will rebind it like any other
workers.  If onlining fails, the worker is left alone till the next
try.

This makes CPU hotplugs cheaper by allowing global_cwqs to keep
workers across them and simplifies code.

Note that trustee doesn't re-arm idle timer when it's done and thus
the disassociated global_cwq will keep all workers until it comes back
online.  This will be improved by further patches.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-17 12:39:27 -07:00
Tejun Heo
25511a4776 workqueue: reimplement CPU online rebinding to handle idle workers
Currently, if there are left workers when a CPU is being brough back
online, the trustee kills all idle workers and scheduled rebind_work
so that they re-bind to the CPU after the currently executing work is
finished.  This works for busy workers because concurrency management
doesn't try to wake up them from scheduler callbacks, which require
the target task to be on the local run queue.  The busy worker bumps
concurrency counter appropriately as it clears WORKER_UNBOUND from the
rebind work item and it's bound to the CPU before returning to the
idle state.

To reduce CPU on/offlining overhead (as many embedded systems use it
for powersaving) and simplify the code path, workqueue is planned to
be modified to retain idle workers across CPU on/offlining.  This
patch reimplements CPU online rebinding such that it can also handle
idle workers.

As noted earlier, due to the local wakeup requirement, rebinding idle
workers is tricky.  All idle workers must be re-bound before scheduler
callbacks are enabled.  This is achieved by interlocking idle
re-binding.  Idle workers are requested to re-bind and then hold until
all idle re-binding is complete so that no bound worker starts
executing work item.  Only after all idle workers are re-bound and
parked, CPU_ONLINE proceeds to release them and queue rebind work item
to busy workers thus guaranteeing scheduler callbacks aren't invoked
until all idle workers are ready.

worker_rebind_fn() is renamed to busy_worker_rebind_fn() and
idle_worker_rebind() for idle workers is added.  Rebinding logic is
moved to rebind_workers() and now called from CPU_ONLINE after
flushing trustee.  While at it, add CPU sanity check in
worker_thread().

Note that now a worker may become idle or the manager between trustee
release and rebinding during CPU_ONLINE.  As the previous patch
updated create_worker() so that it can be used by regular manager
while unbound and this patch implements idle re-binding, this is safe.

This prepares for removal of trustee and keeping idle workers across
CPU hotplugs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-17 12:39:27 -07:00
Tejun Heo
bc2ae0f5bb workqueue: drop @bind from create_worker()
Currently, create_worker()'s callers are responsible for deciding
whether the newly created worker should be bound to the associated CPU
and create_worker() sets WORKER_UNBOUND only for the workers for the
unbound global_cwq.  Creation during normal operation is always via
maybe_create_worker() and @bind is true.  For workers created during
hotplug, @bind is false.

Normal operation path is planned to be used even while the CPU is
going through hotplug operations or offline and this static decision
won't work.

Drop @bind from create_worker() and decide whether to bind by looking
at GCWQ_DISASSOCIATED.  create_worker() will also set WORKER_UNBOUND
autmatically if disassociated.  To avoid flipping GCWQ_DISASSOCIATED
while create_worker() is in progress, the flag is now allowed to be
changed only while holding all manager_mutexes on the global_cwq.

This requires that GCWQ_DISASSOCIATED is not cleared behind trustee's
back.  CPU_ONLINE no longer clears DISASSOCIATED before flushing
trustee, which clears DISASSOCIATED before rebinding remaining workers
if asked to release.  For cases where trustee isn't around, CPU_ONLINE
clears DISASSOCIATED after flushing trustee.  Also, now, first_idle
has UNBOUND set on creation which is explicitly cleared by CPU_ONLINE
while binding it.  These convolutions will soon be removed by further
simplification of CPU hotplug path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-17 12:39:27 -07:00
Tejun Heo
6037315269 workqueue: use mutex for global_cwq manager exclusion
POOL_MANAGING_WORKERS is used to ensure that at most one worker takes
the manager role at any given time on a given global_cwq.  Trustee
later hitched on it to assume manager adding blocking wait for the
bit.  As trustee already needed a custom wait mechanism, waiting for
MANAGING_WORKERS was rolled into the same mechanism.

Trustee is scheduled to be removed.  This patch separates out
MANAGING_WORKERS wait into per-pool mutex.  Workers use
mutex_trylock() to test for manager role and trustee uses mutex_lock()
to claim manager roles.

gcwq_claim/release_management() helpers are added to grab and release
manager roles of all pools on a global_cwq.  gcwq_claim_management()
always grabs pool manager mutexes in ascending pool index order and
uses pool index as lockdep subclass.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-17 12:39:27 -07:00
Tejun Heo
403c821d45 workqueue: ROGUE workers are UNBOUND workers
Currently, WORKER_UNBOUND is used to mark workers for the unbound
global_cwq and WORKER_ROGUE is used to mark workers for disassociated
per-cpu global_cwqs.  Both are used to make the marked worker skip
concurrency management and the only place they make any difference is
in worker_enter_idle() where WORKER_ROGUE is used to skip scheduling
idle timer, which can easily be replaced with trustee state testing.

This patch replaces WORKER_ROGUE with WORKER_UNBOUND and drops
WORKER_ROGUE.  This is to prepare for removing trustee and handling
disassociated global_cwqs as unbound.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-17 12:39:27 -07:00
Tejun Heo
f2d5a0ee06 workqueue: drop CPU_DYING notifier operation
Workqueue used CPU_DYING notification to mark GCWQ_DISASSOCIATED.
This was necessary because workqueue's CPU_DOWN_PREPARE happened
before other DOWN_PREPARE notifiers and workqueue needed to stay
associated across the rest of DOWN_PREPARE.

After the previous patch, workqueue's DOWN_PREPARE happens after
others and can set GCWQ_DISASSOCIATED directly.  Drop CPU_DYING and
let the trustee set GCWQ_DISASSOCIATED after disabling concurrency
management.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-17 12:39:26 -07:00
Tejun Heo
6575820221 workqueue: perform cpu down operations from low priority cpu_notifier()
Currently, all workqueue cpu hotplug operations run off
CPU_PRI_WORKQUEUE which is higher than normal notifiers.  This is to
ensure that workqueue is up and running while bringing up a CPU before
other notifiers try to use workqueue on the CPU.

Per-cpu workqueues are supposed to remain working and bound to the CPU
for normal CPU_DOWN_PREPARE notifiers.  This holds mostly true even
with workqueue offlining running with higher priority because
workqueue CPU_DOWN_PREPARE only creates a bound trustee thread which
runs the per-cpu workqueue without concurrency management without
explicitly detaching the existing workers.

However, if the trustee needs to create new workers, it creates
unbound workers which may wander off to other CPUs while
CPU_DOWN_PREPARE notifiers are in progress.  Furthermore, if the CPU
down is cancelled, the per-CPU workqueue may end up with workers which
aren't bound to the CPU.

While reliably reproducible with a convoluted artificial test-case
involving scheduling and flushing CPU burning work items from CPU down
notifiers, this isn't very likely to happen in the wild, and, even
when it happens, the effects are likely to be hidden by the following
successful CPU down.

Fix it by using different priorities for up and down notifiers - high
priority for up operations and low priority for down operations.

Workqueue cpu hotplug operations will soon go through further cleanup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-17 12:39:26 -07:00
Anton Vorontsov
f555f1231a tracing/function: Convert func_set_flag() to a switch statement
Since the function accepts just one bit, we can use the switch
construction instead of if/else if/...

Just a cosmetic change, there should be no functional changes.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-17 10:15:04 -07:00
Anton Vorontsov
21f679404a tracing/function: Introduce persistent trace option
This patch introduces 'func_ptrace' option, now available in
/sys/kernel/debug/tracing/options when function tracer
is selected.

The patch also adds some tiny code that calls back to pstore
to record the trace. The callback is no-op when PSTORE=n.

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-17 10:07:00 -07:00
Anton Vorontsov
b2ad368beb tracing: Fix initialization failure path in tracing_set_tracer()
If tracer->init() fails, current code will leave current_tracer pointing
to an unusable tracer, which at best makes 'current_tracer' report
inaccurate value.

Fix the issue by pointing current_tracer to nop tracer, and only update
current_tracer with the new one after all the initialization succeeds.

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-17 09:50:53 -07:00
Kay Sievers
eab072609e kmsg - do not flush partial lines when the console is busy
Fragments of continuation lines are flushed to the console immediately. In
case the console is locked, the fragment must be queued up in the cont
buffer.

If the the console is busy and the continuation line is complete, but no part
of it was written to the console up to this point, we can just store the
entire line as a regular record and free the buffer earlier.

If the console is busy and earlier messages are already queued up, we
should not flush the fragments of continuation lines, but store them after
the queued up messages, to ensure the proper ordering.

This keeps the console output better readable in case printk()s race against
each other, or we receive over-long continuation lines we need to flush.

Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-16 18:35:30 -07:00
Kay Sievers
d39f3d77c9 kmsg - export "continuation record" flag to /dev/kmsg
In some cases we are forced to store individual records for a continuation
line print.

Export a flag to allow the external re-construction of the line. The flag
allows us to apply a similar logic externally which is used internally when
the console, /proc/kmsg or the syslog() output is printed.

  $ cat /dev/kmsg
  4,165,0,-;Free swap  = 0kB
  4,166,0,-;Total swap = 0kB
  6,167,0,c;[
  4,168,0,+;0
  4,169,0,+;1
  4,170,0,+;2
  4,171,0,+;3
  4,172,0,+;]
  6,173,0,-;[0 1 2 3 ]
  6,174,0,-;Console: colour VGA+ 80x25
  6,175,0,-;console [tty0] enabled

Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-16 18:35:30 -07:00
Kay Sievers
96efedf149 kmsg - avoid warning for CONFIG_PRINTK=n compilations
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-16 18:35:29 -07:00
Kay Sievers
7049825318 kmsg - properly print over-long continuation lines
Reserve PREFIX_MAX bytes in the LOG_LINE_MAX line when buffering a
continuation line, to be able to properly prefix the LOG_LINE_MAX
line with the syslog prefix and timestamp when printing it.

Reported-By: Dave Jones <davej@redhat.com>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-16 18:35:29 -07:00
Thomas Gleixner
3e997130bd timekeeping: Add missing update call in timekeeping_resume()
The leap second rework unearthed another issue of inconsistent data.

On timekeeping_resume() the timekeeper data is updated, but nothing
calls timekeeping_update(), so now the update code in the timer
interrupt sees stale values.

This has been the case before those changes, but then the timer
interrupt was using stale data as well so this went unnoticed for quite
some time.

Add the missing update call, so all the data is consistent everywhere.

Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Reported-and-tested-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Reported-and-tested-by: Martin Steigerwald <Martin@lichtvoll.de>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Linux PM list <linux-pm@vger.kernel.org>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-16 10:02:17 -07:00
John Stultz
f726a697d0 time: Rework timekeeping functions to take timekeeper ptr as argument
As part of cleaning up the timekeeping code, this patch converts
a number of internal functions to takei a timekeeper ptr as an
argument, so that the internal functions don't access the global
timekeeper structure directly. This allows for further optimizations
to reduce lock hold time later.

This patch has been updated to include more consistent usage of the
timekeeper value, by making sure it is always passed as a argument
to non top-level functions.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-9-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-15 10:39:07 +02:00
John Stultz
2a8c0883c3 time: Move xtime_nsec adjustment underflow handling timekeeping_adjust
When we make adjustments speeding up the clock, its possible
for xtime_nsec to underflow. We already handle this properly,
but we do so from update_wall_time() instead of the more logical
timekeeping_adjust(), where the possible underflow actually
occurs.

Thus, move the correction logic to the timekeeping_adjust, which
is the function that causes the issue. Making update_wall_time()
more readable.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-8-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-15 10:39:07 +02:00
John Stultz
f2a5a0854e time: Move arch_gettimeoffset() usage into timekeeping_get_ns()
Since we call arch_gettimeoffset() in all the accessor
functions, move arch_gettimeoffset() calls into
timekeeping_get_ns() and timekeeping_get_ns_raw() to simplify
the code.

This also makes the code easier to maintain as we don't have to
worry about forgetting the arch_gettimeoffset() as has happened
in the past.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-7-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-15 10:39:06 +02:00
John Stultz
1f4f948706 time: Refactor accumulation of nsecs to secs
We do the exact same logic moving nsecs to secs in the
timekeeper in multiple places, so condense this into a
single function.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-6-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-15 10:39:06 +02:00
John Stultz
1e75fa8be9 time: Condense timekeeper.xtime into xtime_sec
The timekeeper struct has a xtime_nsec, which keeps the
sub-nanosecond remainder.  This ends up being somewhat
duplicative of the timekeeper.xtime.tv_nsec value, and we
have to do extra work to keep them apart, copying the full
nsec portion out and back in over and over.

This patch simplifies some of the logic by taking the timekeeper
xtime value and splitting it into timekeeper.xtime_sec and
reuses the timekeeper.xtime_nsec for the sub-second portion
(stored in higher res shifted nanoseconds).

This simplifies some of the accumulation logic. And will
allow for more accurate timekeeping once the vsyscall code
is updated to use the shifted nanosecond remainder.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-5-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-15 10:39:06 +02:00
John Stultz
fee84c43e6 time: Explicitly use u32 instead of int for shift values
Ingo noted that using a u32 instead of int for shift values
would be better to make sure the compiler doesn't unnecessarily
use complex signed arithmetic.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-4-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-15 10:39:05 +02:00
John Stultz
42e71e81f5 time: Whitespace cleanups per Ingo%27s requests
Ingo noted a number of places where there is inconsistent
use of whitespace. This patch tries to address the main
culprits.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1342156917-25092-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-15 10:39:05 +02:00
Thomas Gleixner
e8b9dd7e24 Merge branch 'timers/urgent' into timers/core
Reason: Update to upstream changes to avoid further conflicts.
Fixup a trivial merge conflict in kernel/time/tick-sched.c

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-15 10:24:53 +02:00
John Stultz
6b1859dba0 ntp: Fix STA_INS/DEL clearing bug
In commit 6b43ae8a61, I
introduced a bug that kept the STA_INS or STA_DEL bit
from being cleared from time_status via adjtimex()
without forcing STA_PLL first.

Usually once the STA_INS is set, it isn't cleared
until the leap second is applied, so its unlikely this
affected anyone. However during testing I noticed it
took some effort to cancel a leap second once STA_INS
was set.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
CC: stable@vger.kernel.org # 3.4
Link: http://lkml.kernel.org/r/1342156917-25092-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-15 09:48:49 +02:00
Theodore Ts'o
775f4b297b random: make 'add_interrupt_randomness()' do something sane
We've been moving away from add_interrupt_randomness() for various
reasons: it's too expensive to do on every interrupt, and flooding the
CPU with interrupts could theoretically cause bogus floods of entropy
from a somewhat externally controllable source.

This solves both problems by limiting the actual randomness addition
to just once a second or after 64 interrupts, whicever comes first.
During that time, the interrupt cycle data is buffered up in a per-cpu
pool.  Also, we make sure the the nonblocking pool used by urandom is
initialized before we start feeding the normal input pool.  This
assures that /dev/urandom is returning unpredictable data as soon as
possible.

(Based on an original patch by Linus, but significantly modified by
tytso.)

Tested-by: Eric Wustrow <ewust@umich.edu>
Reported-by: Eric Wustrow <ewust@umich.edu>
Reported-by: Nadia Heninger <nadiah@cs.ucsd.edu>
Reported-by: Zakir Durumeric <zakir@umich.edu>
Reported-by: J. Alex Halderman <jhalderm@umich.edu>.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-07-14 20:17:28 -04:00
Linus Torvalds
ab93eb8216 Merge branches 'core-urgent-for-linus', 'perf-urgent-for-linus' and 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU, perf, and scheduler fixes from Ingo Molnar.

The RCU fix is a revert for an optimization that could cause deadlocks.

One of the scheduler commits (164c33c6ad "sched: Fix fork() error path
to not crash") is correct but not complete (some architectures like Tile
are not covered yet) - the resulting additional fixes are still WIP and
Ingo did not want to delay these pending fixes.  See this thread on
lkml:

  [PATCH] fork: fix error handling in dup_task()

The perf fixes are just trivial oneliners.

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "rcu: Move PREEMPT_RCU preemption to switch_to() invocation"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf kvm: Fix segfault with report and mixed guestmount use
  perf kvm: Fix regression with guest machine creation
  perf script: Fix format regression due to libtraceevent merge
  ring-buffer: Fix accounting of entries when removing pages
  ring-buffer: Fix crash due to uninitialized new_pages list head

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  MAINTAINERS/sched: Update scheduler file pattern
  sched/nohz: Rewrite and fix load-avg computation -- again
  sched: Fix fork() error path to not crash
2012-07-14 11:16:24 -07:00
David Howells
9249e17fe0 VFS: Pass mount flags to sget()
Pass mount flags to sget() so that it can use them in initialising a new
superblock before the set function is called.  They could also be passed to the
compare function.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:38:34 +04:00
David Howells
be34d1a3bc VFS: Make clone_mnt()/copy_tree()/collect_mounts() return errors
copy_tree() can theoretically fail in a case other than ENOMEM, but always
returns NULL which is interpreted by callers as -ENOMEM.  Change it to return
an explicit error.

Also change clone_mnt() for consistency and because union mounts will add new
error cases.

Thanks to Andreas Gruenbacher <agruen@suse.de> for a bug fix.
[AV: folded braino fix by Dan Carpenter]

Original-author: Valerie Aurora <vaurora@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Valerie Aurora <valerie.aurora@gmail.com>
Cc: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:37:27 +04:00
Al Viro
79714f72d3 get rid of kern_path_parent()
all callers want the same thing, actually - a kinda-sorta analog of
kern_path_create().  I.e. they want parent vfsmount/dentry (with
->i_mutex held, to make sure the child dentry is still their child)
+ the child dentry.

Signed-off-by Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:35:02 +04:00
Al Viro
00cd8dd3bf stop passing nameidata to ->lookup()
Just the flags; only NFS cares even about that, but there are
legitimate uses for such argument.  And getting rid of that
completely would require splitting ->lookup() into a couple
of methods (at least), so let's leave that alone for now...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:34:32 +04:00
Tejun Heo
3270476a6c workqueue: reimplement WQ_HIGHPRI using a separate worker_pool
WQ_HIGHPRI was implemented by queueing highpri work items at the head
of the global worklist.  Other than queueing at the head, they weren't
handled differently; unfortunately, this could lead to execution
latency of a few seconds on heavily loaded systems.

Now that workqueue code has been updated to deal with multiple
worker_pools per global_cwq, this patch reimplements WQ_HIGHPRI using
a separate worker_pool.  NR_WORKER_POOLS is bumped to two and
gcwq->pools[0] is used for normal pri work items and ->pools[1] for
highpri.  Highpri workers get -20 nice level and has 'H' suffix in
their names.  Note that this change increases the number of kworkers
per cpu.

POOL_HIGHPRI_PENDING, pool_determine_ins_pos() and highpri chain
wakeup code in process_one_work() are no longer used and removed.

This allows proper prioritization of highpri work items and removes
high execution latency of highpri work items.

v2: nr_running indexing bug in get_pool_nr_running() fixed.

v3: Refreshed for the get_pool_nr_running() update in the previous
    patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Josh Hunt <joshhunt00@gmail.com>
LKML-Reference: <CAKA=qzaHqwZ8eqpLNFjxnO2fX-tgAOjmpvxgBFjv6dJeQaOW1w@mail.gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
2012-07-13 22:24:45 -07:00
Tejun Heo
4ce62e9e30 workqueue: introduce NR_WORKER_POOLS and for_each_worker_pool()
Introduce NR_WORKER_POOLS and for_each_worker_pool() and convert code
paths which need to manipulate all pools in a gcwq to use them.
NR_WORKER_POOLS is currently one and for_each_worker_pool() iterates
over only @gcwq->pool.

Note that nr_running is per-pool property and converted to an array
with NR_WORKER_POOLS elements and renamed to pool_nr_running.  Note
that get_pool_nr_running() currently assumes 0 index.  The next patch
will make use of non-zero index.

The changes in this patch are mechanical and don't caues any
functional difference.  This is to prepare for multiple pools per
gcwq.

v2: nr_running indexing bug in get_pool_nr_running() fixed.

v3: Pointer to array is stupid.  Don't use it in get_pool_nr_running()
    as suggested by Linus.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-13 22:16:44 -07:00
Linus Torvalds
d55e5bd020 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull the leap second fixes from Thomas Gleixner:
 "It's a rather large series, but well discussed, refined and reviewed.
  It got a massive testing by John, Prarit and tip.

  In theory we could split it into two parts.  The first two patches

    f55a6faa38: hrtimer: Provide clock_was_set_delayed()
    4873fa070a: timekeeping: Fix leapsecond triggered load spike issue

  are merely preventing the stuff loops forever issues, which people
  have observed.

  But there is no point in delaying the other 4 commits which achieve
  full correctness into 3.6 as they are tagged for stable anyway.  And I
  rather prefer to have the full fixes merged in bulk than a "prevent
  the observable wreckage and deal with the hidden fallout later"
  approach."

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hrtimer: Update hrtimer base offsets each hrtimer_interrupt
  timekeeping: Provide hrtimer update function
  hrtimers: Move lock held region in hrtimer_interrupt()
  timekeeping: Maintain ktime_t based offsets for hrtimers
  timekeeping: Fix leapsecond triggered load spike issue
  hrtimer: Provide clock_was_set_delayed()
2012-07-13 15:31:21 -07:00
Tejun Heo
11ebea50db workqueue: separate out worker_pool flags
GCWQ_MANAGE_WORKERS, GCWQ_MANAGING_WORKERS and GCWQ_HIGHPRI_PENDING
are per-pool properties.  Add worker_pool->flags and make the above
three flags per-pool flags.

The changes in this patch are mechanical and don't caues any
functional difference.  This is to prepare for multiple pools per
gcwq.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-07-12 14:46:37 -07:00
Tejun Heo
63d95a9150 workqueue: use @pool instead of @gcwq or @cpu where applicable
Modify all functions which deal with per-pool properties to pass
around @pool instead of @gcwq or @cpu.

The changes in this patch are mechanical and don't caues any
functional difference.  This is to prepare for multiple pools per
gcwq.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-07-12 14:46:37 -07:00
Tejun Heo
bd7bdd43dc workqueue: factor out worker_pool from global_cwq
Move worklist and all worker management fields from global_cwq into
the new struct worker_pool.  worker_pool points back to the containing
gcwq.  worker and cpu_workqueue_struct are updated to point to
worker_pool instead of gcwq too.

This change is mechanical and doesn't introduce any functional
difference other than rearranging of fields and an added level of
indirection in some places.  This is to prepare for multiple pools per
gcwq.

v2: Comment typo fixes as suggested by Namhyung.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
2012-07-12 14:46:37 -07:00
Tejun Heo
974271c485 workqueue: don't use WQ_HIGHPRI for unbound workqueues
Unbound wqs aren't concurrency-managed and try to execute work items
as soon as possible.  This is currently achieved by implicitly setting
%WQ_HIGHPRI on all unbound workqueues; however, WQ_HIGHPRI
implementation is about to be restructured and this usage won't be
valid anymore.

Add an explicit chain-wakeup path for unbound workqueues in
process_one_work() instead of piggy backing on %WQ_HIGHPRI.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-07-12 14:46:37 -07:00
Dan Carpenter
93574fcc5b tracing: Check for allocation failure in __tracing_open()
Clean up and return -ENOMEM on if the kzalloc() fails.

This also prevents a potential crash, as the pointer that failed to
allocate would be later used.

Link: http://lkml.kernel.org/r/20120711063507.GF11812@elgon.mountain

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-07-11 19:56:26 -04:00
Linus Torvalds
00c3e276c5 Merge branch 'akpm' (Andrew's patch-bomb)
Merge random patches from Andrew Morton.

* Merge emailed patches from Andrew Morton <akpm@linux-foundation.org>: (32 commits)
  memblock: free allocated memblock_reserved_regions later
  mm: sparse: fix usemap allocation above node descriptor section
  mm: sparse: fix section usemap placement calculation
  xtensa: fix incorrect memset
  shmem: cleanup shmem_add_to_page_cache
  shmem: fix negative rss in memcg memory.stat
  tmpfs: revert SEEK_DATA and SEEK_HOLE
  drivers/rtc/rtc-twl.c: fix threaded IRQ to use IRQF_ONESHOT
  fat: fix non-atomic NFS i_pos read
  MAINTAINERS: add OMAP CPUfreq driver to OMAP Power Management section
  sgi-xp: nested calls to spin_lock_irqsave()
  fs: ramfs: file-nommu: add SetPageUptodate()
  drivers/rtc/rtc-mxc.c: fix irq enabled interrupts warning
  mm/memory_hotplug.c: release memory resources if hotadd_new_pgdat() fails
  h8300/uaccess: add mising __clear_user()
  h8300/uaccess: remove assignment to __gu_val in unhandled case of get_user()
  h8300/time: add missing #include <asm/irq_regs.h>
  h8300/signal: fix typo "statis"
  h8300/pgtable: add missing #include <asm-generic/pgtable.h>
  drivers/rtc/rtc-ab8500.c: ensure correct probing of the AB8500 RTC when Device Tree is enabled
  ...
2012-07-11 16:06:54 -07:00
Konstantin Khlebnikov
4229fb1dc6 c/r: prctl: less paranoid prctl_set_mm_exe_file()
"no other files mapped" requirement from my previous patch (c/r: prctl:
update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removal) is too
paranoid, it forbids operation even if there mapped one shared-anon vma.

Let's check that current mm->exe_file already unmapped, in this case
exe_file symlink already outdated and its changing is reasonable.

Plus, this patch fixes exit code in case operation success.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reported-by: Cyrill Gorcunov <gorcunov@openvz.org>
Tested-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11 16:04:43 -07:00
John Stultz
5baefd6d84 hrtimer: Update hrtimer base offsets each hrtimer_interrupt
The update of the hrtimer base offsets on all cpus cannot be made
atomically from the timekeeper.lock held and interrupt disabled region
as smp function calls are not allowed there.

clock_was_set(), which enforces the update on all cpus, is called
either from preemptible process context in case of do_settimeofday()
or from the softirq context when the offset modification happened in
the timer interrupt itself due to a leap second.

In both cases there is a race window for an hrtimer interrupt between
dropping timekeeper lock, enabling interrupts and clock_was_set()
issuing the updates. Any interrupt which arrives in that window will
see the new time but operate on stale offsets.

So we need to make sure that an hrtimer interrupt always sees a
consistent state of time and offsets.

ktime_get_update_offsets() allows us to get the current monotonic time
and update the per cpu hrtimer base offsets from hrtimer_interrupt()
to capture a consistent state of monotonic time and the offsets. The
function replaces the existing ktime_get() calls in hrtimer_interrupt().

The overhead of the new function vs. ktime_get() is minimal as it just
adds two store operations.

This ensures that any changes to realtime or boottime offsets are
noticed and stored into the per-cpu hrtimer base structures, prior to
any hrtimer expiration and guarantees that timers are not expired early.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1341960205-56738-8-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-11 23:34:39 +02:00
Thomas Gleixner
f6c06abfb3 timekeeping: Provide hrtimer update function
To finally fix the infamous leap second issue and other race windows
caused by functions which change the offsets between the various time
bases (CLOCK_MONOTONIC, CLOCK_REALTIME and CLOCK_BOOTTIME) we need a
function which atomically gets the current monotonic time and updates
the offsets of CLOCK_REALTIME and CLOCK_BOOTTIME with minimalistic
overhead. The previous patch which provides ktime_t offsets allows us
to make this function almost as cheap as ktime_get() which is going to
be replaced in hrtimer_interrupt().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/1341960205-56738-7-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-11 23:34:39 +02:00
Thomas Gleixner
196951e912 hrtimers: Move lock held region in hrtimer_interrupt()
We need to update the base offsets from this code and we need to do
that under base->lock. Move the lock held region around the
ktime_get() calls. The ktime_get() calls are going to be replaced with
a function which gets the time and the offsets atomically.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/1341960205-56738-6-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-11 23:34:38 +02:00
Thomas Gleixner
5b9fe759a6 timekeeping: Maintain ktime_t based offsets for hrtimers
We need to update the hrtimer clock offsets from the hrtimer interrupt
context. To avoid conversions from timespec to ktime_t maintain a
ktime_t based representation of those offsets in the timekeeper. This
puts the conversion overhead into the code which updates the
underlying offsets and provides fast accessible values in the hrtimer
interrupt.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1341960205-56738-4-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-11 23:34:38 +02:00
John Stultz
4873fa070a timekeeping: Fix leapsecond triggered load spike issue
The timekeeping code misses an update of the hrtimer subsystem after a
leap second happened. Due to that timers based on CLOCK_REALTIME are
either expiring a second early or late depending on whether a leap
second has been inserted or deleted until an operation is initiated
which causes that update. Unless the update happens by some other
means this discrepancy between the timekeeping and the hrtimer data
stays forever and timers are expired either early or late.

The reported immediate workaround - $ data -s "`date`" - is causing a
call to clock_was_set() which updates the hrtimer data structures.
See: http://www.sheeri.com/content/mysql-and-leap-second-high-cpu-and-fix

Add the missing clock_was_set() call to update_wall_time() in case of
a leap second event. The actual update is deferred to softirq context
as the necessary smp function call cannot be invoked from hard
interrupt context.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Reported-by: Jan Engelhardt <jengelh@inai.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1341960205-56738-3-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-11 23:34:37 +02:00
John Stultz
f55a6faa38 hrtimer: Provide clock_was_set_delayed()
clock_was_set() cannot be called from hard interrupt context because
it calls on_each_cpu().

For fixing the widely reported leap seconds issue it is necessary to
call it from hard interrupt context, i.e. the timer tick code, which
does the timekeeping updates.

Provide a new function which denotes it in the hrtimer cpu base
structure of the cpu on which it is called and raise the hrtimer
softirq. We then execute the clock_was_set() notificiation from
softirq context in run_hrtimer_softirq(). The hrtimer softirq is
rarely used, so polling the flag there is not a performance issue.

[ tglx: Made it depend on CONFIG_HIGH_RES_TIMERS. We really should get
  rid of all this ifdeffery ASAP ]

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Reported-by: Jan Engelhardt <jengelh@inai.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1341960205-56738-2-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-11 23:34:37 +02:00
Linus Torvalds
3dc352c02f printk fixes for 3.5-rc6
Here are some more printk fixes for 3.5-rc6.  They resolve all known
 outstanding issues with the printk changes that have been happening.  They have
 been tested by the people reporting the problems.
 
 This hopefully should be it for the printk stuff for 3.5-final.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iEYEABECAAYFAk/9g5IACgkQMUfUDdst+ykGRgCgsLQ+ltx2CExSNZ29Z9OVi1cW
 KFAAoMmZCJkrj7gyCX4Y/UZ7qa+iYm7T
 =nVVT
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull printk fixes from Greg Kroah-Hartman:
 "Here are some more printk fixes for 3.5-rc6.  They resolve all known
  outstanding issues with the printk changes that have been happening.
  They have been tested by the people reporting the problems.

  This hopefully should be it for the printk stuff for 3.5-final.

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

* tag 'driver-core-3.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  kmsg: merge continuation records while printing
  kmsg: /proc/kmsg - support reading of partial log records
  kmsg: make sure all messages reach a newly registered boot console
  kmsg: properly handle concurrent non-blocking read() from /proc/kmsg
  kmsg: add the facility number to the syslog prefix
  kmsg: escape the backslash character while exporting data
  printk: replacing the raw_spin_lock/unlock with raw_spin_lock/unlock_irq
2012-07-11 12:15:15 -07:00
Grant Likely
9844a5524e irqdomain: Fix irq_create_direct_mapping() to test irq_domain type.
irq_create_direct_mapping can only be used with the NOMAP type.  Make
the function test to ensure it is passed the correct type of
irq_domain.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
2012-07-11 16:16:13 +01:00
Grant Likely
d6b0d1f705 irqdomain: Eliminate dedicated radix lookup functions
In preparation to remove the slow revmap path, eliminate the public
radix revmap lookup functions.  This simplifies the code and makes the
slowpath removal patch a lot simpler.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
2012-07-11 16:16:00 +01:00
Grant Likely
98aa468e04 irqdomain: Support for static IRQ mapping and association.
This adds a new strict mapping API for supporting creation of linux IRQs
at existing positions within the domain. The new routines are as follows:

For dynamic allocation and insertion to specified ranges:

	- irq_create_identity_mapping()
	- irq_create_strict_mappings()

These will allocate and associate a range of linux IRQs at the specified
location. This can be used by controllers that have their own static linux IRQ
definitions to map a hwirq range to, as well as for platforms that wish to
establish 1:1 identity mapping between linux and hwirq space.

For insertion to specified ranges by platforms that do their own irq_desc
management:

	- irq_domain_associate()
	- irq_domain_associate_many()

These in turn call back in to the domain's ->map() routine, for further
processing by the platform. Disassociation of IRQs get handled through
irq_dispose_mapping() as normal.

With these in place it should be possible to begin migration of legacy IRQ
domains to linear ones, without requiring special handling for static vs
dynamic IRQ definitions in DT vs non-DT paths. This also makes it possible
for domains with static mappings to adopt whichever tree model best fits
their needs, rather than simply restricting them to linear revmaps.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
[grant.likely: Reorganized irq_domain_associate{,_many} to have all logic in one place]
[grant.likely: Add error checking for unallocated irq_descs at associate time]
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
2012-07-11 16:15:37 +01:00
Grant Likely
2a71a1a9da irqdomain: Always update revmap when setting up a virq
At irq_setup_virq() time all of the data needed to update the reverse
map is available, but the current code ignores it and relies upon the
slow path to insert revmap records.  This patch adds revmap updating
to the setup path so the slow path will no longer be necessary.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
2012-07-11 16:15:34 +01:00
Grant Likely
913af20707 irqdomain: Split disassociating code into separate function
This patch moves the irq disassociation code out into a separate
function in preparation to extend irq_setup_virq to handle multiple
irqs and rename it for use by interrupt controller drivers.  The new
function will be used by irq_setup_virq() in its error path.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
2012-07-11 16:15:34 +01:00
Grant Likely
80c1834fc8 Linux 3.5-rc6
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQEcBAABAgAGBQJP+NMmAAoJEHm+PkMAQRiGPxEH/18YQN8FAzEIjcC10ytA3RC3
 KzPv31jXgJGZDy1UqmpKtJ7GDwb92AhqZxVnJimMa+6d1uA8NsZQq5EMOPPiX8Qi
 8P4AEaw5kSMmR/6zxsxguCGdbDLU3xZ1nJZkHyMgjo2UJbMU0jBPneb/79heWPhe
 0HOkLzN5VA6Yx3Nt70sWQ1zsuj0Ji5jCGO0iNTCBmTiv4J9ZlOx3xJQn4aK6JscO
 /3QRTM43GG0j6zToEOCTHrn8ajOq6rHQQkG0bPVR723nFrSGLoaCT6QVBXYug+AZ
 9Xay7zVNvrq2oH5x5jADG2t2vyaG+nEJpSrVjXznzxgDnK7tWjYqiuG5zqKhAq8=
 =IMfr
 -----END PGP SIGNATURE-----

Merge tag 'v3.5-rc6' into irqdomain/next

Linux 3.5-rc6
2012-07-11 16:08:35 +01:00
Dong Aisheng
22076c7712 irq_domain: correct a minor wrong comment for linear revmap
The revmap type should be linear for irq_domain_add_linear function.

Signed-off-by: Dong Aisheng <dong.aisheng@linaro.org>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2012-07-11 15:07:27 +01:00
Mark Brown
781d0f46d8 irq_domain: Standardise legacy/linear domain selection
A large proportion of interrupt controllers that support legacy mappings
do so because non-DT systems need to use fixed IRQ numbers when registering
devices via buses but can otherwise use a linear mapping. The interrupt
controller itself typically is not affected by the mapping used and best
practice is to use a linear mapping where possible so drivers frequently
select at runtime depending on if a legacy range has been allocated to
them.

Standardise this behaviour by providing irq_domain_register_simple() which
will allocate a linear mapping unless a positive first_irq is provided in
which case it will fall back to a legacy mapping. This helps make best
practice for irq_domain adoption clearer.

Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2012-07-11 14:59:17 +01:00
Kay Sievers
5becfb1df5 kmsg: merge continuation records while printing
In (the unlikely) case our continuation merge buffer is busy, we unfortunately
can not merge further continuation printk()s into a single record and have to
store them separately, which leads to split-up output of these lines when they
are printed.

Add some flags about newlines and prefix existence to these records and try to
reconstruct the full line again, when the separated records are printed.

Reported-By: Michael Neuling <mikey@neuling.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Tested-By: Michael Neuling <mikey@neuling.org>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-09 12:15:42 -07:00
Tejun Heo
ce27e317ba cgroup: cgroup_rm_files() was calling simple_unlink() with the wrong inode
While refactoring cgroup file removal path, 05ef1d7c4a "cgroup:
introduce struct cfent" incorrectly changed the @dir argument of
simple_unlink() to the inode of the file being deleted instead of that
of the containing directory.

The effect of this bug is minor - ctime and mtime of the parent
weren't properly updated on file deletion.

Fix it by using @cgrp->dentry->d_inode instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: stable@vger.kernel.org
2012-07-09 10:11:14 -07:00
Kay Sievers
eb02dac937 kmsg: /proc/kmsg - support reading of partial log records
Restore support for partial reads of any size on /proc/kmsg, in case the
supplied read buffer is smaller than the record size.

Some people seem to think is is ia good idea to run:
  $ dd if=/proc/kmsg bs=1 of=...
as a klog bridge.

Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=44211
Reported-by: Jukka Ollila <jiiksteri@gmail.com>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-09 10:05:10 -07:00
Tejun Heo
5db9a4d99b cgroup: fix cgroup hierarchy umount race
48ddbe1946 "cgroup: make css->refcnt clearing on cgroup removal
optional" allowed a css to linger after the associated cgroup is
removed.  As a css holds a reference on the cgroup's dentry, it means
that cgroup dentries may linger for a while.

Destroying a superblock which has dentries with positive refcnts is a
critical bug and triggers BUG() in vfs code.  As each cgroup dentry
holds an s_active reference, any lingering cgroup has both its dentry
and the superblock pinned and thus preventing premature release of
superblock.

Unfortunately, after 48ddbe1946, there's a small window while
releasing a cgroup which is directly under the root of the hierarchy.
When a cgroup directory is released, vfs layer first deletes the
corresponding dentry and then invokes dput() on the parent, which may
recurse further, so when a cgroup directly below root cgroup is
released, the cgroup is first destroyed - which releases the s_active
it was holding - and then the dentry for the root cgroup is dput().

This creates a window where the root dentry's refcnt isn't zero but
superblock's s_active is.  If umount happens before or during this
window, vfs will see the root dentry with non-zero refcnt and trigger
BUG().

Before 48ddbe1946, this problem didn't exist because the last dentry
reference was guaranteed to be put synchronously from rmdir(2)
invocation which holds s_active around the whole process.

Fix it by holding an extra superblock->s_active reference across
dput() from css release, which is the dput() path added by 48ddbe1946
and the only one which doesn't hold an extra s_active ref across the
final cgroup dput().

Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4FEEA5CB.8070809@huawei.com>
Reported-by: shyju pv <shyju.pv@huawei.com>
Tested-by: shyju pv <shyju.pv@huawei.com>
Cc: Sasha Levin <levinsasha928@gmail.com>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-07-07 16:08:18 -07:00
Tejun Heo
7db5b3ca0e Revert "cgroup: superblock can't be released with active dentries"
This reverts commit fa980ca87d.  The
commit was an attempt to fix a race condition where a cgroup hierarchy
may be unmounted with positive dentry reference on root cgroup.  While
the commit made the race condition slightly more difficult to trigger,
the race was still there and could be reliably triggered using a
different test case.

Revert the incorrect fix.  The next commit will describe the race and
fix it correctly.

Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4FEEA5CB.8070809@huawei.com>
Reported-by: shyju pv <shyju.pv@huawei.com>
Cc: Sasha Levin <levinsasha928@gmail.com>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-07-07 15:55:47 -07:00
Kay Sievers
68b6507dc5 kmsg: make sure all messages reach a newly registered boot console
We suppress printing kmsg records to the console, which are already printed
immediately while we have received their fragments.

Newly registered boot consoles print the entire kmsg buffer during
registration. Clear the console-suppress flag after we skipped the record
during its first storage, so any later print will see these records as usual.

Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-06 09:50:09 -07:00
Kay Sievers
cb424ffe9f kmsg: properly handle concurrent non-blocking read() from /proc/kmsg
The /proc/kmsg read() interface is internally simply wired up to a sequence
of syslog() syscalls, which might are racy between their checks and actions,
regarding concurrency.

In the (very uncommon) case of concurrent readers of /dev/kmsg, relying on
usual O_NONBLOCK behavior, the recently introduced mutex might block an
O_NONBLOCK reader in read(), when poll() returns for it, but another process
has already read the data in the meantime. We've seen that while running
artificial test setups and tools that "fight" about /proc/kmsg data.

This restores the original /proc/kmsg behavior, where in case of concurrent
read()s, poll() might wake up but the read() syscall will just return 0 to
the caller, while another process has "stolen" the data.

This is in the general case not the expected behavior, but it is the exact
same one, that can easily be triggered with a 3.4 kernel, and some tools
might just rely on it.

The mutex is not needed, the original integrity issue which introduced it,
is in the meantime covered by:
  "fill buffer with more than a single message for SYSLOG_ACTION_READ"
  116e90b23f

Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-06 09:50:09 -07:00
Kay Sievers
43a73a50b3 kmsg: add the facility number to the syslog prefix
After the recent split of facility and level into separate variables,
we miss the facility value (always 0 for kernel-originated messages)
in the syslog prefix.

On Tue, Jul 3, 2012 at 12:45 PM, Dan Carpenter <dan.carpenter@oracle.com> wrote:
> Static checkers complain about the impossible condition here.
>
> In 084681d14e ('printk: flush continuation lines immediately to
> console'), we changed msg->level from being a u16 to being an unsigned
> 3 bit bitfield.

Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-06 09:50:09 -07:00
Kay Sievers
e3f5a5f271 kmsg: escape the backslash character while exporting data
Non-printable characters in the log data are hex-escaped to ensure safe
post processing. We need to escape a backslash we find in the data, to be
able to distinguish it from a backslash we add for the escaping.

Also escape the non-printable character 127.

Thanks to Miloslav Trmac for the heads up.

Reported-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-06 09:50:09 -07:00
liu chuansheng
5c53d819c7 printk: replacing the raw_spin_lock/unlock with raw_spin_lock/unlock_irq
In function devkmsg_read/writev/llseek/poll/open()..., the function
raw_spin_lock/unlock is used, there is potential deadlock case happening.
CPU1: thread1 doing the cat /dev/kmsg:
        raw_spin_lock(&logbuf_lock);
        while (user->seq == log_next_seq) {
when thread1 run here, at this time one interrupt is coming on CPU1 and running
based on this thread,if the interrupt handle called the printk which need the
logbuf_lock spin also, it will cause deadlock.

So we should use raw_spin_lock/unlock_irq here.

Acked-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: liu chuansheng <chuansheng.liu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-06 09:50:08 -07:00
Ingo Molnar
5c09d127a1 Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull the RCU tree from Paul E. McKenney:

"The major features of this series are:

1.	Preventing latency spikes of more than 200 microseconds for
	kernels built with NR_CPUS=4096, which is reportedly becoming
	the default for some distros.  This is a first step, as it does
	not help with systems that actually -have- 4096 CPUs (work on
	this case is in progress, but is not yet ready for mainline).
	This category also includes improving concurrency of rcu_barrier(),
	placed here due to conflicts.  Posted to LKML at:
	https://lkml.org/lkml/2012/6/22/381.  Note that patches 18-22
	of that series have been defered to 3.7, as they have not yet
	proven themselves to be mainline-ready (and yes, these are the
	ones intended to get rid of RCU's latency spikes for systems
	that actually have 4096 CPUs).
2.	Updates to documentation and rcutorture fixes, the latter category
	including improvements to rcu_barrier() testing.  Posted to LKML at:
	http://lkml.indiana.edu/hypermail/linux/kernel/1206.1/04094.html.
3.	Miscellaneous fixes posted to LKML at:
	https://lkml.org/lkml/2012/6/22/500, with the exception of the
	last commit, which was posted here:
	http://www.gossamer-threads.com/lists/linux/kernel/1561830
4.	RCU_FAST_NO_HZ fixes and improvements.  Posted to LKML at:
	http://lkml.indiana.edu/hypermail/linux/kernel/1206.1/00006.html
	and http://www.gossamer-threads.com/lists/linux/kernel/1561833.
	The first four patches of the first series went into 3.5 to fix
	a regression.
5.	Code-style fixes.  These were posted to LKML at
	http://lkml.indiana.edu/hypermail/linux/kernel/1205.2/01180.html and
	http://lkml.indiana.edu/hypermail/linux/kernel/1205.2/01181.html.
"

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-06 16:13:58 +02:00
Paul E. McKenney
5cf05ad758 rcu: Fix broken strings in RCU's source code.
Although the C language allows you to break strings across lines, doing
this makes it hard for people to find the Linux kernel code corresponding
to a given console message.  This commit therefore fixes broken strings
throughout RCU's source code.

Suggested-by: Josh Triplett <josh@joshtriplett.org>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-06 06:01:49 -07:00
Paul E. McKenney
c701d5d9b3 rcu: Fix code-style issues involving "else"
The Linux kernel coding style says that single-statement blocks should
omit curly braces unless the other leg of the "if" statement has
multiple statements, in which case the curly braces should be included.
This commit fixes RCU's violations of this rule.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-06 06:01:48 -07:00
Paul E. McKenney
02a0677b0b Merge branches 'bigrtm.2012.07.04a', 'doctorture.2012.07.02a', 'fixes.2012.07.06a' and 'fnh.2012.07.02a' into HEAD
bigrtm: First steps towards getting RCU out of the way of
	tens-of-microseconds real-time response on systems compiled
	with NR_CPUS=4096.  Also cleanups for and increased concurrency
	of rcu_barrier() family of primitives.
doctorture: rcutorture and documentation improvements.
fixes:  Miscellaneous fixes.
fnh: RCU_FAST_NO_HZ fixes and improvements.
2012-07-06 05:59:30 -07:00
Paul E. McKenney
cfca927972 rcu: Introduce check for callback list/count mismatch
The recent bug that introduced the RCU callback list/count mismatch
showed the need for a diagnostic to check for this, which this commit
adds.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-06 05:55:16 -07:00
Grant Likely
74a7f08448 devicetree: add helper inline for retrieving a node's full name
The pattern (np ? np->full_name : "<none>") is rather common in the
kernel, but can also make for quite long lines.  This patch adds a new
inline function, of_node_full_name() so that the test for a valid node
pointer doesn't need to be open coded at all call sites.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
2012-07-06 07:16:34 -05:00
Ingo Molnar
40b3c43f04 Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/urgent
Pull low probability CONFIG_RCU_BOOST=y deadlock fix from Paul E. McKenney.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-06 11:18:50 +02:00
Ingo Molnar
35c2f48c66 Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core
Pull tracing updates from Steve Rostedt.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-06 11:12:17 +02:00
Ingo Molnar
90574ebb7e Merge branch 'perf/urgent' into perf/core
Merge this branch to pick up a fixlet and to update to a more recent base.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-05 21:10:23 +02:00
Peter Zijlstra
5167e8d541 sched/nohz: Rewrite and fix load-avg computation -- again
Thanks to Charles Wang for spotting the defects in the current code:

 - If we go idle during the sample window -- after sampling, we get a
   negative bias because we can negate our own sample.

 - If we wake up during the sample window we get a positive bias
   because we push the sample to a known active period.

So rewrite the entire nohz load-avg muck once again, now adding
copious documentation to the code.

Reported-and-tested-by: Doug Smythies <dsmythies@telus.net>
Reported-and-tested-by: Charles Wang <muming.wq@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins
[ minor edits ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-05 20:58:13 +02:00
Salman Qazi
164c33c6ad sched: Fix fork() error path to not crash
In dup_task_struct(), if arch_dup_task_struct() fails, the clean up
code fails to clean up correctly.  That's because the clean up
code depends on unininitalized ti->task pointer.  We fix this
by making sure that the task and thread_info know about each other
before we attempt to take the error path.

Signed-off-by: Salman Qazi <sqazi@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120626011815.11323.5533.stgit@dungbeetle.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-05 20:57:32 +02:00
David S. Miller
c90a9bb907 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-07-05 03:44:25 -07:00
Linus Torvalds
a3da2c6913 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block bits from Jens Axboe:
 "As vacation is coming up, thought I'd better get rid of my pending
  changes in my for-linus branch for this iteration.  It contains:

   - Two patches for mtip32xx.  Killing a non-compliant sysfs interface
     and moving it to debugfs, where it belongs.

   - A few patches from Asias.  Two legit bug fixes, and one killing an
     interface that is no longer in use.

   - A patch from Jan, making the annoying partition ioctl warning a bit
     less annoying, by restricting it to !CAP_SYS_RAWIO only.

   - Three bug fixes for drbd from Lars Ellenberg.

   - A fix for an old regression for umem, it hasn't really worked since
     the plugging scheme was changed in 3.0.

   - A few fixes from Tejun.

   - A splice fix from Eric Dumazet, fixing an issue with pipe
     resizing."

* 'for-linus' of git://git.kernel.dk/linux-block:
  scsi: Silence unnecessary warnings about ioctl to partition
  block: Drop dead function blk_abort_queue()
  block: Mitigate lock unbalance caused by lock switching
  block: Avoid missed wakeup in request waitqueue
  umem: fix up unplugging
  splice: fix racy pipe->buffers uses
  drbd: fix null pointer dereference with on-congestion policy when diskless
  drbd: fix list corruption by failing but already aborted reads
  drbd: fix access of unallocated pages and kernel panic
  xen/blkfront: Add WARN to deal with misbehaving backends.
  blkcg: drop local variable @q from blkg_destroy()
  mtip32xx: Create debugfs entries for troubleshooting
  mtip32xx: Remove 'registers' and 'flags' from sysfs
  blkcg: fix blkg_alloc() failure path
  block: blkcg_policy_cfq shouldn't be used if !CONFIG_CFQ_GROUP_IOSCHED
  block: fix return value on cfq_init() failure
  mtip32xx: Remove version.h header file inclusion
  xen/blkback: Copy id field when doing BLKIF_DISCARD.
2012-07-03 15:45:10 -07:00
Paul E. McKenney
9d2ad24306 rcu: Make RCU_FAST_NO_HZ respect nohz= boot parameter
If the nohz= boot parameter disables nohz, then RCU_FAST_NO_HZ needs to
also disable itself.  This commit therefore checks for tick_nohz_enabled
being zero, disabling rcu_prepare_for_idle() if so.  This commit assumes
that tick_nohz_enabled can change at runtime: If this is not the case,
then a simpler approach suffices.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:34:43 -07:00
Paul E. McKenney
e84c48ae30 rcu: Round FAST_NO_HZ lazy timeout to nearest second
Currently, if several CPUs in the same package have all lazy RCU
callbacks, their wakeups will be uncorrelated.  If all the CPUs are in the
same power domain (as is often the case), this will result in unnecessary
power-ups of the package.  This commit therefore uses round_jiffies()
to round the timeouts to a second boundary, increasing the odds that
they can be coalesced with each other or with other timeouts.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:34:42 -07:00
Paul E. McKenney
28f8555364 rcu: The rcu_needs_cpu() function is not a quiescent state
The TINY_PREEMPT_RCU() function rcu_preempt_needs_cpu(), which is called
from rcu_needs_cpu(), assumes that it is in a quiescent state with respect
to the CPU.  This is no longer the case.  This commit therefore updates
rcu_preempt_needs_cpu() to make it aware that it is not running in a
quiescent state.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-07-02 12:34:42 -07:00
Paul E. McKenney
bf1304e9cd rcu: Dump only the current CPU's buffers for idle-entry/exit warnings
Problems in RCU idle entry and exit are almost always confined to the
offending CPU.  This commit therefore switches ftrace_dump() from
DUMP_ALL to DUMP_ORIG.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-07-02 12:34:42 -07:00
Paul E. McKenney
cf01537ecf rcu: Add check for CPUs going offline with callbacks queued
If a CPU goes offline with callbacks queued, those callbacks might be
indefinitely postponed, which can result in a system hang.  This commit
therefore inserts warnings for this condition.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:34:25 -07:00
Paul E. McKenney
95f0c1de3e rcu: Disable preemption in rcu_blocking_is_gp()
It is time to optimize CONFIG_TREE_PREEMPT_RCU's synchronize_rcu()
for uniprocessor optimization, which means that rcu_blocking_is_gp()
can no longer rely on RCU read-side critical sections having disabled
preemption.  This commit therefore disables preemption across
rcu_blocking_is_gp()'s scan of the cpu_online_mask.

(Updated from previous version to fix embarrassing bug spotted by
Wu Fengguang.)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:34:25 -07:00
Carsten Emde
1c17e4d443 rcu: Prevent uninitialized string in RCU CPU stall info
An uninitialized string may be displayed at the end of the rcu_preempt
detected stall info such as

0: (1 GPs behind) idle=075/140000000000000/0 =8?^D=8?^D
                                             ^^^^^^^^^^
if CONFIG_RCU_FAST_NO_HZ is not defined.

This trivial patch clears the string in this case.

Signed-off-by: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:34:25 -07:00
Paul E. McKenney
d7118175cc rcu: Fix rcu_is_cpu_idle() #ifdef in TINY_RCU
The rcu_is_cpu_idle() function is used if CONFIG_DEBUG_LOCK_ALLOC,
but TINY_RCU defines it only when CONFIG_PROVE_RCU.  This causes
build failures when CONFIG_DEBUG_LOCK_ALLOC=y but CONFIG_PROVE_RCU=n.
This commit therefore adjusts the #ifdefs for rcu_is_cpu_idle() so
that it is defined when CONFIG_DEBUG_LOCK_ALLOC=y.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:34:25 -07:00
Paul E. McKenney
29154c57e3 rcu: Split RCU core processing out of __call_rcu()
The __call_rcu() function is a bit overweight, so this commit splits
it into actual enqueuing of and accounting for the callback (__call_rcu())
and associated RCU-core processing (__call_rcu_core()).

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:34:24 -07:00
Paul E. McKenney
a16b7a6934 rcu: Prevent __call_rcu() from invoking RCU core on offline CPUs
The __call_rcu() function will invoke the RCU core, for example, if
it detects that the current CPU has too many callbacks.  However, this
can happen on an offline CPU that is on its way to the idle loop, in
which case it is an error to invoke the RCU core, and the excess callbacks
will be adopted in any case.  This commit therefore adds checks to
__call_rcu() for running on an offline CPU, refraining from invoking
the RCU core in this case.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:34:24 -07:00
Paul E. McKenney
62fde6edf1 rcu: Make __call_rcu() handle invocation from idle
Although __call_rcu() is handled correctly when called from a momentary
non-idle period, if it is called on a CPU that RCU believes to be idle
on RCU_FAST_NO_HZ kernels, the callback might be indefinitely postponed.
This commit therefore ensures that RCU is aware of the new callback and
has a chance to force the CPU out of dyntick-idle mode when a new callback
is posted.

Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:34:24 -07:00
Paul E. McKenney
2a3fa843b5 rcu: Consolidate tree/tiny __rcu_read_{,un}lock() implementations
The CONFIG_TREE_PREEMPT_RCU and CONFIG_TINY_PREEMPT_RCU versions of
__rcu_read_lock() and __rcu_read_unlock() are identical, so this commit
consolidates them into kernel/rcupdate.h.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:34:23 -07:00
Paul E. McKenney
1d1fb395f6 rcu: Add ACCESS_ONCE() to ->qlen accesses
The _rcu_barrier() function accesses other CPUs' rcu_data structure's
->qlen field without benefit of locking.  This commit therefore adds
the required ACCESS_ONCE() wrappers around accesses and updates that
need it.

ACCESS_ONCE() is not needed when a CPU accesses its own ->qlen, or
in code that cannot run while _rcu_barrier() is sampling ->qlen fields.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:34:22 -07:00
Paul E. McKenney
3f5d3ea64f rcu: Consolidate duplicate callback-list initialization
There are a couple of open-coded initializations of the rcu_data
structure's RCU callback list.  This commit therefore consolidates
them into a new init_callback_list() function.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:34:21 -07:00
Paul E. McKenney
285fe29481 rcu: Fix detection of abruptly-ending stall
The code that attempts to identify stalls that end just as we detect
them is broken by both flavors of initialization failure.  This commit
therefore properly initializes and computes the count of the number
of reasons why the RCU grace period is stalled.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:34:21 -07:00
Paul E. McKenney
72472a02a9 rcu: Make rcutorture fakewriters invoke rcu_barrier()
The current rcutorture rcu_barrier() testing never intentionally runs
more than one instance of rcu_barrier() at a given time.  This fails
to test the the shiny new concurrency features of rcu_barrier().  This
commit therefore modifies the rcutorture fakewriter kthread to randomly
invoke rcu_barrier() rather than the usual synchronize_rcu().

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:34:04 -07:00
Paul E. McKenney
143aa672f4 rcu: Fix diagnostic-printk typo in rcutorture
The rcu_torture_barrier() function has a copy-and-paste typo in the
string passed to rcutorture_shutdown_absorb(), which this commit fixes.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:34:03 -07:00
Paul E. McKenney
c6ebcbb60c rcu: Fix bug in rcu_barrier() torture test
The child threads in the rcu_torture_barrier_cbs() are improperly
synchronized, which can cause the rcu_barrier() tests to hang.  The
failure mode is as follows:

1.	CPU 0 running in rcu_torture_barrier() sets barrier_cbs_count
    	to n_barrier_cbs.

2.	CPU 1 running in rcu_torture_barrier_cbs() wakes up, posts
    	its RCU callback, and atomically decrements barrier_cbs_count.
    	Because barrier_cbs_count is not zero, it does not do the wake_up().

3.	CPU 2 running in rcu_torture_barrier_cbs() wakes up, but
    	finds that barrier_cbs_count is not equal to n_barrier_cbs,
    	and so returns to sleep.

4.	The value of barrier_cbs_count therefore never reaches zero,
    	which causes the test to hang.

This commit therefore uses a phase variable to coordinate the test,
preventing this scenario from occurring.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:34:03 -07:00
Paul E. McKenney
e3f8d3788e rcu: Test srcu_barrier() from rcutorture test suite
SRCU now has a call_srcu() and an srcu_barrier(), but rcutorture does not
test them.  This commit adds the machinery to allow rcutorture's existing
tests for call_rcu() and rcu_barrier() to apply to the SRCU equivalents.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:34:03 -07:00
Paul E. McKenney
751a68b2e4 rcu: Rationalize ordering of torture_ops list
Move the raw SRCU interfaces out of the middle of the normal SRCU
interfaces.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:34:03 -07:00
Paul E. McKenney
ff015030c9 rcu: RCU_SAVE_DYNTICK code no longer ever dead
Before RCU had unified idle, the RCU_SAVE_DYNTICK leg of the switch
statement in force_quiescent_state() was dead code for CONFIG_NO_HZ=n
kernel builds.  With unified idle, the code is never dead.  This commit
therefore removes the "if" statement designed to make gcc aware of when
the code was and was not dead.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:33:24 -07:00
Paul E. McKenney
c0cc962da3 rcu: Use for_each_rcu_flavor() in TREE_RCU tracing
This commit applies the new for_each_rcu_flavor() macro to the
kernel/rcutree_trace.c file.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:33:24 -07:00
Paul E. McKenney
6ce75a2326 rcu: Introduce for_each_rcu_flavor() and use it
The arrival of TREE_PREEMPT_RCU some years back included some ugly
code involving either #ifdef or #ifdef'ed wrapper functions to iterate
over all non-SRCU flavors of RCU.  This commit therefore introduces
a for_each_rcu_flavor() iterator over the rcu_state structures for each
flavor of RCU to clean up a bit of the ugliness.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:33:24 -07:00
Paul E. McKenney
1bca8cf1a2 rcu: Remove unneeded __rcu_process_callbacks() argument
With the advent of __this_cpu_ptr(), it is no longer necessary to pass
both the rcu_state and rcu_data structures into __rcu_process_callbacks().
This commit therefore computes the rcu_data pointer from the rcu_state
pointer within __rcu_process_callbacks() so that callers can pass in
only the pointer to the rcu_state structure.  This paves the way for
linking the rcu_state structures together and iterating over them.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:33:23 -07:00
Paul E. McKenney
d7e187c8e9 rcu: Add rcu_barrier() statistics to debugfs tracing
This commit adds an rcubarrier file to RCU's debugfs statistical tracing
directory, providing diagnostic information on rcu_barrier().

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:33:23 -07:00
Paul E. McKenney
a83eff0a82 rcu: Add tracing for _rcu_barrier()
This commit adds event tracing for _rcu_barrier() execution.  This
is defined only if RCU_TRACE=y.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:33:23 -07:00
Paul E. McKenney
cf3a9c4842 rcu: Increase rcu_barrier() concurrency
The traditional rcu_barrier() implementation has serialized all requests,
regardless of RCU flavor, and also does not coalesce concurrent requests.
In the past, this has been good and sufficient.

However, systems are getting larger and use of rcu_barrier() has been
increasing.  This commit therefore introduces a counter-based scheme
that allows _rcu_barrier() calls for the same flavor of RCU to take
advantage of each others' work.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:33:23 -07:00
Paul E. McKenney
cfed0a85da rcu: Remove needless initialization
For global variables, C defaults all fields to zero.  The initialization
of the rcu_state structure's ->n_force_qs and ->n_force_qs_ngp fields
is therefore redundant, so this commit removes these initializations.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:33:22 -07:00
Paul E. McKenney
7be7f0be90 rcu: Move rcu_barrier_mutex to rcu_state structure
In order to allow each RCU flavor to concurrently execute its
rcu_barrier() function, it is necessary to move the relevant
state to the rcu_state structure.  This commit therefore moves the
rcu_barrier_mutex global variable to a new ->barrier_mutex field
in the rcu_state structure.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:33:22 -07:00
Paul E. McKenney
7db74df88b rcu: Move rcu_barrier_completion to rcu_state structure
In order to allow each RCU flavor to concurrently execute its
rcu_barrier() function, it is necessary to move the relevant
state to the rcu_state structure.  This commit therefore moves the
rcu_barrier_completion global variable to a new ->barrier_completion
field in the rcu_state structure.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:33:22 -07:00
Paul E. McKenney
24ebbca8ec rcu: Move rcu_barrier_cpu_count to rcu_state structure
In order to allow each RCU flavor to concurrently execute its rcu_barrier()
function, it is necessary to move the relevant state to the rcu_state
structure.  This commit therefore moves the rcu_barrier_cpu_count global
variable to a new ->barrier_cpu_count field in the rcu_state structure.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:33:22 -07:00
Paul E. McKenney
06668efa91 rcu: Move _rcu_barrier()'s rcu_head structures to rcu_data structures
In order for multiple flavors of RCU to each concurrently run one
rcu_barrier(), each flavor needs its own per-CPU set of rcu_head
structures.  This commit therefore moves _rcu_barrier()'s set of
per-CPU rcu_head structures from per-CPU variables to the existing
per-CPU and per-RCU-flavor rcu_data structures.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:33:21 -07:00
Paul E. McKenney
037b64ed0b rcu: Place pointer to call_rcu() in rcu_data structure
This is a preparatory commit for increasing rcu_barrier()'s concurrency.
It adds a pointer in the rcu_data structure to the corresponding call_rcu()
function.  This allows a pointer to the rcu_data structure to imply the
function pointer, which allows _rcu_barrier() state to be placed in the
rcu_state structure.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:33:21 -07:00
Paul E. McKenney
6c90cc7bf0 rcu: Prevent excessive line length in RCU_STATE_INITIALIZER()
Upcoming rcu_barrier() concurrency commits will result in line lengths
greater than 80 characters in the RCU_STATE_INITIALIZER(), so this commit
shortens the name of the macro's argument to prevent this.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:33:21 -07:00
Paul E. McKenney
cca6f39319 rcu: Size rcu_node tree from nr_cpu_ids rather than NR_CPUS
The rcu_node tree array is sized based on compile-time constants,
including NR_CPUS.  Although this approach has worked well in the past,
the recent trend by many distros to define NR_CPUS=4096 results in
excessive grace-period-initialization latencies.

This commit therefore substitutes the run-time computed nr_cpu_ids for
the compile-time NR_CPUS when building the tree.  This can result in
much of the compile-time-allocated rcu_node array being unused.  If
this is a major problem, you are in a specialized situation anyway,
so you can manually adjust the NR_CPUS, RCU_FANOUT, and RCU_FANOUT_LEAF
kernel config parameters.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:33:21 -07:00
Paul E. McKenney
cc5df65b03 rcu: Four-level hierarchy is no longer experimental
Time to make the four-level-hierarchy setting less scary, so this
commit removes "Experimental" from the boot-time message.  Leave the
message in order to get a heads-up on any possible need to expand to
a five-level hierarchy.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:33:20 -07:00
Paul E. McKenney
f885b7f2b2 rcu: Control RCU_FANOUT_LEAF from boot-time parameter
Although making RCU_FANOUT_LEAF a kernel configuration parameter rather
than a fixed constant makes it easier for people to decrease cache-miss
overhead for large systems, it is of little help for people who must
run a single pre-built kernel binary.

This commit therefore allows the value of RCU_FANOUT_LEAF to be
increased (but not decreased!) via a boot-time parameter named
rcutree.rcu_fanout_leaf.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:33:20 -07:00
Paul E. McKenney
cba6d0d64e Revert "rcu: Move PREEMPT_RCU preemption to switch_to() invocation"
This reverts commit 616c310e83.
(Move PREEMPT_RCU preemption to switch_to() invocation).
Testing by Sasha Levin <levinsasha928@gmail.com> showed that this
can result in deadlock due to invoking the scheduler when one of
the runqueue locks is held.  Because this commit was simply a
performance optimization, revert it.

Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
2012-07-02 11:39:19 -07:00
Bojan Smojver
d8150d3504 PM / Hibernate: Print hibernation/thaw progress indicator one line at a time.
With the introduction of suspend to both into in-kernel hibernation
code, dmesg was getting polluted with backspace characters printed as
part of image saving progress indicator. This patch introduces printing
of progress indicator on image save/load every 10% and one line at a
time. As an additional benefit, all other messages emitted by the kernel
during hibernation/thaw should now print cleanly as well.

Signed-off-by: Bojan Smojver <bojan@rexursive.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-07-01 13:31:23 +02:00
Rafael J. Wysocki
b2df1d4f8b PM / Sleep: Separate printing suspend times from initcall_debug
Change the behavior of the newly introduced
/sys/power/pm_print_times attribute so that its initial value
depends on initcall_debug, but setting it to 0 will cause device
suspend/resume times not to be printed, even if initcall_debug has
been set.  This way, the people who use initcall_debug for reasons
other than PM debugging will be able to switch the suspend/resume
times printing off, if need be.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-01 13:31:23 +02:00
Sameer Nanda
4b7760ba0d PM / Sleep: add knob for printing device resume times
Added a new knob called /sys/power/pm_print_times. Setting it to 1
enables printing of time taken by devices to suspend and resume.
Setting it to 0 disables this printing (unless overridden by
initcall_debug kernel command line option).

Signed-off-by: Sameer Nanda <snanda@chromium.org>
Acked-by: Greg KH <gregkh@linuxfoundation.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-07-01 13:31:22 +02:00
Srivatsa S. Bhat
443772d408 ftrace: Disable function tracing during suspend/resume and hibernation, again
If function tracing is enabled for some of the low-level suspend/resume
functions, it leads to triple fault during resume from suspend, ultimately
ending up in a reboot instead of a resume (or a total refusal to come out
of suspended state, on some machines).

This issue was explained in more detail in commit f42ac38c59 (ftrace:
disable tracing for suspend to ram). However, the changes made by that commit
got reverted by commit cbe2f5a6e8 (tracing: allow tracing of
suspend/resume & hibernation code again). So, unfortunately since things are
not yet robust enough to allow tracing of low-level suspend/resume functions,
suspend/resume is still broken when ftrace is enabled.

So fix this by disabling function tracing during suspend/resume & hibernation.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-07-01 13:31:22 +02:00
Bojan Smojver
62c552ccc3 PM / Hibernate: Enable suspend to both for in-kernel hibernation.
It is often useful to suspend to memory after hibernation image has been
written to disk. If the battery runs out or power is otherwise lost, the
computer will resume from the hibernated image. If not, it will resume
from memory and hibernation image will be discarded.

Signed-off-by: Bojan Smojver <bojan@rexursive.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-07-01 13:31:22 +02:00
Randy Dunlap
4f0f4af59c printk.c: fix kernel-doc warnings
Fix kernel-doc warnings in printk.c: use correct parameter name.

  Warning(kernel/printk.c:2429): No description found for parameter 'buf'
  Warning(kernel/printk.c:2429): Excess function parameter 'line' description in 'kmsg_dump_get_buffer'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-30 15:56:40 -07:00
Linus Torvalds
21f27291f5 Driver Core fixes for 3.5-rc5
Here is a number of printk() fixes, specifically a few reported by the
 crazy blog program that ships in SUSE releases (that's "boot log" and
 not "web log", it predates the general "blog" terminology by many
 years), and the restoration of the continuation line functionality
 reported by Stephen and others.  Yes, the changes seem a bit big this
 late in the cycle, but I've been beating on them for a while now, and
 Stephen has even optimized it a bit, so all looks good to me.
 
 The other change in here is a Documentation update for the stable kernel
 rules describing how some distro patches should be backported, to
 hopefully drive a bit more response from the distros to the stable
 kernel releases.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iEYEABECAAYFAk/uhJEACgkQMUfUDdst+ymL0QCfTjWJrdf+ooJ6Bx/NNgOGxYip
 Ss0AnRrCNkfgmMcdNMn/7CIbHlaTj+S+
 =M5Rg
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver Core fixes from Greg Kroah-Hartman:
 "Here is a number of printk() fixes, specifically a few reported by the
  crazy blog program that ships in SUSE releases (that's "boot log" and
  not "web log", it predates the general "blog" terminology by many
  years), and the restoration of the continuation line functionality
  reported by Stephen and others.  Yes, the changes seem a bit big this
  late in the cycle, but I've been beating on them for a while now, and
  Stephen has even optimized it a bit, so all looks good to me.

  The other change in here is a Documentation update for the stable
  kernel rules describing how some distro patches should be backported,
  to hopefully drive a bit more response from the distros to the stable
  kernel releases.

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

* tag 'driver-core-3.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  printk: Optimize if statement logic where newline exists
  printk: flush continuation lines immediately to console
  syslog: fill buffer with more than a single message for SYSLOG_ACTION_READ
  Revert "printk: return -EINVAL if the message len is bigger than the buf size"
  printk: fix regression in SYSLOG_ACTION_CLEAR
  stable: Allow merging of backports for serious user-visible performance issues
2012-06-30 10:11:24 -07:00
Pablo Neira Ayuso
a31f2d17b3 netlink: add netlink_kernel_cfg parameter to netlink_kernel_create
This patch adds the following structure:

struct netlink_kernel_cfg {
        unsigned int    groups;
        void            (*input)(struct sk_buff *skb);
        struct mutex    *cb_mutex;
};

That can be passed to netlink_kernel_create to set optional configurations
for netlink kernel sockets.

I've populated this structure by looking for NULL and zero parameters at the
existing code. The remaining parameters that always need to be set are still
left in the original interface.

That includes optional parameters for the netlink socket creation. This allows
easy extensibility of this interface in the future.

This patch also adapts all callers to use this new interface.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-29 16:46:02 -07:00
Steven Rostedt
d36208227d printk: Optimize if statement logic where newline exists
In reviewing Kay's fix up patch: "printk: Have printk() never buffer its
data", I found two if statements that could be combined and optimized.

Put together the two 'cont.len && cont.owner == current' if statements
into a single one, and check if we need to call cont_add(). This also
removes the unneeded double cont_flush() calls.

Link: http://lkml.kernel.org/r/1340869133.876.10.camel@mop

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-29 16:55:35 -04:00
Vaibhav Nagarnaik
48fdc72f23 ring-buffer: Fix accounting of entries when removing pages
When removing pages from the ring buffer, its state is not reset. This
means that the counters need to be correctly updated to account for the
pages removed.

Update the overrun counter to reflect the removed events from the pages.

Link: http://lkml.kernel.org/r/1340998301-1715-1-git-send-email-vnagarnaik@google.com

Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-06-29 16:17:17 -04:00
Vaibhav Nagarnaik
44b99462d9 ring-buffer: Fix crash due to uninitialized new_pages list head
The new_pages list head in the cpu_buffer is not initialized. When
adding pages to the ring buffer, if the memory allocation fails in
ring_buffer_resize, the clean up handler tries to free up the allocated
pages from all the cpu buffers. The panic is caused by referencing the
uninitialized new_pages list head.

Initializing the new_pages list head in rb_allocate_cpu_buffer fixes
this.

Link: http://lkml.kernel.org/r/1340391005-10880-1-git-send-email-vnagarnaik@google.com

Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-06-29 16:16:35 -04:00
Kay Sievers
084681d14e printk: flush continuation lines immediately to console
Continuation lines are buffered internally, intended to merge the
chunked printk()s into a single record, and to isolate potentially
racy continuation users from usual terminated line users.

This though, has the effect that partial lines are not printed to
the console in the moment they are emitted. In case the kernel
crashes in the meantime, the potentially interesting printed
information would never reach the consoles.

Here we share the continuation buffer with the console copy logic,
and partial lines are always immediately flushed to the available
consoles. They are still buffered internally to improve the
readability and integrity of the messages and minimize the amount
of needed record headers to store.

Signed-off-by: Kay Sievers <kay@vrfy.org>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-29 11:39:42 -04:00
David S. Miller
b26d344c6b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/caif/caif_hsi.c
	drivers/net/usb/qmi_wwan.c

The qmi_wwan merge was trivial.

The caif_hsi.c, on the other hand, was not.  It's a conflict between
1c385f1fdf ("caif-hsi: Replace platform
device with ops structure.") in the net-next tree and commit
39abbaef19 ("caif-hsi: Postpone init of
HIS until open()") in the net tree.

I did my best with that one and will ask Sjur to check it out.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-28 17:37:00 -07:00
Steven Rostedt
a5fb833172 ring-buffer: Fix uninitialized read_stamp
The ring buffer reader page is used to swap a page from the writable
ring buffer. If the writer happens to be on that page, it ends up on the
reader page, but will simply move off of it, back into the writable ring
buffer as writes are added.

The time stamp passed back to the readers is stored in the cpu_buffer per
CPU descriptor. This stamp is updated when a swap of the reader page takes
place, and it reads the current stamp from the page taken from the writable
ring buffer. Everytime a writer goes to a new page, it updates the time stamp
of that page.

The problem happens if a reader reads a page from an empty per CPU ring buffer.
If the buffer is empty, the swap still takes place, placing the writer at the
start of the reader page. If at a later time, a write happens, it updates the
page's time stamp and continues. But the problem is that the read_stamp does
not get updated, because the page was already swapped.

The solution to this was to not swap the page if the ring buffer happens to
be empty. This also removes the side effect that the writes on the reader
page will not get updated because the writer never gets back on the reader
page without a swap. That is, if a read happens on an empty buffer, but then
no reads happen for a while. If a swap took place, and the writer were to start
writing a lot of data (function tracer), it will start overflowing the ring buffer
and overwrite the older data. But because the writer never goes back onto the
reader page, the data left on the reader page never gets overwritten. This
causes the reader to see really old data, followed by a jump to newer data.

Link: http://lkml.kernel.org/r/1340060577-9112-1-git-send-email-dhsharp@google.com
Google-Bug-Id: 6410455
Reported-by: David Sharp <dhsharp@google.com>
tested-by: David Sharp <dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-06-28 13:52:15 -04:00
Steven Rostedt
6d158a813e tracing: Remove NR_CPUS array from trace_iterator
Replace the NR_CPUS array of buffer_iter from the trace_iterator
with an allocated array. This will just create an array of
possible CPUS instead of the max number specified.

The use of NR_CPUS in that array caused allocation failures for
machines that were tight on memory. This did not cause any failures
to the system itself (no crashes), but caused unnecessary failures
for reading the trace files.

Added a helper function called 'trace_buffer_iter()' that returns
the buffer_iter item or NULL if it is not defined or the array was
not allocated. Some routines do not require the array
(tracing_open_pipe() for one).

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-06-28 13:52:15 -04:00
Steven Rostedt
0be61ebc18 tracing/selftest: Add a WARN_ON() if a tracer test fails
Add a WARN_ON() output on test failures so that they are easier to detect
in automated tests. Although, the WARN_ON() will not print if the test
causes the system to crash, obviously.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-06-28 13:52:14 -04:00
David S. Miller
c64e66c67b audit: netlink: Move away from NLMSG_NEW().
And use nlmsg_data() while we're here too.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-26 21:54:14 -07:00
Jan Beulich
116e90b23f syslog: fill buffer with more than a single message for SYSLOG_ACTION_READ
The recent changes to the printk buffer management resulted in
SYSLOG_ACTION_READ to only return a single message, whereas previously
the buffer would get filled as much as possible. As, when too small to
fit everything, filling it to the last byte would be pretty ugly with
the new code, the patch arranges for as many messages as possible to
get returned in a single invocation. User space tools in at least all
SLES versions depend on the old behavior.

This at once addresses the issue attempted to get fixed with commit
b56a39ac26 ("printk: return -EINVAL if
the message len is bigger than the buf size"), and since that commit
widened the possibility for losing a message altogether, the patch
here assumes that this other commit would get reverted first
(otherwise the patch here won't apply).

Furthermore, this patch also addresses the problem dealt with in
commit 4a77a5a06e ("printk: use mutex
lock to stop syslog_seq from going wild"), so I'd recommend reverting
that one too (albeit there's no direct collision between the two).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kay Sievers <kay@vrfy.org>
Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-26 12:37:36 -07:00
Greg Kroah-Hartman
6fda135c90 Revert "printk: return -EINVAL if the message len is bigger than the buf size"
This reverts commit b56a39ac26.

A better patch from Jan will follow this to resolve the issue.

Acked-by: Kay Sievers <kay@vrfy.org>
Cc: Fengguang Wu <wfg@linux.intel.com>
Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-26 12:35:24 -07:00
Paul E. McKenney
b41772abeb rcu: Stop rcu_do_batch() from multiplexing the "count" variable
Commit b1420f1c (Make rcu_barrier() less disruptive) rearranged the
code in rcu_do_batch(), moving the ->qlen manipulation to follow
the requeueing of the callbacks.  Unfortunately, this rearrangement
clobbered the value of the "count" local variable before the value
of rdp->qlen was adjusted, resulting in the value of rdp->qlen being
inaccurate.  This commit therefore introduces an index variable "i",
avoiding the inadvertent multiplexing.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-06-25 12:35:25 -07:00
Alan Stern
4661e3568a printk: fix regression in SYSLOG_ACTION_CLEAR
Commit 7ff9554bb5 (printk: convert
byte-buffer to variable-length record buffer) introduced a regression
by accidentally removing a "break" statement from inside the big
switch in printk's do_syslog().  The symptom of this bug is that the
"dmesg -C" command doesn't only clear the kernel's log buffer; it also
disables console logging.

This patch (as1561) fixes the regression by adding the missing
"break".

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
CC: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-25 12:11:58 -07:00
David S. Miller
dfbce08c19 ipv4: Don't add deprecated new binary sysctl value.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-22 23:02:22 -07:00
Alexander Duyck
6648bd7e0e ipv4: Add sysctl knob to control early socket demux
This change is meant to add a control for disabling early socket demux.
The main motivation behind this patch is to provide an option to disable
the feature as it adds an additional cost to routing that reduces overall
throughput by up to 5%.  For example one of my systems went from 12.1Mpps
to 11.6 after the early socket demux was added.  It looks like the reason
for the regression is that we are now having to perform two lookups, first
the one for an established socket, and then the one for the routing table.

By adding this patch and toggling the value for ip_early_demux to 0 I am
able to get back to the 12.1Mpps I was previously seeing.

[ Move local variables in ip_rcv_finish() down into the basic
  block in which they are actually used.  -DaveM ]

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-22 17:11:13 -07:00
Linus Torvalds
a11637194a Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar.

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  ftrace: Make all inline tags also include notrace
  perf: Use css_tryget() to avoid propping up css refcount
  perf tools: Fix synthesizing tracepoint names from the perf.data headers
  perf stat: Fix default output file
  perf tools: Fix endianity swapping for adds_features bitmask
2012-06-22 10:58:57 -07:00
Linus Torvalds
2ce5682947 Merge branch 'for-3.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull two cgroup fixes from Tejun Heo:
 "This containes two patches fixing a refcnt race bug during css_put().
  Decrementing and checking the value weren't atomic and two tasks could
  think that they both pushed the counter to zero."

* 'for-3.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroups: Account for CSS_DEACT_BIAS in __css_put
  cgroup: make sure that decisions in __css_put are atomic
2012-06-20 22:11:04 -07:00
Linus Torvalds
fe80352460 Driver core and printk fixes for 3.5-rc4
Here are some fixes for 3.5-rc4 that resolve the kmsg problems that
 people have reported showing up after the printk and kmsg changes went
 into 3.5-rc1.  There are also a smattering of other tiny fixes for the
 extcon and hyper-v drivers that people have reported.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iEYEABECAAYFAk/iNQcACgkQMUfUDdst+yklTQCfZCXFlhA43bZo/8Joqd2pLIIW
 2uoAoMze0SlfJeN6Qu7yY0P+qV/f/pc3
 =UNFY
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core and printk fixes from Greg Kroah-Hartman:
 "Here are some fixes for 3.5-rc4 that resolve the kmsg problems that
  people have reported showing up after the printk and kmsg changes went
  into 3.5-rc1.  There are also a smattering of other tiny fixes for the
  extcon and hyper-v drivers that people have reported.

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

* tag 'driver-core-3.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  extcon: max8997: Add missing kfree for info->edev in max8997_muic_remove()
  extcon: Set platform drvdata in gpio_extcon_probe() and fix irq leak
  extcon: Fix wrong index in max8997_extcon_cable[]
  kmsg - kmsg_dump() fix CONFIG_PRINTK=n compilation
  printk: return -EINVAL if the message len is bigger than the buf size
  printk: use mutex lock to stop syslog_seq from going wild
  kmsg - kmsg_dump() use iterator to receive log buffer content
  vme: change maintainer e-mail address
  Extcon: Don't try to create duplicate link names
  driver core: fixup reversed deferred probe order
  printk: Fix alignment of buf causing crash on ARM EABI
  Tools: hv: verify origin of netlink connector message
2012-06-20 15:14:28 -07:00
Cyrill Gorcunov
5702c5eeab c/r: prctl: Move PR_GET_TID_ADDRESS to a proper place
During merging of PR_GET_TID_ADDRESS patch the code has been misplaced (it
happened to appear under PR_MCE_KILL) in result noone can use this option.

Fix it by moving code snippet to a proper place.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20 14:39:36 -07:00
Oleg Nesterov
50d75f8dae pidns: find_new_reaper() can no longer switch to init_pid_ns.child_reaper
find_new_reaper() changes pid_ns->child_reaper, see add0d4df ("pid_ns:
zap_pid_ns_processes: fix the ->child_reaper changing").

The original reason has gone away after the previous patch, ->children
list must be empty after zap_pid_ns_processes().

However now we can not switch to init_pid_ns.child_reaper.
__unhash_process() relies on the "->child_reaper == parent" check, but
this check does not work if the last exiting task is also the child
reaper.

As Eric sugested, we can change __unhash_process() to use the parent's
pid_ns and remove this code.

Also, with this change we can move detach_pid(PIDTYPE_PID) back, where it
was before the previous fix.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Louis Rilling <louis.rilling@kerlabs.com>
Cc: Mike Galbraith <efault@gmx.de>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Tested-by: Andrew Wagin <avagin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20 14:39:36 -07:00
Eric W. Biederman
6347e90091 pidns: guarantee that the pidns init will be the last pidns process reaped
Today we have a twofold bug.  Sometimes release_task on pid == 1 in a pid
namespace can run before other processes in a pid namespace have had
release task called.  With the result that pid_ns_release_proc can be
called before the last proc_flus_task() is done using upid->ns->proc_mnt,
resulting in the use of a stale pointer.  This same set of circumstances
can lead to waitpid(...) returning for a processes started with
clone(CLONE_NEWPID) before the every process in the pid namespace has
actually exited.

To fix this modify zap_pid_ns_processess wait until all other processes in
the pid namespace have exited, even EXIT_DEAD zombies.

The delay_group_leader and related tests ensure that the thread gruop
leader will be the last thread of a process group to be reaped, or to
become EXIT_DEAD and self reap.  With the change to zap_pid_ns_processes
we get the guarantee that pid == 1 in a pid namespace will be the last
task that release_task is called on.

With pid == 1 being the last task to pass through release_task
pid_ns_release_proc can no longer be called too early nor can wait return
before all of the EXIT_DEAD tasks in a pid namespace have exited.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Louis Rilling <louis.rilling@kerlabs.com>
Cc: Mike Galbraith <efault@gmx.de>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Tested-by: Andrew Wagin <avagin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20 14:39:36 -07:00
Konstantin Khlebnikov
4fe7efdbdf mm: correctly synchronize rss-counters at exit/exec
do_exit() and exec_mmap() call sync_mm_rss() before mm_release() does
put_user(clear_child_tid) which can update task->rss_stat and thus make
mm->rss_stat inconsistent.  This triggers the "BUG:" printk in check_mm().

Let's fix this bug in the safest way, and optimize/cleanup this later.

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20 14:39:36 -07:00
Salman Qazi
8e3bbf42c6 cgroups: Account for CSS_DEACT_BIAS in __css_put
When we fixed the race between atomic_dec and css_refcnt, we missed
the fact that css_refcnt internally subtracts CSS_DEACT_BIAS to get
the actual reference count.  This can potentially cause a refcount leak
if __css_put races with cgroup_clear_css_refs.

Signed-off-by: Salman Qazi <sqazi@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-06-18 15:38:02 -07:00
Yan, Zheng
0cda4c0231 perf: Introduce perf_pmu_migrate_context()
Originally from Peter Zijlstra. The helper migrates perf events
from one cpu to another cpu.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1339741902-8449-5-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-18 12:13:21 +02:00
Yan, Zheng
e2d37cd213 perf: Allow the PMU driver to choose the CPU on which to install events
Allow the pmu->event_init callback to change event->cpu, so the PMU driver
can choose the CPU on which to install events.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1339741902-8449-4-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-18 12:13:21 +02:00
Yan, Zheng
fbfc623f82 perf: Avoid race between cpu hotplug and installing event
perf_event_open() requires the cpu on which to install event is online,
but the cpu can go offline after perf_event_open checks that. Add a
get_online_cpus()/put_online_cpus() pair to avoid the race.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1339741902-8449-3-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-18 12:13:20 +02:00
Ingo Molnar
d1ece0998e Merge branch 'perf/urgent' into perf/core
Merge in all fixes before applying more changes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-18 11:47:58 +02:00
Salman Qazi
9c5da09d26 perf: Use css_tryget() to avoid propping up css refcount
An rmdir pushes css's ref count to zero.  However, if the associated
directory is open at the time, the dentry ref count is non-zero.  If
the fd for this directory is then passed into perf_event_open, it
does a css_get().  This bounces the ref count back up from zero.  This
is a problem by itself.  But what makes it turn into a crash is the
fact that we end up doing an extra dput, since we perform a dput
when css_put sees the ref count go down to zero.

css_tryget() does not fall into that trap. So, we use that instead.

Reproduction test-case for the bug:

 #include <unistd.h>
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <fcntl.h>
 #include <linux/unistd.h>
 #include <linux/perf_event.h>
 #include <string.h>
 #include <errno.h>
 #include <stdio.h>

 #define PERF_FLAG_PID_CGROUP    (1U << 2)

 int perf_event_open(struct perf_event_attr *hw_event_uptr,
                     pid_t pid, int cpu, int group_fd, unsigned long flags) {
         return syscall(__NR_perf_event_open,hw_event_uptr, pid, cpu,
                 group_fd, flags);
 }

 /*
  * Directly poke at the perf_event bug, since it's proving hard to repro
  * depending on where in the kernel tree.  what moved?
  */
 int main(int argc, char **argv)
 {
        int fd;
        struct perf_event_attr attr;
        memset(&attr, 0, sizeof(attr));
        attr.exclude_kernel = 1;
        attr.size = sizeof(attr);
        mkdir("/dev/cgroup/perf_event/blah", 0777);
        fd = open("/dev/cgroup/perf_event/blah", O_RDONLY);
        perror("open");
        rmdir("/dev/cgroup/perf_event/blah");
        sleep(2);
        perf_event_open(&attr, fd, 0, -1,  PERF_FLAG_PID_CGROUP);
        perror("perf_event_open");
        close(fd);
        return 0;
 }

Signed-off-by: Salman Qazi <sqazi@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20120614223108.1025.2503.stgit@dungbeetle.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-18 11:45:57 +02:00
Ingo Molnar
4983955c04 Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core
Pull ftrace robustization fixes from Steve Rostedt.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-18 10:57:51 +02:00
Grant Likely
aed98048bd irqdomain: Make ops->map hook optional
There isn't a really compelling reason to force ->map to be populated,
so allow it to be left unset.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
2012-06-17 15:41:57 -06:00
Yuanhan Liu
b56a39ac26 printk: return -EINVAL if the message len is bigger than the buf size
Just like what devkmsg_read() does, return -EINVAL if the message len is
bigger than the buf size, or it will trigger a segfault error.

Acked-by: Kay Sievers <kay@vrfy.org>
Acked-by: Fengguang Wu <wfg@linux.intel.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-16 08:36:03 -07:00
Yuanhan Liu
4a77a5a06e printk: use mutex lock to stop syslog_seq from going wild
Although syslog_seq and log_next_seq stuff are protected by logbuf_lock
spin log, it's not enough. Say we have two processes A and B, and let
syslog_seq = N, while log_next_seq = N + 1, and the two processes both
come to syslog_print at almost the same time. And No matter which
process get the spin lock first, it will increase syslog_seq by one,
then release spin lock; thus later, another process increase syslog_seq
by one again. In this case, syslog_seq is bigger than syslog_next_seq.
And latter, it would make:
   wait_event_interruptiable(log_wait, syslog != log_next_seq)
don't wait any more even there is no new write comes. Thus it introduce
a infinite loop reading.

I can easily see this kind of issue by the following steps:
  # cat /proc/kmsg # at meantime, I don't kill rsyslog
                   # So they are the two processes.
  # xinit          # I added drm.debug=6 in the kernel parameter line,
                   # so that it will produce lots of message and let that
                   # issue happen

It's 100% reproducable on my side. And my disk will be filled up by
/var/log/messages in a quite short time.

So, introduce a mutex_lock to stop syslog_seq from going wild just like
what devkmsg_read() does. It does fix this issue as expected.

v2: use mutex_lock_interruptiable() instead (comments from Kay)

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-By: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-16 08:36:02 -07:00
Oleg Nesterov
e227051b13 uprobes: Remove the unnecessary initialization in add_utask()
Trivial cleanup. No need to nullify ->active_uprobe after
kzalloc().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120615154401.GA9633@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-16 09:10:52 +02:00
Oleg Nesterov
593609a596 uprobes: __copy_insn() needs "loff_t offset"
1. __copy_insn() needs "loff_t offset", not "unsigned long",
   to read the file.

2. use pgoff_t for "idx" and remove the unnecessary typecast.

3. fix the typo, "&=" is not what we want

4. can't resist, rename off1 to off.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120615154359.GA9625@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-16 09:10:49 +02:00
Oleg Nesterov
816c03fbab uprobes: Don't use loff_t for the valid virtual address
loff_t looks confusing when it is used for the virtual address.
Change map_info and install_breakpoint/remove_breakpoint paths
to use "unsigned long".

The patch doesn't change vma_address(), it can't return "long"
because it is used to verify the mapping. But probably this
needs some cleanups too.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120615154355.GA9622@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-16 09:10:48 +02:00
Oleg Nesterov
449d0d7c9f uprobes: Simplify the usage of uprobe->pending_list
uprobe->pending_list is only used to create the temporary list,
it has no meaning after we drop uprobes_mmap_hash(inode).

No need to initialize this node or remove it from tmp_list, and
we can use list_for_each_entry().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120615154353.GA9614@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-16 09:10:48 +02:00
Oleg Nesterov
d9c4a30e82 uprobes: Move BUG_ON(UPROBE_SWBP_INSN_SIZE) from write_opcode() to install_breakpoint()
write_opcode() ensures that UPROBE_SWBP_INSN doesn't cross the
page boundary. This looks a bit confusing, the check does not
depend on vaddr and it is enough to do it only once right after
install_breakpoint()->arch_uprobe_analyze_insn().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120615154350.GA9611@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-16 09:10:47 +02:00
Oleg Nesterov
eb2bf57bee uprobes: No need to re-check vma_address() in write_opcode()
write_opcode() is called by register_for_each_vma() and
uprobe_mmap() paths. In both cases the caller has already
verified this vaddr under mmap_sem, no need to re-check.

Note also that this check is wrong anyway, we should not
truncate loff_t returned by vma_address() if we do not trust
this mapping.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120615154347.GA9604@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-16 09:10:47 +02:00
Oleg Nesterov
fc36f59565 uprobes: Copy_insn() should not return -ENOMEM if __copy_insn() fails
copy_insn() returns -ENOMEM if the first __copy_insn() fails,
it should return the correct error code.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120615154344.GA9601@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-16 09:10:46 +02:00
Oleg Nesterov
d436615e60 uprobes: Copy_insn() shouldn't depend on mm/vma/vaddr
1. copy_insn() doesn't need "addr", it can use uprobe->offset.
   Remove this argument.

2. Change copy_insn/__copy_insn to accept "struct file*" instead
   of vma.

copy_insn() is called only once and mm/vma/vaddr are random, it
shouldn't depend on them.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120615154342.GA9598@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-16 09:10:45 +02:00
Peter Zijlstra
c5784de2b3 uprobes: Document uprobe_register() vs uprobe_mmap() race
Because the mind is treacherous and makes us forget we need to
write stuff down.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120615154339.GA9591@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-16 09:10:45 +02:00
Oleg Nesterov
7a5bfb66b0 uprobes: Change build_map_info() to try kmalloc(GFP_NOWAIT) first
build_map_info() doesn't allocate the memory under i_mmap_mutex
to avoid the deadlock with page reclaim. But it can try
GFP_NOWAIT first, it should work in the likely case and thus we
almost never need the pre-alloc-and-retry path.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Link: http://lkml.kernel.org/r/20120615154336.GA9588@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-16 09:10:44 +02:00
Oleg Nesterov
268720903f uprobes: Rework register_for_each_vma() to make it O(n)
Currently register_for_each_vma() is O(n ** 2) + O(n ** 3),
every time find_next_vma_info() "restarts" the
vma_prio_tree_foreach() loop and each iteration rechecks the
whole try_list. This also means that try_list can grow
"indefinitely" if register/unregister races with munmap/mmap
activity even if the number of mapping is bounded at any time.

With this patch register_for_each_vma() builds the list of
mm/vaddr structures only once and does install_breakpoint() for
each entry.

We do not care about the new mappings which can be created after
build_map_info() drops mapping->i_mmap_mutex, uprobe_mmap()
should do its work.

Note that we do not allocate map_info under i_mmap_mutex, this
can deadlock with page reclaim (but see the next patch). So we
use 2 lists, "curr" which we are going to return, and "prev"
which holds the already allocated memory. The main loop deques
the entry from "prev" (initially it is empty), and if "prev"
becomes empty again it counts the number of entries we need to
pre-allocate outside of i_mmap_mutex.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Link: http://lkml.kernel.org/r/20120615154333.GA9581@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-16 09:10:43 +02:00
Oleg Nesterov
c1914a0936 uprobes: Install_breakpoint() should fail if is_swbp_insn() == T
install_breakpoint() returns -EEXIST if is_swbp_insn(orig_insn)
== T, the caller treats this code as success.

This is doubly wrong. The successful return should set
UPROBE_COPY_INSN, but the real problem is that it shouldn't
succeed. If the probed insn is int3 the application should get
SIGTRAP, this won't happen with uprobe.

Probably we can fix this, we can add the UPROBE_SHARED_BP flag
and teach handle_swbp/set_orig_insn to handle this case
correctly. But this needs some complications and we have other
insns which can't be probed, lets make a simple fix for now.

I think this needs a cleanup. UPROBE_COPY_INSN should die,
copy_insn() should be called by alloc_uprobe().
arch_uprobe_analyze_insn() depends on ->mm (ia32_compat) but it
is called only once.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120615154331.GA9578@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-16 09:10:43 +02:00
Oleg Nesterov
5323ce71e4 uprobes: Write_opcode()->__replace_page() can race with try_to_unmap()
write_opcode() gets old_page via get_user_pages() and then calls
__replace_page() which assumes that this old_page is still
mapped after pte_offset_map_lock().

This is not true if this old_page was already try_to_unmap()'ed,
and in this case everything __replace_page() does with old_page
is wrong. Just for example, put_page() is not balanced.

I think it is possible to teach __replace_page() to handle this
unlikely case correctly, but this patch simply changes it to use
page_check_address() and return -EAGAIN if it fails. The caller
should notice this error code and retry.

Note: write_opcode() asks for the cleanups, I'll try to do this
in a separate patch.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120615154328.GA9571@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-16 09:10:42 +02:00
Oleg Nesterov
cc359d180f uprobes: __copy_insn() should ensure a_ops->readpage != NULL
__copy_insn() blindly calls read_mapping_page(), this will crash
the kernel if ->readpage == NULL, add the necessary check. For
example, hugetlbfs_aops->readpage is NULL. Perhaps we should
change read_mapping_page() instead.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120615154325.GA9568@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-16 09:10:42 +02:00
Oleg Nesterov
ea13137714 uprobes: Valid_vma() should reject VM_HUGETLB
__replace_page() obviously can't work with the hugetlbfs
mappings, uprobe_register() will likely crash the kernel. Change
valid_vma() to check VM_HUGETLB as well.

As for PageTransHuge() no need to worry, vma->vm_file != NULL.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120615154322.GA9561@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-16 09:10:41 +02:00
Linus Torvalds
ed21a66c18 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar.

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  watchdog: Quiet down the boot messages
  perf/x86: Fix broken LBR fixup code
  tracing: Have tracing_off() actually turn tracing off
2012-06-15 16:58:10 -07:00
Linus Torvalds
a95f9b6e09 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core updates (RCU and locking) from Ingo Molnar:
 "Most of the diffstat comes from the RCU slow boot regression fixes,
  but there's also a debuggability improvements/fixes."

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  memblock: Document memblock_is_region_{memory,reserved}()
  rcu: Precompute RCU_FAST_NO_HZ timer offsets
  rcu: Move RCU_FAST_NO_HZ per-CPU variables to rcu_dynticks structure
  rcu: Update RCU_FAST_NO_HZ tracing for lazy callbacks
  rcu: RCU_FAST_NO_HZ detection of callback adoption
  spinlock: Indicate that a lockup is only suspected
  kdump: Execute kmsg_dump(KMSG_DUMP_PANIC) after smp_send_stop()
  panic: Make panic_on_oops configurable
2012-06-15 16:52:35 -07:00
Kay Sievers
e2ae715d66 kmsg - kmsg_dump() use iterator to receive log buffer content
Provide an iterator to receive the log buffer content, and convert all
kmsg_dump() users to it.

The structured data in the kmsg buffer now contains binary data, which
should no longer be copied verbatim to the kmsg_dump() users.

The iterator should provide reliable access to the buffer data, and also
supports proper log line-aware chunking of data while iterating.

Signed-off-by: Kay Sievers <kay@vrfy.org>
Tested-by: Tony Luck <tony.luck@intel.com>
Reported-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Tested-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-15 14:53:59 -07:00
Grant Likely
7325570471 irqdomain: Remove unnecessary test for IRQ_DOMAIN_MAP_LEGACY
Where irq_domain_associate() is called in irq_create_mapping, there is
no need to test for IRQ_DOMAIN_MAP_LEGACY because it is already tested
for earlier in the routine.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
2012-06-15 12:08:09 -06:00
Paul Mundt
5ca4db61e8 irqdomain: Simple NUMA awareness.
While common irqdesc allocation is node aware, the irqdomain code is not.

Presently we observe a number of regressions/inconsistencies on
NUMA-capable platforms:

- Platforms using irqdomains with legacy mappings, where the
  irq_descs are allocated node-local and the irqdomain data
  structure is not.

- Drivers implementing irqdomains will lose node locality
  regardless of the underlying struct device's node id.

This plugs in NUMA node id proliferation across the various allocation
callsites by way of_node_to_nid() node lookup. While of_node_to_nid()
does the right thing for OF-capable platforms it doesn't presently handle
the non-DT case. This is trivially dealt with by simply wraping in to
numa_node_id() unconditionally.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2012-06-15 12:08:00 -06:00
Grant Likely
efd68e7254 devicetree: add helper inline for retrieving a node's full name
The pattern (np ? np->full_name : "<none>") is rather common in the
kernel, but can also make for quite long lines.  This patch adds a new
inline function, of_node_full_name() so that the test for a valid node
pointer doesn't need to be open coded at all call sites.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2012-06-15 11:44:03 -06:00
Steven Rostedt
7374e82771 tracing: Register the ftrace internal events during early boot
All trace events including ftrace internel events (like trace_printk
and function tracing), register functions that describe how to print
their output. The events may be recorded as soon as the ring buffer
is allocated, but they are just raw binary in the buffer. The mapping
of event ids to how to print them are held within a structure that
is registered on system boot.

If a crash happens in boot up before these functions are registered
then their output (via ftrace_dump_on_oops) will be useless:

Dumping ftrace buffer:
---------------------------------
   <...>-1       0.... 319705us : Unknown type 6
---------------------------------

This can be quite frustrating for a kernel developer trying to see
what is going wrong.

There's no reason to register them so late in the boot up process.
They can be registered by early_initcall().

Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-06-14 15:22:14 -04:00
Borislav Petkov
8d240dd88c ftrace: Remove a superfluous check
register_ftrace_function() checks ftrace_disabled and calls
__register_ftrace_function which does it again.

Drop the first check and add the unlikely hint to the second one. Also,
drop the label as John correctly notices.

No functional change.

Link: http://lkml.kernel.org/r/20120329171140.GE6409@aftab

Cc: Borislav Petkov <bp@amd64.org>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-06-14 15:22:12 -04:00
Don Zickus
a702704682 watchdog: Quiet down the boot messages
A bunch of bugzillas have complained how noisy the nmi_watchdog
is during boot-up especially with its expected failure cases
(like virt and bios resource contention).

This is my attempt to quiet them down and keep it less confusing
for the end user.  What I did is print the message for cpu0 and
save it for future comparisons.  If future cpus have an
identical message as cpu0, then don't print the redundant info.
However, if a future cpu has a different message, happily print
that loudly.

Before the change, you would see something like:

    ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
    CPU0: Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz stepping 0a
    Performance Events: PEBS fmt0+, Core2 events, Intel PMU driver.
    ... version:                2
    ... bit width:              40
    ... generic registers:      2
    ... value mask:             000000ffffffffff
    ... max period:             000000007fffffff
    ... fixed-purpose events:   3
    ... event mask:             0000000700000003
    NMI watchdog enabled, takes one hw-pmu counter.
    Booting Node   0, Processors  #1
    NMI watchdog enabled, takes one hw-pmu counter.
     #2
    NMI watchdog enabled, takes one hw-pmu counter.
     #3 Ok.
    NMI watchdog enabled, takes one hw-pmu counter.
    Brought up 4 CPUs
    Total of 4 processors activated (22607.24 BogoMIPS).

After the change, it is simplified to:

    ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
    CPU0: Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz stepping 0a
    Performance Events: PEBS fmt0+, Core2 events, Intel PMU driver.
    ... version:                2
    ... bit width:              40
    ... generic registers:      2
    ... value mask:             000000ffffffffff
    ... max period:             000000007fffffff
    ... fixed-purpose events:   3
    ... event mask:             0000000700000003
    NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
    Booting Node   0, Processors  #1 #2 #3 Ok.
    Brought up 4 CPUs

V2: little changes based on Joe Perches' feedback
V3: printk cleanup based on Ingo's feedback; checkpatch fix
V4: keep printk as one long line
V5: Ingo fix ups

Reported-and-tested-by: Nathan Zimmer <nzimmer@sgi.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: nzimmer@sgi.com
Cc: joe@perches.com
Link: http://lkml.kernel.org/r/1339594548-17227-1-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-14 12:20:50 +02:00
Yinghai Lu
82ec90eac3 resources: allow adjust_resource() for resources with no parent
If a resource has no parent, allow its start/end to be set arbitrarily
as long as any children are still contained within the new range.

[bhelgaas: changelog]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2012-06-13 15:42:22 -06:00
Eric Dumazet
047fe36052 splice: fix racy pipe->buffers uses
Dave Jones reported a kernel BUG at mm/slub.c:3474! triggered
by splice_shrink_spd() called from vmsplice_to_pipe()

commit 35f3d14dbb (pipe: add support for shrinking and growing pipes)
added capability to adjust pipe->buffers.

Problem is some paths don't hold pipe mutex and assume pipe->buffers
doesn't change for their duration.

Fix this by adding nr_pages_max field in struct splice_pipe_desc, and
use it in place of pipe->buffers where appropriate.

splice_shrink_spd() loses its struct pipe_inode_info argument.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Tom Herbert <therbert@google.com>
Cc: stable <stable@vger.kernel.org> # 2.6.35
Tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-13 21:16:42 +02:00
Andrew Lunn
6ebb017de9 printk: Fix alignment of buf causing crash on ARM EABI
Commit 7ff9554bb5, printk: convert
byte-buffer to variable-length record buffer, causes systems using
EABI to crash very early in the boot cycle. The first entry in struct
log is a u64, which for EABI must be 8 byte aligned.

Make use of __alignof__() so the compiler to decide the alignment, but
allow it to be overridden using CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS,
for systems which can perform unaligned access and want to save
a few bytes of space.

Tested on Orion5x and Kirkwood.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Stephen Warren <swarren@wwwdotorg.org>
Acked-by: Stephen Warren <swarren@wwwdotorg.org>
Acked-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-12 16:20:17 -07:00
Thomas Gleixner
924412f66f Merge branch 'nohz-for-tip-2' of git://github.com/fweisbec/linux-dynticks into timers/core 2012-06-11 20:11:29 +02:00
Frederic Weisbecker
84bf1bccc6 nohz: Move next idle expiry time record into idle logic area
The next idle expiry time record and idle sleeps tracking are
statistics that only concern idle.

Since we want the nohz APIs to become usable further idle
context, let's pull up the handling of these statistics to the
callers in idle.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
2012-06-11 20:07:18 +02:00
Frederic Weisbecker
5b39939a40 nohz: Move ts->idle_calls incrementation into strict idle logic
Since we want to prepare for making the nohz API to work further
the idle case, we need to pull ts->idle_calls incrementation up to
the callers in idle.

To perform this, we split tick_nohz_stop_sched_tick() in two parts:
a first one that checks if we can really stop the tick for idle,
and another that actually stops it. Then from the callers in idle,
we check if we can stop the tick and only then we increment idle_calls
and finally relay to the nohz API that won't care about these details
anymore.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
2012-06-11 20:07:17 +02:00
Frederic Weisbecker
f5d411c91e nohz: Rename ts->idle_tick to ts->last_tick
Now that idle and nohz logics are going to be independant each others,
ts->idle_tick becomes too much a biased name to describe the field that
saves the last scheduled tick on top of which we re-calculate the next
tick to schedule when the timer is restarted.

We want to reuse this even to stop the tick outside idle cases. So let's
rename it to some more generic name: ts->last_tick.

This changes a bit the timer list stat export so we need to increase its
version.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
2012-06-11 20:07:17 +02:00
Frederic Weisbecker
2ac0d98fd6 nohz: Make nohz API agnostic against idle ticks cputime accounting
When the timer tick fires, it accounts the new jiffy as either part
of system, user or idle time. This is how we record the cputime
statistics.

But when the tick is stopped from the idle task, we still need
to record the number of jiffies spent tickless until we restart
the tick and fall back to traditional tick-based cputime accounting.

To do this, we take a snapshot of jiffies when the tick is stopped
and compute the difference against the new value of jiffies when
the tick is restarted. Then we account this whole difference to
the idle cputime.

However we are preparing to be able to stop the tick from other places
than idle. So this idle time accounting needs to be performed from
the callers of nohz APIs, not from the nohz APIs themselves because
we now want them to be agnostic against places that stop/restart tick.

Therefore, we pull the tickless idle time accounting out of generic
nohz helpers up to idle entry/exit callers.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
2012-06-11 20:07:16 +02:00
Frederic Weisbecker
19f5f7364a nohz: Separate idle sleeping time accounting from nohz logic
As we plan to be able to stop the tick outside the idle task, we
need to prepare for separating nohz logic from idle. As a start,
this pulls the idle sleeping time accounting out of the tick
stop/restart API to the callers on idle entry/exit.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
2012-06-11 20:06:36 +02:00
Thomas Gleixner
b871a42b60 smpboot: Remove leftover declaration
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-06-11 15:07:52 +02:00
Ingo Molnar
c3e228d59b Linux 3.5-rc2
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJP0qm4AAoJEHm+PkMAQRiG62QIAJRNJFyVB0ZrsMPgdwLnlX4O
 5I86H7GaYXoOK/KMb2s5h4KiFggIODnyEkZi+/39tJOgGo0KrMcDlsh0owB1Iggw
 LE6iyze9I1z9wQze0+SXe7VAcvUYvsx2vgpOKvoNi97Qgn3B6onL+SAi5U+NAqJl
 0NdKmveEd42UIm7JfChHlxl8bm8YB+WcU38OkMGpRpJ/Moz9EbSjYVQg3oHrzJjy
 duiX6SD/OV4m5yCcXXmu+f41pN+SG7xENJ5r4enyi2ZF8mAyVz2goIyL2bA0AJX2
 +GbpD1sxUHkZ6yPg4tf2bmJOj0PkfZNAi8YpFxZDlP4y1pKuCTEDTBp8O2id43w=
 =Jyn8
 -----END PGP SIGNATURE-----

Merge tag 'v3.5-rc2' into perf/core

Merge in Linux 3.5-rc2 - to pick up fixes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-11 10:51:35 +02:00
Ingo Molnar
4a1e001d2b Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/urgent
Merge RCU fixes from Paul E. McKenney:

 " This series has four patches, the major point of which is to eliminate
   some slowdowns (including boot-time slowdowns) resulting from some
   RCU_FAST_NO_HZ changes.  The issue with the changes is that posting timers
   from the idle loop has no effect if the CPU has entered dyntick-idle
   mode because the CPU has already computed its wakeup time, and posting
   a timer does not cause it to be recomputed.  The short-term fix is for
   RCU to precompute the timeout value so that the CPU's calculation is
   correct. "

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-11 10:30:23 +02:00
Linus Torvalds
7249450449 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar.

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix the relax_domain_level boot parameter
  sched: Validate assumptions in sched_init_numa()
  sched: Always initialize cpu-power
  sched: Fix domain iteration
  sched/rt: Fix lockdep annotation within find_lock_lowest_rq()
  sched/numa: Load balance between remote nodes
  sched/x86: Calculate booted cores after construction of sibling_mask
2012-06-08 14:59:29 -07:00
Randy Dunlap
cd96891d48 sched/fair: fix lots of kernel-doc warnings
Fix lots of new kernel-doc warnings in kernel/sched/fair.c:

  Warning(kernel/sched/fair.c:3625): No description found for parameter 'env'
  Warning(kernel/sched/fair.c:3625): Excess function parameter 'sd' description in 'update_sg_lb_stats'
  Warning(kernel/sched/fair.c:3735): No description found for parameter 'env'
  Warning(kernel/sched/fair.c:3735): Excess function parameter 'sd' description in 'update_sd_pick_busiest'
  Warning(kernel/sched/fair.c:3735): Excess function parameter 'this_cpu' description in 'update_sd_pick_busiest'
  .. more warnings

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-08 14:59:10 -07:00
Linus Torvalds
106544d81d Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "A bit larger than what I'd wish for - half of it is due to hw driver
  updates to Intel Ivy-Bridge which info got recently released,
  cycles:pp should work there now too, amongst other things.  (but we
  are generally making exceptions for hardware enablement of this type.)

  There are also callchain fixes in it - responding to mostly
  theoretical (but valid) concerns.  The tooling side sports perf.data
  endianness/portability fixes which did not make it for the merge
  window - and various other fixes as well."

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
  perf/x86: Check user address explicitly in copy_from_user_nmi()
  perf/x86: Check if user fp is valid
  perf: Limit callchains to 127
  perf/x86: Allow multiple stacks
  perf/x86: Update SNB PEBS constraints
  perf/x86: Enable/Add IvyBridge hardware support
  perf/x86: Implement cycles:p for SNB/IVB
  perf/x86: Fix Intel shared extra MSR allocation
  x86/decoder: Fix bsr/bsf/jmpe decoding with operand-size prefix
  perf: Remove duplicate invocation on perf_event_for_each
  perf uprobes: Remove unnecessary check before strlist__delete
  perf symbols: Check for valid dso before creating map
  perf evsel: Fix 32 bit values endianity swap for sample_id_all header
  perf session: Handle endianity swap on sample_id_all header data
  perf symbols: Handle different endians properly during symbol load
  perf evlist: Pass third argument to ioctl explicitly
  perf tools: Update ioctl documentation for PERF_IOC_FLAG_GROUP
  perf tools: Make --version show kernel version instead of pull req tag
  perf tools: Check if callchain is corrupted
  perf callchain: Make callchain cursors TLS
  ...
2012-06-08 09:14:46 -07:00
Linus Torvalds
b1e25f4109 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull leap second timer fix from Thomas Gleixner.

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Fix CLOCK_MONOTONIC inconsistency during leapsecond
2012-06-08 09:11:33 -07:00
Ingo Molnar
9ee6ddc9da Merge branch 'tip/perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/urgent
Pull brown paper bag fix from Steve Rostedt.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-08 12:23:22 +02:00
Ananth N Mavinakayanahalli
7eb9ba5ed3 uprobes: Pass probed vaddr to arch_uprobe_analyze_insn()
On RISC architectures like powerpc, instructions are fixed size.
Instruction analysis on such platforms is just a matter of
(insn % 4). Pass the vaddr at which the uprobe is to be inserted so
that arch_uprobe_analyze_insn() can flag misaligned registration
requests.

Signed-off-by: Ananth N Mavinakaynahalli <ananth@in.ibm.com>
Cc: michael@ellerman.id.au
Cc: antonb@thinktux.localdomain
Cc: Paul Mackerras <paulus@samba.org>
Cc: benh@kernel.crashing.org
Cc: peterz@infradead.org
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: oleg@redhat.com
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20120608093257.GG13409@in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-08 12:22:27 +02:00
Linus Torvalds
48d212a2ee Revert "mm: correctly synchronize rss-counters at exit/exec"
This reverts commit 40af1bbdca.

It's horribly and utterly broken for at least the following reasons:

 - calling sync_mm_rss() from mmput() is fundamentally wrong, because
   there's absolutely no reason to believe that the task that does the
   mmput() always does it on its own VM.  Example: fork, ptrace, /proc -
   you name it.

 - calling it *after* having done mmdrop() on it is doubly insane, since
   the mm struct may well be gone now.

 - testing mm against NULL before you call it is insane too, since a
NULL mm there would have caused oopses long before.

.. and those are just the three bugs I found before I decided to give up
looking for me and revert it asap.  I should have caught it before I
even took it, but I trusted Andrew too much.

Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-07 17:54:07 -07:00
Konstantin Khlebnikov
40af1bbdca mm: correctly synchronize rss-counters at exit/exec
mm->rss_stat counters have per-task delta: task->rss_stat.  Before
changing task->mm pointer the kernel must flush this delta with
sync_mm_rss().

do_exit() already calls sync_mm_rss() to flush the rss-counters before
committing the rss statistics into task->signal->maxrss, taskstats,
audit and other stuff.  Unfortunately the kernel does this before
calling mm_release(), which can call put_user() for processing
task->clear_child_tid.  So at this point we can trigger page-faults and
task->rss_stat becomes non-zero again.  As a result mm->rss_stat becomes
inconsistent and check_mm() will print something like this:

| BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
| BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1

This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
out of do_exit() and calls it earlier.  After mm_release() there should
be no pagefaults.

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>		[3.4.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-07 14:43:55 -07:00
Cyrill Gorcunov
736f24d5e5 c/r: prctl: drop VMA flags test on PR_SET_MM_ stack data assignment
In commit b76437579d ("procfs: mark thread stack correctly in
proc/<pid>/maps") the stack allocated via clone() is marked in
/proc/<pid>/maps as [stack:%d] thus it might be out of the former
mm->start_stack/end_stack values (and even has some custom VMA flags
set).

So to be able to restore mm->start_stack/end_stack drop vma flags test,
but still require the underlying VMA to exist.

As always note this feature is under CONFIG_CHECKPOINT_RESTORE and
requires CAP_SYS_RESOURCE to be granted.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-07 14:43:55 -07:00
Cyrill Gorcunov
300f786b26 c/r: prctl: add ability to get clear_tid_address
Zero is written at clear_tid_address when the process exits.  This
functionality is used by pthread_join().

We already have sys_set_tid_address() to change this address for the
current task but there is no way to obtain it from user space.

Without the ability to find this address and dump it we can't restore
pthread'ed apps which call pthread_join() once they have been restored.

This patch introduces the PR_GET_TID_ADDRESS prctl option which allows
the current process to obtain own clear_tid_address.

This feature is available iif CONFIG_CHECKPOINT_RESTORE is set.

[akpm@linux-foundation.org: fix prctl numbering]
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pedro Alves <palves@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-07 14:43:55 -07:00
Cyrill Gorcunov
1ad75b9e16 c/r: prctl: add minimal address test to PR_SET_MM
Make sure the address being set is greater than mmap_min_addr (as
suggested by Kees Cook).

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-07 14:43:55 -07:00
Konstantin Khlebnikov
bafb282df2 c/r: prctl: update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removal
A fix for commit b32dfe3771 ("c/r: prctl: add ability to set new
mm_struct::exe_file").

After removing mm->num_exe_file_vmas kernel keeps mm->exe_file until
final mmput(), it never becomes NULL while task is alive.

We can check for other mapped files in mm instead of checking
mm->num_exe_file_vmas, and mark mm with flag MMF_EXE_FILE_CHANGED in
order to forbid second changing of mm->exe_file.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-07 14:43:55 -07:00
Paul E. McKenney
aa9b16306e rcu: Precompute RCU_FAST_NO_HZ timer offsets
When a CPU is entering dyntick-idle mode, tick_nohz_stop_sched_tick()
calls rcu_needs_cpu() see if RCU needs that CPU, and, if not, computes the
next wakeup time based on the timer wheels.  Only later, when actually
entering the idle loop, rcu_prepare_for_idle() will be invoked.  In some
cases, rcu_prepare_for_idle() will post timers to wake the CPU back up.
But all for naught: The next wakeup time for the CPU has already been
computed, and posting a timer afterwards does not force that wakeup
time to be recomputed.  This means that rcu_prepare_for_idle()'s have
no effect.

This is not a problem on a busy system because something else will wake
up the CPU soon enough.  However, on lightly loaded systems, the CPU
might stay asleep for a considerable length of time.  If that CPU has
a callback that the rest of the system is waiting on, the system might
run very slowly or (in theory) even hang.

This commit avoids this problem by having rcu_needs_cpu() give
tick_nohz_stop_sched_tick() an estimate of when RCU will need the CPU
to wake back up, which tick_nohz_stop_sched_tick() takes into account
when programming the CPU's wakeup time.  An alternative approach is
for rcu_prepare_for_idle() to use hrtimers instead of normal timers,
but timers are much more efficient than are hrtimers for frequently
and repeatedly posting and cancelling a given timer, which is exactly
what RCU_FAST_NO_HZ does.

Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-06-06 20:43:28 -07:00
Paul E. McKenney
5955f7eecd rcu: Move RCU_FAST_NO_HZ per-CPU variables to rcu_dynticks structure
The RCU_FAST_NO_HZ code relies on a number of per-CPU variables.
This works, but is hidden from someone scanning the data structures
in rcutree.h.  This commit therefore converts these per-CPU variables
to fields in the per-CPU rcu_dynticks structures.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-06-06 20:43:28 -07:00
Paul E. McKenney
fd4b352687 rcu: Update RCU_FAST_NO_HZ tracing for lazy callbacks
In the current code, a short dyntick-idle interval (where there is
at least one non-lazy callback on the CPU) and a long dyntick-idle
interval (where there are only lazy callbacks on the CPU) are traced
identically, which can be less than helpful.  This commit therefore
emits different event traces in these two cases.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-06-06 20:43:27 -07:00
Paul E. McKenney
8f5af6f1f2 rcu: RCU_FAST_NO_HZ detection of callback adoption
In the present implementations of CPU hotplug, the outgoing CPU is
guaranteed to run its stop-machine process on the way out, which
will guarantee that RCU_FAST_NO_HZ forces the CPU out of dyntick-idle
mode.

However, new versions of CPU hotplug might not work this way.  This
commit therefore removes this design constraint by explicitly notifying
CPUs when they adopt non-lazy RCU callbacks.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-06-06 20:43:27 -07:00
Steven Rostedt
f2bf1f6f5f tracing: Have tracing_off() actually turn tracing off
A recent update to have tracing_on/off() only affect the ftrace ring
buffers instead of all ring buffers had a cut and paste error.
The tracing_off() did the exact same thing as tracing_on() and
would not actually turn off tracing. Unfortunately, tracing_off()
is more important to be working than tracing_on() as this is a key
development tool, as it lets the developer turn off tracing as soon
as a problem is discovered. It is also used by panic and oops code.

This bug also breaks the 'echo func:traceoff > set_ftrace_filter'

Cc: <stable@vger.kernel.org> # 3.4
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-06-06 22:15:14 -04:00
Li Zefan
6be96a5c90 cgroup: remove hierarchy_mutex
It was introduced for memcg to iterate cgroup hierarchy without
holding cgroup_mutex, but soon after that it was replaced with
a lockless way in memcg.

No one used hierarchy_mutex since that, so remove it.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-06-06 19:12:30 -07:00
Salman Qazi
967db0ea65 cgroup: make sure that decisions in __css_put are atomic
__css_put is using atomic_dec on the ref count, and then
looking at the ref count to make decisions.  This is prone
to races, as someone else may decrement ref count between
our decrement and our decision.  Instead, we should base our
decisions on the value that we decremented the ref count to.

(This results in an actual race on Google's kernel which I
haven't been able to reproduce on the upstream kernel.  Having
said that, it's still incorrect by inspection).

Signed-off-by: Salman Qazi <sqazi@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
2012-06-06 18:51:35 -07:00
Oleg Nesterov
778b032d96 uprobes: Kill uprobes_srcu/uprobe_srcu_id
Kill the no longer needed uprobes_srcu/uprobe_srcu_id code.

It doesn't really work anyway. synchronize_srcu() can only
synchronize with the code "inside" the
srcu_read_lock/srcu_read_unlock section, while
uprobe_pre_sstep_notifier() does srcu_read_lock() _after_ we
already hit the breakpoint.

I guess this probably works "in practice". synchronize_srcu() is
slow and it implies synchronize_sched(), and the probed task
enters the non- preemptible section at the start of exception
handler. Still this is not right at least in theory, and
task->uprobe_srcu_id blows task_struct.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120529193008.GG8057@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06 17:22:22 +02:00
Oleg Nesterov
56bb4cf647 uprobes: Teach handle_swbp() to rely on "is_swbp" rather than uprobes_srcu
Currently handle_swbp() assumes that it can't race with
unregister, so it roughly does:

	if (find_uprobe(vaddr))
		process_uprobe();
	else
		send_sig(SIGTRAP);

This relies on the not-really-working uprobes_srcu code we are
going to remove, see the next patch.

With this patch we rely on the result of
is_swbp_at_addr(bp_vaddr) if find_uprobe() fails.

If is_swbp == 1, then we hit the normal int3, we should send
SIGTRAP.

If is_swbp == 0, we raced with uprobe_unregister(), we simply
restart this insn again.

The "difficult" case is is_swbp == -EFAULT, when we can't read
this memory. In this case I think we should restart too, and
this is more correct compared to the current code which sends
SIGTRAP.

Ignoring ENOMEM/etc from get_user_pages(), this can only happen
if another thread unmaps this memory before find_active_uprobe()
takes mmap_sem. It would be better to pretend it was unmapped
before this insn was executed, restart, and get SIGSEGV.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120529192947.GF8057@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06 17:22:03 +02:00
Oleg Nesterov
77fc4af1b5 uprobes: Change register_for_each_vma() to take mm->mmap_sem for writing
Change register_for_each_vma() to take mm->mmap_sem for writing.
This is a bit unfortunate but hopefully not too bad, this is the
slow path anyway.

This is needed to ensure that find_active_uprobe() can not race
with uprobe_register() which adds the new bp at the same
bp_vaddr, after find_uprobe() fails and before
is_swbp_at_addr_fast() checks the memory.

IOW, this is needed to ensure that if find_active_uprobe()
returns NULL but is_swbp == true, we can safely assume that it
was the "normal" int3 and we should send SIGTRAP.

There is another reason for this change. We are going to replace
uprobes_state->count with MMF_ flags set by register/unregister
and cleared by find_active_uprobe(), and set/clear shouldn't
race with each other.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120529192928.GE8057@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06 17:21:48 +02:00
Oleg Nesterov
d790d34653 uprobes: Teach find_active_uprobe() to provide the "is_swbp" info
A separate patch to simplify the review, and for the
documentation.

The patch adds another "int *is_swbp" argument to
find_active_uprobe(), so far its only caller doesn't use this
info.

With this patch find_active_uprobe() additionally does:

	- if find_vma() + ->vm_start check fails, *is_swbp = -EFAULT

	- otherwise, if valid_vma() + find_uprobe() fails, it holds
	  the result of is_swbp_at_addr(), can be negative too. The
	  latter is only possible if we raced with another thread
	  which did munmap/etc after we hit this bp.

IOW. If find_active_uprobe(&is_swbp) returns NULL, the caller
can look at is_swbp to figure out whether the current insn is bp
or not, or detect the race with another thread if it is
negative.

Note: I think that performance-wise this change is fine. This
adds is_swbp_at_addr(), but only if we raced with
uprobe_unregister() or if we hit the "normal" int3 but this mm
has uprobes as well. And even in this case the slow
read_opcode() path is very unlikely, this insn recently
triggered do_int3(), __copy_from_user_inatomic() shouldn't fail
in the likely case.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120529192914.GD8057@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06 17:15:24 +02:00
Oleg Nesterov
3a9ea0520f uprobes: Introduce find_active_uprobe() helper
No functional changes. Move the "find uprobe" code from
handle_swbp() to the new helper, find_active_uprobe().

Note: with or without this change, the find-active-uprobe logic
is not exactly right. We can race with another thread which
unmaps the memory with the valid uprobe before we take
mm->mmap_sem. We can't find this uprobe simply because
find_vma() fails. In this case we wrongly assume that this trap
was not caused by uprobe and send the erroneous SIGTRAP. See the
next changes.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120529192857.GC8057@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06 17:15:17 +02:00
Oleg Nesterov
a3d7bb4793 uprobes: Change read_opcode() to use FOLL_FORCE
set_orig_insn()->read_opcode() should not fail if the probed
task did mprotect() after uprobe_register(), change it to use
FOLL_FORCE. Without FOLL_WRITE this doesn't have any "side"
effect but allows to read the !VM_READ memory.

There is another reason for this change, we are going to use
is_swbp_at_addr() from handle_swbp() which can race with another
thread doing mprotect().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120529192759.GB8057@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06 17:14:49 +02:00
Oleg Nesterov
c00b275043 uprobes: Optimize is_swbp_at_addr() for current->mm
Change is_swbp_at_addr() to try to avoid the costly
read_opcode() if mm == current->mm, __copy_from_user_inatomic()
should succeed in the likely case.

Currently this optimization is not important, but we are going
to add more is_swbp_at_addr(current->mm) callers.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120529192744.GA8057@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06 17:13:59 +02:00
Dimitri Sivanich
a841f8cef4 sched: Fix the relax_domain_level boot parameter
It does not get processed because sched_domain_level_max is 0 at the
time that setup_relax_domain_level() is run.

Simply accept the value as it is, as we don't know the value of
sched_domain_level_max until sched domain construction is completed.

Fix sched_relax_domain_level in cpuset.  The build_sched_domain() routine calls
the set_domain_attribute() routine prior to setting the sd->level, however,
the set_domain_attribute() routine relies on the sd->level to decide whether
idle load balancing will be off/on.

Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120605184436.GA15668@sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06 17:07:41 +02:00
Peter Zijlstra
d039ac6080 sched: Validate assumptions in sched_init_numa()
Add some code to validate assumptions we're making and output
warnings if they are not.

If this trigger we want to know about it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alex Shi <lkml.alex@gmail.com>
Link: http://lkml.kernel.org/n/tip-6uc3wk5s9udxtdl9cnku0vtt@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06 16:52:30 +02:00
Peter Zijlstra
c3decf0dfb sched: Always initialize cpu-power
Often when we run into mis-shapen topologies the balance iteration
fails to update the cpu power properly and we'll end up in /0 traps.

Always initialize the cpu-power to a semi-sane value so that we can
at least boot the machine, even if the load-balancer might not
function correctly.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-3lbhyj25sr169ha7z3qht5na@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06 16:52:27 +02:00
Peter Zijlstra
c117487687 sched: Fix domain iteration
Weird topologies can lead to asymmetric domain setups. This needs
further consideration since these setups are typically non-minimal
too.

For now, make it work by adding an extra mask selecting which CPUs
are allowed to iterate up.

The topology that triggered it is the one from David Rientjes:

	10 20 20 30
	20 10 20 20
	20 20 10 20
	30 20 20 10

resulting in boxes that wouldn't even boot.

Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-3p86l9cuaqnxz7uxsojmz5rm@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06 16:52:26 +02:00
Peter Zijlstra
7f1b43936f sched/rt: Fix lockdep annotation within find_lock_lowest_rq()
Roland Dreier reported spurious, hard to trigger lockdep warnings
within the scheduler - without any real lockup.

This bit gives us the right clue:

> [89945.640512]  [<ffffffff8103fa1a>] double_lock_balance+0x5a/0x90
> [89945.640568]  [<ffffffff8104c546>] push_rt_task+0xc6/0x290

if you look at that code you'll find the double_lock_balance() in
question is the one in find_lock_lowest_rq() [yay for inlining].

Now find_lock_lowest_rq() has a bug.. it fails to use
double_unlock_balance() in one exit path, if this results in a retry in
push_rt_task() we'll call double_lock_balance() again, at which point
we'll run into said lockdep confusion.

Reported-by: Roland Dreier <roland@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1337282386.4281.77.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06 16:52:26 +02:00
Alex Shi
10717dcde1 sched/numa: Load balance between remote nodes
Commit cb83b629b ("sched/numa: Rewrite the CONFIG_NUMA sched
domain support") removed the NODE sched domain and started checking
if the node distance in SLIT table is farther than REMOTE_DISTANCE,
if so, it will lose the load balance chance at exec/fork/wake_affine
points.

But actually, even the node distance is farther than REMOTE_DISTANCE.

Modern CPUs also has QPI like connections, which ensures that memory
access is not too slow between nodes. So the above change in behavior
on NUMA machine causes a performance regression on various benchmarks:
hackbench, tbench, netperf, oltp, etc.

This patch will recover the scheduler behavior to old mode on all my
Intel platforms: NHM EP/EX, WSM EP, SNB EP/EP4S, and thus fixes the
perfromance regressions. (all of them just have 2 kinds distance, 10, 21)

Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1338965571-9812-1-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06 16:52:25 +02:00
Thomas Gleixner
e40468a548 timers: Improve get_next_timer_interrupt()
Gilad reported at

 http://lkml.kernel.org/r/1336056962-10465-2-git-send-email-gilad@benyossef.com

"Current timer code fails to correctly return a value meaning that
 there is no future timer event, with the result that the timer keeps
 getting re-armed in HZ one shot mode even when we could turn it off,
 generating unneeded interrupts.

 What is happening is that when __next_timer_interrupt() wishes
 to return a value that signifies "there is no future timer
 event", it returns (base->timer_jiffies + NEXT_TIMER_MAX_DELTA).

 However, the code in tick_nohz_stop_sched_tick(), which called
 __next_timer_interrupt() via get_next_timer_interrupt(),
 compares the return value to (last_jiffies + NEXT_TIMER_MAX_DELTA)
 to see if the timer needs to be re-armed.

 base->timer_jiffies != last_jiffies and so tick_nohz_stop_sched_tick()
 interperts the return value as indication that there is a distant
 future event 12 days from now and programs the timer to fire next
 after KTIME_MAX nsecs instead of avoiding to arm it. This ends up
 causing a needless interrupt once every KTIME_MAX nsecs."

Fix this by using the new active timer accounting. This avoids scans
when no active timer is enqueued completely, so we don't have to rely
on base->timer_next and base->timer_jiffies anymore.

Reported-by: Gilad Ben-Yossef <gilad@benyossef.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20120525214819.317535385@linutronix.de
2012-06-06 13:49:02 +02:00
Thomas Gleixner
99d5f3aac6 timers: Add accounting of non deferrable timers
The code in get_next_timer_interrupt() is suboptimal as it has to run
through the cascade to find the next expiring timer. On a completely
idle core we should only do that when there is an active timer
enqueued and base->next_timer does not give us a fast answer.

Add accounting of the active timers to the now consolidated
attach/detach code. I deliberately avoided sanity checks because the
code is fully symetric and any fiddling with timers w/o using the API
functions will lead to cute explosions anyway. ulong is big enough
even on 32bit and if we really run into the situation to have more
than 1<<32 timers enqueued there, then we are definitely not in a
state to go idle and run through that code.

This allows us to fix another shortcoming of get_next_timer_interrupt().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20120525214819.236377028@linutronix.de
2012-06-06 13:49:01 +02:00
Thomas Gleixner
facbb4a7ef timers: Consolidate base->next_timer update
Another bunch of mindlessly copied code. All callers of
internal_add_timer() except the recascading code updates
base->next_timer.

Move this into internal_add_timer() and let the cascading code call
__internal_add_timer().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20120525214819.189946224@linutronix.de
2012-06-06 13:49:01 +02:00
Thomas Gleixner
ec44bc7acc timers: Create detach_if_pending() and use it
Most callers of detach_timer() have the same pattern around
them. Check whether the timer is pending and eventually updating
base->next_timer.

Create detach_if_pending() and replace the duplicated code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20120525214819.131246037@linutronix.de
2012-06-06 13:49:01 +02:00
Ingo Molnar
c7f5f4ab10 Merge branch 'core/debug' into core/urgent
Merge two debugging patchlets that were waiting for
preparatory commits to hit upstream.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06 11:12:55 +02:00
Ingo Molnar
02e03040a3 Merge tag 'perf-urgent-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent
Pull perf fixes from Arnaldo Carvalho de Melo:

 * Endianness fixes from Jiri Olsa

 * Fixes for make perf tarball

 * Fix for DSO name in perf script callchains, from David Ahern

 * Segfault fixes for perf top --callchain, from Namhyung Kim

 * Minor function result fixes from Srikar Dronamraju

 * Add missing 3rd ioctl parameter, from Namhyung Kim

 * Fix pager usage in minimal embedded systems, from Avik Sil

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06 08:46:33 +02:00
Linus Torvalds
365f0e173f Merge branch 'for-3.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
 "This fixes the possible premature superblock release on umount bug
  mentioned during v3.5-rc1 pull request.

  Originally, cgroup dentry destruction path assumed that cgroup dentry
  didn't have any reference left after cgroup removal thus put super
  during dentry removal.  Now that there can be lingering dentry
  references, this led to super being put with live dentries.  This
  patch fixes the problem by putting super ref on dentry release instead
  of removal."

* 'for-3.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: superblock can't be released with active dentries
2012-06-05 11:54:12 -07:00
Linus Torvalds
0b3e9f3f21 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar.

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Remove NULL assignment of dattr_cur
  sched: Remove the last NULL entry from sched_feat_names
  sched: Make sched_feat_names const
  sched/rt: Fix SCHED_RR across cgroups
  sched: Move nr_cpus_allowed out of 'struct sched_rt_entity'
  sched: Make sure to not re-read variables after validation
  sched: Fix SD_OVERLAP
  sched: Don't try allocating memory from offline nodes
  sched/nohz: Fix rq->cpu_load calculations some more
  sched/x86: Use cpu_llc_shared_mask(cpu) for coregroup_mask
2012-06-05 09:47:15 -07:00
Yong Zhang
8feb8e896d smp: Remove ipi_call_lock[_irq]()/ipi_call_unlock[_irq]()
There is no user of those APIs anymore, just remove it.

Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Cc: ralf@linux-mips.org
Cc: sshtylyov@mvista.com
Cc: david.daney@cavium.com
Cc: nikunj@linux.vnet.ibm.com
Cc: paulmck@linux.vnet.ibm.com
Cc: axboe@kernel.dk
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1338275765-3217-11-git-send-email-yong.zhang0@gmail.com
Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-06-05 17:27:14 +02:00
John Stultz
fad0c66c4b timekeeping: Fix CLOCK_MONOTONIC inconsistency during leapsecond
Commit 6b43ae8a61 (ntp: Fix leap-second hrtimer livelock) broke the
leapsecond update of CLOCK_MONOTONIC. The missing leapsecond update to
wall_to_monotonic causes discontinuities in CLOCK_MONOTONIC.

Adjust wall_to_monotonic when NTP inserted a leapsecond.

Reported-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Tested-by: Richard Cochran <richardcochran@gmail.com>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1338400497-12420-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-06-04 21:46:29 +02:00
Linus Torvalds
9171c670b4 Merge branches 'irq-urgent-for-linus' and 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq and smpboot updates from Thomas Gleixner:
 "Just cleanup patches with no functional change and a fix for suspend
  issues."

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Introduce irq_do_set_affinity() to reduce duplicated code
  genirq: Add IRQS_PENDING for nested and simple irq

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smpboot, idle: Fix comment mismatch over idle_threads_init()
  smpboot, idle: Optimize calls to smp_processor_id() in idle_threads_init()
2012-06-04 11:36:51 -07:00
Linus Torvalds
c22072bdf0 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "The clocksource driver is pure hardware enablement and the skew option
  is default off, well tested and non dangerous."

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick: Move skew_tick option into the HIGH_RES_TIMER section
  clocksource: em_sti: Add DT support
  clocksource: em_sti: Emma Mobile STI driver
  clockevents: Make clockevents_config() a global symbol
  tick: Add tick skew boot option
2012-06-04 11:25:31 -07:00
Linus Torvalds
86c47b70f6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull third pile of signal handling patches from Al Viro:
 "This time it's mostly helpers and conversions to them; there's a lot
  of stuff remaining in the tree, but that'll either go in -rc2
  (isolated bug fixes, ideally via arch maintainers' trees) or will sit
  there until the next cycle."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
  x86: get rid of calling do_notify_resume() when returning to kernel mode
  blackfin: check __get_user() return value
  whack-a-mole with TIF_FREEZE
  FRV: Optimise the system call exit path in entry.S [ver #2]
  FRV: Shrink TIF_WORK_MASK [ver #2]
  FRV: Prevent syscall exit tracing and notify_resume at end of kernel exceptions
  new helper: signal_delivered()
  powerpc: get rid of restore_sigmask()
  most of set_current_blocked() callers want SIGKILL/SIGSTOP removed from set
  set_restore_sigmask() is never called without SIGPENDING (and never should be)
  TIF_RESTORE_SIGMASK can be set only when TIF_SIGPENDING is set
  don't call try_to_freeze() from do_signal()
  pull clearing RESTORE_SIGMASK into block_sigmask()
  sh64: failure to build sigframe != signal without handler
  openrisc: tracehook_signal_handler() is supposed to be called on success
  new helper: sigmask_to_save()
  new helper: restore_saved_sigmask()
  new helpers: {clear,test,test_and_clear}_restore_sigmask()
  HAVE_RESTORE_SIGMASK is defined on all architectures now
2012-06-01 11:53:44 -07:00
Linus Torvalds
1193755ac6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs changes from Al Viro.
 "A lot of misc stuff.  The obvious groups:
   * Miklos' atomic_open series; kills the damn abuse of
     ->d_revalidate() by NFS, which was the major stumbling block for
     all work in that area.
   * ripping security_file_mmap() and dealing with deadlocks in the
     area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in
     general.
   * ->encode_fh() switched to saner API; insane fake dentry in
     mm/cleancache.c gone.
   * assorted annotations in fs (endianness, __user)
   * parts of Artem's ->s_dirty work (jff2 and reiserfs parts)
   * ->update_time() work from Josef.
   * other bits and pieces all over the place.

  Normally it would've been in two or three pull requests, but
  signal.git stuff had eaten a lot of time during this cycle ;-/"

Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the
'truncate_range' inode method was removed by the VM changes, the VFS
update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due
to sparse fix added twice, with other changes nearby).

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits)
  nfs: don't open in ->d_revalidate
  vfs: retry last component if opening stale dentry
  vfs: nameidata_to_filp(): don't throw away file on error
  vfs: nameidata_to_filp(): inline __dentry_open()
  vfs: do_dentry_open(): don't put filp
  vfs: split __dentry_open()
  vfs: do_last() common post lookup
  vfs: do_last(): add audit_inode before open
  vfs: do_last(): only return EISDIR for O_CREAT
  vfs: do_last(): check LOOKUP_DIRECTORY
  vfs: do_last(): make ENOENT exit RCU safe
  vfs: make follow_link check RCU safe
  vfs: do_last(): use inode variable
  vfs: do_last(): inline walk_component()
  vfs: do_last(): make exit RCU safe
  vfs: split do_lookup()
  Btrfs: move over to use ->update_time
  fs: introduce inode operation ->update_time
  reiserfs: get rid of resierfs_sync_super
  reiserfs: mark the superblock as dirty a bit later
  ...
2012-06-01 10:34:35 -07:00
Al Viro
efee984c27 new helper: signal_delivered()
Does block_sigmask() + tracehook_signal_handler();  called when
sigframe has been successfully built.  All architectures converted
to it; block_sigmask() itself is gone now (merged into this one).

I'm still not too happy with the signature, but that's a separate
story (IMO we need a structure that would contain signal number +
siginfo + k_sigaction, so that get_signal_to_deliver() would fill one,
signal_delivered(), handle_signal() and probably setup...frame() -
take one).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01 12:58:52 -04:00
Al Viro
77097ae503 most of set_current_blocked() callers want SIGKILL/SIGSTOP removed from set
Only 3 out of 63 do not.  Renamed the current variant to __set_current_blocked(),
added set_current_blocked() that will exclude unblockable signals, switched
open-coded instances to it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01 12:58:51 -04:00
Al Viro
a610d6e672 pull clearing RESTORE_SIGMASK into block_sigmask()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01 12:58:49 -04:00
Al Viro
754421c8ca HAVE_RESTORE_SIGMASK is defined on all architectures now
Everyone either defines it in arch thread_info.h or has TIF_RESTORE_SIGMASK
and picks default set_restore_sigmask() in linux/thread_info.h.  Kill the
ifdefs, slap #error in linux/thread_info.h to catch breakage when new ones
get merged.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01 12:58:46 -04:00
Linus Torvalds
fb21affa49 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull second pile of signal handling patches from Al Viro:
 "This one is just task_work_add() series + remaining prereqs for it.

  There probably will be another pull request from that tree this
  cycle - at least for helpers, to get them out of the way for per-arch
  fixes remaining in the tree."

Fix trivial conflict in kernel/irq/manage.c: the merge of Andrew's pile
had brought in commit 97fd75b7b8 ("kernel/irq/manage.c: use the
pr_foo() infrastructure to prefix printks") which changed one of the
pr_err() calls that this merge moves around.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
  keys: kill task_struct->replacement_session_keyring
  keys: kill the dummy key_replace_session_keyring()
  keys: change keyctl_session_to_parent() to use task_work_add()
  genirq: reimplement exit_irq_thread() hook via task_work_add()
  task_work_add: generic process-context callbacks
  avr32: missed _TIF_NOTIFY_RESUME on one of do_notify_resume callers
  parisc: need to check NOTIFY_RESUME when exiting from syscall
  move key_repace_session_keyring() into tracehook_notify_resume()
  TIF_NOTIFY_RESUME is defined on all targets now
2012-05-31 18:47:30 -07:00
Linus Torvalds
08615d7d85 Merge branch 'akpm' (Andrew's patch-bomb)
Merge misc patches from Andrew Morton:

 - the "misc" tree - stuff from all over the map

 - checkpatch updates

 - fatfs

 - kmod changes

 - procfs

 - cpumask

 - UML

 - kexec

 - mqueue

 - rapidio

 - pidns

 - some checkpoint-restore feature work.  Reluctantly.  Most of it
   delayed a release.  I'm still rather worried that we don't have a
   clear roadmap to completion for this work.

* emailed from Andrew Morton <akpm@linux-foundation.org>: (78 patches)
  kconfig: update compression algorithm info
  c/r: prctl: add ability to set new mm_struct::exe_file
  c/r: prctl: extend PR_SET_MM to set up more mm_struct entries
  c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
  syscalls, x86: add __NR_kcmp syscall
  fs, proc: introduce /proc/<pid>/task/<tid>/children entry
  sysctl: make kernel.ns_last_pid control dependent on CHECKPOINT_RESTORE
  aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector()
  eventfd: change int to __u64 in eventfd_signal()
  fs/nls: add Apple NLS
  pidns: make killed children autoreap
  pidns: use task_active_pid_ns in do_notify_parent
  rapidio/tsi721: add DMA engine support
  rapidio: add DMA engine support for RIO data transfers
  ipc/mqueue: add rbtree node caching support
  tools/selftests: add mq_perf_tests
  ipc/mqueue: strengthen checks on mqueue creation
  ipc/mqueue: correct mq_attr_ok test
  ipc/mqueue: improve performance of send/recv
  selftests: add mq_open_tests
  ...
2012-05-31 18:10:18 -07:00
Cyrill Gorcunov
b32dfe3771 c/r: prctl: add ability to set new mm_struct::exe_file
When we do restore we would like to have a way to setup a former
mm_struct::exe_file so that /proc/pid/exe would point to the original
executable file a process had at checkpoint time.

For this the PR_SET_MM_EXE_FILE code is introduced.  This option takes a
file descriptor which will be set as a source for new /proc/$pid/exe
symlink.

Note it allows to change /proc/$pid/exe if there are no VM_EXECUTABLE
vmas present for current process, simply because this feature is a special
to C/R and mm::num_exe_file_vmas become meaningless after that.

To minimize the amount of transition the /proc/pid/exe symlink might have,
this feature is implemented in one-shot manner.  Thus once changed the
symlink can't be changed again.  This should help sysadmins to monitor the
symlinks over all process running in a system.

In particular one could make a snapshot of processes and ring alarm if
there unexpected changes of /proc/pid/exe's in a system.

Note -- this feature is available iif CONFIG_CHECKPOINT_RESTORE is set and
the caller must have CAP_SYS_RESOURCE capability granted, otherwise the
request to change symlink will be rejected.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:32 -07:00
Cyrill Gorcunov
fe8c7f5cbf c/r: prctl: extend PR_SET_MM to set up more mm_struct entries
During checkpoint we dump whole process memory to a file and the dump
includes process stack memory.  But among stack data itself, the stack
carries additional parameters such as command line arguments, environment
data and auxiliary vector.

So when we do restore procedure and once we've restored stack data itself
we need to setup mm_struct::arg_start/end, env_start/end, so restored
process would be able to find command line arguments and environment data
it had at checkpoint time.  The same applies to auxiliary vector.

For this reason additional PR_SET_MM_(ARG_START | ARG_END | ENV_START |
ENV_END | AUXV) codes are introduced.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:32 -07:00
Cyrill Gorcunov
d97b46a646 syscalls, x86: add __NR_kcmp syscall
While doing the checkpoint-restore in the user space one need to determine
whether various kernel objects (like mm_struct-s of file_struct-s) are
shared between tasks and restore this state.

The 2nd step can be solved by using appropriate CLONE_ flags and the
unshare syscall, while there's currently no ways for solving the 1st one.

One of the ways for checking whether two tasks share e.g.  mm_struct is to
provide some mm_struct ID of a task to its proc file, but showing such
info considered to be not that good for security reasons.

Thus after some debates we end up in conclusion that using that named
'comparison' syscall might be the best candidate.  So here is it --
__NR_kcmp.

It takes up to 5 arguments - the pids of the two tasks (which
characteristics should be compared), the comparison type and (in case of
comparison of files) two file descriptors.

Lookups for pids are done in the caller's PID namespace only.

At moment only x86 is supported and tested.

[akpm@linux-foundation.org: fix up selftests, warnings]
[akpm@linux-foundation.org: include errno.h]
[akpm@linux-foundation.org: tweak comment text]
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Valdis.Kletnieks@vt.edu
Cc: Michal Marek <mmarek@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:32 -07:00
Cyrill Gorcunov
98ed57eef9 sysctl: make kernel.ns_last_pid control dependent on CHECKPOINT_RESTORE
For those who doesn't need C/R functionality there is no need to control
last pid, ie the pid for the next fork() call.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:32 -07:00
Eric W. Biederman
00c10bc13c pidns: make killed children autoreap
Force SIGCHLD handling to SIG_IGN so that signals are not generated and so
that the children autoreap.  This increases the parallelize and in general
the speed of network namespace shutdown.

Note self reaping childrean can exist past zap_pid_ns_processess but they
will all be reaped before we allow the pid namespace init task with pid ==
1 to be reaped.

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Louis Rilling <louis.rilling@kerlabs.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:32 -07:00
Eric W. Biederman
3208450488 pidns: use task_active_pid_ns in do_notify_parent
Using task_active_pid_ns is more robust because it works even after we
have called exit_namespaces.  This change allows us to have parent
processes that are zombies.  Normally a zombie parent processes is crazy
and the last thing you would want to have but in the case of not letting
the init process of a pid namespace be reaped until all of it's children
are dead and reaped a zombie parent process is exactly what we want.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Louis Rilling <louis.rilling@kerlabs.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:31 -07:00
Anton Vorontsov
e4cc2f873a kernel/cpu.c: document clear_tasks_mm_cpumask()
Add more comments on clear_tasks_mm_cpumask, plus adds a runtime check:
the function is only suitable for offlined CPUs, and if called
inappropriately, the kernel should scream aloud.

[akpm@linux-foundation.org: tweak comment: s/walks up/walks/, use 80 cols]
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:30 -07:00
Anton Vorontsov
cb79295e20 cpu: introduce clear_tasks_mm_cpumask() helper
Many architectures clear tasks' mm_cpumask like this:

	read_lock(&tasklist_lock);
	for_each_process(p) {
		if (p->mm)
			cpumask_clear_cpu(cpu, mm_cpumask(p->mm));
	}
	read_unlock(&tasklist_lock);

Depending on the context, the code above may have several problems,
such as:

1. Working with task->mm w/o getting mm or grabing the task lock is
   dangerous as ->mm might disappear (exit_mm() assigns NULL under
   task_lock(), so tasklist lock is not enough).

2. Checking for process->mm is not enough because process' main
   thread may exit or detach its mm via use_mm(), but other threads
   may still have a valid mm.

This patch implements a small helper function that does things
correctly, i.e.:

1. We take the task's lock while whe handle its mm (we can't use
   get_task_mm()/mmput() pair as mmput() might sleep);

2. To catch exited main thread case, we use find_lock_task_mm(),
   which walks up all threads and returns an appropriate task
   (with task lock held).

Also, Per Peter Zijlstra's idea, now we don't grab tasklist_lock in
the new helper, instead we take the rcu read lock. We can do this
because the function is called after the cpu is taken down and marked
offline, so no new tasks will get this cpu set in their mm mask.

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:29 -07:00
Konstantin Khlebnikov
f7505d64f2 fork: call complete_vfork_done() after clearing child_tid and flushing rss-counters
Child should wake up the parent from vfork() only after finishing all
operations with shared mm.  There is no sense in using
CLONE_CHILD_CLEARTID together with CLONE_VFORK, but it looks more accurate
now.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:29 -07:00
Tim Bird
168eeccbc9 stack usage: add pid to warning printk in check_stack_usage
In embedded systems, sometimes the same program (busybox) is the cause of
multiple warnings.  Outputting the pid with the program name in the
warning printk helps distinguish which instances of a program are using
the stack most.

This is a small patch, but useful.

Signed-off-by: Tim Bird <tim.bird@am.sony.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:28 -07:00
Oleg Nesterov
43e13cc107 cred: remove task_is_dead() from __task_cred() validation
Commit 8f92054e7c ("CRED: Fix __task_cred()'s lockdep check and banner
comment"):

    add the following validation condition:

        task->exit_state >= 0

    to permit the access if the target task is dead and therefore
    unable to change its own credentials.

OK, but afaics currently this can only help wait_task_zombie() which calls
__task_cred() without rcu lock.

Remove this validation and change wait_task_zombie() to use task_uid()
instead.  This means we do rcu_read_lock() only to shut up the lockdep,
but we already do the same in, say, wait_task_stopped().

task_is_dead() should die, task->exit_state != 0 means that this task has
passed exit_notify(), only do_wait-like code paths should use this.

Unfortunately, we can't kill task_is_dead() right now, it has already
acquired buggy users in drivers/staging.  The fix already exists.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:28 -07:00
Randy Dunlap
9b3c98cd66 kmod.c: fix kernel-doc warning
Warning(kernel/kmod.c:419): No description found for parameter 'depth'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:28 -07:00
Boaz Harrosh
785042f2e2 kmod: move call_usermodehelper_fns() to .c file and unexport all it's helpers
If we move call_usermodehelper_fns() to kmod.c file and EXPORT_SYMBOL it
we can avoid exporting all it's helper functions:
	call_usermodehelper_setup
	call_usermodehelper_setfns
	call_usermodehelper_exec
And make all of them static to kmod.c

Since the optimizer will see all these as a single call site it will
inline them inside call_usermodehelper_fns().  So we loose the call to
_fns but gain 3 calls to the helpers.  (Not that it matters)

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:28 -07:00
Boaz Harrosh
81ab6e7b26 kmod: convert two call sites to call_usermodehelper_fns()
Both kernel/sys.c && security/keys/request_key.c where inlining the exact
same code as call_usermodehelper_fns(); So simply convert these sites to
directly use call_usermodehelper_fns().

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:28 -07:00
Boaz Harrosh
ae3cef7300 kmod: unexport call_usermodehelper_freeinfo()
call_usermodehelper_freeinfo() is not used outside of kmod.c.  So unexport
it, and make it static to kmod.c

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:28 -07:00
Nicolas Pitre
d84970bbaf kernel/cpu_pm.c: fix various typos
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Colin Cross <ccross@android.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:27 -07:00
Andrew Morton
97fd75b7b8 kernel/irq/manage.c: use the pr_foo() infrastructure to prefix printks
Use the module-wide pr_fmt() mechanism rather than open-coding "genirq: "
everywhere.

Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:26 -07:00
Sasikantha babu
499eea6bf9 sethostname/setdomainname: notify userspace when there is a change in uts_kern_table
sethostname() and setdomainname() notify userspace on failure (without
modifying uts_kern_table).  Change things so that we only notify userspace
on success, when uts_kern_table was actually modified.

Signed-off-by: Sasikantha babu <sasikanth.v19@gmail.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: WANG Cong <amwang@redhat.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:26 -07:00
Wei Yang
ee5e5683d8 kernel/resource.c: correct the comment of allocate_resource()
In the comment of allocate_resource(), the explanation of parameter max
and min is not correct.

Actually, these two parameters are used to specify the range of the
resource that will be allocated, not the min/max size that will be
allocated.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:26 -07:00
Namhyung Kim
cb7225feec perf: Remove duplicate invocation on perf_event_for_each
The @func callback was invoked twice for group leader when
perf_event_for_each() called. It seems the commit 75f937f24b
("perf_counter: Fix ctx->mutex vs counter ->mutex inversion") made the
mistake during the change.

Signed-off-by: Namhyung Kim <namhyung.kim@lge.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1338443506-25009-1-git-send-email-namhyung.kim@lge.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2012-05-31 14:01:00 -03:00
Linus Torvalds
65a50c951a Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar.

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
  perf ui browser: Stop using 'self'
  perf annotate browser: Read perf config file for settings
  perf config: Allow '_' in config file variable names
  perf annotate browser: Make feature toggles global
  perf annotate browser: The idx_asm field should be used in asm only view
  perf tools: Convert critical messages to ui__error()
  perf ui: Make --stdio default when TUI is not supported
  tools lib traceevent: Silence compiler warning on 32bit build
  perf record: Fix branch_stack type in perf_record_opts
  perf tools: Reconstruct event with modifiers from perf_event_attr
  perf top: Fix counter name fixup when fallbacking to cpu-clock
  perf tools: fix thread_map__new_by_pid_str() memory leak in error path
  perf tools: Do not use _FORTIFY_SOURCE when DEBUG=1 is specified
  tools lib traceevent: Fix signature of create_arg_item()
  tools lib traceevent: Use proper function parameter type
  tools lib traceevent: Fix freeing arg on process_dynamic_array()
  tools lib traceevent: Fix a possibly wrong memory dereference
  tools lib traceevent: Fix a possible memory leak
  tools lib traceevent: Allow expressions in __print_symbolic() fields
  perf evlist: Explicititely initialize input_name
  ...
2012-05-30 11:12:00 -07:00
Linus Torvalds
0d167518e0 Merge branch 'for-3.5/core' of git://git.kernel.dk/linux-block
Merge block/IO core bits from Jens Axboe:
 "This is a bit bigger on the core side than usual, but that is purely
  because we decided to hold off on parts of Tejun's submission on 3.4
  to give it a bit more time to simmer.  As a consequence, it's seen a
  long cycle in for-next.

  It contains:

   - Bug fix from Dan, wrong locking type.
   - Relax splice gifting restriction from Eric.
   - A ton of updates from Tejun, primarily for blkcg.  This improves
     the code a lot, making the API nicer and cleaner, and also includes
     fixes for how we handle and tie policies and re-activate on
     switches.  The changes also include generic bug fixes.
   - A simple fix from Vivek, along with a fix for doing proper delayed
     allocation of the blkcg stats."

Fix up annoying conflict just due to different merge resolution in
Documentation/feature-removal-schedule.txt

* 'for-3.5/core' of git://git.kernel.dk/linux-block: (92 commits)
  blkcg: tg_stats_alloc_lock is an irq lock
  vmsplice: relax alignement requirements for SPLICE_F_GIFT
  blkcg: use radix tree to index blkgs from blkcg
  blkcg: fix blkcg->css ref leak in __blkg_lookup_create()
  block: fix elvpriv allocation failure handling
  block: collapse blk_alloc_request() into get_request()
  blkcg: collapse blkcg_policy_ops into blkcg_policy
  blkcg: embed struct blkg_policy_data in policy specific data
  blkcg: mass rename of blkcg API
  blkcg: style cleanups for blk-cgroup.h
  blkcg: remove blkio_group->path[]
  blkcg: blkg_rwstat_read() was missing inline
  blkcg: shoot down blkgs if all policies are deactivated
  blkcg: drop stuff unused after per-queue policy activation update
  blkcg: implement per-queue policy activation
  blkcg: add request_queue->root_blkg
  blkcg: make request_queue bypassing on allocation
  blkcg: make sure blkg_lookup() returns %NULL if @q is bypassing
  blkcg: make blkg_conf_prep() take @pol and return with queue lock held
  blkcg: remove static policy ID enums
  ...
2012-05-30 08:52:42 -07:00
Kamalesh Babulal
6a4c96eef4 sched: Remove NULL assignment of dattr_cur
Remove explicit NULL assignment of static pointer
dattr_cur from init_sched_domains().

Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120523091411.GG5005@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30 14:02:27 +02:00
Hiroshi Shimamoto
7997a456ef sched: Remove the last NULL entry from sched_feat_names
No need to have the last NULL entry.

Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FBF29E7.5020805@ct.jp.nec.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30 14:02:27 +02:00
Hiroshi Shimamoto
1292531f6f sched: Make sched_feat_names const
The strings sched_feat_names are never changed.

Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FBF29B2.9030904@ct.jp.nec.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30 14:02:26 +02:00
Colin Cross
454c79999f sched/rt: Fix SCHED_RR across cgroups
task_tick_rt() has an optimization to only reschedule SCHED_RR tasks
if they were the only element on their rq.  However, with cgroups
a SCHED_RR task could be the only element on its per-cgroup rq but
still be competing with other SCHED_RR tasks in its parent's
cgroup.  In this case, the SCHED_RR task in the child cgroup would
never yield at the end of its timeslice.  If the child cgroup
rt_runtime_us was the same as the parent cgroup rt_runtime_us,
the task in the parent cgroup would starve completely.

Modify task_tick_rt() to check that the task is the only task on its
rq, and that the each of the scheduling entities of its ancestors
is also the only entity on its rq.

Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1337229266-15798-1-git-send-email-ccross@android.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30 14:02:25 +02:00
Peter Zijlstra
29baa7478b sched: Move nr_cpus_allowed out of 'struct sched_rt_entity'
Since nr_cpus_allowed is used outside of sched/rt.c and wants to be
used outside of there more, move it to a more natural site.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-kr61f02y9brwzkh6x53pdptm@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30 14:02:25 +02:00
Peter Zijlstra
b654f7de41 sched: Make sure to not re-read variables after validation
We could re-read rq->rt_avg after we validated it was smaller than
total, invalidating the check and resulting in an unintended negative.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30 14:02:24 +02:00
Peter Zijlstra
74a5ce20e6 sched: Fix SD_OVERLAP
SD_OVERLAP exists to allow overlapping groups, overlapping groups
appear in NUMA topologies that aren't fully connected.

The typical result of not fully connected NUMA is that each cpu (or
rather node) will have different spans for a particular distance.
However due to how sched domains are traversed -- only the first cpu
in the mask goes one level up -- the next level only cares about the
spans of the cpus that went up.

Due to this two things were observed to be broken:

 - build_overlap_sched_groups() -- since its possible the cpu we're
   building the groups for exists in multiple (or all) groups, the
   selection criteria of the first group didn't ensure there was a cpu
   for which is was true that cpumask_first(span) == cpu. Thus load-
   balancing would terminate.

 - update_group_power() -- assumed that the cpu span of the first
   group of the domain was covered by all groups of the child domain.
   The above explains why this isn't true, so deal with it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1337788843.9783.14.camel@laptop
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30 14:02:24 +02:00
Peter Zijlstra
2ea45800d8 sched: Don't try allocating memory from offline nodes
Allocators don't appreciate it when you try and allocate memory from
offline nodes.

Reported-and-tested-by: Tony Luck <tony.luck@intel.com>
Reported-and-tested-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-epfc1io9whb7o22bcujf31vn@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30 14:02:23 +02:00
Peter Zijlstra
5aaa0b7a2e sched/nohz: Fix rq->cpu_load calculations some more
Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[]
calculations") since while that fixed the busy case it regressed the
mostly idle case.

Add a callback from the nohz exit to also age the rq->cpu_load[]
array. This closes the hole where either there was no nohz load
balance pass during the nohz, or there was a 'significant' amount of
idle time between the last nohz balance and the nohz exit.

So we'll update unconditionally from the tick to not insert any
accidental 0 load periods while busy, and we try and catch up from
nohz idle balance and nohz exit. Both these are still prone to missing
a jiffy, but that has always been the case.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: pjt@google.com
Cc: Venkatesh Pallipadi <venki@google.com>
Link: http://lkml.kernel.org/n/tip-kt0trz0apodbf84ucjfdbr1a@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30 14:02:16 +02:00
Ingo Molnar
063e047761 Merge branch 'linus' into perf/urgent
Merge back Linus's latest branch so that we pick up the uprobes changes.

( I tested this branch locally and while it's one from the middle of the
  merge window it's a good one to base further work off. )

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-30 10:59:04 +02:00
Andi Kleen
eea62f831b brlocks/lglocks: turn into functions
lglocks and brlocks are currently generated with some complicated macros
in lglock.h.  But there's no reason to not just use common utility
functions and put all the data into a common data structure.

Since there are at least two users it makes sense to share this code in a
library.  This is also easier maintainable than a macro forest.

This will also make it later possible to dynamically allocate lglocks and
also use them in modules (this would both still need some additional, but
now straightforward, code)

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-05-29 23:28:41 -04:00
Stephen Boyd
4796dd200d vsprintf: fix %ps on non symbols when using kallsyms
Using %ps in a printk format will sometimes fail silently and print the
empty string if the address passed in does not match a symbol that
kallsyms knows about.  But using %pS will fall back to printing the full
address if kallsyms can't find the symbol.  Make %ps act the same as %pS
by falling back to printing the address.

While we're here also make %ps print the module that a symbol comes from
so that it matches what %pS already does.  Take this simple function for
example (in a module):

	static void test_printk(void)
	{
		int test;
		pr_info("with pS: %pS\n", &test);
		pr_info("with ps: %ps\n", &test);
	}

Before this patch:

 with pS: 0xdff7df44
 with ps:

After this patch:

 with pS: 0xdff7df44
 with ps: 0xdff7df44

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29 16:22:32 -07:00
Frederic Weisbecker
2bb2ba9d51 rescounters: add res_counter_uncharge_until()
When killing a res_counter which is a child of other counter, we need to
do

	res_counter_uncharge(child, xxx)
	res_counter_charge(parent, xxx)

This is not atomic and wastes CPU.  This patch adds
res_counter_uncharge_until().  This function's uncharge propagates to
ancestors until specified res_counter.

	res_counter_uncharge_until(child, parent, xxx)

Now the operation is atomic and efficient.

Signed-off-by: Frederic Weisbecker <fweisbec@redhat.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ying Han <yinghan@google.com>
Cc: Glauber Costa <glommer@parallels.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29 16:22:27 -07:00
Johannes Weiner
91c63734f6 kernel: cgroup: push rcu read locking from css_is_ancestor() to callsite
Library functions should not grab locks when the callsites can do it,
even if the lock nests like the rcu read-side lock does.

Push the rcu_read_lock() from css_is_ancestor() to its single user,
mem_cgroup_same_or_subtree() in preparation for another user that may
already hold the rcu read-side lock.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29 16:22:20 -07:00
Siddhesh Poyarekar
7edc8b0ac1 mm/fork: fix overflow in vma length when copying mmap on clone
The vma length in dup_mmap is calculated and stored in a unsigned int,
which is insufficient and hence overflows for very large maps (beyond
16TB). The following program demonstrates this:

#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>

#define GIG 1024 * 1024 * 1024L
#define EXTENT 16393

int main(void)
{
        int i, r;
        void *m;
        char buf[1024];

        for (i = 0; i < EXTENT; i++) {
                m = mmap(NULL, (size_t) 1 * 1024 * 1024 * 1024L,
                         PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);

                if (m == (void *)-1)
                        printf("MMAP Failed: %d\n", m);
                else
                        printf("%d : MMAP returned %p\n", i, m);

                r = fork();

                if (r == 0) {
                        printf("%d: successed\n", i);
                        return 0;
                } else if (r < 0)
                        printf("FORK Failed: %d\n", r);
                else if (r > 0)
                        wait(NULL);
        }
        return 0;
}

Increase the storage size of the result to unsigned long, which is
sufficient for storing the difference between addresses.

Signed-off-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29 16:22:19 -07:00
Rik van Riel
e709ffd616 mm: remove swap token code
The swap token code no longer fits in with the current VM model.  It
does not play well with cgroups or the better NUMA placement code in
development, since we have only one swap token globally.

It also has the potential to mess with scalability of the system, by
increasing the number of non-reclaimable pages on the active and
inactive anon LRU lists.

Last but not least, the swap token code has been broken for a year
without complaints, as reported by Konstantin Khlebnikov.  This suggests
we no longer have much use for it.

The days of sub-1G memory systems with heavy use of swap are over.  If
we ever need thrashing reducing code in the future, we will have to
implement something that does scale.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Bob Picco <bpicco@meloft.net>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29 16:22:19 -07:00
Tejun Heo
fa980ca87d cgroup: superblock can't be released with active dentries
48ddbe1946 "cgroup: make css->refcnt clearing on cgroup removal
optional" allowed a css to linger after the associated cgroup is
removed.  As a css holds a reference on the cgroup's dentry, it means
that cgroup dentries may linger for a while.

cgroup_create() does grab an active reference on the superblock to
prevent it from going away while there are !root cgroups; however, the
reference is put from cgroup_diput() which is invoked on cgroup
removal, so cgroup dentries which are removed but persisting due to
lingering csses already have released their superblock active refs
allowing superblock to be killed while those dentries are around.

Given the right condition, this makes cgroup_kill_sb() call
kill_litter_super() with dentries with non-zero d_count leading to
BUG() in shrink_dcache_for_umount_subtree().

Fix it by adding cgroup_dops->d_release() operation and moving
deactivate_super() to it.  cgroup_diput() now marks dentry->d_fsdata
with itself if superblock should be deactivated and cgroup_d_release()
deactivates the superblock on dentry release.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
LKML-Reference: <CA+1xoqe5hMuxzCRhMy7J0XchDk2ZnuxOHJKikROk1-ReAzcT6g@mail.gmail.com>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-05-27 17:22:56 -07:00
Thomas Gleixner
62cf20b32a tick: Move skew_tick option into the HIGH_RES_TIMER section
commit 5307c95 (tick: Add tick skew boot option) broke the
!CONFIG_HIGH_RES_TIMERS build.

Move the boot option parsing into the CONFIG_HIGH_RES_TIMERS section.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <mgalbraith@suse.de>
2012-05-25 14:08:57 +02:00
Magnus Damm
e5400321a6 clockevents: Make clockevents_config() a global symbol
Make clockevents_config() into a global symbol to allow it to be used
by compiled-in clockevent drivers. This is needed by drivers that want
to update the timer frequency after registration time.

Signed-off-by: Magnus Damm <damm@opensource.se>
Tested-by: Simon Horman <horms@verge.net.au>
Cc: arnd@arndb.de
Cc: johnstul@us.ibm.com
Cc: rjw@sisk.pl
Cc: lethal@linux-sh.org
Cc: gregkh@linuxfoundation.org
Cc: olof@lixom.net
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20120509143934.27521.46553.sendpatchset@w520
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-05-25 01:44:51 +02:00
Mike Galbraith
5307c9556b tick: Add tick skew boot option
Let the user decide whether power consumption or jitter is the
more important consideration for their machines.

Quoting removal commit af5ab277de:

"Historically, Linux has tried to make the regular timer tick on the
 various CPUs not happen at the same time, to avoid contention on
 xtime_lock.
    
 Nowadays, with the tickless kernel, this contention no longer happens
 since time keeping and updating are done differently. In addition,
 this skew is actually hurting power consumption in a measurable way on
 many-core systems."

Problems:

- Contrary to the above, systems do encounter contention on both
  xtime_lock and RCU structure locks when the tick is synchronized.
  
- Moderate sized RT systems suffer intolerable jitter due to the tick
  being synchronized.

- SGI reports the same for their large systems.

- Fully utilized systems reap no power saving benefit from skew removal,
  but do suffer from resulting induced lock contention.

- 0209f649 rcu: limit rcu_node leaf-level fanout
  This patch was born to combat lock contention which testing showed
  to have been _induced by_ skew removal.  Skew the tick, contention
  disappeared virtually completely.

Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
Link: http://lkml.kernel.org/r/1336472458.21924.78.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-05-25 01:44:50 +02:00
Linus Torvalds
07acfc2a93 Merge branch 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM changes from Avi Kivity:
 "Changes include additional instruction emulation, page-crossing MMIO,
  faster dirty logging, preventing the watchdog from killing a stopped
  guest, module autoload, a new MSI ABI, and some minor optimizations
  and fixes.  Outside x86 we have a small s390 and a very large ppc
  update.

  Regarding the new (for kvm) rebaseless workflow, some of the patches
  that were merged before we switch trees had to be rebased, while
  others are true pulls.  In either case the signoffs should be correct
  now."

Fix up trivial conflicts in Documentation/feature-removal-schedule.txt
arch/powerpc/kvm/book3s_segment.S and arch/x86/include/asm/kvm_para.h.

I suspect the kvm_para.h resolution ends up doing the "do I have cpuid"
check effectively twice (it was done differently in two different
commits), but better safe than sorry ;)

* 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (125 commits)
  KVM: make asm-generic/kvm_para.h have an ifdef __KERNEL__ block
  KVM: s390: onereg for timer related registers
  KVM: s390: epoch difference and TOD programmable field
  KVM: s390: KVM_GET/SET_ONEREG for s390
  KVM: s390: add capability indicating COW support
  KVM: Fix mmu_reload() clash with nested vmx event injection
  KVM: MMU: Don't use RCU for lockless shadow walking
  KVM: VMX: Optimize %ds, %es reload
  KVM: VMX: Fix %ds/%es clobber
  KVM: x86 emulator: convert bsf/bsr instructions to emulate_2op_SrcV_nobyte()
  KVM: VMX: unlike vmcs on fail path
  KVM: PPC: Emulator: clean up SPR reads and writes
  KVM: PPC: Emulator: clean up instruction parsing
  kvm/powerpc: Add new ioctl to retreive server MMU infos
  kvm/book3s: Make kernel emulated H_PUT_TCE available for "PR" KVM
  KVM: PPC: bookehv: Fix r8/r13 storing in level exception handler
  KVM: PPC: Book3S: Enable IRQs during exit handling
  KVM: PPC: Fix PR KVM on POWER7 bare metal
  KVM: PPC: Fix stbux emulation
  KVM: PPC: bookehv: Use lwz/stw instead of PPC_LL/PPC_STL for 32-bit fields
  ...
2012-05-24 16:17:30 -07:00
Srivatsa S. Bhat
4a70d2d990 smpboot, idle: Fix comment mismatch over idle_threads_init()
The comment over idle_threads_init() really talks about the functionality
of idle_init(). Move that comment to idle_init(), and add a suitable
comment over idle_threads_init().

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: suresh.b.siddha@intel.com
Cc: venki@google.com
Cc: nikunj@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/20120524151100.2549.66501.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-05-24 22:58:08 +02:00
Srivatsa S. Bhat
ee74d13229 smpboot, idle: Optimize calls to smp_processor_id() in idle_threads_init()
While trying to initialize idle threads for all cpus, idle_threads_init()
calls smp_processor_id() in a loop, which is unnecessary. The intent
is to initialize idle threads for all non-boot cpus. So just use a variable
to note the boot cpu and use it in the loop.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: suresh.b.siddha@intel.com
Cc: venki@google.com
Cc: nikunj@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/20120524151055.2549.64309.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-05-24 22:58:08 +02:00
Linus Torvalds
b343c8beec irqdomain changes for v3.5 merge window
Minor changes and fixups for irqdomain infrastructure.  Most
 important change adds ability to remove registered irqdomain.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJPvpKWAAoJEEFnBt12D9kB6OIP/RvT1nU223w+hQMr8RKqhjZQ
 GbhuZ1o2lPLAxoGYuCQqapeH542NuuIQvviM2MKLn6u9ev3fKeFZFdF4YTpE8fwo
 ljIXJnjJ9reu2yCrAsIVZtIHJ+hXs07h1QjRwqFWGN/y57BRhxXI6xm1+SEAhay9
 gtRnfQ7eCpi665zYoNBIoNxKeESrdiwgHKUsbkNELmbTwvx+Sc9AWsYMqtO0qRAG
 JrOFCIOu3bqEcshDhM4MLZGVEmlzVZR4zUbQrY0chj5Y2c383YUyg+l+tN0NjPsF
 L3MfgIu8WFim/edQM294dLTrZjqicg4xF5uRxjx5hY2EESoLKdf1pUady673M7ux
 6cBcczMKJVQI7P2do7i8F0VwATkokytcP289hqYzJrxMHeXa48ccEZfiQt6xuLwc
 JwWAZu3BxeBMzZNxQRNX39ImSsP5wnsfZdzUBTAFAcV1ZEYgSrGJYAw+pOz18UXD
 YwwKcnNKwQHgNIkSLjgputT9VSuJsS09xErGeZAqkj7f6oxGlql9ElhUvgrBT8qg
 eiTjgIkArRY6RG8+c2mMeKE10fN822jWK9kWQdttIPa++cSBHo/Yxt8PlClvEvH8
 qjyD4nIG2dhwG8RtMc74IPDyHSHRW5JGXHPg37IoTPzurcsnzuNMSzlXVw2hb39d
 pxhCVNxe1r4GH6NQFOMg
 =K3Pg
 -----END PGP SIGNATURE-----

Merge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6

Pull irqdomain changes from Grant Likely:
 "Minor changes and fixups for irqdomain infrastructure.  The most
  important change adds the ability to remove a registered irqdomain."

* tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6:
  irqdomain: Document size parameter of irq_domain_add_linear()
  irqdomain: trivial pr_fmt conversion.
  irqdomain: Kill off duplicate definitions.
  irqdomain: Make irq_domain_simple_map() static.
  irqdomain: Export remaining public API symbols.
  irqdomain: Support removal of IRQ domains.
2012-05-24 13:55:24 -07:00
Jiang Liu
818b0f3bfb genirq: Introduce irq_do_set_affinity() to reduce duplicated code
All invocations of chip->irq_set_affinity() are doing the same return
value checks. Let them all use a common function.

[ tglx: removed the silly likely while at it ]

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Keping Chen <chenkeping@huawei.com>
Link: http://lkml.kernel.org/r/1333120296-13563-3-git-send-email-jiang.liu@huawei.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-05-24 22:36:40 +02:00
Linus Torvalds
c7523a7c88 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner.

Various trivial conflict fixups in arch Kconfig due to addition of
unrelated entries nearby.  And one slightly more subtle one for sparc32
(new user of GENERIC_CLOCKEVENTS), fixed up as per Thomas.

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
  timekeeping: Fix a few minor newline issues.
  time: remove obsolete declaration
  ntp: Fix a stale comment and a few stray newlines.
  ntp: Correct TAI offset during leap second
  timers: Fixup the Kconfig consolidation fallout
  x86: Use generic time config
  unicore32: Use generic time config
  um: Use generic time config
  tile: Use generic time config
  sparc: Use: generic time config
  sh: Use generic time config
  score: Use generic time config
  s390: Use generic time config
  openrisc: Use generic time config
  powerpc: Use generic time config
  mn10300: Use generic time config
  mips: Use generic time config
  microblaze: Use generic time config
  m68k: Use generic time config
  m32r: Use generic time config
  ...
2012-05-24 13:29:46 -07:00
Ning Jiang
23812b9d9e genirq: Add IRQS_PENDING for nested and simple irq
Every interrupt which is an active wakeup source needs the ability to
abort suspend if there is a pending irq. Right now only edge and level
irqs can do that.

            |
       +---------+
       |   INTC  |
       +---------+
               | GPIO_IRQ
            +------------+
            |  gpio-exp  |
            +------------+
              |        |
         GPIO0_IRQ  GPIO1_IRQ

In the above diagram, gpio expander has irq number GPIO_IRQ, it is
connected with two sub GPIO pins, GPIO0 and GPIO1.

During suspend, we set IRQF_NO_SUSPEND for GPIO_IRQ so that gpio
expander driver can handle the sub irq GPIO0_IRQ and GPIO1_IRQ, and
these two irqs themselves can further be handled by simple or nested
irq in some drivers(typically gpio and mfd driver). If they are used
as wakeup sources during suspend, we want them to be able to abort
suspend too.

Setting IRQS_PENDING flag in handle_nested_irq() and handle_simple_irq()
when the irq is disabled allows check_wakeup_irqs() to identify such
irqs as source for aborting suspend.

Signed-off-by: Ning Jiang <ning.n.jiang@gmail.com>
Cc: rjw@sisk.pl
Link: http://lkml.kernel.org/r/CAH3Oq6T905%2B3fkF43NAMMFvJvq7dsk_so6T2vQ8ZJrA5xiU3YA@mail.gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-05-24 22:27:45 +02:00
Linus Torvalds
28f3d71761 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull more networking updates from David Miller:
 "Ok, everything from here on out will be bug fixes."

1) One final sync of wireless and bluetooth stuff from John Linville.
   These changes have all been in his tree for more than a week, and
   therefore have had the necessary -next exposure.  John was just away
   on a trip and didn't have a change to send the pull request until a
   day or two ago.

2) Put back some defines in user exposed header file areas that were
   removed during the tokenring purge.  From Stephen Hemminger and Paul
   Gortmaker.

3) A bug fix for UDP hash table allocation got lost in the pile due to
   one of those "you got it..  no I've got it.." situations.  :-)

   From Tim Bird.

4) SKB coalescing in TCP needs to have stricter checks, otherwise we'll
   try to coalesce overlapping frags and crash.  Fix from Eric Dumazet.

5) RCU routing table lookups can race with free_fib_info(), causing
   crashes when we deref the device pointers in the route.  Fix by
   releasing the net device in the RCU callback.  From Yanmin Zhang.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (293 commits)
  tcp: take care of overlaps in tcp_try_coalesce()
  ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  mm: add a low limit to alloc_large_system_hash
  ipx: restore token ring define to include/linux/ipx.h
  if: restore token ring ARP type to header
  xen: do not disable netfront in dom0
  phy/micrel: Fix ID of KSZ9021
  mISDN: Add X-Tensions USB ISDN TA XC-525
  gianfar:don't add FCB length to hard_header_len
  Bluetooth: Report proper error number in disconnection
  Bluetooth: Create flags for bt_sk()
  Bluetooth: report the right security level in getsockopt
  Bluetooth: Lock the L2CAP channel when sending
  Bluetooth: Restore locking semantics when looking up L2CAP channels
  Bluetooth: Fix a redundant and problematic incoming MTU check
  Bluetooth: Add support for Foxconn/Hon Hai AR5BBU22 0489:E03C
  Bluetooth: Fix EIR data generation for mgmt_device_found
  Bluetooth: Fix Inquiry with RSSI event mask
  Bluetooth: improve readability of l2cap_seq_list code
  Bluetooth: Fix skb length calculation
  ...
2012-05-24 11:54:29 -07:00
Linus Torvalds
654443e20d Merge branch 'perf-uprobes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull user-space probe instrumentation from Ingo Molnar:
 "The uprobes code originates from SystemTap and has been used for years
  in Fedora and RHEL kernels.  This version is much rewritten, reviews
  from PeterZ, Oleg and myself shaped the end result.

  This tree includes uprobes support in 'perf probe' - but SystemTap
  (and other tools) can take advantage of user probe points as well.

  Sample usage of uprobes via perf, for example to profile malloc()
  calls without modifying user-space binaries.

  First boot a new kernel with CONFIG_UPROBE_EVENT=y enabled.

  If you don't know which function you want to probe you can pick one
  from 'perf top' or can get a list all functions that can be probed
  within libc (binaries can be specified as well):

	$ perf probe -F -x /lib/libc.so.6

  To probe libc's malloc():

	$ perf probe -x /lib64/libc.so.6 malloc
	Added new event:
	probe_libc:malloc    (on 0x7eac0)

  You can now use it in all perf tools, such as:

	perf record -e probe_libc:malloc -aR sleep 1

  Make use of it to create a call graph (as the flat profile is going to
  look very boring):

	$ perf record -e probe_libc:malloc -gR make
	[ perf record: Woken up 173 times to write data ]
	[ perf record: Captured and wrote 44.190 MB perf.data (~1930712

	$ perf report | less

	  32.03%            git  libc-2.15.so   [.] malloc
	                    |
	                    --- malloc

	  29.49%            cc1  libc-2.15.so   [.] malloc
	                    |
	                    --- malloc
	                       |
	                       |--0.95%-- 0x208eb1000000000
	                       |
	                       |--0.63%-- htab_traverse_noresize

	  11.04%             as  libc-2.15.so   [.] malloc
	                     |
	                     --- malloc
	                        |

	   7.15%             ld  libc-2.15.so   [.] malloc
	                     |
	                     --- malloc
	                        |

	   5.07%             sh  libc-2.15.so   [.] malloc
	                     |
	                     --- malloc
	                        |
	   4.99%  python-config  libc-2.15.so   [.] malloc
	          |
	          --- malloc
	             |
	   4.54%           make  libc-2.15.so   [.] malloc
	                   |
	                   --- malloc
	                      |
	                      |--7.34%-- glob
	                      |          |
	                      |          |--93.18%-- 0x41588f
	                      |          |
	                      |           --6.82%-- glob
	                      |                     0x41588f

	   ...

  Or:

	$ perf report -g flat | less

	# Overhead        Command  Shared Object      Symbol
	# ........  .............  .............  ..........
	#
	  32.03%            git  libc-2.15.so   [.] malloc
	          27.19%
	              malloc

	  29.49%            cc1  libc-2.15.so   [.] malloc
	          24.77%
	              malloc

	  11.04%             as  libc-2.15.so   [.] malloc
	          11.02%
	              malloc

	   7.15%             ld  libc-2.15.so   [.] malloc
	           6.57%
	              malloc

	 ...

  The core uprobes design is fairly straightforward: uprobes probe
  points register themselves at (inode:offset) addresses of
  libraries/binaries, after which all existing (or new) vmas that map
  that address will have a software breakpoint injected at that address.
  vmas are COW-ed to preserve original content.  The probe points are
  kept in an rbtree.

  If user-space executes the probed inode:offset instruction address
  then an event is generated which can be recovered from the regular
  perf event channels and mmap-ed ring-buffer.

  Multiple probes at the same address are supported, they create a
  dynamic callback list of event consumers.

  The basic model is further complicated by the XOL speedup: the
  original instruction that is probed is copied (in an architecture
  specific fashion) and executed out of line when the probe triggers.
  The XOL area is a single vma per process, with a fixed number of
  entries (which limits probe execution parallelism).

  The API: uprobes are installed/removed via
  /sys/kernel/debug/tracing/uprobe_events, the API is integrated to
  align with the kprobes interface as much as possible, but is separate
  to it.

  Injecting a probe point is privileged operation, which can be relaxed
  by setting perf_paranoid to -1.

  You can use multiple probes as well and mix them with kprobes and
  regular PMU events or tracepoints, when instrumenting a task."

Fix up trivial conflicts in mm/memory.c due to previous cleanup of
unmap_single_vma().

* 'perf-uprobes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
  perf probe: Detect probe target when m/x options are absent
  perf probe: Provide perf interface for uprobes
  tracing: Fix kconfig warning due to a typo
  tracing: Provide trace events interface for uprobes
  tracing: Extract out common code for kprobes/uprobes trace events
  tracing: Modify is_delete, is_return from int to bool
  uprobes/core: Decrement uprobe count before the pages are unmapped
  uprobes/core: Make background page replacement logic account for rss_stat counters
  uprobes/core: Optimize probe hits with the help of a counter
  uprobes/core: Allocate XOL slots for uprobes use
  uprobes/core: Handle breakpoint and singlestep exceptions
  uprobes/core: Rename bkpt to swbp
  uprobes/core: Make order of function parameters consistent across functions
  uprobes/core: Make macro names consistent
  uprobes: Update copyright notices
  uprobes/core: Move insn to arch specific structure
  uprobes/core: Remove uprobe_opcode_sz
  uprobes/core: Make instruction tables volatile
  uprobes: Move to kernel/events/
  uprobes/core: Clean up, refactor and improve the code
  ...
2012-05-24 11:39:34 -07:00
Linus Torvalds
ab11ca34ee Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media updates from Mauro Carvalho Chehab:
 - some V4L2 API updates needed by embedded devices
 - DVB API extensions for ATSC-MH delivery system, used in US for mobile
   TV
 - new tuners for fc0011/0012/0013 and tua9001
 - a new dvb driver for af9033/9035
 - a new ATSC-MH frontend (lg2160)
 - new remote controller keymaps
 - Removal of a few legacy webcam driver that got replaced by gspca on
   several kernel versions ago
 - a new driver for Exynos 4/5 webcams(s5pp fimc-lite)
 - a new webcam sensor driver (smiapp)
 - a new video input driver for embedded (sta2x1xx)
 - several improvements, fixes, cleanups, etc inside the drivers.

Manually fix up conflicts due to err() -> dev_err() conversion in
drivers/staging/media/easycap/easycap_main.c

* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (484 commits)
  [media] saa7134-cards: Remove a PCI entry added by mistake
  [media] radio-sf16fmi: add support for SF16-FMD
  [media] rc-loopback: remove duplicate line
  [media] patch for Asus My Cinema PS3-100 (1043:48cd)
  [media] au0828: Move the Kconfig knob under V4L_USB_DRIVERS
  [media] em28xx: simple comment fix
  [media] [resend] radio-sf16fmr2: add PnP support for SF16-FMD2
  [media] smiapp: Use v4l2_ctrl_new_int_menu() instead of v4l2_ctrl_new_custom()
  [media] smiapp: Add support for 8-bit uncompressed formats
  [media] smiapp: Allow generic quirk registers
  [media] smiapp: Use non-binning limits if the binning limit is zero
  [media] smiapp: Initialise rval in smiapp_read_nvm()
  [media] smiapp: Round minimum pre_pll up rather than down in ip_clk_freq check
  [media] smiapp: Use 8-bit reads only before identifying the sensor
  [media] smiapp: Quirk for sensors that only do 8-bit reads
  [media] smiapp: Pass struct sensor to register writing commands instead of i2c_client
  [media] smiapp: Allow using external clock from the clock framework
  [media] zl10353: change .read_snr() to report SNR as a 0.1 dB
  [media] media: add support to gspca/pac7302.c for 093a:2627 (Genius FaceCam 300)
  [media] m88rs2000 - only flip bit 2 on reg 0x70 on 16th try
  ...
2012-05-24 10:21:51 -07:00
Ingo Molnar
9ba0541453 Merge branch 'tip/perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/urgent
Pull an ftrace ring-buffer fix from Steve Rostedt:

 * fix kernel crash when changing the size of the ring-buffer on
   boxes where possible_cpus != online_cpus.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-24 09:06:24 +02:00
Tim Bird
31fe62b958 mm: add a low limit to alloc_large_system_hash
UDP stack needs a minimum hash size value for proper operation and also
uses alloc_large_system_hash() for proper NUMA distribution of its hash
tables and automatic sizing depending on available system memory.

On some low memory situations, udp_table_init() must ignore the
alloc_large_system_hash() result and reallocs a bigger memory area.

As we cannot easily free old hash table, we leak it and kmemleak can
issue a warning.

This patch adds a low limit parameter to alloc_large_system_hash() to
solve this problem.

We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
allocation.

Reported-by: Mark Asselstine <mark.asselstine@windriver.com>
Reported-by: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-24 00:28:21 -04:00
Oleg Nesterov
f23ca33546 keys: kill task_struct->replacement_session_keyring
Kill the no longer used task_struct->replacement_session_keyring, update
copy_creds() and exit_creds().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Smith <dsmith@redhat.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-05-23 22:11:41 -04:00
Oleg Nesterov
4d1d61a6b2 genirq: reimplement exit_irq_thread() hook via task_work_add()
exit_irq_thread() and task->irq_thread are needed to handle the unexpected
(and unlikely) exit of irq-thread.

We can use task_work instead and make this all private to
kernel/irq/manage.c, cleanup plus micro-optimization.

1. rename exit_irq_thread() to irq_thread_dtor(), make it
   static, and move it up before irq_thread().

2. change irq_thread() to do task_work_add(irq_thread_dtor)
   at the start and task_work_cancel() before return.

   tracehook_notify_resume() can never play with kthreads,
   only do_exit()->exit_task_work() can call the callback
   and this is what we want.

3. remove task_struct->irq_thread and the special hook
   in do_exit().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Smith <dsmith@redhat.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-05-23 22:11:12 -04:00
Oleg Nesterov
e73f8959af task_work_add: generic process-context callbacks
Provide a simple mechanism that allows running code in the (nonatomic)
context of the arbitrary task.

The caller does task_work_add(task, task_work) and this task executes
task_work->func() either from do_notify_resume() or from do_exit().  The
callback can rely on PF_EXITING to detect the latter case.

"struct task_work" can be embedded in another struct, still it has "void
*data" to handle the most common/simple case.

This allows us to kill the ->replacement_session_keyring hack, and
potentially this can have more users.

Performance-wise, this adds 2 "unlikely(!hlist_empty())" checks into
tracehook_notify_resume() and do_exit().  But at the same time we can
remove the "replacement_session_keyring != NULL" checks from
arch/*/signal.c and exit_creds().

Note: task_work_add/task_work_run abuses ->pi_lock.  This is only because
this lock is already used by lookup_pi_state() to synchronize with
do_exit() setting PF_EXITING.  Fortunately the scope of this lock in
task_work.c is really tiny, and the code is unlikely anyway.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Smith <dsmith@redhat.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-05-23 22:09:21 -04:00
Linus Torvalds
f9369910a6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull first series of signal handling cleanups from Al Viro:
 "This is just the first part of the queue (about a half of it);
  assorted fixes all over the place in signal handling.

  This one ends with all sigsuspend() implementations switched to
  generic one (->saved_sigmask-based).

  With this, a bunch of assorted old buglets are fixed and most of the
  missing bits of NOTIFY_RESUME hookup are in place.  Two more fixes sit
  in arm and um trees respectively, and there's a couple of broken ones
  that need obvious fixes - parisc and avr32 check TIF_NOTIFY_RESUME
  only on one of two codepaths; fixes for that will happen in the next
  series"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (55 commits)
  unicore32: if there's no handler we need to restore sigmask, syscall or no syscall
  xtensa: add handling of TIF_NOTIFY_RESUME
  microblaze: drop 'oldset' argument of do_notify_resume()
  microblaze: handle TIF_NOTIFY_RESUME
  score: add handling of NOTIFY_RESUME to do_notify_resume()
  m68k: add TIF_NOTIFY_RESUME and handle it.
  sparc: kill ancient comment in sparc_sigaction()
  h8300: missing checks of __get_user()/__put_user() return values
  frv: missing checks of __get_user()/__put_user() return values
  cris: missing checks of __get_user()/__put_user() return values
  powerpc: missing checks of __get_user()/__put_user() return values
  sh: missing checks of __get_user()/__put_user() return values
  sparc: missing checks of __get_user()/__put_user() return values
  avr32: struct old_sigaction is never used
  m32r: struct old_sigaction is never used
  xtensa: xtensa_sigaction doesn't exist
  alpha: tidy signal delivery up
  score: don't open-code force_sigsegv()
  cris: don't open-code force_sigsegv()
  blackfin: don't open-code force_sigsegv()
  ...
2012-05-23 18:11:45 -07:00
Linus Torvalds
644473e9c6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace enhancements from Eric Biederman:
 "This is a course correction for the user namespace, so that we can
  reach an inexpensive, maintainable, and reasonably complete
  implementation.

  Highlights:
   - Config guards make it impossible to enable the user namespace and
     code that has not been converted to be user namespace safe.

   - Use of the new kuid_t type ensures the if you somehow get past the
     config guards the kernel will encounter type errors if you enable
     user namespaces and attempt to compile in code whose permission
     checks have not been updated to be user namespace safe.

   - All uids from child user namespaces are mapped into the initial
     user namespace before they are processed.  Removing the need to add
     an additional check to see if the user namespace of the compared
     uids remains the same.

   - With the user namespaces compiled out the performance is as good or
     better than it is today.

   - For most operations absolutely nothing changes performance or
     operationally with the user namespace enabled.

   - The worst case performance I could come up with was timing 1
     billion cache cold stat operations with the user namespace code
     enabled.  This went from 156s to 164s on my laptop (or 156ns to
     164ns per stat operation).

   - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
     Most uid/gid setting system calls treat these value specially
     anyway so attempting to use -1 as a uid would likely cause
     entertaining failures in userspace.

   - If setuid is called with a uid that can not be mapped setuid fails.
     I have looked at sendmail, login, ssh and every other program I
     could think of that would call setuid and they all check for and
     handle the case where setuid fails.

   - If stat or a similar system call is called from a context in which
     we can not map a uid we lie and return overflowuid.  The LFS
     experience suggests not lying and returning an error code might be
     better, but the historical precedent with uids is different and I
     can not think of anything that would break by lying about a uid we
     can't map.

   - Capabilities are localized to the current user namespace making it
     safe to give the initial user in a user namespace all capabilities.

  My git tree covers all of the modifications needed to convert the core
  kernel and enough changes to make a system bootable to runlevel 1."

Fix up trivial conflicts due to nearby independent changes in fs/stat.c

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
  userns:  Silence silly gcc warning.
  cred: use correct cred accessor with regards to rcu read lock
  userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
  userns: Convert cgroup permission checks to use uid_eq
  userns: Convert tmpfs to use kuid and kgid where appropriate
  userns: Convert sysfs to use kgid/kuid where appropriate
  userns: Convert sysctl permission checks to use kuid and kgids.
  userns: Convert proc to use kuid/kgid where appropriate
  userns: Convert ext4 to user kuid/kgid where appropriate
  userns: Convert ext3 to use kuid/kgid where appropriate
  userns: Convert ext2 to use kuid/kgid where appropriate.
  userns: Convert devpts to use kuid/kgid where appropriate
  userns: Convert binary formats to use kuid/kgid where appropriate
  userns: Add negative depends on entries to avoid building code that is userns unsafe
  userns: signal remove unnecessary map_cred_ns
  userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
  userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
  userns: Convert stat to return values mapped from kuids and kgids
  userns: Convert user specfied uids and gids in chown into kuids and kgid
  userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
  ...
2012-05-23 17:42:39 -07:00
Linus Torvalds
fb827ec684 Three trivial patches of no real utility. Modules are boring.
Fortunately David Howells is looking to change this, with his module signing
 patchset.  But that's for next merge window...
 
 Cheers,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJPvN++AAoJENkgDmzRrbjxnB8QAJHnsOjx3M+2IwouCMqatNJf
 GrVMsy7I8UPJ1JSAR/2sCoWUUpg1xhUm+koO8rPJuJZ7kDtiRKEa5cJ1JsPiYzcc
 RA7hWOrN/hzAFSjvdOA4ezXqn3OYaW6S1W64DxN2e0bo73n1srtAZ2lxMsQ/2SOH
 xYQDbTK+/6ERTL0lCghxAZYCIrKeO2oWa46EqW6FdEU2bJisxYr5Kthhig7GaKYU
 xluQEvjoU7hbRm9wcvrCYR0BIxnohrhQ/m9DRTxqeRHzAShYx0tiilKlS3RfPda6
 mlMY7sqOH6MPsUKq8IQIn3Mz4ut8fa9E8Ukzh0rMdGnVz3GwYTnWkWp8oinUs042
 BJUMn0ke6OcCdfNwLM0MPUUHXEpzMRrK1Jt2L/S1S7xewoRmJ2UhWgsUHXwL39vu
 4HR4k7xS/V5GjCUec0YBKcAFg/ccH1ktWzg6mQ1nNTX73aniAJ0by2NR+n1fZOi2
 m/iBYgWXLMJ9nxGbHd7UXFIDDTXS0RRNvGVyRuI82LnOhE3X3GE7wbbRgHQAnPGy
 JlnjQUI5sPqbQE2W/+QSGW1e/HgVWmJKwkGONRLVdgkrHdF79gaUVHjp5JOI6JvT
 XCm3JLMxRC93ZNJnl3qwMX/2zsTh7SfWbLiB4fzTfr82sCWLhCrnD+PWxx1OwYvZ
 Vv3WTJQqPKXWKnkIqKIh
 =gI7A
 -----END PGP SIGNATURE-----

Merge tag 'module-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus

Pull module patches from Rusty Russell, who really sells them:
 "Three trivial patches of no real utility.  Modules are boring."

But to make things slightly more exciting, he adds:
 "Fortunately David Howells is looking to change this, with his module
  signing patchset.  But that's for next merge window...

  Cheers,
  Rusty."

* tag 'module-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
  Guard check in module loader against integer overflow
  modpost: use proper kernel style for autogenerated files
  modpost: Stop grab_file() from leaking filedescriptors if fstat() fails
2012-05-23 17:34:09 -07:00
Linus Torvalds
468f4d1a85 Power management updates for 3.5
* Implementation of opportunistic suspend (autosleep) and user space interface
   for manipulating wakeup sources.
 
 * Hibernate updates from Bojan Smojver and Minho Ban.
 
 * Updates of the runtime PM core and generic PM domains framework related to
   PM QoS.
 
 * Assorted fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIcBAABAgAGBQJPu+jwAAoJEKhOf7ml8uNsOw0P/0w1FqXD64a1laE43JIlBe9w
 yHEcLHc9MXN+8lS0XQ6jFiL/VC3U5Sj7Ro+DFKcL2MWX//dfDcZcwA9ep/qh4tHV
 tJ987IijdWqJV14pde3xQafhp/9i12rArLxns7S5fzkdfVk0iDjhZZaZy4afFJYM
 SuCsDhCwWefZh89+oLikByiFPnhW+f2ZC9YQeokBM/XvZLtxmOiVfL6duloT/Cr+
 58jkrJ8xz/5kmmN4bXM4Wlpf9ZIYFXbvtbKrq3GZOXc+LpNKlWQyFgg/pIuxBewC
 uSgsNXXV0LFDi5JfER/8l9MMLtJwwc4VHzpLvMnRv+GtwO2/FKIIr9Fcv000IL2N
 0/Ppr52M7XpRruM/k+YroUQ4F1oBX6HB4e3rwqC+XG6n5bwn/Jc7kdy7aUojqNLG
 Nlr5f0vBjLTSF66Jnel71Bn+gbA1ogER7E+esSTMpyX+RgGJAUVt5oX9IjbXl3PI
 bk8xW1csSRxBI2NkFOd9EM3vMzdGc5uu+iOoy7iBvcAK0AEfo2Ml9YuSVFQeqAu0
 A96MUW155A+GKMC7I/LK8pTgMvYDedWhVW9uyXpMRjwdFC5/ywZU1aM00tL9HMpG
 pzHOFJgsYrf/6VCV8BwqgudRYd0K5EPSGeITCg973os/XzJIOCfJuy+Pn5V/F0ew
 lTbi8ipQD0Hh8A/Xt0QB
 =Q2vo
 -----END PGP SIGNATURE-----

Merge tag 'pm-for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:

 - Implementation of opportunistic suspend (autosleep) and user space
   interface for manipulating wakeup sources.

 - Hibernate updates from Bojan Smojver and Minho Ban.

 - Updates of the runtime PM core and generic PM domains framework
   related to PM QoS.

 - Assorted fixes.

* tag 'pm-for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (25 commits)
  epoll: Fix user space breakage related to EPOLLWAKEUP
  PM / Domains: Make it possible to add devices to inactive domains
  PM / Hibernate: Use get_gendisk to verify partition if resume_file is integer format
  PM / Domains: Fix computation of maximum domain off time
  PM / Domains: Fix link checking when add subdomain
  PM / Sleep: User space wakeup sources garbage collector Kconfig option
  PM / Sleep: Make the limit of user space wakeup sources configurable
  PM / Documentation: suspend-and-cpuhotplug.txt: Fix typo
  PM / Domains: Cache device stop and domain power off governor results, v3
  PM / Domains: Make device removal more straightforward
  PM / Sleep: Fix a mistake in a conditional in autosleep_store()
  epoll: Add a flag, EPOLLWAKEUP, to prevent suspend while epoll events are ready
  PM / QoS: Create device constraints objects on notifier registration
  PM / Runtime: Remove device fields related to suspend time, v2
  PM / Domains: Rework default domain power off governor function, v2
  PM / Domains: Rework default device stop governor function, v2
  PM / Sleep: Add user space interface for manipulating wakeup sources, v3
  PM / Sleep: Add "prevent autosleep time" statistics to wakeup sources
  PM / Sleep: Implement opportunistic sleep, v2
  PM / Sleep: Add wakeup_source_activate and wakeup_source_deactivate tracepoints
  ...
2012-05-23 14:07:06 -07:00
Steven Rostedt
6a31e1f135 ring-buffer: Check for valid buffer before changing size
On some machines the number of possible CPUS is not the same as the
number of CPUs that is on the machine. Ftrace uses possible_cpus to
update the tracing structures but the ring buffer only allocates
per cpu buffers for online CPUs when they come up.

When the wakeup tracer was enabled in such a case, the ftrace code
enabled all possible cpu buffers, but the code in ring_buffer_resize()
did not check to see if the buffer in question was allocated. Since
boot up CPUs did not match possible CPUs it caused the following
crash:

BUG: unable to handle kernel NULL pointer dereference at 00000020
IP: [<c1097851>] ring_buffer_resize+0x16a/0x28d
*pde = 00000000
Oops: 0000 [#1] PREEMPT SMP
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in: [last unloaded: scsi_wait_scan]

Pid: 1387, comm: bash Not tainted 3.4.0-test+ #13                  /DG965MQ
EIP: 0060:[<c1097851>] EFLAGS: 00010217 CPU: 0
EIP is at ring_buffer_resize+0x16a/0x28d
EAX: f5a14340 EBX: f6026b80 ECX: 00000ff4 EDX: 00000ff3
ESI: 00000000 EDI: 00000002 EBP: f4275ecc ESP: f4275eb0
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 80050033 CR2: 00000020 CR3: 34396000 CR4: 000007d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
Process bash (pid: 1387, ti=f4274000 task=f4380cb0 task.ti=f4274000)
Stack:
 c109cf9a f6026b98 00000162 00160f68 00000006 00160f68 00000002 f4275ef0
 c109d013 f4275ee8 c123b72a c1c0bf00 c1cc81dc 00000005 f4275f98 00000007
 f4275f70 c109d0c7 7700000e 75656b61 00000070 f5e90900 f5c4e198 00000301
Call Trace:
 [<c109cf9a>] ? tracing_set_tracer+0x115/0x1e9
 [<c109d013>] tracing_set_tracer+0x18e/0x1e9
 [<c123b72a>] ? _copy_from_user+0x30/0x46
 [<c109d0c7>] tracing_set_trace_write+0x59/0x7f
 [<c10ec01e>] ? fput+0x18/0x1c6
 [<c11f8732>] ? security_file_permission+0x27/0x2b
 [<c10eaacd>] ? rw_verify_area+0xcf/0xf2
 [<c10ec01e>] ? fput+0x18/0x1c6
 [<c109d06e>] ? tracing_set_tracer+0x1e9/0x1e9
 [<c10ead77>] vfs_write+0x8b/0xe3
 [<c10ebead>] ? fget_light+0x30/0x81
 [<c10eaf54>] sys_write+0x42/0x63
 [<c1834fbf>] sysenter_do_call+0x12/0x28

This happens with the latency tracer as the ftrace code updates the
saved max buffer via its cpumask and not with a global setting.

Adding a check in ring_buffer_resize() to make sure the buffer being resized
exists, fixes the problem.

Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-23 15:35:17 -04:00
Linus Torvalds
56edab3159 Merge branches 'perf-urgent-for-linus' and 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:

 - Leftover AMD PMU driver fix fix from the end of the v3.4
   stabilization cycle.

 - Late tools/perf/ changes that missed the first round:
    * endianness fixes
    * event parsing improvements
    * libtraceevent fixes factored out from trace-cmd
    * perl scripting engine fixes related to libtraceevent,
    * testcase improvements
    * perf inject / pipe mode fixes
    * plus a kernel side fix

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Update event scheduling constraints for AMD family 15h models

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "sched, perf: Use a single callback into the scheduler"
  perf evlist: Show event attribute details
  perf tools: Bump default sample freq to 4 kHz
  perf buildid-list: Work better with pipe mode
  perf tools: Fix piped mode read code
  perf inject: Fix broken perf inject -b
  perf tools: rename HEADER_TRACE_INFO to HEADER_TRACING_DATA
  perf tools: Add union u64_swap type for swapping u64 data
  perf tools: Carry perf_event_attr bitfield throught different endians
  perf record: Fix documentation for branch stack sampling
  perf target: Add cpu flag to sample_type if target has cpu
  perf tools: Always try to build libtraceevent
  perf tools: Rename libparsevent to libtraceevent in Makefile
  perf script: Rename struct event to struct event_format in perl engine
  perf script: Explicitly handle known default print arg type
  perf tools: Add hardcoded name term for pmu events
  perf tools: Separate 'mem:' event scanner bits
  perf tools: Use allocated list for each parsed event
  perf tools: Add support for displaying event parser debug info
  perf test: Move parse event automated tests to separated object
2012-05-23 12:12:49 -07:00
Linus Torvalds
ec0d7f18ab Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull fpu state cleanups from Ingo Molnar:
 "This tree streamlines further aspects of FPU handling by eliminating
  the prepare_to_copy() complication and moving that logic to
  arch_dup_task_struct().

  It also fixes the FPU dumps in threaded core dumps, removes and old
  (and now invalid) assumption plus micro-optimizes the exit path by
  avoiding an FPU save for dead tasks."

Fixed up trivial add-add conflict in arch/sh/kernel/process.c that came
in because we now do the FPU handling in arch_dup_task_struct() rather
than the legacy (and now gone) prepare_to_copy().

* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, fpu: drop the fpu state during thread exit
  x86, xsave: remove thread_has_fpu() bug check in __sanitize_i387_state()
  coredump: ensure the fpu state is flushed for proper multi-threaded core dump
  fork: move the real prepare_to_copy() users to arch_dup_task_struct()
2012-05-23 10:59:07 -07:00
Linus Torvalds
269af9a1a0 Merge branch 'x86-extable-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull exception table generation updates from Ingo Molnar:
 "The biggest change here is to allow the build-time sorting of the
  exception table, to speed up booting.  This is achieved by the
  architecture enabling BUILDTIME_EXTABLE_SORT.  This option is enabled
  for x86 and MIPS currently.

  On x86 a number of fixes and changes were needed to allow build-time
  sorting of the exception table, in particular a relocation invariant
  exception table format was needed.  This required the abstracting out
  of exception table protocol and the removal of 20 years of accumulated
  assumptions about the x86 exception table format.

  While at it, this tree also cleans up various other aspects of
  exception handling, such as early(er) exception handling for
  rdmsr_safe() et al.

  All in one, as the result of these changes the x86 exception code is
  now pretty nice and modern.  As an added bonus any regressions in this
  code will be early and violent crashes, so if you see any of those,
  you'll know whom to blame!"

Fix up trivial conflicts in arch/{mips,x86}/Kconfig files due to nearby
modifications of other core architecture options.

* 'x86-extable-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
  Revert "x86, extable: Disable presorted exception table for now"
  scripts/sortextable: Handle relative entries, and other cleanups
  x86, extable: Switch to relative exception table entries
  x86, extable: Disable presorted exception table for now
  x86, extable: Add _ASM_EXTABLE_EX() macro
  x86, extable: Remove open-coded exception table entries in arch/x86/ia32/ia32entry.S
  x86, extable: Remove open-coded exception table entries in arch/x86/include/asm/xsave.h
  x86, extable: Remove open-coded exception table entries in arch/x86/include/asm/kvm_host.h
  x86, extable: Remove the now-unused __ASM_EX_SEC macros
  x86, extable: Remove open-coded exception table entries in arch/x86/xen/xen-asm_32.S
  x86, extable: Remove open-coded exception table entries in arch/x86/um/checksum_32.S
  x86, extable: Remove open-coded exception table entries in arch/x86/lib/usercopy_32.c
  x86, extable: Remove open-coded exception table entries in arch/x86/lib/putuser.S
  x86, extable: Remove open-coded exception table entries in arch/x86/lib/getuser.S
  x86, extable: Remove open-coded exception table entries in arch/x86/lib/csum-copy_64.S
  x86, extable: Remove open-coded exception table entries in arch/x86/lib/copy_user_nocache_64.S
  x86, extable: Remove open-coded exception table entries in arch/x86/lib/copy_user_64.S
  x86, extable: Remove open-coded exception table entries in arch/x86/lib/checksum_32.S
  x86, extable: Remove open-coded exception table entries in arch/x86/kernel/test_rodata.c
  x86, extable: Remove open-coded exception table entries in arch/x86/kernel/entry_64.S
  ...
2012-05-23 10:44:35 -07:00
Linus Torvalds
3a8580f820 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml
Pull UML updates from Richard Weinberger:
 "Most changes are bug fixes and cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
  um: missing checks of __put_user()/__get_user() return values
  um: stub_rt_sigsuspend isn't needed these days anymore
  um/x86: merge (and trim) 32- and 64-bit variants of ptrace.h
  irq: Remove irq_chip->release()
  um: Remove CONFIG_IRQ_RELEASE_METHOD
  um: Remove usage of irq_chip->release()
  um: Implement um_free_irq()
  um: Fix __swp_type()
  um: Implement a custom pte_same() function
  um: Add BUG() to do_ops()'s error path
  um: Remove unused variables
  um: bury unused _TIF_RESTORE_SIGMASK
  um: wrong sigmask saved in case of multiple sigframes
  um: add TIF_NOTIFY_RESUME
  um: ->restart_block.fn needs to be reset on sigreturn
2012-05-23 09:01:41 -07:00
Jiri Olsa
ab0cce560e Revert "sched, perf: Use a single callback into the scheduler"
This reverts commit cb04ff9ac4 ("sched, perf: Use a single
callback into the scheduler").

Before this change was introduced, the process switch worked
like this (wrt. to perf event schedule):

     schedule (prev, next)
       - schedule out all perf events for prev
       - switch to next
       - schedule in all perf events for current (next)

After the commit, the process switch looks like:

     schedule (prev, next)
       - schedule out all perf events for prev
       - schedule in all perf events for (next)
       - switch to next

The problem is, that after we schedule perf events in, the pmu
is enabled and we can receive events even before we make the
switch to next - so "current" still being prev process (event
SAMPLE data are filled based on the value of the "current"
process).

Thats exactly what we see for test__PERF_RECORD test. We receive
SAMPLES with PID of the process that our tracee is scheduled
from.

Discussed with Peter Zijlstra:

 > Bah!, yeah I guess reverting is the right thing for now. Sad
 > though.
 >
 > So by having the two hooks we have a black-spot between them
 > where we receive no events at all, this black-spot covers the
 > hand-over of current and we thus don't receive the 'wrong'
 > events.
 >
 > I rather liked we could do away with both that black-spot and
 > clean up the code a little, but apparently people rely on it.

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: acme@redhat.com
Cc: paulus@samba.org
Cc: cjashfor@linux.vnet.ibm.com
Cc: fweisbec@gmail.com
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/20120523111302.GC1638@m.brq.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-23 17:40:51 +02:00
David Howells
ef26a5a6ea Guard check in module loader against integer overflow
The check:

	if (len < hdr->e_shoff + hdr->e_shnum * sizeof(Elf_Shdr))

may not work if there's an overflow in the right-hand side of the condition.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-05-23 22:28:53 +09:30
Linus Torvalds
e8650a0823 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial updates from Jiri Kosina:
 "As usual, it's mostly typo fixes, redundant code elimination and some
  documentation updates."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (57 commits)
  edac, mips: don't change code that has been removed in edac/mips tree
  xtensa: Change mail addresses of Hannes Weiner and Oskar Schirmer
  lib: Change mail address of Oskar Schirmer
  net: Change mail address of Oskar Schirmer
  arm/m68k: Change mail address of Sebastian Hess
  i2c: Change mail address of Oskar Schirmer
  net: Fix tcp_build_and_update_options comment in struct tcp_sock
  atomic64_32.h: fix parameter naming mismatch
  Kconfig: replace "--- help ---" with "---help---"
  c2port: fix bogus Kconfig "default no"
  edac: Fix spelling errors.
  qla1280: Remove redundant NULL check before release_firmware() call
  remoteproc: remove redundant NULL check before release_firmware()
  qla2xxx: Remove redundant NULL check before release_firmware() call.
  aic94xx: Get rid of redundant NULL check before release_firmware() call
  tehuti: delete redundant NULL check before release_firmware()
  qlogic: get rid of a redundant test for NULL before call to release_firmware()
  bna: remove redundant NULL test before release_firmware()
  tg3: remove redundant NULL test before release_firmware() call
  typhoon: get rid of redundant conditional before all to release_firmware()
  ...
2012-05-22 19:22:50 -07:00
Linus Torvalds
d79ee93de9 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "The biggest change is the cleanup/simplification of the load-balancer:
  instead of the current practice of architectures twiddling scheduler
  internal data structures and providing the scheduler domains in
  colorfully inconsistent ways, we now have generic scheduler code in
  kernel/sched/core.c:sched_init_numa() that looks at the architecture's
  node_distance() parameters and (while not fully trusting it) deducts a
  NUMA topology from it.

  This inevitably changes balancing behavior - hopefully for the better.

  There are various smaller optimizations, cleanups and fixlets as well"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Taint kernel with TAINT_WARN after sleep-in-atomic bug
  sched: Remove stale power aware scheduling remnants and dysfunctional knobs
  sched/debug: Fix printing large integers on 32-bit platforms
  sched/fair: Improve the ->group_imb logic
  sched/nohz: Fix rq->cpu_load[] calculations
  sched/numa: Don't scale the imbalance
  sched/fair: Revert sched-domain iteration breakage
  sched/x86: Rewrite set_cpu_sibling_map()
  sched/numa: Fix the new NUMA topology bits
  sched/numa: Rewrite the CONFIG_NUMA sched domain support
  sched/fair: Propagate 'struct lb_env' usage into find_busiest_group
  sched/fair: Add some serialization to the sched_domain load-balance walk
  sched/fair: Let minimally loaded cpu balance the group
  sched: Change rq->nr_running to unsigned int
  x86/numa: Check for nonsensical topologies on real hw as well
  x86/numa: Hard partition cpu topology masks on node boundaries
  x86/numa: Allow specifying node_distance() for numa=fake
  x86/sched: Make mwait_usable() heed to "idle=" kernel parameters properly
  sched: Update documentation and comments
  sched_rt: Avoid unnecessary dequeue and enqueue of pushable tasks in set_cpus_allowed_rt()
2012-05-22 18:27:32 -07:00
Linus Torvalds
2ff2b289a6 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf changes from Ingo Molnar:
 "Lots of changes:

   - (much) improved assembly annotation support in perf report, with
     jump visualization, searching, navigation, visual output
     improvements and more.

    - kernel support for AMD IBS PMU hardware features.  Notably 'perf
      record -e cycles:p' and 'perf top -e cycles:p' should work without
      skid now, like PEBS does on the Intel side, because it takes
      advantage of IBS transparently.

    - the libtracevents library: it is the first step towards unifying
      tracing tooling and perf, and it also gives a tracing library for
      external tools like powertop to rely on.

    - infrastructure: various improvements and refactoring of the UI
      modules and related code

    - infrastructure: cleanup and simplification of the profiling
      targets code (--uid, --pid, --tid, --cpu, --all-cpus, etc.)

    - tons of robustness fixes all around

    - various ftrace updates: speedups, cleanups, robustness
      improvements.

    - typing 'make' in tools/ will now give you a menu of projects to
      build and a short help text to explain what each does.

    - ... and lots of other changes I forgot to list.

  The perf record make bzImage + perf report regression you reported
  should be fixed."

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (166 commits)
  tracing: Remove kernel_lock annotations
  tracing: Fix initial buffer_size_kb state
  ring-buffer: Merge separate resize loops
  perf evsel: Create events initially disabled -- again
  perf tools: Split term type into value type and term type
  perf hists: Fix callchain ip printf format
  perf target: Add uses_mmap field
  ftrace: Remove selecting FRAME_POINTER with FUNCTION_TRACER
  ftrace/x86: Have x86 ftrace use the ftrace_modify_all_code()
  ftrace: Make ftrace_modify_all_code() global for archs to use
  ftrace: Return record ip addr for ftrace_location()
  ftrace: Consolidate ftrace_location() and ftrace_text_reserved()
  ftrace: Speed up search by skipping pages by address
  ftrace: Remove extra helper functions
  ftrace: Sort all function addresses, not just per page
  tracing: change CPU ring buffer state from tracing_cpumask
  tracing: Check return value of tracing_dentry_percpu()
  ring-buffer: Reset head page before running self test
  ring-buffer: Add integrity check at end of iter read
  ring-buffer: Make addition of pages in ring buffer atomic
  ...
2012-05-22 18:18:55 -07:00
Linus Torvalds
88d6ae8dc3 Merge branch 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "cgroup file type addition / removal is updated so that file types are
  added and removed instead of individual files so that dynamic file
  type addition / removal can be implemented by cgroup and used by
  controllers.  blkio controller changes which will come through block
  tree are dependent on this.  Other changes include res_counter cleanup
  and disallowing kthread / PF_THREAD_BOUND threads to be attached to
  non-root cgroups.

  There's a reported bug with the file type addition / removal handling
  which can lead to oops on cgroup umount.  The issue is being looked
  into.  It shouldn't cause problems for most setups and isn't a
  security concern."

Fix up trivial conflict in Documentation/feature-removal-schedule.txt

* 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
  res_counter: Account max_usage when calling res_counter_charge_nofail()
  res_counter: Merge res_counter_charge and res_counter_charge_nofail
  cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads
  cgroup: remove cgroup_subsys->populate()
  cgroup: get rid of populate for memcg
  cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcg
  cgroup: make css->refcnt clearing on cgroup removal optional
  cgroup: use negative bias on css->refcnt to block css_tryget()
  cgroup: implement cgroup_rm_cftypes()
  cgroup: introduce struct cfent
  cgroup: relocate __d_cgrp() and __d_cft()
  cgroup: remove cgroup_add_file[s]()
  cgroup: convert memcg controller to the new cftype interface
  memcg: always create memsw files if CONFIG_CGROUP_MEM_RES_CTLR_SWAP
  cgroup: convert all non-memcg controllers to the new cftype interface
  cgroup: relocate cftype and cgroup_subsys definitions in controllers
  cgroup: merge cft_release_agent cftype array into the base files array
  cgroup: implement cgroup_add_cftypes() and friends
  cgroup: build list of all cgroups under a given cgroupfs_root
  cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir()
  ...
2012-05-22 17:40:19 -07:00
Linus Torvalds
c54894cd46 Merge branch 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue changes from Tejun Heo:
 "Nothing exciting.  Most are updates to debug stuff and related fixes.
  Two not-too-critical bugs are fixed - WARN_ON() triggering spurious
  during cpu offlining and unlikely lockdep related oops."

* 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  lockdep: fix oops in processing workqueue
  workqueue: skip nr_running sanity check in worker_enter_idle() if trustee is active
  workqueue: Catch more locking problems with flush_work()
  workqueue: change BUG_ON() to WARN_ON()
  trace: Remove unused workqueue tracer
2012-05-22 17:36:56 -07:00
Linus Torvalds
fb09bafda6 Staging tree pull request for 3.5-rc1
Here is the big staging tree pull request for the 3.5-rc1 merge window.
 
 Loads of changes here, and we just narrowly added more lines than we
 added:
  622 files changed, 28356 insertions(+), 26059 deletions(-)
 
 But, good news is that there is a number of subsystems that moved out of
 the staging tree, to their respective "real" portions of the kernel.
 
 Code that moved out was:
 	- iio core code
 	- mei driver
 	- vme core and bridge drivers
 
 There was one broken network driver that moved into staging as a step
 before it is removed from the tree (pc300), and there was a few new
 drivers added to the tree:
 	- new iio drivers
 	- gdm72xx wimax USB driver
 	- ipack subsystem and 2 drivers
 
 All of the movements around have acks from the various subsystem
 maintainers, and all of this has been in the linux-next tree for a
 while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iEYEABECAAYFAk+7q8MACgkQMUfUDdst+ymjogCguo8fANFVlPWeZGeoBTL+aQfQ
 yTkAoLE0codmh+2SvhulYgyU1Wh6ZDK2
 =nJ2F
 -----END PGP SIGNATURE-----

Merge tag 'staging-3.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

Pull staging tree changes from Greg Kroah-Hartman:
 "Here is the big staging tree pull request for the 3.5-rc1 merge
  window.

  Loads of changes here, and we just narrowly added more lines than we
  added:
   622 files changed, 28356 insertions(+), 26059 deletions(-)

  But, good news is that there is a number of subsystems that moved out
  of the staging tree, to their respective "real" portions of the
  kernel.

  Code that moved out was:
	- iio core code
	- mei driver
	- vme core and bridge drivers

  There was one broken network driver that moved into staging as a step
  before it is removed from the tree (pc300), and there was a few new
  drivers added to the tree:
	- new iio drivers
	- gdm72xx wimax USB driver
	- ipack subsystem and 2 drivers

  All of the movements around have acks from the various subsystem
  maintainers, and all of this has been in the linux-next tree for a
  while.

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

Fixed up various trivial conflicts, along with a non-trivial one found
in -next and pointed out by Olof Johanssen: a clean - but incorrect -
merge of the arch/arm/boot/dts/at91sam9g20.dtsi file.  Fix up manually
as per Stephen Rothwell.

* tag 'staging-3.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (536 commits)
  Staging: bcm: Remove two unused variables from Adapter.h
  Staging: bcm: Removes the volatile type definition from Adapter.h
  Staging: bcm: Rename all "INT" to "int" in Adapter.h
  Staging: bcm: Fix warning: __packed vs. __attribute__((packed)) in Adapter.h
  Staging: bcm: Correctly format all comments in Adapter.h
  Staging: bcm: Fix all whitespace issues in Adapter.h
  Staging: bcm: Properly format braces in Adapter.h
  Staging: ipack/bridges/tpci200: remove unneeded casts
  Staging: ipack/bridges/tpci200: remove TPCI200_SHORTNAME constant
  Staging: ipack: remove board_name and bus_name fields from struct ipack_device
  Staging: ipack: improve the register of a bus and a device in the bus.
  staging: comedi: cleanup all the comedi_driver 'detach' functions
  staging: comedi: remove all 'default N' in Kconfig
  staging: line6/config.h: Delete unused header
  staging: gdm72xx depends on NET
  staging: gdm72xx: Set up parent link in sysfs for gdm72xx devices
  staging: drm/omap: initial dmabuf/prime import support
  staging: drm/omap: dmabuf/prime mmap support
  pstore/ram: Add ECC support
  pstore/ram: Switch to persistent_ram routines
  ...
2012-05-22 16:34:21 -07:00
Linus Torvalds
5d4e2d08e7 Driver core pull for 3.5-rc1
Here's the driver core, and other driver subsystems, pull request for
 the 3.5-rc1 merge window.
 
 Outside of a few minor driver core changes, we ended up with the
 following different subsystem and core changes as well, due to
 interdependancies on the driver core:
  - hyperv driver updates
  - drivers/memory being created and some drivers moved into it
  - extcon driver subsystem created out of the old Android staging switch
    driver code
  - dynamic debug updates
  - printk rework, and /dev/kmsg changes
 
 All of this has been tested in the linux-next releases for a few weeks
 with no reported problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iEYEABECAAYFAk+7q28ACgkQMUfUDdst+ykXmwCfcPASzC+/bDkuqdWsqzxlWZ7+
 VOQAnAriySv397St36J6Hz5bMQZwB1Yq
 =SQc+
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg Kroah-Hartman:
 "Here's the driver core, and other driver subsystems, pull request for
  the 3.5-rc1 merge window.

  Outside of a few minor driver core changes, we ended up with the
  following different subsystem and core changes as well, due to
  interdependancies on the driver core:
   - hyperv driver updates
   - drivers/memory being created and some drivers moved into it
   - extcon driver subsystem created out of the old Android staging
     switch driver code
   - dynamic debug updates
   - printk rework, and /dev/kmsg changes

  All of this has been tested in the linux-next releases for a few weeks
  with no reported problems.

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

Fix up conflicts in drivers/extcon/extcon-max8997.c where git noticed
that a patch to the deleted drivers/misc/max8997-muic.c driver needs to
be applied to this one.

* tag 'driver-core-3.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (90 commits)
  uio_pdrv_genirq: get irq through platform resource if not set otherwise
  memory: tegra{20,30}-mc: Remove empty *_remove()
  printk() - isolate KERN_CONT users from ordinary complete lines
  sysfs: get rid of some lockdep false positives
  Drivers: hv: util: Properly handle version negotiations.
  Drivers: hv: Get rid of an unnecessary check in vmbus_prep_negotiate_resp()
  memory: tegra{20,30}-mc: Use dev_err_ratelimited()
  driver core: Add dev_*_ratelimited() family
  Driver Core: don't oops with unregistered driver in driver_find_device()
  printk() - restore prefix/timestamp printing for multi-newline strings
  printk: add stub for prepend_timestamp()
  ARM: tegra30: Make MC optional in Kconfig
  ARM: tegra20: Make MC optional in Kconfig
  ARM: tegra30: MC: Remove unnecessary BUG*()
  ARM: tegra20: MC: Remove unnecessary BUG*()
  printk: correctly align __log_buf
  ARM: tegra30: Add Tegra Memory Controller(MC) driver
  ARM: tegra20: Add Tegra Memory Controller(MC) driver
  printk() - restore timestamp printing at console output
  printk() - do not merge continuation lines of different threads
  ...
2012-05-22 16:02:13 -07:00
Thomas Gleixner
b80fe1015b Merge branch 'fortglx/3.5/time' of git://git.linaro.org/people/jstultz/linux into timers/core 2012-05-22 10:30:39 +02:00
Al Viro
68f3f16d9a new helper: sigsuspend()
guts of saved_sigmask-based sigsuspend/rt_sigsuspend.  Takes
kernel sigset_t *.

Open-coded instances replaced with calling it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-05-21 23:52:30 -04:00
Linus Torvalds
471368557a Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core irq changes from Ingo Molnar:
 "A collection of small fixes."

By Thomas Gleixner
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hexagon: Remove select of not longer existing Kconfig switches
  arm: Select core options instead of redefining them
  genirq: Do not consider disabled wakeup irqs
  genirq: Allow check_wakeup_irqs to notice level-triggered interrupts
  genirq: Be more informative on irq type mismatch
  genirq: Reject bogus threaded irq requests
  genirq: Streamline irq_action
2012-05-21 20:33:19 -07:00
Linus Torvalds
cb60e3e65c Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
 "New notable features:
   - The seccomp work from Will Drewry
   - PR_{GET,SET}_NO_NEW_PRIVS from Andy Lutomirski
   - Longer security labels for Smack from Casey Schaufler
   - Additional ptrace restriction modes for Yama by Kees Cook"

Fix up trivial context conflicts in arch/x86/Kconfig and include/linux/filter.h

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (65 commits)
  apparmor: fix long path failure due to disconnected path
  apparmor: fix profile lookup for unconfined
  ima: fix filename hint to reflect script interpreter name
  KEYS: Don't check for NULL key pointer in key_validate()
  Smack: allow for significantly longer Smack labels v4
  gfp flags for security_inode_alloc()?
  Smack: recursive tramsmute
  Yama: replace capable() with ns_capable()
  TOMOYO: Accept manager programs which do not start with / .
  KEYS: Add invalidation support
  KEYS: Do LRU discard in full keyrings
  KEYS: Permit in-place link replacement in keyring list
  KEYS: Perform RCU synchronisation on keys prior to key destruction
  KEYS: Announce key type (un)registration
  KEYS: Reorganise keys Makefile
  KEYS: Move the key config into security/keys/Kconfig
  KEYS: Use the compat keyctl() syscall wrapper on Sparc64 for Sparc32 compat
  Yama: remove an unused variable
  samples/seccomp: fix dependencies on arch macros
  Yama: add additional ptrace scopes
  ...
2012-05-21 20:27:36 -07:00
Linus Torvalds
bf67f3a5c4 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp hotplug cleanups from Thomas Gleixner:
 "This series is merily a cleanup of code copied around in arch/* and
  not changing any of the real cpu hotplug horrors yet.  I wish I'd had
  something more substantial for 3.5, but I underestimated the lurking
  horror..."

Fix up trivial conflicts in arch/{arm,sparc,x86}/Kconfig and
arch/sparc/include/asm/thread_info_32.h

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
  um: Remove leftover declaration of alloc_task_struct_node()
  task_allocator: Use config switches instead of magic defines
  sparc: Use common threadinfo allocator
  score: Use common threadinfo allocator
  sh-use-common-threadinfo-allocator
  mn10300: Use common threadinfo allocator
  powerpc: Use common threadinfo allocator
  mips: Use common threadinfo allocator
  hexagon: Use common threadinfo allocator
  m32r: Use common threadinfo allocator
  frv: Use common threadinfo allocator
  cris: Use common threadinfo allocator
  x86: Use common threadinfo allocator
  c6x: Use common threadinfo allocator
  fork: Provide kmemcache based thread_info allocator
  tile: Use common threadinfo allocator
  fork: Provide weak arch_release_[task_struct|thread_info] functions
  fork: Move thread info gfp flags to header
  fork: Remove the weak insanity
  sh: Remove cpu_idle_wait()
  ...
2012-05-21 19:43:57 -07:00
Linus Torvalds
226da0dbc8 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU changes from Ingo Molnar:
 "This is the v3.5 RCU tree from Paul E.  McKenney:

 1) A set of improvements and fixes to the RCU_FAST_NO_HZ feature (with
    more on the way for 3.6).  Posted to LKML:
       https://lkml.org/lkml/2012/4/23/324 (commits 1-3 and 5),
       https://lkml.org/lkml/2012/4/16/611 (commit 4),
       https://lkml.org/lkml/2012/4/30/390 (commit 6), and
       https://lkml.org/lkml/2012/5/4/410 (commit 7, combined with
       the other commits for the convenience of the tester).

 2) Changes to make rcu_barrier() avoid disrupting execution of CPUs
    that have no RCU callbacks.  Posted to LKML:
       https://lkml.org/lkml/2012/4/23/322.

 3) A couple of commits that improve the efficiency of the interaction
    between preemptible RCU and the scheduler, these two being all that
    survived an abortive attempt to allow preemptible RCU's
    __rcu_read_lock() to be inlined.  The full set was posted to LKML at
    https://lkml.org/lkml/2012/4/14/143, and the first and third patches
    of that set remain.

 4) Lai Jiangshan's algorithmic implementation of SRCU, which includes
    call_srcu() and srcu_barrier().  A major feature of this new
    implementation is that synchronize_srcu() no longer disturbs the
    execution of other CPUs.  This work is based on earlier
    implementations by Peter Zijlstra and Paul E.  McKenney.  Posted to
    LKML: https://lkml.org/lkml/2012/2/22/82.

 5) A number of miscellaneous bug fixes and improvements which were
    posted to LKML at: https://lkml.org/lkml/2012/4/23/353 with
    subsequent updates posted to LKML."

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
  rcu: Make rcu_barrier() less disruptive
  rcu: Explicitly initialize RCU_FAST_NO_HZ per-CPU variables
  rcu: Make RCU_FAST_NO_HZ handle timer migration
  rcu: Update RCU maintainership
  rcu: Make exit_rcu() more precise and consolidate
  rcu: Move PREEMPT_RCU preemption to switch_to() invocation
  rcu: Ensure that RCU_FAST_NO_HZ timers expire on correct CPU
  rcu: Add rcutorture test for call_srcu()
  rcu: Implement per-domain single-threaded call_srcu() state machine
  rcu: Use single value to handle expedited SRCU grace periods
  rcu: Improve srcu_readers_active_idx()'s cache locality
  rcu: Remove unused srcu_barrier()
  rcu: Implement a variant of Peter's SRCU algorithm
  rcu: Improve SRCU's wait_idx() comments
  rcu: Flip ->completed only once per SRCU grace period
  rcu: Increment upper bit only for srcu_read_lock()
  rcu: Remove fast check path from __synchronize_srcu()
  rcu: Direct algorithmic SRCU implementation
  rcu: Introduce rcutorture testing for rcu_barrier()
  timer: Fix mod_timer_pinned() header comment
  ...
2012-05-21 19:26:51 -07:00
Linus Torvalds
5ec29e3149 Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core locking updates from Ingo Molnar:
 "This update:

   - extends and simplifies x86 NMI callback handling code to enhance
     and fix the HP hw-watchdog driver

   - simplifies the x86 NMI callback handling code to fix a kmemcheck
     bug.

   - enhances the hung-task debugger"

* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/nmi: Fix the type of the nmiaction.flags field
  x86/nmi: Fix page faults by nmiaction if kmemcheck is enabled
  x86/nmi: Add new NMI queues to deal with IO_CHK and SERR
  watchdog, hpwdt: Remove priority option for NMI callback
  hung task debugging: Inject NMI when hung and going to panic
2012-05-21 19:25:14 -07:00
Richard Cochran
d239f49d77 timekeeping: Fix a few minor newline issues.
Fix a few minor newline issues.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-05-21 16:20:32 -07:00
Richard Cochran
cd5398bed9 ntp: Fix a stale comment and a few stray newlines.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-05-21 16:16:56 -07:00
Richard Cochran
dd48d708ff ntp: Correct TAI offset during leap second
When repeating a UTC time value during a leap second (when the UTC
time should be 23:59:60), the TAI timescale should not stop. The kernel
NTP code increments the TAI offset one second too late. This patch fixes
the issue by incrementing the offset during the leap second itself.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-05-21 16:16:54 -07:00
Thomas Gleixner
764e0da14f timers: Fixup the Kconfig consolidation fallout
Sigh, I missed to check which architecture Kconfig files actually
include the core Kconfig file. There are a few which did not. So we
broke them.

Instead of adding the includes to those, we are better off to move the
include to init/Kconfig like we did already with irqs and others.

This does not change anything for the architectures using the old
style periodic timer mode. It just solves the build wreckage there.

For those architectures which use the clock events infrastructure it
moves the include of the core Kconfig file to "General setup" which is
a way more logical place than having it at random locations specified
by the architecture specific Kconfigs.

Reported-by: Ingo Molnar <mingo@kernel.org>
Cc: Anna-Maria Gleixner <anna-maria@glx-um.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-05-21 23:43:46 +02:00
Richard Weinberger
875682648b irq: Remove irq_chip->release()
As it's only user (UML) does no longer need it we can get
rid of it.

Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2012-05-21 21:09:38 +02:00
Thomas Gleixner
b5e498ad67 timers: Provide generic Kconfig switches
We really don't want all the arch code defining stuff
over and over.

[ anna-maria: Added missing GENERIC_CMOS_UPDATE switch ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@glx-um.de>
Cc: Paul Mundt <lethal@linux-sh.org>
Link: http://lkml.kernel.org/r/1337529587.3208.2.camel@dionysos
Acked-by: Sam Ravnborg <sam@ravnborg.org>
2012-05-21 11:01:40 +02:00
Ingo Molnar
6f5e3577d4 Merge branch 'tip/perf/core-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core 2012-05-21 09:44:36 +02:00
Ingo Molnar
bb27f55eb9 Merge branch 'perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
Fixes for perf/core:

 - Rename some perf_target methods to avoid double negation, from Namhyung Kim.
 - Revert change to use per task events with inheritance, from Namhyung Kim.
 - Events should start disabled till children starts running, from David Ahern.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-21 09:17:50 +02:00
Eric W. Biederman
4b06a81f1d userns: Silence silly gcc warning.
On 32bit builds gcc says:
kernel/user.c:30:4: warning: this decimal constant is unsigned only in ISO C90 [enabled by default]
kernel/user.c:38:4: warning: this decimal constant is unsigned only in ISO C90 [enabled by default]

Silence gcc by changing the constant 4294967295 to 4294967295U.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-19 15:44:40 -06:00
Mark Brown
a87487e687 irqdomain: Document size parameter of irq_domain_add_linear()
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2012-05-19 13:07:51 -06:00
Paul Mundt
54a9058860 irqdomain: trivial pr_fmt conversion.
Convert to pr_fmt before things start to get out of hand and some
janitors start getting overly excited.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2012-05-19 12:59:15 -06:00
Paul Mundt
5c5806e50b irqdomain: Make irq_domain_simple_map() static.
Presently irq_domain_simple_map() isn't labelled as static, but there's
no definition for it in the public irqdomain header either. At present
all in-tree ->map users have meaningful work to do, and all others are
using irq_domain_simple_ops directly. Make it static for now, as it can
always be exported and added to the public API later.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2012-05-19 12:35:33 -06:00
Paul Mundt
ecd84eb20a irqdomain: Export remaining public API symbols.
modules making use of irq domains at the very least need access to the
add/remove/lookup routines, though there's nothing preventing them from
using the remainder of the public API, either.

The current set of exports seem primarily geared at DT-enabled platforms
using DT-backed IRQ domains, where many of the API accesses are hidden
away in OF code. The non-DT cases need to do most of this on their own.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2012-05-19 12:34:38 -06:00
Paul Mundt
58ee99ada2 irqdomain: Support removal of IRQ domains.
Now that IRQ domains are being used by modules it's necessary to support
removing them, too. This adds a new irq_domain_remove() routine for doing
the bulk of the heavy lifting. It's left as an exercise to the caller to
ensure all mappings have been appropriatey disposed of before attempting
to remove the domain.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2012-05-19 12:32:35 -06:00
Richard Weinberger
895b67fd58 tracing: Remove kernel_lock annotations
The BKL is gone, these annotations are useless.

Link: http://lkml.kernel.org/r/1320654202-4433-1-git-send-email-richard@nod.at

Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-19 08:28:51 -04:00
Vaibhav Nagarnaik
a591c73f12 tracing: Fix initial buffer_size_kb state
Make sure that the state of buffer_size_kb is initialized correctly and
returns actual size of the ring buffer.

Link: http://lkml.kernel.org/r/1336066834-1673-1-git-send-email-vnagarnaik@google.com

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Laurent Chavey <chavey@google.com>
Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-19 08:28:50 -04:00
Vaibhav Nagarnaik
05fdd70d2f ring-buffer: Merge separate resize loops
There are 2 separate loops to resize cpu buffers that are online and
offline. Merge them to make the code look better.

Also change the name from update_completion to update_done to allow
shorter lines.

Link: http://lkml.kernel.org/r/1337372991-14783-1-git-send-email-vnagarnaik@google.com

Cc: Laurent Chavey <chavey@google.com>
Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-19 08:28:50 -04:00
Minho Ban
2df83fa4bc PM / Hibernate: Use get_gendisk to verify partition if resume_file is integer format
Sometimes resume= parameter comes in integer style (e.g. major:minor)
and then name_to_dev_t can not detect partition properly. (especially
async device like usb, mmc).

This patch calls get_gendisk() if resumewait is true and resume_file
is in integer format to work around this problem.

Signed-off-by: Minho Ban <mhban@samsung.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-05-18 20:44:59 +02:00
Arnaldo Carvalho de Melo
16ee6576e2 Merge remote-tracking branch 'tip/perf/urgent' into perf/core
Merge reason: We are going to queue up a dependent patch:

"perf tools: Move parse event automated tests to separated object"

That depends on:

commit e7c72d8
perf tools: Add 'G' and 'H' modifiers to event parsing

Conflicts:
	tools/perf/builtin-stat.c

Conflicted with the recent 'perf_target' patches when checking the
result of perf_evsel open routines to see if a retry is needed to cope
with older kernels where the exclude guest/host perf_event_attr bits
were not used.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2012-05-18 13:13:33 -03:00
Seiji Aguchi
62be73eafa kdump: Execute kmsg_dump(KMSG_DUMP_PANIC) after smp_send_stop()
This patch moves kmsg_dump(KMSG_DUMP_PANIC) below smp_send_stop(),
to serialize the crash-logging process via smp_send_stop() and to
thus retrieve a more stable crash image of all CPUs stopped.

Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: dle-develop@lists.sourceforge.net <dle-develop@lists.sourceforge.net>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: a.p.zijlstra@chello.nl <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/5C4C569E8A4B9B42A84A977CF070A35B2E4D7A5CE2@USINDEVS01.corp.hds.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-18 14:02:10 +02:00
Konstantin Khlebnikov
1c2927f185 sched: Taint kernel with TAINT_WARN after sleep-in-atomic bug
Usually sleep-in-atomic bugs are followed by dozens other warnings.
This patch should help to figure out original source of problem.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120510122004.4873.12726.stgit@zurg
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-18 13:07:40 +02:00
Sasha Levin
8ca937a668 cred: use correct cred accessor with regards to rcu read lock
Commit "userns: Convert setting and getting uid and gid system calls to use
kuid and kgid has modified the accessors in wait_task_continued() and
wait_task_stopped() to use __task_cred() instead of task_uid().

__task_cred() assumes that we're inside a rcu read lock, which is untrue
for these two functions.

Modify it to use task_uid() instead.

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-17 16:50:06 -06:00
Linus Torvalds
31ae98359d Merge branches 'perf-urgent-for-linus', 'x86-urgent-for-linus' and 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf, x86 and scheduler updates from Ingo Molnar.

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tracing: Do not enable function event with enable
  perf stat: handle ENXIO error for perf_event_open
  perf: Turn off compiler warnings for flex and bison generated files
  perf stat: Fix case where guest/host monitoring is not supported by kernel
  perf build-id: Fix filename size calculation

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, kvm: KVM paravirt kernels don't check for CPUID being unavailable
  x86: Fix section annotation of acpi_map_cpu2node()
  x86/microcode: Ensure that module is only loaded on supported Intel CPUs

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix KVM and ia64 boot crash due to sched_groups circular linked list assumption
2012-05-17 09:35:17 -07:00
Peter Zijlstra
8e7fbcbc22 sched: Remove stale power aware scheduling remnants and dysfunctional knobs
It's been broken forever (i.e. it's not scheduling in a power
aware fashion), as reported by Suresh and others sending
patches, and nobody cares enough to fix it properly ...
so remove it to make space free for something better.

There's various problems with the code as it stands today, first
and foremost the user interface which is bound to topology
levels and has multiple values per level. This results in a
state explosion which the administrator or distro needs to
master and almost nobody does.

Furthermore large configuration state spaces aren't good, it
means the thing doesn't just work right because it's either
under so many impossibe to meet constraints, or even if
there's an achievable state workloads have to be aware of
it precisely and can never meet it for dynamic workloads.

So pushing this kind of decision to user-space was a bad idea
even with a single knob - it's exponentially worse with knobs
on every node of the topology.

There is a proposal to replace the user interface with a single
3 state knob:

 sched_balance_policy := { performance, power, auto }

where 'auto' would be the preferred default which looks at things
like Battery/AC mode and possible cpufreq state or whatever the hw
exposes to show us power use expectations - but there's been no
progress on it in the past many months.

Aside from that, the actual implementation of the various knobs
is known to be broken. There have been sporadic attempts at
fixing things but these always stop short of reaching a mergable
state.

Therefore this wholesale removal with the hopes of spurring
people who care to come forward once again and work on a
coherent replacement.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1326104915.2442.53.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-17 13:48:56 +02:00
Ingo Molnar
fac536f7e4 Merge branch 'sched/urgent' into sched/core
Merge reason: bring together all the pending scheduler bits,
              for the sched/numa changes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-17 12:17:12 +02:00
Steven Rostedt
b732d439cb ftrace: Remove selecting FRAME_POINTER with FUNCTION_TRACER
The function tracer will enable the -pg option with gcc, which requires
that frame pointers. When FRAME_POINTER is defined in the kernel config
it adds the gcc option -fno-omit-frame-pointer which causes some problems
on some architectures. For those architectures, the FRAME_POINTER select
was not set.

When FUNCTION_TRACER was selected on these architectures that can not have
-fno-omit-frame-pointer, the -pg option is still set. But when
FRAME_POINTER is not selected, the kernel config would add the gcc option
-fomit-frame-pointer. Adding this option is incompatible with -pg
even on archs that do not need frame pointers with -pg.

The answer to this was to just not add either -fno-omit-frame-pointer
or -fomit-frame-pointer on these archs that want function tracing
but do not set FRAME_POINTER.

As it turns out, for archs that require frame pointers for function
tracing, the same can be used. If gcc requires frame pointers with
-pg, it will simply add it. The best thing to do is not select FRAME_POINTER
when function tracing is selected, and let gcc add it if needed.

Only add the -fno-omit-frame-pointer when something else selects
FRAME_POINTER, but do not add -fomit-frame-pointer if function tracing
is selected.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-16 20:00:29 -04:00
Steven Rostedt
e4f5d5440b ftrace/x86: Have x86 ftrace use the ftrace_modify_all_code()
To remove duplicate code, have the ftrace arch_ftrace_update_code()
use the generic ftrace_modify_all_code(). This requires that the
default ftrace_replace_code() becomes a weak function so that an
arch may override it.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-16 20:00:27 -04:00
Steven Rostedt
8ed3e2cfe4 ftrace: Make ftrace_modify_all_code() global for archs to use
Rename __ftrace_modify_code() to ftrace_modify_all_code() and make
it global for all archs to use. This will remove the duplication
of code, as archs that can modify code without stop_machine()
can use it directly outside of the stop_machine() call.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-16 20:00:26 -04:00
Steven Rostedt
f0cf973a22 ftrace: Return record ip addr for ftrace_location()
ftrace_location() is passed an addr, and returns 1 if the addr is
on a ftrace nop (or caller to ftrace_caller), and 0 otherwise.

To let kprobes know if it should move a breakpoint or not, it
must return the actual addr that is the start of the ftrace nop.
This way a kprobe placed on the location of a ftrace nop, can
instead be placed on the instruction after the nop. Even if the
probe addr is on the second or later byte of the nop, it can
simply be moved forward.

Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-16 19:58:49 -04:00
Steven Rostedt
a650e02a52 ftrace: Consolidate ftrace_location() and ftrace_text_reserved()
Both ftrace_location() and ftrace_text_reserved() do basically the same thing.
They search to see if an address is in the ftace table (contains an address
that may change from nop to call ftrace_caller). The difference is
that ftrace_location() searches a single address, but ftrace_text_reserved()
searches a range.

This also makes the ftrace_text_reserved() faster as it now uses a bsearch()
instead of linearly searching all the addresses within a page.

Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-16 19:58:48 -04:00
Steven Rostedt
9644302e33 ftrace: Speed up search by skipping pages by address
As all records in a page of the ftrace table are sorted, we can
speed up the search algorithm by checking if the address to look for
falls in between the first and last record ip on the page.

This speeds up both the ftrace_location() and ftrace_text_reserved()
algorithms, as it can skip full pages when the search address is
not in them.

Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-16 19:58:46 -04:00
Steven Rostedt
706c81f87f ftrace: Remove extra helper functions
The ftrace_record_ip() and ftrace_alloc_dyn_node() were from the
time of the ftrace daemon. Although they were still used, they
still make things a bit more complex than necessary.

Move the code into the one function that uses it, and remove the
helper functions.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-16 19:58:45 -04:00
Steven Rostedt
9fd49328fc ftrace: Sort all function addresses, not just per page
Instead of just sorting the ip's of the functions per ftrace page,
sort the entire list before adding them to the ftrace pages.

This will allow the bsearch algorithm to be sped up as it can
also sort by pages, not just records within a page.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-16 19:58:44 -04:00
Vaibhav Nagarnaik
71babb2705 tracing: change CPU ring buffer state from tracing_cpumask
According to Documentation/trace/ftrace.txt:

tracing_cpumask:

        This is a mask that lets the user only trace
        on specified CPUS. The format is a hex string
        representing the CPUS.

The tracing_cpumask currently doesn't affect the tracing state of
per-CPU ring buffers.

This patch enables/disables CPU recording as its corresponding bit in
tracing_cpumask is set/unset.

Link: http://lkml.kernel.org/r/1336096792-25373-3-git-send-email-vnagarnaik@google.com

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Laurent Chavey <chavey@google.com>
Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-16 19:50:38 -04:00
Namhyung Kim
0a3d7ce7e6 tracing: Check return value of tracing_dentry_percpu()
If tracing_dentry_percpu() failed, tracing_init_debugfs_percpu()
will try to create each cpu directories on debugfs' root directory
as d_percpu is NULL.

Link: http://lkml.kernel.org/r/1335143517-2285-1-git-send-email-namhyung.kim@lge.com

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Namhyung Kim <namhyung.kim@lge.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-16 19:50:37 -04:00
Steven Rostedt
308f7eeb78 ring-buffer: Reset head page before running self test
When the ring buffer does its consistency test on itself, it
removes the head page, runs the tests, and then adds it back
to what the "head_page" pointer was. But because the head_page
pointer may lack behind the real head page (held by the link
list pointer). The reset may be incorrect.

Instead, if the head_page exists (it does not on first allocation)
reset it back to the real head page before running the consistency
tests. Then it will be put back to its original location after
the tests are complete.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-16 19:50:36 -04:00
Steven Rostedt
659f451ff2 ring-buffer: Add integrity check at end of iter read
There use to be ring buffer integrity checks after updating the
size of the ring buffer. But now that the ring buffer can modify
the size while the system is running, the integrity checks were
removed, as they require the ring buffer to be disabed to perform
the check.

Move the integrity check to the reading of the ring buffer via the
iterator reads (the "trace" file). As reading via an iterator requires
disabling the ring buffer, it is a perfect place to have it.

If the ring buffer happens to be disabled when updating the size,
we still perform the integrity check.

Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-16 19:50:23 -04:00
Suresh Siddha
55ccf3fe3f fork: move the real prepare_to_copy() users to arch_dup_task_struct()
Historical prepare_to_copy() is mostly a no-op, duplicated for majority of
the architectures and the rest following the x86 model of flushing the extended
register state like fpu there.

Remove it and use the arch_dup_task_struct() instead.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1336692811-30576-1-git-send-email-suresh.b.siddha@intel.com
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-05-16 15:16:26 -07:00
Vaibhav Nagarnaik
5040b4b7bc ring-buffer: Make addition of pages in ring buffer atomic
This patch adds the capability to add new pages to a ring buffer
atomically while write operations are going on. This makes it possible
to expand the ring buffer size without reinitializing the ring buffer.

The new pages are attached between the head page and its previous page.

Link: http://lkml.kernel.org/r/1336096792-25373-2-git-send-email-vnagarnaik@google.com

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Laurent Chavey <chavey@google.com>
Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-16 16:25:51 -04:00
Vaibhav Nagarnaik
83f40318da ring-buffer: Make removal of ring buffer pages atomic
This patch adds the capability to remove pages from a ring buffer
without destroying any existing data in it.

This is done by removing the pages after the tail page. This makes sure
that first all the empty pages in the ring buffer are removed. If the
head page is one in the list of pages to be removed, then the page after
the removed ones is made the head page. This removes the oldest data
from the ring buffer and keeps the latest data around to be read.

To do this in a non-racey manner, tracing is stopped for a very short
time while the pages to be removed are identified and unlinked from the
ring buffer. The pages are freed after the tracing is restarted to
minimize the time needed to stop tracing.

The context in which the pages from the per-cpu ring buffer are removed
runs on the respective CPU. This minimizes the events not traced to only
NMI trace contexts.

Link: http://lkml.kernel.org/r/1336096792-25373-1-git-send-email-vnagarnaik@google.com

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Laurent Chavey <chavey@google.com>
Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-16 16:18:57 -04:00
Steven Rostedt
6edb2a8a38 tracing: Clean up tracing_mark_write()
On gcc 4.5 the function tracing_mark_write() would give a warning
of page2 being uninitialized. This is due to a bug in gcc because
the logic prevents page2 from being used uninitialized, and
gcc 4.6+ does not complain (correctly).

Instead of adding a "unitialized" around page2, which could show
a bug later on, I combined page1 and page2 into an array map_pages[].
This binds the two and the two are modified according to nr_pages
(what gcc 4.5 seems to ignore). This no longer gives a warning with
gcc 4.5 nor with gcc 4.6.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-16 16:18:57 -04:00
Eric W. Biederman
14a590c3f9 userns: Convert cgroup permission checks to use uid_eq
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-15 14:59:30 -07:00
Eric W. Biederman
54ba47edac userns: signal remove unnecessary map_cred_ns
map_cred_ns is a light wrapper around from_kuid with the order of the arguments
reversed.  Replace map_cred_ns with from_kuid and remove map_cred_ns.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-15 14:59:24 -07:00
Eric W. Biederman
65cc5a17ad userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-15 14:59:24 -07:00
Jiri Kosina
3911ff30f5 genirq: export handle_edge_irq() and irq_to_desc()
Export handle_edge_irq() and irq_to_desc() to modules to allow them to
do things such as

	__irq_set_handler_locked(...., handle_edge_irq);

This fixes

	ERROR: "handle_edge_irq" [drivers/gpio/gpio-pch.ko] undefined!
	ERROR: "irq_to_desc" [drivers/gpio/gpio-pch.ko] undefined!

when gpio-pch is being built as a module.

This was introduced by commit df9541a60a ("gpio: pch9: Use proper flow
type handlers") that added

	__irq_set_handler_locked(d->irq, handle_edge_irq);

but handle_edge_irq() was not exported for modules (and inlined
__irq_set_handler_locked() requires irq_to_desc() exported as well)

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-15 08:10:07 -07:00
Peter Zijlstra
4d82a1debb lockdep: fix oops in processing workqueue
Under memory load, on x86_64, with lockdep enabled, the workqueue's
process_one_work() has been seen to oops in __lock_acquire(), barfing
on a 0xffffffff00000000 pointer in the lockdep_map's class_cache[].

Because it's permissible to free a work_struct from its callout function,
the map used is an onstack copy of the map given in the work_struct: and
that copy is made without any locking.

Surprisingly, gcc (4.5.1 in Hugh's case) uses "rep movsl" rather than
"rep movsq" for that structure copy: which might race with a workqueue
user's wait_on_work() doing lock_map_acquire() on the source of the
copy, putting a pointer into the class_cache[], but only in time for
the top half of that pointer to be copied to the destination map.

Boom when process_one_work() subsequently does lock_map_acquire()
on its onstack copy of the lockdep_map.

Fix this, and a similar instance in call_timer_fn(), with a
lockdep_copy_map() function which additionally NULLs the class_cache[].

Note: this oops was actually seen on 3.4-next, where flush_work() newly
does the racing lock_map_acquire(); but Tejun points out that 3.4 and
earlier are already vulnerable to the same through wait_on_work().

* Patch orginally from Peter.  Hugh modified it a bit and wrote the
  description.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reported-by: Hugh Dickins <hughd@google.com>
LKML-Reference: <alpine.LSU.2.00.1205070951170.1544@eggly.anvils>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-05-15 08:08:31 -07:00
Mauro Carvalho Chehab
69ecdbac14 Merge remote-tracking branch 'linus/master' into staging/for_v3.5
* linus/master: (805 commits)
  tty: Fix LED error return
  openvswitch: checking wrong variable in queue_userspace_packet()
  bonding: Fix LACPDU rx_dropped commit.
  Linux 3.4-rc7
  ARM: EXYNOS: fix ctrlbit for exynos5_clk_pdma1
  ARM: EXYNOS: use s5p-timer for UniversalC210 board
  ARM / mach-shmobile: Invalidate caches when booting secondary cores
  ARM / mach-shmobile: sh73a0 SMP TWD boot regression fix
  ARM / mach-shmobile: r8a7779 SMP TWD boot regression fix
  ARM: mach-shmobile: convert ag5evm to use the generic MMC GPIO hotplug helper
  ARM: mach-shmobile: convert mackerel to use the generic MMC GPIO hotplug helper
  MAINTAINERS: Add myself as the cpufreq maintainer
  dm mpath: check if scsi_dh module already loaded before trying to load
  dm thin: correct module description
  dm thin: fix unprotected use of prepared_discards list
  dm thin: reinstate missing mempool_free in cell_release_singleton
  gpio/exynos: Fix compiler warnings when non-exynos machines are selected
  gpio: pch9: Use proper flow type handlers
  powerpc/irq: Fix another case of lazy IRQ state getting out of sync
  ks8851: Update link status during link change interrupt
  ...

Conflicts:
	drivers/media/common/tuners/xc5000.c
	drivers/media/common/tuners/xc5000.h
	drivers/usb/gadget/uvc_queue.c
2012-05-15 08:39:25 -03:00
Tejun Heo
544ecf310f workqueue: skip nr_running sanity check in worker_enter_idle() if trustee is active
worker_enter_idle() has WARN_ON_ONCE() which triggers if nr_running
isn't zero when every worker is idle.  This can trigger spuriously
while a cpu is going down due to the way trustee sets %WORKER_ROGUE
and zaps nr_running.

It first sets %WORKER_ROGUE on all workers without updating
nr_running, releases gcwq->lock, schedules, regrabs gcwq->lock and
then zaps nr_running.  If the last running worker enters idle
inbetween, it would see stale nr_running which hasn't been zapped yet
and trigger the WARN_ON_ONCE().

Fix it by performing the sanity check iff the trustee is idle.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
2012-05-14 15:04:50 -07:00
Kay Sievers
c313af145b printk() - isolate KERN_CONT users from ordinary complete lines
Arrange the continuation printk() buffering to be fully separated from the
ordinary full line users.

Limit the exposure to races and wrong printk() line merges to users of
continuation only. Ordinary full line users racing against continuation
users will no longer affect each other.

Multiple continuation users from different threads, racing against each
other will not wrongly be merged into a single line, but printed as
separate lines.

Test output of a kernel module which starts two separate threads which
race against each other, one of them printing a single full terminated
line:
  printk("(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)\n");

The other one printing the line, every character separate in a
continuation loop:
  printk("(C");
  for (i = 0; i < 58; i++)
          printk(KERN_CONT "C");
  printk(KERN_CONT "C)\n");

Behavior of single and non-thread-aware printk() buffer:
  # modprobe printk-race
  printk test init
  (CC(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
  C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
  CC(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
  C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
  CC(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
  C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
  C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
  CC(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
  C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
  C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
  CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
  (CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)

New behavior with separate and thread-aware continuation buffer:
  # modprobe printk-race
  printk test init
  (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
  (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
  (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
  (CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
  (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
  (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
  (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
  (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
  (CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
  (CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Joe Perches <joe@perches.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Ingo Molnar  <mingo@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-14 12:36:45 -07:00
Kay Sievers
3ce9a7c0ac printk() - restore prefix/timestamp printing for multi-newline strings
Calls like:
  printk("\n *** DEADLOCK ***\n\n");
will print 3 properly indented, separated, syslog + timestamp prefixed lines in
the log output.

Reported-By: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-14 08:42:22 -07:00
Peter Zijlstra
13e099d2f7 sched/debug: Fix printing large integers on 32-bit platforms
Some numbers like nr_running and nr_uninterruptible are fundamentally
unsigned since its impossible to have a negative amount of tasks, yet
we still print them as signed to easily recognise the underflow
condition.

rq->nr_uninterruptible has 'special' accounting and can in fact very
easily become negative on a per-cpu basis.

It was noted that since the P() macro assumes things are long long and
the promotion of unsigned 'int/long' to long long on 32bit doesn't
sign extend we print silly large numbers instead of the easier to read
signed numbers.

Therefore extend the P() macro to not require the sign extention.

Reported-by: Diwakar Tundlam <dtundlam@nvidia.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-gk5tm8t2n4ix2vkpns42uqqp@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-14 15:05:28 +02:00
Peter Zijlstra
e44bc5c5d0 sched/fair: Improve the ->group_imb logic
Group imbalance is meant to deal with situations where affinity masks
and sched domains don't align well, such as 3 cpus from one group and
6 from another. In this case the domain based balancer will want to
put an equal amount of tasks on each side even though they don't have
equal cpus.

Currently group_imb is set whenever two cpus of a group have a weight
difference of at least one avg task and the heaviest cpu has at least
two tasks. A group with imbalance set will always be picked as busiest
and a balance pass will be forced.

The problem is that even if there are no affinity masks this stuff can
trigger and cause weird balancing decisions, eg. the observed
behaviour was that of 6 cpus, 5 had 2 and 1 had 3 tasks, due to the
difference of 1 avg load (they all had the same weight) and nr_running
being >1 the group_imbalance logic triggered and did the weird thing
of pulling more load instead of trying to move the 1 excess task to
the other domain of 6 cpus that had 5 cpu with 2 tasks and 1 cpu with
1 task.

Curb the group_imbalance stuff by making the nr_running condition
weaker by also tracking the min_nr_running and using the difference in
nr_running over the set instead of the absolute max nr_running.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-9s7dedozxo8kjsb9kqlrukkf@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-14 15:05:28 +02:00
Peter Zijlstra
556061b00c sched/nohz: Fix rq->cpu_load[] calculations
While investigating why the load-balancer did funny I found that the
rq->cpu_load[] tables were completely screwy.. a bit more digging
revealed that the updates that got through were missing ticks followed
by a catchup of 2 ticks.

The catchup assumes the cpu was idle during that time (since only nohz
can cause missed ticks and the machine is idle etc..) this means that
esp. the higher indices were significantly lower than they ought to
be.

The reason for this is that its not correct to compare against jiffies
on every jiffy on any other cpu than the cpu that updates jiffies.

This patch cludges around it by only doing the catch-up stuff from
nohz_idle_balance() and doing the regular stuff unconditionally from
the tick.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: pjt@google.com
Cc: Venkatesh Pallipadi <venki@google.com>
Link: http://lkml.kernel.org/n/tip-tp4kj18xdd5aj4vvj0qg55s2@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-14 15:05:27 +02:00
Peter Zijlstra
870a0bb5d6 sched/numa: Don't scale the imbalance
It's far too easy to get ridiculously large imbalance pct when you
scale it like that. Use a fixed 125% for now.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-zsriaft1dv7hhboyrpvqjy6s@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-14 15:05:26 +02:00
Peter Zijlstra
04f733b4af sched/fair: Revert sched-domain iteration breakage
Patches c22402a2f ("sched/fair: Let minimally loaded cpu balance the
group") and 0ce90475 ("sched/fair: Add some serialization to the
sched_domain load-balance walk") are horribly broken so revert them.

The problem is that while it sounds good to have the minimally loaded
cpu do the pulling of more load, the way we walk the domains there is
absolutely no guarantee this cpu will actually get to the domain. In
fact its very likely it wont. Therefore the higher up the tree we get,
the less likely it is we'll balance at all.

The first of mask always walks up, while sucky in that it accumulates
load on the first cpu and needs extra passes to spread it out at least
guarantees a cpu gets up that far and load-balancing happens at all.

Since its now always the first and idle cpus should always be able to
balance so they get a task as fast as possible we can also do away
with the added serialization.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-rpuhs5s56aiv1aw7khv9zkw6@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-14 15:05:26 +02:00
Peter Zijlstra
dd7d8634e6 sched/numa: Fix the new NUMA topology bits
There's no need to convert a node number to a node number by
pretending its a cpu number..

Reported-by: Yinghai Lu <yinghai@kernel.org>
Reported-and-Tested-by: Greg Pearson <greg.pearson@hp.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-0sqhrht34phowgclj12dgk8h@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-14 15:05:25 +02:00
Ingo Molnar
9cba26e66d Merge branch 'perf/uprobes' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/uprobes 2012-05-14 14:43:40 +02:00
Ingo Molnar
2d84e023cb Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull the v3.5 RCU tree from Paul E. McKenney:

 1)	A set of improvements and fixes to the RCU_FAST_NO_HZ feature
	(with more on the way for 3.6).  Posted to LKML:
	https://lkml.org/lkml/2012/4/23/324 (commits 1-3 and 5),
	https://lkml.org/lkml/2012/4/16/611 (commit 4),
	https://lkml.org/lkml/2012/4/30/390 (commit 6), and
	https://lkml.org/lkml/2012/5/4/410 (commit 7, combined with
	the other commits for the convenience of the tester).

 2)	Changes to make rcu_barrier() avoid disrupting execution of CPUs
	that have no RCU callbacks.  Posted to LKML:
	https://lkml.org/lkml/2012/4/23/322.

 3)	A couple of commits that improve the efficiency of the interaction
	between preemptible RCU and the scheduler, these two being all
	that survived an abortive attempt to allow preemptible RCU's
	__rcu_read_lock() to be inlined.  The full set was posted to
	LKML at https://lkml.org/lkml/2012/4/14/143, and the first and
	third patches of that set remain.

 4)	Lai Jiangshan's algorithmic implementation of SRCU, which includes
	call_srcu() and srcu_barrier().  A major feature of this new
	implementation is that synchronize_srcu() no longer disturbs
	the execution of other CPUs.  This work is based on earlier
	implementations by Peter Zijlstra and Paul E. McKenney.  Posted to
	LKML: https://lkml.org/lkml/2012/2/22/82.

 5)	A number of miscellaneous bug fixes and improvements which were
	posted to LKML at: https://lkml.org/lkml/2012/4/23/353 with
	subsequent updates posted to LKML.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-14 08:41:46 +02:00
Randy Dunlap
1fce677971 printk: add stub for prepend_timestamp()
Add a stub for prepend_timestamp() when CONFIG_PRINTK is not
enabled.  Fixes this build error:

kernel/printk.c:1770:3: error: implicit declaration of function 'prepend_timestamp'

Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-11 16:44:54 -07:00
Rafael J. Wysocki
4e585d25e1 PM / Sleep: User space wakeup sources garbage collector Kconfig option
Make it possible to configure out the user space wakeup sources
garbage collector for debugging and default Android builds.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Arve Hjønnevåg <arve@android.com>
2012-05-11 21:11:16 +02:00
Rafael J. Wysocki
c73893e2ca PM / Sleep: Make the limit of user space wakeup sources configurable
Make it possible to configure out the check against the limit of
user space wakeup sources for debugging and default Android builds.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Arve Hjønnevåg <arve@android.com>
2012-05-11 21:11:02 +02:00
Paul E. McKenney
dc36be4419 Merge branches 'barrier.2012.05.09a', 'fixes.2012.04.26a', 'inline.2012.05.02b' and 'srcu.2012.05.07b' into HEAD
barrier:  Reduce the amount of disturbance by rcu_barrier() to the rest of
    	the system.  This branch also includes improvements to
    	RCU_FAST_NO_HZ, which are included here due to conflicts.
fixes:  Miscellaneous fixes.
inline:  Remaining changes from an abortive attempt to inline
    	preemptible RCU's __rcu_read_lock().  These are (1) making
    	exit_rcu() avoid unnecessary work and (2) avoiding having
    	preemptible RCU record a blocked thread when the scheduler
    	declines to do a context switch.
srcu:	Lai Jiangshan's algorithmic implementation of SRCU, including
    	call_srcu().
2012-05-11 10:14:21 -07:00
Stephen Warren
f8450fca6e printk: correctly align __log_buf
__log_buf must be aligned, because a 64-bit value is written directly
to it as part of struct log. Alignment of the log entries is typically
handled by log_store(), but this only triggers for subsequent entries,
not the very first (or wrapped) entries.

Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-10 15:36:59 -07:00
Mike Galbraith
5e2bf01422 namespaces, pid_ns: fix leakage on fork() failure
Fork() failure post namespace creation for a child cloned with
CLONE_NEWPID leaks pid_namespace/mnt_cache due to proc being mounted
during creation, but not unmounted during cleanup.  Call
pid_ns_release_proc() during cleanup.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Louis Rilling <louis.rilling@kerlabs.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10 15:06:44 -07:00
Steven Rostedt
9b63776fa3 tracing: Do not enable function event with enable
With the adding of function tracing event to perf, it caused a
side effect that produces the following warning when enabling all
events in ftrace:

 # echo 1 > /sys/kernel/debug/tracing/events/enable

[console]
event trace: Could not enable event function

This is because when enabling all events via the debugfs system
it ignores events that do not have a ->reg() function assigned.
This was to skip over the ftrace internal events (as they are
not TRACE_EVENTs). But as the ftrace function event now has
a ->reg() function attached to it for use with perf, it is no
longer ignored.

Worse yet, this ->reg() function is being called when it should
not be. It returns an error and causes the above warning to
be printed.

By adding a new event_call flag (TRACE_EVENT_FL_IGNORE_ENABLE)
and have all ftrace internel event structures have it set,
setting the events/enable will no longe try to incorrectly enable
the function event and does not warn.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-10 15:55:43 -04:00
Jan Kiszka
b7dafa0ef3 compat: Fix RT signal mask corruption via sigprocmask
compat_sys_sigprocmask reads a smaller signal mask from userspace than
sigprogmask accepts for setting.  So the high word of blocked.sig[0]
will be cleared, releasing any potentially blocked RT signal.

This was discovered via userspace code that relies on get/setcontext.
glibc's i386 versions of those functions use sigprogmask instead of
rt_sigprogmask to save/restore signal mask and caused RT signal
unblocking this way.

As suggested by Linus, this replaces the sys_sigprocmask based compat
version with one that open-codes the required logic, including the merge
of the existing blocked set with the new one provided on SIG_SETMASK.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10 08:58:33 -07:00
Kay Sievers
649e6ee33f printk() - restore timestamp printing at console output
The output of the timestamps got lost with the conversion of the
kmsg buffer to records; restore the old behavior.

Document, that CONFIG_PRINTK_TIME now only controls the output of
the timestamps in the syslog() system call and on the console, and
not the recording of the timestamps.

Cc: Joe Perches <joe@perches.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sasha Levin <levinsasha928@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Reported-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-09 20:35:06 -07:00
Kay Sievers
5c5d5ca51a printk() - do not merge continuation lines of different threads
This prevents the merging of printk() continuation lines of different
threads, in the case they race against each other.

It should properly isolate "atomic" single-line printk() users from
continuation users, to make sure the single-line users will never be
merged with the racy continuation ones.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-09 20:29:59 -07:00
Kay Sievers
7f3a781d6f printk - fix compilation for CONFIG_PRINTK=n
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-09 15:51:09 -07:00
Paul E. McKenney
b1420f1c8b rcu: Make rcu_barrier() less disruptive
The rcu_barrier() primitive interrupts each and every CPU, registering
a callback on every CPU.  Once all of these callbacks have been invoked,
rcu_barrier() knows that every callback that was registered before
the call to rcu_barrier() has also been invoked.

However, there is no point in registering a callback on a CPU that
currently has no callbacks, most especially if that CPU is in a
deep idle state.  This commit therefore makes rcu_barrier() avoid
interrupting CPUs that have no callbacks.  Doing this requires reworking
the handling of orphaned callbacks, otherwise callbacks could slip through
rcu_barrier()'s net by being orphaned from a CPU that rcu_barrier() had
not yet interrupted to a CPU that rcu_barrier() had already interrupted.
This reworking was needed anyway to take a first step towards weaning
RCU from the CPU_DYING notifier's use of stop_cpu().

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-05-09 14:27:54 -07:00
Paul E. McKenney
98248a0e24 rcu: Explicitly initialize RCU_FAST_NO_HZ per-CPU variables
The current initialization of the RCU_FAST_NO_HZ per-CPU variables makes
needless and fragile assumptions about the initial value of things like
the jiffies counter.  This commit therefore explicitly initializes all of
them that are better started with a non-zero value.  It also adds some
comments describing the per-CPU state variables.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-05-09 14:26:57 -07:00
Paul E. McKenney
21e52e1566 rcu: Make RCU_FAST_NO_HZ handle timer migration
The current RCU_FAST_NO_HZ assumes that timers do not migrate unless a
CPU goes offline, in which case it assumes that the CPU will have to come
out of dyntick-idle mode (cancelling the timer) in order to go offline.
This is important because when RCU_FAST_NO_HZ permits a CPU to enter
dyntick-idle mode despite having RCU callbacks pending, it posts a timer
on that CPU to force a wakeup on that CPU.  This wakeup ensures that the
CPU will eventually handle the end of the grace period, including invoking
its RCU callbacks.

However, Pascal Chapperon's test setup shows that the timer handler
rcu_idle_gp_timer_func() really does get invoked in some cases.  This is
problematic because this can cause the CPU that entered dyntick-idle
mode despite still having RCU callbacks pending to remain in
dyntick-idle mode indefinitely, which means that its RCU callbacks might
never be invoked.  This situation can result in grace-period delays or
even system hangs, which matches Pascal's observations of slow boot-up
and shutdown (https://lkml.org/lkml/2012/4/5/142).  See also the bugzilla:

	https://bugzilla.redhat.com/show_bug.cgi?id=806548

This commit therefore causes the "should never be invoked" timer handler
rcu_idle_gp_timer_func() to use smp_call_function_single() to wake up
the CPU for which the timer was intended, allowing that CPU to invoke
its RCU callbacks in a timely manner.

Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-05-09 14:26:56 -07:00
Ingo Molnar
c4f400e837 Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core 2012-05-09 19:34:30 +02:00
Peter Zijlstra
cb04ff9ac4 sched, perf: Use a single callback into the scheduler
We can easily use a single callback for both sched-in and sched-out. This
reduces the code footprint in the scheduler path as well as removes
the PMU black spot otherwise present between the out and in callback.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-o56ajxp1edwqg6x9d31wb805@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-09 15:23:17 +02:00
Robert Richter
fd0d000b2c perf: Pass last sampling period to perf_sample_data_init()
We always need to pass the last sample period to
perf_sample_data_init(), otherwise the event distribution will be
wrong. Thus, modifiyng the function interface with the required period
as argument. So basically a pattern like this:

        perf_sample_data_init(&data, ~0ULL);
        data.period = event->hw.last_period;

will now be like that:

        perf_sample_data_init(&data, ~0ULL, event->hw.last_period);

Avoids unininitialized data.period and simplifies code.

Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1333390758-10893-3-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-09 15:23:12 +02:00
Peter Zijlstra
cb83b629ba sched/numa: Rewrite the CONFIG_NUMA sched domain support
The current code groups up to 16 nodes in a level and then puts an
ALLNODES domain spanning the entire tree on top of that. This doesn't
reflect the numa topology and esp for the smaller not-fully-connected
machines out there today this might make a difference.

Therefore, build a proper numa topology based on node_distance().

Since there's no fixed numa layers anymore, the static SD_NODE_INIT
and SD_ALLNODES_INIT aren't usable anymore, the new code tries to
construct something similar and scales some values either on the
number of cpus in the domain and/or the node_distance() ratio.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: linux-alpha@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-sh@vger.kernel.org
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: sparclinux@vger.kernel.org
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Greg Pearson <greg.pearson@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: bob.picco@oracle.com
Cc: chris.mason@oracle.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-r74n3n8hhuc2ynbrnp3vt954@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-09 15:00:55 +02:00
Peter Zijlstra
bd939f45da sched/fair: Propagate 'struct lb_env' usage into find_busiest_group
More function argument passing reduction.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-v66ivjfqdiqdso01lqgqx6qf@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-09 15:00:54 +02:00
Peter Zijlstra
0ce90475dc sched/fair: Add some serialization to the sched_domain load-balance walk
Since the sched_domain walk is completely unserialized (!SD_SERIALIZE)
it is possible that multiple cpus in the group get elected to do the
next level. Avoid this by adding some serialization.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-vqh9ai6s0ewmeakjz80w4qz6@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-09 15:00:53 +02:00
Peter Zijlstra
c22402a2f7 sched/fair: Let minimally loaded cpu balance the group
Currently we let the leftmost (or first idle) cpu ascend the
sched_domain tree and perform load-balancing. The result is that the
busiest cpu in the group might be performing this function and pull
more load to itself. The next load balance pass will then try to
equalize this again.

Change this to pick the least loaded cpu to perform higher domain
balancing.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-v8zlrmgmkne3bkcy9dej1fvm@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-09 15:00:51 +02:00
Peter Zijlstra
c82513e513 sched: Change rq->nr_running to unsigned int
Since there's a PID space limit of 30bits (see
futex.h:FUTEX_TID_MASK) and allocating that many tasks (assuming a
lower bound of 2 pages per task) would still take 8T of memory it
seems reasonable to say that unsigned int is sufficient for
rq->nr_running.

When we do get anywhere near that amount of tasks I suspect other
things would go funny, load-balancer load computations would really
need to be hoisted to 128bit etc.

So save a few bytes and convert rq->nr_running and friends to
unsigned int.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-y3tvyszjdmbibade5bw8zl81@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-09 15:00:49 +02:00
Igor Mammedov
30b4e9eb78 sched: Fix KVM and ia64 boot crash due to sched_groups circular linked list assumption
If we have one cpu that failed to boot and boot cpu gave up on
waiting for it and then another cpu is being booted, kernel
might crash with following OOPS:

   BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
   IP: [<ffffffff812c3630>] __bitmap_weight+0x30/0x80
   Call Trace:
       [<ffffffff8108b9b6>] build_sched_domains+0x7b6/0xa50

The crash happens in init_sched_groups_power() that expects
sched_groups to be circular linked list. However it is not
always true, since sched_groups preallocated in __sdt_alloc are
initialized in build_sched_groups and it may exit early

        if (cpu != cpumask_first(sched_domain_span(sd)))
                return 0;

without initializing sd->groups->next field.

Fix bug by initializing next field right after sched_group was
allocated.

Also-Reported-by: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Cc: a.p.zijlstra@chello.nl
Cc: pjt@google.com
Cc: seto.hidetoshi@jp.fujitsu.com
Link: http://lkml.kernel.org/r/1336559908-32533-1-git-send-email-imammedo@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-09 12:27:35 +02:00
Steven Rostedt
68179686ac tracing: Remove ftrace_disable/enable_cpu()
The ftrace_disable_cpu() and ftrace_enable_cpu() functions were
needed back before the ring buffer was lockless. Now that the
ring buffer is lockless (and has been for some time), these functions
serve no purpose, and unnecessarily slow down operations of the tracer.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-08 21:06:26 -04:00
Jiri Olsa
50e18b94c6 tracing: Use seq_*_private interface for some seq files
It's appropriate to use __seq_open_private interface to open
some of trace seq files, because it covers all steps we are
duplicating in tracing code - zallocating the iterator and
setting it as seq_file's private.

Using this for following files:
  trace
  available_filter_functions
  enabled_functions

Link: http://lkml.kernel.org/r/1335342219-2782-5-git-send-email-jolsa@redhat.com

Signed-off-by: Jiri Olsa <jolsa@redhat.com>

[
 Fixed warnings for:
   kernel/trace/trace.c: In function '__tracing_open':
   kernel/trace/trace.c:2418:11: warning: unused variable 'ret' [-Wunused-variable]
   kernel/trace/trace.c:2417:19: warning: unused variable 'm' [-Wunused-variable]
]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-08 21:04:12 -04:00
Kay Sievers
5fc3249068 kmsg: use do_div() to divide 64bit integer
On Tue, May 8, 2012 at 10:02 AM, Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> kernel/built-in.o: In function `devkmsg_read':
> printk.c:(.text+0x27e8): undefined reference to `__udivdi3'
> Most probably the "msg->ts_nsec / 1000" since
> ts_nsec is a u64 and this is a 32 bit build ...

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-08 08:55:26 -07:00
Thomas Gleixner
f5e1028736 task_allocator: Use config switches instead of magic defines
Replace __HAVE_ARCH_TASK_ALLOCATOR and __HAVE_ARCH_THREAD_ALLOCATOR
with proper config switches.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/20120505150142.371309416@linutronix.de
2012-05-08 14:08:46 +02:00
Thomas Gleixner
0d15d74a1e fork: Provide kmemcache based thread_info allocator
Several architectures have their own kmemcache based thread allocator
because THREAD_SIZE is smaller than PAGE_SIZE. Add it to the core code
conditionally on THREAD_SIZE < PAGE_SIZE so the private copies can go.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120505150141.491002124@linutronix.de
2012-05-08 14:08:44 +02:00
Thomas Gleixner
67ba5293f7 Merge branch 'smp/threadalloc' into smp/hotplug
Reason: Pull in the separate branch which was created so arch/tile can
base further work on it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-05-08 14:07:48 +02:00
Thomas Gleixner
41101809a8 fork: Provide weak arch_release_[task_struct|thread_info] functions
These functions allow us to move most of the duplicated thread_info
allocators to the core code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120505150141.366461660@linutronix.de
2012-05-08 13:55:20 +02:00
Thomas Gleixner
2889f60814 fork: Move thread info gfp flags to header
These flags can be useful for extra allocations outside of the core
code.

Add __GFP_NOTRACK to them, so the archs which have kmemcheck do
not have to provide extra allocators just for that reason.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120505150141.428211694@linutronix.de
2012-05-08 13:55:20 +02:00
Thomas Gleixner
6c0a9fa62f fork: Remove the weak insanity
We error out when compiling with gcc4.1.[01] as it miscompiles
__weak. The workaround with magic defines is not longer
necessary. Make it __weak again.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120505150141.306358267@linutronix.de
2012-05-08 13:55:20 +02:00
Thomas Gleixner
f37f435f33 smp: Implement kick_all_cpus_sync()
Will replace the misnomed cpu_idle_wait() function which is copied a
gazillion times all over arch/*

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120507175652.049316594@linutronix.de
2012-05-08 12:35:06 +02:00
Kay Sievers
e11fea92e1 kmsg: export printk records to the /dev/kmsg interface
Support for multiple concurrent readers of /dev/kmsg, with read(),
seek(), poll() support. Output of message sequence numbers, to allow
userspace log consumers to reliably reconnect and reconstruct their
state at any given time. After open("/dev/kmsg"), read() always
returns *all* buffered records. If only future messages should be
read, SEEK_END can be used. In case records get overwritten while
/dev/kmsg is held open, or records get faster overwritten than they
are read, the next read() will return -EPIPE and the current reading
position gets updated to the next available record. The passed
sequence numbers allow the log consumer to calculate the amount of
lost messages.

  [root@mop ~]# cat /dev/kmsg
  5,0,0;Linux version 3.4.0-rc1+ (kay@mop) (gcc version 4.7.0 20120315 ...
  6,159,423091;ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
  7,160,424069;pci_root PNP0A03:00: host bridge window [io  0x0000-0x0cf7] (ignored)
   SUBSYSTEM=acpi
   DEVICE=+acpi:PNP0A03:00
  6,339,5140900;NET: Registered protocol family 10
  30,340,5690716;udevd[80]: starting version 181
  6,341,6081421;FDC 0 is a S82078B
  6,345,6154686;microcode: CPU0 sig=0x623, pf=0x0, revision=0x0
  7,346,6156968;sr 1:0:0:0: Attached scsi CD-ROM sr0
   SUBSYSTEM=scsi
   DEVICE=+scsi:1:0:0:0
  6,347,6289375;microcode: CPU1 sig=0x623, pf=0x0, revision=0x0

Cc: Karel Zak <kzak@redhat.com>
Tested-by: William Douglas <william.douglas@intel.com>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-07 17:03:27 -07:00
Kay Sievers
7ff9554bb5 printk: convert byte-buffer to variable-length record buffer
- Record-based stream instead of the traditional byte stream
  buffer. All records carry a 64 bit timestamp, the syslog facility
  and priority in the record header.

- Records consume almost the same amount, sometimes less memory than
  the traditional byte stream buffer (if printk_time is enabled). The record
  header is 16 bytes long, plus some padding bytes at the end if needed.
  The byte-stream buffer needed 3 chars for the syslog prefix, 15 char for
  the timestamp and a newline.

- Buffer management is based on message sequence numbers. When records
  need to be discarded, the reading heads move on to the next full
  record. Unlike the byte-stream buffer, no old logged lines get
  truncated or partly overwritten by new ones. Sequence numbers also
  allow consumers of the log stream to get notified if any message in
  the stream they are about to read gets discarded during the time
  of reading.

- Better buffered IO support for KERN_CONT continuation lines, when printk()
  is called multiple times for a single line. The use of KERN_CONT is now
  mandatory to use continuation; a few places in the kernel need trivial fixes
  here. The buffering could possibly be extended to per-cpu variables to allow
  better thread-safety for multiple printk() invocations for a single line.

- Full-featured syslog facility value support. Different facilities
  can tag their messages. All userspace-injected messages enforce a
  facility value > 0 now, to be able to reliably distinguish them from
  the kernel-generated messages. Independent subsystems like a
  baseband processor running its own firmware, or a kernel-related
  userspace process can use their own unique facility values. Multiple
  independent log streams can co-exist that way in the same
  buffer. All share the same global sequence number counter to ensure
  proper ordering (and interleaving) and to allow the consumers of the
  log to reliably correlate the events from different facilities.

Tested-by: William Douglas <william.douglas@intel.com>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-07 16:53:02 -07:00
Hiroshi Shimamoto
489a71b029 sched: Update documentation and comments
Change sched_*.c to sched/*.c in documentation and comments.

Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4F795CAC.9080206@ct.jp.nec.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-07 15:04:18 +02:00
Ingo Molnar
436281c9a1 Merge branch 'linus' into sched/core
Merge reason: We were on a pretty old base, refresh before moving on.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-07 15:03:42 +02:00
Kyle McMartin
2a01bb3885 panic: Make panic_on_oops configurable
Several distros set this by default by patching panic_on_oops.
It seems to fit with the BOOTPARAM_{HARD,SOFT}_PANIC options
though, so let's add a Kconfig entry and reduce some more
upstream delta.

Signed-off-by: Kyle McMartin <kyle@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120411121529.GH26688@redacted.bos.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-07 14:45:29 +02:00
Srikar Dronamraju
f3f096cfed tracing: Provide trace events interface for uprobes
Implements trace_event support for uprobes. In its current form
it can be used to put probes at a specified offset in a file and
dump the required registers when the code flow reaches the
probed address.

The following example shows how to dump the instruction pointer
and %ax a register at the probed text address.  Here we are
trying to probe zfree in /bin/zsh:

 # cd /sys/kernel/debug/tracing/
 # cat /proc/`pgrep  zsh`/maps | grep /bin/zsh | grep r-xp
 00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
 # objdump -T /bin/zsh | grep -w zfree
 0000000000446420 g    DF .text  0000000000000012  Base
 zfree # echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
 # cat uprobe_events
 p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420
 # echo 1 > events/uprobes/enable
 # sleep 20
 # echo 0 > events/uprobes/enable
 # cat trace
 # tracer: nop
 #
 #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
 #              | |       |          |         |
              zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
              zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
              zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
              zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120411103043.GB29437@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-07 14:30:17 +02:00
Srikar Dronamraju
8ab83f5647 tracing: Extract out common code for kprobes/uprobes trace events
Move parts of trace_kprobe.c that can be shared with upcoming
trace_uprobe.c. Common code to kernel/trace/trace_probe.h and
kernel/trace/trace_probe.c. There are no functional changes.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120409091144.8343.76218.sendpatchset@srdronam.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-07 14:29:57 +02:00
Srikar Dronamraju
3a6b76661d tracing: Modify is_delete, is_return from int to bool
is_delete and is_return can take utmost 2 values and are better
of being a boolean than a int. There are no functional changes.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120409091133.8343.65289.sendpatchset@srdronam.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-07 14:29:35 +02:00
Ingo Molnar
19631cb3d6 Merge branch 'tip/perf/core-4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core 2012-05-07 11:03:52 +02:00
Arve Hjønnevåg
040e5bf65e PM / Sleep: Fix a mistake in a conditional in autosleep_store()
The condition check in autosleep_store() is incorrect and prevents
/sys/power/autosleep from working as advertised.  Fix that.

[rjw: Added the changelog.]

Signed-off-by: Arve Hjønnevåg <arve@android.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-05-05 21:50:58 +02:00
Thomas Gleixner
a4a2eb490e init_task: Create generic init_task instance
All archs define init_task in the same way (except ia64, but there is
no particular reason why ia64 cannot use the common version). Create a
generic instance so all archs can be converted over.

The config switch is temporary and will be removed when all archs are
converted over.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Howells <dhowells@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20120503085034.092585287@linutronix.de
2012-05-05 13:00:21 +02:00
Jim Cromie
b5f3abf950 params: replace printk(KERN_<LVL>...) with pr_<lvl>(...)
I left 1 printk which uses __FILE__, __LINE__ explicitly, which should
not be subject to generic preferences expressed via pr_fmt().

+ tweaks suggested by Joe Perches:
- add doing to irq-enabled warning, like others.  It wont happen often..
- change sysfs failure crit, not just err, make it 1 line in logs.
- coalese 2 format fragments into 1 >80 char line

cc: Joe Perches <joe@perches.com>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-04 17:28:18 -07:00
Jim Cromie
1ef9eaf2bf params.c: fix Smack complaint about parse_args
In commit 9fb48c744: "params: add 3rd arg to option handler callback
signature", the if-guard added to the pr_debug was overzealous; no
callers pass NULL, and existing code above and below the guard assumes
as much.  Change the if-guard to match, and silence the Smack
complaint.

CC: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-04 17:24:22 -07:00
Thomas Gleixner
9c6079aa1b genirq: Do not consider disabled wakeup irqs
If an wakeup interrupt has been disabled before the suspend code
disables all interrupts then we have to ignore the pending flag.

Otherwise we would abort suspend over and over as nothing clears the
pending flag because the interrupt is disabled.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: NeilBrown <neilb@suse.de>
2012-05-04 23:38:50 +02:00
Thomas Gleixner
d4dc0f90d2 genirq: Allow check_wakeup_irqs to notice level-triggered interrupts
Level triggered interrupts do not cause IRQS_PENDING to be set when
they fire while "disabled" as the 'pending' state is always present in
the level - they automatically refire where re-enabled.

However the IRQS_PENDING flag is also used to abort a suspend cycle -
if any 'is_wakeup_set' interrupt is PENDING, check_wakeup_irqs() will
cause suspend to abort. Without IRQS_PENDING, suspend won't abort.

Consequently, level-triggered interrupts that fire during the 'noirq'
phase of suspend do not currently abort suspend.

So set IRQS_PENDING even for level triggered interrupts, and make sure
to clear the flag in check_irq_resend.

[ Changelog by courtesy of Neil ]

Tested-by: NeilBrown <neilb@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-05-04 23:38:50 +02:00
Thomas Gleixner
43a18b1e58 smp: Fix idle_thread_init() inline stub
idle_thread_init() does not have arguments.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-05-04 12:52:25 +02:00