Commit Graph

284 Commits

Author SHA1 Message Date
Linus Torvalds
eb64c3c6cd xen: additional features for 3.19-rc0
- Linear p2m for x86 PV guests which simplifies the p2m code, improves
   performance and will allow for > 512 GB PV guests in the future.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJUjx7OAAoJEFxbo/MsZsTRXLIH/ishF/xDCL6F5r0I0SKDuaz5
 C/BediDcFzbzh4/t3x2PrPooHk4gPmeyIg688ZGgBAxHRXC5OJ2U5tdtZ/qUCnwf
 0J1pdp/yoAOVRJT+Sax10lN4+G8YV7+6Ptikz0C7glXBAg8SgFL3Y6tfBS0jNwYR
 wQph09S9n7gMZTodSBLbb0ymtJMhl16DrETJsYV73sU7bAL5sFDVkMQvY3SxkusX
 GNFeALfqM0cSK9mDI6O9avGJKoIdKlzt7VWHdlc+yKTlQsoyg/cSH3AaihhG6af9
 IElRxwH9Z40VFLKip0gNMOIrUwAjFGSw6N+Uhik27tlmvfI3Dll/+gsMz/5sHc8=
 =OyoK
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.19-rc0b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull additional xen update from David Vrabel:
 "Xen: additional features for 3.19-rc0

   - Linear p2m for x86 PV guests which simplifies the p2m code,
     improves performance and will allow for > 512 GB PV guests in the
     future.

  A last-minute, configuration specific issue was discovered with this
  change which is why it was not included in my previous pull request.
  This is now been fixed and tested"

* tag 'stable/for-linus-3.19-rc0b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen: switch to post-init routines in xen mmu.c earlier
  Revert "swiotlb-xen: pass dev_addr to swiotlb_tbl_unmap_single"
  xen: annotate xen_set_identity_and_remap_chunk() with __init
  xen: introduce helper functions to do safe read and write accesses
  xen: Speed up set_phys_to_machine() by using read-only mappings
  xen: switch to linear virtual mapped sparse p2m list
  xen: Hide get_phys_to_machine() to be able to tune common path
  x86: Introduce function to get pmd entry pointer
  xen: Delay invalidating extra memory
  xen: Delay m2p_override initialization
  xen: Delay remapping memory of pv-domain
  xen: use common page allocation function in p2m.c
  xen: Make functions static
  xen: fix some style issues in p2m.c
2014-12-16 13:23:03 -08:00
Juergen Gross
cdfa0badfc xen: switch to post-init routines in xen mmu.c earlier
With the virtual mapped linear p2m list the post-init mmu operations
must be used for setting up the p2m mappings, as in case of
CONFIG_FLATMEM the init routines may trigger BUGs.

paging_init() sets up all infrastructure needed to switch to the
post-init mmu ops done by xen_post_allocator_init(). With the virtual
mapped linear p2m list we need some mmu ops during setup of this list,
so we have to switch to the correct mmu ops as soon as possible.

The p2m list is usable from the beginning, just expansion requires to
have established the new linear mapping. So the call of
xen_remap_memory() had to be introduced, but this is not due to the
mmu ops requiring this.

Summing it up: calling xen_post_allocator_init() not directly after
paging_init() was conceptually wrong in the beginning, it just didn't
matter up to now as no functions used between the two calls needed
some critical mmu ops (e.g. alloc_pte). This has changed now, so I
corrected it.

Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-12-11 12:05:53 +00:00
Linus Torvalds
3100e448e7 Merge branch 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 vdso updates from Ingo Molnar:
 "Various vDSO updates from Andy Lutomirski, mostly cleanups and
  reorganization to improve maintainability, but also some
  micro-optimizations and robustization changes"

* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86_64/vsyscall: Restore orig_ax after vsyscall seccomp
  x86_64: Add a comment explaining the TASK_SIZE_MAX guard page
  x86_64,vsyscall: Make vsyscall emulation configurable
  x86_64, vsyscall: Rewrite comment and clean up headers in vsyscall code
  x86_64, vsyscall: Turn vsyscalls all the way off when vsyscall==none
  x86,vdso: Use LSL unconditionally for vgetcpu
  x86: vdso: Fix build with older gcc
  x86_64/vdso: Clean up vgetcpu init and merge the vdso initcalls
  x86_64/vdso: Remove jiffies from the vvar page
  x86/vdso: Make the PER_CPU segment 32 bits
  x86/vdso: Make the PER_CPU segment start out accessed
  x86/vdso: Change the PER_CPU segment to use struct desc_struct
  x86_64/vdso: Move getcpu code from vsyscall_64.c to vdso/vma.c
  x86_64/vsyscall: Move all of the gate_area code to vsyscall_64.c
2014-12-10 14:24:20 -08:00
Juergen Gross
054954eb05 xen: switch to linear virtual mapped sparse p2m list
At start of the day the Xen hypervisor presents a contiguous mfn list
to a pv-domain. In order to support sparse memory this mfn list is
accessed via a three level p2m tree built early in the boot process.
Whenever the system needs the mfn associated with a pfn this tree is
used to find the mfn.

Instead of using a software walked tree for accessing a specific mfn
list entry this patch is creating a virtual address area for the
entire possible mfn list including memory holes. The holes are
covered by mapping a pre-defined  page consisting only of "invalid
mfn" entries. Access to a mfn entry is possible by just using the
virtual base address of the mfn list and the pfn as index into that
list. This speeds up the (hot) path of determining the mfn of a
pfn.

Kernel build on a Dell Latitude E6440 (2 cores, HT) in 64 bit Dom0
showed following improvements:

Elapsed time: 32:50 ->  32:35
System:       18:07 ->  17:47
User:        104:00 -> 103:30

Tested with following configurations:
- 64 bit dom0, 8GB RAM
- 64 bit dom0, 128 GB RAM, PCI-area above 4 GB
- 32 bit domU, 512 MB, 8 GB, 43 GB (more wouldn't work even without
                                    the patch)
- 32 bit domU, ballooning up and down
- 32 bit domU, save and restore
- 32 bit domU with PCI passthrough
- 64 bit domU, 8 GB, 2049 MB, 5000 MB
- 64 bit domU, ballooning up and down
- 64 bit domU, save and restore
- 64 bit domU with PCI passthrough

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-12-04 14:09:15 +00:00
Juergen Gross
0aad568983 xen: Hide get_phys_to_machine() to be able to tune common path
Today get_phys_to_machine() is always called when the mfn for a pfn
is to be obtained. Add a wrapper __pfn_to_mfn() as inline function
to be able to avoid calling get_phys_to_machine() when possible as
soon as the switch to a linear mapped p2m list has been done.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-12-04 14:09:09 +00:00
Juergen Gross
1f3ac86b4c xen: Delay remapping memory of pv-domain
Early in the boot process the memory layout of a pv-domain is changed
to match the E820 map (either the host one for Dom0 or the Xen one)
regarding placement of RAM and PCI holes. This requires removing memory
pages initially located at positions not suitable for RAM and adding
them later at higher addresses where no restrictions apply.

To be able to operate on the hypervisor supported p2m list until a
virtual mapped linear p2m list can be constructed, remapping must
be delayed until virtual memory management is initialized, as the
initial p2m list can't be extended unlimited at physical memory
initialization time due to it's fixed structure.

A further advantage is the reduction in complexity and code volume as
we don't have to be careful regarding memory restrictions during p2m
updates.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-12-04 14:08:48 +00:00
Juergen Gross
7108c9ce8f xen: use common page allocation function in p2m.c
In arch/x86/xen/p2m.c three different allocation functions for
obtaining a memory page are used: extend_brk(), alloc_bootmem_align()
or __get_free_page().  Which of those functions is used depends on the
progress of the boot process of the system.

Introduce a common allocation routine selecting the to be called
allocation routine dynamically based on the boot progress. This allows
moving initialization steps without having to care about changing
allocation calls.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-12-04 14:08:42 +00:00
Juergen Gross
47591df505 xen: Support Xen pv-domains using PAT
With the dynamical mapping between cache modes and pgprot values it is
now possible to use all cache modes via the Xen hypervisor PAT settings
in a pv domain.

All to be done is to read the PAT configuration MSR and set up the
translation tables accordingly.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stefan.bader@canonical.com
Cc: xen-devel@lists.xensource.com
Cc: ville.syrjala@linux.intel.com
Cc: jbeulich@suse.com
Cc: toshi.kani@hp.com
Cc: plagnioj@jcrosoft.com
Cc: tomi.valkeinen@ti.com
Cc: bhelgaas@google.com
Link: http://lkml.kernel.org/r/1415019724-4317-19-git-send-email-jgross@suse.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-16 11:04:26 +01:00
Andy Lutomirski
1ad83c858c x86_64,vsyscall: Make vsyscall emulation configurable
This adds CONFIG_X86_VSYSCALL_EMULATION, guarded by CONFIG_EXPERT.
Turning it off completely disables vsyscall emulation, saving ~3.5k
for vsyscall_64.c, 4k for vsyscall_emu_64.S (the fake vsyscall
page), some tiny amount of core mm code that supports a gate area,
and possibly 4k for a wasted pagetable.  The latter is because the
vsyscall addresses are misaligned and fit poorly in the fixmap.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/406db88b8dd5f0cbbf38216d11be34bbb43c7eae.1414618407.git.luto@amacapital.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-03 21:44:57 +01:00
Juergen Gross
2c185687ab x86/xen: delay construction of mfn_list_list
The 3 level p2m tree for the Xen tools is constructed very early at
boot by calling xen_build_mfn_list_list(). Memory needed for this tree
is allocated via extend_brk().

As this tree (other than the kernel internal p2m tree) is only needed
for domain save/restore, live migration and crash dump analysis it
doesn't matter whether it is constructed very early or just some
milliseconds later when memory allocation is possible by other means.

This patch moves the call of xen_build_mfn_list_list() just after
calling xen_pagetable_p2m_copy() simplifying this function, too, as it
doesn't have to bother with two parallel trees now. The same applies
for some other internal functions.

While simplifying code, make early_can_reuse_p2m_middle() static and
drop the unused second parameter. p2m_mid_identity_mfn can be removed
as well, it isn't used either.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-10-23 16:24:02 +01:00
David Vrabel
7f2f882245 x86/xen: do not use _PAGE_IOMAP PTE flag for I/O mappings
Since mfn_to_pfn() returns the correct PFN for identity mappings (as
used for MMIO regions), the use of _PAGE_IOMAP is not required in
pte_mfn_to_pfn().

Do not set the _PAGE_IOMAP flag in pte_pfn_to_mfn() and do not use it
in pte_mfn_to_pfn().

This will allow _PAGE_IOMAP to be removed, making it available for
future use.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-09-23 13:36:20 +00:00
Stefan Bader
0b5a50635f x86/xen: don't copy bogus duplicate entries into kernel page tables
When RANDOMIZE_BASE (KASLR) is enabled; or the sum of all loaded
modules exceeds 512 MiB, then loading modules fails with a warning
(and hence a vmalloc allocation failure) because the PTEs for the
newly-allocated vmalloc address space are not zero.

  WARNING: CPU: 0 PID: 494 at linux/mm/vmalloc.c:128
           vmap_page_range_noflush+0x2a1/0x360()

This is caused by xen_setup_kernel_pagetables() copying
level2_kernel_pgt into level2_fixmap_pgt, overwriting many non-present
entries.

Without KASLR, the normal kernel image size only covers the first half
of level2_kernel_pgt and module space starts after that.

L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[  0..255]->kernel
                                                  [256..511]->module
                          [511]->level2_fixmap_pgt[  0..505]->module

This allows 512 MiB of of module vmalloc space to be used before
having to use the corrupted level2_fixmap_pgt entries.

With KASLR enabled, the kernel image uses the full PUD range of 1G and
module space starts in the level2_fixmap_pgt. So basically:

L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[0..511]->kernel
                          [511]->level2_fixmap_pgt[0..505]->module

And now no module vmalloc space can be used without using the corrupt
level2_fixmap_pgt entries.

Fix this by properly converting the level2_fixmap_pgt entries to MFNs,
and setting level1_fixmap_pgt as read-only.

A number of comments were also using the the wrong L3 offset for
level2_kernel_pgt.  These have been corrected.

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: stable@vger.kernel.org
2014-09-10 15:23:42 +01:00
Linus Torvalds
a0abcf2e8f Merge branch 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
Pull x86 cdso updates from Peter Anvin:
 "Vdso cleanups and improvements largely from Andy Lutomirski.  This
  makes the vdso a lot less ''special''"

* 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/vdso, build: Make LE access macros clearer, host-safe
  x86/vdso, build: Fix cross-compilation from big-endian architectures
  x86/vdso, build: When vdso2c fails, unlink the output
  x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
  x86, mm: Replace arch_vma_name with vm_ops->name for vsyscalls
  x86, mm: Improve _install_special_mapping and fix x86 vdso naming
  mm, fs: Add vm_ops->name as an alternative to arch_vma_name
  x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
  x86, vdso: Remove vestiges of VDSO_PRELINK and some outdated comments
  x86, vdso: Move the vvar and hpet mappings next to the 64-bit vDSO
  x86, vdso: Move the 32-bit vdso special pages after the text
  x86, vdso: Reimplement vdso.so preparation in build-time C
  x86, vdso: Move syscall and sysenter setup into kernel/cpu/common.c
  x86, vdso: Clean up 32-bit vs 64-bit vdso params
  x86, mm: Ensure correct alignment of the fixmap
2014-06-05 08:05:29 -07:00
Mukesh Rathor
77945ca73e x86/xen: map foreign pfns for autotranslated guests
When running as a dom0 in PVH mode, foreign pfns that are accessed
must be added to our p2m which is managed by xen. This is done via
XENMEM_add_to_physmap_range hypercall. This is needed for toolstack
building guests and mapping guest memory, xentrace mapping xen pages,
etc.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-05-27 11:47:04 +01:00
David Vrabel
f59c5145dc x86/xen: do not use _PAGE_IOMAP in xen_remap_domain_mfn_range()
_PAGE_IOMAP is used in xen_remap_domain_mfn_range() to prevent the
pfn_pte() call in remap_area_mfn_pte_fn() from using the p2m to translate
the MFN.  If mfn_pte() is used instead, the p2m look up is avoided and
the use of _PAGE_IOMAP is no longer needed.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-05-15 16:16:40 +01:00
Andy Lutomirski
f40c330091 x86, vdso: Move the vvar and hpet mappings next to the 64-bit vDSO
This makes the 64-bit and x32 vdsos use the same mechanism as the
32-bit vdso.  Most of the churn is deleting all the old fixmap code.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/8af87023f57f6bb96ec8d17fce3f88018195b49b.1399317206.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-05 13:19:01 -07:00
Linus Torvalds
c6f21243ce Merge branch 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 vdso changes from Peter Anvin:
 "This is the revamp of the 32-bit vdso and the associated cleanups.

  This adds timekeeping support to the 32-bit vdso that we already have
  in the 64-bit vdso.  Although 32-bit x86 is legacy, it is likely to
  remain in the embedded space for a very long time to come.

  This removes the traditional COMPAT_VDSO support; the configuration
  variable is reused for simply removing the 32-bit vdso, which will
  produce correct results but obviously suffer a performance penalty.
  Only one beta version of glibc was affected, but that version was
  unfortunately included in one OpenSUSE release.

  This is not the end of the vdso cleanups.  Stefani and Andy have
  agreed to continue work for the next kernel cycle; in fact Andy has
  already produced another set of cleanups that came too late for this
  cycle.

  An incidental, but arguably important, change is that this ensures
  that unused space in the VVAR page is properly zeroed.  It wasn't
  before, and would contain whatever garbage was left in memory by BIOS
  or the bootloader.  Since the VVAR page is accessible to user space
  this had the potential of information leaks"

* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  x86, vdso: Fix the symbol versions on the 32-bit vDSO
  x86, vdso, build: Don't rebuild 32-bit vdsos on every make
  x86, vdso: Actually discard the .discard sections
  x86, vdso: Fix size of get_unmapped_area()
  x86, vdso: Finish removing VDSO32_PRELINK
  x86, vdso: Move more vdso definitions into vdso.h
  x86: Load the 32-bit vdso in place, just like the 64-bit vdsos
  x86, vdso32: handle 32 bit vDSO larger one page
  x86, vdso32: Disable stack protector, adjust optimizations
  x86, vdso: Zero-pad the VVAR page
  x86, vdso: Add 32 bit VDSO time support for 64 bit kernel
  x86, vdso: Add 32 bit VDSO time support for 32 bit kernel
  x86, vdso: Patch alternatives in the 32-bit VDSO
  x86, vdso: Introduce VVAR marco for vdso32
  x86, vdso: Cleanup __vdso_gettimeofday()
  x86, vdso: Replace VVAR(vsyscall_gtod_data) by gtod macro
  x86, vdso: __vdso_clock_gettime() cleanup
  x86, vdso: Revamp vclock_gettime.c
  mm: Add new func _install_special_mapping() to mmap.c
  x86, vdso: Make vsyscall_gtod_data handling x86 generic
  ...
2014-04-02 12:26:43 -07:00
David Vrabel
5926f87fda Revert "xen: properly account for _PAGE_NUMA during xen pte translations"
This reverts commit a9c8e4beee.

PTEs in Xen PV guests must contain machine addresses if _PAGE_PRESENT
is set and pseudo-physical addresses is _PAGE_PRESENT is clear.

This is because during a domain save/restore (migration) the page
table entries are "canonicalised" and uncanonicalised". i.e., MFNs are
converted to PFNs during domain save so that on a restore the page
table entries may be rewritten with the new MFNs on the destination.
This canonicalisation is only done for PTEs that are present.

This change resulted in writing PTEs with MFNs if _PAGE_PROTNONE (or
_PAGE_NUMA) was set but _PAGE_PRESENT was clear.  These PTEs would be
migrated as-is which would result in unexpected behaviour in the
destination domain.  Either a) the MFN would be translated to the
wrong PFN/page; b) setting the _PAGE_PRESENT bit would clear the PTE
because the MFN is no longer owned by the domain; or c) the present
bit would not get set.

Symptoms include "Bad page" reports when munmapping after migrating a
domain.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <stable@vger.kernel.org>        [3.12+]
2014-03-25 11:11:42 +00:00
H. Peter Anvin
1f2cbcf648 x86, vdso, xen: Remove stray reference to FIX_VDSO
Checkin

    b0b49f2673 x86, vdso: Remove compat vdso support

... removed the VDSO from the fixmap, and thus FIX_VDSO; remove a
stray reference in Xen.

Found by Fengguang Wu's test robot.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Link: http://lkml.kernel.org/r/4bb4690899106eb11430b1186d5cc66ca9d1660c.1394751608.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-03-13 19:44:47 -07:00
Mel Gorman
a9c8e4beee xen: properly account for _PAGE_NUMA during xen pte translations
Steven Noonan forwarded a users report where they had a problem starting
vsftpd on a Xen paravirtualized guest, with this in dmesg:

  BUG: Bad page map in process vsftpd  pte:8000000493b88165 pmd:e9cc01067
  page:ffffea00124ee200 count:0 mapcount:-1 mapping:     (null) index:0x0
  page flags: 0x2ffc0000000014(referenced|dirty)
  addr:00007f97eea74000 vm_flags:00100071 anon_vma:ffff880e98f80380 mapping:          (null) index:7f97eea74
  CPU: 4 PID: 587 Comm: vsftpd Not tainted 3.12.7-1-ec2 #1
  Call Trace:
    dump_stack+0x45/0x56
    print_bad_pte+0x22e/0x250
    unmap_single_vma+0x583/0x890
    unmap_vmas+0x65/0x90
    exit_mmap+0xc5/0x170
    mmput+0x65/0x100
    do_exit+0x393/0x9e0
    do_group_exit+0xcc/0x140
    SyS_exit_group+0x14/0x20
    system_call_fastpath+0x1a/0x1f
  Disabling lock debugging due to kernel taint
  BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:0 val:-1
  BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:1 val:1

The issue could not be reproduced under an HVM instance with the same
kernel, so it appears to be exclusive to paravirtual Xen guests.  He
bisected the problem to commit 1667918b64 ("mm: numa: clear numa
hinting information on mprotect") that was also included in 3.12-stable.

The problem was related to how xen translates ptes because it was not
accounting for the _PAGE_NUMA bit.  This patch splits pte_present to add
a pteval_present helper for use by xen so both bare metal and xen use
the same code when checking if a PTE is present.

[mgorman@suse.de: wrote changelog, proposed minor modifications]
[akpm@linux-foundation.org: fix typo in comment]
Reported-by: Steven Noonan <steven@uplinklabs.net>
Tested-by: Steven Noonan <steven@uplinklabs.net>
Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <stable@vger.kernel.org>	[3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-10 16:01:41 -08:00
Linus Torvalds
12f2bbd609 Merge branch 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 asmlinkage (LTO) changes from Peter Anvin:
 "This patchset adds more infrastructure for link time optimization
  (LTO).

  This patchset was pulled into my tree late because of a
  miscommunication (part of the patchset was picked up by other
  maintainers).  However, the patchset is strictly build-related and
  seems to be okay in testing"

* 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, asmlinkage, xen: Fix type of NMI
  x86, asmlinkage, xen, kvm: Make {xen,kvm}_lock_spinning global and visible
  x86: Use inline assembler instead of global register variable to get sp
  x86, asmlinkage, paravirt: Make paravirt thunks global
  x86, asmlinkage, paravirt: Don't rely on local assembler labels
  x86, asmlinkage, lguest: Fix C functions used by inline assembler
2014-01-30 18:15:32 -08:00
Andi Kleen
a2e7f0e3a4 x86, asmlinkage, paravirt: Make paravirt thunks global
The paravirt thunks use a hack of using a static reference to a static
function to reference that function from the top level statement.

This assumes that gcc always generates static function names in a specific
format, which is not necessarily true.

Simply make these functions global and asmlinkage or __visible. This way the
static __used variables are not needed and everything works.

Functions with arguments are __visible to keep the register calling
convention on 32bit.

Changed in paravirt and in all users (Xen and vsmp)

v2: Use __visible for functions with arguments

Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Ido Yariv <ido@wizery.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1382458079-24450-5-git-send-email-andi@firstfloor.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-01-29 22:17:17 -08:00
Mukesh Rathor
76bcceff0b xen/pvh/mmu: Use PV TLB instead of native.
We also optimize one - the TLB flush. The native operation would
needlessly IPI offline VCPUs causing extra wakeups. Using the
Xen one avoids that and lets the hypervisor determine which
VCPU needs the TLB flush.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-01-06 10:44:07 -05:00
Mukesh Rathor
4e44e44b0b xen/pvh: MMU changes for PVH (v2)
.. which are surprisingly small compared to the amount for PV code.

PVH uses mostly native mmu ops, we leave the generic (native_*) for
the majority and just overwrite the baremetal with the ones we need.

At startup, we are running with pre-allocated page-tables
courtesy of the tool-stack. But we still need to graft them
in the Linux initial pagetables. However there is no need to
unpin/pin and change them to R/O or R/W.

Note that the xen_pagetable_init due to 7836fec9d0994cc9c9150c5a33f0eb0eb08a335a
"xen/mmu/p2m: Refactor the xen_pagetable_init code." does not
need any changes - we just need to make sure that xen_post_allocator_init
does not alter the pvops from the default native one.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
2014-01-06 10:44:05 -05:00
Konrad Rzeszutek Wilk
b621e157ba xen/mmu: Cleanup xen_pagetable_p2m_copy a bit.
Stefano noticed that the code runs only under 64-bit so
the comments about 32-bit are pointless.

Also we change the condition for xen_revector_p2m_tree
returning the same value (because it could not allocate
a swath of space to put the new P2M in) or it had been
called once already. In such we return early from the
function.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
2014-01-06 10:44:04 -05:00
Konrad Rzeszutek Wilk
32df75cd14 xen/mmu/p2m: Refactor the xen_pagetable_init code (v2).
The revectoring and copying of the P2M only happens when
!auto-xlat and on 64-bit builds. It is not obvious from
the code, so lets have seperate 32 and 64-bit functions.

We also invert the check for auto-xlat to make the code
flow simpler.

Suggested-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-01-06 10:44:02 -05:00
Linus Torvalds
eda670c626 Features:
- SWIOTLB has tracing added when doing bounce buffer.
  - Xen ARM/ARM64 can use Xen-SWIOTLB. This work allows Linux to
    safely program real devices for DMA operations when running as
    a guest on Xen on ARM, without IOMMU support.*1
  - xen_raw_printk works with PVHVM guests if needed.
 Bug-fixes:
  - Make memory ballooning work under HVM with large MMIO region.
  - Inform hypervisor of MCFG regions found in ACPI DSDT.
  - Remove deprecated IRQF_DISABLED.
  - Remove deprecated __cpuinit.
 
 [*1]:
 "On arm and arm64 all Xen guests, including dom0, run with second stage
 translation enabled. As a consequence when dom0 programs a device for a
 DMA operation is going to use (pseudo) physical addresses instead
 machine addresses. This work introduces two trees to track physical to
 machine and machine to physical mappings of foreign pages. Local pages
 are assumed mapped 1:1 (physical address == machine address).  It
 enables the SWIOTLB-Xen driver on ARM and ARM64, so that Linux can
 translate physical addresses to machine addresses for dma operations
 when necessary. " (Stefano).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQEcBAABAgAGBQJSgS86AAoJEFjIrFwIi8fJpY4H/R2gke1A1p9UvTwbkaDhgPs/
 u/mkI6aH+ktgvu5QZNprki660uydtc4Ck7y8leeLGYw+ed1Ys559SJhRc/x8jBYZ
 Hh2chnplld0LAjSpdIDTTePArE1xBo4Gz+fT0zc5cVh0leJwOXn92Kx8N5AWD/T3
 gwH4Ok4K1dzZBIls7imM2AM/L1xcApcx3Dl/QpNcoePQtR4yLuPWMUbb3LM8pbUY
 0B6ZVN4GOhtJ84z8HRKnh4uMnBYmhmky6laTlHVa6L+j1fv7aAPCdNbePjIt/Pvj
 HVYB1O/ht73yHw0zGfK6lhoGG8zlu+Q7sgiut9UsGZZfh34+BRKzNTypqJ3ezQo=
 =xc43
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.13-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull Xen updates from Konrad Rzeszutek Wilk:
 "This has tons of fixes and two major features which are concentrated
  around the Xen SWIOTLB library.

  The short <blurb> is that the tracing facility (just one function) has
  been added to SWIOTLB to make it easier to track I/O progress.
  Additionally under Xen and ARM (32 & 64) the Xen-SWIOTLB driver
  "is used to translate physical to machine and machine to physical
  addresses of foreign[guest] pages for DMA operations" (Stefano) when
  booting under hardware without proper IOMMU.

  There are also bug-fixes, cleanups, compile warning fixes, etc.

  The commit times for some of the commits is a bit fresh - that is b/c
  we wanted to make sure we have the Ack's from the ARM folks - which
  with the string of back-to-back conferences took a bit of time.  Rest
  assured - the code has been stewing in #linux-next for some time.

  Features:
   - SWIOTLB has tracing added when doing bounce buffer.
   - Xen ARM/ARM64 can use Xen-SWIOTLB.  This work allows Linux to
     safely program real devices for DMA operations when running as a
     guest on Xen on ARM, without IOMMU support. [*1]
   - xen_raw_printk works with PVHVM guests if needed.

  Bug-fixes:
   - Make memory ballooning work under HVM with large MMIO region.
   - Inform hypervisor of MCFG regions found in ACPI DSDT.
   - Remove deprecated IRQF_DISABLED.
   - Remove deprecated __cpuinit.

  [*1]:
  "On arm and arm64 all Xen guests, including dom0, run with second
   stage translation enabled.  As a consequence when dom0 programs a
   device for a DMA operation is going to use (pseudo) physical
   addresses instead machine addresses.  This work introduces two trees
   to track physical to machine and machine to physical mappings of
   foreign pages.  Local pages are assumed mapped 1:1 (physical address
   == machine address).  It enables the SWIOTLB-Xen driver on ARM and
   ARM64, so that Linux can translate physical addresses to machine
   addresses for dma operations when necessary.  " (Stefano)"

* tag 'stable/for-linus-3.13-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (32 commits)
  xen/arm: pfn_to_mfn and mfn_to_pfn return the argument if nothing is in the p2m
  arm,arm64/include/asm/io.h: define struct bio_vec
  swiotlb-xen: missing include dma-direction.h
  pci-swiotlb-xen: call pci_request_acs only ifdef CONFIG_PCI
  arm: make SWIOTLB available
  xen: delete new instances of added __cpuinit
  xen/balloon: Set balloon's initial state to number of existing RAM pages
  xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved for MCFG areas.
  xen: remove deprecated IRQF_DISABLED
  x86/xen: remove deprecated IRQF_DISABLED
  swiotlb-xen: fix error code returned by xen_swiotlb_map_sg_attrs
  swiotlb-xen: static inline xen_phys_to_bus, xen_bus_to_phys, xen_virt_to_bus and range_straddles_page_boundary
  grant-table: call set_phys_to_machine after mapping grant refs
  arm,arm64: do not always merge biovec if we are running on Xen
  swiotlb: print a warning when the swiotlb is full
  swiotlb-xen: use xen_dma_map/unmap_page, xen_dma_sync_single_for_cpu/device
  xen: introduce xen_dma_map/unmap_page and xen_dma_sync_single_for_cpu/device
  tracing/events: Fix swiotlb tracepoint creation
  swiotlb-xen: use xen_alloc/free_coherent_pages
  xen: introduce xen_alloc/free_coherent_pages
  ...
2013-11-15 13:34:37 +09:00
Kirill A. Shutemov
49076ec2cc mm: dynamically allocate page->ptl if it cannot be embedded to struct page
If split page table lock is in use, we embed the lock into struct page
of table's page.  We have to disable split lock, if spinlock_t is too
big be to be embedded, like when DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC
enabled.

This patch add support for dynamic allocation of split page table lock
if we can't embed it to struct page.

page->ptl is unsigned long now and we use it as spinlock_t if
sizeof(spinlock_t) <= sizeof(long), otherwise it's pointer to spinlock_t.

The spinlock_t allocated in pgtable_page_ctor() for PTE table and in
pgtable_pmd_page_ctor() for PMD table.  All other helpers converted to
support dynamically allocated page->ptl.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:20 +09:00
Kirill A. Shutemov
57c1ffcefb mm: rename USE_SPLIT_PTLOCKS to USE_SPLIT_PTE_PTLOCKS
We're going to introduce split page table lock for PMD level.  Let's
rename existing split ptlock for PTE level to avoid confusion.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:14 +09:00
Konrad Rzeszutek Wilk
e1d8f62ad4 Merge remote-tracking branch 'stefano/swiotlb-xen-9.1' into stable/for-linus-3.13
* stefano/swiotlb-xen-9.1:
  swiotlb-xen: fix error code returned by xen_swiotlb_map_sg_attrs
  swiotlb-xen: static inline xen_phys_to_bus, xen_bus_to_phys, xen_virt_to_bus and range_straddles_page_boundary
  grant-table: call set_phys_to_machine after mapping grant refs
  arm,arm64: do not always merge biovec if we are running on Xen
  swiotlb: print a warning when the swiotlb is full
  swiotlb-xen: use xen_dma_map/unmap_page, xen_dma_sync_single_for_cpu/device
  xen: introduce xen_dma_map/unmap_page and xen_dma_sync_single_for_cpu/device
  swiotlb-xen: use xen_alloc/free_coherent_pages
  xen: introduce xen_alloc/free_coherent_pages
  arm64/xen: get_dma_ops: return xen_dma_ops if we are running as xen_initial_domain
  arm/xen: get_dma_ops: return xen_dma_ops if we are running as xen_initial_domain
  swiotlb-xen: introduce xen_swiotlb_set_dma_mask
  xen/arm,arm64: enable SWIOTLB_XEN
  xen: make xen_create_contiguous_region return the dma address
  xen/x86: allow __set_phys_to_machine for autotranslate guests
  arm/xen,arm64/xen: introduce p2m
  arm64: define DMA_ERROR_CODE
  arm: make SWIOTLB available

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Conflicts:
	arch/arm/include/asm/dma-mapping.h
	drivers/xen/swiotlb-xen.c

[Conflicts arose b/c "arm: make SWIOTLB available" v8 was in Stefano's
branch, while I had v9 + Ack from Russel. I also fixed up white-space
issues]
2013-11-08 16:10:48 -05:00
Stefano Stabellini
1b65c4e5a9 swiotlb-xen: use xen_alloc/free_coherent_pages
Use xen_alloc_coherent_pages and xen_free_coherent_pages to allocate or
free coherent pages.

We need to be careful handling the pointer returned by
xen_alloc_coherent_pages, because on ARM the pointer is not equal to
phys_to_virt(*dma_handle). In fact virt_to_phys only works for kernel
direct mapped RAM memory.
In ARM case the pointer could be an ioremap address, therefore passing
it to virt_to_phys would give you another physical address that doesn't
correspond to it.

Make xen_create_contiguous_region take a phys_addr_t as start parameter to
avoid the virt_to_phys calls which would be incorrect.

Changes in v6:
- remove extra spaces.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-10-10 13:41:10 +00:00
Stefano Stabellini
69908907b0 xen: make xen_create_contiguous_region return the dma address
Modify xen_create_contiguous_region to return the dma address of the
newly contiguous buffer.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>


Changes in v4:
- use virt_to_machine instead of virt_to_bus.
2013-10-09 16:56:32 +00:00
Konrad Rzeszutek Wilk
b1922a519e xen/mmu: Correct PAT MST setting.
Jan Beulich spotted that the PAT MSR settings in the Xen public
document that "the first (PAT6) column was wrong across the
board, and the column for PAT7 was missing altogether."

This updates it to be in sync.

CC: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
2013-09-27 09:04:45 -04:00
Linus Torvalds
01c7cd0ef5 Merge branch 'x86-kaslr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perparatory x86 kasrl changes from Ingo Molnar:
 "This contains changes from the ongoing KASLR work, by Kees Cook.

  The main changes are the use of a read-only IDT on x86 (which
  decouples the userspace visible virtual IDT address from the physical
  address), and a rework of ELF relocation support, in preparation of
  random, boot-time kernel image relocation."

* 'x86-kaslr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, relocs: Refactor the relocs tool to merge 32- and 64-bit ELF
  x86, relocs: Build separate 32/64-bit tools
  x86, relocs: Add 64-bit ELF support to relocs tool
  x86, relocs: Consolidate processing logic
  x86, relocs: Generalize ELF structure names
  x86: Use a read-only IDT alias on all CPUs
2013-04-30 08:37:24 -07:00
Linus Torvalds
6c4c4d4bda Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "Misc fixes"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Flush lazy MMU when DEBUG_PAGEALLOC is set
  x86/mm/cpa/selftest: Fix false positive in CPA self test
  x86/mm/cpa: Convert noop to functional fix
  x86, mm: Patch out arch_flush_lazy_mmu_mode() when running on bare metal
  x86, mm, paravirt: Fix vmalloc_fault oops during lazy MMU updates
2013-04-14 11:13:24 -07:00
Kees Cook
4eefbe792b x86: Use a read-only IDT alias on all CPUs
Make a copy of the IDT (as seen via the "sidt" instruction) read-only.
This primarily removes the IDT from being a target for arbitrary memory
write attacks, and has the added benefit of also not leaking the kernel
base offset, if it has been relocated.

We already did this on vendor == Intel and family == 5 because of the
F0 0F bug -- regardless of if a particular CPU had the F0 0F bug or
not.  Since the workaround was so cheap, there simply was no reason to
be very specific.  This patch extends the readonly alias to all CPUs,
but does not activate the #PF to #UD conversion code needed to deliver
the proper exception in the F0 0F case except on Intel family 5
processors.

Signed-off-by: Kees Cook <keescook@chromium.org>
Link: http://lkml.kernel.org/r/20130410192422.GA17344@www.outflux.net
Cc: Eric Northup <digitaleric@google.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-11 13:53:19 -07:00
Boris Ostrovsky
511ba86e1d x86, mm: Patch out arch_flush_lazy_mmu_mode() when running on bare metal
Invoking arch_flush_lazy_mmu_mode() results in calls to
preempt_enable()/disable() which may have performance impact.

Since lazy MMU is not used on bare metal we can patch away
arch_flush_lazy_mmu_mode() so that it is never called in such
environment.

[ hpa: the previous patch "Fix vmalloc_fault oops during lazy MMU
  updates" may cause a minor performance regression on
  bare metal.  This patch resolves that performance regression.  It is
  somewhat unclear to me if this is a good -stable candidate. ]

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: http://lkml.kernel.org/r/1364045796-10720-2-git-send-email-konrad.wilk@oracle.com
Tested-by: Josh Boyer <jwboyer@redhat.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org> SEE NOTE ABOVE
2013-04-10 11:25:10 -07:00
Konrad Rzeszutek Wilk
b22227944b xen/mmu: On early bootup, flush the TLB when changing RO->RW bits Xen provided pagetables.
Occassionaly on a DL380 G4 the guest would crash quite early with this:

(XEN) d244:v0: unhandled page fault (ec=0003)
(XEN) Pagetable walk from ffffffff84dc7000:
(XEN)  L4[0x1ff] = 00000000c3f18067 0000000000001789
(XEN)  L3[0x1fe] = 00000000c3f14067 000000000000178d
(XEN)  L2[0x026] = 00000000dc8b2067 0000000000004def
(XEN)  L1[0x1c7] = 00100000dc8da067 0000000000004dc7
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 244 (vcpu#0) crashed on cpu#3:
(XEN) ----[ Xen-4.1.3OVM  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    3
(XEN) RIP:    e033:[<ffffffff81263f22>]
(XEN) RFLAGS: 0000000000000216   EM: 1   CONTEXT: pv guest
(XEN) rax: 0000000000000000   rbx: ffffffff81785f88   rcx: 000000000000003f
(XEN) rdx: 0000000000000000   rsi: 00000000dc8da063   rdi: ffffffff84dc7000

The offending code shows it to be a loop writting the value zero
(%rax) in the %rdi (the L4 provided by Xen) register:

   0: 44 00 00             add    %r8b,(%rax)
   3: 31 c0                 xor    %eax,%eax
   5: b9 40 00 00 00       mov    $0x40,%ecx
   a: 66 0f 1f 84 00 00 00 nopw   0x0(%rax,%rax,1)
  11: 00 00
  13: ff c9                 dec    %ecx
  15:* 48 89 07             mov    %rax,(%rdi)     <-- trapping instruction
  18: 48 89 47 08           mov    %rax,0x8(%rdi)
  1c: 48 89 47 10           mov    %rax,0x10(%rdi)

which fails. xen_setup_kernel_pagetable recycles some of the Xen's
page-table entries when it has switched over to its Linux page-tables.

Right before try to clear the page, we  make a hypercall to change
it from _RO to  _RW and that works (otherwise we would hit an BUG()).
And the _RW flag is set for that page:
(XEN)  L1[0x1c7] = 001000004885f067 0000000000004dc7

The error code is 3, so PFEC_page_present and PFEC_write_access, so page is
present (correct), and we tried to write to the page, but a violation
occurred. The one theory is that the the page entries in hardware
(which are cached) are not up to date with what we just set. Especially
as we have just done an CR3 write and flushed the multicalls.

This patch does solve the problem by flusing out the TLB page
entry after changing it from _RO to _RW and we don't hit this
issue anymore.

Fixed-Oracle-Bug: 16243091 [ON OCCASIONS VM START GOES INTO
'CRASH' STATE: CLEAR_PAGE+0X12 ON HP DL380 G4]
Reported-and-Tested-by: Saar Maoz <Saar.Maoz@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-04-02 14:02:23 -04:00
Konrad Rzeszutek Wilk
d3eb2c89e7 xen/mmu: Move the setting of pvops.write_cr3 to later phase in bootup.
We move the setting of write_cr3 from the early bootup variant
(see git commit 0cc9129d75
"x86-64, xen, mmu: Provide an early version of write_cr3.")
to a more appropiate location.

This new location sets all of the other non-early variants
of pvops calls - and most importantly is before the
alternative_asm mechanism kicks in.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-27 12:06:03 -04:00
Konrad Rzeszutek Wilk
0cc9129d75 x86-64, xen, mmu: Provide an early version of write_cr3.
With commit 8170e6bed4 ("x86, 64bit: Use a #PF handler to materialize
early mappings on demand") we started hitting an early bootup crash
where the Xen hypervisor would inform us that:

    (XEN) d7:v0: unhandled page fault (ec=0000)
    (XEN) Pagetable walk from ffffea000005b2d0:
    (XEN)  L4[0x1d4] = 0000000000000000 ffffffffffffffff
    (XEN) domain_crash_sync called from entry.S
    (XEN) Domain 7 (vcpu#0) crashed on cpu#3:
    (XEN) ----[ Xen-4.2.0  x86_64  debug=n  Not tainted ]----

.. that Xen was unable to context switch back to dom0.

Looking at the calling stack we find:

    [<ffffffff8103feba>] xen_get_user_pgd+0x5a  <--
    [<ffffffff8103feba>] xen_get_user_pgd+0x5a
    [<ffffffff81042d27>] xen_write_cr3+0x77
    [<ffffffff81ad2d21>] init_mem_mapping+0x1f9
    [<ffffffff81ac293f>] setup_arch+0x742
    [<ffffffff81666d71>] printk+0x48

We are trying to figure out whether we need to up-date the user PGD as
well.  Please keep in mind that under 64-bit PV guests we have a limited
amount of rings: 0 for the Hypervisor, and 1 for both the Linux kernel
and user-space.  As such the Linux pvops'fied version of write_cr3
checks if it has to update the user-space cr3 as well.

That clearly is not needed during early bootup.  The recent changes (see
above git commit) streamline the x86 page table allocation to be much
simpler (And also incidentally the #PF handler ends up in spirit being
similar to how the Xen toolstack sets up the initial page-tables).

The fix is to have an early-bootup version of cr3 that just loads the
kernel %cr3.  The later version - which also handles user-page
modifications will be used after the initial page tables have been
setup.

[ hpa: removed a redundant #ifdef and made the new function __init.
  Also note that x86-32 already has such an early xen_write_cr3. ]

Tested-by: "H. Peter Anvin" <hpa@zytor.com>
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/1361579812-23709-1-git-send-email-konrad.wilk@oracle.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-22 17:41:22 -08:00
H. Peter Anvin
de65d816aa Merge remote-tracking branch 'origin/x86/boot' into x86/mm2
Coming patches to x86/mm2 require the changes and advanced baseline in
x86/boot.

Resolved Conflicts:
	arch/x86/kernel/setup.c
	mm/nobootmem.c

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-29 15:10:15 -08:00
Linus Torvalds
896ea17d3d Features:
- Add necessary infrastructure to make balloon driver work under ARM.
  - Add /dev/xen/privcmd interfaces to work with ARM and PVH.
  - Improve Xen PCIBack wild-card parsing.
  - Add Xen ACPI PAD (Processor Aggregator) support - so can offline/online
    sockets depending on the power consumption.
  - PVHVM + kexec = use an E820_RESV region for the shared region so we don't
    overwrite said region during kexec reboot.
  - Cleanups, compile fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJQyJaAAAoJEFjIrFwIi8fJ9DoIALAjj3qaGDimykc/RPSu2MLL
 Tfchb1su0WxSu6fP17jBadq39Qna85UzZATMCyN47k8wB3KoSEW13rqwe7JSsdT/
 SEfZDrlbhNK+JAWJETx+6gq7J7dMwi/tFt4CbwPv/zAHb7C7JyzEgKctbi4Q1e89
 FFMXZru2IWDbaqlcJQjJcE/InhWy5vKW3bY5nR/Bz0RBf9lk/WHbcJwLXirsDcKk
 uMVmPy4yiApX6ZCPbYP5BZvsIFkmLKQEfpmwdzbLGDoL7N1onqq/lgYNgZqPJUkE
 XL1GVBbRGpy+NQr++vUS1NiRyR81EChRO3IrDZwzvNEPqKa9GoF5U1CdRh71R5I=
 =uZQZ
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull Xen updates from Konrad Rzeszutek Wilk:
 - Add necessary infrastructure to make balloon driver work under ARM.
 - Add /dev/xen/privcmd interfaces to work with ARM and PVH.
 - Improve Xen PCIBack wild-card parsing.
 - Add Xen ACPI PAD (Processor Aggregator) support - so can offline/
   online sockets depending on the power consumption.
 - PVHVM + kexec = use an E820_RESV region for the shared region so we
   don't overwrite said region during kexec reboot.
 - Cleanups, compile fixes.

Fix up some trivial conflicts due to the balloon driver now working on
ARM, and there were changes next to the previous work-arounds that are
now gone.

* tag 'stable/for-linus-3.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/PVonHVM: fix compile warning in init_hvm_pv_info
  xen: arm: implement remap interfaces needed for privcmd mappings.
  xen: correctly use xen_pfn_t in remap_domain_mfn_range.
  xen: arm: enable balloon driver
  xen: balloon: allow PVMMU interfaces to be compiled out
  xen: privcmd: support autotranslated physmap guests.
  xen: add pages parameter to xen_remap_domain_mfn_range
  xen/acpi: Move the xen_running_on_version_or_later function.
  xen/xenbus: Remove duplicate inclusion of asm/xen/hypervisor.h
  xen/acpi: Fix compile error by missing decleration for xen_domain.
  xen/acpi: revert pad config check in xen_check_mwait
  xen/acpi: ACPI PAD driver
  xen-pciback: reject out of range inputs
  xen-pciback: simplify and tighten parsing of device IDs
  xen PVonHVM: use E820_Reserved area for shared_info
2012-12-13 14:29:16 -08:00
Ian Campbell
7892f6928d xen: correctly use xen_pfn_t in remap_domain_mfn_range.
For Xen on ARM a PFN is 64 bits so we need to use the appropriate
type here.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v2: include the necessary header,
     Reported-by: Fengguang Wu <fengguang.wu@intel.com> ]
2012-11-29 12:59:19 +00:00
Ian Campbell
9a032e393a xen: add pages parameter to xen_remap_domain_mfn_range
Also introduce xen_unmap_domain_mfn_range. These are the parts of
Mukesh's "xen/pvh: Implement MMU changes for PVH" which are also
needed as a baseline for ARM privcmd support.

The original patch was:

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

This derivative is also:

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
2012-11-29 12:57:36 +00:00
Yinghai Lu
6f80b68e9e x86, mm, Xen: Remove mapping_pagetable_reserve()
Page table area are pre-mapped now after
	x86, mm: setup page table in top-down
	x86, mm: Remove early_memremap workaround for page table accessing on 64bit

mapping_pagetable_reserve is not used anymore, so remove it.

Also remove operation in mask_rw_pte(), as modified allow_low_page
always return pages that are already mapped, moreover
xen_alloc_pte_init, xen_alloc_pmd_init, etc, will mark the page RO
before hooking it into the pagetable automatically.

-v2: add changelog about mask_rw_pte() from Stefano.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-27-git-send-email-yinghai@kernel.org
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:26 -08:00
Konrad Rzeszutek Wilk
95a7d76897 xen/mmu: Use Xen specific TLB flush instead of the generic one.
As Mukesh explained it, the MMUEXT_TLB_FLUSH_ALL allows the
hypervisor to do a TLB flush on all active vCPUs. If instead
we were using the generic one (which ends up being xen_flush_tlb)
we end up making the MMUEXT_TLB_FLUSH_LOCAL hypercall. But
before we make that hypercall the kernel will IPI all of the
vCPUs (even those that were asleep from the hypervisor
perspective). The end result is that we needlessly wake them
up and do a TLB flush when we can just let the hypervisor
do it correctly.

This patch gives around 50% speed improvement when migrating
idle guest's from one host to another.

Oracle-bug: 14630170

CC: stable@vger.kernel.org
Tested-by:  Jingjie Jiang <jingjie.jiang@oracle.com>
Suggested-by:  Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-10-31 12:38:31 -04:00
Linus Torvalds
ccff9b1db6 Feature:
- Register a pfn_is_ram helper to speed up reading of /proc/vmcore.
 Bug-fixes:
  - Three pvops call for Xen were undefined causing BUG_ONs.
  - Add a quirk so that the shutdown watches (used by kdump) are not used with older Xen (3.4).
  - Fix ungraceful state transition for the HVC console.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJQeBKDAAoJEFjIrFwIi8fJDpwH/3nPBH82pJVxdLPBnmJhWuJR
 voSPP0m9i69w/mc7wHtiRwK4lRMAUidgS77iBZkIT2cY0/NYvOKKBlMUitkYJFlK
 dTVqr9O4iQcuG2yQk8+mXxC6NLH1VKOnSIyhqRswrePoBKzoHi/x7Y462a+tbxa9
 lGBHT9/SqeYXyItRfkdfmAXFZcqIJqLRXEwRMvbky1U3s2QGy7CdIQgra0zWF+t1
 ashNpaEBpH9Jy60VSpQtMpx8hWxd0W2NirNu+nACtTE5/MeuiBvKlPdEPC/rUbdJ
 c5j5VYLjSxPCheY0sajK6pxKgHdfiqmMRlutzMVj3Egwilb0LBxv1018gRFzBu8=
 =/qCG
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.7-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull Xen fixes from Konrad Rzeszutek Wilk:
 "This has four bug-fixes and one tiny feature that I forgot to put
  initially in my tree due to oversight.

  The feature is for kdump kernels to speed up the /proc/vmcore reading.
  There is a ram_is_pfn helper function that the different platforms can
  register for.  We are now doing that.

  The bug-fixes cover some embarrassing struct pv_cpu_ops variables
  being set to NULL on Xen (but not baremetal).  We had a similar issue
  in the past with {write|read}_msr_safe and this fills the three
  missing ones.  The other bug-fix is to make the console output (hvc)
  be capable of dealing with misbehaving backends and not fall flat on
  its face.  Lastly, a quirk for older XenBus implementations that came
  with an ancient v3.4 hypervisor (so RHEL5 based) - reading of certain
  non-existent attributes just hangs the guest during bootup - so we
  take precaution of not doing that on such older installations.

  Feature:
   - Register a pfn_is_ram helper to speed up reading of /proc/vmcore.
  Bug-fixes:
   - Three pvops call for Xen were undefined causing BUG_ONs.
   - Add a quirk so that the shutdown watches (used by kdump) are not
     used with older Xen (3.4).
   - Fix ungraceful state transition for the HVC console."

* tag 'stable/for-linus-3.7-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/pv-on-hvm kexec: add quirk for Xen 3.4 and shutdown watches.
  xen/bootup: allow {read|write}_cr8 pvops call.
  xen/bootup: allow read_tscp call for Xen PV guests.
  xen pv-on-hvm: add pfn_is_ram helper for kdump
  xen/hvc: handle backend CLOSED without CLOSING
2012-10-12 22:20:28 +09:00
Konstantin Khlebnikov
314e51b985 mm: kill vma flag VM_RESERVED and mm->reserved_vm counter
A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
currently it lost original meaning but still has some effects:

 | effect                 | alternative flags
-+------------------------+---------------------------------------------
1| account as reserved_vm | VM_IO
2| skip in core dump      | VM_IO, VM_DONTDUMP
3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP

This patch removes reserved_vm counter from mm_struct.  Seems like nobody
cares about it, it does not exported into userspace directly, it only
reduces total_vm showed in proc.

Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.

remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.

[akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:19 +09:00
Olaf Hering
34b6f01a79 xen pv-on-hvm: add pfn_is_ram helper for kdump
Register pfn_is_ram helper speed up reading /proc/vmcore in the kdump
kernel. See commit message of 997c136f51 ("fs/proc/vmcore.c: add hook
to read_from_oldmem() to check for non-ram pages") for details.

It makes use of a new hvmop HVMOP_get_mem_type which was introduced in
xen 4.2 (23298:26413986e6e0) and backported to 4.1.1.

The new function is currently only enabled for reading /proc/vmcore.
Later it will be used also for the kexec kernel. Since that requires
more changes in the generic kernel make it static for the time being.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-10-04 11:30:30 -04:00
Konrad Rzeszutek Wilk
98104c3480 Merge branch 'stable/128gb.v5.1' into stable/for-linus-3.7
* stable/128gb.v5.1:
  xen/mmu: If the revector fails, don't attempt to revector anything else.
  xen/p2m: When revectoring deal with holes in the P2M array.
  xen/mmu: Release just the MFN list, not MFN list and part of pagetables.
  xen/mmu: Remove from __ka space PMD entries for pagetables.
  xen/mmu: Copy and revector the P2M tree.
  xen/p2m: Add logic to revector a P2M tree to use __va leafs.
  xen/mmu: Recycle the Xen provided L4, L3, and L2 pages
  xen/mmu: For 64-bit do not call xen_map_identity_early
  xen/mmu: use copy_page instead of memcpy.
  xen/mmu: Provide comments describing the _ka and _va aliasing issue
  xen/mmu: The xen_setup_kernel_pagetable doesn't need to return anything.
  Revert "xen/x86: Workaround 64-bit hypervisor and 32-bit initial domain." and "xen/x86: Use memblock_reserve for sensitive areas."
  xen/x86: Workaround 64-bit hypervisor and 32-bit initial domain.
  xen/x86: Use memblock_reserve for sensitive areas.
  xen/p2m: Fix the comment describing the P2M tree.

Conflicts:
	arch/x86/xen/mmu.c

The pagetable_init is the old xen_pagetable_setup_done and xen_pagetable_setup_start
rolled in one.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-09-12 11:18:57 -04:00