Since ee7cd8981e 'virtio: expose added
descriptors immediately.', in virtio balloon virtqueue_get_buf might
now run concurrently with virtqueue_kick. I audited both and this
seems safe in practice but this is not guaranteed by the API.
Additionally, a spurious interrupt might in theory make
virtqueue_get_buf run in parallel with virtqueue_add_buf, which is
racy.
While we might try to protect against spurious callbacks it's
easier to fix the driver: balloon seems to be the only one
(mis)using the API like this, so let's just fix balloon.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (removed unused var)
This patch adds an option to instantiate guest virtio-mmio devices
basing on a kernel command line (or module) parameter, for example:
virtio_mmio.devices=0x100@0x100b0000:48
Signed-off-by: Pawel Moll <pawel.moll@arm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Current index allocation in virtio is based on a monotonically
increasing variable "index". This means we'll run out of numbers
after a while. E.g. someone crazy doing this in host side.
while(1) {
hot-plug a virtio device
hot-unplug the virito devcie
}
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The remove and freeze functions have a lot of shared code; put it into a
common function that gets called by both.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
restore_common() was used when there were different thaw and freeze PM
callbacks implemented. We removed thaw in commit
f38f8387cb.
restore_common() can be removed and virtballoon_restore() can itself do
the restore ops.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
When the balloon module is removed, we deflate the balloon, reclaiming
all the pages that were given to the host. However, we don't update the
config values for the new balloon size, resulting in the host showing
outdated balloon values.
The size update is done after each leak and fill operation, only the
module removal case was left out.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
As reported by David Gibson, current code handles PAGE_SIZE != 4k
completely wrong which can lead to guest memory corruption errors:
- page_to_balloon_pfn is wrong: e.g. on system with 64K page size
it gives the same pfn value for 16 different pages.
- we also need to convert back to linux pfns when we free.
- for each linux page we need to tell host about multiple balloon
pages, but code only adds one pfn to the array.
This patch fixes all that, tested with a 64k ppc64 kernel.
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Although virtio config space fields are usually in guest-native endian,
the spec for the virtio balloon device explicitly states that both fields
in its config space are little-endian.
However, the current virtio_balloon driver does not have a suitable endian
swap for the 'num_pages' field, although it does have one for the 'actual'
field. This patch corrects the bug, adding sparse annotation while we're
at it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
There's no difference in supporting S3 and S4 for virtio devices: the
vqs have to be re-created as the device has to be assumed to be reset at
restore-time. Since S4 already handles this situation, we can directly
use the same code and callbacks for S3 support.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
restore_common() was shared between restore and thaw callbacks. With
thaw gone, we don't need restore_common() anymore.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
The thaw operation was used by the balloon driver, but after the last
commit there's no reason to have separate thaw and restore callbacks.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
There's no reason stats update after restore can't work. If a host
requested for stats, and before servicing the request, the guest entered
S4, upon restore, the stats request can still be processed and sent off
to the host.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Remove all #inclusions of asm/system.h preparatory to splitting and killing
it. Performed with the following command:
perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`
Signed-off-by: David Howells <dhowells@redhat.com>
commit e562966dba added support for S4 to
the balloon driver. The freeze function did nothing to free the pages,
since reclaiming the pages from the host to immediately give them back
(if S4 was successful) seemed wasteful. Also, if S4 wasn't successful,
the guest would have to re-fill the balloon. On restore, the pages were
supposed to be marked freed and the free page counters were incremented
to reflect the balloon was totally deflated.
However, this wasn't done right. The pages that were earlier taken away
from the guest during a balloon inflation operation were just shown as
used pages after a successful restore from S4. Just a fancy way of
leaking lots of memory.
Instead of trying that, just leak the balloon on freeze and fill it on
restore/thaw paths. This works properly now. The optimisation to not
leak can be added later on after a bit of refactoring of the code.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Use virtio_mb() to make sure the available index to be exposed before
checking the the avail event. Otherwise we may get stale value of
avail event in guest and never kick the host after.
Note: this fixes a bug introduced by ee7cd8981e.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org
Note: this fixes a bug introduced recently in
7b21e34fd1.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Handling balloon hibernate / restore is tricky. If the balloon was
inflated before going into the hibernation state, upon resume, the host
will not have any memory of that. Any pages that were passed on to the
host earlier would most likely be invalid, and the host will have to
re-balloon to the previous value to get in the pre-hibernate state.
So the only sane thing for the guest to do here is to discard all the
pages that were put in the balloon. When to discard the pages is the
next question.
One solution is to deflate the balloon just before writing the image to
the disk (in the freeze() PM callback). However, asking for pages from
the host just to discard them immediately after seems wasteful of
resources. Hence, it makes sense to do this by just fudging our
counters soon after wakeup. This means we don't deflate the balloon
before sleep, and also don't put unnecessary pressure on the host.
This also helps in the thaw case: if the freeze fails for whatever
reason, the balloon should continue to remain in the inflated state.
This was tested by issuing 'swapoff -a' and trying to go into the S4
state. That fails, and the balloon stays inflated, as expected. Both
the host and the guest are happy.
Finally, in the restore() callback, we empty the list of pages that were
previously given off to the host, add the appropriate number of pages to
the totalram_pages counter, reset the num_pages counter to 0, and
all is fine.
As a last step, delete the vqs on the freeze callback to prepare for
hibernation, and re-create them in the restore and thaw callbacks to
resume normal operation.
The kthread doesn't race with any operations here, since it's frozen
before the freeze() call and is thawed after the thaw() and restore()
callbacks, so we're safe with that.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The probe and PM restore functions will share this code.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Handle thaw, restore and freeze notifications from the PM core. Expose
these to individual virtio drivers that can quiesce and resume vq
operations. For drivers not implementing the thaw() method, use the
restore method instead.
These functions also save device-specific data so that the device can be
put in pre-suspend state after resume, and disable and enable the PCI
device in the freeze and resume functions, respectively.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The older PM API doesn't have a way to get notifications on hibernate
events. Switch to the newer one that gives us those notifications.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Under the existing #ifdef DEBUG, check that they don't have more than
1/10 of a second between an add_buf() and a
virtqueue_notify()/virtqueue_kick_prepare() call.
We could get false positives on a really busy system, but good for
development.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
A virtio driver does virtqueue_add_buf() multiple times before finally
calling virtqueue_kick(); previously we only exposed the added buffers
in the virtqueue_kick() call. This means we don't need a memory
barrier in virtqueue_add_buf(), but it reduces concurrency as the
device (ie. host) can't see the buffers until the kick.
In the unusual (but now possible) case where a driver does add_buf()
and get_buf() without doing a kick, we do need to insert one before
our counter wraps. Otherwise we could wrap num_added, and later on
not realize that we have passed the marker where we should have
kicked.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Since we know vq->vring.num is a power of 2, modulus is lazy (it's asserted
in vring_new_virtqueue()).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Based on patch by Christoph for virtio_blk speedup:
Split virtqueue_kick to be able to do the actual notification
outside the lock protecting the virtqueue. This patch was
originally done by Stefan Hajnoczi, but I can't find the
original one anymore and had to recreated it from memory.
Pointers to the original or corrections for the commit message
are welcome.
Stefan's patch was here:
a6d06644e3http://www.spinics.net/lists/linux-virtualization/msg14616.html
Third time's the charm!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Remove wrapper functions. This makes the allocation type explicit in
all callers; I used GPF_KERNEL where it seemed obvious, left it at
GFP_ATOMIC otherwise.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The old documentation is left over from when we used a structure with
strategy pointers.
And move the documentation to the C file as per kernel practice.
Though I disagree...
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Trivial changes to remove forgotten junk, format comments, and correct names.
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We were cheating with our barriers; using the smp ones rather than the
real device ones. That was fine, until rpmsg came along, which is
used to talk to a real device (a non-SMP CPU).
Unfortunately, just putting back the real barriers (reverting
d57ed95d) causes a performance regression on virtio-pci. In
particular, Amos reports netbench's TCP_RR over virtio_net CPU
utilization increased up to 35% while throughput went down by up to
14%.
By comparison, this branch is in the noise.
Reference: https://lkml.org/lkml/2011/12/11/22
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
virtio pci device reset actually just does an I/O
write, which in PCI is really posted, that is it
can complete on CPU before the device has received it.
Further, interrupts might have been pending on
another CPU, so device callback might get invoked after reset.
This conflicts with how drivers use reset, which is typically:
reset
unregister
a callback running after reset completed can race with
unregister, potentially leading to use after free bugs.
Fix by flushing out the write, and flushing pending interrupts.
This assumes that device is never reset from
its vq/config callbacks, or in parallel with being
added/removed, document this assumption.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fix this compile error on s390:
CC [M] drivers/virtio/virtio_mmio.o
drivers/virtio/virtio_mmio.c: In function 'vm_get_features':
drivers/virtio/virtio_mmio.c:107:2: error: implicit declaration of function 'writel'
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Pawel Moll <pawel.moll@arm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The forcedeth changes had a conflict with the conversion over
to atomic u64 statistics in net-next.
The libertas cfg.c code had a conflict with the bss reference
counting fix by John Linville in net-next.
Conflicts:
drivers/net/ethernet/nvidia/forcedeth.c
drivers/net/wireless/libertas/cfg.c
Add a new .bus_name to virtio_config_ops then modify virtio_net to
call through to it in an ethtool .get_drvinfo routine to report
bus_info in ethtool -i output which is consistent with other
emulated NICs and the output of lspci.
Signed-off-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 31a3ddda16 introduced
a use after free in virtio-pci. The main issue is
that the release method signals removal of the virtio device,
while remove signals removal of the pci device.
For example, on driver removal or hot-unplug,
virtio_pci_release_dev is called before virtio_pci_remove.
We then might get a crash as virtio_pci_remove tries to use the
device freed by virtio_pci_release_dev.
We allocate/free all resources together with the
pci device, so we can leave the release method empty.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include <linux/module.h>
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include <linux/module.h>
net: sch_generic remove redundant use of <linux/module.h>
net: inet_timewait_sock doesnt need <linux/module.h>
...
Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h
This patch, based on virtio PCI driver, adds support for memory
mapped (platform) virtio device. This should allow environments
like qemu to use virtio-based block & network devices even on
platforms without PCI support.
One can define and register a platform device which resources
will describe memory mapped control registers and "mailbox"
interrupt. Such device can be also instantiated using the Device
Tree node with compatible property equal "virtio,mmio".
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Michael S.Tsirkin <mst@redhat.com>
Signed-off-by: Pawel Moll <pawel.moll@arm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
For the MSI but non-per_vq_vector case, the config/change vq
also gets added to the list of vqs that need to process the
MSI interrupt. This is not needed as config has it's own
handler (vp_config_changed). In any case, vring_interrupt()
finds nothing needs to be done on this vq.
I tested this patch by testing the "Fallback:" and "Finally
fall back" cases in vp_find_vqs(). Please review.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Up to now, the module.h header was as hard to keep out as
sunlight. But we are cleaning that up. Fix the virtio users
who simply expect module.h to be there in every C file.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Add support for reporting ring sizes via ethtool -g to the virtio_net
driver.
Signed-off-by: Rick Jones <rick.jones2@hp.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
virtio has been so far used only in the context of virtualization,
and the virtio Kconfig was sourced directly by the relevant arch
Kconfigs when VIRTUALIZATION was selected.
Now that we start using virtio for inter-processor communications,
we need to source the virtio Kconfig outside of the virtualization
scope too.
Moreover, some architectures might use virtio for both virtualization
and inter-processor communications, so directly sourcing virtio
might yield unexpected results due to conflicting selections.
The simple solution offered by this patch is to always source virtio's
Kconfig in drivers/Kconfig, and remove it from the appropriate arch
Kconfigs. Additionally, a virtio menu entry has been added so virtio
drivers don't show up in the general drivers menu.
This way anyone can use virtio, though it's arguably less accessible
(and neat!) for virtualization users now.
Note: some architectures (mips and sh) seem to have a VIRTUALIZATION
menu merely for sourcing virtio's Kconfig, so that menu is removed too.
Signed-off-by: Ohad Ben-Cohen <ohad@wizery.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Add an API that tells the other side that callbacks
should be delayed until a lot of work has been done.
Implement using the new event_idx feature.
Note: it might seem advantageous to let the drivers
ask for a callback after a specific capacity has
been reached. However, as a single head can
free many entries in the descriptor table,
we don't really have a clue about capacity
until get_buf is called. The API is the simplest
to implement at the moment, we'll see what kind of
hints drivers can pass when there's more than one
user of the feature.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Support for the new event idx feature:
1. When enabling interrupts, publish the current avail index
value to the host to get interrupts on the next update.
2. Use the new avail_event feature to reduce the number
of exits from the guest.
Simple test with the simulator:
[virtio]# time ./virtio_test
spurious wakeus: 0x7
real 0m0.169s
user 0m0.140s
sys 0m0.019s
[virtio]# time ./virtio_test --no-event-idx
spurious wakeus: 0x11
real 0m0.649s
user 0m0.295s
sys 0m0.335s
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The virtio balloon driver has a VIRTIO_BALLOON_F_MUST_TELL_HOST
feature bit. Whenever the bit is set, the guest kernel must
always tell the host before we free pages back to the allocator.
Without this feature, we might free a page (and have another
user touch it) while the hypervisor is unprepared for it.
But, if the bit is _not_ set, we are under no obligation to
reverse the order; we're under no obligation to do _anything_.
As of now, qemu-kvm defines the bit, but doesn't set it.
This patch makes the "tell host first" logic the only case. This
should make everybody happy, and reduce the amount of untested or
untestable code in the kernel.
This _also_ means that we don't have to preserve a pfn list
after the pages are freed, which should let us get rid of some
temporary storage (vb->pfns) eventually.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
In the case where a virtio-console port is in use (opened by a program)
and a virtio-console device is removed, the port is kept around but all
the virtio-related state is assumed to be gone.
When the port is finally released (close() called), we call
device_destroy() on the port's device. This results in the parent
device's structures to be freed as well. This includes the PCI regions
for the virtio-console PCI device.
Once this is done, however, virtio_pci_release_dev() kicks in, as the
last ref to the virtio device is now gone, and attempts to do
pci_iounmap(pci_dev, vp_dev->ioaddr);
pci_release_regions(pci_dev);
pci_disable_device(pci_dev);
which results in a double-free warning.
Move the code that releases regions, etc., to the virtio_pci_remove()
function, and all that's now left in release_dev is the final freeing of
the vp_dev.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
When detaching a buffer from a vq, the avail.idx value should be
decremented as well.
This was noticed by hot-unplugging a virtio console port and then
plugging in a new one on the same number (re-using the vqs which were
just 'disowned'). qemu reported
'Guest moved used index from 0 to 256'
when any IO was attempted on the new port.
CC: stable@kernel.org
Reported-by: juzhang <juzhang@redhat.com>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We sometimes need to map between the virtio device and
the given pci device. One such use is OS installer that
gets the boot pci device from BIOS and needs to
find the relevant block device. Since it can't,
installation fails.
Instead of creating a top-level devices/virtio-pci
directory, create each device under the corresponding
pci device node. Symlinks to all virtio-pci
devices can be found under the pci driver link in
bus/pci/drivers/virtio-pci/devices, and all virtio
devices under drivers/bus/virtio/devices.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Gleb Natapov <gleb@redhat.com>
Tested-by: "Daniel P. Berrange" <berrange@redhat.com>
Cc: stable@kernel.org
The sysfs files for virtio produce the wrong format and are missing
the required newline. The output for virtio bus vendor/device should
have the same format as the corresponding entries for PCI devices.
Although this technically changes the ABI for sysfs, these files were
broken to start with!
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We can't rely on indirect buffers for capacity
calculations because they need a memory allocation
which might fail. In particular, virtio_net can get
into this situation under stress, and it drops packets
and performs badly.
So return the number of buffers we can guarantee users.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reported-By: Krishna Kumar2 <krkumar2@in.ibm.com>