This is a new driver to enable userspace networking on VMBus.
It is based largely on the similar driver that already exists
for PCI, and earlier work done by Brocade to support DPDK.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch adds sysfs interface to dynamically bind new UUID values
to existing VMBus device. This is useful for generic UIO driver to
act similar to uio_pci_generic.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To get prepared to CPU offlining support we need co change the way how we
unbind clockevent devices. As one CPU may go online/offline multiple times
we need to bind it in hv_synic_init() and unbind it in hv_synic_cleanup().
There is an additional corner case: when we unload the module completely we
need to switch to some other clockevent mechanism before stopping VMBus or
we will hang. We can't call hv_synic_cleanup() before unloading VMBus as
we won't be able to send UNLOAD request and get a response so
hv_synic_clockevents_cleanup() has to live. Luckily, we can always call
clockevents_unbind_device(), even if it wasn't bound before and there is
no issue if we call it twice.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
"kernel BUG at drivers/hv/channel_mgmt.c:350!" is observed when hv_vmbus
module is unloaded. BUG_ON() was introduced in commit 85d9aa7051
("Drivers: hv: vmbus: add an API vmbus_hvsock_device_unregister()") as
vmbus_free_channels() codepath was apparently forgotten.
Fixes: 85d9aa7051 ("Drivers: hv: vmbus: add an API vmbus_hvsock_device_unregister()")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changed it to HV_UNKNOWN
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signal the host when we determine the host is to be signaled -
on th read path. The currrent code determines the need to signal in the
ringbuffer code and actually issues the signal elsewhere. This can result
in the host viewing this interrupt as spurious since the host may also
poll the channel. Make the necessary adjustments.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signal the host when we determine the host is to be signaled.
The currrent code determines the need to signal in the ringbuffer
code and actually issues the signal elsewhere. This can result
in the host viewing this interrupt as spurious since the host may also
poll the channel. Make the necessary adjustments.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
One of the factors that can result in the host concluding that a given
guest in mounting a DOS attack is if the guest generates interrupts
to the host when the host is not expecting it. If these "spurious"
interrupts reach a certain rate, the host can throttle the guest to
minimize the impact. The host computation of the "expected number
of interrupts" is strictly based on the ring transitions. Until
the host logic is fixed, base the guest logic to interrupt solely
on the ring state.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Balloon driver was only printing the size of the info blob and not the
actual content. This fixes it so that the info blob (max page count as
configured in Hyper-V) is printed out.
Signed-off-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Increase the timeout of backup operations. When system is under I/O load,
it needs more time to freeze. These timeout values should also match the
host timeout values more closely.
Signed-off-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Adding log messages to help troubleshoot error cases and transaction
handling.
Signed-off-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Added logging to help troubleshoot common ballooning, hot add,
and versioning issues.
Signed-off-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If the guest does not support memory hotplugging, it should respond to
the host with zero pages added and successful result code. This signals
to the host that hotplugging is not supported and the host will avoid
sending future hot-add requests.
Signed-off-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We should intentionally declare the protocols to use for every known host
and default to using the latest protocol if the host is unknown or new.
Signed-off-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
I discovered that at least WS2016TP5 host has 60 seconds timeout for the
ICMSGTYPE_NEGOTIATE message so we need to lower guest's timeout a little
bit to make sure we always respond in time. Let's make it 55 seconds.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In commit 9a56e5d6a0ba ("Drivers: hv: make VMBus bus ids persistent")
the name of vmbus devices in sysfs changed to be (in 4.9-rc1):
/sys/bus/vmbus/vmbus-6aebe374-9ba0-11e6-933c-00259086b36b
The prefix ("vmbus-") is redundant and differs from how PCI is
represented in sysfs. Therefore simplify to:
/sys/bus/vmbus/6aebe374-9ba0-11e6-933c-00259086b36b
Please merge this before 4.9 is released and the old format
has to live forever.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The host keeps sending heartbeat packets independent of the
guest responding to them. Even though we respond to the heartbeat messages at
interrupt level, we can have situations where there maybe multiple heartbeat
messages pending that have not been responded to. For instance this occurs when the
VM is paused and the host continues to send the heartbeat messages.
Address this issue by draining and responding to all
the heartbeat messages that maybe pending.
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The auto incremented counter is not being used anymore, get rid of it.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some tools use bus ids to identify devices and they count on the fact
that these ids are persistent across reboot. This may be not true for
VMBus as we use auto incremented counter from alloc_channel() as such
id. Switch to using if_instance from channel offer, this id is supposed
to be persistent.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Under stress, we have seen allocation failure in time synch code. Avoid
this dynamic allocation.
Signed-off-by: Vivek Yadav <vyadav@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This enables support for more accurate TimeSync v4 samples when hosted
under Windows Server 2016 and newer hosts.
The new time samples include a "vmreferencetime" field that represents
the guest's TSC value when the host generated its time sample. This value
lets the guest calculate the latency in receiving the time sample. The
latency is added to the sample host time prior to updating the clock.
Signed-off-by: Alex Ng <alexng@messages.microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Only the first 50 samples after boot were being used to discipline the
clock. After the first 50 samples, any samples from the host were ignored
and the guest clock would eventually drift from the host clock.
This patch allows TimeSync-enabled guests to continuously synchronize the
clock with the host clock, even after the first 50 samples.
Signed-off-by: Alex Ng <alexng@messages.microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Different Windows host versions may reuse the same protocol version when
negotiating the TimeSync, Shutdown, and Heartbeat protocols. We should only
refer to the protocol version to avoid conflating the two concepts.
Signed-off-by: Alex Ng <alexng@messages.microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This fixes a sparse warning because hyperv_mmio resources
are only used in this one file and should be static.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Hyper-V host will send a VSS_OP_HOT_BACKUP request to check if guest is
ready for a live backup/snapshot. The driver should respond to the check
only if the daemon is running and listening to requests. This allows the
host to fallback to standard snapshots in case the VSS daemon is not
running.
Signed-off-by: Alex Ng <alexng@messages.microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Multiple VSS_OP_HOT_BACKUP requests may arrive in quick succession, even
though the host only signals once. The driver wass handling the first
request while ignoring the others in the ring buffer. We should poll the
VSS channel after handling a request to continue processing other requests.
Signed-off-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Introduce a mechanism to control how channels will be affinitized. We will
support two policies:
1. HV_BALANCED: All performance critical channels will be dstributed
evenly amongst all the available NUMA nodes. Once the Node is assigned,
we will assign the CPU based on a simple round robin scheme.
2. HV_LOCALIZED: Only the primary channels are distributed across all
NUMA nodes. Sub-channels will be in the same NUMA node as the primary
channel. This is the current behaviour.
The default policy will be the HV_BALANCED as it can minimize the remote
memory access on NUMA machines with applications that span NUMA nodes.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
With wrap around mappings for ring buffers we can always use a single
memcpy() to do the job.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Tested-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Make it possible to always use a single memcpy() or to provide a direct
link to a packet on the ring buffer by creating virtual mapping for two
copies of the ring buffer with vmap(). Utilize currently empty
hv_ringbuffer_cleanup() to do the unmap.
While on it, replace sizeof(struct hv_ring_buffer) check
in hv_ringbuffer_init() with BUILD_BUG_ON() as it is a compile time check.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In preparation for doing wrap around mappings for ring buffers cleanup
vmbus_open() function:
- check that ring sizes are PAGE_SIZE aligned (they are for all in-kernel
drivers now);
- kfree(open_info) on error only after we kzalloc() it (not an issue as it
is valid to call kfree(NULL);
- rename poorly named labels;
- use alloc_pages() instead of __get_free_pages() as we need struct page
pointer for future.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reports for available memory should use the si_mem_available() value.
The previous freeram value does not include available page cache memory.
Signed-off-by: Alex Ng <alexng@messages.microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
lockdep reports possible circular locking dependency when udev is used
for memory onlining:
systemd-udevd/3996 is trying to acquire lock:
((memory_chain).rwsem){++++.+}, at: [<ffffffff810d137e>] __blocking_notifier_call_chain+0x4e/0xc0
but task is already holding lock:
(&dm_device.ha_region_mutex){+.+.+.}, at: [<ffffffffa015382e>] hv_memory_notifier+0x5e/0xc0 [hv_balloon]
...
which is probably a false positive because we take and release
ha_region_mutex from memory notifier chain depending on the arg. No real
deadlocks were reported so far (though I'm not really sure about
preemptible kernels...) but we don't really need to hold the mutex
for so long. We use it to protect ha_region_list (and its members) and the
num_pages_onlined counter. None of these operations require us to sleep
and nothing is slow, switch to using spinlock with interrupts disabled.
While on it, replace list_for_each -> list_for_each_entry as we actually
need entries in all these cases, drop meaningless list_empty() checks.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
With the recently introduced in-kernel memory onlining
(MEMORY_HOTPLUG_DEFAULT_ONLINE) these is no point in waiting for pages
to come online in the driver and we can get rid of the waiting.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
I'm observing the following hot add requests from the WS2012 host:
hot_add_req: start_pfn = 0x108200 count = 330752
hot_add_req: start_pfn = 0x158e00 count = 193536
hot_add_req: start_pfn = 0x188400 count = 239616
As the host doesn't specify hot add regions we're trying to create
128Mb-aligned region covering the first request, we create the 0x108000 -
0x160000 region and we add 0x108000 - 0x158e00 memory. The second request
passes the pfn_covered() check, we enlarge the region to 0x108000 -
0x190000 and add 0x158e00 - 0x188200 memory. The problem emerges with the
third request as it starts at 0x188400 so there is a 0x200 gap which is
not covered. As the end of our region is 0x190000 now it again passes the
pfn_covered() check were we just adjust the covered_end_pfn and make it
0x188400 instead of 0x188200 which means that we'll try to online
0x188200-0x188400 pages but these pages were never assigned to us and we
crash.
We can't react to such requests by creating new hot add regions as it may
happen that the whole suggested range falls into the previously identified
128Mb-aligned area so we'll end up adding nothing or create intersecting
regions and our current logic doesn't allow that. Instead, create a list of
such 'gaps' and check for them in the page online callback.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Windows 2012 (non-R2) does not specify hot add region in hot add requests
and the logic in hot_add_req() is trying to find a 128Mb-aligned region
covering the request. It may also happen that host's requests are not 128Mb
aligned and the created ha_region will start before the first specified
PFN. We can't online these non-present pages but we don't remember the real
start of the region.
This is a regression introduced by the commit 5abbbb75d7 ("Drivers: hv:
hv_balloon: don't lose memory when onlining order is not natural"). While
the idea of keeping the 'moving window' was wrong (as there is no guarantee
that hot add requests come ordered) we should still keep track of
covered_start_pfn. This is not a revert, the logic is different.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On Hyper-V, performance critical channels use the monitor
mechanism to signal the host when the guest posts mesages
for the host. This mechanism minimizes the hypervisor intercepts
and also makes the host more efficient in that each time the
host is woken up, it processes a batch of messages as opposed to
just one. The goal here is improve the throughput and this is at
the expense of increased latency.
Implement a mechanism to let the client driver decide if latency
is important.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The current delay between retries is unnecessarily high and is negatively
affecting the time it takes to boot the system.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For synthetic NIC channels, enable explicit signaling policy as netvsc wants to
explicitly control when the host is to be signaled.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There is a rare race when we remove an entry from the global list
hv_context.percpu_list[cpu] in hv_process_channel_removal() ->
percpu_channel_deq() -> list_del(): at this time, if vmbus_on_event() ->
process_chn_event() -> pcpu_relid2channel() is trying to query the list,
we can get the kernel fault.
Similarly, we also have the issue in the code path: vmbus_process_offer() ->
percpu_channel_enq().
We can resolve the issue by disabling the tasklet when updating the list.
The patch also moves vmbus_release_relid() to a later place where
the channel has been removed from the per-cpu and the global lists.
Reported-by: Rolf Neugebauer <rolf.neugebauer@docker.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Background: userspace daemons registration protocol for Hyper-V utilities
drivers has two steps:
1) daemon writes its own version to kernel
2) kernel reads it and replies with module version
at this point we consider the handshake procedure being completed and we
do hv_poll_channel() transitioning the utility device to HVUTIL_READY
state. At this point we're ready to handle messages from kernel.
When hvutil_transport is in HVUTIL_TRANSPORT_CHARDEV mode we have a
single buffer for outgoing message. hvutil_transport_send() puts to this
buffer and till the buffer is cleared with hvt_op_read() returns -EFAULT
to all consequent calls. Host<->guest protocol guarantees there is no more
than one request at a time and we will not get new requests till we reply
to the previous one so this single message buffer is enough.
Now to the race. When we finish negotiation procedure and send kernel
module version to userspace with hvutil_transport_send() it goes into the
above mentioned buffer and if the daemon is slow enough to read it from
there we can get a collision when a request from the host comes, we won't
be able to put anything to the buffer so the request will be lost. To
solve the issue we need to know when the negotiation is really done (when
the version message is read by the daemon) and transition to HVUTIL_READY
state after this happens. Implement a callback on read to support this.
Old style netlink communication is not affected by the change, we don't
really know when these messages are delivered but we don't have a single
message buffer there.
Reported-by: Barry Davis <barry_davis@stormagic.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
vmbus_teardown_gpadl() can result in infinite wait when it is called on 5
second timeout in vmbus_open(). The issue is caused by the fact that gpadl
teardown operation won't ever succeed for an opened channel and the timeout
isn't always enough. As a guest, we can always trust the host to respond to
our request (and there is nothing we can do if it doesn't).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In some cases create_gpadl_header() allocates submessages but we never
free them.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We use messagecount only once in vmbus_establish_gpadl() to check if
it is safe to iterate through the submsglist. We can just initialize
the list header in all cases in create_gpadl_header() instead.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When we crash from NMI context (e.g. after NMI injection from host when
'sysctl -w kernel.unknown_nmi_panic=1' is set) we hit
kernel BUG at mm/vmalloc.c:1530!
as vfree() is denied. While the issue could be solved with in_nmi() check
instead I opted for skipping vfree on all sorts of crashes to reduce the
amount of work which can cause consequent crashes. We don't really need to
free anything on crash.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The Hyper-V Linux Integration Services use the VMBus implementation for
communication with the Hypervisor. VMBus registers its own interrupt
handler that completely bypasses the common Linux interrupt handling.
This implies that the interrupt entropy collector is not triggered.
This patch adds the interrupt entropy collection callback into the VMBus
interrupt handler function.
Cc: stable@kernel.org
Signed-off-by: Stephan Mueller <stephan.mueller@atsec.com>
Signed-off-by: Stephan Mueller <smueller@chronox.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We set host_specified_ha_region = true on certain request but this is a
global state which stays 'true' forever. We need to reset it when we
receive a request where ha_region is not specified. I did not see any
real issues, the bug was found by code inspection.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When we iterate through all HA regions in handle_pg_range() we have an
assumption that all these regions are sorted in the list and the
'start_pfn >= has->end_pfn' check is enough to find the proper region.
Unfortunately it's not the case with WS2016 where host can hot-add regions
in a different order. We end up modifying the wrong HA region and crashing
later on pages online. Modify the check to make sure we found the region
we were searching for while iterating. Fix the same check in pfn_covered()
as well.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Kdump keeps biting. Turns out CHANNELMSG_UNLOAD_RESPONSE is always
delivered to the CPU which was used for initial contact or to CPU0
depending on host version. vmbus_wait_for_unload() doesn't account for
the fact that in case we're crashing on some other CPU we won't get the
CHANNELMSG_UNLOAD_RESPONSE message and our wait on the current CPU will
never end.
Do the following:
1) Check for completion_done() in the loop. In case interrupt handler is
still alive we'll get the confirmation we need.
2) Read message pages for all CPUs message page as we're unsure where
CHANNELMSG_UNLOAD_RESPONSE is going to be delivered to. We can race with
still-alive interrupt handler doing the same, add cmpxchg() to
vmbus_signal_eom() to not lose CHANNELMSG_UNLOAD_RESPONSE message.
3) Cleanup message pages on all CPUs. This is required (at least for the
current CPU as we're clearing CPU0 messages now but we may want to bring
up additional CPUs on crash) as new messages won't be delivered till we
consume what's pending. On boot we'll place message pages somewhere else
and we won't be able to read stale messages.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>