mirror of
https://github.com/torvalds/linux.git
synced 2024-12-26 12:52:30 +00:00
Documentation: net: Document resilient next-hop groups
Add a document describing the principles behind resilient next-hop groups, and some notes about how to configure and offload them. Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This commit is contained in:
parent
177cb7876d
commit
87f2c6716f
@ -76,6 +76,7 @@ Contents:
|
||||
netdevices
|
||||
netfilter-sysctl
|
||||
netif-msg
|
||||
nexthop-group-resilient
|
||||
nf_conntrack-sysctl
|
||||
nf_flowtable
|
||||
openvswitch
|
||||
|
293
Documentation/networking/nexthop-group-resilient.rst
Normal file
293
Documentation/networking/nexthop-group-resilient.rst
Normal file
@ -0,0 +1,293 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=========================
|
||||
Resilient Next-hop Groups
|
||||
=========================
|
||||
|
||||
Resilient groups are a type of next-hop group that is aimed at minimizing
|
||||
disruption in flow routing across changes to the group composition and
|
||||
weights of constituent next hops.
|
||||
|
||||
The idea behind resilient hashing groups is best explained in contrast to
|
||||
the legacy multipath next-hop group, which uses the hash-threshold
|
||||
algorithm, described in RFC 2992.
|
||||
|
||||
To select a next hop, hash-threshold algorithm first assigns a range of
|
||||
hashes to each next hop in the group, and then selects the next hop by
|
||||
comparing the SKB hash with the individual ranges. When a next hop is
|
||||
removed from the group, the ranges are recomputed, which leads to
|
||||
reassignment of parts of hash space from one next hop to another. RFC 2992
|
||||
illustrates it thus::
|
||||
|
||||
+-------+-------+-------+-------+-------+
|
||||
| 1 | 2 | 3 | 4 | 5 |
|
||||
+-------+-+-----+---+---+-----+-+-------+
|
||||
| 1 | 2 | 4 | 5 |
|
||||
+---------+---------+---------+---------+
|
||||
|
||||
Before and after deletion of next hop 3
|
||||
under the hash-threshold algorithm.
|
||||
|
||||
Note how next hop 2 gave up part of the hash space in favor of next hop 1,
|
||||
and 4 in favor of 5. While there will usually be some overlap between the
|
||||
previous and the new distribution, some traffic flows change the next hop
|
||||
that they resolve to.
|
||||
|
||||
If a multipath group is used for load-balancing between multiple servers,
|
||||
this hash space reassignment causes an issue that packets from a single
|
||||
flow suddenly end up arriving at a server that does not expect them. This
|
||||
can result in TCP connections being reset.
|
||||
|
||||
If a multipath group is used for load-balancing among available paths to
|
||||
the same server, the issue is that different latencies and reordering along
|
||||
the way causes the packets to arrive in the wrong order, resulting in
|
||||
degraded application performance.
|
||||
|
||||
To mitigate the above-mentioned flow redirection, resilient next-hop groups
|
||||
insert another layer of indirection between the hash space and its
|
||||
constituent next hops: a hash table. The selection algorithm uses SKB hash
|
||||
to choose a hash table bucket, then reads the next hop that this bucket
|
||||
contains, and forwards traffic there.
|
||||
|
||||
This indirection brings an important feature. In the hash-threshold
|
||||
algorithm, the range of hashes associated with a next hop must be
|
||||
continuous. With a hash table, mapping between the hash table buckets and
|
||||
the individual next hops is arbitrary. Therefore when a next hop is deleted
|
||||
the buckets that held it are simply reassigned to other next hops::
|
||||
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
|1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
v v v v
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
|1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
|
||||
Before and after deletion of next hop 3
|
||||
under the resilient hashing algorithm.
|
||||
|
||||
When weights of next hops in a group are altered, it may be possible to
|
||||
choose a subset of buckets that are currently not used for forwarding
|
||||
traffic, and use those to satisfy the new next-hop distribution demands,
|
||||
keeping the "busy" buckets intact. This way, established flows are ideally
|
||||
kept being forwarded to the same endpoints through the same paths as before
|
||||
the next-hop group change.
|
||||
|
||||
Algorithm
|
||||
---------
|
||||
|
||||
In a nutshell, the algorithm works as follows. Each next hop deserves a
|
||||
certain number of buckets, according to its weight and the number of
|
||||
buckets in the hash table. In accordance with the source code, we will call
|
||||
this number a "wants count" of a next hop. In case of an event that might
|
||||
cause bucket allocation change, the wants counts for individual next hops
|
||||
are updated.
|
||||
|
||||
Next hops that have fewer buckets than their wants count, are called
|
||||
"underweight". Those that have more are "overweight". If there are no
|
||||
overweight (and therefore no underweight) next hops in the group, it is
|
||||
said to be "balanced".
|
||||
|
||||
Each bucket maintains a last-used timer. Every time a packet is forwarded
|
||||
through a bucket, this timer is updated to current jiffies value. One
|
||||
attribute of a resilient group is then the "idle timer", which is the
|
||||
amount of time that a bucket must not be hit by traffic in order for it to
|
||||
be considered "idle". Buckets that are not idle are busy.
|
||||
|
||||
After assigning wants counts to next hops, an "upkeep" algorithm runs. For
|
||||
buckets:
|
||||
|
||||
1) that have no assigned next hop, or
|
||||
2) whose next hop has been removed, or
|
||||
3) that are idle and their next hop is overweight,
|
||||
|
||||
upkeep changes the next hop that the bucket references to one of the
|
||||
underweight next hops. If, after considering all buckets in this manner,
|
||||
there are still underweight next hops, another upkeep run is scheduled to a
|
||||
future time.
|
||||
|
||||
There may not be enough "idle" buckets to satisfy the updated wants counts
|
||||
of all next hops. Another attribute of a resilient group is the "unbalanced
|
||||
timer". This timer can be set to 0, in which case the table will stay out
|
||||
of balance until idle buckets do appear, possibly never. If set to a
|
||||
non-zero value, the value represents the period of time that the table is
|
||||
permitted to stay out of balance.
|
||||
|
||||
With this in mind, we update the above list of conditions with one more
|
||||
item. Thus buckets:
|
||||
|
||||
4) whose next hop is overweight, and the amount of time that the table has
|
||||
been out of balance exceeds the unbalanced timer, if that is non-zero,
|
||||
|
||||
\... are migrated as well.
|
||||
|
||||
Offloading & Driver Feedback
|
||||
----------------------------
|
||||
|
||||
When offloading resilient groups, the algorithm that distributes buckets
|
||||
among next hops is still the one in SW. Drivers are notified of updates to
|
||||
next hop groups in the following three ways:
|
||||
|
||||
- Full group notification with the type
|
||||
``NH_NOTIFIER_INFO_TYPE_RES_TABLE``. This is used just after the group is
|
||||
created and buckets populated for the first time.
|
||||
|
||||
- Single-bucket notifications of the type
|
||||
``NH_NOTIFIER_INFO_TYPE_RES_BUCKET``, which is used for notifications of
|
||||
individual migrations within an already-established group.
|
||||
|
||||
- Pre-replace notification, ``NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE``. This
|
||||
is sent before the group is replaced, and is a way for the driver to veto
|
||||
the group before committing anything to the HW.
|
||||
|
||||
Some single-bucket notifications are forced, as indicated by the "force"
|
||||
flag in the notification. Those are used for the cases where e.g. the next
|
||||
hop associated with the bucket was removed, and the bucket really must be
|
||||
migrated.
|
||||
|
||||
Non-forced notifications can be overridden by the driver by returning an
|
||||
error code. The use case for this is that the driver notifies the HW that a
|
||||
bucket should be migrated, but the HW discovers that the bucket has in fact
|
||||
been hit by traffic.
|
||||
|
||||
A second way for the HW to report that a bucket is busy is through the
|
||||
``nexthop_res_grp_activity_update()`` API. The buckets identified this way
|
||||
as busy are treated as if traffic hit them.
|
||||
|
||||
Offloaded buckets should be flagged as either "offload" or "trap". This is
|
||||
done through the ``nexthop_bucket_set_hw_flags()`` API.
|
||||
|
||||
Netlink UAPI
|
||||
------------
|
||||
|
||||
Resilient Group Replacement
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Resilient groups are configured using the ``RTM_NEWNEXTHOP`` message in the
|
||||
same manner as other multipath groups. The following changes apply to the
|
||||
attributes passed in the netlink message:
|
||||
|
||||
=================== =========================================================
|
||||
``NHA_GROUP_TYPE`` Should be ``NEXTHOP_GRP_TYPE_RES`` for resilient group.
|
||||
``NHA_RES_GROUP`` A nest that contains attributes specific to resilient
|
||||
groups.
|
||||
=================== =========================================================
|
||||
|
||||
``NHA_RES_GROUP`` payload:
|
||||
|
||||
=================================== =========================================
|
||||
``NHA_RES_GROUP_BUCKETS`` Number of buckets in the hash table.
|
||||
``NHA_RES_GROUP_IDLE_TIMER`` Idle timer in units of clock_t.
|
||||
``NHA_RES_GROUP_UNBALANCED_TIMER`` Unbalanced timer in units of clock_t.
|
||||
=================================== =========================================
|
||||
|
||||
Next Hop Get
|
||||
^^^^^^^^^^^^
|
||||
|
||||
Requests to get resilient next-hop groups use the ``RTM_GETNEXTHOP``
|
||||
message in exactly the same way as other next hop get requests. The
|
||||
response attributes match the replacement attributes cited above, except
|
||||
``NHA_RES_GROUP`` payload will include the following attribute:
|
||||
|
||||
=================================== =========================================
|
||||
``NHA_RES_GROUP_UNBALANCED_TIME`` How long has the resilient group been out
|
||||
of balance, in units of clock_t.
|
||||
=================================== =========================================
|
||||
|
||||
Bucket Get
|
||||
^^^^^^^^^^
|
||||
|
||||
The message ``RTM_GETNEXTHOPBUCKET`` without the ``NLM_F_DUMP`` flag is
|
||||
used to request a single bucket. The attributes recognized at get requests
|
||||
are:
|
||||
|
||||
=================== =========================================================
|
||||
``NHA_ID`` ID of the next-hop group that the bucket belongs to.
|
||||
``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket.
|
||||
=================== =========================================================
|
||||
|
||||
``NHA_RES_BUCKET`` payload:
|
||||
|
||||
======================== ====================================================
|
||||
``NHA_RES_BUCKET_INDEX`` Index of bucket in the resilient table.
|
||||
======================== ====================================================
|
||||
|
||||
Bucket Dumps
|
||||
^^^^^^^^^^^^
|
||||
|
||||
The message ``RTM_GETNEXTHOPBUCKET`` with the ``NLM_F_DUMP`` flag is used
|
||||
to request a dump of matching buckets. The attributes recognized at dump
|
||||
requests are:
|
||||
|
||||
=================== =========================================================
|
||||
``NHA_ID`` If specified, limits the dump to just the next-hop group
|
||||
with this ID.
|
||||
``NHA_OIF`` If specified, limits the dump to buckets that contain
|
||||
next hops that use the device with this ifindex.
|
||||
``NHA_MASTER`` If specified, limits the dump to buckets that contain
|
||||
next hops that use a device in the VRF with this ifindex.
|
||||
``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket.
|
||||
=================== =========================================================
|
||||
|
||||
``NHA_RES_BUCKET`` payload:
|
||||
|
||||
======================== ====================================================
|
||||
``NHA_RES_BUCKET_NH_ID`` If specified, limits the dump to just the buckets
|
||||
that contain the next hop with this ID.
|
||||
======================== ====================================================
|
||||
|
||||
Usage
|
||||
-----
|
||||
|
||||
To illustrate the usage, consider the following commands::
|
||||
|
||||
# ip nexthop add id 1 via 192.0.2.2 dev eth0
|
||||
# ip nexthop add id 2 via 192.0.2.3 dev eth0
|
||||
# ip nexthop add id 10 group 1/2 type resilient \
|
||||
buckets 8 idle_timer 60 unbalanced_timer 300
|
||||
|
||||
The last command creates a resilient next-hop group. It will have 8 buckets
|
||||
(which is unusually low number, and used here for demonstration purposes
|
||||
only), each bucket will be considered idle when no traffic hits it for at
|
||||
least 60 seconds, and if the table remains out of balance for 300 seconds,
|
||||
it will be forcefully brought into balance.
|
||||
|
||||
Changing next-hop weights leads to change in bucket allocation::
|
||||
|
||||
# ip nexthop replace id 10 group 1,3/2 type resilient
|
||||
|
||||
This can be confirmed by looking at individual buckets::
|
||||
|
||||
# ip nexthop bucket show id 10
|
||||
id 10 index 0 idle_time 5.59 nhid 1
|
||||
id 10 index 1 idle_time 5.59 nhid 1
|
||||
id 10 index 2 idle_time 8.74 nhid 2
|
||||
id 10 index 3 idle_time 8.74 nhid 2
|
||||
id 10 index 4 idle_time 8.74 nhid 1
|
||||
id 10 index 5 idle_time 8.74 nhid 1
|
||||
id 10 index 6 idle_time 8.74 nhid 1
|
||||
id 10 index 7 idle_time 8.74 nhid 1
|
||||
|
||||
Note the two buckets that have a shorter idle time. Those are the ones that
|
||||
were migrated after the next-hop replace command to satisfy the new demand
|
||||
that next hop 1 be given 6 buckets instead of 4.
|
||||
|
||||
Netdevsim
|
||||
---------
|
||||
|
||||
The netdevsim driver implements a mock offload of resilient groups, and
|
||||
exposes debugfs interface that allows marking individual buckets as busy.
|
||||
For example, the following will mark bucket 23 in next-hop group 10 as
|
||||
active::
|
||||
|
||||
# echo 10 23 > /sys/kernel/debug/netdevsim/netdevsim10/fib/nexthop_bucket_activity
|
||||
|
||||
In addition, another debugfs interface can be used to configure that the
|
||||
next attempt to migrate a bucket should fail::
|
||||
|
||||
# echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace
|
||||
|
||||
Besides serving as an example, the interfaces that netdevsim exposes are
|
||||
useful in automated testing, and
|
||||
``tools/testing/selftests/drivers/net/netdevsim/nexthop.sh`` makes use of
|
||||
them to test the algorithm.
|
Loading…
Reference in New Issue
Block a user