linux

Author	SHA1	Message	Date
Yunsheng Lin	d44f9b631f	net: hns3: Cleanup for struct that used to send cmd to firmware The hclge_tm module has already added _cmd to the end of struct that used to send cmd to firmware. This will help us finding the endian issues. This patch adds the _cmd to the end of struct that used to send cmd to firmware in hclge_main module. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-09 09:46:54 -07:00
Yunsheng Lin	5392902d33	net: hns3: Consistently using GENMASK in hns3 driver This patch uses GENMASK to generate bit mask whenever possible in hns3 driver. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-09 09:46:54 -07:00
Yunsheng Lin	56cf68c730	net: hns3: Cleanup indentation for Kconfig in the the hisilicon folder This patch fixes a few indentation for Kconfig file in the hisilicon folder. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-09 09:46:54 -07:00
Yunsheng Lin	9780cb97af	net: hns3: Add hns3_get_handle macro in hns3 driver There are many places that will need to get the handle of netdev, so add a macro to get the handle of netdev. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-09 09:46:54 -07:00
Yunsheng Lin	5bca3b94df	net: hns3: Cleanup for shifting true in hns3 driver This patch fixes a shifting true in hclge_main module. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-09 09:46:53 -07:00
Christos Gkekas	c49c777f9c	qed: Delete redundant check on dcb_app priority dcb_app priority is unsigned thus checking whether it is less than zero is redundant. Signed-off-by: Christos Gkekas <chris.gekas@gmail.com> Acked-By: Tomer Tayar <Tomer.Tayar@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-08 21:21:02 -07:00
Christos Gkekas	c778c32118	net: ethernet: stmmac: Clean up dead code Many macros in dwmac-ipq806x are unused and should be removed. Moreover gmac->id is an unsigned variable and therefore checking whether it is less than zero is redundant. Signed-off-by: Christos Gkekas <chris.gekas@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-08 21:19:07 -07:00
David S. Miller	bf6a119eea	Merge branch 'ipv6_dev_get_saddr-rcu' Eric Dumazet says: ==================== ipv6: ipv6_dev_get_saddr() rcu works Sending IPv6 udp packets on non connected sockets is quite slow, because ipv6_dev_get_saddr() is still using an rwlock and silly references games on ifa. Tested: $ ./super_netperf 16 -H 4444::555:0786 -l 2000 -t UDP_STREAM -- -m 100 & [1] 12527 Performance is boosted from 2.02 Mpps to 4.28 Mpps Kernel profile before patches : 22.62% [kernel] [k] _raw_read_lock_bh 7.04% [kernel] [k] refcount_sub_and_test 6.56% [kernel] [k] ipv6_get_saddr_eval 5.67% [kernel] [k] _raw_read_unlock_bh 5.34% [kernel] [k] __ipv6_dev_get_saddr 4.95% [kernel] [k] refcount_inc_not_zero 4.03% [kernel] [k] __ip6addrlbl_match 3.70% [kernel] [k] _raw_spin_lock 3.44% [kernel] [k] ipv6_dev_get_saddr 3.24% [kernel] [k] ip6_pol_route 3.06% [kernel] [k] refcount_add_not_zero 2.30% [kernel] [k] __local_bh_enable_ip 1.81% [kernel] [k] mlx4_en_xmit 1.20% [kernel] [k] __ip6_append_data 1.12% [kernel] [k] __ip6_make_skb 1.11% [kernel] [k] __dev_queue_xmit 1.06% [kernel] [k] l3mdev_master_ifindex_rcu Kernel profile after patches : 11.36% [kernel] [k] ip6_pol_route 7.65% [kernel] [k] _raw_spin_lock 7.16% [kernel] [k] __ipv6_dev_get_saddr 6.49% [kernel] [k] ipv6_get_saddr_eval 6.04% [kernel] [k] refcount_add_not_zero 3.34% [kernel] [k] __ip6addrlbl_match 2.62% [kernel] [k] __dev_queue_xmit 2.37% [kernel] [k] mlx4_en_xmit 2.26% [kernel] [k] dst_release 1.89% [kernel] [k] __ip6_make_skb 1.87% [kernel] [k] __ip6_append_data 1.86% [kernel] [k] udpv6_sendmsg 1.86% [kernel] [k] ip6t_do_table 1.64% [kernel] [k] ipv6_dev_get_saddr 1.64% [kernel] [k] find_match 1.51% [kernel] [k] l3mdev_master_ifindex_rcu 1.24% [kernel] [k] ipv6_addr_label ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-08 21:16:31 -07:00
Eric Dumazet	cc429c8f6f	ipv6: avoid cache line dirtying in ipv6_dev_get_saddr() By extending the rcu section a bit, we can avoid these very expensive in6_ifa_put()/in6_ifa_hold() calls done in __ipv6_dev_get_saddr() and ipv6_dev_get_saddr() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-08 21:16:31 -07:00
Eric Dumazet	f59c031e91	ipv6: __ipv6_dev_get_saddr() rcu conversion Callers hold rcu_read_lock(), so we do not need the rcu_read_lock()/rcu_read_unlock() pair. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-08 21:16:30 -07:00
Eric Dumazet	24ba333b2c	ipv6: ipv6_chk_prefix() rcu conversion Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-08 21:16:30 -07:00
Eric Dumazet	47e26941f7	ipv6: ipv6_chk_custom_prefix() rcu conversion Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-08 21:16:30 -07:00
Eric Dumazet	d9bf82c2f6	ipv6: ipv6_count_addresses() rcu conversion Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-08 21:16:30 -07:00
Eric Dumazet	8ef802aa8e	ipv6: prepare RCU lookups for idev->addr_list inet6_ifa_finish_destroy() already uses kfree_rcu() to free inet6_ifaddr structs. We need to use proper list additions/deletions in order to allow readers to use RCU instead of idev->lock rwlock. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-08 21:16:30 -07:00
David S. Miller	a42317785c	Merge branch 'bridge-neigh-msg-proxy-and-flood-suppression-support' Roopa Prabhu says: ==================== bridge: neigh msg proxy and flood suppression support This series implements arp and nd suppression in the bridge driver for ethernet vpns. It implements rfc7432, section 10 https://tools.ietf.org/html/rfc7432#section-10 for ethernet VPN deployments. It is similar to the existing BR_PROXYARP* flags but has a few semantic differences to conform to EVPN standard. Unlike the existing flags, this new flag suppresses flood of all neigh discovery packets (arp and nd) to tunnel ports. Supports both vlan filtering and non-vlan filtering bridges. In case of EVPN, it is mainly used to avoid flooding of arp and nd packets to tunnel ports like vxlan. v2 : rebase to latest + address some optimization feedback from Nikolay. v3 : fix kbuild reported build errors with CONFIG_INET off v4 : simplify port flag mask as suggested by stephen v5 : address some feedback from Toshiaki v6 : some v5 cleanups in nd suppress (keep it consistent with arp suppress) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-08 21:12:04 -07:00
Roopa Prabhu	ed842faeb2	bridge: suppress nd pkts on BR_NEIGH_SUPPRESS ports This patch avoids flooding and proxies ndisc packets for BR_NEIGH_SUPPRESS ports. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-08 21:12:04 -07:00
Roopa Prabhu	057658cb33	bridge: suppress arp pkts on BR_NEIGH_SUPPRESS ports This patch avoids flooding and proxies arp packets for BR_NEIGH_SUPPRESS ports. Moves existing br_do_proxy_arp to br_do_proxy_suppress_arp to support both proxy arp and neigh suppress. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-08 21:12:04 -07:00
Roopa Prabhu	821f1b21ca	bridge: add new BR_NEIGH_SUPPRESS port flag to suppress arp and nd flood This patch adds a new bridge port flag BR_NEIGH_SUPPRESS to suppress arp and nd flood on bridge ports. It implements rfc7432, section 10. https://tools.ietf.org/html/rfc7432#section-10 for ethernet VPN deployments. It is similar to the existing BR_PROXYARP* flags but has a few semantic differences to conform to EVPN standard. Unlike the existing flags, this new flag suppresses flood of all neigh discovery packets (arp and nd) to tunnel ports. Supports both vlan filtering and non-vlan filtering bridges. In case of EVPN, it is mainly used to avoid flooding of arp and nd packets to tunnel ports like vxlan. This patch adds netlink and sysfs support to set this bridge port flag. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-08 21:12:04 -07:00
Eric Dumazet	951f788a80	ipv6: fix a BUG in rt6_get_pcpu_route() Ido reported following splat and provided a patch. [ 122.221814] BUG: using smp_processor_id() in preemptible [00000000] code: sshd/2672 [ 122.221845] caller is debug_smp_processor_id+0x17/0x20 [ 122.221866] CPU: 0 PID: 2672 Comm: sshd Not tainted 4.14.0-rc3-idosch-next-custom #639 [ 122.221880] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016 [ 122.221893] Call Trace: [ 122.221919] dump_stack+0xb1/0x10c [ 122.221946] ? _atomic_dec_and_lock+0x124/0x124 [ 122.221974] ? ___ratelimit+0xfe/0x240 [ 122.222020] check_preemption_disabled+0x173/0x1b0 [ 122.222060] debug_smp_processor_id+0x17/0x20 [ 122.222083] ip6_pol_route+0x1482/0x24a0 ... I believe we can simplify this code path a bit, since we no longer hold a read_lock and need to release it to avoid a dead lock. By disabling BH, we make sure we'll prevent code re-entry and rt6_get_pcpu_route()/rt6_make_pcpu_route() run on the same cpu. Fixes: `66f5d6ce53` ("ipv6: replace rwlock with rcu and spinlock in fib6_table") Reported-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-08 21:09:00 -07:00
David S. Miller	51a0c00c6b	Merge tag 'mlx5-updates-2017-10-06' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Saeed Mahameed says: ==================== Mellanox, mlx5 updates 2017-10-06 This series includes some shared code updates for kernel 4.15 to both net-next and rdma-next trees. The series includes mlx5 low level flow steering updates and optimizations to support firmware command parallelism for flow steering requests from Maor Gottlieb and two other small fixes from Matan and Maor. One fix from Matan adds error handling for when the destination list of the flow steering rule is full. Maor introduced a patch to avoid NULL pointer dereference on steering cleanup. Then Some refactoring patches needed by the series for code sharing purposes. and split the Flow Table Entry (FTE) and Flow Group (FG) creation code to two parts: 1) Object allocation - allocate the steering node and initialize its resources. 2) The firmware command execution. This change will give us the ability to take write lock on the parent node (e.g. FG for FTE creating) only on the software data struct allocation and creation part of the procedure where the synchronization is really required, and will allow us to execute multiple firmware commands simultaneously and overcome the firmware bottleneck. Refactor the locking scheme of the mlx5 core flow steering as follows: 1) Replace the mutex lock with readers-writers semaphore and take the write lock only when necessary (e.g. allocating a new flow table entry index or adding a node to the parent's children list). When we try to find a suitable child in the parent's children list (e.g. search for flow group with the same match_criteria of the rule) then we only take the read lock. 2) Add versioning mechanism - each steering entity (FT, FG, FTE, DST) will have an incremental version. The version is increased when the entity is changed (e.g. when a new FTE was added to FG - the FG's version is increased). Versioning is used in order to determine if the last traverse of an entity's children is valid or a rescan under write lock is required. Last patch adds FGs and FTEs memory pool, It is useful because these objects are not small and could be allocated/deallocated many times. This support improves the insertion rate of steering rules from ~5k/sec to ~40k/sec. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-08 21:07:11 -07:00
David S. Miller	28f50eb209	Merge branch 'hv_netvsc-TCP-hash-level' Haiyang Zhang says: ==================== hv_netvsc: support changing TCP hash level The patch set simplifies the existing hash level switching code for UDP. It also adds the support for changing TCP hash level. So users can switch between L3 an L4 hash levels for TCP and UDP. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-08 10:11:01 -07:00
Haiyang Zhang	78005d91c1	hv_netvsc: Update netvsc Document for TCP hash level setting Update Documentation/networking/netvsc.txt for TCP hash level setting and related info. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-08 10:11:01 -07:00
Haiyang Zhang	0518ec4f9d	hv_netvsc: Add ethtool handler to set and get TCP hash levels The patch supports the options to switch TCP hash level between L3 and L4 by ethtool command. TCP over IPv4 and v6 can be set differently. The default hash level is L4. We currently only allow switching TX hash level from within the guests. For example, for TCP over IPv4 on eth0: To include TCP port numbers in hashing: ethtool -N eth0 rx-flow-hash tcp4 sdfn To exclude TCP port numbers in hashing: ethtool -N eth0 rx-flow-hash tcp4 sd To show TCP hash level: ethtool -n eth0 rx-flow-hash tcp4 Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-08 10:11:01 -07:00
Haiyang Zhang	486e398105	hv_netvsc: Change the hash level variable to bit flags This simplifies the logic and make it easier to add more options. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-08 10:11:01 -07:00
David S. Miller	c1b85a193a	Merge branch 'mlxsw-more-extack' Jiri Pirko says: ==================== mlxsw: Add more extack error reporting Ido says: Add error messages to VLAN and bridge enslavements to help users understand why the enslavement failed. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-08 10:07:21 -07:00
Ido Schimmel	9b63ef88d3	mlxsw: spectrum: Propagate extack further for bridge enslavements The code that actually takes care of bridge offload introduces a few more non-trivial constraints with regards to bridge enslavements. Propagate extack there to indicate the reason. $ ip link add link enp1s0np1 name enp1s0np1.10 type vlan id 10 $ ip link add link enp1s0np1 name enp1s0np1.20 type vlan id 20 $ ip link add name br0 type bridge $ ip link set dev enp1s0np1.10 master br0 $ ip link set dev enp1s0np1.20 master br0 Error: spectrum: Can not bridge VLAN uppers of the same port. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-08 10:07:21 -07:00
Ido Schimmel	c1f2c6d025	mlxsw: spectrum: Add extack for VLAN enslavements Similar to physical ports, enslavement of VLAN devices can also fail. Use extack to indicate why the enslavement failed. $ ip link add link enp1s0np1 name enp1s0np1.10 type vlan id 10 $ ip link add name bond0 type bond mode 802.3ad $ ip link set dev enp1s0np1.10 master bond0 Error: spectrum: VLAN devices only support bridge and VRF uppers. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-08 10:07:21 -07:00
David S. Miller	c9f766bc6e	Merge branch 'bpf-obj-name-misc' Martin KaFai Lau says: ==================== bpf: Misc improvements and a new usage on bpf obj name The first two patches make improvements on the bpf obj name. The last patch adds the prog name to kallsyms. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 23:29:40 +01:00
Martin KaFai Lau	368211fb92	bpf: Append prog->aux->name in bpf_get_prog_name() This patch makes the bpf_prog's name available in kallsyms. The new format is bpf_prog_tag[_name]. Sample kallsyms from running selftests/bpf/test_progs: [root@arch-fb-vm1 ~]# egrep ' bpf_prog_[0-9a-fA-F]{16}' /proc/kallsyms ffffffffa0048000 t bpf_prog_dabf0207d1992486_test_obj_id ffffffffa0038000 t bpf_prog_a04f5eef06a7f555__123456789ABCDE ffffffffa0050000 t bpf_prog_a04f5eef06a7f555 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 23:29:39 +01:00
Martin KaFai Lau	067cae4777	bpf: Use char in prog and map name Instead of u8, use char for prog and map name. It can avoid the userspace tool getting compiler's signess warning. The bpf_prog_aux, bpf_map, bpf_attr, bpf_prog_info and bpf_map_info are changed. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 23:29:39 +01:00
Martin KaFai Lau	473d97343f	bpf: Change bpf_obj_name_cpy() to better ensure map's name is init by 0 During get_info_by_fd, the prog/map name is memcpy-ed. It depends on the prog->aux->name and map->name to be zero initialized. bpf_prog_aux is easy to guarantee that aux->name is zero init. The name in bpf_map may be harder to be guaranteed in the future when new map type is added. Hence, this patch makes bpf_obj_name_cpy() to always zero init the prog/map name. Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 23:29:39 +01:00
William Tu	f192970de8	ip_gre: check packet length and mtu correctly in erspan tx Similarly to early patch for erspan_xmit(), the ARPHDR_ETHER device is the length of the whole ether packet. So skb->len should subtract the dev->hard_header_len. Fixes: `1a66a836da` ("gre: add collect_md mode to ERSPAN tunnel") Fixes: `84e54fe0a5` ("gre: introduce native tunnel support for ERSPAN") Signed-off-by: William Tu <u9012063@gmail.com> Cc: Xin Long <lucien.xin@gmail.com> Cc: David Laight <David.Laight@aculab.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 23:17:21 +01:00
Lin Zhang	548ec11470	net: phonet: mark phonet_protocol as const The phonet_protocol structs don't need to be written by anyone and so can be marked as const. Signed-off-by: Lin Zhang <xiaolou4617@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 23:15:08 +01:00
Lin Zhang	64237470dd	net: phonet: mark header_ops as const Signed-off-by: Lin Zhang <xiaolou4617@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 23:15:08 +01:00
David S. Miller	a1d753d290	Merge branch 'bpf-perf-time-helpers' Yonghong Song says: ==================== bpf: add two helpers to read perf event enabled/running time Hardware pmu counters are limited resources. When there are more pmu based perf events opened than available counters, kernel will multiplex these events so each event gets certain percentage (but not 100%) of the pmu time. In case that multiplexing happens, the number of samples or counter value will not reflect the case compared to no multiplexing. This makes comparison between different runs difficult. Typically, the number of samples or counter value should be normalized before comparing to other experiments. The typical normalization is done like: normalized_num_samples = num_samples * time_enabled / time_running normalized_counter_value = counter_value * time_enabled / time_running where time_enabled is the time enabled for event and time_running is the time running for event since last normalization. This patch set implements two helper functions. The helper bpf_perf_event_read_value reads counter/time_enabled/time_running for perf event array map. The helper bpf_perf_prog_read_value read counter/time_enabled/time_running for bpf prog with type BPF_PROG_TYPE_PERF_EVENT. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 23:05:58 +01:00
Yonghong Song	81b9cf8028	bpf: add a test case for helper bpf_perf_prog_read_value The bpf sample program trace_event is enhanced to use the new helper to print out enabled/running time. Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Alexei Starovoitov <ast@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 23:05:57 +01:00
Yonghong Song	4bebdc7a85	bpf: add helper bpf_perf_prog_read_value This patch adds helper bpf_perf_prog_read_cvalue for perf event based bpf programs, to read event counter and enabled/running time. The enabled/running time is accumulated since the perf event open. The typical use case for perf event based bpf program is to attach itself to a single event. In such cases, if it is desirable to get scaling factor between two bpf invocations, users can can save the time values in a map, and use the value from the map and the current value to calculate the scaling factor. Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Alexei Starovoitov <ast@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 23:05:57 +01:00
Yonghong Song	020a32d958	bpf: add a test case for helper bpf_perf_event_read_value The bpf sample program tracex6 is enhanced to use the new helper to read enabled/running time as well. Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Alexei Starovoitov <ast@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 23:05:57 +01:00
Yonghong Song	908432ca84	bpf: add helper bpf_perf_event_read_value for perf event array map Hardware pmu counters are limited resources. When there are more pmu based perf events opened than available counters, kernel will multiplex these events so each event gets certain percentage (but not 100%) of the pmu time. In case that multiplexing happens, the number of samples or counter value will not reflect the case compared to no multiplexing. This makes comparison between different runs difficult. Typically, the number of samples or counter value should be normalized before comparing to other experiments. The typical normalization is done like: normalized_num_samples = num_samples * time_enabled / time_running normalized_counter_value = counter_value * time_enabled / time_running where time_enabled is the time enabled for event and time_running is the time running for event since last normalization. This patch adds helper bpf_perf_event_read_value for kprobed based perf event array map, to read perf counter and enabled/running time. The enabled/running time is accumulated since the perf event open. To achieve scaling factor between two bpf invocations, users can can use cpu_id as the key (which is typical for perf array usage model) to remember the previous value and do the calculation inside the bpf program. Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Alexei Starovoitov <ast@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 23:05:57 +01:00
Yonghong Song	97562633bc	bpf: perf event change needed for subsequent bpf helpers This patch does not impact existing functionalities. It contains the changes in perf event area needed for subsequent bpf_perf_event_read_value and bpf_perf_prog_read_value helpers. Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 23:05:57 +01:00
Amine Kherbouche	bdc476413d	ip_tunnel: add mpls over gre support This commit introduces the MPLSoGRE support (RFC 4023), using ip tunnel API by simply adding ipgre_tunnel_encap_(add\|del)_mpls_ops() and the new tunnel type TUNNEL_ENCAP_MPLS. Signed-off-by: Amine Kherbouche <amine.kherbouche@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 21:38:31 +01:00
David S. Miller	2af48d430a	Merge branch 'fib6-rcu' Wei Wang says: ==================== ipv6: replace rwlock with rcu and spinlock in fib6 table Currently, fib6 table is protected by rwlock. During route lookup, reader lock is taken and during route insertion, deletion or modification, writer lock is taken. This is a very inefficient implementation because the fastpath always has to do the operation to grab the reader lock. According to my latest syn flood test on an iota ivybridage machine with 2 10G mlx nics bonded together, each with 8 rx queues on 2 NUMA nodes, and with the upstream net-next kernel: ipv4 stack can handle around 4.2Mpps ipv6 stack can handle around 1.3Mpps In order to close the gap of the performance number between ipv4 and ipv6 stack, this patch series tries to get rid of the usage of the rwlock and replace it with rcu and spinlock protection. This will greatly speed up the fastpath performance as it only needs to hold rcu which is much less expensive than grabbing the reader lock. It also makes ipv6 fib implementation more consistent with ipv4. In order to be able to replace the current rwlock with rcu and spinlock, some preparation work is needed: Patch 1-8 introduces a per-route hash table (protected by rcu and a different spinlock) to store all cached routes created by pmtu and ip redirect under its main route. This makes the main fib6 tree only contain static routes. Patch 9-14 prepares all the reader path to be ready to tolerate concurrent writer. Patch 15 finally does the rwlock to rcu and spinlock conversion. Patch 16 takes care of rt6_stats. After this patch series, in the same syn flood test, ipv6 stack can now handle around 3.5Mpps compared to previous 1.3Mpps in my test setup. After this patch series, there are still some improvements that should be done in ipv6 stack: 1. During route lookup, dst_use() is called everytime on the selected route to update dst->__use and dst->lastuse. This dirties the cacheline and causes extra cacheline miss and should be avoided. 2. when no route is found in the current table, net->ip6.ipv6_null_entry is used and refcnt is taken on it. As there is no pcpu cache for this specific route, frequent change on the refcnt for this route causes quite some cacheline misses. And to make things worse, if CONFIG_IPV6_MULTIPLE_TABLES is defined, output path route lookup always starts with local table first and guarantees to hit net->ipv6.ip6_null_entry before continuing to do lookup in the main table. These operations on net->ipv6.ip6_null_entry could potentially be avoided. 3. ipv6 input path route lookup grabs refcnt on dst. This is different from ipv4. We could potentially change this behavior to let ipv6 input path route lookup not to grab refcnt on dst. However, it does not give us much performance boost as we currently have pcpu route cache for input path as well in ipv6. But this work probably is still worth doing to unify ipv6 and ipv4 route lookup behavior. The above issues will be addressed separately after this patch series has been accepted. This is a joint work with Martin KaFai Lau and Eric Dumazet. And many many thanks to them for their inspiring ideas and big big code review efforts. ==================== Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 21:22:59 +01:00
Wei Wang	81eb8447da	ipv6: take care of rt6_stats Currently, most of the rt6_stats are not hooked up correctly. As the last part of this patch series, hook up all existing rt6_stats and add one new stat fib_rt_uncache to indicate the number of routes in the uncached list. For details of the stats, please refer to the comments added in include/net/ip6_fib.h. Note: fib_rt_alloc and fib_rt_uncache are not guaranteed to be modified under a lock. So atomic_t is used for them. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 21:22:58 +01:00
Wei Wang	66f5d6ce53	ipv6: replace rwlock with rcu and spinlock in fib6_table With all the preparation work before, we are now ready to replace rwlock with rcu and spinlock in fib6_table. That means now all fib6_node in fib6_table are protected by rcu. And when freeing fib6_node, call_rcu() is used to wait for the rcu grace period before releasing the memory. When accessing fib6_node, corresponding rcu APIs need to be used. And all previous sessions protected by the write lock will now be protected by the spin lock per table. All previous sessions protected by read lock will now be protected by rcu_read_lock(). A couple of things to note here: 1. As part of the work of replacing rwlock with rcu, the linked list of fn->leaf now has to be rcu protected as well. So both fn->leaf and rt->dst.rt6_next are now __rcu tagged and corresponding rcu APIs are used when manipulating them. 2. For fn->rr_ptr, first of all, it also needs to be rcu protected now and is tagged with __rcu and rcu APIs are used in corresponding places. Secondly, fn->rr_ptr is changed in rt6_select() which is a reader thread. This makes the issue a bit complicated. We think a valid solution for it is to let rt6_select() grab the tb6_lock if it decides to change it. As it is not in the normal operation and only happens when there is no valid neighbor cache for the route, we think the performance impact should be low. 3. fib6_walk_continue() has to be called with tb6_lock held even in the route dumping related functions, e.g. inet6_dump_fib(), fib6_tables_dump() and ipv6_route_seq_ops. It is because fib6_walk_continue() makes modifications to the walker structure, and so are fib6_repair_tree() and fib6_del_route(). In order to do proper syncing between them, we need to let fib6_walk_continue() hold the lock. We may be able to do further improvement on the way we do the tree walk to get rid of the need for holding the spin lock. But not for now. 4. When fib6_del_route() removes a route from the tree, we no longer mark rt->dst.rt6_next to NULL to make simultaneous reader be able to further traverse the list with rcu. However, rt->dst.rt6_next is only valid within this same rcu period. No one should access it later. 5. All the operation of atomic_inc(rt->rt6i_ref) is changed to be performed before we publish this route (either by linking it to fn->leaf or insert it in the list pointed by fn->leaf) just to be safe because as soon as we publish the route, some read thread will be able to access it. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 21:22:58 +01:00
Wei Wang	17ecf590b3	ipv6: add key length check into rt6_select() After rwlock is replaced with rcu and spinlock, fib6_lookup() could potentially return an intermediate node if other thread is doing fib6_del() on a route which is the only route on the node so that fib6_repair_tree() will be called on this node and potentially assigns fn->leaf to the its child's fn->leaf. In order to detect this situation in rt6_select(), we have to check if fn->fn_bit is consistent with the key length stored in the route. And depending on if the fn is in the subtree or not, the key is either rt->rt6i_dst or rt->rt6i_src. If any inconsistency is found, that means the node no longer holds valid routes in it. So net->ipv6.ip6_null_entry is returned. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 21:22:58 +01:00
Wei Wang	8d1040e808	ipv6: check fn->leaf before it is used If rwlock is replaced with rcu and spinlock, it is possible that the reader thread will see fn->leaf as NULL in the following scenarios: 1. fib6_add() is in progress and we have already inserted a new node but not yet inserted the route. 2. fib6_del_route() is in progress and we have already set fn->leaf to NULL but not yet freed the node because of rcu grace period. This patch makes sure all the reader threads check fn->leaf first before using it. And together with later patch to grab rcu_read_lock() and rcu_dereference() fn->leaf, it makes sure reader threads are safe when accessing fn->leaf. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 21:22:58 +01:00
Wei Wang	bbd63f06d1	ipv6: update fn_sernum after route is inserted to tree fib6_add() logic currently calls fib6_add_1() to figure out what node should be used for the newly added route and then call fib6_add_rt2node() to insert the route to the node. And during the call of fib6_add_1(), fn_sernum is updated for all nodes that share the same prefix as the new route. This does not have issue in the current code because reader thread will not be able to access the tree while writer thread is inserting new route to it. However, it is not the case once we transition to use RCU. Reader thread could potentially see the new fn_sernum before the new route is inserted. As a result, reader thread's route lookup will return a stale route with the new fn_sernum. In order to solve this issue, we remove all the update of fn_sernum in fib6_add_1(), and instead, introduce a new function that updates fn_sernum for all related nodes and call this functions once the route is successfully inserted to the tree. Also, smp_wmb() is used after a route is successfully inserted into the fib tree and right before the updated of fn->sernum. And smp_rmb() is used right after fn->sernum is accessed in rt6_get_cookie_safe(). This is to guarantee that when the reader thread sees the new fn->sernum, the new route is already inserted in the tree in memory. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 21:22:58 +01:00
Wei Wang	d3843fe5fd	ipv6: replace dst_hold() with dst_hold_safe() in routing code With rwlock, it is safe to call dst_hold() in the read thread because read thread is guaranteed to be separated from write thread. However, after we replace rwlock with rcu, it is no longer safe to use dst_hold(). A dst might already have been deleted but is waiting for the rcu grace period to pass before freeing the memory when a read thread is trying to do dst_hold(). This could potentially cause double free issue. So this commit replaces all dst_hold() with dst_hold_safe() in all read thread to avoid this double free issue. And in order to make the code more compact, a new function ip6_hold_safe() is introduced. It calls dst_hold_safe() first, and if that fails, it will either fall back to hold and return net->ipv6.ip6_null_entry or set rt to NULL according to the caller's need. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 21:22:58 +01:00
Wei Wang	51e398e86d	ipv6: don't release rt->rt6i_pcpu memory during rt6_release() After rwlock is replaced with rcu and spinlock, route lookup can happen simultanously with route deletion. This patch removes the call to free_percpu(rt->rt6i_pcpu) from rt6_release() to avoid the race condition between rt6_release() and rt6_get_pcpu_route(). And as free_percpu(rt->rt6i_pcpu) is already called in ip6_dst_destroy() after the rcu grace period, it is safe to do this change. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 21:22:58 +01:00
Wei Wang	a94b9367e0	ipv6: grab rt->rt6i_ref before allocating pcpu rt After rwlock is replaced with rcu and spinlock, ip6_pol_route() will be called with only rcu held. That means rt6 route deletion could happen simultaneously with rt6_make_pcpu_rt(). This could potentially cause memory leak if rt6_release() is called right before rt6_make_pcpu_rt() on the same route. This patch grabs rt->rt6i_ref safely before calling rt6_make_pcpu_rt() to make sure rt6_release() will not get triggered while rt6_make_pcpu_rt() is in progress. And rt6_release() is called after rt6_make_pcpu_rt() is finished. Note: As we are incrementing rt->rt6i_ref in ip6_pol_route(), there is a very slim chance that fib6_purge_rt() will be triggered unnecessarily when deleting a route if ip6_pol_route() running on another thread picks this route as well and tries to make pcpu cache for it. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-07 21:22:58 +01:00

1 2 3 4 5 ...

707573 Commits