linux

Author	SHA1	Message	Date
Jiri Pirko	477edb7806	bnxt: add missing net/devlink.h include devlink functions are in use, so include the related header file. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-24 14:55:30 -04:00
Jiri Pirko	375cf8c643	net: devlink: add couple of missing mutex_destroy() calls Add missing called to mutex_destroy() for two mutexes used in devlink code. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-24 14:55:30 -04:00
David S. Miller	956ca8fc5c	Merge branch 'aquantia-rx-perf' Igor Russkikh says: ==================== net: aquantia: RX performance optimization patches Here is a set of patches targeting for performance improvement on various platforms and protocols. Our main target was rx performance on iommu systems, notably NVIDIA Jetson TX2 and NVIDIA Xavier platforms. We introduce page reuse strategy to better deal with iommu dma mapping costs. With it we see 80-90% of page reuse under some test configurations on UDP traffic. This shows good improvements on other systems with IOMMU hardware, like AMD Ryzen. We've also improved TCP LRO configuration parameters, allowing packets to better coalesce. Page reuse tests were carried out using iperf3, iperf2, netperf and pktgen. Mainly on UDP traffic, with various packet lengths. Jetson TX2, UDP, Default MTU: RX Lost Datagrams Before: Max: 69% Min: 68% Avg: 68.5% After: Max: 41% Min: 38% Avg: 39.2% Maximum throughput Before: 1.27 Gbits/sec After: 2.41 Gbits/sec AMD Ryzen 5 2400G, UDP, Default MTU: RX Lost Datagrams Before: Max: 12% Min: 4.5% Avg: 7.17% After: Max: 6.2% Min: 2.3% Avg: 4.26% ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-23 22:16:54 -04:00
Igor Russkikh	d0d443cddb	net: aquantia: enable driver build for arm64 or compile_test The driver is now constantly tested in our lab on aarch64 hardware: Jetson tx2, Pascal and Xavier tegra based hardware. Many of tegra smmu related HW bugs were fixed or workarounded already. Thus, add ARM64 into Kconfig. Add also COMPILE_TEST dependency. Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-23 22:16:53 -04:00
Nikita Danilov	1eef4757ce	net: aquantia: improve LRO configuration Default LRO HW configuration was very conservative. Low Number of Descriptors per LRO Sequence, small session timeout, inefficient settings in interrupt generation logic. Change max number of LRO descriptors from 2 to 16 to increase performance. Increase maximum coalescing interval in HW to 250uS. Tune up HW LRO interrupt generation setting to prevent hw issues with long LRO sessions. Signed-off-by: Nikita Danilov <nikita.danilov@aquantia.com> Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-23 22:16:53 -04:00
Igor Russkikh	1b09e72d16	net: aquantia: Increase rx ring default size from 1K to 2K For multigig rates 1K ring size is often not enough and causes extra packet drops in hardware. Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-23 22:16:53 -04:00
Igor Russkikh	8bd7e7639d	net: aquantia: Make RX default frame size 2K This correlates with default internet MTU. This also allows page flip/reuse to be activated, since each allocated RX page now serves for two frags/packets. Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-23 22:16:53 -04:00
Igor Russkikh	9773ef18b8	net: aquantia: Introduce rx refill threshold value Before that, we've refilled ring even on single descriptor move. Under high packet load that caused page allocation logic to be triggered too often. That made overall ring processing slower. Moreover, with page buffer reuse implemented, we should give a chance higher networking levels to process received packets faster, release the pages they consumed and therefore give a higher chance for these pages to be reused. RX ring is now refilled only when AQ_CFG_RX_REFILL_THRES or more descriptors were processed (32 by default). Under regular traffic this gives quite enough time for packet to be consumed and page to be reused. Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-23 22:16:53 -04:00
Igor Russkikh	46f4c29d9d	net: aquantia: optimize rx performance by page reuse strategy We introduce internal aq_rxpage wrapper over regular page where extra field is tracked: rxpage offset inside of allocated page. This offset allows to reuse one page for multiple packets. When needed (for example with large frames processing), allocated pageorder could be customized. This gives even larger page reuse efficiency. page_ref_count is used to track page users. If during rx refill underlying page has users, we increase pg_off by rx frame size thus the top half of the page is reused. Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-23 22:16:53 -04:00
Igor Russkikh	7e2698c4fd	net: aquantia: optimize rx path using larger preallocated skb len Atlantic driver used 14 bytes preallocated skb size. That made L3 protocol processing inefficient because pskb_pull had to fetch all the L3/L4 headers from extra fragments. Specially on UDP flows that caused extra packet drops because CPU was overloaded with pskb_pull. This patch uses eth_get_headlen for skb preallocation. Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-23 22:16:53 -04:00
David S. Miller	d64fee0a03	mlx5-updates-2019-03-20 This series includes updates to mlx5 driver, 1) Compiler warnings cleanup from Saeed Mahameed 2) Parav Pandit simplifies sriov enable/disables 3) Gustavo A. R. Silva, Removes a redundant assignment 4) Moshe Shemesh, Adds Geneve tunnel stateless offload support 5) Eli Britstein, Adds the Support for VLAN modify action and Replaces TC VLAN pop and push actions with VLAN modify Note: This series includes two simple non-mlx5 patches, 1) Declare IANA_VXLAN_UDP_PORT definition in include/net/vxlan.h, and use it in some drivers. 2) Declare GENEVE_UDP_PORT definition in include/net/geneve.h, and use it in mlx5 and nfp drivers. -----BEGIN PGP SIGNATURE----- iQEcBAABAgAGBQJclTLsAAoJEEg/ir3gV/o+7bEH/1sz4oKP2mfhKSbG/I/g7Q3D ifnccYq2EyXd1HzeglXpzLndO8wPve9qr/ANKrrKIYYCxc8FpCdb4aJD1Ucuylbb XHHdfbTIPMa3vjhKtR/Fydht4RkY5IBBsgXywBcNL3ofxmnleNt9JRSr76Yhr2sy Q3H30X+UvwAAQJBY1X+P8RiJcSklLu0UPG2KtTXcCz8YRgOWK0JtEiQyQu6yET4u zbVxYixwKgsR9uhwNXqLxVMsaWFue9cYmVSMLigDx7fRZvj6Ao9REEUflt1hCEoR jOXm1Avnsg9TKnwmgiBjrWQQQ4h+IMfZLK8EtuxVcraBUjtQRVnPak5JjZMjDuc= =7t4R -----END PGP SIGNATURE----- Merge tag 'mlx5-updates-2019-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2019-03-20 This series includes updates to mlx5 driver, 1) Compiler warnings cleanup from Saeed Mahameed 2) Parav Pandit simplifies sriov enable/disables 3) Gustavo A. R. Silva, Removes a redundant assignment 4) Moshe Shemesh, Adds Geneve tunnel stateless offload support 5) Eli Britstein, Adds the Support for VLAN modify action and Replaces TC VLAN pop and push actions with VLAN modify Note: This series includes two simple non-mlx5 patches, 1) Declare IANA_VXLAN_UDP_PORT definition in include/net/vxlan.h, and use it in some drivers. 2) Declare GENEVE_UDP_PORT definition in include/net/geneve.h, and use it in mlx5 and nfp drivers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-23 22:03:44 -04:00
David S. Miller	071d08af38	Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== 100GbE Intel Wired LAN Driver Updates 2019-03-22 This series contains updates to ice driver only. Akeem enables MAC anti-spoofing by default when a new VSI is being created. Fixes an issue when reclaiming VF resources back to the pool after reset, by freeing VF resources separately using the first VF vector index to traverse the list, instead of starting at the last assigned vectors list. Added support for VF & PF promiscuous mode in the ice driver. Fixed the PF driver from letting the VF know it is "not trusted" when it attempts to add more than its permitted additional MAC addresses. Altered how the driver gets the VF VSIs instances, instead of using the mailbox messages to retrieve VSIs, get it directly via the VF object in the PF data structure. Bruce fixes return values to resolve static analysis warnings. Made whitespace changes to increase readability and reduce code wrapping. Anirudh cleans up code by removing a function prototype that was never implemented and removed an unused field in the ice_sched_vsi_info structure. Kiran fixes a potential divide by zero issue by adding a check. Victor cleans up the transmit scheduler by adjusting the stack variable usage and added/modified debug prints to make them more useful. Yashaswini updates the driver in VEB mode to ensure that the LAN_EN bit is set if all the right conditions are met. Christopher ensures the loopback enable bit is not set for prune switch rules, since all transmit traffic would be looped back to the internal switch and dropped. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-23 22:02:54 -04:00
David S. Miller	bdaba8959e	Merge branch 'tcp-rx-tx-cache' Eric Dumazet says: ==================== tcp: add rx/tx cache to reduce lock contention On hosts with many cpus we can observe a very serious contention on spinlocks used in mm slab layer. The following can happen quite often : 1) TX path sendmsg() allocates one (fclone) skb on CPU A, sends a clone. ACK is received on CPU B, and consumes the skb that was in the retransmit queue. 2) RX path network driver allocates skb on CPU C recvmsg() happens on CPU D, freeing the skb after it has been delivered to user space. In both cases, we are hitting the asymetric alloc/free pattern for which slab has to drain alien caches. At 8 Mpps per second, this represents 16 Mpps alloc/free per second and has a huge penalty. In an interesting experiment, I tried to use a single kmem_cache for all the skbs (in skb_init() : skbuff_fclone_cache = skbuff_head_cache = kmem_cache_create("skbuff_fclone_cache", sizeof(struct sk_buff_fclones),); qnd most of the contention disappeared, since cpus could better use their local slab per-cpu cache. But we can do actually better, in the following patches. TX : at ACK time, no longer free the skb but put it back in a tcp socket cache, so that next sendmsg() can reuse it immediately. RX : at recvmsg() time, do not free the skb but put it in a tcp socket cache so that it can be freed by the cpu feeding the incoming packets in BH. This increased the performance of small RPC benchmark by about 10 % on a host with 112 hyperthreads. v2 : - Solved a race condition : sk_stream_alloc_skb() to make sure the prior clone has been freed. - Really test rps_needed in sk_eat_skb() as claimed. - Fixed rps_needed use in drivers/net/tun.c v3: Added a #ifdef CONFIG_RPS, to avoid compile error (kbuild robot) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-23 21:57:38 -04:00
Eric Dumazet	8b27dae5a2	tcp: add one skb cache for rx Often times, recvmsg() system calls and BH handling for a particular TCP socket are done on different cpus. This means the incoming skb had to be allocated on a cpu, but freed on another. This incurs a high spinlock contention in slab layer for small rpc, but also a high number of cache line ping pongs for larger packets. A full size GRO packet might use 45 page fragments, meaning that up to 45 put_page() can be involved. More over performing the __kfree_skb() in the recvmsg() context adds a latency for user applications, and increase probability of trapping them in backlog processing, since the BH handler might found the socket owned by the user. This patch, combined with the prior one increases the rpc performance by about 10 % on servers with large number of cores. (tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps instead of 8 Mpps) This also increases single bulk flow performance on 40Gbit+ links, since in this case there are often two cpus working in tandem : - CPU handling the NIC rx interrupts, feeding the receive queue, and (after this patch) freeing the skbs that were consumed. - CPU in recvmsg() system call, essentially 100 % busy copying out data to user space. Having at most one skb in a per-socket cache has very little risk of memory exhaustion, and since it is protected by socket lock, its management is essentially free. Note that if rps/rfs is used, we do not enable this feature, because there is high chance that the same cpu is handling both the recvmsg() system call and the TCP rx path, but that another cpu did the skb allocations in the device driver right before the RPS/RFS logic. To properly handle this case, it seems we would need to record on which cpu skb was allocated, and use a different channel to give skbs back to this cpu. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-23 21:57:38 -04:00
Eric Dumazet	472c2e07ee	tcp: add one skb cache for tx On hosts with a lot of cores, RPC workloads suffer from heavy contention on slab spinlocks. 20.69% [kernel] [k] queued_spin_lock_slowpath 5.64% [kernel] [k] _raw_spin_lock 3.83% [kernel] [k] syscall_return_via_sysret 3.48% [kernel] [k] __entry_text_start 1.76% [kernel] [k] __netif_receive_skb_core 1.64% [kernel] [k] __fget For each sendmsg(), we allocate one skb, and free it at the time ACK packet comes. In many cases, ACK packets are handled by another cpus, and this unfortunately incurs heavy costs for slab layer. This patch uses an extra pointer in socket structure, so that we try to reuse the same skb and avoid these expensive costs. We cache at most one skb per socket so this should be safe as far as memory pressure is concerned. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-23 21:57:38 -04:00
Eric Dumazet	dc05360fee	net: convert rps_needed and rfs_needed to new static branch api We prefer static_branch_unlikely() over static_key_false() these days. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-23 21:57:38 -04:00
David S. Miller	7c1508e5f6	Merge branch 'net-dev-BYPASS-for-lockless-qdisc' Paolo Abeni says: ==================== net: dev: BYPASS for lockless qdisc This patch series is aimed at improving xmit performances of lockless qdisc in the uncontended scenario. After the lockless refactor pfifo_fast can't leverage the BYPASS optimization. Due to retpolines the overhead for the avoidables enqueue and dequeue operations has increased and we see measurable regressions. The first patch introduces the BYPASS code path for lockless qdisc, and the second one optimizes such path further. Overall this avoids up to 3 indirect calls per xmit packet. Detailed performance figures are reported in the 2nd patch. v2 -> v3: - qdisc_is_empty() has a const argument (Eric) v1 -> v2: - use really an 'empty' flag instead of 'not_empty', as suggested by Eric ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-23 21:52:37 -04:00
Paolo Abeni	ba27b4cdaa	net: dev: introduce support for sch BYPASS for lockless qdisc With commit `c5ad119fb6` ("net: sched: pfifo_fast use skb_array") pfifo_fast no longer benefit from the TCQ_F_CAN_BYPASS optimization. Due to retpolines the cost of the enqueue()/dequeue() pair has become relevant and we observe measurable regression for the uncontended scenario when the packet-rate is below line rate. After commit `46b1c18f9d` ("net: sched: put back q.qlen into a single location") we can check for empty qdisc with a reasonably fast operation even for nolock qdiscs. This change extends TCQ_F_CAN_BYPASS support to nolock qdisc. The new chunk of code mirrors closely the existing one for traditional qdisc, leveraging a newly introduced helper to read atomically the qdisc length. Tested with pktgen in queue xmit mode, with pfifo_fast, a MQ device, and MQ root qdisc: threads vanilla patched kpps kpps 1 2465 2889 2 4304 5188 4 7898 9589 Same as above, but with a single queue device: threads vanilla patched kpps kpps 1 2556 2827 2 2900 2900 4 5000 5000 8 4700 4700 No mesaurable changes in the contended scenarios, and more 10% improvement in the uncontended ones. v1 -> v2: - rebased after flag name change Signed-off-by: Paolo Abeni <pabeni@redhat.com> Tested-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-23 21:52:36 -04:00
Paolo Abeni	28cff537ef	net: sched: add empty status flag for NOLOCK qdisc The queue is marked not empty after acquiring the seqlock, and it's up to the NOLOCK qdisc clearing such flag on dequeue. Since the empty status lays on the same cache-line of the seqlock, it's always hot on cache during the updates. This makes the empty flag update a little bit loosy. Given the lack of synchronization between enqueue and dequeue, this is unavoidable. v2 -> v3: - qdisc_is_empty() has a const argument (Eric) v1 -> v2: - use really an 'empty' flag instead of 'not_empty', as suggested by Eric Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-23 21:52:36 -04:00
Soheil Hassas Yeganeh	576fd2f7ca	tcp: add documentation for tcp_ca_state Add documentation to the tcp_ca_state enum, since this enum is exposed in uapi. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Cc: Sowmini Varadhan <sowmini05@gmail.com> Acked-by: Sowmini Varadhan <sowmini05@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-23 21:50:05 -04:00
Eric Dumazet	e6d1407013	tcp: remove conditional branches from tcp_mstamp_refresh() tcp_clock_ns() (aka ktime_get_ns()) is using monotonic clock, so the checks we had in tcp_mstamp_refresh() are no longer relevant. This patch removes cpu stall (when the cache line is not hot) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-23 21:43:21 -04:00
Florian Fainelli	a7a01ab312	net: phy: Correct Cygnus/Omega PHY driver prompt The tristate prompt should have been replaced rather than defined a few lines below, rebase mistake. Fixes: `17cc982176` ("net: phy: Move Omega PHY entry to Cygnus PHY driver") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-03-23 20:54:18 -04:00
Peter Oskolkov	7df5e3db8f	selftests: bpf: tc-bpf flow shaping with EDT Add a small test that shows how to shape a TCP flow in tc-bpf with EDT and ECN. Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-03-22 18:16:44 -07:00
Peter Oskolkov	315a202987	bpf: make bpf_skb_ecn_set_ce callable from BPF_PROG_TYPE_SCHED_ACT This helper is useful if a bpf tc filter sets skb->tstamp. Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-03-22 18:16:43 -07:00
Alexei Starovoitov	629a002568	Merge branch 'bpf-tc-tunneling' Willem de Bruijn says: ==================== BPF allows for dynamic tunneling, choosing the tunnel destination and features on-demand. Extend bpf_skb_adjust_room to allow for efficient tunneling at the TC hooks. Most features are required for large packets with GSO, as these will be modified after this patch. Patch 1 is a performance optimization, avoiding an unnecessary unclone for the TCP hot path. Patches 2..6 introduce a regression test. These can be squashed, but the code is arguably more readable when gradually expanding the feature set. Patch 7 is a performance optimization, avoid copying network headers that are going to be overwritten. This also simplifies the bpf program. Patch 8 reenables bpf_skb_adjust_room for UDP packets. Patch 9 configures skb tunneling metadata analogous to tunnel devices. Patches 10..13 expand the regression test to make use of the new features and enable the GSO testcases. Changes v1->v2 - move BPF_F_ADJ_ROOM_MASK out of uapi as it can be expanded - document new flags - in tests replace netcat -q flag with coreutils timeout: the -q flag is not supported in all netcat versions v2->v3 - move BPF_F_ADJ_ROOM_ENCAP_L3_MASK out of uapi as it has no use in userspace ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-03-22 13:52:46 -07:00
Willem de Bruijn	75a1a9fa2e	selftests/bpf: convert bpf tunnel test to encap modes Make the tests correctly annotate skbs with tunnel metadata. This makes the gso tests succeed. Enable them. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-03-22 13:52:45 -07:00
Willem de Bruijn	94f16813e1	selftests/bpf: convert bpf tunnel test to BPF_F_ADJ_ROOM_FIXED_GSO Lower route MTU to ensure packets fit in device MTU after encap, then skip the gso_size changes. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-03-22 13:52:45 -07:00
Willem de Bruijn	005edd1656	selftests/bpf: convert bpf tunnel test to BPF_ADJ_ROOM_MAC Avoid moving the network layer header when prefixing tunnel headers. This avoids an explicit call to bpf_skb_store_bytes and an implicit move of the network header bytes in bpf_skb_adjust_room. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-03-22 13:52:45 -07:00
Willem de Bruijn	6c408decbd	bpf: Sync bpf.h to tools Sync include/uapi/linux/bpf.h with tools/ Changes v1->v2: - BPF_F_ADJ_ROOM_MASK moved, no longer in this commit v2->v3: - BPF_F_ADJ_ROOM_ENCAP_L3_MASK moved, no longer in this commit Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-03-22 13:52:45 -07:00
Willem de Bruijn	868d523535	bpf: add bpf_skb_adjust_room encap flags When pushing tunnel headers, annotate skbs in the same way as tunnel devices. For GSO packets, the network stack requires certain fields set to segment packets with tunnel headers. gro_gse_segment depends on transport and inner mac header, for instance. Add an option to pass this information. Remove the restriction on len_diff to network header length, which is too short, e.g., for GRE protocols. Changes v1->v2: - document new flags - BPF_F_ADJ_ROOM_MASK moved v2->v3: - BPF_F_ADJ_ROOM_ENCAP_L3_MASK moved Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-03-22 13:52:45 -07:00
Willem de Bruijn	2278f6cc15	bpf: add bpf_skb_adjust_room flag BPF_F_ADJ_ROOM_FIXED_GSO bpf_skb_adjust_room adjusts gso_size of gso packets to account for the pushed or popped header room. This is not allowed with UDP, where gso_size delineates datagrams. Add an option to avoid these updates and allow this call for datagrams. It can also be used with TCP, when MSS is known to allow headroom, e.g., through MSS clamping or route MTU. Changes v1->v2: - document flag BPF_F_ADJ_ROOM_FIXED_GSO - do not expose BPF_F_ADJ_ROOM_MASK through uapi, as it may change. Link: https://patchwork.ozlabs.org/patch/1052497/ Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-03-22 13:52:45 -07:00
Willem de Bruijn	14aa31929b	bpf: add bpf_skb_adjust_room mode BPF_ADJ_ROOM_MAC bpf_skb_adjust_room net allows inserting room in an skb. Existing mode BPF_ADJ_ROOM_NET inserts room after the network header by pulling the skb, moving the network header forward and zeroing the new space. Add new mode BPF_ADJUST_ROOM_MAC that inserts room after the mac header. This allows inserting tunnel headers in front of the network header without having to recreate the network header in the original space, avoiding two copies. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-03-22 13:52:45 -07:00
Willem de Bruijn	8142958954	selftests/bpf: extend bpf tunnel test with tso Segmentation offload takes a longer path. Verify that the feature works with large packets. The test succeeds if not setting dodgy in bpf_skb_adjust_room, as veth TSO is permissive. If not setting SKB_GSO_DODGY, this enables tunneled TSO offload on supporting NICs. The feature sets SKB_GSO_DODGY because the caller is untrusted. As a result the packets traverse through the gso stack at least up to TCP. And fail the gso_type validation, such as the skb->encapsulation check in gre_gso_segment and the gso_type checks introduced in commit `418e897e07` ("gso: validate gso_type on ipip style tunnel"). This will be addressed in a follow-on feature patch. In the meantime, disable the new gso tests. Changes v1->v2: - not all netcat versions support flag '-q', use timeout instead Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-03-22 13:52:44 -07:00
Willem de Bruijn	7255fade7b	selftests/bpf: extend bpf tunnel test with gre GRE is a commonly used protocol. Add GRE cases for both IPv4 and IPv6. It also inserts different sized headers, which can expose some unexpected edge cases. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-03-22 13:52:44 -07:00
Willem de Bruijn	ef81bd0549	selftests/bpf: expand bpf tunnel test to ipv6 The test only uses ipv4 so far, expand to ipv6. This is mostly a boilerplate near copy of the ipv4 path. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-03-22 13:52:44 -07:00
Willem de Bruijn	ccd34cd357	selftests/bpf: expand bpf tunnel test with decap The bpf tunnel test encapsulates using bpf, then decapsulates using a standard tunnel device to verify correctness. Once encap is verified, also test decap, by replacing the tunnel device on decap with another bpf program. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-03-22 13:52:44 -07:00
Willem de Bruijn	98cdabcd07	selftests/bpf: bpf tunnel encap test Validate basic tunnel encapsulation using ipip. Set up two namespaces connected by veth. Connect a client and server. Do this with and without bpf encap. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-03-22 13:52:44 -07:00
Willem de Bruijn	908adce646	bpf: in bpf_skb_adjust_room avoid copy in tx fast path bpf_skb_adjust_room calls skb_cow on grow. This expensive operation can be avoided in the fast path when the only other clone has released the header. This is the common case for TCP, where one headerless clone is kept on the retransmit queue. It is safe to do so even when touching the gso fields in skb_shinfo. Regular tunnel encap with iptunnel_handle_offloads takes the same optimization. The tcp stack unclones in the unlikely case that it accesses these fields through headerless clones packets on the retransmit queue (see __tcp_retransmit_skb). If any other clones are present, e.g., from packet sockets, skb_cow_head returns the same value as skb_cow(). Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2019-03-22 13:52:44 -07:00
Eli Britstein	76b496b1bd	net/mlx5e: Replace TC VLAN pop and push actions with VLAN modify Changing the VLAN header may be implemented by pop the existing header and push a new one. Translate those operations as VLAN modify. Applicable for use cases such as OVS where the controller translates a vlan modify meta (OF) rule to DP pop+push actions rule. Signed-off-by: Eli Britstein <elibr@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2019-03-22 12:09:32 -07:00
Eli Britstein	bdc837eecf	net/mlx5e: Support VLAN modify action Support VLAN modify action by emulating a rewrite action for the VLAN fields. Currently, the only supported field is the vid. The prio in the action must be set to 0 to indicate no change. Signed-off-by: Eli Britstein <elibr@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2019-03-22 12:09:32 -07:00
Eli Britstein	0eb69bb996	net/mlx5e: Add VLAN ID rewrite fields Add VLAN ID rewrite fields as a pre-step to support this rewrite. Signed-off-by: Eli Britstein <elibr@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2019-03-22 12:09:32 -07:00
Moshe Shemesh	bea964107f	net: Add IANA_VXLAN_UDP_PORT definition to vxlan header file Added IANA_VXLAN_UDP_PORT (4789) definition to vxlan header file so it can be used by drivers instead of local definition. Updated drivers which locally defined it as 4789 to use it. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Cc: John Hurley <john.hurley@netronome.com> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Cc: Yunsheng Lin <linyunsheng@huawei.com> Cc: Peng Li <lipeng321@huawei.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2019-03-22 12:09:31 -07:00
Moshe Shemesh	e3cfc7e6b7	net/mlx5e: TX, Add geneve tunnel stateless offload support Currently support only default geneve udp port (6081). For the tx side, the HW is assisted by SW parsing, which sets the headers offset to offload tunneled LSO and csum. Note that for udp tunnels, we don't use special rx offloads, as rss on the outer headers is enough, we support checksum complete and GRO takes care of aggregation. Geneve TSO BW and CPU load results (tested using iperf single tcp stream). In this patch we add TSO support over Geneve, so the "before" result doesn't actually get to using the TSO HW offload even when turned on. Tested on ConnectX-5, Intel(R) Xeon(R) CPU E5-2660 v2 @2.20GHz. __________________________________ \| Before \| After \| \|________________\|_________________\| \| 12.6 Gbits/sec \| 21.7 Gbits/sec \| \| 100% CPU load \| 61.5% CPU load \| \|________________\|_________________\| Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Acked-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2019-03-22 12:09:31 -07:00
Moshe Shemesh	cac018b8c7	net/mlx5e: Take SW parser code to a separate function Refactor mlx5e_ipsec_set_swp() code, split the part which sets the eseg software parser (SWP) offsets and flags, so it can be used in a downstream patch by other mlx5e functionality which needs to set eseg SWP. The new function mlx5e_set_eseg_swp() is useful for setting swp for both outer and inner headers. It also handles the special ipsec case of xfrm mode transfer. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2019-03-22 12:09:31 -07:00
Moshe Shemesh	974eff2b57	net: Move the definition of the default Geneve udp port to public header file Move the definition of the default Geneve udp port from the geneve source to the header file, so we can re-use it from drivers. Modify existing drivers to use it. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Cc: John Hurley <john.hurley@netronome.com> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2019-03-22 12:09:31 -07:00
Gustavo A. R. Silva	bdde931149	net/mlx5e: Remove redundant assignment Remove redundant assignment to tun_entropy->enabled. Addesses-Coverity-ID: `1477328` ("Unused value") Fixes: `97417f6182` ("net/mlx5e: Fix GRE key by controlling port tunnel entropy calculation") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Eli Britstein <elibr@mellanox.com> Acked-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2019-03-22 12:09:30 -07:00
Saeed Mahameed	ee576ec1c1	net/mlx5e: Fix compilation warning in en_tc.c Amazingly a mlx5e_tc function is being called from the eswitch layer, which is by itself very terrible! The function was declared locally in eswitch_offloads.c so it could be used there, which caused the following compilation warning, fix that. drivers/.../mlx5/core/en_tc.c:3242:6: [-Werror=missing-prototypes] error: no previous prototype for ‘mlx5e_tc_clean_fdb_peer_flows’ Fixes: `04de7dda73` ("net/mlx5e: Infrastructure for duplicated offloading of TC flows") Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2019-03-22 12:09:30 -07:00
Saeed Mahameed	d3669ca9ff	net/mlx5e: Fix port buffer function documentation format This patch fixes compiler warnings: In drivers/.../mlx5/core/en/port_buffer.c:190: warning: Function parameter or member 'pfc_en' not described... ... warning: Function parameter or member 'change' not described... Fixes: `0696d60853` ("net/mlx5e: Receive buffer configuration") Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2019-03-22 12:09:30 -07:00
Saeed Mahameed	092ead4829	net/mlx5: Fix compilation warning in eq.c mlx5_eq_table_get_rmap is being used only when CONFIG_RFS_ACCEL is enabled, this patch fixes the below warning when CONFIG_RFS_ACCEL is disabled. drivers/.../mlx5/core/eq.c:903:18: [-Werror=missing-prototypes] error: no previous prototype for ‘mlx5_eq_table_get_rmap’ Fixes: `f2f3df5501` ("net/mlx5: EQ, Privatize eq_table and friends") Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2019-03-22 12:09:30 -07:00
Parav Pandit	eb5cc431f1	net/mlx5: Simplify mlx5_sriov_is_enabled() by using pci core API It is desired to get rid of num_vfs stored inside mlx5_core_sriov to safely support vports more than vfs. To reduce dependency on mlx5_core_sriov num_vfs, start using pci_num_vf() from pci core. Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Bodong Wang <bodong@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2019-03-22 12:09:29 -07:00

1 2 3 4 5 ...

824466 Commits