linux

Author	SHA1	Message	Date
wenxu	c0d59da795	ip_gre: Make none-tun-dst gre tunnel store tunnel info as metadat_dst in recv Currently collect_md gre tunnel will store the tunnel info(metadata_dst) to skb_dst. And now the non-tun-dst gre tunnel already can add tunnel header through lwtunnel. When received a arp_request on the non-tun-dst gre tunnel. The packet of arp response will send through the non-tun-dst tunnel without tunnel info which will lead the arp response packet to be dropped. If the non-tun-dst gre tunnel also store the tunnel info as metadata_dst, The arp response packet will set the releted tunnel info in the iptunnel_metadata_reply. The following is the test script: ip netns add cl ip l add dev vethc type veth peer name eth0 netns cl ifconfig vethc 172.168.0.7/24 up ip l add dev tun1000 type gretap key 1000 ip link add user1000 type vrf table 1 ip l set user1000 up ip l set dev tun1000 master user1000 ifconfig tun1000 10.0.1.1/24 up ip netns exec cl ifconfig eth0 172.168.0.17/24 up ip netns exec cl ip l add dev tun type gretap local 172.168.0.17 remote 172.168.0.7 key 1000 ip netns exec cl ifconfig tun 10.0.1.7/24 up ip r r 10.0.1.7 encap ip id 1000 dst 172.168.0.17 key dev tun1000 table 1 With this patch ip netns exec cl ping 10.0.1.1 can success Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 22:21:30 -08:00
David S. Miller	ee5a489fd9	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2019-11-20 The following pull-request contains BPF updates for your net-next tree. We've added 81 non-merge commits during the last 17 day(s) which contain a total of 120 files changed, 4958 insertions(+), 1081 deletions(-). There are 3 trivial conflicts, resolve it by always taking the chunk from `196e8ca748`: <<<<<<< HEAD ======= void bpf_map_area_mmapable_alloc(u64 size, int numa_node); >>>>>>> `196e8ca748` <<<<<<< HEAD void bpf_map_area_alloc(u64 size, int numa_node) ======= static void __bpf_map_area_alloc(u64 size, int numa_node, bool mmapable) >>>>>>> `196e8ca748` <<<<<<< HEAD if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { ======= / kmalloc()'ed memory can't be mmap()'ed */ if (!mmapable && size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { >>>>>>> `196e8ca748` The main changes are: 1) Addition of BPF trampoline which works as a bridge between kernel functions, BPF programs and other BPF programs along with two new use cases: i) fentry/fexit BPF programs for tracing with practically zero overhead to call into BPF (as opposed to k[ret]probes) and ii) attachment of the former to networking related programs to see input/output of networking programs (covering xdpdump use case), from Alexei Starovoitov. 2) BPF array map mmap support and use in libbpf for global data maps; also a big batch of libbpf improvements, among others, support for reading bitfields in a relocatable manner (via libbpf's CO-RE helper API), from Andrii Nakryiko. 3) Extend s390x JIT with usage of relative long jumps and loads in order to lift the current 64/512k size limits on JITed BPF programs there, from Ilya Leoshkevich. 4) Add BPF audit support and emit messages upon successful prog load and unload in order to have a timeline of events, from Daniel Borkmann and Jiri Olsa. 5) Extension to libbpf and xdpsock sample programs to demo the shared umem mode (XDP_SHARED_UMEM) as well as RX-only and TX-only sockets, from Magnus Karlsson. 6) Several follow-up bug fixes for libbpf's auto-pinning code and a new API call named bpf_get_link_xdp_info() for retrieving the full set of prog IDs attached to XDP, from Toke Høiland-Jørgensen. 7) Add BTF support for array of int, array of struct and multidimensional arrays and enable it for skb->cb[] access in kfree_skb test, from Martin KaFai Lau. 8) Fix AF_XDP by using the correct number of channels from ethtool, from Luigi Rizzo. 9) Two fixes for BPF selftest to get rid of a hang in test_tc_tunnel and to avoid xdping to be run as standalone, from Jiri Benc. 10) Various BPF selftest fixes when run with latest LLVM trunk, from Yonghong Song. 11) Fix a memory leak in BPF fentry test run data, from Colin Ian King. 12) Various smaller misc cleanups and improvements mostly all over BPF selftests and samples, from Daniel T. Lee, Andre Guedes, Anders Roxell, Mao Wenan, Yue Haibing. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 18:11:23 -08:00
Daniel Borkmann	196e8ca748	bpf: Switch bpf_map_{area_alloc,area_mmapable_alloc}() to u64 size Given we recently extended the original bpf_map_area_alloc() helper in commit `fc9702273e` ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY"), we need to apply the same logic as in `ff1c08e1f7` ("bpf: Change size to u64 for bpf_map_{area_alloc, charge_init}()"). To avoid conflicts, extend it for bpf-next. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-11-20 23:18:58 +01:00
Daniel Borkmann	91e6015b08	bpf: Emit audit messages upon successful prog load and unload Allow for audit messages to be emitted upon BPF program load and unload for having a timeline of events. The load itself is in syscall context, so additional info about the process initiating the BPF prog creation can be logged and later directly correlated to the unload event. The only info really needed from BPF side is the globally unique prog ID where then audit user space tooling can query / dump all info needed about the specific BPF program right upon load event and enrich the record, thus these changes needed here can be kept small and non-intrusive to the core. Raw example output: # auditctl -D # auditctl -a always,exit -F arch=x86_64 -S bpf # ausearch --start recent -m 1334 [...] ---- time->Wed Nov 20 12:45:51 2019 type=PROCTITLE msg=audit(1574271951.590:8974): proctitle="./test_verifier" type=SYSCALL msg=audit(1574271951.590:8974): arch=c000003e syscall=321 success=yes exit=14 a0=5 a1=7ffe2d923e80 a2=78 a3=0 items=0 ppid=742 pid=949 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=2 comm="test_verifier" exe="/root/bpf-next/tools/testing/selftests/bpf/test_verifier" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null) type=UNKNOWN[1334] msg=audit(1574271951.590:8974): auid=0 uid=0 gid=0 ses=2 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=949 comm="test_verifier" exe="/root/bpf-next/tools/testing/selftests/bpf/test_verifier" prog-id=3260 event=LOAD ---- time->Wed Nov 20 12:45:51 2019 type=UNKNOWN[1334] msg=audit(1574271951.590:8975): prog-id=3260 event=UNLOAD ---- [...] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191120213816.8186-1-jolsa@kernel.org	2019-11-20 13:44:51 -08:00
David S. Miller	e2193c9334	Merge branch 'r8169-smaller-improvements-to-firmware-handling' Heiner Kallweit says: ==================== r8169: smaller improvements to firmware handling This series includes few smaller improvements to firmware handling. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 12:50:25 -08:00
Heiner Kallweit	df0120f12f	r8169: add check for PHY_MDIO_CHG to rtl_nic_fw_data_ok Only values 0 and 1 are currently defined as parameters for PHY_MDIO_CHG. Instead of silently ignoring unknown values and misinterpreting the firmware code let's explicitly check. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 12:50:24 -08:00
Heiner Kallweit	cfccde80e8	r8169: use macro FIELD_SIZEOF in definition of FW_OPCODE_SIZE Using macro FIELD_SIZEOF makes this define easier understandable. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 12:50:24 -08:00
Heiner Kallweit	e20c43dbdf	r8169: change mdelay to msleep in rtl_fw_write_firmware We're not in atomic context here, therefore switch to msleep. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 12:50:24 -08:00
Thomas Bogendoerfer	e2ffe3ff6f	net: ipconfig: Wait for deferred device probes If network device drives are using deferred probing, it was possible that waiting for devices to show up in ipconfig was already over, when the device eventually showed up. By calling wait_for_device_probe() we now make sure deferred probing is done before checking for available devices. Signed-off-by: Thomas Bogendoerfer <tbogendoerfer@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 12:39:53 -08:00
Mao Wenan	2be8ca97d0	vsock/vmci: make vmci_vsock_cb_host_called static When using make C=2 drivers/misc/vmw_vmci/vmci_driver.o to compile, below warning can be seen: drivers/misc/vmw_vmci/vmci_driver.c:33:6: warning: symbol 'vmci_vsock_cb_host_called' was not declared. Should it be static? This patch make symbol vmci_vsock_cb_host_called static. Fixes: `b1bba80a43` ("vsock/vmci: register vmci_transport only when VMCI guest/host are active") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Mao Wenan <maowenan@huawei.com> Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 12:39:29 -08:00
David S. Miller	e07e75412b	Merge branch 'page_pool-DMA-sync' Lorenzo Bianconi says: ==================== add DMA-sync-for-device capability to page_pool API Introduce the possibility to sync DMA memory for device in the page_pool API. This feature allows to sync proper DMA size and not always full buffer (dma_sync_single_for_device can be very costly). Please note DMA-sync-for-CPU is still device driver responsibility. Relying on page_pool DMA sync mvneta driver improves XDP_DROP pps of about 170Kpps: - XDP_DROP DMA sync managed by mvneta driver: ~420Kpps - XDP_DROP DMA sync managed by page_pool API: ~585Kpps Do not change naming convention for the moment since the changes will hit other drivers as well. I will address it in another series. Changes since v4: - do not allow the driver to set max_len to 0 - convert PP_FLAG_DMA_MAP/PP_FLAG_DMA_SYNC_DEV to BIT() macro Changes since v3: - move dma_sync_for_device before putting the page in ptr_ring in __page_pool_recycle_into_ring since ptr_ring can be consumed concurrently. Simplify the code moving dma_sync_for_device before running __page_pool_recycle_direct/__page_pool_recycle_into_ring Changes since v2: - rely on PP_FLAG_DMA_SYNC_DEV flag instead of dma_sync Changes since v1: - rename sync in dma_sync - set dma_sync_size to 0xFFFFFFFF in page_pool_recycle_direct and page_pool_put_page routines - Improve documentation ==================== Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 12:34:37 -08:00
Lorenzo Bianconi	07e13edbb6	net: mvneta: get rid of huge dma sync in mvneta_rx_refill Get rid of costly dma_sync_single_for_device in mvneta_rx_refill since now the driver can let page_pool API to manage needed DMA sync with a proper size. - XDP_DROP DMA sync managed by mvneta driver: ~420Kpps - XDP_DROP DMA sync managed by page_pool API: ~585Kpps Tested-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 12:34:29 -08:00
Lorenzo Bianconi	e68bc75691	net: page_pool: add the possibility to sync DMA memory for device Introduce the following parameters in order to add the possibility to sync DMA memory for device before putting allocated pages in the page_pool caches: - PP_FLAG_DMA_SYNC_DEV: if set in page_pool_params flags, all pages that the driver gets from page_pool will be DMA-synced-for-device according to the length provided by the device driver. Please note DMA-sync-for-CPU is still device driver responsibility - offset: DMA address offset where the DMA engine starts copying rx data - max_len: maximum DMA memory size page_pool is allowed to flush. This is currently used in __page_pool_alloc_pages_slow routine when pages are allocated from page allocator These parameters are supposed to be set by device drivers. This optimization reduces the length of the DMA-sync-for-device. The optimization is valid because pages are initially DMA-synced-for-device as defined via max_len. At RX time, the driver will perform a DMA-sync-for-CPU on the memory for the packet length. What is important is the memory occupied by packet payload, because this is the area CPU is allowed to read and modify. As we don't track cache-lines written into by the CPU, simply use the packet payload length as dma_sync_size at page_pool recycle time. This also take into account any tail-extend. Tested-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 12:34:28 -08:00
Lorenzo Bianconi	f383b29500	net: mvneta: rely on page_pool_recycle_direct in mvneta_run_xdp Rely on page_pool_recycle_direct and not on xdp_return_buff in mvneta_run_xdp. This is a preliminary patch to limit the dma sync len to the one strictly necessary Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 12:34:17 -08:00
Gautam Ramakrishnan	cec2975f2b	net: sched: pie: enable timestamp based delay calculation RFC 8033 suggests an alternative approach to calculate the queue delay in PIE by using a timestamp on every enqueued packet. This patch adds an implementation of that approach and sets it as the default method to calculate queue delay. The previous method (based on Little's law) to calculate queue delay is set as optional. Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com> Signed-off-by: Leslie Monis <lesliemonis@gmail.com> Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in> Acked-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 12:31:45 -08:00
Krzysztof Kozlowski	f01b437d12	isdn: Fix Kconfig indentation Adjust indentation from spaces to tab (+optional two spaces) as in coding style with command like: $ sed -e 's/^ /\t/' -i */Kconfig Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 12:30:47 -08:00
Krzysztof Kozlowski	041ccdb620	nfc: Fix Kconfig indentation Adjust indentation from spaces to tab (+optional two spaces) as in coding style with command like: $ sed -e 's/^ /\t/' -i */Kconfig Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 12:30:40 -08:00
David S. Miller	07def46382	Merge branch 'cxgb4-add-TC-MATCHALL-classifier-offload' Rahul Lakkireddy says: ==================== cxgb4: add TC-MATCHALL classifier offload This series of patches add support to offload TC-MATCHALL classifier to hardware to classify all outgoing and incoming traffic on the underlying port. Only 1 egress and 1 ingress rule each can be offloaded on the underlying port. Patch 1 adds support for TC-MATCHALL classifier offload on the egress side. TC-POLICE is the only action that can be offloaded on the egress side and is used to rate limit all outgoing traffic to specified max rate. Patch 2 adds logic to reject the current rule offload if its priority conflicts with existing rules in the TCAM. Patch 3 adds support for TC-MATCHALL classifier offload on the ingress side. The same set of actions supported by existing TC-FLOWER classifier offload can be applied on all the incoming traffic. v5: - Fixed commit message and comment to include comparison for equal priority in patch 2. v4: - Removed check in patch 1 to reject police offload if prio is not 1. - Moved TC_SETUP_BLOCK code to separate function in patch 1. - Added logic to ensure the prio passed by TC doesn't conflict with other rules in TCAM in patch 2. - Higher index has lower priority than lower index in TCAM. So, rework cxgb4_get_free_ftid() to search free index from end of TCAM in descending order in patch 2. - Added check to ensure the matchall rule's prio doesn't conflict with other rules in TCAM in patch 3. - Added logic to fill default mask for VIID, if none has been provided, to prevent conflict with duplicate VIID rules in patch 3. - Used existing variables in private structure to fill VIID info, instead of extracting the info manually in patch 3. v3: - Added check in patch 1 to reject police offload if prio is not 1. - Assign block_shared variable only for TC_SETUP_BLOCK in patch 1. v2: - Added check to reject flow block sharing for policers in patch 1. - Removed logic to fetch free index from end of TCAM in patch 2. Must maintain the same ordering as in kernel. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 12:05:24 -08:00
Rahul Lakkireddy	21c4c60b76	cxgb4: add TC-MATCHALL classifier ingress offload Add TC-MATCHALL classifier ingress offload support. The same actions supported by existing TC-FLOWER offload can be applied to all incoming traffic on the underlying interface. Ensure the rule priority doesn't conflict with existing rules in the TCAM. Only 1 ingress matchall rule can be active at a time on the underlying interface. v5: - No change. v4: - Added check to ensure the matchall rule's prio doesn't conflict with other rules in TCAM. - Added logic to fill default mask for VIID, if none has been provided, to prevent conflict with duplicate VIID rules. - Used existing variables in private structure to fill VIID info, instead of extracting the info manually. v3: - No change. v2: - Removed logic to fetch free index from end of TCAM. Must maintain same ordering as in kernel. Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 12:05:23 -08:00
Rahul Lakkireddy	41ec03e534	cxgb4: check rule prio conflicts before offload Only offload rule if it satisfies both of the following conditions: 1. The immediate previous rule has priority <= current rule's priority. 2. The immediate next rule has priority >= current rule's priority. Also rework free entry fetch logic to search from end of TCAM, instead of beginning, because higher indices have lower priority than lower indices. This is similar to how TC auto generates priority values. v5: - Fixed commit message and comment to include comparison for equal priority. v4: - Patch added in this version. Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 12:05:23 -08:00
Rahul Lakkireddy	4ec4762d8e	cxgb4: add TC-MATCHALL classifier egress offload Add TC-MATCHALL classifier offload with TC-POLICE action applied for all outgoing traffic on the underlying interface. Split flow block offload to support both egress and ingress classification. For example, to rate limit all outgoing traffic to 1 Gbps: $ tc qdisc add dev enp2s0f4 clsact $ tc filter add dev enp2s0f4 egress matchall skip_sw \ action police rate 1Gbit burst 8Kbit Note that skip_sw is important. Otherwise, both stack and hardware will end up doing policing. Policing can't be shared across flow blocks. Only 1 egress matchall rule can be active at a time on the underlying interface. v5: - No change. v4: - Removed check to reject police offload if prio is not 1. - Moved TC_SETUP_BLOCK code to separate function. v3: - Added check to reject police offload if prio is not 1. - Assign block_shared variable only for TC_SETUP_BLOCK. v2: - Added check to reject flow block sharing for policers. Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 12:05:23 -08:00
David S. Miller	77c05d2f73	Merge branch 'page_pool-API-for-numa-node-change-handling' Saeed Mahameed says: ==================== page_pool: API for numa node change handling This series extends page pool API to allow page pool consumers to update page pool numa node on the fly. This is required since on some systems, rx rings irqs can migrate between numa nodes, due to irq balancer or user defined scripts, current page pool has no way to know of such migration and will keep allocating and holding on to pages from a wrong numa node, which is bad for the consumer performance. 1) Add API to update numa node id of the page pool Consumers will call this API to update the page pool numa node id. 2) Don't recycle non-reusable pages: Page pool will check upon page return whether a page is suitable for recycling or not. 2.1) when it belongs to a different num node. 2.2) when it was allocated under memory pressure. 3) mlx5 will use the new API to update page pool numa id on demand. The series is a joint work between me and Jonathan, we tested it and it proved itself worthy to avoid page allocator bottlenecks and improve packet rate and cpu utilization significantly for the described scenarios above. Performance testing: XDP drop/tx rate and TCP single/multi stream, on mlx5 driver while migrating rx ring irq from close to far numa: mlx5 internal page cache was locally disabled to get pure page pool results. CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz NIC: Mellanox Technologies MT27700 Family [ConnectX-4] (100G) XDP Drop/TX single core: NUMA \| XDP \| Before \| After --------------------------------------- Close \| Drop \| 11 Mpps \| 10.9 Mpps Far \| Drop \| 4.4 Mpps \| 5.8 Mpps Close \| TX \| 6.5 Mpps \| 6.5 Mpps Far \| TX \| 3.5 Mpps \| 4 Mpps Improvement is about 30% drop packet rate, 15% tx packet rate for numa far test. No degradation for numa close tests. TCP single/multi cpu/stream: NUMA \| #cpu \| Before \| After -------------------------------------- Close \| 1 \| 18 Gbps \| 18 Gbps Far \| 1 \| 15 Gbps \| 18 Gbps Close \| 12 \| 80 Gbps \| 80 Gbps Far \| 12 \| 68 Gbps \| 80 Gbps In all test cases we see improvement for the far numa case, and no impact on the close numa case. ================== Performance analysis and conclusions by Jesper [1]: Impact on XDP drop x86_64 is inconclusive and shows only 0.3459ns slow-down, as this is below measurement accuracy of system. v2->v3: - Rebase on top of latest net-next and Jesper's page pool object release patchset [2] - No code changes - Performance analysis by Jesper added to the cover letter. v1->v2: - Drop last patch, as requested by Ilias and Jesper. - Fix documentation's performance numbers order. [1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool04_inflight_changes.org#performance-notes [2] https://patchwork.ozlabs.org/cover/1192098/ ==================== Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:47:36 -08:00
Saeed Mahameed	6849c6d86b	net/mlx5e: Rx, Update page pool numa node when changed Once every napi poll cycle, check if numa node is different than the page pool's numa id, and update it using page_pool_update_nid(). Alternatively, we could have registered an irq affinity change handler, but page_pool_update_nid() must be called from napi context anyways, so the handler won't actually help. Performance testing: XDP drop/tx rate and TCP single/multi stream, on mlx5 driver while migrating rx ring irq from close to far numa: mlx5 internal page cache was locally disabled to get pure page pool results. CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz NIC: Mellanox Technologies MT27700 Family [ConnectX-4] (100G) XDP Drop/TX single core: NUMA \| XDP \| Before \| After --------------------------------------- Close \| Drop \| 11 Mpps \| 10.9 Mpps Far \| Drop \| 4.4 Mpps \| 5.8 Mpps Close \| TX \| 6.5 Mpps \| 6.5 Mpps Far \| TX \| 3.5 Mpps \| 4 Mpps Improvement is about 30% drop packet rate, 15% tx packet rate for numa far test. No degradation for numa close tests. TCP single/multi cpu/stream: NUMA \| #cpu \| Before \| After -------------------------------------- Close \| 1 \| 18 Gbps \| 18 Gbps Far \| 1 \| 15 Gbps \| 18 Gbps Close \| 12 \| 80 Gbps \| 80 Gbps Far \| 12 \| 68 Gbps \| 80 Gbps In all test cases we see improvement for the far numa case, and no impact on the close numa case. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:47:36 -08:00
Saeed Mahameed	d5394610b1	page_pool: Don't recycle non-reusable pages A page is NOT reusable when at least one of the following is true: 1) allocated when system was under some pressure. (page_is_pfmemalloc) 2) belongs to a different NUMA node than pool->p.nid. To update pool->p.nid users should call page_pool_update_nid(). Holding on to such pages in the pool will hurt the consumer performance when the pool migrates to a different numa node. Performance testing: XDP drop/tx rate and TCP single/multi stream, on mlx5 driver while migrating rx ring irq from close to far numa: mlx5 internal page cache was locally disabled to get pure page pool results. CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz NIC: Mellanox Technologies MT27700 Family [ConnectX-4] (100G) XDP Drop/TX single core: NUMA \| XDP \| Before \| After --------------------------------------- Close \| Drop \| 11 Mpps \| 10.9 Mpps Far \| Drop \| 4.4 Mpps \| 5.8 Mpps Close \| TX \| 6.5 Mpps \| 6.5 Mpps Far \| TX \| 3.5 Mpps \| 4 Mpps Improvement is about 30% drop packet rate, 15% tx packet rate for numa far test. No degradation for numa close tests. TCP single/multi cpu/stream: NUMA \| #cpu \| Before \| After -------------------------------------- Close \| 1 \| 18 Gbps \| 18 Gbps Far \| 1 \| 15 Gbps \| 18 Gbps Close \| 12 \| 80 Gbps \| 80 Gbps Far \| 12 \| 68 Gbps \| 80 Gbps In all test cases we see improvement for the far numa case, and no impact on the close numa case. The impact of adding a check per page is very negligible, and shows no performance degradation whatsoever, also functionality wise it seems more correct and more robust for page pool to verify when pages should be recycled, since page pool can't guarantee where pages are coming from. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:47:36 -08:00
Saeed Mahameed	bc83674870	page_pool: Add API to update numa node Add page_pool_update_nid() to be called by page pool consumers when they detect numa node changes. It will update the page pool nid value to start allocating from the new effective numa node. This is to mitigate page pool allocating pages from a wrong numa node, where the pool was originally allocated, and holding on to pages that belong to a different numa node, which causes performance degradation. For pages that are already being consumed and could be returned to the pool by the consumer, in next patch we will add a check per page to avoid recycling them back to the pool and return them to the page allocator. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:47:36 -08:00
David S. Miller	1f12177b32	Merge branch 'cpsw-switchdev' Grygorii Strashko says: ==================== net: ethernet: ti: introduce new cpsw switchdev based driver Thank you All for review of v6. There are no significant changes in this version, just fixed comments to v6. --- v6 The major change in this version is DT bindings conversation to json-schema, and fixed other comments to v5. Also added patch to clean up ALE on init and netif restart. --- v5 The major part of work done in this iteration is rebasing on top of net-next with XDP series from Ivan Khoronzhuk [3], and enable XDP support in the new CPSW switchdev driver (it was little bit painful ;(). There are mostly no functional changes in new CPSW driver, just few fixes, sync with old driver and cleanups/optimizations. So, I've kept rest of cover letter unchanged. --- This series originally based on work [1][2] done by Ilias Apalodimas <ilias.apalodimas@linaro.org>. This the RFC v5 which introduces new CPSW switchdev based driver which is operating in dual-emac mode by default, thus working as 2 individual network interfaces. The Switch mode can be enabled by configuring devlink driver parameter "switch_mode" to 1/true: devlink dev param set platform/48484000.switch \ name switch_mode value 1 cmode runtime This can be done regardless of the state of Port's netdev devices - UP/DOWN, but Port's netdev devices have to be in UP before joining the bridge to avoid overwriting of bridge configuration as CPSW switch driver completely reloads its configuration when first Port changes its state to UP. When the both interfaces joined the bridge - CPSW switch driver will start marking packets with offload_fwd_mark flag unless "ale_bypass=0". All configuration is implemented via switchdev API. The previous solution of tracking both Ports joined the bridge (from netdevice_notifier) proved to be not correct as changing CPSW switch driver mode required cleanup of ALE table and CPSW settings which happens while second Port is joined bridge and as result configuration loaded by bridge for the first Port became corrupted. The introduction of the new CPSW switchdev based driver (cpsw_new.c) is split on two parts: Part 1 - basic dual-emac driver; Part 2 switchdev support. Such approach has simplified code development and testing alot. And, I hope, it will help with better review. patches #1 - 5: preparation patches which also moves common code to cpsw_priv.c patches #6 - 9: Introduce TI CPSW switch driver based on switchdev and new DT bindings patch #10: new CPSW switchdev driver documentation patch #11: adds DT nodes for new CPSW switchdev driver added for DRA7 SoC patch #12: adds DT nodes for new cpsw switchdev driver for am571x-idk board patch #13: enables build of TI CPSW driver Most of the contents of the previous cover-letter have been added in new driver documentation, so please refer to that for configuration, testing and future work. These patches can be found at (branch contains some additional patches required for testing on top of net-next): https://github.com/grygoriyS/linux.git branch: lkml-5.4-switch-tbd-v7 changes in v7: - patch 2: added check for devm_kmalloc_array() return value - patch 6: fixed comments changes in v6: https://lkml.org/lkml/2019/11/9/108 - DT bindings converted to json-schema - netdev initialization is split on creation and registration. The netdevs registration happens now at the end of the pobe. - reworked cpsw_set_pauseparam() to use PHYlib APIs. - other comments for v5 fixed v5: https://patchwork.kernel.org/cover/11208785/ - rebase on top of net-next with XDP series from Ivan Khoronzhuk [3], and enable XDP support in the new CPSW switchdev driver cpsw driver (tested XDP_DROP only) - sync with old cpsw driver - implement comments from Ivan Khoronzhuk and Rob Herring - fixed "NETDEV WATCHDOG: .." warning after interface after interface UP/DOWN, missed TX wake in cpsw_adjust_link() v4: https://patchwork.kernel.org/cover/11010523/ - finished split of common CPSW code - added devlink support - changed CPSW mode configuration approach: from netdevice_notifier to devlink parameter - refactor and clean up ALE changes which allows to modify VLANs/MDBs entries - added missed support for port QDISC_CBS and QDISC_MQPRIO - the CPSW is split on two parts: basic dual_mac driver and switchdev support - added missed callback .ndo_get_port_parent_id() - reworked ingress frames marking in switch mode (offload_fwd_mark) - applied comments from Andrew Lunn v3: https://lwn.net/Articles/786677/ Changes in v3: - alot of work done to split properly common code between legacy and switchdev CPSW drivers and clean up code - CPSW switchdev interface updated to the current LKML switchdev interface - actually new CPSW switchdev based driver introduced - optimized dual_mac mode in new driver. Main change is that in promiscuous mode P0_UNI_FLOOD (both ports) is enabled in addition to ALLMULTI (current port) instead of ALE_BYPASS. So, port in non promiscuous mode will keep possibility of mcast and vlan filtering. - changed bridge join sequnce: now switch mode will be enabled only when both ports joined the bridge. CPSW will be switched to dual_mac mode if any port leave bridge. ALE table is completly cleared and then refiled while switching to switch mode - this simplidies code a lot, but introduces some limitation to bridge setup sequence: ip link add name br0 type bridge ip link set dev br0 type bridge ageing_time 1000 ip link set dev br0 type bridge vlan_filtering 0 <- disable echo 0 > /sys/class/net/br0/bridge/default_vlan ip link set dev sw0p1 up <- add ports ip link set dev sw0p2 up ip link set dev sw0p1 master br0 ip link set dev sw0p2 master br0 echo 1 > /sys/class/net/br0/bridge/default_vlan <- enable ip link set dev br0 type bridge vlan_filtering 1 bridge vlan add dev br0 vid 1 pvid untagged self - STP tested with vlan_filtering 1/0. To make STP work I've had to set NO_SA_UPDATE for all slave ports (see comment in code). It also required to statically register STP mcast address {0x01, 0x80, 0xc2, 0x0, 0x0, 0x0}; - allowed build both TI_CPSW and TI_CPSW_SWITCHDEV drivers - PTP can be enabled on both ports in dual_mac mode [1] https://patchwork.ozlabs.org/cover/929367/ [2] https://patches.linaro.org/cover/136709/ [3] https://patchwork.kernel.org/cover/11035813/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:25:24 -08:00
Grygorii Strashko	3727d259dd	arm: omap2plus_defconfig: enable new cpsw switchdev driver Add CONFIG_TI_CPSW_SWITCHDEV option to enable new cpsw switchdev driver Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:25:24 -08:00
Grygorii Strashko	15b991ade4	ARM: dts: am571x-idk: enable for new cpsw switch dev driver Add DT nodes for new cpsw switchdev driver for am571x-idk board for now to enable testing of the new solution. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:25:24 -08:00
Grygorii Strashko	39331a49c4	ARM: dts: dra7: add dt nodes for new cpsw switch dev driver Add DT nodes for new cpsw switch dev driver. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:25:24 -08:00
Ilias Apalodimas	14c815a9d1	Documentation: networking: add cpsw switchdev based driver documentation A new cpsw dirver based on switchdev was added. Add documentation about basic configuration and future features Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:25:24 -08:00
Grygorii Strashko	da84e50c8e	phy: ti: phy-gmii-sel: dependency from ti cpsw-switchdev driver Add dependency from TI_CPSW_SWITCHDEV. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:25:24 -08:00
Ilias Apalodimas	111cf1ab4d	net: ethernet: ti: introduce cpsw switchdev based driver part 2 - switch CPSW switchdev based driver which is operating in dual-emac mode by default, thus working as 2 individual network interfaces. The Switch mode can be enabled by configuring devlink driver parameter "switch_mode" to 1: devlink dev param set platform/48484000.switch \ name switch_mode value 1 cmode runtime This can be done regardless of the state of Port's netdevs - UP/DOWN, but Port's netdev devices have to be UP before joining the bridge to avoid overwriting of bridge configuration as CPSW switch driver completely reloads its configuration when first Port changes its state to UP. When the both interfaces joined the bridge - CPSW switch driver will start marking packets with offload_fwd_mark flag unless "ale_bypass=0". All configuration is implemented via switchdev API and notifiers. Supported: - SWITCHDEV_ATTR_ID_PORT_PRE_BRIDGE_FLAGS - SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS: BR_MCAST_FLOOD - SWITCHDEV_ATTR_ID_PORT_STP_STATE - SWITCHDEV_OBJ_ID_PORT_VLAN - SWITCHDEV_OBJ_ID_PORT_MDB - SWITCHDEV_OBJ_ID_HOST_MDB Hence CPSW switchdev driver supports: - FDB offloading - MDB offloading - VLAN filtering and offloading - STP Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:25:24 -08:00
Ilias Apalodimas	ed3525eda4	net: ethernet: ti: introduce cpsw switchdev based driver part 1 - dual-emac Part 1: Introduce basic CPSW dual_mac driver (cpsw_new.c) which is operating in dual-emac mode by default, thus working as 2 individual network interfaces. Main differences from legacy CPSW driver are: - optimized promiscuous mode: The P0_UNI_FLOOD (both ports) is enabled in addition to ALLMULTI (current port) instead of ALE_BYPASS. So, Ports in promiscuous mode will keep possibility of mcast and vlan filtering, which is provides significant benefits when ports are joined to the same bridge, but without enabling "switch" mode, or to different bridges. - learning disabled on ports as it make not too much sense for segregated ports - no forwarding in HW. - enabled basic support for devlink. devlink dev show platform/48484000.switch devlink dev param show platform/48484000.switch: name ale_bypass type driver-specific values: cmode runtime value false - "ale_bypass" devlink driver parameter allows to enable ALE_CONTROL(4).BYPASS mode for debug purposes. - updated DT bindings. Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:25:23 -08:00
Grygorii Strashko	ef63fe72f6	dt-bindings: net: ti: add new cpsw switch driver bindings Add bindings for the new TI CPSW switch driver. Comparing to the legacy bindings (net/cpsw.txt): - ports definition follows DSA bindings (net/dsa/dsa.txt) and ports can be marked as "disabled" if not physically wired. - all deprecated properties dropped; - all legacy propertiies dropped which represent constant HW cpapbilities (cpdma_channels, ale_entries, bd_ram_size, mac_control, slaves, active_slave) - TI CPTS DT properties are reused as is, but grouped in "cpts" sub-node - TI Davinci MDIO DT bindings are reused as is, because Davinci MDIO is reused. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:25:23 -08:00
Grygorii Strashko	c5013ac1dd	net: ethernet: ti: cpsw: move set of common functions in cpsw_priv As a preparatory patch to add support for a switchdev based cpsw driver, move common functions to cpsw-priv.c so that they can be used across both drivers. Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:25:23 -08:00
Grygorii Strashko	51a9533797	net: ethernet: ti: cpsw: resolve build deps of cpsw drivers A following patches introduce new CPSW switchdev driver which uses common code with legacy CPSW driver. This will introduce build dependency between CPSW switchdev and CPSW legacy drivers related to for_each_slave() and cpsw_slave_index() - they can be compiled both, but only one of them will be not functional depending in Kconfig settings due to duffrences in Slave Ports indexes calculation. To fix this make for_each_slave() local (it's used now only by legacy CPSW driver) and convert cpsw_slave_index() to be a function pointer which is assigned in probe. Driver to probe is defined by DT. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:25:23 -08:00
Ilias Apalodimas	e85c143707	net: ethernet: ti: ale: modify vlan/mdb api for switchdev A following patch introduces switchdev functionality, so modify ALE engine VLANs/MDBs API: - cpsw_ale_del_mcast(): update so it will remove only selected ports from mcast port_mask or delete whole mcast record if !port_mask - cpsw_ale_del_vlan(): update so it will remove only selected ports from all VLAN record's masks or delete whole VLAN record if !port_mask - add cpsw_ale_vlan_add_modify() to add or modify existing VLAN record's masks - add cpsw_ale_set_unreg_mcast() for enabling unreg mcast on port VLANs Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:25:23 -08:00
Grygorii Strashko	4b41d34367	net: ethernet: ti: cpsw: allow untagged traffic on host port Now untagged vlan traffic is not support on Host P0 port. This patch adds in ALE context bitmap of VLANs for which Host P0 port bit set in Force Untagged Packet Egress bitmask in VLANs ALE entries, and adds corresponding check in VLAN incapsulation header parsing function cpsw_rx_vlan_encap(). Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:25:23 -08:00
Grygorii Strashko	7fe579dfb9	net: ethernet: ti: ale: clean ale tbl on init and intf restart Clean CPSW ALE on init and intf restart (up/down) to avoid reading obsolete or garbage entries from ALE table. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:25:23 -08:00
David S. Miller	b9242da6f6	Merge branch 'nf_tables_offload-vlan-matching-support' Pablo Neira Ayuso says: ==================== nf_tables_offload: vlan matching support The following patchset contains Netfilter support for vlan matching offloads: 1) Constify nft_reg_load() as a preparation patch. 2) Restrict rule matching to ingress interface type ARPHRD_ETHER. 3) Add new vlan_tci field to flow_dissector_key_vlan structure, to allow to set up vlan_id, vlan_dei and vlan_priority in one go. 4) C-VLAN matching support. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:21:35 -08:00
Pablo Neira Ayuso	89d8fd44ab	netfilter: nft_payload: add C-VLAN offload support Match on h_vlan_encapsulated_proto and set up protocol dependency. Check for protocol dependency before accessing the tci field. Allow to match on the encapsulated ethertype too. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:21:34 -08:00
Pablo Neira Ayuso	a82055af59	netfilter: nft_payload: add VLAN offload support Match on ethertype and set up protocol dependency. Check for protocol dependency before accessing the tci field. Allow to match on the encapsulated ethertype too. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:21:34 -08:00
Pablo Neira Ayuso	8819efc943	netfilter: nf_tables_offload: allow ethernet interface type only Hardware offload support at this stage assumes an ethernet device in place. The flow dissector provides the intermediate representation to express this selector, so extend it to allow to store the interface type. Flower does not uses this, so skb_flow_dissect_meta() is not extended to match on this new field. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:21:34 -08:00
Pablo Neira Ayuso	7cd9a58d68	netfilter: nf_tables: constify nft_reg_load{8, 16, 64}() This patch constifies the pointer to source register data that is passed as an input parameter. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 11:21:34 -08:00
Xin Long	2f1d370b99	lwtunnel: add support for multiple geneve opts geneve RFC (draft-ietf-nvo3-geneve-14) allows a geneve packet to carry multiple geneve opts, so it's necessary for lwtunnel to support adding multiple geneve opts in one lwtunnel route. But vxlan and erspan opts are still only allowed to add one option. With this patch, iproute2 could make it like: # ip r a 1.1.1.0/24 encap ip id 1 geneve_opts 0:0:12121212,1:2:12121212 \ dst 10.1.0.2 dev geneve1 # ip r a 1.1.1.0/24 encap ip id 1 vxlan_opts 456 \ dst 10.1.0.2 dev erspan1 # ip r a 1.1.1.0/24 encap ip id 1 erspan_opts 1:123:0:0 \ dst 10.1.0.2 dev erspan1 Which are pretty much like cls_flower and act_tunnel_key. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-20 10:15:13 -08:00
YueHaibing	b2e2f0e6a6	bpf: Make array_map_mmap static Fix sparse warning: kernel/bpf/arraymap.c:481:5: warning: symbol 'array_map_mmap' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191119142113.15388-1-yuehaibing@huawei.com	2019-11-19 16:57:32 -08:00
Andrii Nakryiko	24f6505027	selftests/bpf: Enforce no-ALU32 for test_progs-no_alu32 With the most recent Clang, alu32 is enabled by default if -mcpu=probe or -mcpu=v3 is specified. Use a separate build rule with -mcpu=v2 to enforce no ALU32 mode. Suggested-by: Yonghong Song <yhs@fb.com> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20191120002510.4130605-1-andriin@fb.com	2019-11-19 16:53:22 -08:00
Rahul Lakkireddy	272630feb4	cxgb4: remove unneeded semicolon for switch block Semicolon is not required at the end of switch block. So, remove it. Addresses coccinelle warning: drivers/net/ethernet/chelsio/cxgb4/sge.c:2260:2-3: Unneeded semicolon Fixes: `4846d5330d` ("cxgb4: add Tx and Rx path for ETHOFLD traffic") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-19 16:39:51 -08:00
Vladimir Oltean	b8fc7177d8	net: dsa: felix: Fix CPU port assignment when not last port On the NXP LS1028A, there are 2 Ethernet links between the Felix switch and the ENETC: - eno2 <-> swp4, at 2.5G - eno3 <-> swp5, at 1G Only one of the above Ethernet port pairs can act as a DSA link for tagging. When adding initial support for the driver, it was tested only on the 1G eno3 <-> swp5 interface, due to the necessity of using PHYLIB initially (which treats fixed-link interfaces as emulated C22 PHYs, so it doesn't support fixed-link speeds higher than 1G). After making PHYLINK work, it appears that swp4 still can't act as CPU port. So it looks like ocelot_set_cpu_port was being called for swp4, but then it was called again for swp5, overwriting the CPU port assigned in the DT. It appears that when you call dsa_upstream_port for a port that is not defined in the device tree (such as swp5 when using swp4 as CPU port), its dp->cpu_dp pointer is not initialized by dsa_tree_setup_default_cpu, and this trips up the following condition in dsa_upstream_port: if (!cpu_dp) return port; So the moral of the story is: don't call dsa_upstream_port for a port that is not defined in the device tree, and therefore its dsa_port structure is not completely initialized (ds->num_ports is still 6). Fixes: `5605194877` ("net: dsa: ocelot: add driver for Felix switch family") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-19 15:21:45 -08:00
Andrii Nakryiko	a0d7da26ce	libbpf: Fix call relocation offset calculation bug When relocating subprogram call, libbpf doesn't take into account relo->text_off, which comes from symbol's value. This generally works fine for subprograms implemented as static functions, but breaks for global functions. Taking a simplified test_pkt_access.c as an example: __attribute__ ((noinline)) static int test_pkt_access_subprog1(volatile struct __sk_buff skb) { return skb->len 2; } __attribute__ ((noinline)) static int test_pkt_access_subprog2(int val, volatile struct __sk_buff skb) { return skb->len + val; } SEC("classifier/test_pkt_access") int test_pkt_access(struct __sk_buff skb) { if (test_pkt_access_subprog1(skb) != skb->len * 2) return TC_ACT_SHOT; if (test_pkt_access_subprog2(2, skb) != skb->len + 2) return TC_ACT_SHOT; return TC_ACT_UNSPEC; } When compiled, we get two relocations, pointing to '.text' symbol. .text has st_value set to 0 (it points to the beginning of .text section): 0000000000000008 000000050000000a R_BPF_64_32 0000000000000000 .text 0000000000000040 000000050000000a R_BPF_64_32 0000000000000000 .text test_pkt_access_subprog1 and test_pkt_access_subprog2 offsets (targets of two calls) are encoded within call instruction's imm32 part as -1 and 2, respectively: 0000000000000000 test_pkt_access_subprog1: 0: 61 10 00 00 00 00 00 00 r0 = (u32 )(r1 + 0) 1: 64 00 00 00 01 00 00 00 w0 <<= 1 2: 95 00 00 00 00 00 00 00 exit 0000000000000018 test_pkt_access_subprog2: 3: 61 10 00 00 00 00 00 00 r0 = (u32 )(r1 + 0) 4: 04 00 00 00 02 00 00 00 w0 += 2 5: 95 00 00 00 00 00 00 00 exit 0000000000000000 test_pkt_access: 0: bf 16 00 00 00 00 00 00 r6 = r1 ===> 1: 85 10 00 00 ff ff ff ff call -1 2: bc 01 00 00 00 00 00 00 w1 = w0 3: b4 00 00 00 02 00 00 00 w0 = 2 4: 61 62 00 00 00 00 00 00 r2 = (u32 )(r6 + 0) 5: 64 02 00 00 01 00 00 00 w2 <<= 1 6: 5e 21 08 00 00 00 00 00 if w1 != w2 goto +8 <LBB0_3> 7: bf 61 00 00 00 00 00 00 r1 = r6 ===> 8: 85 10 00 00 02 00 00 00 call 2 9: bc 01 00 00 00 00 00 00 w1 = w0 10: 61 62 00 00 00 00 00 00 r2 = (u32 )(r6 + 0) 11: 04 02 00 00 02 00 00 00 w2 += 2 12: b4 00 00 00 ff ff ff ff w0 = -1 13: 1e 21 01 00 00 00 00 00 if w1 == w2 goto +1 <LBB0_3> 14: b4 00 00 00 02 00 00 00 w0 = 2 0000000000000078 LBB0_3: 15: 95 00 00 00 00 00 00 00 exit Now, if we compile example with global functions, the setup changes. Relocations are now against specifically test_pkt_access_subprog1 and test_pkt_access_subprog2 symbols, with test_pkt_access_subprog2 pointing 24 bytes into its respective section (.text), i.e., 3 instructions in: 0000000000000008 000000070000000a R_BPF_64_32 0000000000000000 test_pkt_access_subprog1 0000000000000048 000000080000000a R_BPF_64_32 0000000000000018 test_pkt_access_subprog2 Calls instructions now encode offsets relative to function symbols and are both set ot -1: 0000000000000000 test_pkt_access_subprog1: 0: 61 10 00 00 00 00 00 00 r0 = (u32 )(r1 + 0) 1: 64 00 00 00 01 00 00 00 w0 <<= 1 2: 95 00 00 00 00 00 00 00 exit 0000000000000018 test_pkt_access_subprog2: 3: 61 20 00 00 00 00 00 00 r0 = (u32 )(r2 + 0) 4: 0c 10 00 00 00 00 00 00 w0 += w1 5: 95 00 00 00 00 00 00 00 exit 0000000000000000 test_pkt_access: 0: bf 16 00 00 00 00 00 00 r6 = r1 ===> 1: 85 10 00 00 ff ff ff ff call -1 2: bc 01 00 00 00 00 00 00 w1 = w0 3: b4 00 00 00 02 00 00 00 w0 = 2 4: 61 62 00 00 00 00 00 00 r2 = (u32 )(r6 + 0) 5: 64 02 00 00 01 00 00 00 w2 <<= 1 6: 5e 21 09 00 00 00 00 00 if w1 != w2 goto +9 <LBB2_3> 7: b4 01 00 00 02 00 00 00 w1 = 2 8: bf 62 00 00 00 00 00 00 r2 = r6 ===> 9: 85 10 00 00 ff ff ff ff call -1 10: bc 01 00 00 00 00 00 00 w1 = w0 11: 61 62 00 00 00 00 00 00 r2 = (u32 )(r6 + 0) 12: 04 02 00 00 02 00 00 00 w2 += 2 13: b4 00 00 00 ff ff ff ff w0 = -1 14: 1e 21 01 00 00 00 00 00 if w1 == w2 goto +1 <LBB2_3> 15: b4 00 00 00 02 00 00 00 w0 = 2 0000000000000080 LBB2_3: 16: 95 00 00 00 00 00 00 00 exit Thus the right formula to calculate target call offset after relocation should take into account relocation's target symbol value (offset within section), call instruction's imm32 offset, and (subtracting, to get relative instruction offset) instruction index of call instruction itself. All that is shifted by number of instructions in main program, given all sub-programs are copied over after main program. Convert few selftests relying on bpf-to-bpf calls to use global functions instead of static ones. Fixes: `48cca7e44f` ("libbpf: add support for bpf_call") Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20191119224447.3781271-1-andriin@fb.com	2019-11-19 15:00:12 -08:00

1 2 3 4 5 ...

875082 Commits