linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-12 15:11:50 +00:00

Author	SHA1	Message	Date
Martin KaFai Lau	4ed8ec521e	cgroup: bpf: Add BPF_MAP_TYPE_CGROUP_ARRAY Add a BPF_MAP_TYPE_CGROUP_ARRAY and its bpf_map_ops's implementations. To update an element, the caller is expected to obtain a cgroup2 backed fd by open(cgroup2_dir) and then update the array with that fd. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:30:38 -04:00
Martin KaFai Lau	1f3fe7ebf6	cgroup: Add cgroup_get_from_fd Add a helper function to get a cgroup2 from a fd. It will be stored in a bpf array (BPF_MAP_TYPE_CGROUP_ARRAY) which will be introduced in the later patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:30:38 -04:00
David S. Miller	6bd3847bdc	Merge branch 'bpf-robustify' Daniel Borkmann says: ==================== Further robustify putting BPF progs This series addresses a potential issue reported to us by Jann Horn with regards to putting progs. First patch moves progs generally under RCU destruction and second patch refactors getting of progs to simplify code a bit. For details, please see individual patches. Note, we think that addressing this one in net-next should be sufficient. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:00:52 -04:00
Daniel Borkmann	113214be7f	bpf: refactor bpf_prog_get and type check into helper Since bpf_prog_get() and program type check is used in a couple of places, refactor this into a small helper function that we can make use of. Since the non RO prog->aux part is not used in performance critical paths and a program destruction via RCU is rather very unlikley when doing the put, we shouldn't have an issue just doing the bpf_prog_get() + prog->type != type check, but actually not taking the ref at all (due to being in fdget() / fdput() section of the bpf fd) is even cleaner and makes the diff smaller as well, so just go for that. Callsites are changed to make use of the new helper where possible. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:00:47 -04:00
Daniel Borkmann	1aacde3d22	bpf: generally move prog destruction to RCU deferral Jann Horn reported following analysis that could potentially result in a very hard to trigger (if not impossible) UAF race, to quote his event timeline: - Set up a process with threads T1, T2 and T3 - Let T1 set up a socket filter F1 that invokes another filter F2 through a BPF map [tail call] - Let T1 trigger the socket filter via a unix domain socket write, don't wait for completion - Let T2 call PERF_EVENT_IOC_SET_BPF with F2, don't wait for completion - Now T2 should be behind bpf_prog_get(), but before bpf_prog_put() - Let T3 close the file descriptor for F2, dropping the reference count of F2 to 2 - At this point, T1 should have looked up F2 from the map, but not finished executing it - Let T3 remove F2 from the BPF map, dropping the reference count of F2 to 1 - Now T2 should call bpf_prog_put() (wrong BPF program type), dropping the reference count of F2 to 0 and scheduling bpf_prog_free_deferred() via schedule_work() - At this point, the BPF program could be freed - BPF execution is still running in a freed BPF program While at PERF_EVENT_IOC_SET_BPF time it's only guaranteed that the perf event fd we're doing the syscall on doesn't disappear from underneath us for whole syscall time, it may not be the case for the bpf fd used as an argument only after we did the put. It needs to be a valid fd pointing to a BPF program at the time of the call to make the bpf_prog_get() and while T2 gets preempted, F2 must have dropped reference to 1 on the other CPU. The fput() from the close() in T3 should also add additionally delay to the reference drop via exit_task_work() when bpf_prog_release() gets called as well as scheduling bpf_prog_free_deferred(). That said, it makes nevertheless sense to move the BPF prog destruction generally after RCU grace period to guarantee that such scenario above, but also others as recently fixed in `ceb5607035` ("bpf, perf: delay release of BPF prog after grace period") with regards to tail calls won't happen. Integrating bpf_prog_free_deferred() directly into the RCU callback is not allowed since the invocation might happen from either softirq or process context, so we're not permitted to block. Reviewing all bpf_prog_put() invocations from eBPF side (note, cBPF -> eBPF progs don't use this for their destruction) with call_rcu() look good to me. Since we don't know whether at the time of attaching the program, we're already part of a tail call map, we need to use RCU variant. However, due to this, there won't be severely more stress on the RCU callback queue: situations with above bpf_prog_get() and bpf_prog_put() combo in practice normally won't lead to releases, but even if they would, enough effort/ cycles have to be put into loading a BPF program into the kernel already. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:00:47 -04:00
Amitoj Kaur Chawla	466fc793da	atm: horizon: Use setup_timer Convert a call to init_timer and accompanying intializations of the timer's data and function fields to a call to setup_timer. The Coccinelle semantic patch that fixes this problem is as follows: @@ expression t,d,f,e1; identifier x1; statement S1; @@ ( -t.data = d; \| -t.function = f; \| -init_timer(&t); +setup_timer(&t,f,d); \| -init_timer_on_stack(&t); +setup_timer_on_stack(&t,f,d); ) <... when != S1 t.x1 = e1; ...> Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:58:26 -04:00
David S. Miller	e3cc6e37a7	Merge branch 'qed-next' Manish Chopra says: ==================== qede: Enhancements This patch series have few small fastpath features support and code refactoring. Note - regarding get/set tunable configuration via ethtool Surprisingly, there is NO ethtool application support for such configuration given that we have kernel support. Do let us know if we need to add support for that in user ethtool. Please consider applying this series to "net-next". ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:40:58 -04:00
Manish Chopra	831a8e6c40	qede: Bump up driver version to 8.10.1.20 Signed-off-by: Manish Chopra <manish.chopra@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:40:53 -04:00
Manish Chopra	3d789994b0	qede: Add get/set rx copy break tunable support Signed-off-by: Manish <manish.chopra@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:40:53 -04:00
Manish Chopra	312e06761c	qede: Utilize xmit_more This patch uses xmit_more optimization to reduce number of TX doorbells write per packet. Signed-off-by: Manish <manish.chopra@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:40:53 -04:00
Manish Chopra	c774169d8f	qede: qede_poll refactoring This patch cleanups qede_poll() routine a bit and allows qede_poll() to do single iteration to handle TX completion [As under heavy TX load qede_poll() might run for indefinite time in the while(1) loop for TX completion processing and cause CPU stuck]. Signed-off-by: Manish <manish.chopra@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:40:53 -04:00
Manish Chopra	c72a6125d0	qede: Add support for handling IP fragmented packets. When handling IP fragmented packets with csum in their transport header, the csum isn't changed as part of the fragmentation. As a result, the packet containing the transport headers would have the correct csum of the original packet, but one that mismatches the actual packet that passes on the wire. As a result, on receive path HW would give an indication that the packet has incorrect csum, which would cause qede to discard the incoming packet. Since HW also delivers a notification of IP fragments, change driver behavior to pass such incoming packets to stack and let it make the decision whether it needs to be dropped. Signed-off-by: Manish <manish.chopra@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:40:53 -04:00
David S. Miller	beb528d01f	Merge branch 'tun-skb_array' Jason Wang says: ==================== switch to use tx skb array in tun This series tries to switch to use skb array in tun. This is used to eliminate the spinlock contention between producer and consumer. The conversion was straightforward: just introdce a tx skb array and use it instead of sk_receive_queue. A minor issue is to keep the tx_queue_len behaviour, since tun used to use it for the length of sk_receive_queue. This is done through: - add the ability to resize multiple rings at once to avoid handling partial resize failure for mutiple rings. - add the support for zero length ring. - introduce a notifier which was triggered when tx_queue_len was changed for a netdev. - resize all queues during the tx_queue_len changing. Tests shows about 15% improvement on guest rx pps: Before: ~1300000pps After : ~1500000pps Changes from V3: - fix kbuild warnings - call NETDEV_CHANGE_TX_QUEUE_LEN on IFLA_TXQLEN Changes from V2: - add multiple rings resizing support for ptr_ring/skb_array - add zero length ring support - introdce a NETDEV_CHANGE_TX_QUEUE_LEN - drop new flags Changes from V1: - switch to use skb array instead of a customized circular buffer - add non-blocking support - rename .peek to .peek_len - drop lockless peeking since test show very minor improvement ==================== Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-from-altitude: 34697 feet. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:32:28 -04:00
Jason Wang	1576d98605	tun: switch to use skb array for tx We used to queue tx packets in sk_receive_queue, this is less efficient since it requires spinlocks to synchronize between producer and consumer. This patch tries to address this by: - switch from sk_receive_queue to a skb_array, and resize it when tx_queue_len was changed. - introduce a new proto_ops peek_len which was used for peeking the skb length. - implement a tun version of peek_len for vhost_net to use and convert vhost_net to use peek_len if possible. Pktgen test shows about 15.3% improvement on guest receiving pps for small buffers: Before: ~1300000pps After : ~1500000pps Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:32:17 -04:00
Jason Wang	08294a26e1	net: introduce NETDEV_CHANGE_TX_QUEUE_LEN This patch introduces a new event - NETDEV_CHANGE_TX_QUEUE_LEN, this will be triggered when tx_queue_len. It could be used by net device who want to do some processing at that time. An example is tun who may want to resize tx array when tx_queue_len is changed. Cc: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:32:17 -04:00
Jason Wang	bf900b3dbe	skb_array: add wrappers for resizing Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:32:17 -04:00
Michael S. Tsirkin	59e6ae5324	ptr_ring: support resizing multiple queues Sometimes, we need support resizing multiple queues at once. This is because it was not easy to recover to recover from a partial failure of multiple queues resizing. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:32:17 -04:00
Jason Wang	fd68adec9d	skb_array: minor tweak Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:32:16 -04:00
Jason Wang	982fb490c2	ptr_ring: support zero length ring Sometimes, we need zero length ring. But current code will crash since we don't do any check before accessing the ring. This patch fixes this. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:32:16 -04:00
David S. Miller	8dc7243abb	Merge branch 'sch_hfsc-fixes-cleanups' Michal Soltys says: ==================== HFSC patches, part 1 It's revised version of part of the patches I submitted really, really long time ago (back then I asked Patrick to ignore them as I found some issues shortly after submitting). Anyway this is the first set with very simple fixes/changes though some of them relatively subtle (I tried to do very exhaustive commit messages explaining what and why with those). The patches are against net-next tree. The second set will be heavier - or rather with more complex explanations, among those I have: - a fix to subtle issue introduced in http://permalink.gmane.org/gmane.linux.kernel.commits.2-4/8281 along with simplifying related stuff - update times to 96 bits (which allows to "just" use 32 bit shifts and improves curve definition accuracy at more extreme low/high speeds) - add curve "merging" instead of just selecting in convex case (computations mirror those from concave intersection) But these are eventually for later. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:03:51 -04:00
Michal Soltys	33ef84a77d	net/sched/sch_hfsc.c: anchor virtual curve at proper vt in hfsc_change_fsc() cl->cl_vt alone is relative only to the current backlog period, while the curve operates on cumulative virtual time. This patch adds missing cl->cl_vtoff. Signed-off-by: Michal Soltys <soltys@ziu.info> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:03:43 -04:00
Michal Soltys	ab12cb4742	net/sched/sch_hfsc.c: go passive after vt update When a class is going passive, it should update its cl_vt first to be consistent with the last dequeue operation. Otherwise its cl_vt will be one packet behind and parent's cvtmax might not be updated as well. One possible side effect is if some class goes passive and subsequently goes active /without/ its parent going passive - with cl_vt lagging one packet behind - comparison made in init_vf() will be affected (same period). Signed-off-by: Michal Soltys <soltys@ziu.info> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:03:43 -04:00
Michal Soltys	2354f056f6	net/sched/sch_hfsc.c: remove leftover dlist and droplist This is update to: commit `a09ceb0e08` ("sched: remove qdisc->drop") That commit removed qdisc->drop, but left alone dlist and droplist that no longer serve any meaningful purpose. Signed-off-by: Michal Soltys <soltys@ziu.info> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:03:43 -04:00
Michal Soltys	d1d0fc5e4c	net/sched/sch_hfsc.c: add unlikely() in qdisc_peek_len() The condition can only succeed on wrong configurations. Signed-off-by: Michal Soltys <soltys@ziu.info> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:03:43 -04:00
Michal Soltys	12d0ad3be9	net/sched/sch_hfsc.c: handle corner cases where head may change invalidating calculated deadline Realtime scheduling implemented in HFSC uses head of the queue to make the decision about which packet to schedule next. But in case of any head drop, the deadline calculated for the previous head is not necessarily correct for the next head (unless both packets have the same length). Thanks to peek() function used during dequeue - which internally is a dequeue operation - hfsc is almost safe from this issue, as peek() dequeues and isolates the head storing it temporarily until the real dequeue happens. But there is one exception: if after the class activation a drop happens before the first dequeue operation, there's never a chance to do the peek(). Adding peek() call in enqueue - if this is the first packet in a new backlog period AND the scheduler has realtime curve defined - fixes that one corner case. The 1st hfsc_dequeue() will use that peeked packet, similarly as every subsequent hfsc_dequeue() call uses packet peeked by the previous call. Signed-off-by: Michal Soltys <soltys@ziu.info> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:03:43 -04:00
Eric Dumazet	19689e38ec	tcp: md5: use kmalloc() backed scratch areas Some arches have virtually mapped kernel stacks, or will soon have. tcp_md5_hash_header() uses an automatic variable to copy tcp header before mangling th->check and calling crypto function, which might be problematic on such arches. David says that using percpu storage is also problematic on non SMP builds. Just use kmalloc() to allocate scratch areas. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 04:02:55 -04:00
David S. Miller	435c556cde	Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== Intel Wired LAN Driver Updates 2016-06-29 This series contains updates and fixes to e1000e, igb, ixgbe and fm10k. A true smorgasbord of changes. Jake cleans up some obscurity by not using the BIT() macro on bitshift operation and also fixed the calculated index when looping through the indir array. Fixes the issue with igb's workqueue item for overflow check from causing a surprise remove event. The ptp_flags variable is added to simplify the work of writing several complex MAC type checks in the PTP code while fixing the workqueue. Alex Duyck fixes the receive buffers alignment which should not be L1 cache aligned, but to 512 bytes instead. Denys Vlasenko prevents a division by zero which was reported under VMWare for e1000e. Amritha fixes an issue where filters in a child hash table must be cleared from the hardware before delete the filter links in ixgbe. Bhaktipriya Shridhar simply replaces the deprecated create_workqueue() with alloc_workqueue() for fm10k. Tony corrects ixgbe ethtool reporting to show x550 supports hardware timestamping of all packets. Emil fixes an issue where MAC-VLANs on the VF fail to pass traffic due to spoofed packets. Andrew Lunn increases performance on some systems where syncing a buffer for DMA is expensive. So rather than sync the whole 2K receive buffer, only synchronize the length of the frame. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 09:29:07 -04:00
David S. Miller	c435e6e048	Merge branch 'nfp-next' Jakub Kicinski says: ==================== nfp: few code improvements Three small patches for net-next. First and second patches improve the code quality by spelling things correctly and removing unused parameters. Third patch hooks-in standard kernel implementation of .get_link() in ethtool ops. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 09:12:23 -04:00
Jakub Kicinski	2370def2e4	nfp: implement ethtool .get_link() callback Point the ethtool .get_link() callback to the standard ethtool_op_get_link() implementation. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 09:12:14 -04:00
Jakub Kicinski	f642963bff	nfp: remove unused parameter from nfp_net_write_mac_addr() nfp_net_write_mac_addr() always writes to the BAR the current device address taken from netdev struct. The address given as parameter is actually ignored. Since all callers pass netdev->dev_addr simply remove the parameter. While at it improve the function's kdoc a bit. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 09:12:14 -04:00
Jakub Kicinski	796312cd00	nfp: correct name of control BAR define Spell abbreviation of control as ctrl not crtl. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 09:12:14 -04:00
Dan Carpenter	6fde0e63ec	be2net: signedness bug in be_msix_enable() "num_vec" needs to be signed for the error handling to work. Fixes: `e261768e9e` ('be2net: support asymmetric rx/tx queue counts') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Sathya Perla <sathya.perla@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 08:54:18 -04:00
Masanari Iida	9b9a553c90	net: netcp: Fix a typo in keystone-netcp.txt This patch fix a spelling typo in keystone-netcp.txt Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 08:53:44 -04:00
David S. Miller	833ba3d585	Merge branch 'mediatek-next' John Crispin says: ==================== net-next: mediatek: IRQ cleanups, fixes and grouping This series contains 2 small code cleanups that are leftovers from the MIPS support. There is also a small fix that adds proper locking to the code accessing the IRQ registers. Without this fix we saw deadlocks caused by the last patch of the series, which adds IRQ grouping. The grouping feature allows us to use different IRQs for TX and RX. By doing so we can use affinity to let the SoC handle the IRQs on different cores. This series depends on a previous series currently sitting in net.git starting with commit `562c5a7040` ("net: mediatek: only wake the queue if it is stopped") up to commit `82c6544ddd` ("net: mediatek: remove superfluous queue wake up call") ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 08:52:12 -04:00
John Crispin	8067302973	net-next: mediatek: add support for IRQ grouping The ethernet core has 3 IRQs. Using the IRQ grouping registers we are able to separate TX and RX IRQs, which allows us to service them on separate cores. This patch splits the IRQ handler into 2 separate functions, one for TX and another for RX. The TX housekeeping is split out into its own NAPI handler. Signed-off-by: John Crispin <john@phrozen.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 08:52:04 -04:00
John Crispin	7bc9ccec34	net-next: mediatek: add IRQ locking The code that enables and disables IRQs is missing proper locking. After adding the IRQ grouping patch and routing the RX and TX IRQs to different cores we experienced IRQ stalls. Fix this by adding proper locking. We use a dedicated lock to reduce the latency if the IRQ code. Signed-off-by: John Crispin <john@phrozen.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 08:52:04 -04:00
John Crispin	eece71e8fb	net-next: mediatek: don't use intermediate variables to store IRQ masks The code currently uses variables to store and never modify the bit masks of interrupts. This is legacy code from an early version of the driver that supported MIPS based SoCs where the IRQ bits depended on the actual SoC. As the bits are the same for all ARM based SoCs using this driver we can remove the intermediate variables. Signed-off-by: John Crispin <john@phrozen.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 08:52:04 -04:00
John Crispin	6e6edd8b96	net-next: mediatek: remove superfluous register reads The driver was originally written for MIPS based SoC. These required the IRQ mask register to be read after writing it to ensure that the content was actually applied. As this version only works on ARM based SoCs, we can safely remove the 2 reads as they are no longer required. Signed-off-by: John Crispin <john@phrozen.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 08:52:04 -04:00
Mateusz Bajorski	153380ec4b	fib_rules: Added NLM_F_EXCL support to fib_nl_newrule When adding rule with NLM_F_EXCL flag then check if the same rule exist. If yes then exit with -EEXIST. This is already implemented in iproute2: if (cmd == RTM_NEWRULE) { req.n.nlmsg_flags \|= NLM_F_CREATE\|NLM_F_EXCL; req.r.rtm_type = RTN_UNICAST; } Tested ipv4 and ipv6 with net-next linux on qemu x86 expected behavior after patch: localhost ~ # ip rule 0: from all lookup local 32766: from all lookup main 32767: from all lookup default localhost ~ # ip rule add from 10.46.177.97 lookup 104 pref 1005 localhost ~ # ip rule add from 10.46.177.97 lookup 104 pref 1005 RTNETLINK answers: File exists localhost ~ # ip rule 0: from all lookup local 1005: from 10.46.177.97 lookup 104 32766: from all lookup main 32767: from all lookup default There was already topic regarding this but I don't see any changes merged and problem still occurs. https://lkml.kernel.org/r/1135778809.5944.7.camel+%28%29+localhost+%21+localdomain Signed-off-by: Mateusz Bajorski <mateusz.bajorski@nokia.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 08:23:19 -04:00
Seymour, Shane M	2631b79f6c	tcp: increase size at which tcp_bound_to_half_wnd bounds to > TCP_MSS_DEFAULT In previous commit `01f83d6984` the following comments were added: "When peer uses tiny windows, there is no use in packetizing to sub-MSS pieces for the sake of SWS or making sure there are enough packets in the pipe for fast recovery." The test should be > TCP_MSS_DEFAULT not >= 512. This allows low end devices that send an MSS of 536 (TCP_MSS_DEFAULT) to see better network performance by sending it 536 bytes of data at a time instead of bounding to half window size (268). Other network stacks work this way, e.g. HP-UX. Signed-off-by: Shane Seymour <shane.seymour@hpe.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 08:17:21 -04:00
Andrey Vagin	b1ed4c4fa9	tcp: add an ability to dump and restore window parameters We found that sometimes a restored tcp socket doesn't work. A reason of this bug is incorrect window parameters and in this case tcp_acceptable_seq() returns tcp_wnd_end(tp) instead of tp->snd_nxt. The other side drops packets with this seq, because seq is less than tp->rcv_nxt ( tcp_sequence() ). Data from a send queue is sent only if there is enough space in a window, so when we restore unacked data, we need to expand a window to fit this data. This was in a first version of this patch: "tcp: extend window to fit all restored unacked data in a send queue" Then Alexey recommended me to restore window parameters instead of adjusted them according with data in a sent queue. This sounds resonable. rcv_wnd has to be restored, because it was reported to another side and the offered window is never shrunk. One of reasons why we need to restore snd_wnd was described above. Cc: Pavel Emelyanov <xemul@parallels.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 08:15:31 -04:00
David S. Miller	641f7e405e	Merge branch 'bridge-igmp-stats' Nikolay Aleksandrov says: ==================== net: bridge: add support for IGMP/MLD stats This patchset adds support for the new IFLA_STATS_LINK_XSTATS_SLAVE attribute which can be used with RTM_GETSTATS in order to export per-slave statistics. It works by passing the attribute to the linkxstats callback and if the callback user supports it - it should dump that slave's stats. This is much more scalable and permits us to request only a single port's statistics instead of dumping everything every time. The second patch adds support for per-port IGMP/MLD statistics and uses the new API to export them for the bridge and its ports. The stats are made in a very lightweight manner, the normal fast-path is not affected at all and the flood paths (br_flood/br_multicast_flood) are only affected if the packet is IGMP and the IGMP stats have been enabled using cache-hot data for the check. v2: Patch 01 is new, patch 02 has been reworked to use the new API, also in addition counters for IGMP/MLD parse errors have been added and members are added for per-port multicast traffic stats. The multicast counting has been slightly optimized (moved the br_multicast_count inside the IPv4/6 IGMP functions after the checks for IGMP traffic) to avoid one conditional that was on all of the multicast traffic path (both IGMP and other). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 06:18:32 -04:00
Nikolay Aleksandrov	1080ab95e3	net: bridge: add support for IGMP/MLD stats and export them via netlink This patch adds stats support for the currently used IGMP/MLD types by the bridge. The stats are per-port (plus one stat per-bridge) and per-direction (RX/TX). The stats are exported via netlink via the new linkxstats API (RTM_GETSTATS). In order to minimize the performance impact, a new option is used to enable/disable the stats - multicast_stats_enabled, similar to the recent vlan stats. Also in order to avoid multiple IGMP/MLD type lookups and checks, we make use of the current "igmp" member of the bridge private skb->cb region to record the type on Rx (both host-generated and external packets pass by multicast_rcv()). We can do that since the igmp member was used as a boolean and all the valid IGMP/MLD types are positive values. The normal bridge fast-path is not affected at all, the only affected paths are the flooding ones and since we make use of the IGMP/MLD type, we can quickly determine if the packet should be counted using cache-hot data (cb's igmp member). We add counters for: * IGMP Queries * IGMP Leaves * IGMP v1/v2/v3 reports * MLD Queries * MLD Leaves * MLD v1/v2 reports These are invaluable when monitoring or debugging complex multicast setups with bridges. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 06:18:24 -04:00
Nikolay Aleksandrov	80e73cc563	net: rtnetlink: add support for the IFLA_STATS_LINK_XSTATS_SLAVE attribute This patch adds support for the IFLA_STATS_LINK_XSTATS_SLAVE attribute which allows to export per-slave statistics if the master device supports the linkxstats callback. The attribute is passed down to the linkxstats callback and it is up to the callback user to use it (an example has been added to the only current user - the bridge). This allows us to query only specific slaves of master devices like bridge ports and export only what we're interested in instead of having to dump all ports and searching only for a single one. This will be used to export per-port IGMP/MLD stats and also per-port vlan stats in the future, possibly other statistics as well. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 06:15:04 -04:00
David S. Miller	545c321ba3	Merge branch 'bpf-helper-improvements' Daniel Borkmann says: ==================== BPF helper improvements This set adds various BPF helper improvements, that is, cleaning up and adding BPF_F_CURRENT_CPU flag for tracing helper, allowing for preemption checks on bpf_get_smp_processor_id() helper, and adding two new helpers bpf_skb_change_{proto, type} for tc related programs. For further details please see individual patches. Note, this set requires -net to be merged into -net-next tree first. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 05:54:48 -04:00
Daniel Borkmann	d2485c4242	bpf: add bpf_skb_change_type helper This work adds a helper for changing skb->pkt_type in a controlled way. We only allow a subset of possible values and can extend that in future should other use cases come up. Doing this as a helper has the advantage that errors can be handeled gracefully and thus helper kept extensible. It's a write counterpart to pkt_type member we can already read from struct __sk_buff context. Major use case is to change incoming skbs to PACKET_HOST in a programmatic way instead of having to recirculate via redirect(..., BPF_F_INGRESS), for example. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 05:54:40 -04:00
Daniel Borkmann	6578171a7f	bpf: add bpf_skb_change_proto helper This patch adds a minimal helper for doing the groundwork of changing the skb->protocol in a controlled way. Currently supported is v4 to v6 and vice versa transitions, which allows f.e. for a minimal, static nat64 implementation where applications in containers that still require IPv4 can be transparently operated in an IPv6-only environment. For example, host facing veth of the container can transparently do the transitions in a programmatic way with the help of clsact qdisc and cls_bpf. Idea is to separate concerns for keeping complexity of the helper lower, which means that the programs utilize bpf_skb_change_proto(), bpf_skb_store_bytes() and bpf_lX_csum_replace() to get the job done, instead of doing everything in a single helper (and thus partially duplicating helper functionality). Also, bpf_skb_change_proto() shouldn't need to deal with raw packet data as this is done by other helpers. bpf_skb_proto_6_to_4() and bpf_skb_proto_4_to_6() unclone the skb to operate on a private one, push or pop additionally required header space and migrate the gso/gro meta data from the shared info. We do mark the gso type as dodgy so that headers are checked and segs recalculated by the gso/gro engine. The gso_size target is adapted as well. The flags argument added is currently reserved and can be used for future extensions. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 05:54:40 -04:00
Daniel Borkmann	80b48c4457	bpf: don't use raw processor id in generic helper Use smp_processor_id() for the generic helper bpf_get_smp_processor_id() instead of the raw variant. This allows for preemption checks when we have DEBUG_PREEMPT, and otherwise uses the raw variant anyway. We only need to keep the raw variant for socket filters, but we can reuse the helper that is already there from cBPF side. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 05:54:40 -04:00
Daniel Borkmann	6816a7ffce	bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_read Follow-up commit to `1e33759c78` ("bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_output") to add the same functionality into bpf_perf_event_read() helper. The split of index into flags and index component is also safe here, since such large maps are rejected during map allocation time. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 05:54:40 -04:00
Daniel Borkmann	d793133031	bpf, trace: fetch current cpu only once We currently have two invocations, which is unnecessary. Fetch it only once and use the smp_processor_id() variant, so we also get preemption checks along with it when DEBUG_PREEMPT is set. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 05:54:40 -04:00

1 2 3 4 5 ...

603877 Commits