linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-17 01:22:07 +00:00

Author	SHA1	Message	Date
David Howells	b24d2891cf	rxrpc: Preset timestamp on Tx sk_buffs Set the timestamp on sk_buffs holding packets to be transmitted before queueing them because the moment the packet is on the queue it can be seen by the retransmission algorithm - which may see a completely random timestamp. If the retransmission algorithm sees such a timestamp, it may retransmit the packet and, in future, tell the congestion management algorithm that the retransmit timer expired. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-23 13:17:52 +01:00
Douglas Caetano dos Santos	2fe664f1fc	tcp: fix wrong checksum calculation on MTU probing With TCP MTU probing enabled and offload TX checksumming disabled, tcp_mtu_probe() calculated the wrong checksum when a fragment being copied into the probe's SKB had an odd length. This was caused by the direct use of skb_copy_and_csum_bits() to calculate the checksum, as it pads the fragment being copied, if needed. When this fragment was not the last, a subsequent call used the previous checksum without considering this padding. The effect was a stale connection in one way, as even retransmissions wouldn't solve the problem, because the checksum was never recalculated for the full SKB length. Signed-off-by: Douglas Caetano dos Santos <douglascs@taghos.com.br> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-23 07:55:02 -04:00
Eric Dumazet	fefa569a9d	net_sched: sch_fq: account for schedule/timers drifts It looks like the following patch can make FQ very precise, even in VM or stressed hosts. It matters at high pacing rates. We take into account the difference between the time that was programmed when last packet was sent, and current time (a drift of tens of usecs is often observed) Add an EWMA of the unthrottle latency to help diagnostics. This latency is the difference between current time and oldest packet in delayed RB-tree. This accounts for the high resolution timer latency, but can be different under stress, as fq_check_throttled() can be opportunistically be called from a dequeue() called after an enqueue() for a different flow. Tested: // Start a 10Gbit flow $ netperf --google-pacing-rate 1250000000 -H lpaa24 -l 10000 -- -K bbr & Before patch : $ sar -n DEV 10 5 \| grep eth0 \| grep Average Average: eth0 17106.04 756876.84 1102.75 1119049.02 0.00 0.00 0.52 After patch : $ sar -n DEV 10 5 \| grep eth0 \| grep Average Average: eth0 17867.00 800245.90 1151.77 1183172.12 0.00 0.00 0.52 A new iproute2 tc can output the 'unthrottle latency' : $ tc -s qd sh dev eth0 \| grep latency 0 gc, 0 highprio, 32490767 throttled, 2382 ns latency Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-23 07:19:06 -04:00
Marcelo Ricardo Leitner	a3007446e5	sctp: fix the handling of SACK Gap Ack blocks sctp_acked() is using 32bit arithmetics on 16bits vars, via TSN_lte() macros, which is weird and confusing. Once the offset to ctsn is calculated, all wrapping is already handled and thus to verify the Gap Ack blocks we can just use pure less/big-or-equal than checks. Also, rename gap variable to tsn_offset, so it's more meaningful, as it doesn't point to any gap at all. Even so, I don't think this discrepancy resulted in any practical bug. This patch is a preparation for the next one, which will introduce typecheck() for TSN_lte() macros and would cause a compile error here. Suggested-by: David Laight <David.Laight@ACULAB.COM> Reported-by: David Laight <David.Laight@ACULAB.COM> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-23 06:54:58 -04:00
WANG Cong	3d4357fba8	sch_sfb: keep backlog updated with qlen Fixes: `2ccccf5fb4` ("net_sched: update hierarchical backlog too") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-23 06:52:31 -04:00
WANG Cong	2ed5c3f096	sch_qfq: keep backlog updated with qlen Reported-by: Stas Nichiporovich <stasn77@gmail.com> Fixes: `2ccccf5fb4` ("net_sched: update hierarchical backlog too") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-23 06:52:31 -04:00
WANG Cong	21641c2e1f	net_sched: check NULL on error path in route4_change() On error path in route4_change(), 'f' could be NULL, so we should check NULL before calling tcf_exts_destroy(). Fixes: `b9a24bb76b` ("net_sched: properly handle failure case of tcf_exts_init()") Reported-by: kbuild test robot <fengguang.wu@intel.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-23 06:51:49 -04:00
David S. Miller	d6989d4bbe	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2016-09-23 06:46:57 -04:00
Pablo Neira Ayuso	4004d5c374	netfilter: nft_lookup: remove superfluous element found check We already checked for !found just a bit before: if (!found) { regs->verdict.code = NFT_BREAK; return; } if (found && set->flags & NFT_SET_MAP) ^^^^^ So this redundant check can just go away. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-23 09:30:48 +02:00
Gao Feng	b9d80f83bf	netfilter: xt_helper: Use sizeof(variable) instead of literal number It's better to use sizeof(info->name)-1 as index to force set the string tail instead of literal number '29'. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-23 09:30:43 +02:00
Gao Feng	7bdc66242d	netfilter: Enhance the codes used to get random once There are some codes which are used to get one random once in netfilter. We could use net_get_random_once to simplify these codes. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-23 09:30:36 +02:00
Liping Zhang	a20877b5ed	netfilter: nf_tables: check tprot_set first when we use xt.thoff pkt->xt.thoff is not always set properly, but we use it without any check. For payload expr, it will cause wrong results. For nftrace, we may notify the wrong network or transport header to the user space, furthermore, input the following nft rules, warning message will be printed out: # nft add rule arp filter output meta nftrace set 1 WARNING: CPU: 0 PID: 13428 at net/netfilter/nf_tables_trace.c:263 nft_trace_notify+0x4a3/0x5e0 [nf_tables] Call Trace: [<ffffffff813d58ae>] dump_stack+0x63/0x85 [<ffffffff810a4c0b>] __warn+0xcb/0xf0 [<ffffffff810a4d3d>] warn_slowpath_null+0x1d/0x20 [<ffffffffa0589703>] nft_trace_notify+0x4a3/0x5e0 [nf_tables] [ ... ] [<ffffffffa05690a8>] nft_do_chain_arp+0x78/0x90 [nf_tables_arp] [<ffffffff816f4aa2>] nf_iterate+0x62/0x80 [<ffffffff816f4b33>] nf_hook_slow+0x73/0xd0 [<ffffffff81732bbf>] arp_xmit+0x8f/0xb0 [ ... ] [<ffffffff81732d36>] arp_solicit+0x106/0x2c0 So before we use pkt->xt.thoff, check the tprot_set first. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-23 09:30:26 +02:00
Liping Zhang	8dc3c2b86b	netfilter: nf_tables: improve nft payload fast eval There's an off-by-one issue in nft_payload_fast_eval, skb_tail_pointer and ptr + priv->len all point to the last valid address plus 1. So if they are equal, we can still fetch the valid data. It's unnecessary to fall back to nft_payload_eval. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-23 09:30:16 +02:00
Liping Zhang	8061bb5443	netfilter: nft_queue: add _SREG_QNUM attr to select the queue number Currently, the user can specify the queue numbers by _QUEUE_NUM and _QUEUE_TOTAL attributes, this is enough in most situations. But acctually, it is not very flexible, for example: tcp dport 80 mapped to queue0 tcp dport 81 mapped to queue1 tcp dport 82 mapped to queue2 In order to do this thing, we must add 3 nft rules, and more mapping meant more rules ... So take one register to select the queue number, then we can add one simple rule to mapping queues, maybe like this: queue num tcp dport map { 80:0, 81:1, 82:2 ... } Florian Westphal also proposed wider usage scenarios: queue num jhash ip saddr . ip daddr mod ... queue num meta cpu ... queue num meta mark ... The last point is how to load a queue number from sreg, although we can use (u16)&regs->data[reg] to load the queue number, just like nat expr to load its l4port do. But we will cooperate with hash expr, meta cpu, meta mark expr and so on. They all store the result to u32 type, so cast it to u16 pointer and dereference it will generate wrong result in the big endian system. So just keep it simple, we treat queue number as u32 type, although u16 type is already enough. Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-23 09:29:50 +02:00
Laura Garcia Liebana	36b701fae1	netfilter: nf_tables: validate maximum value of u32 netlink attributes Fetch value and validate u32 netlink attribute. This validation is usually required when the u32 netlink attributes are being stored in a field whose size is smaller. This patch revisits `4da449ae1d` ("netfilter: nft_exthdr: Add size check on u8 nft_exthdr attributes"). Fixes: `96518518cc` ("netfilter: add nftables") Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Laura Garcia Liebana <nevola@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-23 09:29:02 +02:00
Eric W. Biederman	7872559664	Merge branch 'nsfs-ioctls' into HEAD From: Andrey Vagin <avagin@openvz.org> Each namespace has an owning user namespace and now there is not way to discover these relationships. Pid and user namepaces are hierarchical. There is no way to discover parent-child relationships too. Why we may want to know relationships between namespaces? One use would be visualization, in order to understand the running system. Another would be to answer the question: what capability does process X have to perform operations on a resource governed by namespace Y? One more use-case (which usually called abnormal) is checkpoint/restart. In CRIU we are going to dump and restore nested namespaces. There [1] was a discussion about which interface to choose to determing relationships between namespaces. Eric suggested to add two ioctl-s [2]: > Grumble, Grumble. I think this may actually a case for creating ioctls > for these two cases. Now that random nsfs file descriptors are bind > mountable the original reason for using proc files is not as pressing. > > One ioctl for the user namespace that owns a file descriptor. > One ioctl for the parent namespace of a namespace file descriptor. Here is an implementaions of these ioctl-s. $ man man7/namespaces.7 ... Since Linux 4.X, the following ioctl(2) calls are supported for namespace file descriptors. The correct syntax is: fd = ioctl(ns_fd, ioctl_type); where ioctl_type is one of the following: NS_GET_USERNS Returns a file descriptor that refers to an owning user names‐ pace. NS_GET_PARENT Returns a file descriptor that refers to a parent namespace. This ioctl(2) can be used for pid and user namespaces. For user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same meaning. In addition to generic ioctl(2) errors, the following specific ones can occur: EINVAL NS_GET_PARENT was called for a nonhierarchical namespace. EPERM The requested namespace is outside of the current namespace scope. [1] https://lkml.org/lkml/2016/7/6/158 [2] https://lkml.org/lkml/2016/7/9/101 Changes for v2: * don't return ENOENT for init_user_ns and init_pid_ns. There is nothing outside of the init namespace, so we can return EPERM in this case too. > The fewer special cases the easier the code is to get > correct, and the easier it is to read. // Eric Changes for v3: * rename ns->get_owner() to ns->owner(). get_* usually means that it grabs a reference. Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: "W. Trevor King" <wking@tremily.us> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Serge Hallyn <serge.hallyn@canonical.com>	2016-09-22 20:00:36 -05:00
Andrey Vagin	bcac25a58b	kernel: add a helper to get an owning user namespace for a namespace Return -EPERM if an owning user namespace is outside of a process current user namespace. v2: In a first version ns_get_owner returned ENOENT for init_user_ns. This special cases was removed from this version. There is nothing outside of init_user_ns, so we can return EPERM. v3: rename ns->get_owner() to ns->owner(). get_* usually means that it grabs a reference. Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2016-09-22 19:59:39 -05:00
Trond Myklebust	a6cebd41b8	SUNRPC: Fix setting of buffer length in xdr_set_next_buffer() Use xdr->nwords to tell us how much buffer remains. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-22 17:17:47 -04:00
Trond Myklebust	ace0e14f4f	SUNRPC: Fix corruption of xdr->nwords in xdr_copy_to_scratch When we copy the first part of the data, we need to ensure that value of xdr->nwords is updated as well. Do so by calling __xdr_inline_decode() Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-22 17:12:31 -04:00
Eric W. Biederman	df75e7748b	userns: When the per user per user namespace limit is reached return ENOSPC The current error codes returned when a the per user per user namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong. I asked for advice on linux-api and it we made clear that those were the wrong error code, but a correct effor code was not suggested. The best general error code I have found for hitting a resource limit is ENOSPC. It is not perfect but as it is unambiguous it will serve until someone comes up with a better error code. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2016-09-22 13:25:56 -05:00
Linus Torvalds	f887c21e21	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: "Mostly small bits scattered all over the place, which is usually how things go this late in the -rc series. 1) Proper driver init device resets in bnx2, from Baoquan He. 2) Fix accounting overflow in __tcp_retransmit_skb(), sk_forward_alloc, and ip_idents_reserve, from Eric Dumazet. 3) Fix crash in bna driver ethtool stats handling, from Ivan Vecera. 4) Missing check of skb_linearize() return value in mac80211, from Johannes Berg. 5) Endianness fix in nf_table_trace dumps, from Liping Zhang. 6) SSN comparison fix in SCTP, from Marcelo Ricardo Leitner. 7) Update DSA and b44 MAINTAINERS entries. 8) Make input path of vti6 driver work again, from Nicolas Dichtel. 9) Off-by-one in mlx4, from Sebastian Ott. 10) Fix fallback route lookup handling in ipv6, from Vincent Bernat. 11) Fix stack corruption on probe in qed driver, from Yuval Mintz. 12) PHY init fixes in r8152 from Hayes Wang. 13) Missing SKB free in irda_accept error path, from Phil Turnbull" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits) tcp: properly account Fast Open SYN-ACK retrans tcp: fix under-accounting retransmit SNMP counters MAINTAINERS: Update b44 maintainer. net: get rid of an signed integer overflow in ip_idents_reserve() net/mlx4_core: Fix to clean devlink resources net: can: ifi: Configure transmitter delay vti6: fix input path ipmr, ip6mr: return lastuse relative to now r8152: disable ALDPS and EEE before setting PHY r8152: remove r8153_enable_eee r8152: move PHY settings to hw_phy_cfg r8152: move enabling PHY r8152: move some functions cxgb4/cxgb4vf: Allocate more queues for 25G and 100G adapter qed: Fix stack corruption on probe MAINTAINERS: Add an entry for the core network DSA code net: ipv6: fallback to full lookup if table lookup is unsuitable net/mlx5: E-Switch, Handle mode change failures net/mlx5: E-Switch, Fix error flow in the SRIOV e-switch init code net/mlx5: Fix flow counter bulk command out mailbox allocation ...	2016-09-22 08:49:25 -07:00
Michał Narajowski	7dc6f16c68	Bluetooth: Fix not updating scan rsp when adv off Scan response data should not be updated unless there is an advertising instance. Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-22 17:48:23 +02:00
Arek Lichwa	dd7e39bbfc	Bluetooth: Fix NULL pointer dereference in mgmt context Adds missing callback assignment to cmd_complete in pending management command context. Dump path involves security procedure performed on legacy (pre-SSP) devices with service security requirements set to HIGH (16digits PIN). It fails when shorter PIN is delivered by user. [ 1.517950] Bluetooth: PIN code is not 16 bytes long [ 1.518491] BUG: unable to handle kernel NULL pointer dereference at (null) [ 1.518584] IP: [< (null)>] (null) [ 1.518584] PGD 9e08067 PUD 9fdf067 PMD 0 [ 1.518584] Oops: 0010 [#1] SMP [ 1.518584] Modules linked in: [ 1.518584] CPU: 0 PID: 1002 Comm: kworker/u3:2 Not tainted 4.8.0-rc6-354649-gaf4168c #16 [ 1.518584] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.9.3-20160701_074356-anatol 04/01/2014 [ 1.518584] Workqueue: hci0 hci_rx_work [ 1.518584] task: ffff880009ce14c0 task.stack: ffff880009e10000 [ 1.518584] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [ 1.518584] RSP: 0018:ffff880009e13bc8 EFLAGS: 00010293 [ 1.518584] RAX: 0000000000000000 RBX: ffff880009eed100 RCX: 0000000000000006 [ 1.518584] RDX: ffff880009ddc000 RSI: 0000000000000000 RDI: ffff880009eed100 [ 1.518584] RBP: ffff880009e13be0 R08: 0000000000000000 R09: 0000000000000001 [ 1.518584] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 1.518584] R13: ffff880009e13ccd R14: ffff880009ddc000 R15: ffff880009ddc010 [ 1.518584] FS: 0000000000000000(0000) GS:ffff88000bc00000(0000) knlGS:0000000000000000 [ 1.518584] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1.518584] CR2: 0000000000000000 CR3: 0000000009fdd000 CR4: 00000000000006f0 [ 1.518584] Stack: [ 1.518584] ffffffff81909808 ffff880009e13cce ffff880009e0d40b ffff880009e13c68 [ 1.518584] ffffffff818f428d 00000000024000c0 ffff880009e13c08 ffffffff810ca903 [ 1.518584] ffff880009e13c48 ffffffff811ade34 ffffffff8178c31f ffff880009ee6200 [ 1.518584] Call Trace: [ 1.518584] [<ffffffff81909808>] ? mgmt_pin_code_neg_reply_complete+0x38/0x60 [ 1.518584] [<ffffffff818f428d>] hci_cmd_complete_evt+0x69d/0x3200 [ 1.518584] [<ffffffff810ca903>] ? rcu_read_lock_sched_held+0x53/0x60 [ 1.518584] [<ffffffff811ade34>] ? kmem_cache_alloc+0x1a4/0x200 [ 1.518584] [<ffffffff8178c31f>] ? skb_clone+0x4f/0xa0 [ 1.518584] [<ffffffff818f9d81>] hci_event_packet+0x8e1/0x28e0 [ 1.518584] [<ffffffff81a421f1>] ? _raw_spin_unlock_irqrestore+0x31/0x50 [ 1.518584] [<ffffffff810aea3e>] ? trace_hardirqs_on_caller+0xee/0x1b0 [ 1.518584] [<ffffffff818e6bd1>] hci_rx_work+0x1e1/0x5b0 [ 1.518584] [<ffffffff8107e4bd>] ? process_one_work+0x1ed/0x6b0 [ 1.518584] [<ffffffff8107e538>] process_one_work+0x268/0x6b0 [ 1.518584] [<ffffffff8107e4bd>] ? process_one_work+0x1ed/0x6b0 [ 1.518584] [<ffffffff8107e9c3>] worker_thread+0x43/0x4e0 [ 1.518584] [<ffffffff8107e980>] ? process_one_work+0x6b0/0x6b0 [ 1.518584] [<ffffffff8107e980>] ? process_one_work+0x6b0/0x6b0 [ 1.518584] [<ffffffff8108505f>] kthread+0xdf/0x100 [ 1.518584] [<ffffffff81a4297f>] ret_from_fork+0x1f/0x40 [ 1.518584] [<ffffffff81084f80>] ? kthread_create_on_node+0x210/0x210 Signed-off-by: Arek Lichwa <arek.lichwa@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-22 17:37:21 +02:00
Laura Garcia Liebana	2b03bf7324	netfilter: nft_numgen: add number generation offset Add support of an offset value for incremental counter and random. With this option the sysadmin is able to start the counter to a certain value and then apply the generated number. Example: meta mark set numgen inc mod 2 offset 100 This will generate marks with the serie 100, 101, 100, 101, ... Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Laura Garcia Liebana <nevola@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-22 16:33:05 +02:00
David S. Miller	60cd6e63ec	RxRPC rewrite -----BEGIN PGP SIGNATURE----- iQIVAwUAV+OQV/Sw1s6N8H32AQK/Gw//TF7n19v+gqUenh5m6xPYkVlZl6d/TRi+ 3JoG5pdNORxTDU7UgzkeuCywDTk5XUYsJs3TOzInRAdDedwfgIiwF3ZKw3Bo30vR cVUfG7GK4o+CLWifL3BILYMTJfkOnUS4sllylSqX/EOlPDEEspSRWTxXq+DCOGNZ 1APBRD8XfA+IIC3fleMh+zSpKZ3ffc2c7djelzo2nCG3ku78U57B23TCyzp2tQNQ 6ClvhOAwL2nMXF5vebtIU7ou6LUV/TdC4qTkQuz3du/+k+LOG/c8/6s6k70MgXQU L3DW3rcnrWxkyzDb5oQoGYSWG5x4gp/TazHbJE2kuUVhQma8eDbOAGRWJoxlSzoC LqHE+6q3KnwwXpbYd3DJ+jUI7pu7pUvub1cvJr0uxPcjRb4CzhHT/1OBUb9p4CJX /n8NFNXk+5qWsvLaPuNNBPs4pc2xgz/cotjKBYUznqObiq2xgeivZpbsEBOpSIT1 2hl0EuyAi1Gwpi6qfW8oM6EGrlAzuG77cLcLnxrDz+GsHcgqUvdSuTh0P26eOh7D 1V03kkfX9dIqkOc5xA9xAckopfG5BhQDiFsMV+5McZ2x8GtUdnMw8E7dsG8xIeY5 yDzk9m6tD79PlqS7HJ7Fzj6owzqLUeJOI08y9EUSacBFKzNak1NVmcYcXd10rDFj duNM4rDi6zA= =3zfm -----END PGP SIGNATURE----- Merge tag 'rxrpc-rewrite-20160922-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Preparation for slow-start algorithm [ver #2] Here are some patches that prepare for improvements in ACK generation and for the implementation of the slow-start part of the protocol: (1) Stop storing the protocol header in the Tx socket buffers, but rather generate it on the fly. This potentially saves a little space and makes it easier to alter the header just before transmission (the flags may get altered and the serial number has to be changed). (2) Mask off the Tx buffer annotations and add a flag to record which ones have already been resent. (3) Track RTT on a per-peer basis for use in future changes. Tracepoints are added to log this. (4) Send PING ACKs in response to incoming calls to elicit a PING-RESPONSE ACK from which RTT data can be calculated. The response also carries other useful information. (5) Expedite PING-RESPONSE ACK generation from sendmsg. If we're actively using sendmsg, this allows us, under some circumstances, to avoid having to rely on the background work item to run to generate this ACK. This requires ktime_sub_ms() to be added. (6) Set the REQUEST-ACK flag on some DATA packets to elicit ACK-REQUESTED ACKs from which RTT data can be calculated. (7) Limit the use of pings and ACK requests for RTT determination. Changes: (V2) Don't use the C division operator for 64-bit division. One instance should use do_div() and the other should be using nsecs_to_jiffies(). The last two patches got transposed, leading to an undefined symbol in one of them. Reported-by: kbuild test robot <lkp@intel.com> ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-22 08:14:59 -04:00
David Howells	fc943f6777	rxrpc: Reduce the number of PING ACKs sent We don't want to send a PING ACK for every new incoming call as that just adds to the network traffic. Instead, we send a PING ACK to the first three that we receive and then once per second thereafter. This could probably be made adjustable in future. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-22 08:49:22 +01:00
David Howells	0d4b103c00	rxrpc: Reduce the number of ACK-Requests sent Reduce the number of ACK-Requests we set on DATA packets that we're sending to reduce network traffic. We set the flag on odd-numbered DATA packets to start off the RTT cache until we have at least three entries in it and then probe once per second thereafter to keep it topped up. This could be made tunable in future. Note that from this point, the RXRPC_REQUEST_ACK flag is set on DATA packets as we transmit them and not stored statically in the sk_buff. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-22 08:49:20 +01:00
Yuchung Cheng	7e32b44361	tcp: properly account Fast Open SYN-ACK retrans Since the TFO socket is accepted right off SYN-data, the socket owner can call getsockopt(TCP_INFO) to collect ongoing SYN-ACK retransmission or timeout stats (i.e., tcpi_total_retrans, tcpi_retransmits). Currently those stats are only updated upon handshake completes. This patch fixes it. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-22 03:33:01 -04:00
Yuchung Cheng	de1d657816	tcp: fix under-accounting retransmit SNMP counters This patch fixes these under-accounting SNMP rtx stats LINUX_MIB_TCPFORWARDRETRANS LINUX_MIB_TCPFASTRETRANS LINUX_MIB_TCPSLOWSTARTRETRANS when retransmitting TSO packets Fixes: `10d3be5692` ("tcp-tso: do not split TSO packets at retransmit time") Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-22 03:33:01 -04:00
David Howells	50235c4b5a	rxrpc: Obtain RTT data by requesting ACKs on DATA packets In addition to sending a PING ACK to gain RTT data, we can set the RXRPC_REQUEST_ACK flag on a DATA packet and get a REQUESTED-ACK ACK. The ACK packet contains the serial number of the packet it is in response to, so we can look through the Tx buffer for a matching DATA packet. This requires that the data packets be stamped with the time of transmission as a ktime rather than having the resend_at time in jiffies. This further requires the resend code to do the resend determination in ktimes and convert to jiffies to set the timer. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-22 08:21:24 +01:00
David Howells	7aa51da7c8	rxrpc: Expedite ping response transmission Expedite the transmission of a response to a PING ACK by sending it from sendmsg if one is pending. We're most likely to see a PING ACK during the client call Tx phase as the other side may use it to determine a number of parameters, such as the client's receive window size, the RTT and whether the client is doing slow start (similar to RFC5681). If we don't expedite it, it's left to the background processing thread to transmit. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-22 08:21:24 +01:00
David Howells	8e83134db4	rxrpc: Send pings to get RTT data Send a PING ACK packet to the peer when we get a new incoming call from a peer we don't have a record for. The PING RESPONSE ACK packet will tell us the following about the peer: (1) its receive window size (2) its MTU sizes (3) its support for jumbo DATA packets (4) if it supports slow start (similar to RFC 5681) (5) an estimate of the RTT This is necessary because the peer won't normally send us an ACK until it gets to the Rx phase and we send it a packet, but we would like to know some of this information before we start sending packets. A pair of tracepoints are added so that RTT determination can be observed. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-22 08:21:24 +01:00
Marcelo Ricardo Leitner	4a225ce395	sctp: make use of SCTP_TRUNC4 macro And avoid the usage of '&~3'. This is the last place still not using the macro. Also break the line to make it easier to read. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-22 03:13:26 -04:00
Marcelo Ricardo Leitner	e2f036a972	sctp: rename WORD_TRUNC/ROUND macros To something more meaningful these days, specially because this is working on packet headers or lengths and which are not tied to any CPU arch but to the protocol itself. So, WORD_TRUNC becomes SCTP_TRUNC4 and WORD_ROUND becomes SCTP_PAD4. Reported-by: David Laight <David.Laight@ACULAB.COM> Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-22 03:13:26 -04:00
David S. Miller	ba1ba25d31	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2016-09-21 1) Propagate errors on security context allocation. From Mathias Krause. 2) Fix inbound policy checks for inter address family tunnels. From Thomas Zeitlhofer. 3) Fix an old memory leak on aead algorithm usage. From Ilan Tayari. 4) A recent patch fixed a possible NULL pointer dereference but broke the vti6 input path. Fix from Nicolas Dichtel. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-22 02:56:23 -04:00
Eric Dumazet	f9616c35a0	tcp: implement TSQ for retransmits We saw sch_fq drops caused by the per flow limit of 100 packets and TCP when dealing with large cwnd and bursts of retransmits. Even after increasing the limit to 1000, and even after commit `10d3be5692` ("tcp-tso: do not split TSO packets at retransmit time"), we can still have these drops. Under certain conditions, TCP can spend a considerable amount of time queuing thousands of skbs in a single tcp_xmit_retransmit_queue() invocation, incurring latency spikes and stalls of other softirq handlers. This patch implements TSQ for retransmits, limiting number of packets and giving more chance for scheduling packets in both ways. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-22 02:44:16 -04:00
Eric Dumazet	adb03115f4	net: get rid of an signed integer overflow in ip_idents_reserve() Jiri Pirko reported an UBSAN warning happening in ip_idents_reserve() [] UBSAN: Undefined behaviour in ./arch/x86/include/asm/atomic.h:156:11 [] signed integer overflow: [] -2117905507 + -695755206 cannot be represented in type 'int' Since we do not have uatomic_add_return() yet, use atomic_cmpxchg() so that the arithmetics can be done using unsigned int. Fixes: `04ca6973f7` ("ip: make IP identifiers less predictable") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-22 02:41:17 -04:00
Shmulik Ladkani	ecf4ee41d2	net: skbuff: Coding: Use eth_type_vlan() instead of open coding it Fix 'skb_vlan_pop' to use eth_type_vlan instead of directly comparing skb->protocol to ETH_P_8021Q or ETH_P_8021AD. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Reviewed-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-22 01:35:57 -04:00
Shmulik Ladkani	636c262808	net: skbuff: Remove errornous length validation in skb_vlan_pop() In `93515d53b1` "net: move vlan pop/push functions into common code" skb_vlan_pop was moved from its private location in openvswitch to skbuff common code. In case skb has non hw-accel vlan tag, the original 'pop_vlan()' assured that skb->len is sufficient (if skb->len < VLAN_ETH_HLEN then pop was considered a no-op). This validation was moved as is into the new common 'skb_vlan_pop'. Alas, in its original location (openvswitch), there was a guarantee that 'data' points to the mac_header, therefore the 'skb->len < VLAN_ETH_HLEN' condition made sense. However there's no such guarantee in the generic 'skb_vlan_pop'. For short packets received in rx path going through 'skb_vlan_pop', this causes 'skb_vlan_pop' to fail pop-ing a valid vlan hdr (in the non hw-accel case) or to fail moving next tag into hw-accel tag. Remove the 'skb->len < VLAN_ETH_HLEN' condition entirely: It is superfluous since inner '__skb_vlan_pop' already verifies there are VLAN_ETH_HLEN writable bytes at the mac_header. Note this presents a slight change to skb_vlan_pop() users: In case total length is smaller than VLAN_ETH_HLEN, skb_vlan_pop() now returns an error, as opposed to previous "no-op" behavior. Existing callers (e.g. tc act vlan, ovs) usually drop the packet if 'skb_vlan_pop' fails. Fixes: `93515d53b1` ("net: move vlan pop/push functions into common code") Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Cc: Pravin Shelar <pshelar@ovn.org> Reviewed-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-22 01:35:57 -04:00
Shmulik Ladkani	45a497f2d1	net/sched: act_vlan: Introduce TCA_VLAN_ACT_MODIFY vlan action TCA_VLAN_ACT_MODIFY allows one to change an existing tag. It accepts same attributes as TCA_VLAN_ACT_PUSH (protocol, id, priority). If packet is vlan tagged, then the tag gets overwritten according to user specified attributes. For example, this allows user to replace a tag's vid while preserving its priority bits (as opposed to "action vlan pop pipe action vlan push"). Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-22 01:34:20 -04:00
Shmulik Ladkani	bfca4c520f	net: skbuff: Export __skb_vlan_pop This exports the functionality of extracting the tag from the payload, without moving next vlan tag into hw accel tag. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-22 01:34:20 -04:00
David Howells	cf1a6474f8	rxrpc: Add per-peer RTT tracker Add a function to track the average RTT for a peer. Sources of RTT data will be added in subsequent patches. The RTT data will be useful in the future for determining resend timeouts and for handling the slow-start part of the Rx protocol. Also add a pair of tracepoints, one to log transmissions to elicit a response for RTT purposes and one to log responses that contribute RTT data. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-22 01:26:25 +01:00
David Howells	f07373ead4	rxrpc: Add re-sent Tx annotation Add a Tx-phase annotation for packet buffers to indicate that a buffer has already been retransmitted. This will be used by future congestion management. Re-retransmissions of a packet don't affect the congestion window managment in the same way as initial retransmissions. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-22 01:23:50 +01:00
David Howells	5a924b8951	rxrpc: Don't store the rxrpc header in the Tx queue sk_buffs Don't store the rxrpc protocol header in sk_buffs on the transmit queue, but rather generate it on the fly and pass it to kernel_sendmsg() as a separate iov. This reduces the amount of storage required. Note that the security header is still stored in the sk_buff as it may get encrypted along with the data (and doesn't change with each transmission). Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-22 01:23:50 +01:00
Jakub Kicinski	9798e6fe4f	net: act_mirred: allow statistic updates from offloaded actions Implement .stats_update() callback. The implementation is generic and can be reused by other simple actions if needed. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 19:50:03 -04:00
Jakub Kicinski	68d640630d	net: cls_bpf: allow offloaded filters to update stats Call into offloaded filters to update stats. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 19:50:03 -04:00
Jakub Kicinski	eadb41489f	net: cls_bpf: add support for marking filters as hardware-only Add cls_bpf support for the TCA_CLS_FLAGS_SKIP_SW flag. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 19:50:02 -04:00
Jakub Kicinski	0d01d45f1b	net: cls_bpf: limit hardware offload by software-only flag Add cls_bpf support for the TCA_CLS_FLAGS_SKIP_HW flag. Unlike U32 and flower cls_bpf already has some netlink flags defined. Create a new attribute to be able to use the same flag values as the above. Unlike U32 and flower reject unknown flags. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 19:50:02 -04:00
Jakub Kicinski	332ae8e2f6	net: cls_bpf: add hardware offload This patch adds hardware offload capability to cls_bpf classifier, similar to what have been done with U32 and flower. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 19:50:02 -04:00
Florian Westphal	c2f672fc94	xfrm: state lookup can be lockless This is called from the packet input path, we get lock contention if many cpus handle ipsec in parallel. After recent rcu conversion it is safe to call __xfrm_state_lookup without the spinlock. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2016-09-21 12:37:29 +02:00
Nicolas Dichtel	63c43787d3	vti6: fix input path Since commit `1625f45299`, vti6 is broken, all input packets are dropped (LINUX_MIB_XFRMINNOSTATES is incremented). XFRM_TUNNEL_SKB_CB(skb)->tunnel.ip6 is set by vti6_rcv() before calling xfrm6_rcv()/xfrm6_rcv_spi(), thus we cannot set to NULL that value in xfrm6_rcv_spi(). A new function xfrm6_rcv_tnl() that enables to pass a value to xfrm6_rcv_spi() is added, so that xfrm6_rcv() is not touched (this function is used in several handlers). CC: Alexey Kodanev <alexey.kodanev@oracle.com> Fixes: `1625f45299` ("net/xfrm_input: fix possible NULL deref of tunnel.ip6->parms.i_key") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2016-09-21 10:09:14 +02:00
Nikolay Aleksandrov	b5036cd4ed	ipmr, ip6mr: return lastuse relative to now When I introduced the lastuse member I made a subtle error because it was returned as an absolute value but that is meaningless to user-space as it doesn't allow to see how old exactly an entry is. Let's make it similar to how the bridge returns such values and make it relative to "now" (jiffies). This allows us to show the actual age of the entries and is much more useful (e.g. user-space daemons can age out entries, iproute2 can display the lastuse properly). Fixes: `43b9e12740` ("net: ipmr/ip6mr: add support for keeping an entry age") Reported-by: Satish Ashok <sashok@cumulusnetworks.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:58:23 -04:00
Neal Cardwell	0f8782ea14	tcp_bbr: add BBR congestion control This commit implements a new TCP congestion control algorithm: BBR (Bottleneck Bandwidth and RTT). A detailed description of BBR will be published in ACM Queue, Vol. 14 No. 5, September-October 2016, as "BBR: Congestion-Based Congestion Control". BBR has significantly increased throughput and reduced latency for connections on Google's internal backbone networks and google.com and YouTube Web servers. BBR requires only changes on the sender side, not in the network or the receiver side. Thus it can be incrementally deployed on today's Internet, or in datacenters. The Internet has predominantly used loss-based congestion control (largely Reno or CUBIC) since the 1980s, relying on packet loss as the signal to slow down. While this worked well for many years, loss-based congestion control is unfortunately out-dated in today's networks. On today's Internet, loss-based congestion control causes the infamous bufferbloat problem, often causing seconds of needless queuing delay, since it fills the bloated buffers in many last-mile links. On today's high-speed long-haul links using commodity switches with shallow buffers, loss-based congestion control has abysmal throughput because it over-reacts to losses caused by transient traffic bursts. In 1981 Kleinrock and Gale showed that the optimal operating point for a network maximizes delivered bandwidth while minimizing delay and loss, not only for single connections but for the network as a whole. Finding that optimal operating point has been elusive, since any single network measurement is ambiguous: network measurements are the result of both bandwidth and propagation delay, and those two cannot be measured simultaneously. While it is impossible to disambiguate any single bandwidth or RTT measurement, a connection's behavior over time tells a clearer story. BBR uses a measurement strategy designed to resolve this ambiguity. It combines these measurements with a robust servo loop using recent control systems advances to implement a distributed congestion control algorithm that reacts to actual congestion, not packet loss or transient queue delay, and is designed to converge with high probability to a point near the optimal operating point. In a nutshell, BBR creates an explicit model of the network pipe by sequentially probing the bottleneck bandwidth and RTT. On the arrival of each ACK, BBR derives the current delivery rate of the last round trip, and feeds it through a windowed max-filter to estimate the bottleneck bandwidth. Conversely it uses a windowed min-filter to estimate the round trip propagation delay. The max-filtered bandwidth and min-filtered RTT estimates form BBR's model of the network pipe. Using its model, BBR sets control parameters to govern sending behavior. The primary control is the pacing rate: BBR applies a gain multiplier to transmit faster or slower than the observed bottleneck bandwidth. The conventional congestion window (cwnd) is now the secondary control; the cwnd is set to a small multiple of the estimated BDP (bandwidth-delay product) in order to allow full utilization and bandwidth probing while bounding the potential amount of queue at the bottleneck. When a BBR connection starts, it enters STARTUP mode and applies a high gain to perform an exponential search to quickly probe the bottleneck bandwidth (doubling its sending rate each round trip, like slow start). However, instead of continuing until it fills up the buffer (i.e. a loss), or until delay or ACK spacing reaches some threshold (like Hystart), it uses its model of the pipe to estimate when that pipe is full: it estimates the pipe is full when it notices the estimated bandwidth has stopped growing. At that point it exits STARTUP and enters DRAIN mode, where it reduces its pacing rate to drain the queue it estimates it has created. Then BBR enters steady state. In steady state, PROBE_BW mode cycles between first pacing faster to probe for more bandwidth, then pacing slower to drain any queue that created if no more bandwidth was available, and then cruising at the estimated bandwidth to utilize the pipe without creating excess queue. Occasionally, on an as-needed basis, it sends significantly slower to probe for RTT (PROBE_RTT mode). BBR has been fully deployed on Google's wide-area backbone networks and we're experimenting with BBR on Google.com and YouTube on a global scale. Replacing CUBIC with BBR has resulted in significant improvements in network latency and application (RPC, browser, and video) metrics. For more details please refer to our upcoming ACM Queue publication. Example performance results, to illustrate the difference between BBR and CUBIC: Resilience to random loss (e.g. from shallow buffers): Consider a netperf TCP_STREAM test lasting 30 secs on an emulated path with a 10Gbps bottleneck, 100ms RTT, and 1% packet loss rate. CUBIC gets 3.27 Mbps, and BBR gets 9150 Mbps (2798x higher). Low latency with the bloated buffers common in today's last-mile links: Consider a netperf TCP_STREAM test lasting 120 secs on an emulated path with a 10Mbps bottleneck, 40ms RTT, and 1000-packet bottleneck buffer. Both fully utilize the bottleneck bandwidth, but BBR achieves this with a median RTT 25x lower (43 ms instead of 1.09 secs). Our long-term goal is to improve the congestion control algorithms used on the Internet. We are hopeful that BBR can help advance the efforts toward this goal, and motivate the community to do further research. Test results, performance evaluations, feedback, and BBR-related discussions are very welcome in the public e-mail list for BBR: https://groups.google.com/forum/#!forum/bbr-dev NOTE: BBR must be used with the fq qdisc ("man tc-fq") with pacing enabled, since pacing is integral to the BBR design and implementation. BBR without pacing would not function properly, and may incur unnecessary high packet loss rates. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:23:01 -04:00
Yuchung Cheng	c0402760f5	tcp: new CC hook to set sending rate with rate_sample in any CA state This commit introduces an optional new "omnipotent" hook, cong_control(), for congestion control modules. The cong_control() function is called at the end of processing an ACK (i.e., after updating sequence numbers, the SACK scoreboard, and loss detection). At that moment we have precise delivery rate information the congestion control module can use to control the sending behavior (using cwnd, TSO skb size, and pacing rate) in any CA state. This function can also be used by a congestion control that prefers not to use the default cwnd reduction approach (i.e., the PRR algorithm) during CA_Recovery to control the cwnd and sending rate during loss recovery. We take advantage of the fact that recent changes defer the retransmission or transmission of new data (e.g. by F-RTO) in recovery until the new tcp_cong_control() function is run. With this commit, we only run tcp_update_pacing_rate() if the congestion control is not using this new API. New congestion controls which use the new API do not want the TCP stack to run the default pacing rate calculation and overwrite whatever pacing rate they have chosen at initialization time. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:23:01 -04:00
Yuchung Cheng	77bfc174c3	tcp: allow congestion control to expand send buffer differently Currently the TCP send buffer expands to twice cwnd, in order to allow limited transmits in the CA_Recovery state. This assumes that cwnd does not increase in the CA_Recovery. For some congestion control algorithms, like the upcoming BBR module, if the losses in recovery do not indicate congestion then we may continue to raise cwnd multiplicatively in recovery. In such cases the current multiplier will falsely limit the sending rate, much as if it were limited by the application. This commit adds an optional congestion control callback to use a different multiplier to expand the TCP send buffer. For congestion control modules that do not specificy this callback, TCP continues to use the previous default of 2. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:23:01 -04:00
Neal Cardwell	556c6b46d1	tcp: export tcp_mss_to_mtu() for congestion control modules Export tcp_mss_to_mtu(), so that congestion control modules can use this to help calculate a pacing rate. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:23:01 -04:00
Neal Cardwell	1b3878ca15	tcp: export tcp_tso_autosize() and parameterize minimum number of TSO segments To allow congestion control modules to use the default TSO auto-sizing algorithm as one of the ingredients in their own decision about TSO sizing: 1) Export tcp_tso_autosize() so that CC modules can use it. 2) Change tcp_tso_autosize() to allow callers to specify a minimum number of segments per TSO skb, in case the congestion control module has a different notion of the best floor for TSO skbs for the connection right now. For very low-rate paths or policed connections it can be appropriate to use smaller TSO skbs. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:23:00 -04:00
Neal Cardwell	ed6e7268b9	tcp: allow congestion control module to request TSO skb segment count Add the tso_segs_goal() function in tcp_congestion_ops to allow the congestion control module to specify the number of segments that should be in a TSO skb sent by tcp_write_xmit() and tcp_xmit_retransmit_queue(). The congestion control module can either request a particular number of segments in TSO skb that we transmit, or return 0 if it doesn't care. This allows the upcoming BBR congestion control module to select small TSO skb sizes if the module detects that the bottleneck bandwidth is very low, or that the connection is policed to a low rate. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:23:00 -04:00
Yuchung Cheng	eb8329e0a0	tcp: export data delivery rate This commit export two new fields in struct tcp_info: tcpi_delivery_rate: The most recent goodput, as measured by tcp_rate_gen(). If the socket is limited by the sending application (e.g., no data to send), it reports the highest measurement instead of the most recent. The unit is bytes per second (like other rate fields in tcp_info). tcpi_delivery_rate_app_limited: A boolean indicating if the goodput was measured when the socket's throughput was limited by the sending application. This delivery rate information can be useful for applications that want to know the current throughput the TCP connection is seeing, e.g. adaptive bitrate video streaming. It can also be very useful for debugging or troubleshooting. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:23:00 -04:00
Soheil Hassas Yeganeh	d7722e8570	tcp: track application-limited rate samples This commit adds code to track whether the delivery rate represented by each rate_sample was limited by the application. Upon each transmit, we store in the is_app_limited field in the skb a boolean bit indicating whether there is a known "bubble in the pipe": a point in the rate sample interval where the sender was application-limited, and did not transmit even though the cwnd and pacing rate allowed it. This logic marks the flow app-limited on a write if all of the following are true: 1) There is less than 1 MSS of unsent data in the write queue available to transmit. 2) There is no packet in the sender's queues (e.g. in fq or the NIC tx queue). 3) The connection is not limited by cwnd. 4) There are no lost packets to retransmit. The tcp_rate_check_app_limited() code in tcp_rate.c determines whether the connection is application-limited at the moment. If the flow is application-limited, it sets the tp->app_limited field. If the flow is application-limited then that means there is effectively a "bubble" of silence in the pipe now, and this silence will be reflected in a lower bandwidth sample for any rate samples from now until we get an ACK indicating this bubble has exited the pipe: specifically, until we get an ACK for the next packet we transmit. When we send every skb we record in scb->tx.is_app_limited whether the resulting rate sample will be application-limited. The code in tcp_rate_gen() checks to see when it is safe to mark all known application-limited bubbles of silence as having exited the pipe. It does this by checking to see when the delivered count moves past the tp->app_limited marker. At this point it zeroes the tp->app_limited marker, as all known bubbles are out of the pipe. We make room for the tx.is_app_limited bit in the skb by borrowing a bit from the in_flight field used by NV to record the number of bytes in flight. The receive window in the TCP header is 16 bits, and the max receive window scaling shift factor is 14 (RFC 1323). So the max receive window offered by the TCP protocol is 2^(16+14) = 2^30. So we only need 30 bits for the tx.in_flight used by NV. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:23:00 -04:00
Yuchung Cheng	b9f64820fb	tcp: track data delivery rate for a TCP connection This patch generates data delivery rate (throughput) samples on a per-ACK basis. These rate samples can be used by congestion control modules, and specifically will be used by TCP BBR in later patches in this series. Key state: tp->delivered: Tracks the total number of data packets (original or not) delivered so far. This is an already-existing field. tp->delivered_mstamp: the last time tp->delivered was updated. Algorithm: A rate sample is calculated as (d1 - d0)/(t1 - t0) on a per-ACK basis: d1: the current tp->delivered after processing the ACK t1: the current time after processing the ACK d0: the prior tp->delivered when the acked skb was transmitted t0: the prior tp->delivered_mstamp when the acked skb was transmitted When an skb is transmitted, we snapshot d0 and t0 in its control block in tcp_rate_skb_sent(). When an ACK arrives, it may SACK and ACK some skbs. For each SACKed or ACKed skb, tcp_rate_skb_delivered() updates the rate_sample struct to reflect the latest (d0, t0). Finally, tcp_rate_gen() generates a rate sample by storing (d1 - d0) in rs->delivered and (t1 - t0) in rs->interval_us. One caveat: if an skb was sent with no packets in flight, then tp->delivered_mstamp may be either invalid (if the connection is starting) or outdated (if the connection was idle). In that case, we'll re-stamp tp->delivered_mstamp. At first glance it seems t0 should always be the time when an skb was transmitted, but actually this could over-estimate the rate due to phase mismatch between transmit and ACK events. To track the delivery rate, we ensure that if packets are in flight then t0 and and t1 are times at which packets were marked delivered. If the initial and final RTTs are different then one may be corrupted by some sort of noise. The noise we see most often is sending gaps caused by delayed, compressed, or stretched acks. This either affects both RTTs equally or artificially reduces the final RTT. We approach this by recording the info we need to compute the initial RTT (duration of the "send phase" of the window) when we recorded the associated inflight. Then, for a filter to avoid bandwidth overestimates, we generalize the per-sample bandwidth computation from: bw = delivered / ack_phase_rtt to the following: bw = delivered / max(send_phase_rtt, ack_phase_rtt) In large-scale experiments, this filtering approach incorporating send_phase_rtt is effective at avoiding bandwidth overestimates due to ACK compression or stretched ACKs. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:23:00 -04:00
Neal Cardwell	0682e6902a	tcp: count packets marked lost for a TCP connection Count the number of packets that a TCP connection marks lost. Congestion control modules can use this loss rate information for more intelligent decisions about how fast to send. Specifically, this is used in TCP BBR policer detection. BBR uses a high packet loss rate as one signal in its policer detection and policer bandwidth estimation algorithm. The BBR policer detection algorithm cannot simply track retransmits, because a retransmit can be (and often is) an indicator of packets lost long, long ago. This is particularly true in a long CA_Loss period that repairs the initial massive losses when a policer kicks in. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:23:00 -04:00
Eric Dumazet	b2d3ea4a73	tcp: switch back to proper tcp_skb_cb size check in tcp_init() Revert to the tcp_skb_cb size check that tcp_init() had before commit `b4772ef879` ("net: use common macro for assering skb->cb[] available size in protocol families"). As related commit `744d5a3e9f` ("net: move skb->dropcount to skb->cb[]") explains, the sock_skb_cb_check_size() mechanism was added to ensure that there is space for dropcount, "for protocol families using it". But TCP is not a protocol using dropcount, so tcp_init() doesn't need to provision space for dropcount in the skb->cb[], and thus we can revert to the older form of the tcp_skb_cb size check. Doing so allows TCP to use 4 more bytes of the skb->cb[] space. Fixes: `b4772ef879` ("net: use common macro for assering skb->cb[] available size in protocol families") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:23:00 -04:00
Eric Dumazet	77879147a3	net_sched: sch_fq: add low_rate_threshold parameter This commit adds to the fq module a low_rate_threshold parameter to insert a delay after all packets if the socket requests a pacing rate below the threshold. This helps achieve more precise control of the sending rate with low-rate paths, especially policers. The basic issue is that if a congestion control module detects a policer at a certain rate, it may want fq to be able to shape to that policed rate. That way the sender can avoid policer drops by having the packets arrive at the policer at or just under the policed rate. The default threshold of 550Kbps was chosen analytically so that for policers or links at 500Kbps or 512Kbps fq would very likely invoke this mechanism, even if the pacing rate was briefly slightly above the available bandwidth. This value was then empirically validated with two years of production testing on YouTube video servers. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:23:00 -04:00
Neal Cardwell	6403389211	tcp: use windowed min filter library for TCP min_rtt estimation Refactor the TCP min_rtt code to reuse the new win_minmax library in lib/win_minmax.c to simplify the TCP code. This is a pure refactor: the functionality is exactly the same. We just moved the windowed min code to make TCP easier to read and maintain, and to allow other parts of the kernel to use the windowed min/max filter code. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:22:59 -04:00
Soheil Hassas Yeganeh	f78e73e27f	tcp: cdg: rename struct minmax in tcp_cdg.c to avoid a naming conflict The upcoming change "lib/win_minmax: windowed min or max estimator" introduces a struct called minmax, which is then included in include/linux/tcp.h in the upcoming change "tcp: use windowed min filter library for TCP min_rtt estimation". This would create a compilation error for tcp_cdg.c, which defines its own minmax struct. To avoid this naming conflict (and potentially others in the future), this commit renames the version used in tcp_cdg.c to cdg_minmax. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Acked-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-21 00:22:59 -04:00
Jamal Hadi Salim	aecc5cefc3	net sched actions: fix GETing actions With the batch changes that translated transient actions into a temporary list lost in the translation was the fact that tcf_action_destroy() will eventually delete the action from the permanent location if the refcount is zero. Example of what broke: ...add a gact action to drop sudo $TC actions add action drop index 10 ...now retrieve it, looks good sudo $TC actions get action gact index 10 ...retrieve it again and find it is gone! sudo $TC actions get action gact index 10 Fixes: `22dc13c837` ("net_sched: convert tcf_exts from list to pointer array"), Fixes: `824a7e8863` ("net_sched: remove an unnecessary list_del()") Fixes: `f07fed82ad` ("net_sched: remove the leftover cleanup_a()") Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-20 23:34:55 -04:00
Daniel Borkmann	36bbef52c7	bpf: direct packet write and access for helpers for clsact progs This work implements direct packet access for helpers and direct packet write in a similar fashion as already available for XDP types via commits `4acf6c0b84` ("bpf: enable direct packet data write for xdp progs") and `6841de8b0d` ("bpf: allow helpers access the packet directly"), and as a complementary feature to the already available direct packet read for tc (cls/act) programs. For enabling this, we need to introduce two helpers, bpf_skb_pull_data() and bpf_csum_update(). The first is generally needed for both, read and write, because they would otherwise only be limited to the current linear skb head. Usually, when the data_end test fails, programs just bail out, or, in the direct read case, use bpf_skb_load_bytes() as an alternative to overcome this limitation. If such data sits in non-linear parts, we can just pull them in once with the new helper, retest and eventually access them. At the same time, this also makes sure the skb is uncloned, which is, of course, a necessary condition for direct write. As this needs to be an invariant for the write part only, the verifier detects writes and adds a prologue that is calling bpf_skb_pull_data() to effectively unclone the skb from the very beginning in case it is indeed cloned. The heuristic makes use of a similar trick that was done in `233577a220` ("net: filter: constify detection of pkt_type_offset"). This comes at zero cost for other programs that do not use the direct write feature. Should a program use this feature only sparsely and has read access for the most parts with, for example, drop return codes, then such write action can be delegated to a tail called program for mitigating this cost of potential uncloning to a late point in time where it would have been paid similarly with the bpf_skb_store_bytes() as well. Advantage of direct write is that the writes are inlined whereas the helper cannot make any length assumptions and thus needs to generate a call to memcpy() also for small sizes, as well as cost of helper call itself with sanity checks are avoided. Plus, when direct read is already used, we don't need to cache or perform rechecks on the data boundaries (due to verifier invalidating previous checks for helpers that change skb->data), so more complex programs using rewrites can benefit from switching to direct read plus write. For direct packet access to helpers, we save the otherwise needed copy into a temp struct sitting on stack memory when use-case allows. Both facilities are enabled via may_access_direct_pkt_data() in verifier. For now, we limit this to map helpers and csum_diff, and can successively enable other helpers where we find it makes sense. Helpers that definitely cannot be allowed for this are those part of bpf_helper_changes_skb_data() since they can change underlying data, and those that write into memory as this could happen for packet typed args when still cloned. bpf_csum_update() helper accommodates for the fact that we need to fixup checksum_complete when using direct write instead of bpf_skb_store_bytes(), meaning the programs can use available helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(), csum_block_add(), csum_block_sub() equivalents in eBPF together with the new helper. A usage example will be provided for iproute2's examples/bpf/ directory. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-20 23:32:11 -04:00
pravin shelar	2279994d07	openvswitch: avoid resetting flow key while installing new flow. since commit commit `db74a3335e` ("openvswitch: use percpu flow stats") flow alloc resets flow-key. So there is no need to reset the flow-key again if OVS is using newly allocated flow-key. Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-20 22:54:35 -04:00
pravin shelar	190aa3e778	openvswitch: Fix Frame-size larger than 1024 bytes warning. There is no need to declare separate key on stack, we can just use sw_flow->key to store the key directly. This commit fixes following warning: net/openvswitch/datapath.c: In function ‘ovs_flow_cmd_new’: net/openvswitch/datapath.c:1080:1: warning: the frame size of 1040 bytes is larger than 1024 bytes [-Wframe-larger-than=] Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-20 22:54:35 -04:00
David S. Miller	204dfe1798	Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2016-09-19 Here's the main bluetooth-next pull request for the 4.9 kernel. - Added new messages for monitor sockets for better mgmt tracing - Added local name and appearance support in scan response - Added new Qualcomm WCNSS SMD based HCI driver - Minor fixes & cleanup to 802.15.4 code - New USB ID to btusb driver - Added Marvell support to HCI UART driver - Add combined LED trigger for controller power - Other minor fixes here and there Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-20 22:52:50 -04:00
John Crispin	092183df0f	net-next: dsa: make the set_addr() operation optional Only 1 of the 3 drivers currently has a set_addr() operation. Make the set_addr() callback optional to reduce the amount of empty stubs inside the drivers. Signed-off-by: John Crispin <john@phrozen.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-20 04:47:44 -04:00
John Crispin	06f8ec9041	net-next: dsa: fix duplicate invocation of set_addr() commit `83c0afaec7` ("net: dsa: Add new binding implementation") has a duplicate invocation of the set_addr() operation callback. Remove one of them. Signed-off-by: John Crispin <john@phrozen.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-20 04:47:44 -04:00
Herbert Xu	83e7e4ce9e	mac80211: Use rhltable instead of rhashtable mac80211 currently uses rhashtable with insecure_elasticity set to true. The latter is because of duplicate objects. What's more, mac80211 walks the rhashtable chains by hand which is broken as rhashtable may contain multiple tables due to resizing or rehashing. This patch fixes it by converting it to the newly added rhltable interface which is designed for use with duplicate objects. With rhltable a lookup returns a list of objects instead of a single one. This is then fed into the existing for_each_sta_info macro. This patch also deletes the sta_addr_hash function since rhashtable defaults to jhash. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-20 04:43:36 -04:00
Vincent Bernat	a435a07f91	net: ipv6: fallback to full lookup if table lookup is unsuitable Commit `8c14586fc3` ("net: ipv6: Use passed in table for nexthop lookups") introduced a regression: insertion of an IPv6 route in a table not containing the appropriate connected route for the gateway but which contained a non-connected route (like a default gateway) fails while it was previously working: $ ip link add eth0 type dummy $ ip link set up dev eth0 $ ip addr add 2001:db8::1/64 dev eth0 $ ip route add ::/0 via 2001:db8::5 dev eth0 table 20 $ ip route add 2001:db8:cafe::1/128 via 2001:db8::6 dev eth0 table 20 RTNETLINK answers: No route to host $ ip -6 route show table 20 default via 2001:db8::5 dev eth0 metric 1024 pref medium After this patch, we get: $ ip route add 2001:db8:cafe::1/128 via 2001:db8::6 dev eth0 table 20 $ ip -6 route show table 20 2001:db8:cafe::1 via 2001:db8::6 dev eth0 metric 1024 pref medium default via 2001:db8::5 dev eth0 metric 1024 pref medium Fixes: `8c14586fc3` ("net: ipv6: Use passed in table for nexthop lookups") Signed-off-by: Vincent Bernat <vincent@bernat.im> Acked-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-20 03:28:29 -04:00
Jamal Hadi Salim	5a7a5555a3	net sched: stylistic cleanups Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 22:04:14 -04:00
Roman Mashak	f71b109f17	net sched actions police: peg drop stats for conforming traffic setting conforming action to drop is a valid policy. When it is set we need to at least see the stats indicating it for debugging. Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 22:04:14 -04:00
Jamal Hadi Salim	408fbc22ef	net sched ife action: Introduce skb tcindex metadata encap decap Sample use case of how this is encoded: user space via tuntap (or a connected VM/Machine/container) encodes the tcindex TLV. Sample use case of decoding: IFE action decodes it and the skb->tc_index is then used to classify. So something like this for encoded ICMP packets: .. first decode then reclassify... skb->tcindex will be set sudo $TC filter add dev $ETH parent ffff: prio 2 protocol 0xbeef \ u32 match u32 0 0 flowid 1:1 \ action ife decode reclassify ...next match the decode icmp packet... sudo $TC filter add dev $ETH parent ffff: prio 4 protocol ip \ u32 match ip protocol 1 0xff flowid 1:1 \ action continue ... last classify it using the tcindex classifier and do someaction.. sudo $TC filter add dev $ETH parent ffff: prio 5 protocol ip \ handle 0x11 tcindex classid 1:1 \ action blah.. Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 21:55:28 -04:00
Jamal Hadi Salim	6a5d58b67e	net sched ife action: add 16 bit helpers encoder and checker for 16 bits metadata Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 21:55:28 -04:00
Steffen Klassert	07b26c9454	gso: Support partial splitting at the frag_list pointer Since commit `8a29111c7` ("net: gro: allow to build full sized skb") gro may build buffers with a frag_list. This can hurt forwarding because most NICs can't offload such packets, they need to be segmented in software. This patch splits buffers with a frag_list at the frag_list pointer into buffers that can be TSO offloaded. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 20:59:34 -04:00
Johannes Weiner	d979a39d72	cgroup: duplicate cgroup reference when cloning sockets When a socket is cloned, the associated sock_cgroup_data is duplicated but not its reference on the cgroup. As a result, the cgroup reference count will underflow when both sockets are destroyed later on. Fixes: `bd1060a1d6` ("sock, cgroup: add sock->sk_cgroup") Link: http://lkml.kernel.org/r/20160914194846.11153-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: <stable@vger.kernel.org> [4.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-09-19 15:36:17 -07:00
Michał Narajowski	af4168c5a9	Bluetooth: Set appearance only for LE capable controllers Setting appearance on controllers without LE support will result in No Supported error. Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 21:48:22 +03:00
Michał Narajowski	e74317f43f	Bluetooth: Fix missing ext info event when setting appearance This patch adds missing event when setting appearance, just like in the set local name command. Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl> Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:33:27 +02:00
Michał Narajowski	5e9fae48f8	Bluetooth: Add supported data types to ext info changed event This patch adds EIR data to extended info changed event. Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl> Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:33:27 +02:00
Szymon Janc	6a9e90bff9	Bluetooth: Add appearance to Read Ext Controller Info command If LE is enabled appearance is added to EIR data. Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl> Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:33:27 +02:00
Michał Narajowski	cde7a863d3	Bluetooth: Factor appending EIR to separate helper This will also be used for Extended Information Event handling. Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl> Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:33:27 +02:00
Szymon Janc	7d5c11da1f	Bluetooth: Refactor read_ext_controller_info handler There is no need to allocate heap for reply only to copy stack data to it. This also fix rp memory leak and missing hdev unlock if kmalloc failed. Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:33:27 +02:00
Szymon Janc	3310230c5d	Bluetooth: Increment management interface revision Increment the mgmt revision due to the recently added Read Extended Controller Information and Set Appearance commands. Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:19:34 +02:00
Szymon Janc	9c9db78dc0	Bluetooth: Fix advertising instance validity check for flags Flags are not allowed in Scan Response. Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:19:34 +02:00
Szymon Janc	2bb36870e8	Bluetooth: Unify advertising instance flags check This unifies max length and TLV validity checks. Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:19:34 +02:00
Szymon Janc	5e2c59e84b	Bluetooth: Remove unused parameter from tlv_data_is_valid function hdev parameter is not used in function. Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:19:34 +02:00
Michał Narajowski	c4960ecf2b	Bluetooth: Add support for appearance in scan rsp This patch enables prepending appearance value to scan response data. It also adds support for setting appearance value through mgmt command. If currently advertised instance has apperance flag set it is expired immediately. Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl> Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:19:34 +02:00
Michał Narajowski	7c295c4801	Bluetooth: Add support for local name in scan rsp This patch enables appending local name to scan response data. If currently advertised instance has name flag set it is expired immediately. Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl> Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:19:34 +02:00
Szymon Janc	83ebb9ec73	Bluetooth: Fix not registering BR/EDR SMP channel with force_bredr flag If force_bredr is set SMP BR/EDR channel should also be for non-SC capable controllers. Since hcidev flag is persistent wrt power toggle it can be already set when calling smp_register(). This resulted in SMP BR/EDR channel not being registered even if HCI_FORCE_BREDR_SMP flag was set. This also fix NULL pointer dereference when trying to disable force_bredr after power cycle. BUG: unable to handle kernel NULL pointer dereference at 0000000000000388 IP: [<ffffffffc0493ad8>] smp_del_chan+0x18/0x80 [bluetooth] Call Trace: [<ffffffffc04950ca>] force_bredr_smp_write+0xba/0x100 [bluetooth] [<ffffffff8133be14>] full_proxy_write+0x54/0x90 [<ffffffff81245967>] __vfs_write+0x37/0x160 [<ffffffff813617f7>] ? selinux_file_permission+0xd7/0x110 [<ffffffff81356fbd>] ? security_file_permission+0x3d/0xc0 [<ffffffff810eb5b2>] ? percpu_down_read+0x12/0x50 [<ffffffff812462a5>] vfs_write+0xb5/0x1a0 [<ffffffff812476f5>] SyS_write+0x55/0xc0 [<ffffffff817eb872>] entry_SYSCALL_64_fastpath+0x1a/0xa4 Code: 48 8b 45 f0 eb c1 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f6 05 c6 3b 02 00 04 55 48 89 e5 41 54 53 49 89 fc 75 4b <49> 8b 9c 24 88 03 00 00 48 85 db 74 31 49 c7 84 24 88 03 00 00 RIP [<ffffffffc0493ad8>] smp_del_chan+0x18/0x80 [bluetooth] RSP <ffff8802aee3bd90> CR2: 0000000000000388 Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:19:34 +02:00
Wei Yongjun	3e36ca483a	Bluetooth: Use kzalloc instead of kmalloc/memset Use kzalloc rather than kmalloc followed by memset with 0. Generated by: scripts/coccinelle/api/alloc/kzalloc-simple.cocci Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:19:34 +02:00
Frédéric Dalleau	3c0975a7a1	Bluetooth: Fix reason code used for rejecting SCO connections A comment in the code states that SCO connection should be rejected with the proper error value between 0xd-0xf. The code uses HCI_ERROR_REMOTE_LOW_RESOURCES which is 0x14. This led to following error: < HCI Command: Reject Synchronous Co.. (0x01\|0x002a) plen 7 Address: 34:51:C9:EF:02:CA (Apple, Inc.) Reason: Remote Device Terminated due to Low Resources (0x14) > HCI Event: Command Status (0x0f) plen 4 Reject Synchronous Connection Request (0x01\|0x002a) ncmd 1 Status: Invalid HCI Command Parameters (0x12) Instead make use of HCI_ERROR_REJ_LIMITED_RESOURCES which is 0xd. Signed-off-by: Frédéric Dalleau <frederic.dalleau@collabora.co.uk> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	baab793225	Bluetooth: Fix wrong New Settings event when closing HCI User Channel When closing HCI User Channel, the New Settings event was send out to inform about changed settings. However such event is wrong since the exclusive HCI User Channel access is active until the Index Added event has been sent. @ USER Close: test @ MGMT Event: New Settings (0x0006) plen 4 Current settings: 0x00000ad0 Bondable Secure Simple Pairing BR/EDR Low Energy Secure Connections = Close Index: 00:14:EF:22:04:12 @ MGMT Event: Index Added (0x0004) plen 0 Calling __mgmt_power_off from hci_dev_do_close requires an extra check for an active HCI User Channel. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	aa1638dde7	Bluetooth: Send control open and close messages for HCI user channels When opening and closing HCI user channel, send monitoring messages to be able to trace its behavior. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Michał Narajowski	8a0c9f4909	Bluetooth: Append local name and CoD to Extended Controller Info This adds device class, complete local name and short local name to EIR data in Extended Controller Info as specified in docs. Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	321c6feed2	Bluetooth: Add framework for Extended Controller Information This command is used to retrieve the current state and basic information of a controller. It is typically used right after getting the response to the Read Controller Index List command or an Index Added event (or its extended counterparts). When any of the values in the EIR_Data field changes, the event Extended Controller Information Changed will be used to inform clients about the updated information. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	f4cdbb3f25	Bluetooth: Handle HCI raw socket transition from unbound to bound In case an unbound HCI raw socket is later on bound, ensure that the monitor notification messages indicate a close and re-open. None of the userspace tools use the socket this, but it is actually possible to use an ioctl on an unbound socket and then later bind it. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	f81f5b2db8	Bluetooth: Send control open and close messages for HCI raw sockets When opening and closing HCI raw sockets their main usage is for legacy userspace. To track interaction with the modern mgmt interface, send open and close monitoring messages for these action. The HCI raw sockets is special since it supports unbound ioctl operation and for that special case delay the notification message until at least one ioctl has been executed. The difference between a bound and unbound socket will be detailed by the fact the HCI index is present or not. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	d0bef1d26f	Bluetooth: Add extra channel checks for control open/close messages The control open and close monitoring events require special channel checks to ensure messages are only send when the right events happen. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	5a6d2cf5f1	Bluetooth: Assign the channel early when binding HCI sockets Assignment of the hci_pi(sk)->channel should be done early when binding the HCI socket. This avoids confusion with the RAW channel that is used for legacy access. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	0ef2c42f8c	Bluetooth: Send control open and close only when cookie is present Only when the cookie has been assigned, then send the open and close monitor messages. Also if the socket is bound to a device, then include the index into the message. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	9e8305b39b	Bluetooth: Use numbers for subsystem version string Instead of keeping a version string around, use version and revision numbers and then stringify them for use as module parameter. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	df1cb87af9	Bluetooth: Introduce helper functions for socket cookie handling Instead of manually allocating cookie information each time, use helper functions for generating and releasing cookies. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	9db5c62951	Bluetooth: Use command status event for Set IO Capability errors In case of failure, the Set IO Capability command is suppose to return command status and not command complete. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	56f787c502	Bluetooth: Fix wrong Get Clock Information return parameters The address information of the Get Clock Information return parameters is copying from a different memory location. It uses &cmd->param while it actually needs to be cmd->param. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	5504c3a310	Bluetooth: Use individual flags for certain management events Instead of hiding everything behind a general managment events flag, introduce indivdual flags that allow fine control over which events are send to a given management channel. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Johan Hedberg	37d3a1fab5	Bluetooth: mgmt: Fix sending redundant event for Advertising Instance When an Advertising Instance is removed, the Advertising Removed event shouldn't be sent to the same socket that issued the Remove Advertising command (it gets a command complete event instead). The mgmt_advertising_removed() function already has a parameter for skipping a specific socket, but there was no code to propagate the right value to this parameter. This patch fixes the issue by making sure the intermediate hci_req_clear_adv_instance() function gets the socket pointer. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	38ceaa00d0	Bluetooth: Add support for sending MGMT commands and events to monitor This adds support for tracing all management commands and events via the monitor interface. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	249fa1699f	Bluetooth: Add support for sending MGMT open and close to monitor This sends new notifications to the monitor support whenever a management channel has been opened or closed. This allows tracing of control channels really easily. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	03c979c471	Bluetooth: Introduce helper to pack mgmt version information The mgmt version information will be also needed for the control changell tracing feature. This provides a helper to pack them. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	70ecce91e3	Bluetooth: Store control socket cookie and comm information To further allow unique identification and tracking of control socket, store cookie and comm information when binding the socket. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	47b0f573f2	Bluetooth: Check SOL_HCI for raw socket options The SOL_HCI level should be enforced when using socket options on the HCI raw socket interface. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Aristeu Rozanski	bd89bb6daa	mac802154: use rate limited warnings for malformed frames Signed-off-by: Aristeu Rozanski <arozansk@redhat.com> Acked-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:19:34 +02:00
Aristeu Rozanski	ca1de81aa2	mac802154: don't warn on unsupported frames Just because we don't support certain types of frames yet doesn't mean we have to flood the message log with warnings about "invalid" frames. Signed-off-by: Aristeu Rozanski <arozansk@redhat.com> Acked-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:19:34 +02:00
Alexander Aring	5ddedce3b7	6lowpan: ndisc: no overreact if no short address is available This patch removes handling to remove short address for a neigbour entry if RS/RA/NS/NA doesn't contain a short address. If these messages doesn't has any short address option, the existing short address from ndisc cache will be used. The current behaviour will set that the neigbour doesn't has a short address anymore. Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:19:34 +02:00
Alexander Aring	abbcc341ad	mac802154: set phy net namespace for new ifaces This patch sets the net namespace when creating SoftMAC interfaces. This is important if the namespace at phy layer was switched before. Currently we losing interfaces in some namespace and it's not possible to recover that. Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	e64c97b53b	Bluetooth: Add combined LED trigger for controller power Instead of just having a LED trigger for power on a specific controller, this adds the LED trigger "bluetooth-power" that combines the power states of all controllers into a single trigger. This simplifies the trigger selection and also supports multiple controllers per host system via a single LED. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
David Vrabel	d48f9ce73c	sunrpc: fix write space race causing stalls Write space becoming available may race with putting the task to sleep in xprt_wait_for_buffer_space(). The existing mechanism to avoid the race does not work. This (edited) partial trace illustrates the problem: [1] rpc_task_run_action: task:43546@5 ... action=call_transmit [2] xs_write_space <-xs_tcp_write_space [3] xprt_write_space <-xs_write_space [4] rpc_task_sleep: task:43546@5 ... [5] xs_write_space <-xs_tcp_write_space [1] Task 43546 runs but is out of write space. [2] Space becomes available, xs_write_space() clears the SOCKWQ_ASYNC_NOSPACE bit. [3] xprt_write_space() attemts to wake xprt->snd_task (== 43546), but this has not yet been queued and the wake up is lost. [4] xs_nospace() is called which calls xprt_wait_for_buffer_space() which queues task 43546. [5] The call to sk->sk_write_space() at the end of xs_nospace() (which is supposed to handle the above race) does not call xprt_write_space() as the SOCKWQ_ASYNC_NOSPACE bit is clear and thus the task is not woken. Fix the race by resetting the SOCKWQ_ASYNC_NOSPACE bit in xs_nospace() so the second call to sk->sk_write_space() calls xprt_write_space(). Suggested-by: Trond Myklebust <trondmy@primarydata.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> cc: stable@vger.kernel.org # 4.4 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:21:36 -04:00
Chuck Lever	496b77a5c5	xprtrdma: Eliminate rpcrdma_receive_worker() Clean up: the extra layer of indirection doesn't add value. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:38 -04:00
Chuck Lever	1519e9697d	xprtrdma: Rename rpcrdma_receive_wc() Clean up: When converting xprtrdma to use the new CQ API, I missed a spot. The naming convention elsewhere is: {svc_rdma,rpcrdma}_wc_{operation} Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:38 -04:00
Chuck Lever	eeb30613e1	xprtrmda: Report address of frmr, not mw Tie frwr debugging messages together by always reporting the address of the frwr. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:38 -04:00
Chuck Lever	44829d02d2	xprtrdma: Support larger inline thresholds The Version One default inline threshold is still 1KB. But allow testing with thresholds up to 64KB. This maximum is somewhat arbitrary. There's no fundamental architectural limit I'm aware of, but it's good to keep the size of Receive buffers reasonable. Now that Send can use a s/g list, a Send buffer is only as large as each RPC requires. Receive buffers are always the size of the inline threshold, however. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:38 -04:00
Chuck Lever	655fec6987	xprtrdma: Use gathered Send for large inline messages An RPC Call message that is sent inline but that has a data payload (ie, one or more items in rq_snd_buf's page list) must be "pulled up:" - call_allocate has to reserve enough RPC Call buffer space to accommodate the data payload - call_transmit has to memcopy the rq_snd_buf's page list and tail into its head iovec before it is sent As the inline threshold is increased beyond its current 1KB default, however, this means data payloads of more than a few KB are copied by the host CPU. For example, if the inline threshold is increased just to 4KB, then NFS WRITE requests up to 4KB would involve a memcpy of the NFS WRITE's payload data into the RPC Call buffer. This is an undesirable amount of participation by the host CPU. The inline threshold may be much larger than 4KB in the future, after negotiation with a peer server. Instead of copying the components of rq_snd_buf into its head iovec, construct a gather list of these components, and send them all in place. The same approach is already used in the Linux server's RPC-over-RDMA reply path. This mechanism also eliminates the need for rpcrdma_tail_pullup, which is used to manage the XDR pad and trailing inline content when a Read list is present. This requires that the pages in rq_snd_buf's page list be DMA-mapped during marshaling, and unmapped when a data-bearing RPC is completed. This is slightly less efficient for very small I/O payloads, but significantly more efficient as data payload size and inline threshold increase past a kilobyte. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:38 -04:00
Chuck Lever	c8b920bb49	xprtrdma: Basic support for Remote Invalidation Have frwr's ro_unmap_sync recognize an invalidated rkey that appears as part of a Receive completion. Local invalidation can be skipped for that rkey. Use an out-of-band signaling mechanism to indicate to the server that the client is prepared to receive RDMA Send With Invalidate. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:38 -04:00
Chuck Lever	87cfb9a0c8	xprtrdma: Client-side support for rpcrdma_connect_private Send an RDMA-CM private message on connect, and look for one during a connection-established event. Both sides can communicate their various implementation limits. Implementations that don't support this sideband protocol ignore it. Once the client knows the server's inline threshold maxima, it can adjust the use of Reply chunks, and eliminate most use of Position Zero Read chunks. Moderately-sized I/O can be done using a pure inline RDMA Send instead of RDMA operations that require memory registration. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:38 -04:00
Chuck Lever	6ea8e71150	xprtrdma: Move recv_wr to struct rpcrdma_rep Clean up: The fields in the recv_wr do not vary. There is no need to initialize them before each ib_post_recv(). This removes a large-ish data structure from the stack. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:38 -04:00
Chuck Lever	90aab60296	xprtrdma: Move send_wr to struct rpcrdma_req Clean up: Most of the fields in each send_wr do not vary. There is no need to initialize them before each ib_post_send(). This removes a large-ish data structure from the stack. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:38 -04:00
Chuck Lever	b157380af1	xprtrdma: Simplify rpcrdma_ep_post_recv() Clean up. Since commit `fc66448549` ("xprtrdma: Split the completion queue"), rpcrdma_ep_post_recv() no longer uses the "ep" argument. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:38 -04:00
Chuck Lever	13650c23f1	xprtrdma: Eliminate "ia" argument in rpcrdma_{alloc, free}_regbuf Clean up. The "ia" argument is no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:38 -04:00
Chuck Lever	54cbd6b0c6	xprtrdma: Delay DMA mapping Send and Receive buffers Currently, each regbuf is allocated and DMA mapped at the same time. This is done during transport creation. When a device driver is unloaded, every DMA-mapped buffer in use by a transport has to be unmapped, and then remapped to the new device if the driver is loaded again. Remapping will have to be done _after_ the connect worker has set up the new device. But there's an ordering problem: call_allocate, which invokes xprt_rdma_allocate which calls rpcrdma_alloc_regbuf to allocate Send buffers, happens _before_ the connect worker can run to set up the new device. Instead, at transport creation, allocate each buffer, but leave it unmapped. Once the RPC carries these buffers into ->send_request, by which time a transport connection should have been established, check to see that the RPC's buffers have been DMA mapped. If not, map them there. When device driver unplug support is added, it will simply unmap all the transport's regbufs, but it doesn't have to deallocate the underlying memory. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:37 -04:00
Chuck Lever	99ef4db329	xprtrdma: Replace DMA_BIDIRECTIONAL The use of DMA_BIDIRECTIONAL is discouraged by DMA-API.txt. Fortunately, xprtrdma now knows which direction I/O is going as soon as it allocates each regbuf. The RPC Call and Reply buffers are no longer the same regbuf. They can each be labeled correctly now. The RPC Reply buffer is never part of either a Send or Receive WR, but it can be part of Reply chunk, which is mapped and registered via ->ro_map . So it is not DMA mapped when it is allocated (DMA_NONE), to avoid a double- mapping. Since Receive buffers are no longer DMA_BIDIRECTIONAL and their contents are never modified by the host CPU, DMA-API-HOWTO.txt suggests that a DMA sync before posting each buffer should be unnecessary. (See my_card_interrupt_handler). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:37 -04:00
Chuck Lever	08cf2efd54	xprtrdma: Use smaller buffers for RPC-over-RDMA headers Commit `949317464b` ("xprtrdma: Limit number of RDMA segments in RPC-over-RDMA headers") capped the number of chunks that may appear in RPC-over-RDMA headers. The maximum header size can be estimated and fixed to avoid allocating buffer space that is never used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:37 -04:00
Chuck Lever	9c40c49f14	xprtrdma: Initialize separate RPC call and reply buffers RPC-over-RDMA needs to separate its RPC call and reply buffers. o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA Send operation using DMA_TO_DEVICE o If the client expects a large RPC reply, it DMA maps rq_rcv_buf as part of a Reply chunk using DMA_FROM_DEVICE The two mappings are for data movement in opposite directions. DMA-API.txt suggests that if these mappings share a DMA cacheline, bad things can happen. This could occur in the final bytes of rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers happen to share a DMA cacheline. On x86_64 the cacheline size is typically 8 bytes, and RPC call messages are usually much smaller than the send buffer, so this hasn't been a noticeable problem. But the DMA cacheline size can be larger on other platforms. Also, often rq_rcv_buf starts most of the way into a page, thus an additional RDMA segment is needed to map and register the end of that buffer. Try to avoid that scenario to reduce the cost of registering and invalidating Reply chunks. Instead of carrying a single regbuf that covers both rq_snd_buf and rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for rq_snd_buf and one regbuf for rq_rcv_buf. Some incidental changes worth noting: - To clear out some spaghetti, refactor xprt_rdma_allocate. - The value stored in rg_size is the same as the value stored in the iov.length field, so eliminate rg_size Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:37 -04:00
Chuck Lever	5a6d1db455	SUNRPC: Add a transport-specific private field in rpc_rqst Currently there's a hidden and indirect mechanism for finding the rpcrdma_req that goes with an rpc_rqst. It depends on getting from the rq_buffer pointer in struct rpc_rqst to the struct rpcrdma_regbuf that controls that buffer, and then to the struct rpcrdma_req it goes with. This was done back in the day to avoid the need to add a per-rqst pointer or to alter the buf_free API when support for RPC-over-RDMA was introduced. I'm about to change the way regbuf's work to support larger inline thresholds. Now is a good time to replace this indirect mechanism with something that is more straightforward. I guess this should be considered a clean up. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:37 -04:00
Chuck Lever	68778945e4	SUNRPC: Separate buffer pointers for RPC Call and Reply messages For xprtrdma, the RPC Call and Reply buffers are involved in real I/O operations. To start with, the DMA direction of the I/O for a Call is opposite that of a Reply. In the current arrangement, the Reply buffer address is on a four-byte alignment just past the call buffer. Would be friendlier on some platforms if that was at a DMA cache alignment instead. Because the current arrangement allocates a single memory region which contains both buffers, the RPC Reply buffer often contains a page boundary in it when the Call buffer is large enough (which is frequent). It would be a little nicer for setting up DMA operations (and possible registration of the Reply buffer) if the two buffers were separated, well-aligned, and contained as few page boundaries as possible. Now, I could just pad out the single memory region used for the pair of buffers. But frequently that would mean a lot of unused space to ensure the Reply buffer did not have a page boundary. Add a separate pointer to rpc_rqst that points right to the RPC Reply buffer. This makes no difference to xprtsock, but it will help xprtrdma in subsequent patches. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:37 -04:00
Chuck Lever	3435c74aed	SUNRPC: Generalize the RPC buffer release API xprtrdma needs to allocate the Call and Reply buffers separately. TBH, the reliance on using a single buffer for the pair of XDR buffers is transport implementation-specific. Instead of passing just the rq_buffer into the buf_free method, pass the task structure and let buf_free take care of freeing both XDR buffers at once. There's a micro-optimization here. In the common case, both xprt_release and the transport's buf_free method were checking if rq_buffer was NULL. Now the check is done only once per RPC. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:37 -04:00
Chuck Lever	5fe6eaa1f9	SUNRPC: Generalize the RPC buffer allocation API xprtrdma needs to allocate the Call and Reply buffers separately. TBH, the reliance on using a single buffer for the pair of XDR buffers is transport implementation-specific. Transports that want to allocate separate Call and Reply buffers will ignore the "size" argument anyway. Don't bother passing it. The buf_alloc method can't return two pointers. Instead, make the method's return value an error code, and set the rq_buffer pointer in the method itself. This gives call_allocate an opportunity to terminate an RPC instead of looping forever when a permanent problem occurs. If a request is just bogus, or the transport is in a state where it can't allocate resources for any request, there needs to be a way to kill the RPC right there and not loop. This immediately fixes a rare problem in the backchannel send path, which loops if the server happens to send a CB request whose call+reply size is larger than a page (which it shouldn't do yet). One more issue: looks like xprt_inject_disconnect was incorrectly placed in the failure path in call_allocate. It needs to be in the success path, as it is for other call-sites. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:37 -04:00
Chuck Lever	b9c5bc03be	SUNRPC: Refactor rpc_xdr_buf_init() Clean up: there is some XDR initialization logic that is common to the forward channel and backchannel. Move it to an XDR header so it can be shared. rpc_rqst::rq_buffer points to a buffer containing big-endian data. Update its annotation as part of the clean up. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:37 -04:00
Chuck Lever	eb342e9a38	xprtrdma: Eliminate INLINE_THRESHOLD macros Clean up: r_xprt is already available everywhere these macros are invoked, so just dereference that directly. RPCRDMA_INLINE_PAD_VALUE is no longer used, so it can simply be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:37 -04:00
Andy Adamson	fda0ab4117	SUNRPC: rpc_clnt_add_xprt setup function for NFS layer Use a setup function to call into the NFS layer to test an rpc_xprt for session trunking so as to not leak the rpc_xprt_switch into the nfs layer. Search for the address in the rpc_xprt_switch first so as not to put an unnecessary EXCHANGE_ID on the wire. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Andy Adamson	39e5d2df95	SUNRPC search xprt switch for sockaddr Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Andy Adamson	dd69171769	SUNRPC rpc_clnt_xprt_switch_add_xprt Give the NFS layer access to the rpc_xprt_switch_add_xprt function Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Andy Adamson	3b58a8a904	SUNRPC rpc_clnt_xprt_switch_put Give the NFS layer access to the xprt_switch_put function Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Andy Adamson	7705f6abbb	SUNRPC remove rpc_task_release_client from rpc_task_set_client rpc_task_set_client is only called from rpc_run_task after rpc_new_task and rpc_task_release_client is not needed as the task is new. When called from rpc_new_task, rpc_task_set_client also removed the assigned rpc_xprt which is not desired. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Trond Myklebust	d002526886	SUNRPC: Initialise struct svc_serv backchannel fields during __svc_create() Clean up. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Amitoj Kaur Chawla	2813b626e3	sunrpc: Remove unnecessary variable The variable `err` is not used anywhere and just returns the predefined value `0` at the end of the function. Hence, remove the variable and return 0 explicitly. Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:35 -04:00
Ilan Tayari	b588479358	xfrm: Fix memory leak of aead algorithm name commit `1a6509d991` ("[IPSEC]: Add support for combined mode algorithms") introduced aead. The function attach_aead kmemdup()s the algorithm name during xfrm_state_construct(). However this memory is never freed. Implementation has since been slightly modified in commit `ee5c23176f` ("xfrm: Clone states properly on migration") without resolving this leak. This patch adds a kfree() call for the aead algorithm name. Fixes: `1a6509d991` ("[IPSEC]: Add support for combined mode algorithms") Signed-off-by: Ilan Tayari <ilant@mellanox.com> Acked-by: Rami Rosen <roszenrami@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2016-09-19 12:08:58 +02:00
David S. Miller	e867e87ae8	RxRPC rewrite -----BEGIN PGP SIGNATURE----- iQIVAwUAV90ZzvSw1s6N8H32AQIBsg//StI2J/tyGGTvvVXoeWPAGeDZp8907Qbc O6671V3feft0HvenQ/uFfC9PdpRTYHTlQhWidk9/wTx5nvg/5b7HGHSD8ghPnCRj Jyu4XbpXM6vw650qSBpRAg5TLIAkSsrdjaANzFSYw70sX6eTLdKJF3YFyoDRD2Cc T2Y0XIUAcACl5/A/KIHRaKjTtmzhoc5Vih9lOWQRZ/dw2u0A6GdwN0qoFfHqXwvI t1S4MtSkXrO8D9uWqmFhTpRzSAzO57L7bccVJbr+OzdNAm/Bkicjrxyzo6YepDAg VRdYaNtKCR9vPRFb9vZFlZy3vyUHb/22mrgM6/V5GG5qd/NFSnouFZ/WGxLnv1Yt 5X+F0qJXXgxSmaGt0f6cV7d9OO24HU/YWGl7wUCHsMZ4bWDySBUWtk0X4USpCsiD +AUxYJTEUNKZcCsW8XmC+cfnkRLNPycX603/pCH7aiPdQ5qY+ue89ICxu3PrJrf/ S8PuQPIoEWmEdTQ9QweG9FpIuQc2LO9fmzOJdiZXHYbGAKFOfNNz5Z64tSVRvPVE WbxW0HYSNCJmBt15OI3hyONmbgACgmuD+e3EFWzCATz94FwMogLBFNqPcw1YpbwX 48RdcP62gx7FMZ4MdqfWllviZKjTlTVmacCRghwGFBOuZL3ifu1gfhsTbrAxqXCs CSlnXgZV7gI= =/2cJ -----END PGP SIGNATURE----- Merge tag 'rxrpc-rewrite-20160917-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Tracepoint addition and improvement Here is a set of patches that add some more tracepoints and improve a couple of existing ones. New additions include: (1) Connection refcount tracking. (2) Client connection state machine tracking. (3) Tx and Rx packet lifecycle. (4) ACK reception and transmission. (5) recvmsg processing. Updates include: (1) Print the symbolic packet name in the Rx packet tracepoint. (2) Additional call refcount trace events. (3) Improvements to sk_buff tracking with AF_RXRPC. In addition: (1) Config option to inject packet loss during both transmission and reception. (2) Removal of some printks. This series needs to be applied on top of the previously posted fixes. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 01:52:21 -04:00
David S. Miller	5b0c6fc8ef	RxRPC rewrite -----BEGIN PGP SIGNATURE----- iQIVAwUAV90ZnPSw1s6N8H32AQKguA//Q7lvTa3n3DFvMQcIPsyJZ6VniUqksTA1 wmQrw4GHXRUgM8UWz7G9Y5aqxUp2q6y6Vm9BeHkQ2bYZjSrOx5Dc/AImWhBn5au+ h+HZYcEs4mFM5AoVT7GK8o/nNODDjNt2qwcH/Nf8+SM7Xf52zYHelrSteLZZ1YWO S1m4pa5YUBa/ICD3+K52lWq9SCG0VMmy41UHDXU6uSakfN62rn9ZYCcNeathlJGS 2D0cG3GzYYHiBZ1CkmdPQgGQhTM3wzI+0OvbNnidFlF78zVDlxX8C9zgWs/VlTCg 02ok4+ftzDojgXH9W+DziEiyCNq14GTDbrdSai5WA+vHVGagr6OoSwDWgPHkXBvW pYQh7jqRoBOKN2fsVkU0t19hPc2CCVGYMh49A6AFv8lgS+nWoDXOlmZ0snh8deQg Z0HO5mx+V+4yJplBlwH6ncvbRB9ywpsvIuLriZXC/aJg6aY4a8nrU35d1+6xUaM7 RBMud0uj+7oU+sC9N7CuM8m8HpBOg6+qAsbsfATSwadMRcMdS4LSoXcBg0WvljIH JmtL924yEnMDw1yPkmuDBcQ9K6DuxeOYZg4A2756tBtGulxuVjntmI1MVAQlsbqH CnNPWpxIDoLRQsHVcYWS5O1F16drGobzFhmj7Hf/6HmGa28x7nQhDafzfFj/3Dos MAdM2pdO2x8= =MjVT -----END PGP SIGNATURE----- Merge tag 'rxrpc-rewrite-20160917-1' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Fixes & miscellany Here are some more AF_RXRPC fix patches with a couple of miscellaneous changes also. Fixes include: (1) Make RxRPC IPv6 support conditional on IPv6 being available. (2) Move the condition check in rxrpc_locate_data() into the caller and check the error return. (3) Fix the detection of the last received packet in recvmsg. (4) Account calls that need acceptance and clean up any unaccepted ones if the socket gets closed. (5) Fix the cleanup of client connections. (6) Fix the soft-ACK parsing and the retransmission of packets based on those ACKs. (7) Suppress transmission of an ACK when there's no pending ACK to transmit because another thread stole it. And some miscellany: (8) Whitespace removal. (9) Switch-value consistency in rxrpc_send_call_packet(). (10) Fix the basic transmission packet size to allow for spur-of-the-moment jumbo DATA packet production. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 01:51:21 -04:00
Florian Westphal	48da34b7a7	sched: add and use qdisc_skb_head helpers This change replaces sk_buff_head struct in Qdiscs with new qdisc_skb_head. Its similar to the skb_buff_head api, but does not use skb->prev pointers. Qdiscs will commonly enqueue at the tail of a list and dequeue at head. While skb_buff_head works fine for this, enqueue/dequeue needs to also adjust the prev pointer of next element. The ->prev pointer is not required for qdiscs so we can just leave it undefined and avoid one cacheline write access for en/dequeue. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 01:47:18 -04:00
Florian Westphal	ed760cb8aa	sched: replace __skb_dequeue with __qdisc_dequeue_head After previous patch these functions are identical. Replace __skb_dequeue in qdiscs with __qdisc_dequeue_head. Next patch will then make __qdisc_dequeue_head handle single-linked list instead of strcut sk_buff_head argument. Doesn't change generated code. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 01:47:18 -04:00
Florian Westphal	ec32336879	sched: remove qdisc arg from __qdisc_dequeue_head Moves qdisc stat accouting to qdisc_dequeue_head. The only direct caller of the __qdisc_dequeue_head version open-codes this now. This allows us to later use __qdisc_dequeue_head as a replacement of __skb_dequeue() (which operates on sk_buff_head list). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 01:47:18 -04:00
Florian Westphal	97d0678f91	sched: don't use skb queue helpers A followup change will replace the sk_buff_head in the qdisc struct with a slightly different list. Use of the sk_buff_head helpers will thus cause compiler warnings. Open-code these accesses in an extra change to ease review. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 01:47:18 -04:00
Florian Westphal	1486587b2f	pie: use qdisc_dequeue_head wrapper Doesn't change generated code. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 01:47:18 -04:00
Christophe Jaillet	e8bc8f9a67	sctp: Remove some redundant code In commit `311b21774f` ("sctp: simplify sk_receive_queue locking"), a call to 'skb_queue_splice_tail_init()' has been made explicit. Previously it was hidden in 'sctp_skb_list_tail()' Now, the code around it looks redundant. The '_init()' part of 'skb_queue_splice_tail_init()' should already do the same. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 01:34:01 -04:00
Mahesh Bandewar	e8bffe0cf9	net: Add _nf_(un)register_hooks symbols Add _nf_register_hooks() and _nf_unregister_hooks() calls which allow caller to hold RTNL mutex. Signed-off-by: Mahesh Bandewar <maheshb@google.com> CC: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 01:25:22 -04:00
Mahesh Bandewar	d409b84768	ipv6: Export p6_route_input_lookup symbol Make ip6_route_input_lookup available outside of ipv6 the module similar to ip_route_input_noref in the IPv4 world. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 01:25:22 -04:00
Nogah Frankel	69ae6ad2ff	net: core: Add offload stats to if_stats_msg Add a nested attribute of offload stats to if_stats_msg named IFLA_STATS_LINK_OFFLOAD_XSTATS. Under it, add SW stats, meaning stats only per packets that went via slowpath to the cpu, named IFLA_OFFLOAD_XSTATS_CPU_HIT. Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:33:42 -04:00
David S. Miller	c13ed534b8	This time we have various things - all across the board: * MU-MIMO sniffer support in mac80211 * a create_singlethread_workqueue() cleanup * interface dump filtering that was documented but not implemented * support for the new radiotap timestamp field * send delBA in two unexpected conditions (as required by the spec) * connect keys cleanups - allow only WEP with index 0-3 * per-station aggregation limit to work around broken APs * debugfs improvement for the integrated codel algorithm and various other small improvements and cleanups. -----BEGIN PGP SIGNATURE----- iQIcBAABCgAGBQJX2+umAAoJEGt7eEactAAdIMkP/jMmpbxkzD64L7nTkO4APGva r6RmMM1SmgVD/CtVkjlBLuvo5YOTWv/vWvy6KoUESOINAx/e6T3T7bmmCOXzbsOL e5/YYcS1AOqgn5SdhgIj1E5cpdYIhlUGRlNJ0qEjeLLrh4/TLUNbCcuPhOYybUMz fUrdPKgDeWb7x9EHLENhPsVtCXWwKnkDIS4qclPZCWgRj46XM4pNB4OlvCUzGY6k bOqGJfrtjYjgKFDmPFqfYA4JDA56980qqO41+eEKXeMvDKNs+pSiNco130Q+uU3E o7tk9DMnAnCy2GihpV1ZYVkLr6O+7o9xVuenj3NRlhyd1mn2gXxLcO4AkHcrZBkf Ei+2L+KgnWELyqiSOaGTJKlugsgS4DDoNnFEIVjSweQ9DIoBA/Gj/6+4uZeHXJ3M bEjtHnCLi5CuI067uBoevwXFoMi1poWra2KnZKOZzFS5OL3xHv4//x/Wmnn2/5Jz ffEwVyRmTY76sLWfnwXUDClrFWAYQrpNyTryc+k3cpYKzhnseiqt+z43cBuISm00 uh5B9PpPB8RhtUnXrL/SHRyf8YEluaidTsI2lc1LvwXOc0+Zbp73mTCgP+rzLs9p K2qVRiozpIXanW6hKmmaDwjKlcAKKLP0xN2v90MqwQt4YdLIKlXnll1AH2BawzuP OWB3n8D0I6y0PWH+Yo8o =s1MY -----END PGP SIGNATURE----- Merge tag 'mac80211-next-for-davem-2016-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== This time we have various things - all across the board: * MU-MIMO sniffer support in mac80211 * a create_singlethread_workqueue() cleanup * interface dump filtering that was documented but not implemented * support for the new radiotap timestamp field * send delBA in two unexpected conditions (as required by the spec) * connect keys cleanups - allow only WEP with index 0-3 * per-station aggregation limit to work around broken APs * debugfs improvement for the integrated codel algorithm and various other small improvements and cleanups. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:29:08 -04:00
David S. Miller	7ac327318e	Two more fixes: * reject aggregation sessions for TSID/TID 8-16 that we can never use anyway and which could confuse drivers * check return value of skb_linearize() -----BEGIN PGP SIGNATURE----- iQIcBAABCgAGBQJX2+l7AAoJEGt7eEactAAdedMP/0k225qWRR+19lavkZ3/lSqb YiQdYNwKCZf9WR+5FiQKFcTV4XMCfKiWX2eJJWqFI5r+V4mfTJAZDehZpddCvJou KW3dZYe25+DaFApdhQiJXTxKSeJAbRB7EWT4i0P54Ned0YUeFwwOjvxuH+fq2VY0 tScX5JogPuGFVzXhlqUVZTjbHlBkcILGv7QjOXXL9yf4d2x57uXyabOB14l351RF 4hqXm7u+JQedbGjhYHDMKnyuwXQW4mbhELBdyMh7heLPqufWf4Xakrwh8WYSKuBb H5dICeq+/JPKnwOStmwtz1FQVIlU08PMLnWzQbByrkDvQPCIkMHLu6h6oNierqAH z4vyy8w05KzFNG3R5egygopXbncZvSOCcOPkDQxbTYmh2tK97OMUjXbUfPEUXZs2 WbSuECm+RVSNKtLlUip7TrM3iYI7xmCN/qPB4Wuq+z/JiSOrnP+G047iUbhdDpnE fwLmRWWdEGkjnv3TFdzzxLVAJPCQWyuLfxaUBlHlYRNqBY8ufTR3FHUKj4dVwqJK EqnSiDUqbZE6qJw516MEp+CqZGb8hzeMCeAu+docWMa3CmnNlglZtYgNiou45y9Z 9/8dAO93/qAzsaw9A/XprnIVPcu1vsFSicrhjOdZsnYoVJDGoW6rjzz6HpCk9QxB 7FjxZ5LilJ5Ws4ltlwr7 =3dek -----END PGP SIGNATURE----- Merge tag 'mac80211-for-davem-2016-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Two more fixes: * reject aggregation sessions for TSID/TID 8-16 that we can never use anyway and which could confuse drivers * check return value of skb_linearize() ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:26:49 -04:00
Eric Dumazet	695b4ec0f0	pkt_sched: fq: use proper locking in fq_dump_stats() When fq is used on 32bit kernels, we need to lock the qdisc before copying 64bit fields. Otherwise "tc -s qdisc ..." might report bogus values. Fixes: `afe4fd0624` ("pkt_sched: fq: Fair Queue packet scheduler") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:15:08 -04:00
Thadeu Lima de Souza Cascardo	db74a3335e	openvswitch: use percpu flow stats Instead of using flow stats per NUMA node, use it per CPU. When using megaflows, the stats lock can be a bottleneck in scalability. On a E5-2690 12-core system, usual throughput went from ~4Mpps to ~15Mpps when forwarding between two 40GbE ports with a single flow configured on the datapath. This has been tested on a system with possible CPUs 0-7,16-23. After module removal, there were no corruption on the slab cache. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Cc: pravin shelar <pshelar@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:14:01 -04:00
Thadeu Lima de Souza Cascardo	40773966cc	openvswitch: fix flow stats accounting when node 0 is not possible On a system with only node 1 as possible, all statistics is going to be accounted on node 0 as it will have a single writer. However, when getting and clearing the statistics, node 0 is not going to be considered, as it's not a possible node. Tested that statistics are not zero on a system with only node 1 possible. Also compile-tested with CONFIG_NUMA off. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:14:01 -04:00
Xin Long	41001faf95	sctp: not return ENOMEM err back in sctp_packet_transmit As David and Marcelo's suggestion, ENOMEM err shouldn't return back to user in transmit path. Instead, sctp's retransmit would take care of the chunks that fail to send because of ENOMEM. This patch is only to do some release job when alloc_skb fails, not to return ENOMEM back any more. Besides, it also cleans up sctp_packet_transmit's err path, and fixes some issues in err path: - It didn't free the head skb in nomem: path. - No need to check nskb in no_route: path. - It should goto err: path if alloc_skb fails for head. - Not all the NOMEMs should free nskb. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:02:33 -04:00
Xin Long	83dbc3d4a3	sctp: make sctp_outq_flush/tail/uncork return void sctp_outq_flush return value is meaningless now, this patch is to make sctp_outq_flush return void, as well as sctp_outq_fail and sctp_outq_uncork. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:02:33 -04:00
Xin Long	645194409b	sctp: save transmit error to sk_err in sctp_outq_flush Every time when sctp calls sctp_outq_flush, it sends out the chunks of control queue, retransmit queue and data queue. Even if some trunks are failed to transmit, it still has to flush all the transports, as it's the only chance to clean that transmit_list. So the latest transmit error here should be returned back. This transmit error is an internal error of sctp stack. I checked all the places where it uses the transmit error (the return value of sctp_outq_flush), most of them are actually just save it to sk_err. Except for sctp_assoc/endpoint_bh_rcv, they will drop the chunk if it's failed to send a REPLY, which is actually incorrect, as we can't be sure the error that sctp_outq_flush returns is from sending that REPLY. So it's meaningless for sctp_outq_flush to return error back. This patch is to save transmit error to sk_err in sctp_outq_flush, the new error can update the old value. Eventually, sctp_wait_for_* would check for it. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:02:32 -04:00
Xin Long	b61c654f9b	sctp: free msg->chunks when sctp_primitive_SEND return err Last patch "sctp: do not return the transmit err back to sctp_sendmsg" made sctp_primitive_SEND return err only when asoc state is unavailable. In this case, chunks are not enqueued, they have no chance to be freed if we don't take care of them later. This Patch is actually to revert commit `1cd4d5c432` ("sctp: remove the unused sctp_datamsg_free()"), commit `69b5777f2e` ("sctp: hold the chunks only after the chunk is enqueued in outq") and commit `8b570dc9f7` ("sctp: only drop the reference on the datamsg after sending a msg"), to use sctp_datamsg_free to free the chunks of current msg. Fixes: `8b570dc9f7` ("sctp: only drop the reference on the datamsg after sending a msg") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:02:32 -04:00
Xin Long	66388f2c08	sctp: do not return the transmit err back to sctp_sendmsg Once a chunk is enqueued successfully, sctp queues can take care of it. Even if it is failed to transmit (like because of nomem), it should be put into retransmit queue. If sctp report this error to users, it confuses them, they may resend that msg, but actually in kernel sctp stack is in charge of retransmit it already. Besides, this error probably is not from the failure of transmitting current msg, but transmitting or retransmitting another msg's chunks, as sctp_outq_flush just tries to send out all transports' chunks. This patch is to make sctp_cmd_send_msg return avoid, and not return the transmit err back to sctp_sendmsg Fixes: `8b570dc9f7` ("sctp: only drop the reference on the datamsg after sending a msg") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:02:32 -04:00
Xin Long	2c89791eeb	sctp: remove the unnecessary state check in sctp_outq_tail Data Chunks are only sent by sctp_primitive_SEND, in which sctp checks the asoc's state through statetable before calling sctp_outq_tail. So there's no need to check the asoc's state again in sctp_outq_tail. Besides, sctp_do_sm is protected by lock_sock, even if sending msg is interrupted by timer events, the event's processes still need to acquire lock_sock first. It means no others CMDs can be enqueue into side effect list before CMD_SEND_MSG to change asoc->state, so it's safe to remove it. This patch is to remove redundant asoc->state check from sctp_outq_tail. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:02:32 -04:00
Alexei Starovoitov	8d79266bc4	ip6_tunnel: add collect_md mode to IPv6 tunnels Similar to gre, vxlan, geneve tunnels allow IPIP6 and IP6IP6 tunnels to operate in 'collect metadata' mode. Unlike ipv4 code here it's possible to reuse ip6_tnl_xmit() function for both collect_md and traditional tunnels. bpf_skb_[gs]et_tunnel_key() helpers and ovs (in the future) are the users. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-17 10:13:07 -04:00
Alexei Starovoitov	cfc7381b30	ip_tunnel: add collect_md mode to IPIP tunnel Similar to gre, vxlan, geneve tunnels allow IPIP tunnels to operate in 'collect metadata' mode. bpf_skb_[gs]et_tunnel_key() helpers can make use of it right away. ovs can use it as well in the future (once appropriate ovs-vport abstractions and user apis are added). Note that just like in other tunnels we cannot cache the dst, since tunnel_info metadata can be different for every packet. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-17 10:13:07 -04:00
Julia Lawall	eb94737d71	l2tp: constify net_device_ops structures Check for net_device_ops structures that are only stored in the netdev_ops field of a net_device structure. This field is declared const, so net_device_ops structures that have this property can be declared as const also. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @r disable optional_qualifier@ identifier i; position p; @@ static struct net_device_ops i@p = { ... }; @ok@ identifier r.i; struct net_device e; position p; @@ e.netdev_ops = &i@p; @bad@ position p != {r.p,ok.p}; identifier r.i; struct net_device_ops e; @@ e@i@p @depends on !bad disable optional_qualifier@ identifier r.i; @@ static +const struct net_device_ops i = { ... }; // </smpl> The result of size on this file before the change is: text data bss dec hex filename 3401 931 44 4376 1118 net/l2tp/l2tp_eth.o and after the change it is: text data bss dec hex filename 3993 347 44 4384 1120 net/l2tp/l2tp_eth.o Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-17 10:07:23 -04:00
Alan Cox	5ff904d55d	llc: switch type to bool as the timeout is only tested versus 0 (As asked by Dave in Februrary) Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-17 10:05:05 -04:00
Eric Dumazet	3613b3dbd1	tcp: prepare skbs for better sack shifting With large BDP TCP flows and lossy networks, it is very important to keep a low number of skbs in the write queue. RACK and SACK processing can perform a linear scan of it. We should avoid putting any payload in skb->head, so that SACK shifting can be done if needed. With this patch, we allow to pack ~0.5 MB per skb instead of the 64KB initially cooked at tcp_sendmsg() time. This gives a reduction of number of skbs in write queue by eight. tcp_rack_detect_loss() likes this. We still allow payload in skb->head for first skb put in the queue, to not impact RPC workloads. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-17 10:05:05 -04:00
phil.turnbull@oracle.com	8ab86c00e3	irda: Free skb on irda_accept error path. skb is not freed if newsk is NULL. Rework the error path so free_skb is unconditionally called on function exit. Fixes: `c3ea9fa274` ("[IrDA] af_irda: IRDA_ASSERT cleanups") Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-17 09:59:31 -04:00
Eric Dumazet	ffb4d6c850	tcp: fix overflow in __tcp_retransmit_skb() If a TCP socket gets a large write queue, an overflow can happen in a test in __tcp_retransmit_skb() preventing all retransmits. The flow then stalls and resets after timeouts. Tested: sysctl -w net.core.wmem_max=1000000000 netperf -H dest -- -s 1000000000 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-17 09:59:30 -04:00
David Howells	8a681c3605	rxrpc: Add config to inject packet loss Add a configuration option to inject packet loss by discarding approximately every 8th packet received and approximately every 8th DATA packet transmitted. Note that no locking is used, but it shouldn't really matter. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 11:24:04 +01:00
David Howells	71f3ca408f	rxrpc: Improve skb tracing Improve sk_buff tracing within AF_RXRPC by the following means: (1) Use an enum to note the event type rather than plain integers and use an array of event names rather than a big multi ?: list. (2) Distinguish Rx from Tx packets and account them separately. This requires the call phase to be tracked so that we know what we might find in rxtx_buffer[]. (3) Add a parameter to rxrpc_{new,see,get,free}_skb() to indicate the event type. (4) A pair of 'rotate' events are added to indicate packets that are about to be rotated out of the Rx and Tx windows. (5) A pair of 'lost' events are added, along with rxrpc_lose_skb() for packet loss injection recording. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 11:24:04 +01:00
David Howells	ba39f3a0ed	rxrpc: Remove printks from rxrpc_recvmsg_data() to fix uninit var Remove _enter/_debug/_leave calls from rxrpc_recvmsg_data() of which one uses an uninitialised variable. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 11:24:04 +01:00
David Howells	849979051c	rxrpc: Add a tracepoint to follow what recvmsg does Add a tracepoint to follow what recvmsg does within AF_RXRPC. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 11:24:03 +01:00
David Howells	58dc63c998	rxrpc: Add a tracepoint to follow packets in the Rx buffer Add a tracepoint to follow the life of packets that get added to a call's receive buffer. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 11:24:03 +01:00
David Howells	f3639df2d9	rxrpc: Add a tracepoint to log ACK transmission Add a tracepoint to log information about ACK transmission. Signed-off-by: David Howels <dhowells@redhat.com>	2016-09-17 11:24:03 +01:00
David Howells	ec71eb9ada	rxrpc: Add a tracepoint to log received ACK packets Add a tracepoint to log information from received ACK packets. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 11:24:03 +01:00
David Howells	a124fe3ee5	rxrpc: Add a tracepoint to follow the life of a packet in the Tx buffer Add a tracepoint to follow the insertion of a packet into the transmit buffer, its transmission and its rotation out of the buffer. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 11:24:03 +01:00
David Howells	363deeab6d	rxrpc: Add connection tracepoint and client conn state tracepoint Add a pair of tracepoints, one to track rxrpc_connection struct ref counting and the other to track the client connection cache state. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 11:24:03 +01:00
David Howells	a84a46d730	rxrpc: Add some additional call tracing Add additional call tracepoint points for noting call-connected, call-released and connection-failed events. Also fix one tracepoint that was using an integer instead of the corresponding enum value as the point type. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 11:24:03 +01:00
David Howells	a3868bfc8d	rxrpc: Print the packet type name in the Rx packet trace Print a symbolic packet type name for each valid received packet in the trace output, not just a number. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 11:24:03 +01:00
David Howells	182f505624	rxrpc: Fix the basic transmit DATA packet content size at 1412 bytes Fix the basic transmit DATA packet content size at 1412 bytes so that they can be arbitrarily assembled into jumbo packets. In the future, I'm thinking of moving to keeping a jumbo packet header at the beginning of each packet in the Tx queue and creating the packet header on the spot when kernel_sendmsg() is invoked. That way, jumbo packets can be assembled on the spur of the moment for (re-)transmission. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:54:32 +01:00
David Howells	2311e327cd	rxrpc: Be consistent about switch value in rxrpc_send_call_packet() rxrpc_send_call_packet() should use type in both its switch-statements rather than using pkt->whdr.type. This might give the compiler an easier job of uninitialised variable checking. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:54:21 +01:00
David Howells	27d0fc431c	rxrpc: Don't transmit an ACK if there's no reason set Don't transmit an ACK if call->ackr_reason in unset. There's the possibility of a race between recvmsg() sending an ACK and the background processing thread trying to send the same one. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:53:55 +01:00
David Howells	dfa7d92040	rxrpc: Fix retransmission algorithm Make the retransmission algorithm use for-loops instead of do-loops and move the counter increments into the for-statement increment slots. Though the do-loops are slighly more efficient since there will be at least one pass through the each loop, the counter increments are harder to get right as the continue-statements skip them. Without this, if there are any positive acks within the loop, the do-loop will cycle forever because the counter increment is never done. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:53:21 +01:00
David Howells	d01dc4c3c1	rxrpc: Fix the parsing of soft-ACKs The soft-ACK parser doesn't increment the pointer into the soft-ACK list, resulting in the first ACK/NACK value being applied to all the relevant packets in the Tx queue. This has the potential to miss retransmissions and cause excessive retransmissions. Fix this by incrementing the pointer. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:53:21 +01:00
David Howells	78883793f8	rxrpc: Fix unexposed client conn release If the last call on a client connection is release after the connection has had a bunch of calls allocated but before any DATA packets are sent (so that it's not yet marked RXRPC_CONN_EXPOSED), an assertion will happen in rxrpc_disconnect_client_call(). af_rxrpc: Assertion failed - 1(0x1) >= 2(0x2) is false ------------[ cut here ]------------ kernel BUG at ../net/rxrpc/conn_client.c:753! This is because it's expecting the conn to have been exposed and to have 2 or more refs - but this isn't necessarily the case. Simply remove the assertion. This allows the conn to be moved into the inactive state and deleted if it isn't resurrected before the final put is called. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:53:21 +01:00
David Howells	357f5ef646	rxrpc: Call rxrpc_release_call() on error in rxrpc_new_client_call() Call rxrpc_release_call() on getting an error in rxrpc_new_client_call() rather than trying to do the cleanup ourselves. This isn't a problem, provided we set RXRPC_CALL_HAS_USERID only if we actually add the call to the calls tree as cleanup code fragments that would otherwise cause problems are conditional. Without this, we miss some of the cleanup. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:53:21 +01:00
David Howells	66d58af7f4	rxrpc: Fix the putting of client connections In rxrpc_put_one_client_conn(), if a connection has RXRPC_CONN_COUNTED set on it, then it's accounted for in rxrpc_nr_client_conns and may be on various lists - and this is cleaned up correctly. However, if the connection doesn't have RXRPC_CONN_COUNTED set on it, then the put routine returns rather than just skipping the extra bit of cleanup. Fix this by making the extra bit of clean up conditional instead and always killing off the connection. This manifests itself as connections with a zero usage count hanging around in /proc/net/rxrpc_conns because the connection allocated, but discarded, due to a race with another process that set up a parallel connection, which was then shared instead. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:53:20 +01:00
David Howells	0360da6db7	rxrpc: Purge the to_be_accepted queue on socket release Purge the queue of to_be_accepted calls on socket release. Note that purging sock_calls doesn't release the ref owned by to_be_accepted. Probably the sock_calls list is redundant given a purges of the recvmsg_q, the to_be_accepted queue and the calls tree. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:51:54 +01:00
David Howells	e6f3afb3fc	rxrpc: Record calls that need to be accepted Record calls that need to be accepted using sk_acceptq_added() otherwise the backlog counter goes negative because sk_acceptq_removed() is called. This causes the preallocator to malfunction. Calls that are preaccepted by AFS within the kernel aren't affected by this. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:51:54 +01:00
David Howells	816c9fce12	rxrpc: Fix handling of the last packet in rxrpc_recvmsg_data() The code for determining the last packet in rxrpc_recvmsg_data() has been using the RXRPC_CALL_RX_LAST flag to determine if the rx_top pointer points to the last packet or not. This isn't a good idea, however, as the input code may be running simultaneously on another CPU and that sets the flag before updating the top pointer. Fix this by the following means: (1) Restrict the use of RXRPC_CALL_RX_LAST to the input routines only. There's otherwise a synchronisation problem between detecting the flag and checking tx_top. This could probably be dealt with by appropriate application of memory barriers, but there's a simpler way. (2) Set RXRPC_CALL_RX_LAST after setting rx_top. (3) Make rxrpc_rotate_rx_window() consult the flags header field of the DATA packet it's about to discard to see if that was the last packet. Use this as the basis for ending the Rx phase. This shouldn't be a problem because the recvmsg side of things is guaranteed to see the packets in order. (4) Make rxrpc_recvmsg_data() return 1 to indicate the end of the data if: (a) the packet it has just processed is marked as RXRPC_LAST_PACKET (b) the call's Rx phase has been ended. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:51:54 +01:00
David Howells	2e2ea51dec	rxrpc: Check the return value of rxrpc_locate_data() Check the return value of rxrpc_locate_data() in rxrpc_recvmsg_data(). Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:50:49 +01:00
David Howells	4b22457c06	rxrpc: Move the check of rx_pkt_offset from rxrpc_locate_data() to caller Move the check of rx_pkt_offset from rxrpc_locate_data() to the caller, rxrpc_recvmsg_data(), so that it's more clear what's going on there. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:50:48 +01:00
David Howells	fabf920180	rxrpc: Remove some whitespace. Remove a tab that's on a line that should otherwise be blank. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:50:15 +01:00
David Howells	d19127473a	rxrpc: Make IPv6 support conditional on CONFIG_IPV6 Add CONFIG_AF_RXRPC_IPV6 and make the IPv6 support code conditional on it. This is then made conditional on CONFIG_IPV6. Without this, the following can be seen: net/built-in.o: In function `rxrpc_init_peer': >> peer_object.c:(.text+0x18c3c8): undefined reference to `ip6_route_output_flags' Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-17 03:58:45 -04:00
Linus Torvalds	87ee1280ff	Fix a memory corruption bug that I introduced in 4.7. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJX3GKYAAoJECebzXlCjuG+7jEQAIgMFl0feSlYrWqPjHZuUXcu uNdvnxKgWWP+JJp5pEh+uFLUsXqamfMqcEj4B0OgcoZcsYMDVsAEafeSWkdNVQeO Wr+j+IBcC1oECGA5MbZWzHBzw1CMq5/l411pvU+2xeZl8iG83MCl9su8dgf4ctAe hc7asU5fiQ+nazhfBaKpVzMAQO+G76E5J90B/Y+VbjYvtQoITUaT3R/PqBIKoG6j ezlElPHKddD2Yd8JnqPfnQ521I4yQL2AhrtJzQayCAnDlBXJ0B41IchcFRxPJPBX teNybHjHX9XQMGyE13EM4MO43AM9Bfcn7hAUPppag4tnvmS34fkIbBsWwWJjTmhR sjfLoBbP11KgGUjg7vRbNNsQ1zHhDyvss5ChUQAqfkDDH7CpnmOPXUqKKSGIfVvw fCpBMqxh5xwSzX3m2Wm2wVsUYH0rSyTKmcomdrZw0o577L2HsoaqWa2hcD/v7fE2 MtPnYLszJXJA3k1ZLoJ6L4sFphznOwzwrj8oLpan5YGsJnonyHcequZw9Z4YoMYG YPztd/kwctIezKhHyRqmr8lghTJ5vjLK49FfJEjrDAb8vfXq3Gwb7ZkwtZBH9RSw CV2vmJ1ny9PRRCKw0UCYAfUv5ucS2JiRbk31qjoIMYpDWJQghrZHnltRxgKg5UDW ZxTGNLwH/WXRVHJzTaeE =gr4D -----END PGP SIGNATURE----- Merge tag 'nfsd-4.8-2' of git://linux-nfs.org/~bfields/linux Pull nfsd bugfix from Bruce Fields: "Fix a memory corruption bug that I introduced in 4.7" * tag 'nfsd-4.8-2' of git://linux-nfs.org/~bfields/linux: svcauth_gss: Revert `64c59a3726` ("Remove unnecessary allocation")	2016-09-16 17:00:26 -07:00
Luca Coelho	fbd05e4a6e	cfg80211: add helper to find an IE that matches a byte-array There are a few places where an IE that matches not only the EID, but also other bytes inside the element, needs to be found. To simplify that and reduce the amount of similar code, implement a new helper function to match the EID and an extra array of bytes. Additionally, simplify cfg80211_find_vendor_ie() by using the new match function. Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-16 14:49:52 +02:00
Emmanuel Grumbach	c68df2e7be	mac80211: allow using AP_LINK_PS with mac80211-generated TIM IE In `46fa38e84b` ("mac80211: allow software PS-Poll/U-APSD with AP_LINK_PS"), Johannes allowed to use mac80211's code for handling stations that go to PS or send PS-Poll / uAPSD trigger frames for devices that enable RSS. This means that mac80211 doesn't look at frames anymore but rather relies on a notification that will come from the device when a PS transition occurs or when a PS-Poll / trigger frame is detected by the device. iwlwifi will need this capability but still needs mac80211 to take care of the TIM IE. Today, if a driver sets AP_LINK_PS, mac80211 will not update the TIM IE. Change mac80211 to check existence of the set_tim driver callback rather than using AP_LINK_PS to decide if the driver handles the TIM IE internally or not. Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> [reword commit message a bit] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-16 14:49:23 +02:00
John Crispin	cafdc45c94	net-next: dsa: add Qualcomm tag RX/TX handler Add support for the 2-bytes Qualcomm tag that gigabit switches such as the QCA8337/N might insert when receiving packets, or that we need to insert while targeting specific switch ports. The tag is inserted directly behind the ethernet header. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: John Crispin <john@phrozen.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-16 04:31:51 -04:00
Mark Tomlinson	d6f64d725b	net: VRF: Pass original iif to ip_route_input() The function ip_rcv_finish() calls l3mdev_ip_rcv(). On any VRF except the global VRF, this replaces skb->dev with the VRF master interface. When calling ip_route_input_noref() from here, the checks for forwarding look at this master device instead of the initial ingress interface. This will allow packets to be routed which normally would be dropped. For example, an interface that is not assigned an IP address should drop packets, but because the checking is against the master device, the packet will be forwarded. The fix here is to still call l3mdev_ip_rcv(), but remember the initial net_device. This is passed to the other functions within ip_rcv_finish, so they still see the original interface. Signed-off-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-16 04:24:07 -04:00
David S. Miller	6777ae8083	Here are two batman-adv bugfix patches: - Fix reference counting for last_bonding_candidate, by Sven Eckelmann - Fix head room reservation for ELP packets, by Linus Luessing -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdBQJX2UCwFhxzd0BzaW1vbnd1bmRlcmxpY2guZGUACgkQoSvjmEKS nqHp+w/+MGG87sjUfObJKrSTmsIDgWHl2VSQwKBivbkvIuBm9gxIhFblaBQEJd5A 88Z5LOGWpnxh2rfFIPurTLxkYAZyARpIxWOLVtmPE3TdfMi4savTsvkOywd0ZHqE kXZH1QHE/Y3CQBf9FaM5cgTiwAXjZ+KWr5kRg3WNmH0oQUepndUBb5AAbjpD2G3f Gt4TsfunplXCmA/5uJMISOjKiub8usUhXVXBHVpxR+ZItdoIolwN197wDzXT8pWK FJe+Flqrvxi3n0kNFknUzCDdt09TFLms4m0AEu+8f4P6t1mR7v+YHNM4IlBx8BX0 6Kwiz009h9+5JFZvOSTCy6tkrodn5cAk9LNNsam5uPTWXeY76gFOzegdHIcIaBxq MLEqnTUgONqOakZs+4NopUz0HUvtJXDGYJcy7SDLYIYEwhaHP7seyuPkjA+xap2p uPZR3dTYESdeNGs/uPTEGVh1q8w6xXqXe9IRzP1KvGAOsg5IXCboLvJHpVMc25aT 4CT9qz5H94sVqdR6d5pBb+0CLoxnhbl+4IjKnHwMBzVM/0LVDJZRmx9qgiNHJ+0f PGQeQEbAfrihORRn2jCEq0YLCHKhjvLw9O7GVj353wmiVkvvuR+MQEv+bcDgu0DE mH9maOpcOahud6S+x+m9of3f3ytwWR4UO2Qk5t1RdIuVRSD7sXM= =mHPX -----END PGP SIGNATURE----- Merge tag 'batadv-net-for-davem-20160914' of git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== Here are two batman-adv bugfix patches: - Fix reference counting for last_bonding_candidate, by Sven Eckelmann - Fix head room reservation for ELP packets, by Linus Luessing ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-16 04:17:33 -04:00
Eric Dumazet	76f0dcbb5a	tcp: fix a stale ooo_last_skb after a replace When skb replaces another one in ooo queue, I forgot to also update tp->ooo_last_skb as well, if the replaced skb was the last one in the queue. To fix this, we simply can re-use the code that runs after an insertion, trying to merge skbs at the right of current skb. This not only fixes the bug, but also remove all small skbs that might be a subset of the new one. Example: We receive segments 2001:3001, 4001:5001 Then we receive 2001:8001 : We should replace 2001:3001 with the big skb, but also remove 4001:50001 from the queue to save space. packetdrill test demonstrating the bug 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7> +0.100 < . 1:1(0) ack 1 win 1024 +0 accept(3, ..., ...) = 4 +0.01 < . 1001:2001(1000) ack 1 win 1024 +0 > . 1:1(0) ack 1 <nop,nop, sack 1001:2001> +0.01 < . 1001:3001(2000) ack 1 win 1024 +0 > . 1:1(0) ack 1 <nop,nop, sack 1001:2001 1001:3001> Fixes: `9f5afeae51` ("tcp: use an RB tree for ooo receive queue") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Yuchung Cheng <ycheng@google.com> Cc: Yaogong Wang <wygivan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-16 04:09:49 -04:00
David S. Miller	364eac0c8b	RxRPC rewrite -----BEGIN PGP SIGNATURE----- iQIVAwUAV9h/5PSw1s6N8H32AQLJCQ//RFbu0SNSoJnnZbOTwkxBaGYnGg4KbNVt iR4zumQfFssyYr7WcH1S6kuPzM/dJfjkRYqollyUGCEfnWyDwyfnjM9Na9PQoZ9F k7xnbim8N65njHLdGF6QMhenmoRXSBVCN2E0uPTbBXurFHJ8ZgQQs+DhogalvGUl 2TL/aMdpqRoo1Vg0/APVOKeLGqgHEhrXxelTZB/74IXyYT+rzjfzu+ZfwxUAijsM d+FBSwY+D8RYSV4LXQzMNNFCwNORbG2Rse2nEqd7bVqdVywWsuhbgeESjx1Y3+ge /mofVyxrpoblT9qsScbISbIQEe6cLxRiQgQHEudennRI2/3EbpNSijhNFWVon2Em NAa7r+tfOPtVx5JTL9NyvwtrXPfAgDi7Stpml3Yhhr/CjRHYK9kfKysowMXL5vOz NHD0fUozNLecpGCmdxG+alwf5BJ5q9DRPP7bI7KE/4FfVsYe8bO6pQ9G+myeP/A4 h8DuvK4xSJUEpEa5dpDLA1wSC2XH5PgYdXIr2DFBaFjllIdf1cGNKKIYRF2eIX/I obVD7c72oZV1kIK3URTLJpE+CdA4KgTFuL3YqIxquA2Iedb1t2uOrQcp4WwaWf7V REY9KbBn1F0+yJfO3Fjckerzle+MlAmrAHQpkZUduo5JzdRm3DY++YzswXQT5Fpr 8S3T30nwY0k= =T1wh -----END PGP SIGNATURE----- Merge tag 'rxrpc-rewrite-20160913-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Support IPv6 Here is a set of patches that add IPv6 support. They need to be applied on top of the just-posted miscellaneous fix patches. They are: (1) Make autobinding of an unconnected socket work when sendmsg() is called to initiate a client call. (2) Don't specify the protocol when creating the client socket, but rather take the default instead. (3) Use rxrpc_extract_addr_from_skb() in a couple of places that were doing the same thing manually. This allows the IPv6 address extraction to be done in fewer places. (4) Add IPv6 support. With this, calls can be made to IPv6 servers from userspace AF_RXRPC programs; AFS, however, can't use IPv6 yet as the RPC calls need to be upgradeable. ==================== Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-16 01:57:19 -04:00
David S. Miller	39caa8bf6d	RxRPC rewrite -----BEGIN PGP SIGNATURE----- iQIVAwUAV9h5A/Sw1s6N8H32AQJOPA//UI0606GZV2zjGqvWYbwquxjhWbbiVfEx CB5BeiQjKs8MxrJeHT/+bh6Z1Y6YorkyrVCc7kI1RQ+yiN0hw49bhFfF9Kr46DBF gYI2VdiKjIFEgC9fTenLkhMDQC7Hhf9O50hzk9QcC4y7w1Lhytah97d9w+Df0ECy a2QLMe2Ad9K5qR08ih3yTH7+G9K1m4/iqIrON2Hd9Opb+oFJgOiixvUVPr9f/6Xd /2YeAPDy/2A1MQ2nNE+oSW4C5uD+mJICqjjSw9YyhYl31lIfwBZ7+DE9hjR1qCXj UzMJLKrutXQQ1U7/Fbbke6UU5yKVm1djQB1qTF8t1hCHp/q88E7T06UUU9oBDqe0 98CjPofEXBcqn9hjrXIvJgxCEISTPHx9ikaq0i5yF/6pSHZ9G8gLUfrqbMwipkfk mXItd6HAHXhX7cS5u76v7I4c9u5olexX5cJ91/ibtOdsupiJTMLwCx4twR6knEcS /6SSqjklFL4f6HjuNlNJ8m2dB98DII+Ym0qo/ZQy4KUm/+0yzrkpGHvt32CR4wng qjtDN+KgxNss1duu4zkHgQe22u3iSRToxwydWTIQYY6tx4e08X1eSIFRL5ddYpEC bjnOtmniAyDP5YF1jRwFDLS3YzT9Uvrf0TVAOvU7/FjPh3KCGa8fn38xIbEsX6eI 1uadG1bf9wg= =vHfH -----END PGP SIGNATURE----- Merge tag 'rxrpc-rewrite-20160913-1' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Miscellaneous fixes Here's a set of miscellaneous fix patches. There are a couple of points of note: (1) There is one non-fix patch that adjusts the call ref tracking tracepoint to make kernel API-held refs on calls more obvious. This is a prerequisite for the patch that fixes prealloc refcounting. (2) The final patch alters how jumbo packets that partially exceed the receive window are handled. Previously, space was being left in the Rx buffer for them, but this significantly hurts performance as the Rx window can't be increased to match the OpenAFS Tx window size. Instead, the excess subpackets are discarded and an EXCEEDS_WINDOW ACK is generated for the first. To avoid the problem of someone trying to run the kernel out of space by feeding the kernel a series of overlapping maximal jumbo packets, we stop allowing jumbo packets on a call if we encounter more than three jumbo packets with duplicate or excessive subpackets. ==================== Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-16 01:52:20 -04:00
David S. Miller	3e454fd5db	A few more fixes: * better mesh path fixing, from Thomas * fix TIM IE recalculation after sending frames to a sleeping station, from Felix * fix sequence number assignment while sending frames to a sleeping station, also from Felix * validate number of probe response CSA counter offsets, fixing a copy/paste bug (from myself) -----BEGIN PGP SIGNATURE----- iQIcBAABCgAGBQJX2FrGAAoJEGt7eEactAAd/ZUP/ix4CokfC/WYxAYoOI6R2X+X UVvkFgG8hsZMzm2PZ4bg14jx9cQdnisgzxJnDgwoYvAnjBR0mbPl50LXoJ+JTlVr JnoIeMEKXqtxP099B4ssSCpW8YEr6Ex8RFweqN5vLuyIhlW63To1pOMvj9+zvUDB 3P7CmX2D7cWYys8ciDEaNhHziJRAdd9AruWf3emS1IhScDw05NlDQlff6FZpG/gW JMVGjKs+v5DdT3SH677/S6KfpwkPEpIoWzgYH32OW7nXxfuYZCoMWSdeoCfgNe+i WgZ154RPBD/YcXsCdNtw36uped7qlrS55Fnlyi1bKMx7vdKLJHABh+FB7qxFpO2F 2gh3G/KBcuGFfOihPXp11m3LlUn+ml1Ri3F8ffwvoO2ZoZGDplrH6oX2+Q9Hlm7U zvhuVzLriq0xHGZhv4EXqMt+70Zk0GvwzBviSjYXEwUJ+h/FkCyrVyUCpfmko9lf 8QMhI+rm0b5BtOvTkGI4ghEbpz80x1UnnUEGiMdzFhFIunSy43EzJ/RbpiAiaOCI pou/moi9+0yMAuJQkMEK2Yg+DdluTsSNuMHB7NKTx8o/wzwgFWFJ6jHlQO72SM43 k0Ku66LyJfovmv03dPaTxDSISaoxsn+1+SlsUPAF0usLyLka9A0WBjOE8edfDI0h hAamoQ0QHd4OlHtgKib6 =xzhM -----END PGP SIGNATURE----- Merge tag 'mac80211-for-davem-2016-09-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== A few more fixes: * better mesh path fixing, from Thomas * fix TIM IE recalculation after sending frames to a sleeping station, from Felix * fix sequence number assignment while sending frames to a sleeping station, also from Felix * validate number of probe response CSA counter offsets, fixing a copy/paste bug (from myself) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-16 01:34:08 -04:00
Lance Richardson	2679d04041	openvswitch: avoid deferred execution of recirc actions The ovs kernel data path currently defers the execution of all recirc actions until stack utilization is at a minimum. This is too limiting for some packet forwarding scenarios due to the small size of the deferred action FIFO (10 entries). For example, broadcast traffic sent out more than 10 ports with recirculation results in packet drops when the deferred action FIFO becomes full, as reported here: http://openvswitch.org/pipermail/dev/2016-March/067672.html Since the current recursion depth is available (it is already tracked by the exec_actions_level pcpu variable), we can use it to determine whether to execute recirculation actions immediately (safe when recursion depth is low) or defer execution until more stack space is available. With this change, the deferred action fifo size becomes a non-issue for currently failing scenarios because it is no longer used when there are three or fewer recursions through ovs_execute_actions(). Suggested-by: Pravin Shelar <pshelar@ovn.org> Signed-off-by: Lance Richardson <lrichard@redhat.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-15 20:35:52 -04:00
Or Gerlitz	a53d850a79	net/sched: cls_flower: Remove an unused field from the filter key structure Commit `c3f8324188` "net: Add full IPv6 addresses to flow_keys" added an unused instance of struct flow_dissector_key_addrs into struct fl_flow_key, remove it. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reported-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-15 20:27:23 -04:00
Or Gerlitz	aa72d70837	net/sched: cls_flower: Support masking for matching on tcp/udp ports Add the definitions for src/dst udp/tcp port masks and use them when setting && dumping the relevant keys. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Paul Blakey <paulb@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-15 20:27:23 -04:00
Jamal Hadi Salim	86da71b573	net_sched: Introduce skbmod action This action is intended to be an upgrade from a usability perspective from pedit (as well as operational debugability). Compare this: sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \ u32 match ip protocol 1 0xff flowid 1:2 \ action pedit munge offset -14 u8 set 0x02 \ munge offset -13 u8 set 0x15 \ munge offset -12 u8 set 0x15 \ munge offset -11 u8 set 0x15 \ munge offset -10 u16 set 0x1515 \ pipe to: sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \ u32 match ip protocol 1 0xff flowid 1:2 \ action skbmod dmac 02:15:15:15:15:15 Also try to do a MAC address swap with pedit or worse try to debug a policy with destination mac, source mac and etherype. Then make few rules out of those and you'll get my point. In the future common use cases on pedit can be migrated to this action (as an example different fields in ip v4/6, transports like tcp/udp/sctp etc). For this first cut, this allows modifying basic ethernet header. The most important ethernet use case at the moment is when redirecting or mirroring packets to a remote machine. The dst mac address needs a re-write so that it doesnt get dropped or confuse an interconnecting (learning) switch or dropped by a target machine (which looks at the dst mac). And at times when flipping back the packet a swap of the MAC addresses is needed. Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-15 19:33:47 -04:00
Daniel Borkmann	f53d8c7b18	bpf: use skb_at_tc_ingress helper in tcf_bpf We have a small skb_at_tc_ingress() helper for testing for ingress, so make use of it. cls_bpf already uses it and so should act_bpf. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-15 19:29:47 -04:00
Daniel Borkmann	04b3f8de4b	bpf: drop unnecessary test in cls_bpf_classify and tcf_bpf The skb_mac_header_was_set() test in cls_bpf's and act_bpf's fast-path is actually unnecessary and can be removed altogether. This was added by commit `a166151cbe` ("bpf: fix bpf helpers to use skb->mac_header relative offsets"), which was later on improved by `3431205e03` ("bpf: make programs see skb->data == L2 for ingress and egress"). We're always guaranteed to have valid mac header at the time we invoke cls_bpf_classify() or tcf_bpf(). Reason is that since `6d1ccff627` ("net: reset mac header in dev_start_xmit()") we do skb_reset_mac_header() in __dev_queue_xmit() before we could call into sch_handle_egress() or any subsequent enqueue. sch_handle_ingress() always sees a valid mac header as well (things like skb_reset_mac_len() would badly fail otherwise). Thus, drop the unnecessary test in classifier and action case. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-15 19:29:47 -04:00
Hadar Hen Zion	07c0f09e23	net/sched: act_tunnel_key: Remove rcu_read_lock protection Remove rcu_read_lock protection from tunnel_key_dump and use rtnl_dereference, dump operation is protected by rtnl lock. Also, remove rcu_read_lock from tunnel_key_release and use rcu_dereference_protected. Both operations are running exclusively and a writer couldn't modify t->params while those functions are executed. Fixes: 54d94fd89d90 ('net/sched: Introduce act_tunnel_key') Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-15 19:18:18 -04:00
Johannes Berg	ec53c832ee	cfg80211: remove unnecessary pointer-of For an array, there's no need to use &array, so just use the plain wiphy->addresses[i].addr here to silence smatch. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-15 16:46:20 +02:00
Rajkumar Manoharan	e8a24cd4b8	mac80211: allow driver to handle packet-loss mechanism Based on consecutive msdu failures, mac80211 triggers CQM packet-loss mechanism. Drivers like ath10k that have its own connection monitoring algorithm, offloaded to firmware for triggering station kickout. In case of station kickout, driver will report low ack status by mac80211 API (ieee80211_report_low_ack). This flag will enable the driver to completely rely on firmware events for station kickout and bypass mac80211 packet loss mechanism. Signed-off-by: Rajkumar Manoharan <rmanohar@qti.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-15 16:46:20 +02:00
Johannes Berg	c7e9dbcf09	mac80211: remove sta_remove_debugfs driver callback No drivers implement this, relying either on the recursive directory removal to remove their debugfs, or not having any to start with. Remove the dead driver callback. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-15 16:46:19 +02:00
Johannes Berg	8826fef95b	mac80211: remove pointless chanctx NULL check If chanctx is derived as container_of() from a non-NULL pointer, it can't ever be NULL. Since we checked conf before, that's true here, so remove the useless NULL check. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-15 16:46:19 +02:00
Johannes Berg	5140974dca	mac80211: remove unused assignment The next line overwrites this assignment, so remove it; there's no real value in using it for the next assignment either. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-15 16:46:18 +02:00
Johannes Berg	53b18980fd	nl80211: always check nla_put* return values A few instances were found where we didn't check them, add the missing checks even though they'll probably never trigger as the message should be large enough here. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-15 16:46:17 +02:00
Johannes Berg	76e1fb4b55	nl80211: always check nla_nest_start() return value If the message got full during nla_nest_start(), it can return NULL. None of the cases here seem like that can really happen, but check the return value nonetheless. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-15 16:46:17 +02:00
Johannes Berg	58bd7f1158	mac80211: fix scan completed tracing Passing the 'info' pointer where a 'info->aborted' is expected will always lead to tracing to erroneously record that the scan was aborted, fix that by passing the correct info->aborted. The remaining data will be collected in cfg80211, so I haven't duplicated it here. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-15 16:46:16 +02:00
Johannes Berg	93db1d9e6c	mac80211: fix possible out-of-bounds access In the unlikely situation that the supplicant has negotiated admission for the background AC (which it has no reason to as it's not supposed to be requiring admission control to start with, and we'd ignore such a requirement anyway), the loop here may terminate with non_acm_ac == 4, which leads to an array overrun. Check this explicitly just for completeness. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-15 16:46:16 +02:00
Johannes Berg	f1c1f17ac5	cfg80211: allow connect keys only with default (TX) key There's no point in allowing connect keys when one of them isn't also configured as the TX key, it would just confuse drivers and probably cause them to pick something for TX. Disallow this confusing and erroneous configuration. As wpa_supplicant will always send NL80211_ATTR_KEYS, even when there are no keys inside, allow that and treat it as though the attribute isn't present at all. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-15 16:45:41 +02:00
Johannes Berg	85d5313ed7	mac80211: reject TSPEC TIDs (TSIDs) for aggregation Since mac80211 doesn't currently support TSIDs 8-15 which can only be used after QoS TSPEC negotiation (and not even after WMM negotiation), reject attempts to set up aggregation sessions for them, which might confuse drivers. In mac80211 we do correctly handle that, but the TSIDs should never get used anyway, and drivers might not be able to handle it. Cc: stable@vger.kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-15 10:08:52 +02:00
Johannes Berg	0b97a484e5	mac80211: check skb_linearize() return value The A-MSDU TX code (within TXQs) didn't always check the return value of skb_linearize() properly, resulting in potentially passing a frag- list SKB down to the driver even when it said it can't handle it. Fix that. Fixes: `6e0456b545` ("mac80211: add A-MSDU tx support") Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-14 12:08:33 +02:00
David Howells	75b54cb57c	rxrpc: Add IPv6 support Add IPv6 support to AF_RXRPC. With this, AF_RXRPC sockets can be created: service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET6); instead of: service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET); The AFS filesystem doesn't support IPv6 at the moment, though, since that requires upgrades to some of the RPC calls. Note that a good portion of this patch is replacing "%pI4:%u" in print statements with "%pISpc" which is able to handle both protocols and print the port. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 23:09:13 +01:00
David Howells	1c2bc7b948	rxrpc: Use rxrpc_extract_addr_from_skb() rather than doing this manually There are two places that want to transmit a packet in response to one just received and manually pick the address to reply to out of the sk_buff. Make them use rxrpc_extract_addr_from_skb() instead so that IPv6 is handled automatically. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 23:09:13 +01:00
David Howells	aaa31cbc66	rxrpc: Don't specify protocol to when creating transport socket Pass 0 as the protocol argument when creating the transport socket rather than IPPROTO_UDP. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 23:09:13 +01:00
David Howells	cd5892c756	rxrpc: Create an address for sendmsg() to bind unbound socket with Create an address for sendmsg() to bind unbound socket with rather than using a completely blank address otherwise the transport socket creation will fail because it will try to use address family 0. We use the address family specified in the protocol argument when the AF_RXRPC socket was created and SOCK_DGRAM as the default. For anything else, bind() must be used. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 23:09:13 +01:00
David Howells	75e4212639	rxrpc: Correctly initialise, limit and transmit call->rx_winsize call->rx_winsize should be initialised to the sysctl setting and the sysctl setting should be limited to the maximum we want to permit. Further, we need to place this in the ACK info instead of the sysctl setting. Furthermore, discard the idea of accepting the subpackets of a jumbo packet that lie beyond the receive window when the first packet of the jumbo is within the window. Just discard the excess subpackets instead. This allows the receive window to be opened up right to the buffer size less one for the dead slot. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 22:38:45 +01:00
David Howells	3432a757b1	rxrpc: Fix prealloc refcounting The preallocated call buffer holds a ref on the calls within that buffer. The ref was being released in the wrong place - it worked okay for incoming calls to the AFS cache manager service, but doesn't work right for incoming calls to a userspace service. Instead of releasing an extra ref service calls in rxrpc_release_call(), the ref needs to be released during the acceptance/rejectance process. To this end: (1) The prealloc ref is now normally released during rxrpc_new_incoming_call(). (2) For preallocated kernel API calls, the kernel API's ref needs to be released when the call is discarded on socket close. (3) We shouldn't take a second ref in rxrpc_accept_call(). (4) rxrpc_recvmsg_new_call() needs to get a ref of its own when it adds the call to the to_be_accepted socket queue. In doing (4) above, we would prefer not to put the call's refcount down to 0 as that entails doing cleanup in softirq context, but it's unlikely as there are several refs held elsewhere, at least one of which must be put by someone in process context calling rxrpc_release_call(). However, it's not a problem if we do have to do that. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 22:38:37 +01:00
David Howells	cbd00891de	rxrpc: Adjust the call ref tracepoint to show kernel API refs Adjust the call ref tracepoint to show references held on a call by the kernel API separately as much as possible and add an additional trace to at the allocation point from the preallocation buffer for an incoming call. Note that this doesn't show the allocation of a client call for the kernel separately at the moment. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 22:38:30 +01:00
David Howells	01fd074224	rxrpc: Allow tx_winsize to grow in response to an ACK Allow tx_winsize to grow when the ACK info packet shows a larger receive window at the other end rather than only permitting it to shrink. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 22:38:24 +01:00
David Howells	89a80ed4c0	rxrpc: Use skb->len not skb->data_len skb->len should be used rather than skb->data_len when referring to the amount of data in a packet. This will only cause a malfunction in the following cases: (1) We receive a jumbo packet (validation and splitting both are wrong). (2) We see if there's extra ACK info in an ACK packet (we think it's not there and just ignore it). Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 22:36:22 +01:00
David Howells	b25de36053	rxrpc: Add missing unlock in rxrpc_call_accept() Add a missing unlock in rxrpc_call_accept() in the path taken if there's no call to wake up. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 22:36:22 +01:00
David Howells	33b603fda8	rxrpc: Requeue call for recvmsg if more data rxrpc_recvmsg() needs to make sure that the call it has just been processing gets requeued for further attention if the buffer has been filled and there's more data to be consumed. The softirq producer only queues the call and wakes the socket if it fills the first slot in the window, so userspace might end up sleeping forever otherwise, despite there being data available. This is not a problem provided the userspace buffer is big enough or it empties the buffer completely before more data comes in. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 22:36:21 +01:00
David Howells	91c2c7b656	rxrpc: The IDLE ACK packet should use rxrpc_idle_ack_delay The IDLE ACK packet should use the rxrpc_idle_ack_delay setting when the timer is set for it. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 22:36:21 +01:00
David Howells	bc4abfcf51	rxrpc: Add missing wakeup on Tx window rotation We need to wake up the sender when Tx window rotation due to an incoming ACK makes space in the buffer otherwise the sender is liable to just hang endlessly. This problem isn't noticeable if the Tx phase transfers no more than will fit in a single window or the Tx window rotates fast enough that it doesn't get full. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 22:36:21 +01:00
David Howells	08a39685a7	rxrpc: Make sure we initialise the peer hash key Peer records created for incoming connections weren't getting their hash key set. This meant that incoming calls wouldn't see more than one DATA packet - which is not a problem for AFS CM calls with small request data blobs. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 22:36:21 +01:00
Johannes Berg	89b706fb28	cfg80211: reduce connect key caching struct size After the previous patches, connect keys can only (correctly) be used for storing static WEP keys. Therefore, remove all the data for dealing with key index 4/5 and reduce the size of the key material to the maximum for WEP keys. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-13 20:20:54 +02:00

... 3 4 5 6 7 ...

43844 Commits