linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-11 22:51:42 +00:00

Author	SHA1	Message	Date
roel kluin	65a1c4fffa	net: Cleanup redundant tests on unsigned optlen is unsigned so the `< 0' test is never true. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-29 01:39:54 -07:00
Gilad Ben-Yossef	dc343475ed	Allow disabling of DSACK TCP option per route Add and use no DSCAK bit in the features field. Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com> Sigend-off-by: Ori Finkelman <ori@comsleep.com> Sigend-off-by: Yony Amit <yony@comsleep.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-29 01:28:48 -07:00
Gilad Ben-Yossef	345cda2fd6	Allow to turn off TCP window scale opt per route Add and use no window scale bit in the features field. Note that this is not the same as setting a window scale of 0 as would happen with window limit on route. Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com> Sigend-off-by: Ori Finkelman <ori@comsleep.com> Sigend-off-by: Yony Amit <yony@comsleep.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-29 01:28:47 -07:00
Gilad Ben-Yossef	cda42ebd67	Allow disabling TCP timestamp options per route Implement querying and acting upon the no timestamp bit in the feature field. Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com> Sigend-off-by: Ori Finkelman <ori@comsleep.com> Sigend-off-by: Yony Amit <yony@comsleep.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-29 01:28:45 -07:00
Gilad Ben-Yossef	1aba721eba	Add the no SACK route option feature Implement querying and acting upon the no sack bit in the features field. Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com> Sigend-off-by: Ori Finkelman <ori@comsleep.com> Sigend-off-by: Yony Amit <yony@comsleep.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-29 01:28:44 -07:00
Gilad Ben-Yossef	022c3f7d82	Allow tcp_parse_options to consult dst entry We need tcp_parse_options to be aware of dst_entry to take into account per dst_entry TCP options settings Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com> Sigend-off-by: Ori Finkelman <ori@comsleep.com> Sigend-off-by: Yony Amit <yony@comsleep.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-29 01:28:41 -07:00
Gilad Ben-Yossef	f55017a93f	Only parse time stamp TCP option in time wait sock Since we only use tcp_parse_options here to check for the exietence of TCP timestamp option in the header, it is better to call with the "established" flag on. Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com> Signed-off-by: Ori Finkelman <ori@comsleep.com> Signed-off-by: Yony Amit <yony@comsleep.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-29 01:28:39 -07:00
Eric Dumazet	d17fa6fa81	ipmr: Optimize multiple unregistration Speedup module unloading by factorizing synchronize_rcu() calls Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-29 01:13:49 -07:00
Neil Horman	55888dfb6b	AF_RAW: Augment raw_send_hdrinc to expand skb to fit iphdr->ihl (v2) Augment raw_send_hdrinc to correct for incorrect ip header length values A series of oopses was reported to me recently. Apparently when using AF_RAW sockets to send data to peers that were reachable via ipsec encapsulation, people could panic or BUG halt their systems. I've tracked the problem down to user space sending an invalid ip header over an AF_RAW socket with IP_HDRINCL set to 1. Basically what happens is that userspace sends down an ip frame that includes only the header (no data), but sets the ip header ihl value to a large number, one that is larger than the total amount of data passed to the sendmsg call. In raw_send_hdrincl, we allocate an skb based on the size of the data in the msghdr that was passed in, but assume the data is all valid. Later during ipsec encapsulation, xfrm4_tranport_output moves the entire frame back in the skbuff to provide headroom for the ipsec headers. During this operation, the skb->transport_header is repointed to a spot computed by skb->network_header + the ip header length (ihl). Since so little data was passed in relative to the value of ihl provided by the raw socket, we point transport header to an unknown location, resulting in various crashes. This fix for this is pretty straightforward, simply validate the value of of iph->ihl when sending over a raw socket. If (iph->ihl*4U) > user data buffer size, drop the frame and return -EINVAL. I just confirmed this fixes the reported crashes. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-29 01:09:58 -07:00
Andreas Petlund	ea84e5555a	net: Corrected spelling error heurestics->heuristics Corrected a spelling error in a function name. Signed-off-by: Andreas Petlund <apetlund@simula.no> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-28 04:00:03 -07:00
Eric Dumazet	eef6dd65e3	gre: Optimize multiple unregistration Speedup module unloading by factorizing synchronize_rcu() calls Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-28 02:22:09 -07:00
Eric Dumazet	0694c4c016	ipip: Optimize multiple unregistration Speedup module unloading by factorizing synchronize_rcu() calls Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-28 02:22:08 -07:00
David S. Miller	cfadf853f6	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/sh_eth.c	2009-10-27 01:03:26 -07:00
Eric Dumazet	8d5b2c084d	gre: convert hash tables locking to RCU GRE tunnels use one rwlock to protect their hash tables. This locking scheme can be converted to RCU for free, since netdevice already must wait for a RCU grace period at dismantle time. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-24 06:07:59 -07:00
Eric Dumazet	8f95dd63a2	ipip: convert hash tables locking to RCU IPIP tunnels use one rwlock to protect their hash tables. This locking scheme can be converted to RCU for free, since netdevice already must wait for a RCU grace period at dismantle time. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-24 06:07:57 -07:00
Arjan van de Ven	c62f4c453a	net: use WARN() for the WARN_ON in commit `b6b39e8f3f` Commit `b6b39e8f3f` (tcp: Try to catch MSG_PEEK bug) added a printk() to the WARN_ON() that's in tcp.c. This patch changes this combination to WARN(); the advantage of WARN() is that the printk message shows up inside the message, so that kerneloops.org will collect the message. In addition, this gets rid of an extra if() statement. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-22 21:37:56 -07:00
Krishna Kumar	ea94ff3b55	net: Fix for dst_negative_advice dst_negative_advice() should check for changed dst and reset sk_tx_queue_mapping accordingly. Pass sock to the callers of dst_negative_advice. (sk_reset_txq is defined just for use by dst_negative_advice. The only way I could find to get around this is to move dst_negative_() from dst.h to dst.c, include sock.h in dst.c, etc) Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-20 18:55:46 -07:00
Herbert Xu	b6b39e8f3f	tcp: Try to catch MSG_PEEK bug This patch tries to print out more information when we hit the MSG_PEEK bug in tcp_recvmsg. It's been around since at least 2005 and it's about time that we finally fix it. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-20 00:51:57 -07:00
John Dykstra	0eae750e60	IP: Cleanups Use symbols instead of magic constants while checking PMTU discovery setsockopt. Remove redundant test in ip_rt_frag_needed() (done by caller). Signed-off-by: John Dykstra <john.dykstra1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-19 23:22:52 -07:00
Eric Dumazet	55b8050353	net: Fix IP_MULTICAST_IF ipv4/ipv6 setsockopt(IP_MULTICAST_IF) have dubious __dev_get_by_index() calls. This function should be called only with RTNL or dev_base_lock held, or reader could see a corrupt hash chain and eventually enter an endless loop. Fix is to call dev_get_by_index()/dev_put(). If this happens to be performance critical, we could define a new dev_exist_by_index() function to avoid touching dev refcount. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-19 21:34:20 -07:00
Julian Anastasov	b103cf3438	tcp: fix TCP_DEFER_ACCEPT retrans calculation Fix TCP_DEFER_ACCEPT conversion between seconds and retransmission to match the TCP SYN-ACK retransmission periods because the time is converted to such retransmissions. The old algorithm selects one more retransmission in some cases. Allow up to 255 retransmissions. Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-19 19:19:06 -07:00
Julian Anastasov	0c3d79bce4	tcp: reduce SYN-ACK retrans for TCP_DEFER_ACCEPT Change SYN-ACK retransmitting code for the TCP_DEFER_ACCEPT users to not retransmit SYN-ACKs during the deferring period if ACK from client was received. The goal is to reduce traffic during the deferring period. When the period is finished we continue with sending SYN-ACKs (at least one) but this time any traffic from client will change the request to established socket allowing application to terminate it properly. Also, do not drop acked request if sending of SYN-ACK fails. Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-19 19:19:03 -07:00
Julian Anastasov	d1b99ba41d	tcp: accept socket after TCP_DEFER_ACCEPT period Willy Tarreau and many other folks in recent years were concerned what happens when the TCP_DEFER_ACCEPT period expires for clients which sent ACK packet. They prefer clients that actively resend ACK on our SYN-ACK retransmissions to be converted from open requests to sockets and queued to the listener for accepting after the deferring period is finished. Then application server can decide to wait longer for data or to properly terminate the connection with FIN if read() returns EAGAIN which is an indication for accepting after the deferring period. This change still can have side effects for applications that expect always to see data on the accepted socket. Others can be prepared to work in both modes (with or without TCP_DEFER_ACCEPT period) and their data processing can ignore the read=EAGAIN notification and to allocate resources for clients which proved to have no data to send during the deferring period. OTOH, servers that use TCP_DEFER_ACCEPT=1 as flag (not as a timeout) to wait for data will notice clients that didn't send data for 3 seconds but that still resend ACKs. Thanks to Willy Tarreau for the initial idea and to Eric Dumazet for the review and testing the change. Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-19 19:19:01 -07:00
David S. Miller	a1a2ad9151	Revert "tcp: fix tcp_defer_accept to consider the timeout" This reverts commit `6d01a026b7`. Julian Anastasov, Willy Tarreau and Eric Dumazet have come up with a more correct way to deal with this. Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-19 19:12:36 -07:00
Steffen Klassert	dff3bb0626	ah4: convert to ahash This patch converts ah4 to the new ahash interface. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-18 21:31:59 -07:00
Eric Dumazet	8edf19c2fe	net: sk_drops consolidation part 2 - skb_kill_datagram() can increment sk->sk_drops itself, not callers. - UDP on IPV4 & IPV6 dropped frames (because of bad checksum or policy checks) increment sk_drops Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-18 18:52:54 -07:00
Eric Dumazet	c720c7e838	inet: rename some inet_sock fields In order to have better cache layouts of struct sock (separate zones for rx/tx paths), we need this preliminary patch. Goal is to transfert fields used at lookup time in the first read-mostly cache line (inside struct sock_common) and move sk_refcnt to a separate cache line (only written by rx path) This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr, sport and id fields. This allows a future patch to define these fields as macros, like sk_refcnt, without name clashes. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-18 18:52:53 -07:00
Eric Dumazet	766e9037cc	net: sk_drops consolidation sock_queue_rcv_skb() can update sk_drops itself, removing need for callers to take care of it. This is more consistent since sock_queue_rcv_skb() also reads sk_drops when queueing a skb. This adds sk_drops managment to many protocols that not cared yet. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-14 20:40:11 -07:00
David S. Miller	421355de87	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2009-10-13 12:55:20 -07:00
Eric Dumazet	f373b53b5f	tcp: replace ehash_size by ehash_mask Storing the mask (size - 1) instead of the size allows fast path to be a bit faster. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-13 03:44:02 -07:00
Eric Dumazet	8558467201	udp: Fix udp_poll() and ioctl() udp_poll() can in some circumstances drop frames with incorrect checksums. Problem is we now have to lock the socket while dropping frames, or risk sk_forward corruption. This bug is present since commit `95766fff6b` ([UDP]: Add memory accounting.) While we are at it, we can correct ioctl(SIOCINQ) to also drop bad frames. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-13 03:16:54 -07:00
Willy Tarreau	6d01a026b7	tcp: fix tcp_defer_accept to consider the timeout I was trying to use TCP_DEFER_ACCEPT and noticed that if the client does not talk, the connection is never accepted and remains in SYN_RECV state until the retransmits expire, where it finally is deleted. This is bad when some firewall such as netfilter sits between the client and the server because the firewall sees the connection in ESTABLISHED state while the server will finally silently drop it without sending an RST. This behaviour contradicts the man page which says it should wait only for some time : TCP_DEFER_ACCEPT (since Linux 2.4) Allows a listener to be awakened only when data arrives on the socket. Takes an integer value (seconds), this can bound the maximum number of attempts TCP will make to complete the connection. This option should not be used in code intended to be portable. Also, looking at ipv4/tcp.c, a retransmit counter is correctly computed : case TCP_DEFER_ACCEPT: icsk->icsk_accept_queue.rskq_defer_accept = 0; if (val > 0) { /* Translate value in seconds to number of * retransmits */ while (icsk->icsk_accept_queue.rskq_defer_accept < 32 && val > ((TCP_TIMEOUT_INIT / HZ) << icsk->icsk_accept_queue.rskq_defer_accept)) icsk->icsk_accept_queue.rskq_defer_accept++; icsk->icsk_accept_queue.rskq_defer_accept++; } break; ==> rskq_defer_accept is used as a counter of retransmits. But in tcp_minisocks.c, this counter is only checked. And in fact, I have found no location which updates it. So I think that what was intended was to decrease it in tcp_minisocks whenever it is checked, which the trivial patch below does. Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-13 01:35:28 -07:00
Neil Horman	3b885787ea	net: Generalize socket rx gap / receive queue overflow cmsg Create a new socket level option to report number of queue overflows Recently I augmented the AF_PACKET protocol to report the number of frames lost on the socket receive queue between any two enqueued frames. This value was exported via a SOL_PACKET level cmsg. AFter I completed that work it was requested that this feature be generalized so that any datagram oriented socket could make use of this option. As such I've created this patch, It creates a new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue overflowed between any two given frames. It also augments the AF_PACKET protocol to take advantage of this new feature (as it previously did not touch sk->sk_drops, which this patch uses to record the overflow count). Tested successfully by me. Notes: 1) Unlike my previous patch, this patch simply records the sk_drops value, which is not a number of drops between packets, but rather a total number of drops. Deltas must be computed in user space. 2) While this patch currently works with datagram oriented protocols, it will also be accepted by non-datagram oriented protocols. I'm not sure if thats agreeable to everyone, but my argument in favor of doing so is that, for those protocols which aren't applicable to this option, sk_drops will always be zero, and reporting no drops on a receive queue that isn't used for those non-participating protocols seems reasonable to me. This also saves us having to code in a per-protocol opt in mechanism. 3) This applies cleanly to net-next assuming that commit `977750076d` (my af packet cmsg patch) is reverted Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-12 13:26:31 -07:00
David S. Miller	7fe13c5733	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2009-10-11 23:15:47 -07:00
Eric Dumazet	f86dcc5aa8	udp: dynamically size hash tables at boot time UDP_HTABLE_SIZE was initialy defined to 128, which is a bit small for several setups. 4000 active UDP sockets -> 32 sockets per chain in average. An incoming frame has to lookup all sockets to find best match, so long chains hurt latency. Instead of a fixed size hash table that cant be perfect for every needs, let UDP stack choose its table size at boot time like tcp/ip route, using alloc_large_system_hash() helper Add an optional boot parameter, uhash_entries=x so that an admin can force a size between 256 and 65536 if needed, like thash_entries and rhash_entries. dmesg logs two new lines : [ 0.647039] UDP hash table entries: 512 (order: 0, 4096 bytes) [ 0.647099] UDP Lite hash table entries: 512 (order: 0, 4096 bytes) Maximal size on 64bit arches would be 65536 slots, ie 1 MBytes for non debugging spinlocks. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-07 22:00:22 -07:00
Hagen Paul Pfeifer	4b17d50f9e	ipv4: Define cipso_v4_delopt static There is no reason that cipso_v4_delopt() is not defined as a static function. Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-07 14:45:58 -07:00
Atis Elsts	ffce908246	net: Add sk_mark route lookup support for IPv4 listening sockets Add support for route lookup using sk_mark on IPv4 listening sockets. Signed-off-by: Atis Elsts <atis@mikrotik.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-07 13:55:57 -07:00
Stephen Hemminger	a21090cff2	ipv4: arp_notify address list bug This fixes a bug with arp_notify. If arp_notify is enabled, kernel will crash if address is changed and no IP address is assigned. http://bugzilla.kernel.org/show_bug.cgi?id=14330 Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-07 03:18:17 -07:00
Stephen Hemminger	ec1b4cf74c	net: mark net_proto_ops as const All usages of structure net_proto_ops should be declared const. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-07 01:10:46 -07:00
Ilia K	ee5e81f000	add vif using local interface index instead of IP When routing daemon wants to enable forwarding of multicast traffic it performs something like: struct vifctl vc = { .vifc_vifi = 1, .vifc_flags = 0, .vifc_threshold = 1, .vifc_rate_limit = 0, .vifc_lcl_addr = ip, /* <--- ip address of physical interface, e.g. eth0 / .vifc_rmt_addr.s_addr = htonl(INADDR_ANY), }; setsockopt(fd, IPPROTO_IP, MRT_ADD_VIF, &vc, sizeof(vc)); This leads (in the kernel) to calling vif_add() function call which search the (physical) device using assigned IP address: dev = ip_dev_find(net, vifc->vifc_lcl_addr.s_addr); The current API (struct vifctl) does not allow to specify an interface other way than using it's IP, and if there are more than a single interface with specified IP only the first one will be found. The attached patch (against 2.6.30.4) allows to specify an interface by its index, instead of IP address: struct vifctl vc = { .vifc_vifi = 1, .vifc_flags = VIFF_USE_IFINDEX, / NEW / .vifc_threshold = 1, .vifc_rate_limit = 0, .vifc_lcl_ifindex = if_nametoindex("eth0"), / NEW */ .vifc_rmt_addr.s_addr = htonl(INADDR_ANY), }; setsockopt(fd, IPPROTO_IP, MRT_ADD_VIF, &vc, sizeof(vc)); Signed-off-by: Ilia K. <mail4ilia@gmail.com> === modified file 'include/linux/mroute.h' Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-07 00:48:41 -07:00
Eric Dumazet	0bfbedb14a	tunnels: Optimize tx path We currently dirty a cache line to update tunnel device stats (tx_packets/tx_bytes). We better use the txq->tx_bytes/tx_packets counters that already are present in cpu cache, in the cache line shared with txq->_xmit_lock This patch extends IPTUNNEL_XMIT() macro to use txq pointer provided by the caller. Also &tunnel->dev->stats can be replaced by &dev->stats Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-05 00:21:57 -07:00
Stephen Hemminger	16c6cf8bb4	ipv4: fib table algorithm performance improvement The FIB algorithim for IPV4 is set at compile time, but kernel goes through the overhead of function call indirection at runtime. Save some cycles by turning the indirect calls to direct calls to either hash or trie code. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-05 00:21:56 -07:00
Eric Dumazet	b3a5b6cc7c	icmp: No need to call sk_write_space() We can make icmp messages tx completion callback a litle bit faster. Setting SOCK_USE_WRITE_QUEUE sk flag tells sock_wfree() to not call sk_write_space() on a socket we know no thread is posssibly waiting for write space. (on per cpu kernel internal icmp sockets only) This avoids the sock_def_write_space() call and read_lock(&sk->sk_callback_lock)/read_unlock(&sk->sk_callback_lock) calls as well. We avoid three atomic ops. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-05 00:21:54 -07:00
Eric Dumazet	42324c6270	net: splice() from tcp to pipe should take into account O_NONBLOCK tcp_splice_read() doesnt take into account socket's O_NONBLOCK flag Before this patch : splice(socket,0,pipe,0,1281024,SPLICE_F_MOVE); causes a random endless block (if pipe is full) and splice(socket,0,pipe,0,1281024,SPLICE_F_MOVE \| SPLICE_F_NONBLOCK); will return 0 immediately if the TCP buffer is empty. User application has no way to instruct splice() that socket should be in blocking mode but pipe in nonblock more. Many projects cannot use splice(tcp -> pipe) because of this flaw. http://git.samba.org/?p=samba.git;a=history;f=source3/lib/recvfile.c;h=ea0159642137390a0f7e57a123684e6e63e47581;hb=HEAD http://lkml.indiana.edu/hypermail/linux/kernel/0807.2/0687.html Linus introduced SPLICE_F_NONBLOCK in commit `29e350944f` (splice: add SPLICE_F_NONBLOCK flag ) It doesn't make the splice itself necessarily nonblocking (because the actual file descriptors that are spliced from/to may block unless they have the O_NONBLOCK flag set), but it makes the splice pipe operations nonblocking. Linus intention was clear : let SPLICE_F_NONBLOCK control the splice pipe mode only This patch instruct tcp_splice_read() to use the underlying file O_NONBLOCK flag, as other socket operations do. Users will then call : splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE \| SPLICE_F_NONBLOCK ); to block on data coming from socket (if file is in blocking mode), and not block on pipe output (to avoid deadlock) First version of this patch was submitted by Octavian Purdila Reported-by: Volker Lendecke <vl@samba.org> Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-02 09:46:05 -07:00
Atis Elsts	914a9ab386	net: Use sk_mark for routing lookup in more places This patch against v2.6.31 adds support for route lookup using sk_mark in some more places. The benefits from this patch are the following. First, SO_MARK option now has effect on UDP sockets too. Second, ip_queue_xmit() and inet_sk_rebuild_header() could fail to do routing lookup correctly if TCP sockets with SO_MARK were used. Signed-off-by: Atis Elsts <atis@mikrotik.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>	2009-10-01 15:16:49 -07:00
Ori Finkelman	89e95a613c	IPv4 TCP fails to send window scale option when window scale is zero Acknowledge TCP window scale support by inserting the proper option in SYN/ACK and SYN headers even if our window scale is zero. This fixes the following observed behavior: 1. Client sends a SYN with TCP window scaling option and non zero window scale value to a Linux box. 2. Linux box notes large receive window from client. 3. Linux decides on a zero value of window scale for its part. 4. Due to compare against requested window scale size option, Linux does not to send windows scale TCP option header on SYN/ACK at all. With the following result: Client box thinks TCP window scaling is not supported, since SYN/ACK had no TCP window scale option, while Linux thinks that TCP window scaling is supported (and scale might be non zero), since SYN had TCP window scale option and we have a mismatched idea between the client and server regarding window sizes. Probably it also fixes up the following bug (not observed in practice): 1. Linux box opens TCP connection to some server. 2. Linux decides on zero value of window scale. 3. Due to compare against computed window scale size option, Linux does not to set windows scale TCP option header on SYN. With the expected result that the server OS does not use window scale option due to not receiving such an option in the SYN headers, leading to suboptimal performance. Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com> Signed-off-by: Ori Finkelman <ori@comsleep.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-01 15:14:51 -07:00
Andrew Morton	4fdb78d309	net/ipv4/tcp.c: fix min() type mismatch warning net/ipv4/tcp.c: In function 'do_tcp_setsockopt': net/ipv4/tcp.c:2050: warning: comparison of distinct pointer types lacks a cast Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-01 15:02:20 -07:00
David S. Miller	b7058842c9	net: Make setsockopt() optlen be unsigned. This provides safety against negative optlen at the type level instead of depending upon (sometimes non-trivial) checks against this sprinkled all over the the place, in each and every implementation. Based upon work done by Arjan van de Ven and feedback from Linus Torvalds. Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-30 16:12:20 -07:00
Eric Dumazet	a43912ab19	tunnel: eliminate recursion field It seems recursion field from "struct ip_tunnel" is not anymore needed. recursion prevention is done at the upper level (in dev_queue_xmit()), since we use HARD_TX_LOCK protection for tunnels. This avoids a cache line ping pong on "struct ip_tunnel" : This structure should be now mostly read on xmit and receive paths. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-24 15:39:22 -07:00
Shan Wei	0915921bde	ipv4: check optlen for IP_MULTICAST_IF option Due to man page of setsockopt, if optlen is not valid, kernel should return -EINVAL. But a simple testcase as following, errno is 0, which means setsockopt is successful. addr.s_addr = inet_addr("192.1.2.3"); setsockopt(s, IPPROTO_IP, IP_MULTICAST_IF, &addr, 1); printf("errno is %d\n", errno); Xiaotian Feng(dfeng@redhat.com) caught the bug. We fix it firstly checking the availability of optlen and then dealing with the logic like other options. Reported-by: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-24 15:38:44 -07:00
Alexey Dobriyan	8d65af789f	sysctl: remove "struct file *" argument of ->proc_handler It's unused. It isn't needed -- read or write flag is already passed and sysctl shouldn't care about the rest. It _was_ used in two places at arch/frv for some reason. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:04 -07:00
Jan Beulich	4481374ce8	mm: replace various uses of num_physpages by totalram_pages Sizing of memory allocations shouldn't depend on the number of physical pages found in a system, as that generally includes (perhaps a huge amount of) non-RAM pages. The amount of what actually is usable as storage should instead be used as a basis here. Some of the calculations (i.e. those not intending to use high memory) should likely even use (totalram_pages - totalhigh_pages). Signed-off-by: Jan Beulich <jbeulich@novell.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Dave Airlie <airlied@linux.ie> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:38 -07:00
Linus Torvalds	f205ce83a7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (66 commits) be2net: fix some cmds to use mccq instead of mbox atl1e: fix 2.6.31-git4 -- ATL1E 0000:03:00.0: DMA-API: device driver frees DMA pkt_sched: Fix qstats.qlen updating in dump_stats ipv6: Log the affected address when DAD failure occurs wl12xx: Fix print_mac() conversion. af_iucv: fix race when queueing skbs on the backlog queue af_iucv: do not call iucv_sock_kill() twice af_iucv: handle non-accepted sockets after resuming from suspend af_iucv: fix race in __iucv_sock_wait() iucv: use correct output register in iucv_query_maxconn() iucv: fix iucv_buffer_cpumask check when calling IUCV functions iucv: suspend/resume error msg for left over pathes wl12xx: switch to %pM to print the mac address b44: the poll handler b44_poll must not enable IRQ unconditionally ipv6: Ignore route option with ROUTER_PREF_INVALID bonding: make ab_arp select active slaves as other modes cfg80211: fix SME connect rc80211_minstrel: fix contention window calculation ssb/sdio: fix printk format warnings p54usb: add Zcomax XG-705A usbid ...	2009-09-17 20:53:52 -07:00
Robert Varga	657e9649e7	tcp: fix CONFIG_TCP_MD5SIG + CONFIG_PREEMPT timer BUG() I have recently came across a preemption imbalance detected by: <4>huh, entered ffffffff80644630 with preempt_count 00000102, exited with 00000101? <0>------------[ cut here ]------------ <2>kernel BUG at /usr/src/linux/kernel/timer.c:664! <0>invalid opcode: 0000 [1] PREEMPT SMP with ffffffff80644630 being inet_twdr_hangman(). This appeared after I enabled CONFIG_TCP_MD5SIG and played with it a bit, so I looked at what might have caused it. One thing that struck me as strange is tcp_twsk_destructor(), as it calls tcp_put_md5sig_pool() -- which entails a put_cpu(), causing the detected imbalance. Found on 2.6.23.9, but 2.6.31 is affected as well, as far as I can tell. Signed-off-by: Robert Varga <nite@hq.alert.sk> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-15 23:49:21 -07:00
Linus Torvalds	ada3fa1505	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 commits) powerpc64: convert to dynamic percpu allocator sparc64: use embedding percpu first chunk allocator percpu: kill lpage first chunk allocator x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA percpu: update embedding first chunk allocator to handle sparse units percpu: use group information to allocate vmap areas sparsely vmalloc: implement pcpu_get_vm_areas() vmalloc: separate out insert_vmalloc_vm() percpu: add chunk->base_addr percpu: add pcpu_unit_offsets[] percpu: introduce pcpu_alloc_info and pcpu_group_info percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward percpu: add @align to pcpu_fc_alloc_fn_t percpu: make @dyn_size mandatory for pcpu_setup_first_chunk() percpu: drop @static_size from first chunk allocators percpu: generalize first chunk allocator selection percpu: build first chunk allocators selectively percpu: rename 4k first chunk allocator to page percpu: improve boot messages percpu: fix pcpu_reclaim() locking ... Fix trivial conflict as by Tejun Heo in kernel/sched.c	2009-09-15 09:39:44 -07:00
Moni Shoua	75c78500dd	bonding: remap muticast addresses without using dev_close() and dev_open() This patch fixes commit `e36b9d16c6`. The approach there is to call dev_close()/dev_open() whenever the device type is changed in order to remap the device IP multicast addresses to HW multicast addresses. This approach suffers from 2 drawbacks: . It assumes tha the device is UP when calling dev_close(), or otherwise dev_close() has no affect. It is worth to mention that initscripts (Redhat) and sysconfig (Suse) doesn't act the same in this matter. . dev_close() has other side affects, like deleting entries from the routing table, which might be unnecessary. The fix here is to directly remap the IP multicast addresses to HW multicast addresses for a bonding device that changes its type, and nothing else. Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Moni Shoua <monis@voltaire.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-15 02:37:40 -07:00
Ilpo Järvinen	0b6a05c1db	tcp: fix ssthresh u16 leftover It was once upon time so that snd_sthresh was a 16-bit quantity. ...That has not been true for long period of time. I run across some ancient compares which still seem to trust such legacy. Put all that magic into a single place, I hopefully found all of them. Compile tested, though linking of allyesconfig is ridiculous nowadays it seems. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-15 01:30:10 -07:00
Alexey Dobriyan	32613090a9	net: constify struct net_protocol Remove long removed "inet_protocol_base" declaration. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-14 17:03:01 -07:00
Linus Torvalds	d7e9660ad9	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1623 commits) netxen: update copyright netxen: fix tx timeout recovery netxen: fix file firmware leak netxen: improve pci memory access netxen: change firmware write size tg3: Fix return ring size breakage netxen: build fix for INET=n cdc-phonet: autoconfigure Phonet address Phonet: back-end for autoconfigured addresses Phonet: fix netlink address dump error handling ipv6: Add IFA_F_DADFAILED flag net: Add DEVTYPE support for Ethernet based devices mv643xx_eth.c: remove unused txq_set_wrr() ucc_geth: Fix hangs after switching from full to half duplex ucc_geth: Rearrange some code to avoid forward declarations phy/marvell: Make non-aneg speed/duplex forcing work for 88E1111 PHYs drivers/net/phy: introduce missing kfree drivers/net/wan: introduce missing kfree net: force bridge module(s) to be GPL Subject: [PATCH] appletalk: Fix skb leak when ipddp interface is not loaded ... Fixed up trivial conflicts: - arch/x86/include/asm/socket.h converted to <asm-generic/socket.h> in the x86 tree. The generic header has the same new #define's, so that works out fine. - drivers/net/tun.c fix conflict between `89f56d1e9` ("tun: reuse struct sock fields") that switched over to using 'tun->socket.sk' instead of the redundantly available (and thus removed) 'tun->sk', and `2b980dbd` ("lsm: Add hooks to the TUN driver") which added a new 'tun->sk' use. Noted in 'next' by Stephen Rothwell.	2009-09-14 10:37:28 -07:00
David S. Miller	9a0da0d19c	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6	2009-09-10 18:17:09 -07:00
James Morris	a3c8b97396	Merge branch 'next' into for-linus	2009-09-11 08:04:49 +10:00
Alexey Dobriyan	fa1a9c6813	headers: net/ipv[46]/protocol.c header trim Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-09 03:43:50 -07:00
Wu Fengguang	aa1330766c	tcp: replace hard coded GFP_KERNEL with sk_allocation This fixed a lockdep warning which appeared when doing stress memory tests over NFS: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock mount_root => nfs_root_data => tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim David raised a concern that if the allocation fails in tcp_send_fin(), and it's GFP_ATOMIC, we are going to yield() (which sleeps) and loop endlessly waiting for the allocation to succeed. But fact is, the original GFP_KERNEL also sleeps. GFP_ATOMIC+yield() looks weird, but it is no worse the implicit sleep inside GFP_KERNEL. Both could loop endlessly under memory pressure. CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> CC: David S. Miller <davem@davemloft.net> CC: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-02 23:45:45 -07:00
Eric Dumazet	6ce9e7b5fe	ip: Report qdisc packet drops Christoph Lameter pointed out that packet drops at qdisc level where not accounted in SNMP counters. Only if application sets IP_RECVERR, drops are reported to user (-ENOBUFS errors) and SNMP counters updated. IP_RECVERR is used to enable extended reliable error message passing, but these are not needed to update system wide SNMP stats. This patch changes things a bit to allow SNMP counters to be updated, regardless of IP_RECVERR being set or not on the socket. Example after an UDP tx flood # netstat -s ... IP: 1487048 outgoing packets dropped ... Udp: ... SndbufErrors: 1487048 send() syscalls, do however still return an OK status, to not break applications. Note : send() manual page explicitly says for -ENOBUFS error : "The output queue for a network interface was full. This generally indicates that the interface has stopped sending, but may be caused by transient congestion. (Normally, this does not occur in Linux. Packets are just silently dropped when a device queue overflows.) " This is not true for IP_RECVERR enabled sockets : a send() syscall that hit a qdisc drop returns an ENOBUFS error. Many thanks to Christoph, David, and last but not least, Alexey ! Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-02 18:05:33 -07:00
Stephen Hemminger	3b401a81c0	inet: inet_connection_sock_af_ops const The function block inet_connect_sock_af_ops contains no data make it constant. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-02 01:03:49 -07:00
Stephen Hemminger	b2e4b3debc	tcp: MD5 operations should be const Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-02 01:03:43 -07:00
David S. Miller	6cdee2f96a	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/yellowfin.c	2009-09-02 00:32:56 -07:00
Stephen Hemminger	89d69d2b75	net: make neigh_ops constant These tables are never modified at runtime. Move to read-only section. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-01 17:40:57 -07:00
Damian Lukowski	6fa12c8503	Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value. RFC 1122 specifies two threshold values R1 and R2 for connection timeouts, which may represent a number of allowed retransmissions or a timeout value. Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds in number of allowed retransmissions. For any desired threshold R2 (by means of time) one can specify tcp_retries2 (by means of number of retransmissions) such that TCP will not time out earlier than R2. This is the case, because the RTO schedule follows a fixed pattern, namely exponential backoff. However, the RTO behaviour is not predictable any more if RTO backoffs can be reverted, as it is the case in the draft "Make TCP more Robust to Long Connectivity Disruptions" (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd). In the worst case TCP would time out a connection after 3.2 seconds, if the initial RTO equaled MIN_RTO and each backoff has been reverted. This patch introduces a function retransmits_timed_out(N), which calculates the timeout of a TCP connection, assuming an initial RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions. Whenever timeout decisions are made by comparing the retransmission counter to some value N, this function can be used, instead. The meaning of tcp_retries2 will be changed, as many more RTO retransmissions can occur than the value indicates. However, it yields a timeout which is similar to the one of an unpatched, exponentially backing off TCP in the same scenario. As no application could rely on an RTO greater than MIN_RTO, there should be no risk of a regression. Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-01 02:45:47 -07:00
Damian Lukowski	f1ecd5d9e7	Revert Backoff [v3]: Revert RTO on ICMP destination unreachable Here, an ICMP host/network unreachable message, whose payload fits to TCP's SND.UNA, is taken as an indication that the RTO retransmission has not been lost due to congestion, but because of a route failure somewhere along the path. With true congestion, a router won't trigger such a message and the patched TCP will operate as standard TCP. This patch reverts one RTO backoff, if an ICMP host/network unreachable message, whose payload fits to TCP's SND.UNA, arrives. Based on the new RTO, the retransmission timer is reset to reflect the remaining time, or - if the revert clocked out the timer - a retransmission is sent out immediately. Backoffs are only reverted, if TCP is in RTO loss recovery, i.e. if there have been retransmissions and reversible backoffs, already. Changes from v2: 1) Renaming of skb in tcp_v4_err() moved to another patch. 2) Reintroduced tcp_bound_rto() and __tcp_set_rto(). 3) Fixed code comments. Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-01 02:45:42 -07:00
Damian Lukowski	4d1a2d9ec1	Revert Backoff [v3]: Rename skb to icmp_skb in tcp_v4_err() This supplementary patch renames skb to icmp_skb in tcp_v4_err() in order to disambiguate from another sk_buff variable, which will be introduced in a separate patch. Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-01 02:45:38 -07:00
Stephen Hemminger	6fef4c0c8e	netdev: convert pseudo-devices to netdev_tx_t Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-01 01:13:07 -07:00
John Dykstra	9a7030b76a	tcp: Remove redundant copy of MD5 authentication key Remove the copy of the MD5 authentication key from tcp_check_req(). This key has already been copied by tcp_v4_syn_recv_sock() or tcp_v6_syn_recv_sock(). Signed-off-by: John Dykstra <john.dykstra1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-08-29 00:19:25 -07:00
Octavian Purdila	80a1096bac	tcp: fix premature termination of FIN_WAIT2 time-wait sockets There is a race condition in the time-wait sockets code that can lead to premature termination of FIN_WAIT2 and, subsequently, to RST generation when the FIN,ACK from the peer finally arrives: Time TCP header 0.000000 30755 > http [SYN] Seq=0 Win=2920 Len=0 MSS=1460 TSV=282912 TSER=0 0.000008 http > 30755 aSYN, ACK] Seq=0 Ack=1 Win=2896 Len=0 MSS=1460 TSV=... 0.136899 HEAD /1b.html?n1Lg=v1 HTTP/1.0 [Packet size limited during capture] 0.136934 HTTP/1.0 200 OK [Packet size limited during capture] 0.136945 http > 30755 [FIN, ACK] Seq=187 Ack=207 Win=2690 Len=0 TSV=270521... 0.136974 30755 > http [ACK] Seq=207 Ack=187 Win=2734 Len=0 TSV=283049 TSER=... 0.177983 30755 > http [ACK] Seq=207 Ack=188 Win=2733 Len=0 TSV=283089 TSER=... 0.238618 30755 > http [FIN, ACK] Seq=207 Ack=188 Win=2733 Len=0 TSV=283151... 0.238625 http > 30755 [RST] Seq=188 Win=0 Len=0 Say twdr->slot = 1 and we are running inet_twdr_hangman and in this instance inet_twdr_do_twkill_work returns 1. At that point we will mark slot 1 and schedule inet_twdr_twkill_work. We will also make twdr->slot = 2. Next, a connection is closed and tcp_time_wait(TCP_FIN_WAIT2, timeo) is called which will create a new FIN_WAIT2 time-wait socket and will place it in the last to be reached slot, i.e. twdr->slot = 1. At this point say inet_twdr_twkill_work will run which will start destroying the time-wait sockets in slot 1, including the just added TCP_FIN_WAIT2 one. To avoid this issue we increment the slot only if all entries in the slot have been purged. This change may delay the slots cleanup by a time-wait death row period but only if the worker thread didn't had the time to run/purge the current slot in the next period (6 seconds with default sysctl settings). However, on such a busy system even without this change we would probably see delays... Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-08-29 00:00:35 -07:00
Jens Låås	80b71b80df	fib_trie: resize rework Here is rework and cleanup of the resize function. Some bugs we had. We were using ->parent when we should use node_parent(). Also we used ->parent which is not assigned by inflate in inflate loop. Also a fix to set thresholds to power 2 to fit halve and double strategy. max_resize is renamed to max_work which better indicates it's function. Reaching max_work is not an error, so warning is removed. max_work only limits amount of work done per resize. (limits CPU-usage, outstanding memory etc). The clean-up makes it relatively easy to add fixed sized root-nodes if we would like to decrease the memory pressure on routers with large routing tables and dynamic routing. If we'll need that... Its been tested with 280k routes. Work done together with Robert Olsson. Signed-off-by: Jens Låås <jens.laas@its.uu.se> Signed-off-by: Robert Olsson <robert.olsson@its.uu.se> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-08-28 23:57:15 -07:00
Eric Dumazet	30038fc61a	net: ip_rt_send_redirect() optimization While doing some forwarding benchmarks, I noticed ip_rt_send_redirect() is rather expensive, even if send_redirects is false for the device. Fix is to avoid two atomic ops, we dont really need to take a reference on in_dev Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-08-28 23:52:01 -07:00
Eric Dumazet	df19a62677	tcp: keepalive cleanups Introduce keepalive_probes(tp) helper, and use it, like keepalive_time_when(tp) and keepalive_intvl_when(tp) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-08-28 23:48:54 -07:00
Eric Dumazet	3d1427f870	ipv4: af_inet.c cleanups Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-08-28 23:45:21 -07:00
Julien TINNES	788d908f28	ipv4: make ip_append_data() handle NULL routing table Add a check in ip_append_data() for NULL *rtp to prevent future bugs in callers from being exploitable. Signed-off-by: Julien Tinnes <julien@cr0.org> Signed-off-by: Tavis Ormandy <taviso@sdf.lonestar.org> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-08-27 12:23:43 -07:00
Patrick McHardy	3993832464	netfilter: nfnetlink: constify message attributes and headers Signed-off-by: Patrick McHardy <kaber@trash.net>	2009-08-25 16:07:58 +02:00
Patrick McHardy	74f7a6552c	netfilter: nf_conntrack: log packets dropped by helpers Log packets dropped by helpers using the netfilter logging API. This is useful in combination with nfnetlink_log to analyze those packets in userspace for debugging. Signed-off-by: Patrick McHardy <kaber@trash.net>	2009-08-25 15:33:08 +02:00
Maximilian Engelhardt	cce5a5c302	netfilter: nf_nat: fix inverted logic for persistent NAT mappings Kernel 2.6.30 introduced a patch [1] for the persistent option in the netfilter SNAT target. This is exactly what we need here so I had a quick look at the code and noticed that the patch is wrong. The logic is simply inverted. The patch below fixes this. Also note that because of this the default behavior of the SNAT target has changed since kernel 2.6.30 as it now ignores the destination IP in choosing the source IP for nating (which should only be the case if the persistent option is set). [1] http://git.eu.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=98d500d66cb7940747b424b245fc6a51ecfbf005 Signed-off-by: Maximilian Engelhardt <maxi@daemonizer.de> Signed-off-by: Patrick McHardy <kaber@trash.net>	2009-08-24 19:24:54 +02:00
Jan Engelhardt	35aad0ffdf	netfilter: xtables: mark initial tables constant The inputted table is never modified, so should be considered const. Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Patrick McHardy <kaber@trash.net>	2009-08-24 14:56:30 +02:00
James Morris	ece13879e7	Merge branch 'master' into next Conflicts: security/Kconfig Manual fix. Signed-off-by: James Morris <jmorris@namei.org>	2009-08-20 09:18:42 +10:00
Tom Goff	8cdb045632	gre: Fix MTU calculation for bound GRE tunnels The GRE header length should be subtracted when the tunnel MTU is calculated. This just corrects for the associativity change introduced by commit `42aa916265` ("gre: Move MTU setting out of ipgre_tunnel_bind_dev"). Signed-off-by: Tom Goff <thomas.goff@boeing.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-08-14 16:41:18 -07:00
Tejun Heo	384be2b18a	Merge branch 'percpu-for-linus' into percpu-for-next Conflicts: arch/sparc/kernel/smp_64.c arch/x86/kernel/cpu/perf_counter.c arch/x86/kernel/setup_percpu.c drivers/cpufreq/cpufreq_ondemand.c mm/percpu.c Conflicts in core and arch percpu codes are mostly from commit ed78e1e078dd44249f88b1dd8c76dafb39567161 which substituted many num_possible_cpus() with nr_cpu_ids. As for-next branch has moved all the first chunk allocators into mm/percpu.c, the changes are moved from arch code to mm/percpu.c. Signed-off-by: Tejun Heo <tj@kernel.org>	2009-08-14 14:45:31 +09:00
Eric Paris	a8f80e8ff9	Networking: use CAP_NET_ADMIN when deciding to call request_module The networking code checks CAP_SYS_MODULE before using request_module() to try to load a kernel module. While this seems reasonable it's actually weakening system security since we have to allow CAP_SYS_MODULE for things like /sbin/ip and bluetoothd which need to be able to trigger module loads. CAP_SYS_MODULE actually grants those binaries the ability to directly load any code into the kernel. We should instead be protecting modprobe and the modules on disk, rather than granting random programs the ability to load code directly into the kernel. Instead we are going to gate those networking checks on CAP_NET_ADMIN which still limits them to root but which does not grant those processes the ability to load arbitrary code into the kernel. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Paul Moore <paul.moore@hp.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: James Morris <jmorris@namei.org>	2009-08-14 11:18:34 +10:00
Patrick McHardy	dc05a564ab	Merge branch 'master' of git://dev.medozas.de/linux	2009-08-10 17:14:59 +02:00
Jan Engelhardt	e2fe35c17f	netfilter: xtables: check for standard verdicts in policies This adds the second check that Rusty wanted to have a long time ago. :-) Base chain policies must have absolute verdicts that cease processing in the table, otherwise rule execution may continue in an unexpected spurious fashion (e.g. next chain that follows in memory). Signed-off-by: Jan Engelhardt <jengelh@medozas.de>	2009-08-10 13:35:31 +02:00
Jan Engelhardt	90e7d4ab5c	netfilter: xtables: check for unconditionality of policies This adds a check that iptables's original author Rusty set forth in a FIXME comment. Underflows in iptables are better known as chain policies, and are required to be unconditional or there would be a stochastical chance for the policy rule to be skipped if it does not match. If that were to happen, rule execution would continue in an unexpected spurious fashion. Signed-off-by: Jan Engelhardt <jengelh@medozas.de>	2009-08-10 13:35:29 +02:00
Jan Engelhardt	a7d51738e7	netfilter: xtables: ignore unassigned hooks in check_entry_size_and_hooks The "hook_entry" and "underflow" array contains values even for hooks not provided, such as PREROUTING in conjunction with the "filter" table. Usually, the values point to whatever the next rule is. For the upcoming unconditionality and underflow checking patches however, we must not inspect that arbitrary rule. Skipping unassigned hooks seems like a good idea, also because newinfo->hook_entry and newinfo->underflow will then continue to have the poison value for detecting abnormalities. Signed-off-by: Jan Engelhardt <jengelh@medozas.de>	2009-08-10 13:35:28 +02:00
Jan Engelhardt	47901dc2c4	netfilter: xtables: use memcmp in unconditional check Instead of inspecting each u32/char open-coded, clean up and make use of memcmp. On some arches, memcmp is implemented as assembly or GCC's __builtin_memcmp which can possibly take advantages of known alignment. Signed-off-by: Jan Engelhardt <jengelh@medozas.de>	2009-08-10 13:35:27 +02:00
Jan Engelhardt	e5afbba186	netfilter: iptables: remove unused datalen variable Signed-off-by: Jan Engelhardt <jengelh@medozas.de>	2009-08-10 13:35:25 +02:00
Jan Engelhardt	f88e6a8a50	netfilter: xtables: switch table AFs to nfproto Signed-off-by: Jan Engelhardt <jengelh@medozas.de>	2009-08-10 13:35:23 +02:00
Jan Engelhardt	24c232d8e9	netfilter: xtables: switch hook PFs to nfproto Signed-off-by: Jan Engelhardt <jengelh@medozas.de>	2009-08-10 13:35:21 +02:00
Jan Engelhardt	57750a22ed	netfilter: conntrack: switch hook PFs to nfproto Simple substitution to indicate that the fields indeed use the NFPROTO_ space. Signed-off-by: Jan Engelhardt <jengelh@medozas.de>	2009-08-10 13:35:20 +02:00
Rafael Laufer	549812799c	netfilter: nf_conntrack: add SCTP support for SO_ORIGINAL_DST Signed-off-by: Patrick McHardy <kaber@trash.net>	2009-08-10 10:08:27 +02:00
Jan Engelhardt	36cbd3dcc1	net: mark read-only arrays as const String literals are constant, and usually, we can also tag the array of pointers const too, moving it to the .rodata section. Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-08-05 10:42:58 -07:00
Randy Dunlap	f816700aa9	xfrm4: fix build when SYSCTLs are disabled Fix build errors when SYSCTLs are not enabled: (.init.text+0x5154): undefined reference to `net_ipv4_ctl_path' (.init.text+0x5176): undefined reference to `register_net_sysctl_table' xfrm4_policy.c:(.exit.text+0x573): undefined reference to `unregister_net_sysctl_table Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-08-04 20:18:33 -07:00
David S. Miller	df597efb57	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/wireless/iwlwifi/iwl-3945.h drivers/net/wireless/iwlwifi/iwl-tx.c drivers/net/wireless/iwlwifi/iwl3945-base.c	2009-07-30 19:22:43 -07:00

1 2 3 4 5 ...

3403 Commits