Commit Graph

4177 Commits

Author SHA1 Message Date
Ilpo Järvinen
d254bcdbf2 [TCP]: Fixes IW > 2 cases when TCP is application limited
Whenever a transfer is application limited, we are allowed at least
initial window worth of data per window unless cwnd is previously
less than that.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-08-04 22:59:52 -07:00
Alexey Dobriyan
29bbd72d6e [NET]: Fix more per-cpu typos
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-08-02 15:02:31 -07:00
Catherine Zhang
dc49c1f94e [AF_UNIX]: Kernel memory leak fix for af_unix datagram getpeersec patch
From: Catherine Zhang <cxzhang@watson.ibm.com>

This patch implements a cleaner fix for the memory leak problem of the
original unix datagram getpeersec patch.  Instead of creating a
security context each time a unix datagram is sent, we only create the
security context when the receiver requests it.

This new design requires modification of the current
unix_getsecpeer_dgram LSM hook and addition of two new hooks, namely,
secid_to_secctx and release_secctx.  The former retrieves the security
context and the latter releases it.  A hook is required for releasing
the security context because it is up to the security module to decide
how that's done.  In the case of Selinux, it's a simple kfree
operation.

Acked-by:  Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-08-02 14:12:06 -07:00
Wei Dong
dafee49085 [IPV6]: SNMPv2 "ipv6IfStatsOutFragCreates" counter error
When I tested linux kernel 2.6.71.7 about statistics
"ipv6IfStatsOutFragCreates", and found that it couldn't increase
correctly. The criteria is RFC 2465:

  ipv6IfStatsOutFragCreates OBJECT-TYPE
      SYNTAX      Counter32
      MAX-ACCESS  read-only
      STATUS      current
      DESCRIPTION
         "The number of output datagram fragments that have
         been generated as a result of fragmentation at
         this output interface."
      ::= { ipv6IfStatsEntry 15 }

I think there are two issues in Linux kernel. 
1st:
RFC2465 specifies the counter is "The number of output datagram
fragments...". I think increasing this counter after output a fragment
successfully is better. And it should not be increased even though a
fragment is created but failed to output.

2nd:
If we send a big ICMP/ICMPv6 echo request to a host, and receive
ICMP/ICMPv6 echo reply consisted of some fragments. As we know that in
Linux kernel first fragmentation occurs in ICMP layer(maybe saying
transport layer is better), but this is not the "real"
fragmentation,just do some "pre-fragment" -- allocate space for date,
and form a frag_list, etc. The "real" fragmentation happens in IP layer
-- set offset and MF flag and so on. So I think in "fast path" for
ip_fragment/ip6_fragment, if we send a fragment which "pre-fragment" by
upper layer we should also increase "ipv6IfStatsOutFragCreates".

Signed-off-by: Wei Dong <weid@nanjing-fnst.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-08-02 13:41:21 -07:00
Patrick McHardy
3ab720881b [NETFILTER]: xt_hashlimit/xt_string: missing string validation
The hashlimit table name and the textsearch algorithm need to be
terminated, the textsearch pattern length must not exceed the
maximum size.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-08-02 13:38:29 -07:00
Patrick McHardy
b10866fd7d [NETFILTER]: SIP helper: expect RTP streams in both directions
Since we don't know in which direction the first packet will arrive, we
need to create one expectation for each direction, which is currently
prevented by max_expected beeing set to 1.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-08-02 13:38:28 -07:00
David S. Miller
52499afe40 [TCP]: Process linger2 timeout consistently.
Based upon guidance from Alexey Kuznetsov.

When linger2 is active, we check to see if the fin_wait2
timeout is longer than the timewait.  If it is, we schedule
the keepalive timer for the difference between the timewait
timeout and the fin_wait2 timeout.

When this orphan socket is seen by tcp_keepalive_timer()
it will try to transform this fin_wait2 socket into a
fin_wait2 mini-socket, again if linger2 is active.

Not all paths were setting this initial keepalive timer correctly.
The tcp input path was doing it correctly, but tcp_close() wasn't,
potentially making the socket linger longer than it really needs to.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-08-02 13:38:24 -07:00
Tom Tucker
8d71740c56 [NET]: Core net changes to generate netevents
Generate netevents for:
- neighbour changes
- routing redirects
- pmtu changes

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-08-02 13:38:21 -07:00
Wei Yongjun
3687b1dc6f [TCP]: SNMPv2 tcpAttemptFails counter error
Refer to RFC2012, tcpAttemptFails is defined as following:
  tcpAttemptFails OBJECT-TYPE
      SYNTAX      Counter32
      MAX-ACCESS  read-only
      STATUS      current
      DESCRIPTION
              "The number of times TCP connections have made a direct
              transition to the CLOSED state from either the SYN-SENT
              state or the SYN-RCVD state, plus the number of times TCP
              connections have made a direct transition to the LISTEN
              state from the SYN-RCVD state."
      ::= { tcp 7 }

When I lookup into RFC793, I found that the state change should occured
under following condition:
  1. SYN-SENT -> CLOSED
     a) Received ACK,RST segment when SYN-SENT state.

  2. SYN-RCVD -> CLOSED
     b) Received SYN segment when SYN-RCVD state(came from LISTEN).
     c) Received RST segment when SYN-RCVD state(came from SYN-SENT).
     d) Received SYN segment when SYN-RCVD state(came from SYN-SENT).

  3. SYN-RCVD -> LISTEN
     e) Received RST segment when SYN-RCVD state(came from LISTEN).

In my test, those direct state transition can not be counted to
tcpAttemptFails.

Signed-off-by: Wei Yongjun <yjwei@nanjing-fnst.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-08-02 13:38:19 -07:00
James Morris
118075b3cd [TCP]: fix memory leak in net/ipv4/tcp_probe.c::tcpprobe_read()
Based upon a patch by Jesper Juhl.

Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Stephen Hemminger <shemminger@osdl.org>
Acked-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-08-02 13:38:18 -07:00
Tetsuo Handa
f59fc7f30b [IPV4/IPV6]: Setting 0 for unused port field in RAW IP recvmsg().
From: Tetsuo Handa from-linux-kernel@i-love.sakura.ne.jp

The recvmsg() for raw socket seems to return random u16 value
from the kernel stack memory since port field is not initialized.
But I'm not sure this patch is correct.
Does raw socket return any information stored in port field?

[ BSD defines RAW IP recvmsg to return a sin_port value of zero.
  This is described in Steven's TCP/IP Illustrated Volume 2 on
  page 1055, which is discussing the BSD rip_input() implementation. ]
    
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-25 17:05:35 -07:00
Alexey Kuznetsov
7228749092 [IPV4] ipmr: ip multicast route bug fix.
IP multicast route code was reusing an skb which causes use after free
and double free.

From: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>

Note, it is real skb_clone(), not alloc_skb(). Equeued skb contains
the whole half-prepared netlink message plus room for the rest.
It could be also skb_copy(), if we want to be puristic about mangling
cloned data, but original copy is really not going to be used.  

Acked-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-25 16:45:12 -07:00
Guillaume Chazarain
d569f1d72f [IPV4]: Clear the whole IPCB, this clears also IPCB(skb)->flags.
Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-24 23:45:16 -07:00
Patrick McHardy
8cf8fb5687 [NETFILTER]: SNMP NAT: fix byteorder confusion
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-24 22:53:35 -07:00
Adrian Bunk
72b5582359 [NETFILTER]: conntrack: fix SYSCTL=n compile
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-24 22:53:12 -07:00
Patrick McHardy
083edca05a [NETFILTER]: H.323 helper: fix possible NULL-ptr dereference
An RCF message containing a timeout results in a NULL-ptr dereference if
no RRQ has been seen before.

Noticed by the "SATURN tool", reported by Thomas Dillig <tdillig@stanford.edu>
and Isil Dillig <isil@stanford.edu>.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-24 22:52:10 -07:00
Patrick McHardy
8265abc082 [IPV4]: Fix nexthop realm dumping for multipath routes
Routing realms exist per nexthop, but are only returned to userspace
for the first nexthop. This is due to the fact that iproute2 only
allows to set the realm for the first nexthop and the kernel refuses
multipath routes where only a single realm is present.

Dump all realms for multipath routes to enable iproute to correctly
display them.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-21 15:09:55 -07:00
Panagiotis Issaris
0da974f4f3 [NET]: Conversions from kmalloc+memset to k(z|c)alloc.
Signed-off-by: Panagiotis Issaris <takis@issaris.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-21 14:51:30 -07:00
Herbert Xu
5d9c5a3292 [IPV4]: Get rid of redundant IPCB->opts initialisation
Now that we always zero the IPCB->opts in ip_rcv, it is no longer
necessary to do so before calling netif_rx for tunneled packets.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-21 14:29:53 -07:00
Stephen Hemminger
53602f92dd [IPV4]: Clear skb cb on IP input
when data arrives at IP through loopback (and possibly other devices).
So the field needs to be cleared before it confuses the route code.
This was seen when running netem over loopback, but there are probably
other device cases. Maybe this should go into stable?

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-14 14:49:32 -07:00
Herbert Xu
b47b2ec198 [IPV4]: Fix error handling for fib_insert_node call
The error handling around fib_insert_node was broken because we always
zeroed the error before checking it.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-12 13:59:04 -07:00
Herbert Xu
da952315c9 [IPCOMP]: Fix truesize after decompression
The truesize check has uncovered the fact that we forgot to update truesize
after pskb_expand_head.  Unfortunately pskb_expand_head can't update it for
us because it's used in all sorts of different contexts, some of which would
not allow truesize to be updated by itself.

So the solution for now is to simply update it in IPComp.

This patch also changes skb_put to __skb_put since we've just expanded
tailroom by exactly that amount so we know it's there (but gcc does not).

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-12 13:58:55 -07:00
Xiaoliang (David) Wei
6150c22e2a [TCP] tcp_highspeed: Fix AI updates.
I think there is still a problem with the AIMD parameter update in
HighSpeed TCP code.

Line 125~138 of the code (net/ipv4/tcp_highspeed.c):

	/* Update AIMD parameters */
	if (tp->snd_cwnd > hstcp_aimd_vals[ca->ai].cwnd) {
		while (tp->snd_cwnd > hstcp_aimd_vals[ca->ai].cwnd &&
		       ca->ai < HSTCP_AIMD_MAX - 1)
			ca->ai++;
	} else if (tp->snd_cwnd < hstcp_aimd_vals[ca->ai].cwnd) {
		while (tp->snd_cwnd > hstcp_aimd_vals[ca->ai].cwnd &&
		       ca->ai > 0)
			ca->ai--;

In fact, the second part (decreasing ca->ai) never decreases since the
while loop's inequality is in the reverse direction. This leads to
unfairness with multiple flows (once a flow happens to enjoy a higher
ca->ai, it keeps enjoying that even its cwnd decreases)

Here is a tentative fix (I also added a comment, trying to keep the
change clear):

Acked-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-12 13:58:50 -07:00
David S. Miller
c427d27452 [TCP]: Remove TCP Compound
This reverts: f890f92104

The inclusion of TCP Compound needs to be reverted at this time
because it is not 100% certain that this code conforms to the
requirements of Developer's Certificate of Origin 1.1 paragraph (b).

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-10 14:50:35 -07:00
Herbert Xu
7466d90f85 [IPV4] inetpeer: Get rid of volatile from peer_total
The variable peer_total is protected by a lock.  The volatile marker
makes no sense.  This shaves off 20 bytes on i386.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-10 14:50:30 -07:00
Patrick McHardy
26e0fd1ce2 [NET]: Fix IPv4/DECnet routing rule dumping
When more rules are present than fit in a single skb, the remaining
rules are incorrectly skipped.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-08 13:38:55 -07:00
Herbert Xu
a430a43d08 [NET] gso: Fix up GSO packets with broken checksums
Certain subsystems in the stack (e.g., netfilter) can break the partial
checksum on GSO packets.  Until they're fixed, this patch allows this to
work by recomputing the partial checksums through the GSO mechanism.

Once they've all been converted to update the partial checksum instead of
clearing it, this workaround can be removed.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-08 13:34:56 -07:00
Herbert Xu
89114afd43 [NET] gso: Add skb_is_gso
This patch adds the wrapper function skb_is_gso which can be used instead
of directly testing skb_shinfo(skb)->gso_size.  This makes things a little
nicer and allows us to change the primary key for indicating whether an skb
is GSO (if we ever want to do that).

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-08 13:34:32 -07:00
Herbert Xu
bbcf467dab [NET]: Verify gso_type too in gso_segment
We don't want nasty Xen guests to pass a TCPv6 packet in with gso_type set
to TCPv4 or even UDP (or a packet that's both TCP and UDP).

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-03 19:38:35 -07:00
Ingo Molnar
c636618485 [PATCH] lockdep: annotate bh_lock_sock()
Teach special (recursive) locking code to the lock validator.  Has no effect
on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:08 -07:00
Ingo Molnar
6205120044 [PATCH] lockdep: fix RT_HASH_LOCK_SZ
On lockdep we have a quite big spinlock_t, so keep the size down.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:05 -07:00
Ingo Molnar
8a25d5debf [PATCH] lockdep: prove spinlock rwlock locking correctness
Use the lock validator framework to prove spinlock and rwlock locking
correctness.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:04 -07:00
Ingo Molnar
e4d9191885 [PATCH] lockdep: locking init debugging improvement
Locking init improvement:

 - introduce and use __SPIN_LOCK_UNLOCKED for array initializations,
   to pass in the name string of locks, used by debugging

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:02 -07:00
Linus Torvalds
e37a72de84 Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
  [IPV6]: Added GSO support for TCPv6
  [NET]: Generalise TSO-specific bits from skb_setup_caps
  [IPV6]: Added GSO support for TCPv6
  [IPV6]: Remove redundant length check on input
  [NETFILTER]: SCTP conntrack: fix crash triggered by packet without chunks
  [TG3]: Update version and reldate
  [TG3]: Add TSO workaround using GSO
  [TG3]: Turn on hw fix for ASF problems
  [TG3]: Add rx BD workaround
  [TG3]: Add tg3_netif_stop() in vlan functions
  [TCP]: Reset gso_segs if packet is dodgy
2006-06-30 15:40:17 -07:00
Herbert Xu
f83ef8c0b5 [IPV6]: Added GSO support for TCPv6
This patch adds GSO support for IPv6 and TCPv6.  This is based on a patch
by Ananda Raju <Ananda.Raju@neterion.com>.  His original description is:

	This patch enables TSO over IPv6. Currently Linux network stacks
	restricts TSO over IPv6 by clearing of the NETIF_F_TSO bit from
	"dev->features". This patch will remove this restriction.

	This patch will introduce a new flag NETIF_F_TSO6 which will be used
	to check whether device supports TSO over IPv6. If device support TSO
	over IPv6 then we don't clear of NETIF_F_TSO and which will make the
	TCP layer to create TSO packets. Any device supporting TSO over IPv6
	will set NETIF_F_TSO6 flag in "dev->features" along with NETIF_F_TSO.

	In case when user disables TSO using ethtool, NETIF_F_TSO will get
	cleared from "dev->features". So even if we have NETIF_F_TSO6 we don't
	get TSO packets created by TCP layer.

	SKB_GSO_TCPV4 renamed to SKB_GSO_TCP to make it generic GSO packet.
	SKB_GSO_UDPV4 renamed to SKB_GSO_UDP as UFO is not a IPv4 feature.
	UFO is supported over IPv6 also

	The following table shows there is significant improvement in
	throughput with normal frames and CPU usage for both normal and jumbo.

	--------------------------------------------------
	|          |     1500        |      9600         |
	|          ------------------|-------------------|
	|          | thru     CPU    |  thru     CPU     |
	--------------------------------------------------
	| TSO OFF  | 2.00   5.5% id  |  5.66   20.0% id  |
	--------------------------------------------------
	| TSO ON   | 2.63   78.0 id  |  5.67   39.0% id  |
	--------------------------------------------------

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-30 14:12:10 -07:00
Herbert Xu
bcd7611117 [NET]: Generalise TSO-specific bits from skb_setup_caps
This patch generalises the TSO-specific bits from sk_setup_caps by adding
the sk_gso_type member to struct sock.  This makes sk_setup_caps generic
so that it can be used by TCPv6 or UFO.

The only catch is that whoever uses this must provide a GSO implementation
for their protocol which I think is a fair deal :) For now UFO continues to
live without a GSO implementation which is OK since it doesn't use the sock
caps field at the moment.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-30 14:12:08 -07:00
Herbert Xu
adcfc7d0b4 [IPV6]: Added GSO support for TCPv6
This patch adds GSO support for IPv6 and TCPv6.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-30 14:12:06 -07:00
Patrick McHardy
dd7271feba [NETFILTER]: SCTP conntrack: fix crash triggered by packet without chunks
When a packet without any chunks is received, the newconntrack variable
in sctp_packet contains an out of bounds value that is used to look up an
pointer from the array of timeouts, which is then dereferenced, resulting
in a crash. Make sure at least a single chunk is present.

Problem noticed by George A. Theall <theall@tenablesecurity.com>

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-30 14:12:01 -07:00
Herbert Xu
3820c3f3e4 [TCP]: Reset gso_segs if packet is dodgy
I wasn't paranoid enough in verifying GSO information.  A bogus gso_segs
could upset drivers as much as a bogus header would.  Let's reset it in
the per-protocol gso_segment functions.

I didn't verify gso_size because that can be verified by the source of
the dodgy packets.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-30 14:11:47 -07:00
Jörn Engel
6ab3d5624e Remove obsolete #include <linux/config.h>
Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30 19:25:36 +02:00
Matt LaPlante
c22751b73a [NETFILTE] ipv4: Fix typo (Bugzilla #6753)
This patch fixes bugzilla #6753, a typo in the netfilter Kconfig

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-29 16:58:28 -07:00
Michael Chan
b0da853703 [NET]: Add ECN support for TSO
In the current TSO implementation, NETIF_F_TSO and ECN cannot be
turned on together in a TCP connection.  The problem is that most
hardware that supports TSO does not handle CWR correctly if it is set
in the TSO packet.  Correct handling requires CWR to be set in the
first packet only if it is set in the TSO header.

This patch adds the ability to turn on NETIF_F_TSO and ECN using
GSO if necessary to handle TSO packets with CWR set.  Hardware
that handles CWR correctly can turn on NETIF_F_TSO_ECN in the dev->
features flag.

All TSO packets with CWR set will have the SKB_GSO_TCPV4_ECN set.  If
the output device does not have the NETIF_F_TSO_ECN feature set, GSO
will split the packet up correctly with CWR only set in the first
segment.

With help from Herbert Xu <herbert@gondor.apana.org.au>.

Since ECN can always be enabled with TSO, the SOCK_NO_LARGESEND sock
flag is completely removed.

Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-29 16:58:08 -07:00
Sridhar Samudrala
47da8ee681 [TCP]: Export accept queue len of a TCP listening socket via rx_queue
While debugging a TCP server hang issue, we noticed that currently there is
no way for a user to get the acceptq backlog value for a TCP listen socket.

All the standard networking utilities that display socket info like netstat,
ss and /proc/net/tcp have 2 fields called rx_queue and tx_queue. These
fields do not mean much for listening sockets. This patch uses one of these
unused fields(rx_queue) to export the accept queue len for listening sockets.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-29 16:57:57 -07:00
Darrel Goeddel
c7bdb545d2 [NETLINK]: Encapsulate eff_cap usage within security framework.
This patch encapsulates the usage of eff_cap (in netlink_skb_params) within
the security framework by extending security_netlink_recv to include a required
capability parameter and converting all direct usage of eff_caps outside
of the lsm modules to use the interface.  It also updates the SELinux
implementation of the security_netlink_send and security_netlink_recv
hooks to take advantage of the sid in the netlink_skb_params struct.
This also enables SELinux to perform auditing of netlink capability checks.
Please apply, for 2.6.18 if possible.

Signed-off-by: Darrel Goeddel <dgoeddel@trustedcs.com>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by:  James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-29 16:57:55 -07:00
Herbert Xu
576a30eb64 [NET]: Added GSO header verification
When GSO packets come from an untrusted source (e.g., a Xen guest domain),
we need to verify the header integrity before passing it to the hardware.

Since the first step in GSO is to verify the header, we can reuse that
code by adding a new bit to gso_type: SKB_GSO_DODGY.  Packets with this
bit set can only be fed directly to devices with the corresponding bit
NETIF_F_GSO_ROBUST.  If the device doesn't have that bit, then the skb
is fed to the GSO engine which will allow the packet to be sent to the
hardware if it passes the header check.

This patch changes the sg flag to a full features flag.  The same method
can be used to implement TSO ECN support.  We simply have to mark packets
with CWR set with SKB_GSO_ECN so that only hardware with a corresponding
NETIF_F_TSO_ECN can accept them.  The GSO engine can either fully segment
the packet, or segment the first MTU and pass the rest to the hardware for
further segmentation.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-29 16:57:53 -07:00
Patrick McHardy
ef47c6a7b8 [NETFILTER]: ip_queue/nfnetlink_queue: drop bridge port references when dev disappears
When a device that is acting as a bridge port is unregistered, the
ip_queue/nfnetlink_queue notifier doesn't check if its one of
physindev/physoutdev and doesn't release the references if it is.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-29 16:57:48 -07:00
Patrick McHardy
da298d3a4f [NETFILTER]: x_tables: fix xt_register_table error propagation
When xt_register_table fails the error is not properly propagated back.
Based on patch by Lepton Wu <ytht.net@gmail.com>.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-29 16:57:40 -07:00
Herbert Xu
0718bcc09b [NET]: Fix CHECKSUM_HW GSO problems.
Fix checksum problems in the GSO code path for CHECKSUM_HW packets.

The ipv4 TCP pseudo header checksum has to be adjusted for GSO
segmented packets.

The adjustment is needed because the length field in the pseudo-header
changes.  However, because we have the inequality oldlen > newlen, we
know that delta = (u16)~oldlen + newlen is still a 16-bit quantity.
This also means that htonl(delta) + th->check still fits in 32 bits.
Therefore we don't have to use csum_add on this operations.

This is based on a patch by Michael Chan <mchan@broadcom.com>.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-25 23:55:46 -07:00
Paul Mackerras
bfe5d83419 [PATCH] Define __raw_get_cpu_var and use it
There are several instances of per_cpu(foo, raw_smp_processor_id()), which
is semantically equivalent to __get_cpu_var(foo) but without the warning
that smp_processor_id() can give if CONFIG_DEBUG_PREEMPT is enabled.  For
those architectures with optimized per-cpu implementations, namely ia64,
powerpc, s390, sparc64 and x86_64, per_cpu() turns into more and slower
code than __get_cpu_var(), so it would be preferable to use __get_cpu_var
on those platforms.

This defines a __raw_get_cpu_var(x) macro which turns into per_cpu(x,
raw_smp_processor_id()) on architectures that use the generic per-cpu
implementation, and turns into __get_cpu_var(x) on the architectures that
have an optimized per-cpu implementation.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:01 -07:00
Andrew Morton
fb1bb34d45 [PATCH] remove for_each_cpu()
Convert a few stragglers over to for_each_possible_cpu(), remove
for_each_cpu().

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:00:54 -07:00
Herbert Xu
09b8f7a93e [IPSEC]: Handle GSO packets
This patch segments GSO packets received by the IPsec stack.  This can
happen when a NIC driver injects GSO packets into the stack which are
then forwarded to another host.

The primary application of this is going to be Xen where its backend
driver may inject GSO packets into dom0.

Of course this also can be used by other virtualisation schemes such as
VMWare or UML since the tap device could be modified to inject GSO packets
received through splice.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-23 02:07:38 -07:00
Herbert Xu
f4c50d990d [NET]: Add software TSOv4
This patch adds the GSO implementation for IPv4 TCP.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-23 02:07:33 -07:00
Herbert Xu
7967168cef [NET]: Merge TSO/UFO fields in sk_buff
Having separate fields in sk_buff for TSO/UFO (tso_size/ufo_size) is not
going to scale if we add any more segmentation methods (e.g., DCCP).  So
let's merge them.

They were used to tell the protocol of a packet.  This function has been
subsumed by the new gso_type field.  This is essentially a set of netdev
feature bits (shifted by 16 bits) that are required to process a specific
skb.  As such it's easy to tell whether a given device can process a GSO
skb: you just have to and the gso_type field and the netdev's features
field.

I've made gso_type a conjunction.  The idea is that you have a base type
(e.g., SKB_GSO_TCPV4) that can be modified further to support new features.
For example, if we add a hardware TSO type that supports ECN, they would
declare NETIF_F_TSO | NETIF_F_TSO_ECN.  All TSO packets with CWR set would
have a gso_type of SKB_GSO_TCPV4 | SKB_GSO_TCPV4_ECN while all other TSO
packets would be SKB_GSO_TCPV4.  This means that only the CWR packets need
to be emulated in software.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-23 02:07:29 -07:00
Linus Torvalds
4c84a39c8a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (46 commits)
  IB/uverbs: Don't serialize with ib_uverbs_idr_mutex
  IB/mthca: Make all device methods truly reentrant
  IB/mthca: Fix memory leak on modify_qp error paths
  IB/uverbs: Factor out common idr code
  IB/uverbs: Don't decrement usecnt on error paths
  IB/uverbs: Release lock on error path
  IB/cm: Use address handle helpers
  IB/sa: Add ib_init_ah_from_path()
  IB: Add ib_init_ah_from_wc()
  IB/ucm: Get rid of duplicate P_Key parameter
  IB/srp: Factor out common request reset code
  IB/srp: Support SRP rev. 10 targets
  [SCSI] srp.h: Add I/O Class values
  IB/fmr: Use device's max_map_map_per_fmr attribute in FMR pool.
  IB/mthca: Fill in max_map_per_fmr device attribute
  IB/ipath: Add client reregister event generation
  IB/mthca: Add client reregister event generation
  IB: Move struct port_info from ipath to <rdma/ib_smi.h>
  IPoIB: Handle client reregister events
  IB: Add client reregister event type
  ...
2006-06-19 19:01:59 -07:00
Herbert Xu
8648b3053b [NET]: Add NETIF_F_GEN_CSUM and NETIF_F_ALL_CSUM
The current stack treats NETIF_F_HW_CSUM and NETIF_F_NO_CSUM
identically so we test for them in quite a few places.  For the sake
of brevity, I'm adding the macro NETIF_F_GEN_CSUM for these two.  We
also test the disjunct of NETIF_F_IP_CSUM and the other two in various
places, for that purpose I've added NETIF_F_ALL_CSUM.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 22:06:05 -07:00
David S. Miller
35089bb203 [TCP]: Add tcp_slow_start_after_idle sysctl.
A lot of people have asked for a way to disable tcp_cwnd_restart(),
and it seems reasonable to add a sysctl to do that.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:30:53 -07:00
Luca De Cicco
bc726a71d2 [TCP] Westwood: reset RTT min after FRTO
RTT_min is updated each time a timeout event occurs
in order to cope with hard handovers in wireless scenarios such as UMTS.

Signed-off-by: Luca De Cicco <ldecicco@gmail.com>
Signed-off-by: Stephen Hemminger <shemminger@dxpl.pdx.osdl.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:30:38 -07:00
Luca De Cicco
b3a92eabe5 [TCP] Westwood: bandwidth filter startup
The bandwidth estimate filter is now initialized with the first
sample in order to have better performances in the case of small
file transfers.

Signed-off-by: Luca De Cicco <ldecicco@gmail.com>
Signed-off-by: Stephen Hemminger <shemminger@dxpl.pdx.osdl.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:30:36 -07:00
Luca De Cicco
b7d7a9e3c9 [TCP] Westwood: comment fixes
Cleanup some comments and add more references

Signed-off-by: Luca De Cicco <ldecicco@gmail.com>
Signed-off-by: Stephen Hemminger <shemminger@dxpl.pdx.osdl.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:30:34 -07:00
Stephen Hemminger
f61e29018a [TCP] Westwood: fix first sample
Need to update send sequence number tracking after first ack.
Rework of patch from Luca De Cicco.

Signed-off-by: Stephen Hemminger <shemminger@dxpl.pdx.osdl.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:30:32 -07:00
Stephen Hemminger
bdeb04c6d9 [NET]: net.ipv4.ip_autoconfig sysctl removal
The sysctl net.ipv4.ip_autoconfig is a legacy value that is not used.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:30:30 -07:00
Herbert Xu
364c6badde [NET]: Clean up skb_linearize
The linearisation operation doesn't need to be super-optimised.  So we can
replace __skb_linearize with __pskb_pull_tail which does the same thing but
is more general.

Also, most users of skb_linearize end up testing whether the skb is linear
or not so it helps to make skb_linearize do just that.

Some callers of skb_linearize also use it to copy cloned data, so it's
useful to have a new function skb_linearize_cow to copy the data if it's
either non-linear or cloned.

Last but not least, I've removed the gfp argument since nobody uses it
anymore.  If it's ever needed we can easily add it back.

Misc bugs fixed by this patch:

* via-velocity error handling (also, no SG => no frags)

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:30:16 -07:00
Patrick McHardy
bf0857ea32 [NETFILTER]: hashlimit match: fix random initialization
hashlimit does:

        if (!ht->rnd)
                get_random_bytes(&ht->rnd, 4);

ignoring that 0 is also a valid random number.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:30:11 -07:00
Patrick McHardy
2b2283d030 [NETFILTER]: recent match: missing refcnt initialization
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:30:09 -07:00
Patrick McHardy
a0e889bb1b [NETFILTER]: recent match: fix "sleeping function called from invalid context"
create_proc_entry must not be called with locks held. Use a mutex
instead to protect data only changed in user context.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:30:07 -07:00
James Morris
7c9728c393 [SECMARK]: Add secmark support to conntrack
Add a secmark field to IP and NF conntracks, so that security markings
on packets can be copied to their associated connections, and also
copied back to packets as required.  This is similar to the network
mark field currently used with conntrack, although it is intended for
enforcement of security policy rather than network policy.

Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:30:01 -07:00
James Morris
984bc16cc9 [SECMARK]: Add secmark support to core networking.
Add a secmark field to the skbuff structure, to allow security subsystems to
place security markings on network packets.  This is similar to the nfmark
field, except is intended for implementing security policy, rather than than
networking policy.

This patch was already acked in principle by Dave Miller.

Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:57 -07:00
David S. Miller
f86502bfc1 [IPV4] icmp: Kill local 'ip' arg in icmp_redirect().
It is typed wrong, and it's only assigned and used once.
So just pass in iph->daddr directly which fixes both problems.

Based upon a patch by Alexey Dobriyan.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:41 -07:00
Alexey Dobriyan
6d74165350 [IPV4]: Right prototype of __raw_v4_lookup()
All users pass 32-bit values as addresses and internally they're
compared with 32-bit entities. So, change "laddr" and "raddr" types to
__be32.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:39 -07:00
Alexey Dobriyan
338fcf9886 [IPV4] igmp: Fixup struct ip_mc_list::multiaddr type
All users except two expect 32-bit big-endian value. One is of

	->multiaddr = ->multiaddr

variety. And last one is "%08lX".

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:37 -07:00
David S. Miller
70df2311ee [TCP]: Fix compile warning in tcp_probe.c
The suseconds_t et al. are not necessarily any particular type on
every platform, so cast to unsigned long so that we can use one printf
format string and avoid warnings across the board

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:35 -07:00
Stephen Hemminger
738980ffa6 [TCP]: Limited slow start for Highspeed TCP
Implementation of RFC3742 limited slow start. Added as part
of the TCP highspeed congestion control module.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:33 -07:00
Stephen Hemminger
a42e9d6ce8 [TCP]: TCP Probe congestion window tracing
This adds a new module for tracking TCP state variables non-intrusively
using kprobes.  It has a simple /proc interface that outputs one line
for each packet received. A sample usage is to collect congestion
window and ssthresh over time graphs.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:31 -07:00
Stephen Hemminger
72dc5b9225 [TCP]: Minimum congestion window consolidation.
Many of the TCP congestion methods all just use ssthresh
as the minimum congestion window on decrease.  Rather than
duplicating the code, just have that be the default if that
handle in the ops structure is not set.

Minor behaviour change to TCP compound.  It probably wants
to use this (ssthresh) as lower bound, rather than ssthresh/2
because the latter causes undershoot on loss.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:29 -07:00
Stephen Hemminger
a4ed258495 [TCP]: TCP Compound quad root function
The original code did a 64 bit divide directly, which won't work on
32 bit platforms.  Rather than doing a 64 bit square root twice,
just implement a 4th root function in one pass using Newton's method.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:27 -07:00
Angelo P. Castellani
f890f92104 [TCP]: TCP Compound congestion control
TCP Compound is a sender-side only change to TCP that uses
a mixed Reno/Vegas approach to calculate the cwnd.

For further details look here:
  ftp://ftp.research.microsoft.com/pub/tr/TR-2005-86.pdf

Signed-off-by: Angelo P. Castellani <angelo.castellani@gmail.com>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:25 -07:00
Bin Zhou
76f1017757 [TCP]: TCP Veno congestion control
TCP Veno module is a new congestion control module to improve TCP
performance over wireless networks. The key innovation in TCP Veno is
the enhancement of TCP Reno/Sack congestion control algorithm by using
the estimated state of a connection based on TCP Vegas. This scheme
significantly reduces "blind" reduction of TCP window regardless of
the cause of packet loss.

This work is based on the research paper "TCP Veno: TCP Enhancement
for Transmission over Wireless Access Networks." C. P. Fu, S. C. Liew,
IEEE Journal on Selected Areas in Communication, Feb. 2003.

Original paper and many latest research works on veno:
 http://www.ntu.edu.sg/home/ascpfu/veno/veno.html

Signed-off-by: Bin Zhou <zhou0022@ntu.edu.sg>
	       Cheng Peng Fu <ascpfu@ntu.edu.sg>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:23 -07:00
Wong Hoi Sing Edison
7c106d7e78 [TCP]: TCP Low Priority congestion control
TCP Low Priority is a distributed algorithm whose goal is to utilize only
 the excess network bandwidth as compared to the ``fair share`` of
 bandwidth as targeted by TCP. Available from:
   http://www.ece.rice.edu/~akuzma/Doc/akuzma/TCP-LP.pdf

Original Author:
 Aleksandar Kuzmanovic <akuzma@northwestern.edu>

See http://www-ece.rice.edu/networks/TCP-LP/ for their implementation.
As of 2.6.13, Linux supports pluggable congestion control algorithms.
Due to the limitation of the API, we take the following changes from
the original TCP-LP implementation:
 o We use newReno in most core CA handling. Only add some checking
   within cong_avoid.
 o Error correcting in remote HZ, therefore remote HZ will be keeped
   on checking and updating.
 o Handling calculation of One-Way-Delay (OWD) within rtt_sample, sicne
   OWD have a similar meaning as RTT. Also correct the buggy formular.
 o Handle reaction for Early Congestion Indication (ECI) within
   pkts_acked, as mentioned within pseudo code.
 o OWD is handled in relative format, where local time stamp will in
   tcp_time_stamp format.

Port from 2.4.19 to 2.6.16 as module by:
 Wong Hoi Sing Edison <hswong3i@gmail.com>
 Hung Hing Lun <hlhung3i@gmail.com>

Signed-off-by: Wong Hoi Sing Edison <hswong3i@gmail.com>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:21 -07:00
Alexey Dobriyan
c45fb1089e [NETFILTER]: PPTP helper: fixup gre_keymap_lookup() return type
GRE keys are 16-bit wide.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:17 -07:00
Patrick McHardy
ae5b7d8ba2 [NETFILTER]: Add SIP connection tracking helper
Add SIP connection tracking helper. Originally written by
Christian Hentschel <chentschel@arnet.com.ar>, some cleanup, minor
fixes and bidirectional SIP support added by myself.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:15 -07:00
Patrick McHardy
e44ab66a75 [NETFILTER]: H.323 helper: replace internal_net_addr parameter by routing-based heuristic
Call Forwarding doesn't need to create an expectation if both peers can
reach each other without our help. The internal_net_addr parameter
lets the user explicitly specify a single network where this is true,
but is not very flexible and even fails in the common case that calls
will both be forwarded to outside parties and inside parties. Use an
optional heuristic based on routing instead, the assumption is that
if bpth the outgoing device and the gateway are equal, both peers can
reach each other directly.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:13 -07:00
Jing Min Zhao
c0d4cfd96d [NETFILTER]: H.323 helper: Add support for Call Forwarding
Signed-off-by: Jing Min Zhao <zhaojingmin@users.sourceforge.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:11 -07:00
Patrick McHardy
c952616934 [NETFILTER]: amanda helper: convert to textsearch infrastructure
When a port number within a packet is replaced by a differently sized
number only the packet is resized, but not the copy of the data.
Following port numbers are rewritten based on their offsets within
the copy, leading to packet corruption.

Convert the amanda helper to the textsearch infrastructure to avoid
the copy entirely.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:09 -07:00
Patrick McHardy
7d8c501817 [NETFILTER]: FTP helper: search optimization
Instead of skipping search entries for the wrong direction simply index
them by direction.

Based on patch by Pablo Neira <pablo@netfilter.org>

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:07 -07:00
Patrick McHardy
695ecea329 [NETFILTER]: SNMP helper: fix debug module param type
debug is the debug level, not a bool.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:05 -07:00
Patrick McHardy
89f2e21883 [NETFILTER]: ctnetlink: change table dumping not to require an unique ID
Instead of using the ID to find out where to continue dumping, take a
reference to the last entry dumped and try to continue there.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:03 -07:00
Patrick McHardy
3726add766 [NETFILTER]: ctnetlink: fix NAT configuration
The current configuration only allows to configure one manip and overloads
conntrack status flags with netlink semantic.

Signed-off-by: Patrick Mchardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:01 -07:00
Eric Leblond
997ae831ad [NETFILTER]: conntrack: add fixed timeout flag in connection tracking
Add a flag in a connection status to have a non updated timeout.
This permits to have connection that automatically die at a given
time.

Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:28:59 -07:00
Patrick McHardy
39a27a35c5 [NETFILTER]: conntrack: add sysctl to disable checksumming
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:28:57 -07:00
Patrick McHardy
6442f1cf89 [NETFILTER]: conntrack: don't call helpers for related ICMP messages
None of the existing helpers expects to get called for related ICMP
packets and some even drop them if they can't parse them.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:28:55 -07:00
Patrick McHardy
404bdbfd24 [NETFILTER]: recent match: replace by rewritten version
Replace the unmaintainable ipt_recent match by a rewritten version that
should be fully compatible.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:28:53 -07:00
Patrick McHardy
957dc80ac3 [NETFILTER]: x_tables: add SCTP/DCCP support where missing
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:28:47 -07:00
Patrick McHardy
3e72b2fe5b [NETFILTER]: x_tables: remove some unnecessary casts
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:28:45 -07:00
Herbert Xu
31a4ab9302 [IPSEC] proto: Move transport mode input path into xfrm_mode_transport
Now that we have xfrm_mode objects we can move the transport mode specific
input decapsulation code into xfrm_mode_transport.  This removes duplicate
code as well as unnecessary header movement in case of tunnel mode SAs
since we will discard the original IP header immediately.

This also fixes a minor bug for transport-mode ESP where the IP payload
length is set to the correct value minus the header length (with extension
headers for IPv6).

Of course the other neat thing is that we no longer have to allocate
temporary buffers to hold the IP headers for ESP and IPComp.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:28:41 -07:00
Herbert Xu
b59f45d0b2 [IPSEC] xfrm: Abstract out encapsulation modes
This patch adds the structure xfrm_mode.  It is meant to represent
the operations carried out by transport/tunnel modes.

By doing this we allow additional encapsulation modes to be added
without clogging up the xfrm_input/xfrm_output paths.

Candidate modes include 4-to-6 tunnel mode, 6-to-4 tunnel mode, and
BEET modes.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:28:39 -07:00
Herbert Xu
546be2405b [IPSEC] xfrm: Undo afinfo lock proliferation
The number of locks used to manage afinfo structures can easily be reduced
down to one each for policy and state respectively.  This is based on the
observation that the write locks are only held by module insertion/removal
which are very rare events so there is no need to further differentiate
between the insertion of modules like ipv6 versus esp6.

The removal of the read locks in xfrm4_policy.c/xfrm6_policy.c might look
suspicious at first.  However, after you realise that nobody ever takes
the corresponding write lock you'll feel better :)

As far as I can gather it's an attempt to guard against the removal of
the corresponding modules.  Since neither module can be unloaded at all
we can leave it to whoever fixes up IPv6 unloading :)

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:28:37 -07:00
David S. Miller
15986e1aad [TCP]: tcp_rcv_rtt_measure_ts() call in pure-ACK path is superfluous
We only want to take receive RTT mesaurements for data
bearing frames, here in the header prediction fast path
for a pure-sender, we know that we have a pure-ACK and
thus the checks in tcp_rcv_rtt_mesaure_ts() will not pass.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:26:16 -07:00
Chris Leech
1a2449a87b [I/OAT]: TCP recv offload to I/OAT
Locks down user pages and sets up for DMA in tcp_recvmsg, then calls
dma_async_try_early_copy in tcp_v4_do_rcv

Signed-off-by: Chris Leech <christopher.leech@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:25:56 -07:00
Chris Leech
9593782585 [I/OAT]: Add a sysctl for tuning the I/OAT offloaded I/O threshold
Any socket recv of less than this ammount will not be offloaded

Signed-off-by: Chris Leech <christopher.leech@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:25:54 -07:00
Chris Leech
624d116473 [I/OAT]: Make sk_eat_skb I/OAT aware.
Add an extra argument to sk_eat_skb, and make it move early copied
packets to the async_wait_queue instead of freeing them.

Signed-off-by: Chris Leech <christopher.leech@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:25:52 -07:00
Chris Leech
0e4b4992b8 [I/OAT]: Rename cleanup_rbuf to tcp_cleanup_rbuf and make non-static
Needed to be able to call tcp_cleanup_rbuf in tcp_input.c for I/OAT

Signed-off-by: Chris Leech <christopher.leech@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:25:50 -07:00
Sean Hefty
a1e8733e55 [NET]: Export ip_dev_find()
Export ip_dev_find() to allow locating a net_device given an IP address.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2006-06-17 20:37:28 -07:00
Weidong
42d1d52e69 [IPV4]: Increment ipInHdrErrors when TTL expires.
Signed-off-by: Weidong <weid@nanjing-fnst.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-12 13:09:59 -07:00
Aki M Nyrhinen
79320d7e14 [TCP]: continued: reno sacked_out count fix
From: Aki M Nyrhinen <anyrhine@cs.helsinki.fi>

IMHO the current fix to the problem (in_flight underflow in reno)
is incorrect.  it treats the symptons but ignores the problem. the
problem is timing out packets other than the head packet when we
don't have sack. i try to explain (sorry if explaining the obvious).

with sack, scanning the retransmit queue for timed out packets is
fine because we know which packets in our retransmit queue have been
acked by the receiver.

without sack, we know only how many packets in our retransmit queue the
receiver has acknowledged, but no idea which packets.

think of a "typical" slow-start overshoot case, where for example
every third packet in a window get lost because a router buffer gets
full.

with sack, we check for timeouts on those every third packet (as the
rest have been sacked). the packet counting works out and if there
is no reordering, we'll retransmit exactly the packets that were 
lost.

without sack, however, we check for timeout on every packet and end up
retransmitting consecutive packets in the retransmit queue. in our
slow-start example, 2/3 of those retransmissions are unnecessary. these
unnecessary retransmissions eat the congestion window and evetually
prevent fast recovery from continuing, if enough packets were lost.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-11 21:18:56 -07:00
Herbert Xu ~{PmVHI~}
f291196979 [TCP]: Avoid skb_pull if possible when trimming head
Trimming the head of an skb by calling skb_pull can cause the packet
to become unaligned if the length pulled is odd.  Since the length is
entirely arbitrary for a FIN packet carrying data, this is actually
quite common.

Unaligned data is not the end of the world, but we should avoid it if
it's easily done.  In this case it is trivial.  Since we're discarding
all of the head data it doesn't matter whether we move skb->data forward
or back.

However, it is still possible to have unaligned skb->data in general.
So network drivers should be prepared to handle it instead of crashing.

This patch also adds an unlikely marking on len < headlen since partial
ACKs on head data are extremely rare in the wild.  As the return value
of __pskb_trim_head is no longer ever NULL that has been removed.

Signed-off-by: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-05 15:03:37 -07:00
Stephen Hemminger
fb80a6e1a5 [TCP] tcp_highspeed: Fix problem observed by Xiaoliang (David) Wei
When snd_cwnd is smaller than 38 and the connection is in
congestion avoidance phase (snd_cwnd > snd_ssthresh), the snd_cwnd
seems to stop growing.

The additive increase was confused because C array's are 0 based.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-02 17:51:08 -07:00
Alexey Dobriyan
7114b0bb6d [NETFILTER]: PPTP helper: fix sstate/cstate typo
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-28 22:51:05 -07:00
Patrick McHardy
ca3ba88d0c [NETFILTER]: mark H.323 helper experimental
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-28 22:50:40 -07:00
Marcel Holtmann
6c813c3fe9 [NETFILTER]: Fix small information leak in SO_ORIGINAL_DST (CVE-2006-1343)
It appears that sockaddr_in.sin_zero is not zeroed during
getsockopt(...SO_ORIGINAL_DST...) operation. This can lead
to an information leak (CVE-2006-1343).

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-28 22:50:18 -07:00
Chris Wright
4a06373913 [NETFILTER]: SNMP NAT: fix memleak in snmp_object_decode
If kmalloc fails, error path leaks data allocated from asn1_oid_decode().

Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-23 15:15:13 -07:00
Patrick McHardy
4d942d8b39 [NETFILTER]: H.323 helper: fix sequence extension parsing
When parsing unknown sequence extensions the "son"-pointer points behind
the last known extension for this type, don't try to interpret it.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-23 15:15:10 -07:00
Patrick McHardy
7185989db4 [NETFILTER]: H.323 helper: fix parser error propagation
The condition "> H323_ERROR_STOP" can never be true since H323_ERROR_STOP
is positive and is the highest possible return code, while real errors are
negative, fix the checks. Also only abort on real errors in some spots
that were just interpreting any return value != 0 as error.

Fixes crashes caused by use of stale data after a parsing error occured:

BUG: unable to handle kernel paging request at virtual address bfffffff
 printing eip:
c01aa0f8
*pde = 1a801067
*pte = 00000000
Oops: 0000 [#1]
PREEMPT
Modules linked in: ip_nat_h323 ip_conntrack_h323 nfsd exportfs sch_sfq sch_red cls_fw sch_hfsc  xt_length ipt_owner xt_MARK iptable_mangle nfs lockd sunrpc pppoe pppoxx
CPU:    0
EIP:    0060:[<c01aa0f8>]    Not tainted VLI
EFLAGS: 00210646   (2.6.17-rc4 #8)
EIP is at memmove+0x19/0x22
eax: d77264e9   ebx: d77264e9   ecx: e88d9b17   edx: d77264e9
esi: bfffffff   edi: bfffffff   ebp: de6a7680   esp: c0349db8
ds: 007b   es: 007b   ss: 0068
Process asterisk (pid: 3765, threadinfo=c0349000 task=da068540)
Stack: <0>00000006 c0349e5e d77264e3 e09a2b4e e09a38a0 d7726052 d7726124 00000491
       00000006 00000006 00000006 00000491 de6a7680 d772601e d7726032 c0349f74
       e09a2dc2 00000006 c0349e5e 00000006 00000000 d76dda28 00000491 c0349f74
Call Trace:
 [<e09a2b4e>] mangle_contents+0x62/0xfe [ip_nat]
 [<e09a2dc2>] ip_nat_mangle_tcp_packet+0xa1/0x191 [ip_nat]
 [<e0a2712d>] set_addr+0x74/0x14c [ip_nat_h323]
 [<e0ad531e>] process_setup+0x11b/0x29e [ip_conntrack_h323]
 [<e0ad534f>] process_setup+0x14c/0x29e [ip_conntrack_h323]
 [<e0ad57bd>] process_q931+0x3c/0x142 [ip_conntrack_h323]
 [<e0ad5dff>] q931_help+0xe0/0x144 [ip_conntrack_h323]
...

Found by the PROTOS c07-h2250v4 testsuite.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-23 15:15:08 -07:00
Patrick McHardy
f41d5bb1d9 [NETFILTER]: SNMP NAT: fix memory corruption
Fix memory corruption caused by snmp_trap_decode:

- When snmp_trap_decode fails before the id and address are allocated,
  the pointers contain random memory, but are freed by the caller
  (snmp_parse_mangle).

- When snmp_trap_decode fails after allocating just the ID, it tries
  to free both address and ID, but the address pointer still contains
  random memory. The caller frees both ID and random memory again.

- When snmp_trap_decode fails after allocating both, it frees both,
  and the callers frees both again.

The corruption can be triggered remotely when the ip_nat_snmp_basic
module is loaded and traffic on port 161 or 162 is NATed.

Found by multiple testcases of the trap-app and trap-enc groups of the
PROTOS c06-snmpv1 testsuite.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-22 16:55:14 -07:00
Alexey Dobriyan
4195f81453 [NET]: Fix "ntohl(ntohs" bugs
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-22 16:53:22 -07:00
Solar Designer
2c8ac66bb2 [NETFILTER]: Fix do_add_counters race, possible oops or info leak (CVE-2006-0039)
Solar Designer found a race condition in do_add_counters(). The beginning
of paddc is supposed to be the same as tmp which was sanity-checked
above, but it might not be the same in reality. In case the integer
overflow and/or the race condition are triggered, paddc->num_counters
might not match the allocation size for paddc. If the check below
(t->private->number != paddc->num_counters) nevertheless passes (perhaps
this requires the race condition to be triggered), IPT_ENTRY_ITERATE()
would read kernel memory beyond the allocation size, potentially causing
an oops or leaking sensitive data (e.g., passwords from host system or
from another VPS) via counter increments. This requires CAP_NET_ADMIN.

Signed-off-by: Solar Designer <solar@openwall.com>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-19 02:16:52 -07:00
Alexey Dobriyan
a467704dcb [NETFILTER]: GRE conntrack: fix htons/htonl confusion
GRE keys are 16 bit.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-19 02:16:29 -07:00
Philip Craig
5c170a09d9 [NETFILTER]: fix format specifier for netfilter log targets
The prefix argument for nf_log_packet is a format specifier,
so don't pass the user defined string directly to it.

Signed-off-by: Philip Craig <philipc@snapgear.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-19 02:15:47 -07:00
Jesper Juhl
493e2428aa [NETFILTER]: Fix memory leak in ipt_recent
The Coverity checker spotted that we may leak 'hold' in
net/ipv4/netfilter/ipt_recent.c::checkentry() when the following
is true:
  if (!curr_table->status_proc) {
    ...
    if(!curr_table) {
    ...
      return 0;  <-- here we leak.
Simply moving an existing vfree(hold); up a bit avoids the possible leak.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-19 02:15:13 -07:00
Angelo P. Castellani
8872d8e1c4 [TCP]: reno sacked_out count fix
From: "Angelo P. Castellani" <angelo.castellani+lkml@gmail.com>

Using NewReno, if a sk_buff is timed out and is accounted as lost_out,
it should also be removed from the sacked_out.

This is necessary because recovery using NewReno fast retransmit could
take up to a lot RTTs and the sk_buff RTO can expire without actually
being really lost.

left_out = sacked_out + lost_out
in_flight = packets_out - left_out + retrans_out

Using NewReno without this patch, on very large network losses,
left_out becames bigger than packets_out + retrans_out (!!).

For this reason unsigned integer in_flight overflows to 2^32 - something.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-16 21:42:11 -07:00
Wei Yongjun
63cbd2fda3 [IPV4]: ip_options_fragment() has no effect on fragmentation
Fix error point to options in ip_options_fragment(). optptr get a
error pointer to the ipv4 header, correct is pointer to ipv4 options.

Signed-off-by: Wei Yongjun <weiyj@soft.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-09 15:18:50 -07:00
Hua Zhong
0182bd2b1e [IPV4]: Remove likely in ip_rcv_finish()
This is another result from my likely profiling tool
(dwalker@mvista.com just sent the patch of the profiling tool to
linux-kernel mailing list, which is similar to what I use).

On my system (not very busy, normal development machine within a
VMWare workstation), I see a 6/5 miss/hit ratio for this "likely".

Signed-off-by: Hua Zhong <hzhong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-06 18:11:39 -07:00
John Heffner
5528e568a7 [TCP]: Fix snd_cwnd adjustments in tcp_highspeed.c
Xiaoliang (David) Wei wrote:
> Hi gurus,
> 
>    I am reading the code of tcp_highspeed.c in the kernel and have a
> question on the hstcp_cong_avoid function, specifically the following
> AI part (line 136~143 in net/ipv4/tcp_highspeed.c ):
> 
>                /* Do additive increase */
>                if (tp->snd_cwnd < tp->snd_cwnd_clamp) {
>                        tp->snd_cwnd_cnt += ca->ai;
>                        if (tp->snd_cwnd_cnt >= tp->snd_cwnd) {
>                                tp->snd_cwnd++;
>                                tp->snd_cwnd_cnt -= tp->snd_cwnd;
>                        }
>                }
> 
>    In this part, when (tp->snd_cwnd_cnt == tp->snd_cwnd),
> snd_cwnd_cnt will be -1... snd_cwnd_cnt is defined as u16, will this
> small chance of getting -1 becomes a problem?
> Shall we change it by reversing the order of the cwnd++ and cwnd_cnt -= 
> cwnd?

Absolutely correct.  Thanks.

Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-05 17:41:44 -07:00
Herbert Xu
75c2d9077c [TCP]: Fix sock_orphan dead lock
Calling sock_orphan inside bh_lock_sock in tcp_close can lead to dead
locks.  For example, the inet_diag code holds sk_callback_lock without
disabling BH.  If an inbound packet arrives during that admittedly tiny
window, it will cause a dead lock on bh_lock_sock.  Another possible
path would be through sock_wfree if the network device driver frees the
tx skb in process context with BH enabled.

We can fix this by moving sock_orphan out of bh_lock_sock.

The tricky bit is to work out when we need to destroy the socket
ourselves and when it has already been destroyed by someone else.

By moving sock_orphan before the release_sock we can solve this
problem.  This is because as long as we own the socket lock its
state cannot change.

So we simply record the socket state before the release_sock
and then check the state again after we regain the socket lock.
If the socket state has transitioned to TCP_CLOSE in the time being,
we know that the socket has been destroyed.  Otherwise the socket is
still ours to keep.

Note that I've also moved the increment on the orphan count forward.
This may look like a problem as we're increasing it even if the socket
is just about to be destroyed where it'll be decreased again.  However,
this simply enlarges a window that already exists.  This also changes
the orphan count test by one.

Considering what the orphan count is meant to do this is no big deal.

This problem was discoverd by Ingo Molnar using his lock validator.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-03 23:31:35 -07:00
Patrick McHardy
7800007c1e [NETFILTER]: x_tables: don't use __copy_{from,to}_user on unchecked memory in compat layer
Noticed by Linus Torvalds <torvalds@osdl.org>

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-03 23:20:27 -07:00
Jing Min Zhao
7582e9d17e [NETFILTER]: H.323 helper: Change author's email address
Signed-off-by: Jing Min Zhao <zhaojingmin@users.sourceforge.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-03 23:19:59 -07:00
Patrick McHardy
2354feaeb2 [NETFILTER]: NAT: silence unused variable warnings with CONFIG_XFRM=n
net/ipv4/netfilter/ip_nat_standalone.c: In function 'ip_nat_out':
net/ipv4/netfilter/ip_nat_standalone.c:223: warning: unused variable 'ctinfo'
net/ipv4/netfilter/ip_nat_standalone.c:222: warning: unused variable 'ct'

Surprisingly no complaints so far ..

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-03 23:19:26 -07:00
Patrick McHardy
4228e2a989 [NETFILTER]: H.323 helper: fix use of uninitialized data
When a Choice element contains an unsupported choice no error is returned
and parsing continues normally, but the choice value is not set and
contains data from the last parsed message. This may in turn lead to
parsing of more stale data and following crashes.

Fixes a crash triggered by testcase 0003243 from the PROTOS c07-h2250v4
testsuite following random other testcases:

CPU:    0
EIP:    0060:[<c01a9554>]    Not tainted VLI
EFLAGS: 00210646   (2.6.17-rc2 #3)
EIP is at memmove+0x19/0x22
eax: d7be0307   ebx: d7be0307   ecx: e841fcf9   edx: d7be0307
esi: bfffffff   edi: bfffffff   ebp: da5eb980   esp: c0347e2c
ds: 007b   es: 007b   ss: 0068
Process events/0 (pid: 4, threadinfo=c0347000 task=dff86a90)
Stack: <0>00000006 c0347ea6 d7be0301 e09a6b2c 00000006 da5eb980 d7be003e d7be0052
       c0347f6c e09a6d9c 00000006 c0347ea6 00000006 00000000 d7b9a548 00000000
       c0347f6c d7b9a548 00000004 e0a1a119 0000028f 00000006 c0347ea6 00000006
Call Trace:
 [<e09a6b2c>] mangle_contents+0x40/0xd8 [ip_nat]
 [<e09a6d9c>] ip_nat_mangle_tcp_packet+0xa1/0x191 [ip_nat]
 [<e0a1a119>] set_addr+0x60/0x14d [ip_nat_h323]
 [<e0ab6e66>] q931_help+0x2da/0x71a [ip_conntrack_h323]
 [<e0ab6e98>] q931_help+0x30c/0x71a [ip_conntrack_h323]
 [<e09af242>] ip_conntrack_help+0x22/0x2f [ip_conntrack]
 [<c022934a>] nf_iterate+0x2e/0x5f
 [<c025d357>] xfrm4_output_finish+0x0/0x39f
 [<c02294ce>] nf_hook_slow+0x42/0xb0
 [<c025d357>] xfrm4_output_finish+0x0/0x39f
 [<c025d732>] xfrm4_output+0x3c/0x4e
 [<c025d357>] xfrm4_output_finish+0x0/0x39f
 [<c0230370>] ip_forward+0x1c2/0x1fa
 [<c022f417>] ip_rcv+0x388/0x3b5
 [<c02188f9>] netif_receive_skb+0x2bc/0x2ec
 [<c0218994>] process_backlog+0x6b/0xd0
 [<c021675a>] net_rx_action+0x4b/0xb7
 [<c0115606>] __do_softirq+0x35/0x7d
 [<c0104294>] do_softirq+0x38/0x3f

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-03 23:17:11 -07:00
Patrick McHardy
6fd737031e [NETFILTER]: H.323 helper: fix endless loop caused by invalid TPKT len
When the TPKT len included in the packet is below the lowest valid value
of 4 an underflow occurs which results in an endless loop.

Found by testcase 0000058 from the PROTOS c07-h2250v4 testsuite.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-03 23:16:29 -07:00
Patrick McHardy
e17df688f7 [NETFILTER] SCTP conntrack: fix infinite loop
fix infinite loop in the SCTP-netfilter code: check SCTP chunk size to
guarantee progress of for_each_sctp_chunk(). (all other uses of
for_each_sctp_chunk() are preceded by do_basic_checks(), so this fix
should be complete.)

Based on patch from Ingo Molnar <mingo@elte.hu>

CVE-2006-1527

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-02 17:26:39 -07:00
Patrick McHardy
46c5ea3c9a [NETFILTER] x_tables: fix compat related crash on non-x86
When iptables userspace adds an ipt_standard_target, it calculates the size
of the entire entry as:

sizeof(struct ipt_entry) + XT_ALIGN(sizeof(struct ipt_standard_target))

ipt_standard_target looks like this:

  struct xt_standard_target
  {
        struct xt_entry_target target;
        int verdict;
  };

xt_entry_target contains a pointer, so when compiled for 64 bit the
structure gets an extra 4 byte of padding at the end. On 32 bit
architectures where iptables aligns to 8 byte it will also have 4
byte padding at the end because it is only 36 bytes large.

The compat_ipt_standard_fn in the kernel adjusts the offsets by

  sizeof(struct ipt_standard_target) - sizeof(struct compat_ipt_standard_target),

which will always result in 4, even if the structure from userspace
was already padded to a multiple of 8. On x86 this works out by
accident because userspace only aligns to 4, on all other
architectures this is broken and causes incorrect adjustments to
the size and following offsets.

Thanks to Linus for lots of debugging help and testing.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-01 20:48:32 -07:00
Hua Zhong
83de47cd0c [TCP]: Fix unlikely usage in tcp_transmit_skb()
The following unlikely should be replaced by likely because the
condition happens every time unless there is a hard error to transmit
a packet.

Signed-off-by: Hua Zhong <hzhong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-29 18:33:19 -07:00
Herbert Xu
a76e07acd0 [IPSEC]: Fix IP ID selection
I was looking through the xfrm input/output code in order to abstract
out the address family specific encapsulation/decapsulation code.  During
that process I found this bug in the IP ID selection code in xfrm4_output.c.

At that point dst is still the xfrm_dst for the current SA which
represents an internal flow as far as the IPsec tunnel is concerned.
Since the IP ID is going to sit on the outside of the encapsulated
packet, we obviously want the external flow which is just dst->child.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-29 18:33:16 -07:00
Heiko Carstens
a536e07787 [IPV4]: inet_init() -> fs_initcall
Convert inet_init to an fs_initcall to make sure its called before any
device driver's initcall.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-29 18:33:14 -07:00
Thomas Voegtle
44adf28f4a [NETFILTER]: ULOG target is not obsolete
The backend part is obsoleted, but the target itself is still needed.

Signed-off-by: Thomas Voegtle <tv@lio96.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-24 17:27:29 -07:00
Herbert Xu
b60b49ea6a [TCP]: Account skb overhead in tcp_fragment
Make sure that we get the full sizeof(struct sk_buff)
plus the data size accounted for in skb->truesize.

This will create invariants that will allow adding
assertion checks on skb->truesize.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-19 21:35:00 -07:00
Jesper Juhl
63903ca6af [NET]: Remove redundant NULL checks before [kv]free
Redundant NULL check before kfree removal
from net/

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-18 15:57:55 -07:00
Herbert Xu
ef5cb9738b [TCP]: Fix truesize underflow
There is a problem with the TSO packet trimming code.  The cause of
this lies in the tcp_fragment() function.

When we allocate a fragment for a completely non-linear packet the
truesize is calculated for a payload length of zero.  This means that
truesize could in fact be less than the real payload length.

When that happens the TSO packet trimming can cause truesize to become
negative.  This in turn can cause sk_forward_alloc to be -n * PAGE_SIZE
which would trigger the warning.

I've copied the code DaveM used in tso_fragment which should work here.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-18 15:57:49 -07:00
Stephen Hemminger
d2c962b853 [IPV4]: ip_route_input panic fix
This fixes http://bugzilla.kernel.org/show_bug.cgi?id=6388
The bug is caused by ip_route_input dereferencing skb->nh.protocol of
the dummy skb passed dow from inet_rtm_getroute (Thanks Thomas for seeing
it). It only happens if the route requested is for a multicast IP
address.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-17 17:27:11 -07:00
Zach Brown
3d9dd7564d [PATCH] ip_output: account for fraggap when checking to add trailer_len
During other work I noticed that ip_append_data() seemed to be forgetting to
include the frag gap in its calculation of a fragment that consumes the rest of
the payload.  Herbert confirmed that this was a bug that snuck in during a
previous rework.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-14 16:04:18 -07:00
Adrian Bunk
6c97e72a16 [IPV4]: Possible cleanups.
This patch contains the following possible cleanups:
- make the following needlessly global function static:
  - arp.c: arp_rcv()
- remove the following unused EXPORT_SYMBOL's:
  - devinet.c: devinet_ioctl
  - fib_frontend.c: ip_rt_ioctl
  - inet_hashtables.c: inet_bind_bucket_create
  - inet_hashtables.c: inet_bind_hash
  - tcp_input.c: sysctl_tcp_abc
  - tcp_ipv4.c: sysctl_tcp_tw_reuse
  - tcp_output.c: sysctl_tcp_mtu_probing
  - tcp_output.c: sysctl_tcp_base_mss

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-14 15:00:20 -07:00
KAMEZAWA Hiroyuki
6f91204225 [PATCH] for_each_possible_cpu: network codes
for_each_cpu() actually iterates across all possible CPUs.  We've had mistakes
in the past where people were using for_each_cpu() where they should have been
iterating across only online or present CPUs.  This is inefficient and
possibly buggy.

We're renaming for_each_cpu() to for_each_possible_cpu() to avoid this in the
future.

This patch replaces for_each_cpu with for_each_possible_cpu under /net

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:31 -07:00
David S. Miller
55c0022e53 [IPV4] ip_fragment: Always compute hash with ipfrag_lock held.
Otherwise we could compute an inaccurate hash due to the
random seed changing.

Noticed by Zach Brown and patch is based upon some feedback
from Herbert Xu.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-09 22:43:55 -07:00
Patrick McHardy
19910d1aec [NETFILTER]: Fix DNAT in LOCAL_OUT
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-09 22:38:29 -07:00
Patrick McHardy
7a43c99551 [NETFILTER]: H.323 helper: remove changelog
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-09 22:25:43 -07:00
Patrick McHardy
96f6bf82ea [NETFILTER]: Convert conntrack/ipt_REJECT to new checksumming functions
Besides removing lots of duplicate code, all converted users benefit
from improved HW checksum error handling. Tested with and without HW
checksums in almost all combinations.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-09 22:25:42 -07:00
Patrick McHardy
422c346fad [NETFILTER]: Add address family specific checksum helpers
Add checksum operation which takes care of verifying the checksum and
dealing with HW checksum errors and avoids multiple checksum
operations by setting ip_summed to CHECKSUM_UNNECESSARY after
successful verification.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-09 22:25:41 -07:00
Patrick McHardy
bce8032ef3 [NETFILTER]: Introduce infrastructure for address family specific operations
Change the queue rerouter intrastructure to a generic usable
infrastructure for address family specific operations as a base for
some cleanups.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-09 22:25:40 -07:00
Patrick McHardy
a0aed49bdb [NETFILTER]: Fix IP_NF_CONNTRACK_NETLINK dependency
When NAT is built as a module, ip_conntrack_netlink can not be linked
statically.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-09 22:25:39 -07:00
Jing Min Zhao
a0b7db5e86 [NETFILTER]: H.323 helper: add parameter 'default_rrq_ttl'
default_rrq_ttl is used when no TTL is included in the RRQ.

Signed-off-by: Jing Min Zhao <zhaojingmin@users.sourceforge.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-09 22:25:38 -07:00
Jing Min Zhao
51d42f5e4e [NETFILTER]: H.323 helper: make get_h245_addr() static
Signed-off-by: Jing Min Zhao <zhaojingmin@users.sourceforge.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-09 22:25:37 -07:00
Jing Min Zhao
0f249685fd [NETFILTER]: H.323 helper: change EXPORT_SYMBOL to EXPORT_SYMBOL_GPL
Signed-off-by: Jing Min Zhao <zhaojingmin@users.sourceforge.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-09 22:25:36 -07:00
Jing Min Zhao
48bfee5fad [NETFILTER]: H.323 helper: move some function prototypes to ip_conntrack_h323.h
Move prototypes of NAT callbacks to ip_conntrack_h323.h. Because the
use of typedefs as arguments, some header files need to be moved as
well.

Signed-off-by: Jing Min Zhao <zhaojingmin@users.sourceforge.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-09 22:25:35 -07:00
Patrick McHardy
32292a7ff1 [NETFILTER]: Fix section mismatch warnings
Fix section mismatch warnings caused by netfilter's init_or_cleanup
functions used in many places by splitting the init from the cleanup
parts.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-09 22:25:34 -07:00
Patrick McHardy
964ddaa10d [NETFILTER]: Clean up hook registration
Clean up hook registration by makeing use of the new mass registration and
unregistration helpers.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-09 22:25:33 -07:00
Herbert Xu
45af08be6d [INET]: Use port unreachable instead of proto for tunnels
This patch changes GRE and SIT to generate port unreachable instead of
protocol unreachable errors when we can't find a matching tunnel for a
packet.

This removes the ambiguity as to whether the error is caused by no
tunnel being found or by the lack of support for the given tunnel
type.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-09 22:25:29 -07:00
Herbert Xu
50fba2aa7c [INET]: Move no-tunnel ICMP error to tunnel4/tunnel6
This patch moves the sending of ICMP messages when there are no IPv4/IPv6
tunnels present to tunnel4/tunnel6 respectively.  Please note that for now
if xfrm4_tunnel/xfrm6_tunnel is loaded then no ICMP messages will ever be
sent.  This is similar to how we handle AH/ESP/IPCOMP.

This move fixes the bug where we always send an ICMP message when there is
no ip6_tunnel device present for a given packet even if it is later handled
by IPsec.  It also causes ICMP messages to be sent when no IPIP tunnel is
present.

I've decided to use the "port unreachable" ICMP message over the current
value of "address unreachable" (and "protocol unreachable" by GRE) because
it is not ambiguous unlike the other ones which can be triggered by other
conditions.  There seems to be no standard specifying what value must be
used so this change should be OK.  In fact we should change GRE to use
this value as well.

Incidentally, this patch also fixes a fairly serious bug in xfrm6_tunnel
where we don't check whether the embedded IPv6 header is present before
dereferencing it for the inside source address.

This patch is inspired by a previous patch by Hugo Santos <hsantos@av.it.pt>.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-09 22:25:25 -07:00
Patrick McHardy
2e2f7aefa8 [NETFILTER]: Fix fragmentation issues with bridge netfilter
The conntrack code doesn't do re-fragmentation of defragmented packets
anymore but relies on fragmentation in the IP layer. Purely bridged
packets don't pass through the IP layer, so the bridge netfilter code
needs to take care of fragmentation itself.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-09 22:25:23 -07:00
Robert Olsson
550e29bc96 [FIB_TRIE]: Fix leaf freeing.
Seems like leaf (end-nodes) has been freed by __tnode_free_rcu and not
by __leaf_free_rcu. This fixes the problem. Only tnode_free is now
used which checks for appropriate node type. free_leaf can be removed.

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-09 22:25:23 -07:00
Herbert Xu
8bf4b8a108 [IPSEC]: Check x->encap before dereferencing it
We need to dereference x->encap before dereferencing it for encap_type.
If it's absent then the encap_type is zero.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-09 22:25:22 -07:00
Dmitry Mishin
2722971cbe [NETFILTER]: iptables 32bit compat layer
This patch extends current iptables compatibility layer in order to get
32bit iptables to work on 64bit kernel. Current layer is insufficient due
to alignment checks both in kernel and user space tools.

Patch is for current net-2.6.17 with addition of move of ipt_entry_{match|
target} definitions to xt_entry_{match|target}.

Signed-off-by: Dmitry Mishin <dim@openvz.org>
Acked-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-01 02:25:19 -08:00
Martin Josefsson
e64a70be51 [NETFILTER]: {ip,nf}_conntrack_netlink: fix expectation notifier unregistration
This patch fixes expectation notifier unregistration on module unload to
use ip_conntrack_expect_unregister_notifier(). This bug causes a soft
lockup at the first expectation created after a rmmod ; insmod of this
module.

Should go into -stable as well.

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-01 02:24:48 -08:00
Yasuyuki Kozakai
a89ecb6a2e [NETFILTER]: x_tables: unify IPv4/IPv6 multiport match
This unifies ipt_multiport and ip6t_multiport to xt_multiport.
As a result, this addes support for inversion and port range match
to IPv6 packets.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-01 02:22:54 -08:00
Yasuyuki Kozakai
dc5ab2faec [NETFILTER]: x_tables: unify IPv4/IPv6 esp match
This unifies ipt_esp and ip6t_esp to xt_esp. Please note that now
a user program needs to specify IPPROTO_ESP as protocol to use esp match
with IPv6. This means that ip6tables requires '-p esp' like iptables.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-01 02:22:30 -08:00
Herbert Xu
dbe5b4aaaf [IPSEC]: Kill unused decap state structure
This patch removes the *_decap_state structures which were previously
used to share state between input/post_input.  This is no longer
needed.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-01 00:54:16 -08:00
Herbert Xu
e695633e21 [IPSEC]: Kill unused decap state argument
This patch removes the decap_state argument from the xfrm input hook.
Previously this function allowed the input hook to share state with
the post_input hook.  The latter has since been removed.

The only purpose for it now is to check the encap type.  However, it
is easier and better to move the encap type check to the generic
xfrm_rcv function.  This allows us to get rid of the decap state
argument altogether.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-01 00:52:46 -08:00
Andrew Morton
65b4b4e81a [NETFILTER]: Rename init functions.
Every netfilter module uses `init' for its module_init() function and
`fini' or `cleanup' for its module_exit() function.

Problem is, this creates uninformative initcall_debug output and makes
ctags rather useless.

So go through and rename them all to $(filename)_init and
$(filename)_fini.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-28 17:02:48 -08:00
S P
c3e5d877aa [TCP]: Fix RFC2465 typo.
Signed-off-by: S P <speattle@yahoo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-28 17:02:47 -08:00
Herbert Xu
d2acc3479c [INET]: Introduce tunnel4/tunnel6
Basically this patch moves the generic tunnel protocol stuff out of
xfrm4_tunnel/xfrm6_tunnel and moves it into the new files of tunnel4.c
and tunnel6 respectively.

The reason for this is that the problem that Hugo uncovered is only
the tip of the iceberg.  The real problem is that when we removed the
dependency of ipip on xfrm4_tunnel we didn't really consider the module
case at all.

For instance, as it is it's possible to build both ipip and xfrm4_tunnel
as modules and if the latter is loaded then ipip simply won't load.

After considering the alternatives I've decided that the best way out of
this is to restore the dependency of ipip on the non-xfrm-specific part
of xfrm4_tunnel.  This is acceptable IMHO because the intention of the
removal was really to be able to use ipip without the xfrm subsystem.
This is still preserved by this patch.

So now both ipip/xfrm4_tunnel depend on the new tunnel4.c which handles
the arbitration between the two.  The order of processing is determined
by a simple integer which ensures that ipip gets processed before
xfrm4_tunnel.

The situation for ICMP handling is a little bit more complicated since
we may not have enough information to determine who it's for.  It's not
a big deal at the moment since the xfrm ICMP handlers are basically
no-ops.  In future we can deal with this when we look at ICMP caching
in general.

The user-visible change to this is the removal of the TUNNEL Kconfig
prompts.  This makes sense because it can only be used through IPCOMP
as it stands.

The addition of the new modules shouldn't introduce any problems since
module dependency will cause them to be loaded.

Oh and I also turned some unnecessary pskb's in IPv6 related to this
patch to skb's.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-28 17:02:46 -08:00
Alan Stern
e041c68341 [PATCH] Notifier chain update: API changes
The kernel's implementation of notifier chains is unsafe.  There is no
protection against entries being added to or removed from a chain while the
chain is in use.  The issues were discussed in this thread:

    http://marc.theaimsgroup.com/?l=linux-kernel&m=113018709002036&w=2

We noticed that notifier chains in the kernel fall into two basic usage
classes:

	"Blocking" chains are always called from a process context
	and the callout routines are allowed to sleep;

	"Atomic" chains can be called from an atomic context and
	the callout routines are not allowed to sleep.

We decided to codify this distinction and make it part of the API.  Therefore
this set of patches introduces three new, parallel APIs: one for blocking
notifiers, one for atomic notifiers, and one for "raw" notifiers (which is
really just the old API under a new name).  New kinds of data structures are
used for the heads of the chains, and new routines are defined for
registration, unregistration, and calling a chain.  The three APIs are
explained in include/linux/notifier.h and their implementation is in
kernel/sys.c.

With atomic and blocking chains, the implementation guarantees that the chain
links will not be corrupted and that chain callers will not get messed up by
entries being added or removed.  For raw chains the implementation provides no
guarantees at all; users of this API must provide their own protections.  (The
idea was that situations may come up where the assumptions of the atomic and
blocking APIs are not appropriate, so it should be possible for users to
handle these things in their own way.)

There are some limitations, which should not be too hard to live with.  For
atomic/blocking chains, registration and unregistration must always be done in
a process context since the chain is protected by a mutex/rwsem.  Also, a
callout routine for a non-raw chain must not try to register or unregister
entries on its own chain.  (This did happen in a couple of places and the code
had to be changed to avoid it.)

Since atomic chains may be called from within an NMI handler, they cannot use
spinlocks for synchronization.  Instead we use RCU.  The overhead falls almost
entirely in the unregister routine, which is okay since unregistration is much
less frequent that calling a chain.

Here is the list of chains that we adjusted and their classifications.  None
of them use the raw API, so for the moment it is only a placeholder.

  ATOMIC CHAINS
  -------------
arch/i386/kernel/traps.c:		i386die_chain
arch/ia64/kernel/traps.c:		ia64die_chain
arch/powerpc/kernel/traps.c:		powerpc_die_chain
arch/sparc64/kernel/traps.c:		sparc64die_chain
arch/x86_64/kernel/traps.c:		die_chain
drivers/char/ipmi/ipmi_si_intf.c:	xaction_notifier_list
kernel/panic.c:				panic_notifier_list
kernel/profile.c:			task_free_notifier
net/bluetooth/hci_core.c:		hci_notifier
net/ipv4/netfilter/ip_conntrack_core.c:	ip_conntrack_chain
net/ipv4/netfilter/ip_conntrack_core.c:	ip_conntrack_expect_chain
net/ipv6/addrconf.c:			inet6addr_chain
net/netfilter/nf_conntrack_core.c:	nf_conntrack_chain
net/netfilter/nf_conntrack_core.c:	nf_conntrack_expect_chain
net/netlink/af_netlink.c:		netlink_chain

  BLOCKING CHAINS
  ---------------
arch/powerpc/platforms/pseries/reconfig.c:	pSeries_reconfig_chain
arch/s390/kernel/process.c:		idle_chain
arch/x86_64/kernel/process.c		idle_notifier
drivers/base/memory.c:			memory_chain
drivers/cpufreq/cpufreq.c		cpufreq_policy_notifier_list
drivers/cpufreq/cpufreq.c		cpufreq_transition_notifier_list
drivers/macintosh/adb.c:		adb_client_list
drivers/macintosh/via-pmu.c		sleep_notifier_list
drivers/macintosh/via-pmu68k.c		sleep_notifier_list
drivers/macintosh/windfarm_core.c	wf_client_list
drivers/usb/core/notify.c		usb_notifier_list
drivers/video/fbmem.c			fb_notifier_list
kernel/cpu.c				cpu_chain
kernel/module.c				module_notify_list
kernel/profile.c			munmap_notifier
kernel/profile.c			task_exit_notifier
kernel/sys.c				reboot_notifier_list
net/core/dev.c				netdev_chain
net/decnet/dn_dev.c:			dnaddr_chain
net/ipv4/devinet.c:			inetaddr_chain

It's possible that some of these classifications are wrong.  If they are,
please let us know or submit a patch to fix them.  Note that any chain that
gets called very frequently should be atomic, because the rwsem read-locking
used for blocking chains is very likely to incur cache misses on SMP systems.
(However, if the chain's callout routines may sleep then the chain cannot be
atomic.)

The patch set was written by Alan Stern and Chandra Seetharaman, incorporating
material written by Keith Owens and suggestions from Paul McKenney and Andrew
Morton.

[jes@sgi.com: restructure the notifier chain initialization macros]
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:50 -08:00
Ingo Molnar
14cc3e2b63 [PATCH] sem2mutex: misc static one-file mutexes
Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jens Axboe <axboe@suse.de>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Cc: Greg KH <greg@kroah.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Adam Belay <ambx1@neo.rr.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:56:55 -08:00
Linus Torvalds
b55813a2e5 Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
  [NETFILTER] x_table.c: sem2mutex
  [IPV4]: Aggregate route entries with different TOS values
  [TCP]: Mark tcp_*mem[] __read_mostly.
  [TCP]: Set default max buffers from memory pool size
  [SCTP]: Fix up sctp_rcv return value
  [NET]: Take RTNL when unregistering notifier
  [WIRELESS]: Fix config dependencies.
  [NET]: Fill in a 32-bit hole in struct sock on 64-bit platforms.
  [NET]: Ensure device name passed to SO_BINDTODEVICE is NULL terminated.
  [MODULES]: Don't allow statically declared exports
  [BRIDGE]: Unaligned accesses in the ethernet bridge
2006-03-25 08:39:20 -08:00
Davide Libenzi
f348d70a32 [PATCH] POLLRDHUP/EPOLLRDHUP handling for half-closed devices notifications
Implement the half-closed devices notifiation, by adding a new POLLRDHUP
(and its alias EPOLLRDHUP) bit to the existing poll/select sets.  Since the
existing POLLHUP handling, that does not report correctly half-closed
devices, was feared to be changed, this implementation leaves the current
POLLHUP reporting unchanged and simply add a new bit that is set in the few
places where it makes sense.  The same thing was discussed and conceptually
agreed quite some time ago:

http://lkml.org/lkml/2003/7/12/116

Since this new event bit is added to the existing Linux poll infrastruture,
even the existing poll/select system calls will be able to use it.  As far
as the existing POLLHUP handling, the patch leaves it as is.  The
pollrdhup-2.6.16.rc5-0.10.diff defines the POLLRDHUP for all the existing
archs and sets the bit in the six relevant files.  The other attached diff
is the simple change required to sys/epoll.h to add the EPOLLRDHUP
definition.

There is "a stupid program" to test POLLRDHUP delivery here:

 http://www.xmailserver.org/pollrdhup-test.c

It tests poll(2), but since the delivery is same epoll(2) will work equally.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:56 -08:00
Ilia Sotnikov
cef2685e00 [IPV4]: Aggregate route entries with different TOS values
When we get an ICMP need-to-frag message, the original TOS value in the
ICMP payload cannot be used as a key to look up the routes to update.
This is because the TOS field may have been modified by routers on the
way.  Similarly, ip_rt_redirect should also ignore the TOS as the router
that gave us the message may have modified the TOS value.

The patch achieves this objective by aggregating entries with different
TOS values (but are otherwise identical) into the same bucket.  This
makes it easy to update them at the same time when an ICMP message is
received.

In future we should use a twin-hashing scheme where teh aggregation
occurs at the entry level.  That is, the TOS goes back into the hash
for normal lookups while ICMP lookups will end up with a node that
gives us a list that contains all other route entries that differ
only by TOS.

Signed-off-by: Ilia Sotnikov <hostcc@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-25 01:38:55 -08:00
David S. Miller
b8059eadf9 [TCP]: Mark tcp_*mem[] __read_mostly.
Suggested by Stephen Hemminger.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-25 01:36:56 -08:00
John Heffner
7b4f4b5ebc [TCP]: Set default max buffers from memory pool size
This patch sets the maximum TCP buffer sizes (available to automatic
buffer tuning, not to setsockopt) based on the TCP memory pool size.
The maximum sndbuf and rcvbuf each will be up to 4 MB, but no more
than 1/128 of the memory pressure threshold.

Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-25 01:34:07 -08:00
Alexey Dobriyan
53b3531bbb [PATCH] s/;;/;/g
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:24 -08:00
Patrick McHardy
a5cdc03003 [IPV4]: Add fib rule netlink notifications
To really make sense of route notifications in the presence of
multiple tables, userspace also needs to be notified about routing
rule updates.  Notifications are sent to the so far unused
RTNLGRP_NOP1 (now RTNLGRP_RULE) group.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-23 01:16:06 -08:00
Alexey Kuznetsov
1a55d57b10 [TCP]: Do not use inet->id of global tcp_socket when sending RST.
The problem is in ip_push_pending_frames(), which uses:

        if (!df) {
                __ip_select_ident(iph, &rt->u.dst, 0);
        } else {
                iph->id = htons(inet->id++);
        }

instead of ip_select_ident().

Right now I think the code is a nonsense. Most likely, I copied it from
old ip_build_xmit(), where it was really special, we had to decide
whether to generate unique ID when generating the first (well, the last)
fragment.

In ip_push_pending_frames() it does not make sense, it should use plain
ip_select_ident() instead.

Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-22 14:27:59 -08:00
Patrick McHardy
6a534ee35c [NETFILTER]: Fix undefined references to get_h225_addr
get_h225_addr is exported, but declared static, which fails when
linking statically.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-22 13:57:25 -08:00
Pablo Neira Ayuso
b9f78f9fca [NETFILTER]: nf_conntrack: support for layer 3 protocol load on demand
x_tables matches and targets that require nf_conntrack_ipv[4|6] to work
don't have enough information to load on demand these modules. This
patch introduces the following changes to solve this issue:

o nf_ct_l3proto_try_module_get: try to load the layer 3 connection
tracker module and increases the refcount.
o nf_ct_l3proto_module put: drop the refcount of the module.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-22 13:56:08 -08:00
Pablo Neira Ayuso
a45049c51c [NETFILTER]: x_tables: set the protocol family in x_tables targets/matches
Set the family field in xt_[matches|targets] registered.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-22 13:55:40 -08:00
Pablo Neira Ayuso
4e3882f773 [NETFILTER]: conntrack: cleanup the conntrack ID initialization
Currently the first conntrack ID assigned is 2, use 1 instead.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-22 13:55:11 -08:00
Pablo Neira Ayuso
1cde64365b [NETFILTER]: ctnetlink: Fix expectaction mask dumping
The expectation mask has some particularities that requires a different
handling. The protocol number fields can be set to non-valid protocols,
ie. l3num is set to 0xFFFF. Since that protocol does not exist, the mask
tuple will not be dumped. Moreover, this results in a kernel panic when
nf_conntrack accesses the array of protocol handlers, that is PF_MAX (0x1F)
long.

This patch introduces the function ctnetlink_exp_dump_mask, that correctly
dumps the expectation mask. Such function uses the l3num value from the
expectation tuple that is a valid layer 3 protocol number. The value of the
l3num mask isn't dumped since it is meaningless from the userspace side.

Thanks to Yasuyuki Kozakai and Patrick McHardy for the feedback.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-22 13:54:15 -08:00
Jing Min Zhao
5e35941d99 [NETFILTER]: Add H.323 conntrack/NAT helper
Signed-off-by: Jing Min Zhao <zhaojignmin@hotmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 23:41:17 -08:00
David S. Miller
dbeff12b4d [INET]: Fix typo in Arnaldo's connection sock compat fixups.
"struct inet_csk" --> "struct inet_connection_sock" :-)

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 22:52:32 -08:00
Arnaldo Carvalho de Melo
543d9cfeec [NET]: Identation & other cleanups related to compat_[gs]etsockopt cset
No code changes, just tidying up, in some cases moving EXPORT_SYMBOLs
to just after the function exported, etc.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 22:48:35 -08:00
Arnaldo Carvalho de Melo
dec73ff029 [ICSK] compat: Introduce inet_csk_compat_[gs]etsockopt
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 22:46:16 -08:00
Dmitry Mishin
3fdadf7d27 [NET]: {get|set}sockopt compatibility layer
This patch extends {get|set}sockopt compatibility layer in order to
move protocol specific parts to their place and avoid huge universal
net/compat.c file in the future.

Signed-off-by: Dmitry Mishin <dim@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 22:45:21 -08:00
Catherine Zhang
2c7946a7bf [SECURITY]: TCP/UDP getpeersec
This patch implements an application of the LSM-IPSec networking
controls whereby an application can determine the label of the
security association its TCP or UDP sockets are currently connected to
via getsockopt and the auxiliary data mechanism of recvmsg.

Patch purpose:

This patch enables a security-aware application to retrieve the
security context of an IPSec security association a particular TCP or
UDP socket is using.  The application can then use this security
context to determine the security context for processing on behalf of
the peer at the other end of this connection.  In the case of UDP, the
security context is for each individual packet.  An example
application is the inetd daemon, which could be modified to start
daemons running at security contexts dependent on the remote client.

Patch design approach:

- Design for TCP
The patch enables the SELinux LSM to set the peer security context for
a socket based on the security context of the IPSec security
association.  The application may retrieve this context using
getsockopt.  When called, the kernel determines if the socket is a
connected (TCP_ESTABLISHED) TCP socket and, if so, uses the dst_entry
cache on the socket to retrieve the security associations.  If a
security association has a security context, the context string is
returned, as for UNIX domain sockets.

- Design for UDP
Unlike TCP, UDP is connectionless.  This requires a somewhat different
API to retrieve the peer security context.  With TCP, the peer
security context stays the same throughout the connection, thus it can
be retrieved at any time between when the connection is established
and when it is torn down.  With UDP, each read/write can have
different peer and thus the security context might change every time.
As a result the security context retrieval must be done TOGETHER with
the packet retrieval.

The solution is to build upon the existing Unix domain socket API for
retrieving user credentials.  Linux offers the API for obtaining user
credentials via ancillary messages (i.e., out of band/control messages
that are bundled together with a normal message).

Patch implementation details:

- Implementation for TCP
The security context can be retrieved by applications using getsockopt
with the existing SO_PEERSEC flag.  As an example (ignoring error
checking):

getsockopt(sockfd, SOL_SOCKET, SO_PEERSEC, optbuf, &optlen);
printf("Socket peer context is: %s\n", optbuf);

The SELinux function, selinux_socket_getpeersec, is extended to check
for labeled security associations for connected (TCP_ESTABLISHED ==
sk->sk_state) TCP sockets only.  If so, the socket has a dst_cache of
struct dst_entry values that may refer to security associations.  If
these have security associations with security contexts, the security
context is returned.

getsockopt returns a buffer that contains a security context string or
the buffer is unmodified.

- Implementation for UDP
To retrieve the security context, the application first indicates to
the kernel such desire by setting the IP_PASSSEC option via
getsockopt.  Then the application retrieves the security context using
the auxiliary data mechanism.

An example server application for UDP should look like this:

toggle = 1;
toggle_len = sizeof(toggle);

setsockopt(sockfd, SOL_IP, IP_PASSSEC, &toggle, &toggle_len);
recvmsg(sockfd, &msg_hdr, 0);
if (msg_hdr.msg_controllen > sizeof(struct cmsghdr)) {
    cmsg_hdr = CMSG_FIRSTHDR(&msg_hdr);
    if (cmsg_hdr->cmsg_len <= CMSG_LEN(sizeof(scontext)) &&
        cmsg_hdr->cmsg_level == SOL_IP &&
        cmsg_hdr->cmsg_type == SCM_SECURITY) {
        memcpy(&scontext, CMSG_DATA(cmsg_hdr), sizeof(scontext));
    }
}

ip_setsockopt is enhanced with a new socket option IP_PASSSEC to allow
a server socket to receive security context of the peer.  A new
ancillary message type SCM_SECURITY.

When the packet is received we get the security context from the
sec_path pointer which is contained in the sk_buff, and copy it to the
ancillary message space.  An additional LSM hook,
selinux_socket_getpeersec_udp, is defined to retrieve the security
context from the SELinux space.  The existing function,
selinux_socket_getpeersec does not suit our purpose, because the
security context is copied directly to user space, rather than to
kernel space.

Testing:

We have tested the patch by setting up TCP and UDP connections between
applications on two machines using the IPSec policies that result in
labeled security associations being built.  For TCP, we can then
extract the peer security context using getsockopt on either end.  For
UDP, the receiving end can retrieve the security context using the
auxiliary data mechanism of recvmsg.

Signed-off-by: Catherine Zhang <cxzhang@watson.ibm.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 22:41:23 -08:00
Rick Jones
15d99e02ba [TCP]: sysctl to allow TCP window > 32767 sans wscale
Back in the dark ages, we had to be conservative and only allow 15-bit
window fields if the window scale option was not negotiated.  Some
ancient stacks used a signed 16-bit quantity for the window field of
the TCP header and would get confused.

Those days are long gone, so we can use the full 16-bits by default
now.

There is a sysctl added so that we can still interact with such old
stacks

Signed-off-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 22:40:29 -08:00
Neil Horman
abd596a4b6 [IPV4] ARP: Alloc acceptance of unsolicited ARP via netdevice sysctl.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 22:39:47 -08:00
David S. Miller
edb2c34fb2 [NETFILTER]: Fix warnings in ip_nat_snmp_basic.c
net/ipv4/netfilter/ip_nat_snmp_basic.c: In function 'asn1_header_decode':
net/ipv4/netfilter/ip_nat_snmp_basic.c:248: warning: 'len' may be used uninitialized in this function
net/ipv4/netfilter/ip_nat_snmp_basic.c:248: warning: 'def' may be used uninitialized in this function
net/ipv4/netfilter/ip_nat_snmp_basic.c: In function 'snmp_translate':
net/ipv4/netfilter/ip_nat_snmp_basic.c:672: warning: 'l' may be used uninitialized in this function
net/ipv4/netfilter/ip_nat_snmp_basic.c:668: warning: 'type' may be used uninitialized in this function

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 22:36:21 -08:00
Ingo Molnar
57b47a53ec [NET]: sem2mutex part 2
Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 22:35:41 -08:00
Arjan van de Ven
4a3e2f711a [NET] sem2mutex: net/
Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 22:33:17 -08:00
Stephen Hemminger
1533306186 [NET]: dev_put/dev_hold cleanup
Get rid of the old __dev_put macro that is just a hold over from pre 2.6
kernel.  And turn dev_hold into an inline instead of a macro.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 22:32:28 -08:00
Stephen Hemminger
6756ae4b4e [NET]: Convert RTNL to mutex.
This patch turns the RTNL from a semaphore to a new 2.6.16 mutex and
gets rid of some of the leftover legacy.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 22:23:58 -08:00
Baruch Even
50bf3e224a [TCP] H-TCP: Better time accounting
Instead of estimating the time since the last congestion event, count
it directly.

Signed-off-by: Baruch Even <baruch@ev-en.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 22:23:10 -08:00
Baruch Even
0bc6d90b82 [TCP] H-TCP: Account for delayed-ACKs
Account for delayed-ACKs in H-TCP.

Delayed-ACKs cause H-TCP to be less aggressive than its design calls
for. It is especially true when the receiver is a Linux machine where
the average delayed ack is over 3 packets with values of 7 not unheard
of.

Signed-off-By: Baruch Even <baruch@ev-en.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 22:22:47 -08:00
Baruch Even
c33ad6e476 [TCP] H-TCP: Use msecs_to_jiffies
Use functions to calculate jiffies from milliseconds and not the old,
crude method of dividing HZ by a value. Ensures more accurate values
even in the face of strange HZ values.

Signed-off-By: Baruch Even <baruch@ev-en.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 22:22:20 -08:00
Arnaldo Carvalho de Melo
c4d9390941 [ICSK]: Introduce inet_csk_ctl_sock_create
Consolidating open coded sequences in tcp and dccp, v4 and v6.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 22:01:03 -08:00
Robert Olsson
06ef921d60 [IPV4]: fib_trie stats fix
fib_triestats has been buggy and caused oopses some platforms as
openwrt.  The patch below should cure those problems.

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 21:35:01 -08:00
Robert Olsson
5ddf0eb2bf [IPV4]: fib_trie initialzation fix
In some kernel configs /proc functions seems to be accessed before the
trie is initialized. The patch below checks for this.

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 21:34:12 -08:00
John Heffner
0e7b13685f [TCP] mtu probing: move tcp-specific data out of inet_connection_sock
This moves some TCP-specific MTU probing state out of
inet_connection_sock back to tcp_sock.

Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 21:32:58 -08:00
Patrick McHardy
a193a4abdd [NETFILTER]: Fix skb->nf_bridge lifetime issues
The bridge netfilter code simulates the NF_IP_PRE_ROUTING hook and skips
the real hook by registering with high priority and returning NF_STOP if
skb->nf_bridge is present and the BRNF_NF_BRIDGE_PREROUTING flag is not
set. The flag is only set during the simulated hook.

Because skb->nf_bridge is only freed when the packet is destroyed, the
packet will not only skip the first invocation of NF_IP_PRE_ROUTING, but
in the case of tunnel devices on top of the bridge also all further ones.
Forwarded packets from a bridge encapsulated by a tunnel device and sent
as locally outgoing packet will also still have the incorrect bridge
information from the input path attached.

We already have nf_reset calls on all RX/TX paths of tunnel devices,
so simply reset the nf_bridge field there too. As an added bonus,
the bridge information for locally delivered packets is now also freed
when the packet is queued to a socket.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 19:23:05 -08:00
Jamal Hadi Salim
9500e8a81f [IPSEC]: Sync series - fast path
Fast path sequence updates that will generate ipsec async
events

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 19:15:29 -08:00
Patrick McHardy
a242769248 [NETFILTER]: ctnetlink: avoid unneccessary event message generation
Avoid unneccessary event message generation by checking for netlink
listeners before building a message.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 18:03:59 -08:00
Patrick McHardy
c4b8851392 [NETFILTER]: x_tables: replace IPv4/IPv6 policy match by address family independant version
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 18:03:40 -08:00
Patrick McHardy
c498673474 [NETFILTER]: x_tables: add xt_{match,target} arguments to match/target functions
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 18:02:56 -08:00
Patrick McHardy
1c524830d0 [NETFILTER]: x_tables: pass registered match/target data to match/target functions
This allows to make decisions based on the revision (and address family
with a follow-up patch) at runtime.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 18:02:15 -08:00
Patrick McHardy
aa83c1ab43 [NETFILTER]: Convert arp_tables targets to centralized error checking
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 18:01:28 -08:00
Patrick McHardy
1d5cd90976 [NETFILTER]: Convert ip_tables matches/targets to centralized error checking
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 18:01:14 -08:00
Patrick McHardy
3cdc7c953e [NETFILTER]: Change {ip,ip6,arp}_tables to use centralized error checking
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 18:00:36 -08:00
Holger Eitzenberger
f2ad52c9da [NETFILTER]: Fix CID offset bug in PPTP NAT helper debug message
The recent (kernel 2.6.15.1) fix for PPTP NAT helper introduced a
bug - which only appears if DEBUGP is enabled though.

The calculation of the CID offset into a PPTP request struct is
not correct, so that at least not the correct CID is displayed
if DEBUGP is enabled.

This patch corrects CID offset calculation and introduces a #define
for that.

Signed-off-by: Holger Eitzenberger <heitzenberger@astaro.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 17:58:21 -08:00
Harald Welte
dc808fe28d [NETFILTER] nf_conntrack: clean up to reduce size of 'struct nf_conn'
This patch moves all helper related data fields of 'struct nf_conn'
into a separate structure 'struct nf_conn_help'.  This new structure
is only present in conntrack entries for which we actually have a
helper loaded.

Also, this patch cleans up the nf_conntrack 'features' mechanism to
resemble what the original idea was: Just glue the feature-specific
data structures at the end of 'struct nf_conn', and explicitly
re-calculate the pointer to it when needed rather than keeping
pointers around.

Saves 20 bytes per conntrack on my x86_64 box. A non-helped conntrack
is 276 bytes. We still need to save another 20 bytes in order to fit
into to target of 256bytes.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 17:56:32 -08:00
John Heffner
5d424d5a67 [TCP]: MTU probing
Implementation of packetization layer path mtu discovery for TCP, based on
the internet-draft currently found at
<http://www.ietf.org/internet-drafts/draft-ietf-pmtud-method-05.txt>.

Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 17:53:41 -08:00
Adrian Bunk
d15150f755 [IPV4] fib_rules.c: make struct fib_rules static again
struct fib_rules became global for no good reason.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 17:46:56 -08:00
Robert Olsson
7b204afd45 [IPV4]: Use RCU locking in fib_rules.
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 17:18:53 -08:00
Patrick McHardy
31fe4d3317 [NETFILTER]: arp_tables: fix NULL pointer dereference
The check is wrong and lets NULL-ptrs slip through since !IS_ERR(NULL)
is true.

Coverity #190

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-12 20:40:43 -08:00
Patrick McHardy
baa829d892 [IPV4/6]: Fix UFO error propagation
When ufo_append_data fails err is uninitialized, but returned back.
Strangely gcc doesn't notice it.

Coverity #901 and #902

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-12 20:39:40 -08:00
Patrick McHardy
4a1ff6e2bd [TCP]: tcp_highspeed: fix AIMD table out-of-bounds access
Covertiy #547

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-12 20:39:39 -08:00
David S. Miller
ba244fe900 [TCP]: Fix tcp_tso_should_defer() when limit>=65536
That's >= a full sized TSO frame, so we should always
return 0 in that case.

Based upon a report and initial patch from Lachlan
Andrew, final patch suggested by Herbert Xu.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-11 18:51:49 -08:00
Thomas Graf
850a9a4e3c [NETFILTER] ip_queue: Fix wrong skb->len == nlmsg_len assumption
The size of the skb carrying the netlink message is not
equivalent to the length of the actual netlink message
due to padding. ip_queue matches the length of the payload
against the original packet size to determine if packet
mangling is desired, due to the above wrong assumption
arbitary packets may not be mangled depening on their
original size.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-07 14:56:12 -08:00
Patrick McHardy
bafac2a512 [NETFILTER]: Restore {ipt,ip6t,ebt}_LOG compatibility
The nfnetlink_log infrastructure changes broke compatiblity of the LOG
targets. They currently use whatever log backend was registered first,
which means that if ipt_ULOG was loaded first, no messages will be printed
to the ring buffer anymore.

Restore compatiblity by using the old log functions by default and only use
the nf_log backend if the user explicitly said so.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-27 13:04:17 -08:00
Herbert Xu
752c1f4c78 [IPSEC]: Kill post_input hook and do NAT-T in esp_input directly
The only reason post_input exists at all is that it gives us the
potential to adjust the checksums incrementally in future which
we ought to do.

However, after thinking about it for a bit we can adjust the
checksums without using this post_input stuff at all.  The crucial
point is that only the inner-most NAT-T SA needs to be considered
when adjusting checksums.  What's more, the checksum adjustment
comes down to a single u32 due to the linearity of IP checksums.

We just happen to have a spare u32 lying around in our skb structure :)
When ip_summed is set to CHECKSUM_NONE on input, the value of skb->csum
is currently unused.  All we have to do is to make that the checksum
adjustment and voila, there goes all the post_input and decap structures!

I've left in the decap data structures for now since it's intricately
woven into the sec_path stuff.  We can kill them later too.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-27 13:00:40 -08:00
Herbert Xu
4bf05eceec [IPSEC] esp: Kill unnecessary block and indentation
We used to keep sg on the stack which is why the extra block was useful.
We've long since stopped doing that so let's kill the block and save
some indentation.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-27 13:00:01 -08:00
Herbert Xu
4da3089f2b [IPSEC]: Use TOS when doing tunnel lookups
We should use the TOS because it's one of the routing keys.  It also
means that we update the correct routing cache entry when PMTU occurs.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-23 16:19:26 -08:00
Suresh Bhogavilli
8525987849 [IPV4]: Fix garbage collection of multipath route entries
When garbage collecting route cache entries of multipath routes
in rt_garbage_collect(), entries were deleted from the hash bucket
'i' while holding a spin lock on bucket 'k' resulting in a system
hang.  Delete entries, if any, from bucket 'k' instead.

Signed-off-by: Suresh Bhogavilli <sbhogavilli@verisign.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-23 16:10:52 -08:00
Patrick McHardy
8e249f0881 [NETFILTER]: Fix outgoing redirects to loopback
When redirecting an outgoing packet to loopback, it keeps the original
conntrack reference and information from the outgoing path, which
falsely triggers the check for DNAT on input and the dst_entry is
released to trigger rerouting. ip_route_input refuses to route the
packet because it has a local source address and it is dropped.

Look at the packet itself to dermine if it was NATed. Also fix a
missing inversion that causes unneccesary xfrm lookups.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-19 22:29:47 -08:00
Patrick McHardy
bc6e14b6f0 [NETFILTER]: Fix NAT PMTUD problems
ICMP errors are only SNATed when their source matches the source of the
connection they are related to, otherwise the source address is not
changed. This creates problems with ICMP frag. required messages
originating from a router behind the NAT, if private IPs are used the
packet has a good change of getting dropped on the path to its destination.

Always NAT ICMP errors similar to the original connection.

Based on report by Al Viro.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-19 22:26:40 -08:00
Yasuyuki Kozakai
7d3cdc6b55 [NETFILTER]: nf_conntrack: move registration of __nf_ct_attach
Move registration of __nf_ct_attach to nf_conntrack_core to make it usable
for IPv6 connection tracking as well.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-15 15:22:21 -08:00
Patrick McHardy
48d5cad87c [XFRM]: Fix SNAT-related crash in xfrm4_output_finish
When a packet matching an IPsec policy is SNATed so it doesn't match any
policy anymore it looses its xfrm bundle, which makes xfrm4_output_finish
crash because of a NULL pointer dereference.

This patch directs these packets to the original output path instead. Since
the packets have already passed the POST_ROUTING hook, but need to start at
the beginning of the original output path which includes another
POST_ROUTING invocation, a flag is added to the IPCB to indicate that the
packet was rerouted and doesn't need to pass the POST_ROUTING hook again.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-15 15:10:22 -08:00
Patrick McHardy
ee68cea2c2 [NETFILTER]: Fix xfrm lookup after SNAT
To find out if a packet needs to be handled by IPsec after SNAT, packets
are currently rerouted in POST_ROUTING and a new xfrm lookup is done. This
breaks SNAT of non-unicast packets to non-local addresses because the
packet is routed as incoming packet and no neighbour entry is bound to the
dst_entry. In general, it seems to be a bad idea to replace the dst_entry
after the packet was already sent to the output routine because its state
might not match what's expected.

This patch changes the xfrm lookup in POST_ROUTING to re-use the original
dst_entry without routing the packet again. This means no policy routing
can be used for transport mode transforms (which keep the original route)
when packets are SNATed to match the policy, but it looks like the best
we can do for now.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-15 01:34:23 -08:00
Dave Jones
77decfc716 [IPV4] ICMP: Invert default for invalid icmp msgs sysctl
isic can trigger these msgs to be spewed at a very high rate.
There's already a sysctl to turn them off. Given these messages
aren't useful for most people, this patch disables them by
default.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-13 15:36:21 -08:00
John Heffner
6fcf9412de [TCP]: rcvbuf lock when tcp_moderate_rcvbuf enabled
The rcvbuf lock should probably be honored here.

Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-09 17:06:57 -08:00
Alexey Kuznetsov
28633514af [NETLINK]: illegal use of pid in rtnetlink
When a netlink message is not related to a netlink socket,
it is issued by kernel socket with pid 0. Netlink "pid" has nothing
to do with current->pid. I called it incorrectly, if it was named "port",
the confusion would be avoided.

Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-09 16:43:41 -08:00
Al Viro
76edc6051e [PATCH] ipv4 NULL noise removal
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-02-07 20:57:37 -05:00
Al Viro
1b8623545b [PATCH] remove bogus asm/bug.h includes.
A bunch of asm/bug.h includes are both not needed (since it will get
pulled anyway) and bogus (since they are done too early).  Removed.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-02-07 20:56:35 -05:00
Linus Torvalds
98bd0c07b6 Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2006-02-05 11:10:29 -08:00
Eric Dumazet
88a2a4ac6b [PATCH] percpu data: only iterate over possible CPUs
percpu_data blindly allocates bootmem memory to store NR_CPUS instances of
cpudata, instead of allocating memory only for possible cpus.

As a preparation for changing that, we need to convert various 0 -> NR_CPUS
loops to use for_each_cpu().

(The above only applies to users of asm-generic/percpu.h.  powerpc has gone it
alone and is presently only allocating memory for present CPUs, so it's
currently corrupting memory).

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Jens Axboe <axboe@suse.de>
Cc: Anton Blanchard <anton@samba.org>
Acked-by: William Irwin <wli@holomorphy.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05 11:06:51 -08:00
Patrick McHardy
7918d212df [NETFILTER]: Fix check whether dst_entry needs to be released after NAT
After DNAT the original dst_entry needs to be released if present
so the packet doesn't skip input routing with its new address. The
current check for DNAT in ip_nat_in is reversed and checks for SNAT.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-04 23:51:29 -08:00
Patrick McHardy
0047c65a60 [NETFILTER]: Prepare {ipt,ip6t}_policy match for x_tables unification
The IPv4 and IPv6 version of the policy match are identical besides address
comparison and the data structure used for userspace communication. Unify
the data structures to break compatiblity now (before it is released), so
we can port it to x_tables in 2.6.17.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-04 23:51:28 -08:00
Patrick McHardy
e55f1bc5dc [NETFILTER]: Check policy length in policy match strict mode
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-04 23:51:26 -08:00
Kirill Korotaev
ee4bb818ae [NETFILTER]: Fix possible overflow in netfilters do_replace()
netfilter's do_replace() can overflow on addition within SMP_ALIGN()
and/or on multiplication by NR_CPUS, resulting in a buffer overflow on
the copy_from_user().  In practice, the overflow on addition is
triggerable on all systems, whereas the multiplication one might require
much physical memory to be present due to the check above.  Either is
sufficient to overwrite arbitrary amounts of kernel memory.

I really hate adding the same check to all 4 versions of do_replace(),
but the code is duplicate...

Found by Solar Designer during security audit of OpenVZ.org

Signed-Off-By: Kirill Korotaev <dev@openvz.org>
Signed-Off-By: Solar Designer <solar@openwall.com>
Signed-off-by: Patrck McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-04 23:51:25 -08:00
Patrick McHardy
6f16930078 [NETFILTER]: Fix missing src port initialization in tftp expectation mask
Reported by David Ahern <dahern@avaya.com>, netfilter bugzilla #426.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-04 23:51:21 -08:00
Patrick McHardy
ad2ad0f965 [NETFILTER]: Fix undersized skb allocation in ipt_ULOG/ebt_ulog/nfnetlink_log
The skb allocated is always of size nlbufsize, even if that is smaller than
the size needed for the current packet.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-04 23:51:19 -08:00
Holger Eitzenberger
c2db292438 [NETFILTER]: ULOG/nfnetlink_log: Use better default value for 'nlbufsiz'
Performance tests showed that ULOG may fail on heavy loaded systems
because of failed order-N allocations (N >= 1).

The default value of 4096 is not optimal in the sense that it actually
allocates _two_ contigous physical pages.  Reasoning: ULOG uses
alloc_skb(), which adds another ~300 bytes for skb_shared_info.

This patch sets the default value to NLMSG_GOODSIZE and adds some
documentation at the top.

Signed-off-by: Holger Eitzenberger <heitzenberger@astaro.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-04 23:51:18 -08:00
Pablo Neira Ayuso
34f9a2e4de [NETFILTER]: ctnetlink: add MODULE_ALIAS for expectation subsystem
Add load-on-demand support for expectation request. eg. conntrack -L expect

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-04 23:51:16 -08:00
Marcus Sundberg
b633ad5fbf [NETFILTER]: ctnetlink: Fix subsystem used for expectation events
The ctnetlink expectation events should use the NFNL_SUBSYS_CTNETLINK_EXP
subsystem, not NFNL_SUBSYS_CTNETLINK.

Signed-off-by: Marcus Sundberg <marcus@ingate.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-04 23:51:15 -08:00
Herbert Xu
fa60cf7f64 [ICMP]: Fix extra dst release when ip_options_echo fails
When two ip_route_output_key lookups in icmp_send were combined I
forgot to change the error path for ip_options_echo to not drop the
dst reference since it now sits before the dst lookup.  To fix it we
simply jump past the ip_rt_put call.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-04 23:51:14 -08:00
Horms
f00c401b9b [IPV4]: Remove suprious use of goto out: in icmp_reply
This seems to be an artifact of the follwoing commit in February '02.

e7e173af42dbf37b1d946f9ee00219cb3b2bea6a

In a nutshell, goto out and return actually do the same thing,
and both are called in this function. This patch removes out.

Signed-Off-By: Horms <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-02 17:03:18 -08:00
Herbert Xu
f8addb3215 [IPV4] multipath_wrandom: Fix softirq-unsafe spin lock usage
The spin locks in multipath_wrandom may be obtained from either process
context or softirq context depending on whether the packet is locally
or remotely generated.  Therefore we need to disable BH processing when
taking these locks.

This bug was found by Ingo's lock validator.
 
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-02 16:59:16 -08:00
Sam Ravnborg
f9d9516db7 [NET]: Do not export inet_bind_bucket_create twice.
inet_bind_bucket_create was exported twice.  Keep the export in the
file where inet_bind_bucket_create is defined.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-31 17:47:02 -08:00
Patrick McHardy
5d39a795bf [IPV4]: Always set fl.proto in ip_route_newports
ip_route_newports uses the struct flowi from the struct rtable returned
by ip_route_connect for the new route lookup and just replaces the port
numbers if they have changed. If an IPsec policy exists which doesn't match
port 0 the struct flowi won't have the proto field set and no xfrm lookup
is done for the changed ports.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-31 17:35:35 -08:00
Linus Torvalds
dd1c1853e2 Fix ipv4/igmp.c compile with gcc-4 and IP_MULTICAST
Modern versions of gcc do not like case statements at the end of a block
statement: you need at least an empty statement.  Using just a "break;"
is preferred for visual style.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-31 13:11:41 -08:00
Baruch Even
2c74088e41 [TCP] H-TCP: Fix accounting
This fixes the accounting in H-TCP, the ccount variable is also
adjusted a few lines above this one.

This line was not supposed to be there and wasn't there in the patches
originally submitted, the four patches submitted were merged to one
and in that merge the bug was introduced.

Signed-Off-By: Baruch Even <baruch@ev-en.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-30 20:54:39 -08:00
Dave Jones
c5d90e0004 [IPV4] igmp: remove pointless printk
This is easily triggerable by sending bogus packets,
allowing a malicious user to flood remote logs.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-30 20:27:17 -08:00
Alan Cox
715b49ef2d [PATCH] EDAC: atomic scrub operations
EDAC requires a way to scrub memory if an ECC error is found and the chipset
does not do the work automatically.  That means rewriting memory locations
atomically with respect to all CPUs _and_ bus masters.  That means we can't
use atomic_add(foo, 0) as it gets optimised for non-SMP

This adds a function to include/asm-foo/atomic.h for the platforms currently
supported which implements a scrub of a mapped block.

It also adjusts a few other files include order where atomic.h is included
before types.h as this now causes an error as atomic_scrub uses u32.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:30 -08:00
David L Stevens
ad12583f46 [IPV4]: Fix multiple bugs in IGMPv3
1) fix "mld_marksources()" to
        a) send nothing when all queried sources are excluded
        b) send full exclude report when source queried sources are
                not excluded
        c) don't schedule a timer when there's nothing to report

2) fix "add_grec()" to send empty-source records when it should
        The original check doesn't account for a non-empty source
        list with all sources inactive; the new code keeps that
        short-circuit case, and also generates the group header
        with an empty list if needed.

3) fix mca_crcount decrement to be after add_grec(), which needs
        its original value

4) add/remove delete records and prevent current advertisements
        when an exclude-mode filter moves from "active" to "inactive"
        or vice versa based on new filter additions.

        Items 1-3 are just IPv4 versions of the IPv6 bugs found
by Yan Zheng and fixed earlier. Item #4 is a related bug that
affects exclude-mode change records only (but not queries) and
also occurs in IPv6 (IPv6 version coming soon).

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-18 14:20:56 -08:00
Andrew Morton
dbd2915ce8 [IPV4]: RT_CACHE_STAT_INC() warning fix
BUG: using smp_processor_id() in preemptible [00000001] code: rpc.statd/2408

And it _is_ a bug, but I guess we don't care enough to add preempt_disable().

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-17 22:46:49 -08:00
Eric Dumazet
2f970d8357 [IPV4]: rt_cache_stat can be statically defined
Using __get_cpu_var(obj) is slightly faster than per_cpu_ptr(obj, 
raw_smp_processor_id()).

1) Smaller code and memory use
For static and small objects, DEFINE_PER_CPU(type, object) is preferred over a 
alloc_percpu() : Better and smaller code to access them, and no extra memory 
(storing the pointer, and the percpu array of pointers)

x86_64 code before patch

mov    1237577(%rip),%rax        # ffffffff803e5990 <rt_cache_stat>
not    %rax  # part of per_cpu machinery
mov    %gs:0x3c,%edx # get cpu number
movslq %edx,%rdx # extend 32 bits cpu number to 64 bits
mov    (%rax,%rdx,8),%rax # get the pointer for this cpu
incl   0x38(%rax)

x86_64 code after patch

mov    $per_cpu__rt_cache_stat,%rdx
mov    %gs:0x48,%rax # get percpu data offset
incl   0x38(%rax,%rdx,1)

2) False sharing avoidance for SMP :
For a small NR_CPUS, the array of per cpu pointers allocated in alloc_percpu() 
can be <= 32 bytes. This let slab code gives a part of a cache line. If the 
other part of this 64 bytes (or 128 bytes) cache line is used by a mostly 
written object, we can have false sharing and expensive per_cpu_ptr() operations.

Size of rt_cache_stat is 64 bytes, so this patch is not a danger of a too big 
increase of bss (in UP mode) or static per_cpu data for SMP 
(PERCPU_ENOUGH_ROOM is currently 32768 bytes)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-17 02:54:36 -08:00
David S. Miller
f09484ff87 [NETFILTER]: ip_conntrack_proto_gre.c needs linux/interrupt.h
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-17 02:42:02 -08:00
Yasuyuki Kozakai
6dd42af790 [NETFILTER] Makefile cleanup
These are replaced with x_tables matches and no longer exist.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-17 02:38:56 -08:00
Benoit Boissinot
ccc91324a1 [NETFILTER] ip[6]t_policy: Fix compilation warnings
ip[6]t_policy argument conversion slipped when merging with x_tables

Signed-off-by: Benoit Boissinot <benoit.boissinot@ens-lyon.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-17 02:26:34 -08:00
Linus Torvalds
caf5b04c82 x86: Work around compiler code generation bug with -Os
Some versions of gcc generate incorrect code for the inet_check_attr()
function, apparently due to a totally bogus index -> pointer comparison
transformation.

At least "gcc version 4.0.1 20050727 (Red Hat 4.0.1-5)" from FC4 is
affected, possibly others too.

This changes the function subtly so that the buggy gcc transformation
doesn't trigger.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-14 22:08:28 -08:00
Patrick McHardy
ee51b1b6ce [XFRM]: IPsec tunnel wildcard address support
When the source address of a tunnel is given as 0.0.0.0 do a routing lookup
to get the real source address for the destination and fill that into the
acquire message. This allows to specify policies like this:

spdadd 172.16.128.13/32 172.16.0.0/20 any -P out ipsec
        esp/tunnel/0.0.0.0-x.x.x.x/require;
spdadd 172.16.0.0/20 172.16.128.13/32 any -P in ipsec
        esp/tunnel/x.x.x.x-0.0.0.0/require;

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-13 14:34:36 -08:00
Harald Welte
2e4e6a17af [NETFILTER] x_tables: Abstraction layer for {ip,ip6,arp}_tables
This monster-patch tries to do the best job for unifying the data
structures and backend interfaces for the three evil clones ip_tables,
ip6_tables and arp_tables.  In an ideal world we would never have
allowed this kind of copy+paste programming... but well, our world
isn't (yet?) ideal.

o introduce a new x_tables module
o {ip,arp,ip6}_tables depend on this x_tables module
o registration functions for tables, matches and targets are only
  wrappers around x_tables provided functions
o all matches/targets that are used from ip_tables and ip6_tables
  are now implemented as xt_FOOBAR.c files and provide module aliases
  to ipt_FOOBAR and ip6t_FOOBAR
o header files for xt_matches are in include/linux/netfilter/,
  include/linux/netfilter_{ipv4,ipv6} contains compatibility wrappers
  around the xt_FOOBAR.h headers

Based on this patchset we're going to further unify the code,
gradually getting rid of all the layer 3 specific assumptions.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-12 14:06:43 -08:00
Randy Dunlap
4fc268d24c [PATCH] capable/capability.h (net/)
net: Use <linux/capability.h> where capable() is used.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11 18:42:14 -08:00
Kris Katterjohn
8b3a70058b [NET]: Remove more unneeded typecasts on *malloc()
This removes more unneeded casts on the return value for kmalloc(),
sock_kmalloc(), and vmalloc().

Signed-off-by: Kris Katterjohn <kjak@users.sourceforge.net>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-11 16:32:14 -08:00
David S. Miller
a776809755 [NETFILTER]: ip_ct_proto_gre_fini() cannot be __exit
It is invoked from failures paths of __init code.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-11 16:32:12 -08:00
Nicolas Kaiser
b8ab50bc55 netfilter: headers included twice
Headers included twice.

Signed-off-by: Nicolas Kaiser <nikai@nikai.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-01-11 02:04:35 +01:00
Patrick McHardy
babbdb1a18 [NETFILTER]: Fix timeout sysctls on big-endian 64bit architectures
The connection tracking timeout variables are unsigned long, but
proc_dointvec_jiffies is used with sizeof(unsigned int) in the sysctl
tables. Since there is no proc_doulongvec_jiffies function, change the
timeout variables to unsigned int.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-10 12:54:35 -08:00
Patrick McHardy
9d28026b7e [NETFILTER]: Remove unused function from NAT protocol helpers
->print and ->print_range are not used (and apparently never were).

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-10 12:54:34 -08:00
Patrick McHardy
c07bc1ffbd [NETFILTER]: Fix return value confusion in PPTP NAT helper
ip_nat_mangle_tcp_packet doesn't return NF_* values but 0/1 for
failure/success.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-10 12:54:33 -08:00
Patrick McHardy
03b9feca89 [NETFILTER]: Fix another crash in ip_nat_pptp
The PPTP NAT helper calculates the offset at which the packet needs
to be mangled as difference between two pointers to the header. With
non-linear skbs however the pointers may point to two seperate buffers
on the stack and the calculation results in a wrong offset beeing
used.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-10 12:54:32 -08:00
Patrick McHardy
15db34702c [NETFILTER]: Fix crash in ip_nat_pptp
When an inbound PPTP_IN_CALL_REQUEST packet is received the
PPTP NAT helper uses a NULL pointer in pointer arithmentic to
calculate the offset in the packet which needs to be mangled
and corrupts random memory or crashes.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-10 12:54:30 -08:00
Patrick McHardy
bb94aa169e [NETFILTER]: net/ipv[46]/netfilter.c cleanups
Don't wrap entire file in #ifdef CONFIG_NETFILTER, remove a few
unneccessary includes.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-10 12:54:29 -08:00
Kris Katterjohn
d3f4a687f6 [NET]: Change memcmp(,,ETH_ALEN) to compare_ether_addr()
This changes some memcmp(one,two,ETH_ALEN) to compare_ether_addr(one,two).

Signed-off-by: Kris Katterjohn <kjak@users.sourceforge.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-10 12:54:28 -08:00
Linus Torvalds
a457aa6c2b Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial 2006-01-09 17:06:53 -08:00
Adrian Bunk
93b1fae491 spelling: s/trough/through/
Additionally, one comment was reformulated by Joe Perches <joe@perches.com>.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-01-10 00:13:33 +01:00
Arnaldo Carvalho de Melo
dff2c03534 [INET_DIAG]: Introduce sk_diag_fill
To be called from inet_diag_get_exact, also rename inet_diag_fill to
inet_csk_diag_fill, for consistency with inet_twsk_diag_fill.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-09 14:56:56 -08:00
Arnaldo Carvalho de Melo
c7d58aabdc [INET_DIAG]: Introduce inet_twsk_diag_dump & inet_twsk_diag_fill
To properly dump TIME_WAIT sockets and to reduce complexity a bit by
having per socket class accessor routines.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-09 14:56:38 -08:00
Arnaldo Carvalho de Melo
4e852c0279 [INET_DIAG]: whitespace/simple cleanups
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-09 14:56:19 -08:00
Arnaldo Carvalho de Melo
7dbf075524 [INET_DIAG]: Use inet_twsk() with TIME_WAIT sockets
The fields being accessed in inet_diag_dump are outside sock_common, the
common part of struct sock and struct inet_timewait_sock.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-09 14:56:03 -08:00
Patrick McHardy
cfacb0577e [IPV4]: ip_output.c needs xfrm.h
This patch fixes a warning from my IPsec patches:

   CC      net/ipv4/ip_output.o
net/ipv4/ip_output.c: In function 'ip_finish_output':
net/ipv4/ip_output.c:208: warning: implicit declaration of function
'xfrm4_output_finish'

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-09 14:16:28 -08:00
Kris Katterjohn
09a626600b [NET]: Change some "if (x) BUG();" to "BUG_ON(x);"
This changes some simple "if (x) BUG();" statements to "BUG_ON(x);"

Signed-off-by: Kris Katterjohn <kjak@users.sourceforge.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-09 14:16:18 -08:00
Patrick McHardy
2941a48631 [NET]: Convert net/{ipv4,ipv6,sched} to netdev_priv
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-09 14:16:03 -08:00
Adrian Bunk
97dc627fb3 [IPV4]: make ip_fragment() static
Since there's no longer any external user of ip_fragment() we can make 
it static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-07 13:23:39 -08:00
Joe Kappus
da7bc6ee8e [NETFILTER]: ip_conntrack_proto_sctp.c needs linux/interrupt.h
Signed-off-by: Joe Kappus <joecool1029@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-07 12:57:41 -08:00
Patrick McHardy
e16a8f0b8c [NETFILTER]: Add ipt_policy/ip6t_policy matches
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-07 12:57:38 -08:00
Patrick McHardy
eb9c7ebe69 [NETFILTER]: Handle NAT in IPsec policy checks
Handle NAT of decapsulated IPsec packets by reconstructing the struct flowi
of the original packet from the conntrack information for IPsec policy
checks.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-07 12:57:37 -08:00
Patrick McHardy
b59c270104 [NETFILTER]: Keep conntrack reference until IPsec policy checks are done
Keep the conntrack reference until policy checks have been performed for
IPsec NAT support. The reference needs to be dropped before a packet is
queued to avoid having the conntrack module unloadable.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-07 12:57:36 -08:00
Patrick McHardy
5c901daaea [NETFILTER]: Redo policy lookups after NAT when neccessary
When NAT changes the key used for the xfrm lookup it needs to be done
again. If a new policy is returned in POST_ROUTING the packet needs
to be passed to xfrm4_output_one manually after all hooks were called
because POST_ROUTING is called with fixed okfn (ip_finish_output).

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-07 12:57:35 -08:00
Patrick McHardy
4e8e9de7c2 [NETFILTER]: Use conntrack information to determine if packet was NATed
Preparation for IPsec support for NAT:
Use conntrack information instead of saving the saving and comparing the
addresses to determine if a packet was NATed and needs to be rerouted to
make it easier to extend the key.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-07 12:57:34 -08:00
Patrick McHardy
3e3850e989 [NETFILTER]: Fix xfrm lookup in ip_route_me_harder/ip6_route_me_harder
ip_route_me_harder doesn't use the port numbers of the xfrm lookup and
uses ip_route_input for non-local addresses which doesn't do a xfrm
lookup, ip6_route_me_harder doesn't do a xfrm lookup at all.

Use xfrm_decode_session and do the lookup manually, make sure both
only do the lookup if the packet hasn't been transformed already.

Makeing sure the lookup only happens once needs a new field in the
IP6CB, which exceeds the size of skb->cb. The size of skb->cb is
increased to 48b. Apparently the IPv6 mobile extensions need some
more room anyway.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-07 12:57:33 -08:00
Patrick McHardy
8cdfab8a43 [IPV4]: reset IPCB flags when neccessary
Reset IPSKB_XFRM_TUNNEL_SIZE flags in ipip and ip_gre hard_start_xmit
function before the packet reenters IP. This is neccessary so the
encapsulated packets are checked not to be oversized in xfrm4_output.c
again. Reset all flags in sit when a packet changes its address family.

Also remove some obsolete IPSKB flags.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-07 12:57:32 -08:00
Patrick McHardy
b05e106698 [IPV4/6]: Netfilter IPsec input hooks
When the innermost transform uses transport mode the decapsulated packet
is not visible to netfilter. Pass the packet through the PRE_ROUTING and
LOCAL_IN hooks again before handing it to upper layer protocols to make
netfilter-visibility symetrical to the output path.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-07 12:57:31 -08:00
Patrick McHardy
16a6677fdf [XFRM]: Netfilter IPsec output hooks
Call netfilter hooks before IPsec transforms. Packets visit the
FORWARD/LOCAL_OUT and POST_ROUTING hook before the first encapsulation
and the LOCAL_OUT and POST_ROUTING hook before each following tunnel mode
transform.

Patch from Herbert Xu <herbert@gondor.apana.org.au>:

Move the loop from dst_output into xfrm4_output/xfrm6_output since they're
the only ones who need to it. xfrm{4,6}_output_one() processes the first SA
all subsequent transport mode SAs and is called in a loop that calls the
netfilter hooks between each two calls.

In order to avoid the tail call issue, I've added the inline function
nf_hook which is nf_hook_slow plus the empty list check.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-07 12:57:28 -08:00
Linus Torvalds
d8d8f6a4fd Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2006-01-06 15:24:28 -08:00
Alexey Dobriyan
76ab608d86 [NET]: Endian-annotate struct iphdr
And fix trivial warnings that emerged.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-06 13:24:29 -08:00
Joe
3cbc4ab58f [NETFILTER]: ipt_helper.c needs linux/interrupt.h
From: Joe <joecool1029@gmail.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-06 13:15:11 -08:00
Sam Ravnborg
367cb70421 kbuild: un-stringnify KBUILD_MODNAME
Now when kbuild passes KBUILD_MODNAME with "" do not __stringify it when
used. Remove __stringnify for all users.
This also fixes the output of:

$ ls -l /sys/module/
drwxr-xr-x 4 root root 0 2006-01-05 14:24 pcmcia
drwxr-xr-x 4 root root 0 2006-01-05 14:24 pcmcia_core
drwxr-xr-x 3 root root 0 2006-01-05 14:24 "processor"
drwxr-xr-x 3 root root 0 2006-01-05 14:24 "psmouse"

The quoting of the module names will be gone again.
Thanks to GregKH + Kay Sievers for reproting this.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2006-01-06 21:17:50 +01:00
Kris Katterjohn
46f25dffba [NET]: Change 1500 to ETH_DATA_LEN in some files
These patches add the header linux/if_ether.h and change 1500 to
ETH_DATA_LEN in some files.

Signed-off-by: Kris Katterjohn <kjak@users.sourceforge.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-05 16:48:56 -08:00
Andrew Morton
e924283bf9 [IPVS]: Another file needs linux/interrupt.h
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-05 16:48:55 -08:00
Yasuyuki Kozakai
e8eaedf2f8 [NETFILTER]: Use HOPLIMIT metric as TTL of TCP reset sent by REJECT
HOPLIMIT metric is appropriate to TCP reset sent by REJECT target
than hard-coded max TTL. Thanks to David S. Miller for hint.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-05 12:28:57 -08:00
Patrick McHardy
0ae2cfe7f3 [NETFILTER]: nf_conntrack_l3proto_ipv4.c needs net/route.h
CC [M]  net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.o
net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c: In function 'ipv4_refrag':
net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c:198: error: dereferencing pointer to incomplete type
make[3]: *** [net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.o] Error 1

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-05 12:21:52 -08:00
Patrick McHardy
1bd9bef6f9 [NETFILTER]: Call POST_ROUTING hook before fragmentation
Call POST_ROUTING hook before fragmentation to get rid of the okfn use
in ip_refrag and save the useless fragmentation/defragmentation step
when NAT is used.

The patch introduces one user-visible change, the POSTROUTING chain
in the mangle table gets entire packets, not fragments, which should
simplify use of the MARK and CLASSIFY targets for queueing as a nice
side-effect.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-05 12:20:59 -08:00
Patrick McHardy
abbcc73982 [NETFILTER]: Remove okfn usage in ip_vs_core.c
okfn should only be used from different contexts to avoid deep call chains,
i.e. by nf_queue.

Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-05 12:20:40 -08:00
Patrick McHardy
a9b305c4e5 [NETFILTER]: ctnetlink: Fix dumping of helper name
Properly dump the helper name instead of internal kernel data.
Based on patch by Marcus Sundberg <marcus@ingate.com>.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-05 12:20:02 -08:00
Patrick McHardy
e7be6994ec [NETFILTER]: Fix module_param types and permissions
Fix netfilter module_param types and permissions. Also fix an off-by-one in
the ipt_ULOG nlbufsiz < 128k check.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-05 12:19:46 -08:00
Pablo Neira Ayuso
c1d10adb4a [NETFILTER]: Add ctnetlink port for nf_conntrack
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-05 12:19:05 -08:00
Pablo Neira Ayuso
205d67c7d9 [NETFILTER]: ctnetlink: remove unused variable
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-05 12:18:44 -08:00
Pablo Neira Ayuso
d4d6bb41e0 [NETFILTER]: ctnetlink: fix conntrack mark race
Set conntrack mark before it is in hashes.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-05 12:18:25 -08:00
Pablo Neira Ayuso
0368309cb4 [NETFILTER]: ctnetlink: ctnetlink_event cleanup
Cleanup: Use 'else if' instead of a ugly 'goto' statement.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-05 12:18:08 -08:00
Pablo Neira Ayuso
47116eb201 [NETFILTER]: ctnetlink: use u_int32_t instead of unsigned int
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-05 12:17:50 -08:00
Pablo Neira Ayuso
984955b3d7 [NETFILTER]: ctnetlink: propagate ctnetlink_dump_tuples_proto return value back
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-05 12:17:29 -08:00
Yasuyuki Kozakai
90c4656eb4 [NETFILTER]: ctnetlink: Add sanity checkings for ICMP
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-05 12:17:03 -08:00
Pablo Neira Ayuso
684f7b296c [NETFILTER]: ctnetlink: remove bogus checks in ICMP protocol at dumping
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-05 12:16:41 -08:00
Adrian Bunk
4ffd2e4907 [IPVS]: Fix compilation
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-05 12:14:43 -08:00
Thomas Young
74cb879822 [TCP] tcp_vegas: Fix slow start
Vegas' slow start was only adding one MSS per RTT rather than one for
every ack. Slow start behavior should now match Reno.

Signed-off-by: Thomas Young <tyo@ee.mu.oz.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-04 13:59:32 -08:00
Arnaldo Carvalho de Melo
f190055ff5 [IPVS]: Add missing include <linux/net.h>
CC [M]  net/ipv4/ipvs/ip_vs_conn.o
  /pub/scm/linux/kernel/git/acme/net-2.6/net/ipv4/ipvs/ip_vs_conn.c: In
  function 'ip_vs_conn_new':
  /pub/scm/linux/kernel/git/acme/net-2.6/net/ipv4/ipvs/ip_vs_conn.c:606:
  warning: implicit declaration of function 'net_ratelimit'
  /pub/scm/linux/kernel/git/acme/net-2.6/net/ipv4/ipvs/ip_vs_conn.c: In
  function 'ip_vs_random_dropentry':
  /pub/scm/linux/kernel/git/acme/net-2.6/net/ipv4/ipvs/ip_vs_conn.c:810:
  warning: implicit declaration of function 'net_random'

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2006-01-04 02:02:20 -02:00
Arnaldo Carvalho de Melo
80e40daa47 [TCP]: syn_flood_warning is only needed if CONFIG_SYN_COOKIES is selected
CC      net/ipv4/tcp_ipv4.o
  /pub/scm/linux/kernel/git/acme/net-2.6/net/ipv4/tcp_ipv4.c:665: warning:
  'syn_flood_warning' defined but not used

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2006-01-04 01:58:06 -02:00
Stephen Hemminger
40efc6fa17 [TCP]: less inline's
TCP inline usage cleanup:
 * get rid of inline in several places
 * replace __inline__ with inline where possible
 * move functions used in one file out of tcp.h
 * let compiler decide on used once cases

On x86_64: 
   text	   data	    bss	    dec	    hex	filename
3594701	 648348	 567400	4810449	 4966d1	vmlinux.orig
3593133	 648580	 567400	4809113	 496199	vmlinux

On sparc64:
   text	   data	    bss	    dec	    hex	filename
2538278	 406152	 530392	3474822	 350586	vmlinux.ORIG
2536382	 406384	 530392	3473158	 34ff06	vmlinux

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 16:03:49 -08:00
Stephen Hemminger
cd8787ab04 [IPV4] fib_trie: build fix
Need this to fix build of fib_trie in net-2.6.16 (rebased) tree.
The code needs the new inet_make_mask inline.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 14:38:34 -08:00
Roberto Nibali
4b5bdf5cc3 [IPVS]: Cleanup IP_VS_DBG statements.
From: Roberto Nibali <ratz@drugphish.ch>

The attached patch (against current -GIT) is a cleanup patch which does
following:

o lookup debug messages shifted back to 9
o added more informational value to flags and refcnt since those
entries can be in multiple referenced structures
o cleanup 80 char violation

It's the prepatch to the session pool implementation and helps very much
to debug and monitor important variables and structures regarding the
threshold limitation and persistency without the thousands of lookup
messages which noone is interested in.

Signed-off-by: Horms <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 14:22:59 -08:00
Christoph Hellwig
b5e5fa5e09 [NET]: Add a dev_ioctl() fallback to sock_ioctl()
Currently all network protocols need to call dev_ioctl as the default
fallback in their ioctl implementations.  This patch adds a fallback
to dev_ioctl to sock_ioctl if the protocol returned -ENOIOCTLCMD.
This way all the procotol ioctl handlers can be simplified and we don't
need to export dev_ioctl.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 14:18:33 -08:00
Arnaldo Carvalho de Melo
14c850212e [INET_SOCK]: Move struct inet_sock & helper functions to net/inet_sock.h
To help in reducing the number of include dependencies, several files were
touched as they were getting needed headers indirectly for stuff they use.

Thanks also to Alan Menegotto for pointing out that net/dccp/proto.c had
linux/dccp.h include twice.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:11:21 -08:00
Eric Dumazet
90ddc4f047 [NET]: move struct proto_ops to const
I noticed that some of 'struct proto_ops' used in the kernel may share
a cache line used by locks or other heavily modified data. (default
linker alignement is 32 bytes, and L1_CACHE_LINE is 64 or 128 at
least)

This patch makes sure a 'struct proto_ops' can be declared as const,
so that all cpus can share all parts of it without false sharing.

This is not mandatory : a driver can still use a read/write structure
if it needs to (and eventually a __read_mostly)

I made a global stubstitute to change all existing occurences to make
them const.

This should reduce the possibility of false sharing on SMP, and
speedup some socket system calls.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:11:15 -08:00
Robert Olsson
fd9662555c [IPV4] fib_trie: Add credits.
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:11:10 -08:00
Stephen Hemminger
9eb2d62719 [TCP] cubic: use Newton-Raphson
Replace cube root algorithim with a faster version using Newton-Raphson.
Surprisingly, doing the scaled div64_64 is faster than a true 64 bit
division on 64 bit CPU's.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:11:09 -08:00
Stephen Hemminger
89b3d9aaf4 [TCP] cubic: precompute constants
Revised version of patch to pre-compute values for TCP cubic.
  * d32,d64 replaced with descriptive names
  * cube_factor replaces
	 srtt[scaled by count] / HZ * ((1 << (10+2*BICTCP_HZ)) / bic_scale)
  * beta_scale replaces
	8*(BICTCP_BETA_SCALE+beta)/3/(BICTCP_BETA_SCALE-beta);

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:11:08 -08:00
Arnaldo Carvalho de Melo
d83d8461f9 [IP_SOCKGLUE]: Remove most of the tcp specific calls
As DCCP needs to be called in the same spots.

Now we have a member in inet_sock (is_icsk), set at sock creation time from
struct inet_protosw->flags (if INET_PROTOSW_ICSK is set, like for TCP and
DCCP) to see if a struct sock instance is a inet_connection_sock for places
like the ones in ip_sockglue.c (v4 and v6) where we previously were looking if
sk_type was SOCK_STREAM, that is insufficient because we now use the same code
for DCCP, that has sk_type SOCK_DCCP.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:58 -08:00
Arnaldo Carvalho de Melo
a7f5e7f164 [INET]: Generalise tcp_v4_hash_connect
Renaming it to inet_hash_connect, making it possible to ditch
dccp_v4_hash_connect and share the same code with TCP instead.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:55 -08:00
Arnaldo Carvalho de Melo
6d6ee43e0b [TWSK]: Introduce struct timewait_sock_ops
So that we can share several timewait sockets related functions and
make the timewait mini sockets infrastructure closer to the request
mini sockets one.

Next changesets will take advantage of this, moving more code out of
TCP and DCCP v4 and v6 to common infrastructure.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:54 -08:00
Arnaldo Carvalho de Melo
0fa1a53e1f [IPV6]: Introduce inet6_timewait_sock
Out of tcp6_timewait_sock, that now is just an aggregation of
inet_timewait_sock and inet6_timewait_sock, using tw_ipv6_offset in struct
inet_timewait_sock, that is common to the IPv6 transport protocols that use
timewait sockets, like DCCP and TCP.

tw_ipv6_offset plays the struct inet_sock pinfo6 role, i.e. for the generic
code to find the IPv6 area in a timewait sock.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:47 -08:00
Roberto Nibali
f1f71e03b1 [IPVS]: remove dead code
This patch removes dead code. I don't see the reason to keep this cruft
around, besides cluttering the nice and functionally working code.

Signed-off-by: Roberto Nibali <ratz@drugphish.ch>
Signed-off-by: Horms <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:43 -08:00
Stephen Hemminger
65a45441d7 [UDP]: udp_checksum_init return value
Since udp_checksum_init always returns 0 there is no point in
having it return a value.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:42 -08:00
Herbert Xu
3305b80c21 [IP]: Simplify and consolidate MSG_PEEK error handling
When a packet is obtained from skb_recv_datagram with MSG_PEEK enabled
it is left on the socket receive queue.  This means that when we detect
a checksum error we have to be careful when trying to free the packet
as someone could have dequeued it in the time being.

Currently this delicate logic is duplicated three times between UDPv4,
UDPv6 and RAWv6.  This patch moves them into a one place and simplifies
the code somewhat.

This is based on a suggestion by Eric Dumazet.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:41 -08:00
Arnaldo Carvalho de Melo
af05dc9394 [ICSK]: Move v4_addr2sockaddr from TCP to icsk
Renaming it to inet_csk_addr2sockaddr.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:39 -08:00
Arnaldo Carvalho de Melo
8292a17a39 [ICSK]: Rename struct tcp_func to struct inet_connection_sock_af_ops
And move it to struct inet_connection_sock. DCCP will use it in the
upcoming changesets.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:38 -08:00
Arnaldo Carvalho de Melo
ca304b6104 [IPV6]: Introduce inet6_rsk()
And inet6_rsk_offset in inet_request_sock, for the same reasons as
inet_sock's pinfo6 member.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:37 -08:00
Arnaldo Carvalho de Melo
c2977c2213 [ICSK]: make inet_csk_reqsk_queue_hash_add timeout arg unsigned long
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:34 -08:00
Arnaldo Carvalho de Melo
971af18bbf [IPV6]: Reuse inet_csk_get_port in tcp_v6_get_port
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:33 -08:00
Herbert Xu
89cee8b1cb [IPV4]: Safer reassembly
Another spin of Herbert Xu's "safer ip reassembly" patch
for 2.6.16.

(The original patch is here:
http://marc.theaimsgroup.com/?l=linux-netdev&m=112281936522415&w=2
and my only contribution is to have tested it.)

This patch (optionally) does additional checks before accepting IP
fragments, which can greatly reduce the possibility of reassembling
fragments which originated from different IP datagrams.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Arthur Kepner <akepner@sgi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:31 -08:00
Eric Dumazet
3183606469 [NETFILTER] ip_tables: NUMA-aware allocation
Part of a performance problem with ip_tables is that memory allocation
is not NUMA aware, but 'only' SMP aware (ie each CPU normally touch
separate cache lines)

Even with small iptables rules, the cost of this misplacement can be
high on common workloads.  Instead of using one vmalloc() area
(located in the node of the iptables process), we now allocate an area
for each possible CPU, using vmalloc_node() so that memory should be
allocated in the CPU's node if possible.

Port to arp_tables and ip6_tables by Harald Welte.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:29 -08:00
Stephen Hemminger
df3271f336 [TCP] BIC: CUBIC window growth (2.0)
Replace existing BIC version 1.1 with new version 2.0.
The main change is to replace the window growth function
with a cubic function as described in:
  http://www.csc.ncsu.edu/faculty/rhee/export/bitcp/cubic-paper.pdf

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:28 -08:00
Stephen Hemminger
05d054503a [TCP] BIC: spelling and whitespace
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:27 -08:00
Stephen Hemminger
018da8f44c [TCP] BIC: remove low utilization code.
The latest BICTCP patch at:
http://www.csc.ncsu.edu:8080/faculty/rhee/export/bitcp/index_files/Page546.htm

disables the low_utilization feature of BICTCP because it doesn't work
in some cases. This patch removes it.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:10:26 -08:00
Patrick McHardy
9e999993c7 [XFRM]: Handle DCCP in xfrm{4,6}_decode_session
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-12-19 14:03:46 -08:00
Patrick McHardy
0476f171af [NETFILTER]: Fix NAT init order
As noticed by Phil Oester, the GRE NAT protocol helper is initialized
before the NAT core, which makes registration fail.

Change the linking order to make NAT be initialized first.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-12-19 13:53:09 -08:00
Herbert Xu
1542272a60 [GRE]: Fix hardware checksum modification
The skb_postpull_rcsum introduced a bug to the checksum modification.
Although the length pulled is offset bytes, the origin of the pulling
is the GRE header, not the IP header.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-12-14 12:55:24 -08:00
Marcus Sundberg
2f9616d4c4 [NETFILTER]: ip_nat_tftp: Fix expectation NAT
When a TFTP client is SNATed so that the port is also changed, the
port is never changed back for the expected connection.

Signed-off-by: Marcus Sundberg <marcus@ingate.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-12-12 15:02:48 -08:00
David S. Miller
dfb4b9dceb [TCP] Vegas: timestamp before clone
We have to store the congestion control timestamp on the SKB before we
clone it, not after.  Else we get no timestamping information at all.

tcp_transmit_skb() has been reworked so that we can do the timestamp
still in one spot, instead of at all the call sites.

Problem discovered, and initial fix, from Tom Young
<tyo@ee.unimelb.edu.au>.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-12-06 16:24:52 -08:00
Thomas Young
0d7bef600a [TCP] Vegas: Remove extra call to tcp_vegas_rtt_calc
Remove unneeded call to tcp_vegas_rtt_calc. The more accurate
microsecond value has already been registered prior to calling
tcp_vegas_cong_avoid.

Signed-off-by: Thomas Young <tyo@ee.mu.oz.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-12-06 16:17:11 -08:00
Thomas Young
5b49561381 [TCP] Vegas: stop resetting rtt every ack
Move the resetting of rtt measurements to inside the once per RTT
block of code.

Signed-off-by: Thomas Young <tyo@ee.mu.oz.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-12-06 16:16:34 -08:00
Patrick McHardy
2fdf1faa8e [NETFILTER]: Don't use conntrack entry after dropping the reference
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-12-05 13:38:16 -08:00
Patrick McHardy
266c854348 [NETFILTER]: Fix unbalanced read_unlock_bh in ctnetlink
NFA_NEST calls NFA_PUT which jumps to nfattr_failure if the skb has no
room left. We call read_unlock_bh at nfattr_failure for the NFA_PUT inside
the locked section, so move NFA_NEST inside the locked section too.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-12-05 13:37:33 -08:00
Patrick McHardy
a795756333 [NETFILTER]: Mark ctnetlink as EXPERIMENTAL
Should have been marked EXPERIMENTAL from the beginning, as the current
bunch of fixes show.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-12-05 13:36:25 -08:00
Patrick McHardy
0be7fa92ca [NETFILTER]: Fix CTA_PROTO_NUM attribute size in ctnetlink
CTA_PROTO_NUM is a u_int8_t.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-12-05 13:34:51 -08:00
Patrick McHardy
afe5c6bb03 [NETFILTER]: Fix ip_conntrack_flush abuse in ctnetlink
ip_conntrack_flush() used to be part of ip_conntrack_cleanup(), which needs
to drop _all_ references on module unload. Table flushed using ctnetlink
just needs to clean the table and doesn't need to flush the event cache or
wait for any references attached to skbs. Move everything but pure table
flushing back to ip_conntrack_cleanup().

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-12-05 13:33:50 -08:00
Pablo Neira Ayuso
8d1ca69984 [NETFILTER]: Fix incorrect argument to ip_nat_initialized() in ctnetlink
ip_nat_initialized() takes enum ip_nat_manip_type as it's second argument,
not a hook number.

Noticed and initial patch by Marcus Sundberg <marcus@ingate.com>.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-12-05 13:32:14 -08:00
Herbert Xu
86c8f9d158 [IPV4] Fix EPROTONOSUPPORT error in inet_create
There is a coding error in inet_create that causes it to always return
ESOCKTNOSUPPORT.  It should return EPROTONOSUPPORT when there are
protocols registered for a given socket type but none of them match
the requested protocol.

This is based on a patch by Jayachandran C.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-12-02 20:43:26 -08:00
David Stevens
24c6927505 [IGMP]: workaround for IGMP v1/v2 bug
From: David Stevens <dlstevens@us.ibm.com>

As explained at:

	http://www.cs.ucsb.edu/~krishna/igmp_dos/

With IGMP version 1 and 2 it is possible to inject a unicast
report to a client which will make it ignore multicast
reports sent later by the router.

The fix is to only accept the report if is was sent to a
multicast or unicast address.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-12-02 20:32:59 -08:00
Thomas Graf
ea86575eaf [NETLINK]: Fix processing of fib_lookup netlink messages
The receive path for fib_lookup netlink messages is lacking sanity
checks for header and payload and is thus vulnerable to malformed
netlink messages causing illegal memory references.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-12-01 14:30:00 -08:00
Phil Oester
2a43c4af3f [NETFILTER]: Fix recent match jiffies wrap mismatches
Around jiffies wrap time (i.e. within first 5 mins after boot), recent
match rules which contain both --seconds and --hitcount arguments
experience false matches.

This is because the last_pkts array is filled with zeros on creation, and
when comparing 'now' to 0 (+ --seconds argument), time_before_eq thinks it
has found a hit.

Below patch adds a break if the packet value is zero.  This has the
unfortunate side effect of causing mismatches if a packet was received
when jiffies really was equal to zero.  The odds of that happening are
slim compared to the problems caused by not adding the break however.
Plus, the author used this same method just below, so it is "good enough".

This fixes netfilter bugs #383 and #395.

Signed-off-by: Phil Oester <kernel@linuxace.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-12-01 14:29:24 -08:00
Jozsef Kadlecsik
73f306024c [NETFILTER]: Ignore ACKs ACKs on half open connections in TCP conntrack
Mounting NFS file systems after a (warm) reboot could take a long time if
firewalling and connection tracking was enabled.

The reason is that the NFS clients tends to use the same ports (800 and
counting down). Now on reboot, the server would still have a TCB for an
existing TCP connection client:800 -> server:2049. The client sends a
SYN from port 800 to server:2049, which elicits an ACK from the server.
The firewall on the client drops the ACK because (from its point of
view) the connection is still in half-open state, and it expects to see
a SYNACK.

The client will eventually time out after several minutes.

The following patch corrects this, by accepting ACKs on half open
connections as well.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-12-01 14:28:58 -08:00
Adrian Bunk
d127e94a5c [NETFILTER] ipv4: small cleanups
This patch contains the following cleanups:
- make needlessly global code static
- ip_conntrack_core.c: ip_conntrack_flush() -> ip_conntrack_flush(void)

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-29 16:28:18 -08:00
Adrian Bunk
4b30b1c6a3 [IPV4]: make two functions static
This patch makes two needlessly global functions static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-29 16:27:20 -08:00
Arjan van de Ven
9b5b5cff9a [NET]: Add const markers to various variables.
the patch below marks various variables const in net/; the goal is to
move them to the .rodata section so that they can't false-share
cachelines with things that get written to, as well as potentially
helping gcc a bit with optimisations.  (these were found using a gcc
patch to warn about such variables)

Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-29 16:21:38 -08:00
Mike Stroyan
18955cfcb2 [IPV4] tcp/route: Another look at hash table sizes
The tcp_ehash hash table gets too big on systems with really big memory.
It is worse on systems with pages larger than 4KB.  It wastes memory that
could be better used.  It also makes the netstat command slow because reading
/proc/net/tcp and /proc/net/tcp6 needs to go through the full hash table.

  The default value should not be larger for larger page sizes.  It seems
that the effect of page size is an unintended error dating back a long
time.  I also wonder if the default value really should be a larger
fraction of memory for systems with more memory.  While systems with
really big ram can afford more space for hash tables, it is not clear to
me that they benefit from increasing the allocation ratio for this table.

  The amount of memory allocated is determined by net/ipv4/tcp.c:tcp_init and
mm/page_alloc.c:alloc_large_system_hash.

tcp_init calls alloc_large_system_hash passing parameters-
    bucketsize=sizeof(struct tcp_ehash_bucket)
    numentries=thash_entries
    scale=(num_physpages >= 128 * 1024) ? (25-PAGE_SHIFT) : (27-PAGE_SHIFT)
    limit=0

On i386, PAGE_SHIFT is 12 for a page size of 4K
On ia64, PAGE_SHIFT defaults to 14 for a page size of 16K

The num_physpages test above makes the allocation take a larger fraction
of the total memory on systems with larger memory.  The threshold size
for a i386 system is 512MB.  For an ia64 system with 16KB pages the
threshold is 2GB.

For smaller memory systems-
On i386, scale = (27 - 12) = 15
On ia64, scale = (27 - 14) = 13
For larger memory systems-
On i386, scale = (25 - 12) = 13
On ia64, scale = (25 - 14) = 11

  For the rest of this discussion, I'll just track the larger memory case.

  The default behavior has numentries=thash_entries=0, so the allocated
size is determined by either scale or by the default limit of 1/16 of
total memory.

In alloc_large_system_hash-
|	numentries = (flags & HASH_HIGHMEM) ? nr_all_pages : nr_kernel_pages;
|	numentries += (1UL << (20 - PAGE_SHIFT)) - 1;
|	numentries >>= 20 - PAGE_SHIFT;
|	numentries <<= 20 - PAGE_SHIFT;

  At this point, numentries is pages for all of memory, rounded up to the
nearest megabyte boundary.

|	/* limit to 1 bucket per 2^scale bytes of low memory */
|	if (scale > PAGE_SHIFT)
|		numentries >>= (scale - PAGE_SHIFT);
|	else
|		numentries <<= (PAGE_SHIFT - scale);

On i386, numentries >>= (13 - 12), so numentries is 1/8196 of
bytes of total memory.
On ia64, numentries <<= (14 - 11), so numentries is 1/2048 of
bytes of total memory.

|        log2qty = long_log2(numentries);
|
|        do {
|                size = bucketsize << log2qty;

bucketsize is 16, so size is 16 times numentries, rounded
down to a power of two.

On i386, size is 1/512 of bytes of total memory.
On ia64, size is 1/128 of bytes of total memory.

For smaller systems the results are
On i386, size is 1/2048 of bytes of total memory.
On ia64, size is 1/512 of bytes of total memory.

  The large page effect can be removed by just replacing
the use of PAGE_SHIFT with a constant of 12 in the calls to
alloc_large_system_hash.  That makes them more like the other uses of
that function from fs/inode.c and fs/dcache.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-29 16:12:55 -08:00
Benoit Boissinot
de919820cf [NETFILTER]: ip_conntrack_netlink.c needs linux/interrupt.h
net/ipv4/netfilter/ip_conntrack_netlink.c: In function 'ctnetlink_dump_table':
net/ipv4/netfilter/ip_conntrack_netlink.c:409: warning: implicit declaration of function 'local_bh_disable'
net/ipv4/netfilter/ip_conntrack_netlink.c:427: warning: implicit declaration of function 'local_bh_enable'

Signed-off-by: Benoit Boissinot <benoit.boissinot@ens-lyon.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-23 19:03:46 -08:00
Pablo Neira Ayuso
00cb277a4a [NETFILTER] ctnetlink: Fix refcount leak ip_conntrack/nat_proto
Remove proto == NULL checking since ip_conntrack_[nat_]proto_find_get
always returns a valid pointer.

Fix missing ip_conntrack_proto_put in some paths.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-22 14:54:34 -08:00
Jamal Hadi Salim
0ff60a4567 [IPV4]: Fix secondary IP addresses after promotion
This patch fixes the problem with promoting aliases when:
a) a single primary and > 1 secondary addresses
b) multiple primary addresses each with at least one secondary address

Based on earlier efforts from Brian Pomerantz <bapper@piratehaven.org>,
Patrick McHardy <kaber@trash.net> and Thomas Graf <tgraf@suug.ch>

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-22 14:47:37 -08:00
Yasuyuki Kozakai
2b8f2ff6f4 [NETFILTER]: fixed dependencies between modules related with ip_conntrack
- IP_NF_CONNTRACK_MARK is bool and depends on only IP_NF_CONNTRACK
  which is tristate. If a variable depends on IP_NF_CONNTRACK_MARK and
  doesn't care about IP_NF_CONNTRACK, it can be y. This must be avoided.
- IP_NF_CT_ACCT has same problem.
- IP_NF_TARGET_CLUSTERIP also depends on IP_NF_MANGLE.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-20 21:09:55 -08:00
Patrick McHardy
c9e53cbe7a [FIB_TRIE]: Don't show local table in /proc/net/route output
Don't show local table to behave similar to fib_hash.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-20 21:09:00 -08:00
Harald Welte
2fce76afdb [NETFILTER] ip_conntrack: fix ftp/irc/tftp helpers on ports >= 32768
Since we've converted the ftp/irc/tftp helpers to use the new
module_parm_array() some time ago, we ware accidentially using signed data
types - thus preventing those modules from being used on ports >= 32768.

This patch fixes it by using 'ushort' module parameters.

Thanks to Jan Nijs for reporting this bug.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-17 15:06:47 -08:00
Stephen Hemminger
bd6af700a7 [TCP]: TCP highspeed build error
There is a compile error that crept in with the last patch of
TCP patches.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-17 14:11:18 -08:00
Yasuyuki Kozakai
e7c8a41e81 [IPV4,IPV6]: replace handmade list with hlist in IPv{4,6} reassembly
Both of ipq and frag_queue have *next and **prev, and they can be replaced
with hlist. Thanks Arnaldo Carvalho de Melo for the suggestion.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-16 12:55:37 -08:00
Stephen Hemminger
31f3426904 [TCP]: More spelling fixes.
From Joe Perches

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-15 15:17:10 -08:00
Harald Welte
37d2e7a20d [NETFILTER] nfnetlink: unconditionally require CAP_NET_ADMIN
This patch unconditionally requires CAP_NET_ADMIN for all nfnetlink
messages.  It also removes the per-message cap_required field, since all
existing subsystems use CAP_NET_ADMIN for all their messages anyway.

Patrick McHardy owes me a beer if we ever need to re-introduce this.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-14 15:24:59 -08:00
Pablo Neira Ayuso
5655820852 [NETFILTER] ctnetlink: More thorough size checking of attributes
Add missing size checks. Thanks Patrick McHardy for the hint.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-14 15:22:11 -08:00
Pablo Neira Ayuso
dbd36ea496 [NETFILTER] ctnetlink: use size_t to make gcc-4.x happy
Make gcc-4.x happy. Use size_t instead of int. Thanks to Patrick McHardy
for the hint.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-14 15:21:01 -08:00
Vlad Drukker
a2d7222f0f [NETFILTER] {ip,nf}_conntrack TCP: Accept SYN+PUSH like SYN
Some devices (e.g. Qlogic iSCSI HBA hardware like QLA4010 up to firmware
3.0.0.4) initiates TCP with SYN and PUSH flags set.

The Linux TCP/IP stack deals fine with that, but the connection tracking
code doesn't.

This patch alters TCP connection tracking to accept SYN+PUSH as a valid
flag combination.

Signed-off-by: Vlad Drukker <vlad@storewiz.com>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-12 12:13:14 -08:00
Jeff Garzik
c050970a25 [PATCH] TCP: fix vegas build
Recent TCP changes broke the build.

Signed-off-by: Jeff Garzik <jgarzik@pobox.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-11 09:21:28 -08:00
Stephen Hemminger
6a438bbe68 [TCP]: speed up SACK processing
Use "hints" to speed up the SACK processing. Various forms 
of this have been used by TCP developers (Web100, STCP, BIC)
to avoid the 2x linear search of outstanding segments.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 17:14:59 -08:00
Stephen Hemminger
caa20d9abe [TCP]: spelling fixes
Minor spelling fixes for TCP code.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 17:13:47 -08:00
John Heffner
326f36e9e7 [TCP]: receive buffer growth limiting with mixed MTU
This is a patch for discussion addressing some receive buffer growing issues.
This is partially related to the thread "Possible BUG in IPv4 TCP window
handling..." last week.

Specifically it addresses the problem of an interaction between rcvbuf
moderation (receiver autotuning) and rcv_ssthresh.  The problem occurs when
sending small packets to a receiver with a larger MTU.  (A very common case I
have is a host with a 1500 byte MTU sending to a host with a 9k MTU.)  In
such a case, the rcv_ssthresh code is targeting a window size corresponding
to filling up the current rcvbuf, not taking into account that the new rcvbuf
moderation may increase the rcvbuf size.

One hunk makes rcv_ssthresh use tcp_rmem[2] as the size target rather than
rcvbuf.  The other changes the behavior when it overflows its memory bounds
with in-order data so that it tries to grow rcvbuf (the same as with
out-of-order data).

These changes should help my problem of mixed MTUs, and should also help the
case from last week's thread I think.  (In both cases though you still need
tcp_rmem[2] to be set much larger than the TCP window.)  One question is if
this is too aggressive at trying to increase rcvbuf if it's under memory
stress.

Orignally-from: John Heffner <jheffner@psc.edu>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 17:11:48 -08:00
Stephen Hemminger
9772efb970 [TCP]: Appropriate Byte Count support
This is an updated version of the RFC3465 ABC patch originally
for Linux 2.6.11-rc4 by Yee-Ting Li. ABC is a way of counting
bytes ack'd rather than packets when updating congestion control.

The orignal ABC described in the RFC applied to a Reno style
algorithm. For advanced congestion control there is little
change after leaving slow start.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 17:09:53 -08:00
Stephen Hemminger
7faffa1c7f [TCP]: add tcp_slow_start helper
Move all the code that does linear TCP slowstart to one
inline function to ease later patch to add ABC support.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 17:07:24 -08:00
Stephen Hemminger
2d2abbab63 [TCP]: simplify microsecond rtt sampling
Simplify the code that comuputes microsecond rtt estimate used
by TCP Vegas. Move the callback out of the RTT sampler and into
the end of the ack cleanup.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 16:56:12 -08:00
Stephen Hemminger
f4805eded7 [TCP]: fix congestion window update when using TSO deferal
TCP peformance with TSO over networks with delay is awful.
On a 100Mbit link with 150ms delay, we get 4Mbits/sec with TSO and
50Mbits/sec without TSO.

The problem is with TSO, we intentionally do not keep the maximum
number of packets in flight to fill the window, we hold out to until 
we can send a MSS chunk. But, we also don't update the congestion window 
unless we have filled, as per RFC2861.

This patch replaces the check for the congestion window being full
with something smarter that accounts for TSO.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 16:53:30 -08:00
Herbert Xu
fb286bb299 [NET]: Detect hardware rx checksum faults correctly
Here is the patch that introduces the generic skb_checksum_complete
which also checks for hardware RX checksum faults.  If that happens,
it'll call netdev_rx_csum_fault which currently prints out a stack
trace with the device name.  In future it can turn off RX checksum.

I've converted every spot under net/ that does RX checksum checks to
use skb_checksum_complete or __skb_checksum_complete with the
exceptions of:

* Those places where checksums are done bit by bit.  These will call
netdev_rx_csum_fault directly.

* The following have not been completely checked/converted:

ipmr
ip_vs
netfilter
dccp

This patch is based on patches and suggestions from Stephen Hemminger
and David S. Miller.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 13:01:24 -08:00
Thomas Graf
a8f74b2288 [NETLINK]: Make netlink_callback->done() optional
Most netlink families make no use of the done() callback, making
it optional gets rid of all unnecessary dummy implementations.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 02:26:40 +01:00
Yasuyuki Kozakai
9fb9cbb108 [NETFILTER]: Add nf_conntrack subsystem.
The existing connection tracking subsystem in netfilter can only
handle ipv4.  There were basically two choices present to add
connection tracking support for ipv6.  We could either duplicate all
of the ipv4 connection tracking code into an ipv6 counterpart, or (the
choice taken by these patches) we could design a generic layer that
could handle both ipv4 and ipv6 and thus requiring only one sub-protocol
(TCP, UDP, etc.) connection tracking helper module to be written.

In fact nf_conntrack is capable of working with any layer 3
protocol.

The existing ipv4 specific conntrack code could also not deal
with the pecularities of doing connection tracking on ipv6,
which is also cured here.  For example, these issues include:

1) ICMPv6 handling, which is used for neighbour discovery in
   ipv6 thus some messages such as these should not participate
   in connection tracking since effectively they are like ARP
   messages

2) fragmentation must be handled differently in ipv6, because
   the simplistic "defrag, connection track and NAT, refrag"
   (which the existing ipv4 connection tracking does) approach simply
   isn't feasible in ipv6

3) ipv6 extension header parsing must occur at the correct spots
   before and after connection tracking decisions, and there were
   no provisions for this in the existing connection tracking
   design

4) ipv6 has no need for stateful NAT

The ipv4 specific conntrack layer is kept around, until all of
the ipv4 specific conntrack helpers are ported over to nf_conntrack
and it is feature complete.  Once that occurs, the old conntrack
stuff will get placed into the feature-removal-schedule and we will
fully kill it off 6 months later.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-11-09 16:38:16 -08:00
Krzysztof Piotr Oledzki
5fd52fe098 [NETFILTER] ctnetlink: ICMP_ID is u_int16_t not u_int8_t.
Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 13:04:32 -08:00
Krzysztof Piotr Oledzki
439a9994bb [NETFILTER] ctnetlink: Fix oops when no ICMP ID info in message
This patch fixes an userspace triggered oops. If there is no ICMP_ID
info the reference to attr will be NULL.

Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 13:04:08 -08:00
Pablo Neira Ayuso
a856a19a9f [NETFILTER] ctnetlink: Add support to identify expectations by ID's
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 13:03:42 -08:00
Pablo Neira Ayuso
fcda46128d [NETFILTER] ctnetlink: propagate error instaed of returning -EPERM
Propagate the error to userspace instead of returning -EPERM if the get
conntrack operation fails.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 13:03:26 -08:00
Pablo Neira Ayuso
fe902a91ff [NETFILTER] ctnetlink: return -EINVAL if size is wrong
Return -EINVAL if the size isn't OK instead of -EPERM.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 13:03:09 -08:00
Yasuyuki Kozakai
d63a928108 [NETFILTER]: stop tracking ICMP error at early point
Currently connection tracking handles ICMP error like normal packets
if it failed to get related connection. But it fails that after all.

This makes connection tracking stop tracking ICMP error at early point.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 13:02:45 -08:00
Philip Craig
5978a9b82c [NETFILTER] PPTP helper: fix PNS-PAC expectation call id
The reply tuple of the PNS->PAC expectation was using the wrong call id.

So we had the following situation:
- PNS behind NAT firewall
- PNS call id requires NATing
- PNS->PAC gre packet arrives first

then the PNS->PAC expectation is matched, and the other expectation
is deleted, but the PAC->PNS gre packets do not match the gre conntrack
because the call id is wrong.

We also cannot use ip_nat_follow_master().

Signed-off-by: Philip Craig <philipc@snapgear.com>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 13:01:53 -08:00
Pablo Neira Ayuso
81e5c27d08 [NETFILTER] ctnetlink: get_conntrack can use GFP_KERNEL
ctnetlink_get_conntrack is always called from user context, so GFP_KERNEL
is enough.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 13:01:19 -08:00
Pablo Neira Ayuso
7a4fe3664b [NETFILTER] ctnetlink: kill unused includes
Kill some useless headers included in ctnetlink. They aren't used in any
way.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 13:00:47 -08:00
Pablo Neira Ayuso
119a318494 [NETFILTER] ctnetlink: add module alias to fix autoloading
Add missing module alias. This is a must to load ctnetlink on demand. For
example, the conntrack tool will fail if the module isn't loaded.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 13:00:29 -08:00
Pablo Neira Ayuso
02a78cdf42 [NETFILTER] ctnetlink: add marking support from userspace
This patch adds support for conntrack marking from user space.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 13:00:04 -08:00
Pablo Neira Ayuso
51df784ed7 [NETFILTER] ctnetlink: check if protoinfo is present
This fixes an oops triggered from userspace. If we don't pass information
about the private protocol info, the reference to attr will be NULL. This is
likely to happen in update messages.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 12:59:41 -08:00
Harald Welte
a2506c0432 [NETFILTER] nfnetlink: nfattr_parse() can never fail, make it void
nfattr_parse (and thus nfattr_parse_nested) always returns success. So we
can make them 'void' and remove all the checking at the caller side.

Based on original patch by Pablo Neira Ayuso <pablo@netfilter.org>

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 12:59:13 -08:00
Yasuyuki Kozakai
eaae4fa45e [NETFILTER]: refcount leak of proto when ctnetlink dumping tuple
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 12:58:46 -08:00
Yasuyuki Kozakai
46998f59c0 [NETFILTER]: packet counter of conntrack is 32bits
The packet counter variable of conntrack was changed to 32bits from 64bits.
This follows that change.
		    
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 12:58:05 -08:00
Herbert Xu
89f5f0aeed [IPV4]: Fix ip_queue_xmit identity increment for TSO packets
When ip_queue_xmit calls ip_select_ident_more for IP identity selection
it gives it the wrong packet count for TSO packets.  The ip_select_*
functions expect one less than the number of packets, so we need to
subtract one for TSO packets.

This bug was diagnosed and fixed by Tom Young.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-08 09:41:56 -08:00
Jesper Juhl
a51482bde2 [NET]: kfree cleanup
From: Jesper Juhl <jesper.juhl@gmail.com>

This is the net/ part of the big kfree cleanup patch.

Remove pointless checks for NULL prior to calling kfree() in net/.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Arnaldo Carvalho de Melo <acme@conectiva.com.br>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
2005-11-08 09:41:34 -08:00
Julian Anastasov
dc8103f25f [IPVS]: fix connection leak if expire_nodest_conn=1
There was a fix in 2.6.13 that changed the behaviour of
ip_vs_conn_expire_now function not to put reference to connection,
its callers should hold write lock or connection refcnt. But we
forgot to convert one caller, when the real server for connection
is unavailable caller should put the connection reference. It
happens only when sysctl var expire_nodest_conn is set to 1 and
such connections never expire. Thanks to Roberto Nibali who found
the problem and tested a 2.4.32-rc2 patch, which is equal to this
2.6 version. Patch for 2.4 is already sent to Marcelo.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Roberto Nibali <ratz@drugphish.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-08 09:40:05 -08:00
Stephen Hemminger
6df716340d [TCP/DCCP]: Randomize port selection
This patch randomizes the port selected on bind() for connections
to help with possible security attacks. It should also be faster
in most cases because there is no need for a global lock.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-11-05 21:23:15 -02:00
Harald Welte
433a4d3b54 [NETFILTER]: CONNMARK target needs ip_conntrack
There's a missing dependency from the CONNMARK target to ip_conntrack.

Signed-off-by: Pablo Neira Ayuso <pablo@eurodev.net>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-11-05 16:39:20 -02:00
Harald Welte
0f81eb4db4 [NETFILTER]: Fix double free after netlink_unicast() in ctnetlink
It's not necessary to free skb if netlink_unicast() failed.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-11-05 03:28:37 -02:00
Harald Welte
d2a7bb7141 [NETFILTER] NAT: Fix module refcount dropping too far
The unknown protocol is used as a fallback when a protocol isn't known.
Hence we cannot handle it failing, so don't set ".me".  It's OK, since we
only grab a reference from within the same module (iptable_nat.ko), so we
never take the module refcount from 0 to 1.

Also, remove the "protocol is NULL" test: it's never NULL.

Signed-off-by: Rusty Rusty <rusty@rustcorp.com.au>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-11-05 01:23:34 -02:00
Harald Welte
d811552eda [NETFILTER] PPTP helper: Fix endianness bug in GRE key / CallID NAT
This endianness bug slipped through while changing the 'gre.key' field in the
conntrack tuple from 32bit to 16bit.

None of my tests caught the problem, since the linux pptp client always has
'0' as call id / gre key.  Only windows clients actually trigger the bug.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-11-04 23:19:17 -02:00
Harald Welte
3428c209c6 [NETFILTER] PPTP helper: Fix compilation of conntrack helper without NAT
This patch fixes compilation of the PPTP conntrack helper when NAT is
configured off.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-11-04 23:02:53 -02:00
Stephen Hemminger
450b5b1898 [TCP]: BIC max increment too large
The max growth of BIC TCP is too large. Original code was based on
BIC 1.0 and the default there was 32. Later code (2.6.13) included
compensation for delayed acks, and should have reduced the default
value to 16; since normally TCP gets one ack for every two packets sent.

The current value of 32 makes BIC too aggressive and unfair to other
flows.

Submitted-by: Injong Rhee <rhee@eos.ncsu.edu>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Acked-by: Ian McDonald <imcdnzl@gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-11-02 21:24:01 -02:00
Yan Zheng
8713dbf057 [MCAST]: ip[6]_mc_add_src should be called when number of sources is zero
And filter mode is exclude.

Further explanation by David Stevens:

Multicast source filters aren't widely used yet, and that's really the only
feature that's affected if an application actually exercises this bug, as far
as I can tell. An ordinary filter-less multicast join should still work, and
only forwarded multicast traffic making use of filters and doing empty-source
filters with the MSFILTER ioctl would be at risk of not getting multicast
traffic forwarded to them because the reports generated would not be based on
the correct counts.

Signed-off-by: Yan Zheng <yanzheng@21cn.com
Acked-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-11-02 21:03:57 -02:00
Harald Welte
6b7d31fcdd [NETFILTER]: Add "revision" support to arp_tables and ip6_tables
Like ip_tables already has it for some time, this adds support for
having multiple revisions for each match/target.  We steal one byte from
the name in order to accomodate a 8 bit version number.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-31 16:36:08 -02:00
Jean Delvare
3fa63c7d82 [PATCH] Typo fix: dot after newline in printk strings
Typo fix: dots appearing after a newline in printk strings.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:20 -08:00
Jayachandran C
9fcc2e8a75 [IPV4]: Fix issue reported by Coverity in ipv4/fib_frontend.c
fib_del_ifaddr() dereferences ifa->ifa_dev, so the code already assumes that
ifa->ifa_dev is non-NULL, the check is unnecessary.

Signed-off-by: Jayachandran C. <c.jayachandran at gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-29 02:53:39 -02:00
Ananda Raju
e89e9cf539 [IPv4/IPv6]: UFO Scatter-gather approach
Attached is kernel patch for UDP Fragmentation Offload (UFO) feature.

1. This patch incorporate the review comments by Jeff Garzik.
2. Renamed USO as UFO (UDP Fragmentation Offload)
3. udp sendfile support with UFO

This patches uses scatter-gather feature of skb to generate large UDP
datagram. Below is a "how-to" on changes required in network device
driver to use the UFO interface.

UDP Fragmentation Offload (UFO) Interface:
-------------------------------------------
UFO is a feature wherein the Linux kernel network stack will offload the
IP fragmentation functionality of large UDP datagram to hardware. This
will reduce the overhead of stack in fragmenting the large UDP datagram to
MTU sized packets

1) Drivers indicate their capability of UFO using
dev->features |= NETIF_F_UFO | NETIF_F_HW_CSUM | NETIF_F_SG

NETIF_F_HW_CSUM is required for UFO over ipv6.

2) UFO packet will be submitted for transmission using driver xmit routine.
UFO packet will have a non-zero value for

"skb_shinfo(skb)->ufo_size"

skb_shinfo(skb)->ufo_size will indicate the length of data part in each IP
fragment going out of the adapter after IP fragmentation by hardware.

skb->data will contain MAC/IP/UDP header and skb_shinfo(skb)->frags[]
contains the data payload. The skb->ip_summed will be set to CHECKSUM_HW
indicating that hardware has to do checksum calculation. Hardware should
compute the UDP checksum of complete datagram and also ip header checksum of
each fragmented IP packet.

For IPV6 the UFO provides the fragment identification-id in
skb_shinfo(skb)->ip6_frag_id. The adapter should use this ID for generating
IPv6 fragments.

Signed-off-by: Ananda Raju <ananda.raju@neterion.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (forwarded)
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-28 16:30:00 -02:00
Linus Torvalds
236fa08168 Merge master.kernel.org:/pub/scm/linux/kernel/git/acme/net-2.6.15 2005-10-28 08:50:37 -07:00
Herbert Xu
2ad41065d9 [TCP]: Clear stale pred_flags when snd_wnd changes
This bug is responsible for causing the infamous "Treason uncloaked"
messages that's been popping up everywhere since the printk was added.
It has usually been blamed on foreign operating systems.  However,
some of those reports implicate Linux as both systems are running
Linux or the TCP connection is going across the loopback interface.

In fact, there really is a bug in the Linux TCP header prediction code
that's been there since at least 2.1.8.  This bug was tracked down with
help from Dale Blount.

The effect of this bug ranges from harmless "Treason uncloaked"
messages to hung/aborted TCP connections.  The details of the bug
and fix is as follows.

When snd_wnd is updated, we only update pred_flags if
tcp_fast_path_check succeeds.  When it fails (for example,
when our rcvbuf is used up), we will leave pred_flags with
an out-of-date snd_wnd value.

When the out-of-date pred_flags happens to match the next incoming
packet we will again hit the fast path and use the current snd_wnd
which will be wrong.

In the case of the treason messages, it just happens that the snd_wnd
cached in pred_flags is zero while tp->snd_wnd is non-zero.  Therefore
when a zero-window packet comes in we incorrectly conclude that the
window is non-zero.

In fact if the peer continues to send us zero-window pure ACKs we
will continue making the same mistake.  It's only when the peer
transmits a zero-window packet with data attached that we get a
chance to snap out of it.  This is what triggers the treason
message at the next retransmit timeout.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-27 15:11:04 -02:00
David Engel
dcab5e1eec [IPV4]: Fix setting broadcast for SIOCSIFNETMASK
Fix setting of the broadcast address when the netmask is set via
SIOCSIFNETMASK in Linux 2.6.  The code wanted the old value of
ifa->ifa_mask but used it after it had already been overwritten with
the new value.

Signed-off-by: David Engel <gigem@comcast.net>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-26 01:20:21 -02:00
Jayachandran C
0d0d2bba97 [IPV4]: Remove dead code from ip_output.c
skb_prev is assigned from skb, which cannot be NULL. This patch removes the
unnecessary NULL check.

Signed-off-by: Jayachandran C. <c.jayachandran at gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-26 00:58:54 -02:00
Herbert Xu
1371e37da2 [IPV4]: Kill redundant rcu_dereference on fa_info
This patch kills a redundant rcu_dereference on fa->fa_info in fib_trie.c.
As this dereference directly follows a list_for_each_entry_rcu line, we
have already taken a read barrier with respect to getting an entry from
the list.

This read barrier guarantees that all values read out of fa are valid.
In particular, the contents of structure pointed to by fa->fa_info is
initialised before fa->fa_info is actually set (see fn_trie_insert);
the setting of fa->fa_info itself is further separated with a write
barrier from the insertion of fa into the list.

Therefore by taking a read barrier after obtaining fa from the list
(which is given by list_for_each_entry_rcu), we can be sure that
fa->fa_info contains a valid pointer, as well as the fact that the
data pointed to by fa->fa_info is itself valid.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-26 00:25:03 -02:00
Harald Welte
eed75f191d [NETFILTER] ip_conntrack: Make "hashsize" conntrack parameter writable
It's fairly simple to resize the hash table, but currently you need to
remove and reinsert the module.  That's bad (we lose connection
state).  Harald has even offered to write a daemon which sets this
based on load.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-26 00:19:27 -02:00
John Hawkes
670c02c2bf [NET]: Wider use of for_each_*cpu()
In 'net' change the explicit use of for-loops and NR_CPUS into the
general for_each_cpu() or for_each_online_cpu() constructs, as
appropriate.  This widens the scope of potential future optimizations
of the general constructs, as well as takes advantage of the existing
optimizations of first_cpu() and next_cpu(), which is advantageous
when the true CPU count is much smaller than NR_CPUS.

Signed-off-by: John Hawkes <hawkes@sgi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-25 23:54:01 -02:00
Julian Anastasov
c98d80edc8 [SK_BUFF]: ipvs_property field must be copied
IPVS used flag NFC_IPVS_PROPERTY in nfcache but as now nfcache was removed the
new flag 'ipvs_property' still needs to be copied. This patch should be
included in 2.6.14.

Further comments from Harald Welte:

Sorry, seems like the bug was introduced by me.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-22 17:06:01 -02:00
Herbert Xu
b2cc99f04c [TCP] Allow len == skb->len in tcp_fragment
It is legitimate to call tcp_fragment with len == skb->len since
that is done for FIN packets and the FIN flag counts as one byte.
So we should only check for the len > skb->len case.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-20 17:13:13 -02:00
Herbert Xu
046d20b739 [TCP]: Ratelimit debugging warning.
Better safe than sorry.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-13 14:42:24 -07:00
David S. Miller
c8923c6b85 [NETFILTER]: Fix OOPSes on machines with discontiguous cpu numbering.
Original patch by Harald Welte, with feedback from Herbert Xu
and testing by Sbastien Bernard.

EBTABLES, ARP tables, and IP/IP6 tables all assume that cpus
are numbered linearly.  That is not necessarily true.

This patch fixes that up by calculating the largest possible
cpu number, and allocating enough per-cpu structure space given
that.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-13 14:41:23 -07:00
Herbert Xu
9ff5c59ce2 [TCP]: Add code to help track down "BUG at net/ipv4/tcp_output.c:438!"
This is the second report of this bug.  Unfortunately the first
reporter hasn't been able to reproduce it since to provide more
debugging info.

So let's apply this patch for 2.6.14 to

1) Make this non-fatal.
2) Provide the info we need to track it down.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-12 15:59:39 -07:00
Arnaldo Carvalho de Melo
eeb2b85606 [TWSK]: Grab the module refcount for timewait sockets
This is required to avoid unloading a module that has active timewait
sockets, such as DCCP.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-10 21:25:23 -07:00
Pablo Neira Ayuso
061cb4a0ec [NETFILTER] ctnetlink: add support to change protocol info
This patch add support to change the state of the private protocol
information via conntrack_netlink.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-10 21:23:46 -07:00
Pablo Neira Ayuso
3392315375 [NETFILTER] ctnetlink: allow userspace to change TCP state
This patch adds the ability of changing the state a TCP connection. I know
that this must be used with care but it's required to provide a complete
conntrack creation via conntrack_netlink. So I'll document this aspect on
the upcoming docs.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-10 21:23:28 -07:00
Harald Welte
a051a8f730 [NETFILTER]: Use only 32bit counters for CONNTRACK_ACCT
Initially we used 64bit counters for conntrack-based accounting, since we
had no event mechanism to tell userspace that our counters are about to
overflow.  With nfnetlink_conntrack, we now have such a event mechanism and
thus can save 16bytes per connection.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-10 21:21:10 -07:00
Herbert Xu
d4875b049b [IPSEC] Fix block size/MTU bugs in ESP
This patch fixes the following bugs in ESP:

* Fix transport mode MTU overestimate.  This means that the inner MTU
  is smaller than it needs be.  Worse yet, given an input MTU which
  is a multiple of 4 it will always produce an estimate which is not
  a multiple of 4.

  For example, given a standard ESP/3DES/MD5 transform and an MTU of
  1500, the resulting MTU for transport mode is 1462 when it should
  be 1464.

  The reason for this is because IP header lengths are always a multiple
  of 4 for IPv4 and 8 for IPv6.

* Ensure that the block size is at least 4.  This is required by RFC2406
  and corresponds to what the esp_output function does.  At the moment
  this only affects crypto_null as its block size is 1.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-10 21:11:34 -07:00
Herbert Xu
a02a64223e [IPSEC]: Use ALIGN macro in ESP
This patch uses the macro ALIGN in all the applicable spots for ESP.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-10 21:11:08 -07:00
Pablo Neira Ayuso
e1c73b78e3 [NETFILTER] ctnetlink: add one nesting level for TCP state
To keep consistency, the TCP private protocol information is nested
attributes under CTA_PROTOINFO_TCP. This way the sequence of attributes to
access the TCP state information looks like here below:

CTA_PROTOINFO
CTA_PROTOINFO_TCP
CTA_PROTOINFO_TCP_STATE

instead of:

CTA_PROTOINFO
CTA_PROTOINFO_TCP_STATE

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-10 20:55:49 -07:00
Pablo Neira Ayuso
a1bcc3f268 [NETFILTER] ctnetlink: ICMP ID is not mandatory
The ID is only required by ICMP type 8 (echo), so it's not
mandatory for all sort of ICMP connections. This patch makes
mandatory only the type and the code for ICMP netlink messages.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-10 20:53:16 -07:00
Harald Welte
d000eaf772 [NETFILTER] conntrack_netlink: Fix endian issue with status from userspace
When we send "status" from userspace, we forget to convert the endianness.
This patch adds the reqired conversion.  Thanks to Pablo Neira for
discovering this.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-10 20:52:51 -07:00
Harald Welte
f40863cec8 [NETFILTER] ipt_ULOG: Mark ipt_ULOG as OBSOLETE
Similar to nfnetlink_queue and ip_queue, we mark ipt_ULOG as obsolete.
This should have been part of the original nfnetlink_log merge, but
I somehow missed it.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-10 20:51:53 -07:00
Harald Welte
85d9b05d9b [NETFILTER] PPTP helper: Add missing Kconfig dependency
PPTP should not be selectable without conntrack enabled

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-10 20:47:42 -07:00
Al Viro
dd0fc66fb3 [PATCH] gfp flags annotations - part 1
- added typedef unsigned int __nocast gfp_t;

 - replaced __nocast uses for gfp flags with gfp_t - it gives exactly
   the same warnings as far as sparse is concerned, doesn't change
   generated code (from gcc point of view we replaced unsigned int with
   typedef) and documents what's going on far better.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-08 15:00:57 -07:00
Stephen Hemminger
42a39450f8 [TCP]: BIC coding bug in Linux 2.6.13
Missing parenthesis in causes BIC to be slow in increasing congestion
window.

Spotted by Injong Rhee.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-05 12:09:31 -07:00
Randy Dunlap
8eea00a44d [IPVS]: fix sparse gfp nocast warnings
From: Randy Dunlap <rdunlap@xenotime.net>

Fix implicit nocast warnings in ip_vs code:
net/ipv4/ipvs/ip_vs_app.c:631:54: warning: implicit cast to nocast type

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-04 22:42:15 -07:00
Horst H. von Brand
a5181ab06d [NETFILTER]: Fix Kconfig typo
Signed-off-by: Horst H. von Brand <vonbrand@inf.utfsm.cl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-04 15:58:56 -07:00
Robert Olsson
e6308be85a [IPV4]: fib_trie root-node expansion
The patch below introduces special thresholds to keep root node in the trie 
large. This gives a flatter tree at the cost of a modest memory increase.
Overall it seems to be gain and this was also proposed by one the authors 
of the paper in recent a seminar.

Main table after loading 123 k routes.

	Aver depth:     3.30
	Max depth:      9
        Root-node size  12 bits
        Total size: 4044  kB

With the patch:
	Aver depth:     2.78
	Max depth:      8
        Root-node size  15 bits
        Total size: 4150  kB

An increase of 8-10% was seen in forwading performance for an rDoS attack. 

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-04 13:01:58 -07:00
David S. Miller
7ce312467e [IPV4]: Update icmp sysctl docs and disable broadcast ECHO/TIMESTAMP by default
It's not a good idea to be smurf'able by default.
The few people who need this can turn it on.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-03 16:07:30 -07:00
Herbert Xu
e5ed639913 [IPV4]: Replace __in_dev_get with __in_dev_get_rcu/rtnl
The following patch renames __in_dev_get() to __in_dev_get_rtnl() and
introduces __in_dev_get_rcu() to cover the second case.

1) RCU with refcnt should use in_dev_get().
2) RCU without refcnt should use __in_dev_get_rcu().
3) All others must hold RTNL and use __in_dev_get_rtnl().

There is one exception in net/ipv4/route.c which is in fact a pre-existing
race condition.  I've marked it as such so that we remember to fix it.

This patch is based on suggestions and prior work by Suzanne Wood and
Paul McKenney.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-03 14:35:55 -07:00
Herbert Xu
444fc8fc3a [IPV4]: Fix "Proxy ARP seems broken"
Meelis Roos <mroos@linux.ee> wrote:
> RK> My firewall setup relies on proxyarp working.  However, with 2.6.14-rc3,
> RK> it appears to be completely broken.  The firewall is 212.18.232.186,
> 
> Same here with some kernel between 14-rc2 and 14-rc3 - no reposnse to
> ARP on a proxyarp gateway. Sorry, no exact revison and no more debugging
> yet since it'a a production gateway.

The breakage is caused by the change to use the CB area for flagging
whether a packet has been queued due to proxy_delay.  This area gets
cleared every time arp_rcv gets called.  Unfortunately packets delayed
due to proxy_delay also go through arp_rcv when they are reprocessed.

In fact, I can't think of a reason why delayed proxy packets should go
through netfilter again at all.  So the easiest solution is to bypass
that and go straight to arp_process.

This is essentially what would've happened before netfilter support
was added to ARP.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> 
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-03 14:18:10 -07:00
Eric Dumazet
81c3d5470e [INET]: speedup inet (tcp/dccp) lookups
Arnaldo and I agreed it could be applied now, because I have other
pending patches depending on this one (Thank you Arnaldo)

(The other important patch moves skc_refcnt in a separate cache line,
so that the SMP/NUMA performance doesnt suffer from cache line ping pongs)

1) First some performance data :
--------------------------------

tcp_v4_rcv() wastes a *lot* of time in __inet_lookup_established()

The most time critical code is :

sk_for_each(sk, node, &head->chain) {
     if (INET_MATCH(sk, acookie, saddr, daddr, ports, dif))
         goto hit; /* You sunk my battleship! */
}

The sk_for_each() does use prefetch() hints but only the begining of
"struct sock" is prefetched.

As INET_MATCH first comparison uses inet_sk(__sk)->daddr, wich is far
away from the begining of "struct sock", it has to bring into CPU
cache cold cache line. Each iteration has to use at least 2 cache
lines.

This can be problematic if some chains are very long.

2) The goal
-----------

The idea I had is to change things so that INET_MATCH() may return
FALSE in 99% of cases only using the data already in the CPU cache,
using one cache line per iteration.

3) Description of the patch
---------------------------

Adds a new 'unsigned int skc_hash' field in 'struct sock_common',
filling a 32 bits hole on 64 bits platform.

struct sock_common {
	unsigned short		skc_family;
	volatile unsigned char	skc_state;
	unsigned char		skc_reuse;
	int			skc_bound_dev_if;
	struct hlist_node	skc_node;
	struct hlist_node	skc_bind_node;
	atomic_t		skc_refcnt;
+	unsigned int		skc_hash;
	struct proto		*skc_prot;
};

Store in this 32 bits field the full hash, not masked by (ehash_size -
1) Using this full hash as the first comparison done in INET_MATCH
permits us immediatly skip the element without touching a second cache
line in case of a miss.

Suppress the sk_hashent/tw_hashent fields since skc_hash (aliased to
sk_hash and tw_hash) already contains the slot number if we mask with
(ehash_size - 1)

File include/net/inet_hashtables.h

64 bits platforms :
#define INET_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
     (((__sk)->sk_hash == (__hash))
     ((*((__u64 *)&(inet_sk(__sk)->daddr)))== (__cookie))   &&  \
     ((*((__u32 *)&(inet_sk(__sk)->dport))) == (__ports))   &&  \
     (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))

32bits platforms:
#define TCP_IPV4_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
     (((__sk)->sk_hash == (__hash))                 &&  \
     (inet_sk(__sk)->daddr          == (__saddr))   &&  \
     (inet_sk(__sk)->rcv_saddr      == (__daddr))   &&  \
     (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))


- Adds a prefetch(head->chain.first) in 
__inet_lookup_established()/__tcp_v4_check_established() and 
__inet6_lookup_established()/__tcp_v6_check_established() and 
__dccp_v4_check_established() to bring into cache the first element of the 
list, before the {read|write}_lock(&head->lock);

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-03 14:13:38 -07:00
Herbert Xu
325ed82393 [NET]: Fix packet timestamping.
I've found the problem in general.  It affects any 64-bit
architecture.  The problem occurs when you change the system time.

Suppose that when you boot your system clock is forward by a day.
This gets recorded down in skb_tv_base.  You then wind the clock back
by a day.  From that point onwards the offset will be negative which
essentially overflows the 32-bit variables they're stored in.

In fact, why don't we just store the real time stamp in those 32-bit
variables? After all, we're not going to overflow for quite a while
yet.

When we do overflow, we'll need a better solution of course.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-03 13:57:23 -07:00
Alexey Kuznetsov
09e9ec8711 [TCP]: Don't over-clamp window in tcp_clamp_window()
From: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>

Handle better the case where the sender sends full sized
frames initially, then moves to a mode where it trickles
out small amounts of data at a time.

This known problem is even mentioned in the comments
above tcp_grow_window() in tcp_input.c, specifically:

...
 * The scheme does not work when sender sends good segments opening
 * window and then starts to feed us spagetti. But it should work
 * in common situations. Otherwise, we have to rely on queue collapsing.
...

When the sender gives full sized frames, the "struct sk_buff" overhead
from each packet is small.  So we'll advertize a larger window.
If the sender moves to a mode where small segments are sent, this
ratio becomes tilted to the other extreme and we start overrunning
the socket buffer space.

tcp_clamp_window() tries to address this, but it's clamping of
tp->window_clamp is a wee bit too aggressive for this particular case.

Fix confirmed by Ion Badulescu.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-29 17:17:15 -07:00
David S. Miller
01ff367e62 [TCP]: Revert 6b251858d3
But retain the comment fix.

Alexey Kuznetsov has explained the situation as follows:

--------------------

I think the fix is incorrect. Look, the RFC function init_cwnd(mss) is
not continuous: f.e. for mss=1095 it needs initial window 1095*4, but
for mss=1096 it is 1096*3. We do not know exactly what mss sender used
for calculations. If we advertised 1096 (and calculate initial window
3*1096), the sender could limit it to some value < 1096 and then it
will need window his_mss*4 > 3*1096 to send initial burst.

See?

So, the honest function for inital rcv_wnd derived from
tcp_init_cwnd() is:

	init_rcv_wnd(mss)=
	  min { init_cwnd(mss1)*mss1 for mss1 <= mss }

It is something sort of:

	if (mss < 1096)
		return mss*4;
	if (mss < 1096*2)
		return 1096*4;
	return mss*2;

(I just scrablled a graph of piece of paper, it is difficult to see or
to explain without this)

I selected it differently giving more window than it is strictly
required.  Initial receive window must be large enough to allow sender
following to the rfc (or just setting initial cwnd to 2) to send
initial burst.  But besides that it is arbitrary, so I decided to give
slack space of one segment.

Actually, the logic was:

If mss is low/normal (<=ethernet), set window to receive more than
initial burst allowed by rfc under the worst conditions
i.e. mss*4. This gives slack space of 1 segment for ethernet frames.

For msses slighlty more than ethernet frame, take 3. Try to give slack
space of 1 frame again.

If mss is huge, force 2*mss. No slack space.

Value 1460*3 is really confusing. Minimal one is 1096*2, but besides
that it is an arbitrary value. It was meant to be ~4096. 1460*3 is
just the magic number from RFC, 1460*3 = 1095*4 is the magic :-), so
that I guess hands typed this themselves.

--------------------

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-29 17:07:20 -07:00
David S. Miller
6b251858d3 [TCP]: Fix init_cwnd calculations in tcp_select_initial_window()
Match it up to what RFC2414 really specifies.
Noticed by Rick Jones.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-28 16:31:48 -07:00
Harald Welte
188bab3ae0 [NETFILTER]: Fix invalid module autoloading by splitting iptable_nat
When you've enabled conntrack and NAT as a module (standard case in all
distributions), and you've also enabled the new conntrack netlink
interface, loading ip_conntrack_netlink.ko will auto-load iptable_nat.ko.
This causes a huge performance penalty, since for every packet you iterate
the nat code, even if you don't want it.

This patch splits iptable_nat.ko into the NAT core (ip_nat.ko) and the
iptables frontend (iptable_nat.ko).  Threfore, ip_conntrack_netlink.ko will
only pull ip_nat.ko, but not the frontend.  ip_nat.ko will "only" allocate
some resources, but not affect runtime performance.

This separation is also a nice step in anticipation of new packet filters
(nf-hipac, ipset, pkttables) being able to use the NAT core.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-26 15:25:11 -07:00
Harald Welte
8ddec7460d [NETFILTER] ip_conntrack: Update event cache when status changes
The GRE, SCTP and TCP protocol helpers did not call
ip_conntrack_event_cache() when updating ct->status.  This patch adds
the respective calls.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-24 16:56:08 -07:00
Harald Welte
d67b24c40f [NETFILTER]: Fix ip[6]t_NFQUEUE Kconfig dependency
We have to introduce a separate Kconfig menu entry for the NFQUEUE targets.
They cannot "just" depend on nfnetlink_queue, since nfnetlink_queue could
be linked into the kernel, whereas iptables can be a module.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-24 16:52:03 -07:00
Harald Welte
1dfbab5949 [NETFILTER] Fix conntrack event cache deadlock/oops
This patch fixes a number of bugs.  It cannot be reasonably split up in
multiple fixes, since all bugs interact with each other and affect the same
function:

Bug #1:
The event cache code cannot be called while a lock is held.  Therefore, the
call to ip_conntrack_event_cache() within ip_ct_refresh_acct() needs to be
moved outside of the locked section.  This fixes a number of 2.6.14-rcX
oops and deadlock reports.

Bug #2:
We used to call ct_add_counters() for unconfirmed connections without
holding a lock.  Since the add operations are not atomic, we could race
with another CPU.

Bug #3:
ip_ct_refresh_acct() lost REFRESH events in some cases where refresh
(and the corresponding event) are desired, but no accounting shall be
performed.  Both, evenst and accounting implicitly depended on the skb
parameter bein non-null.   We now re-introduce a non-accounting
"ip_ct_refresh()" variant to explicitly state the desired behaviour.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-22 23:46:57 -07:00
Alexey Dobriyan
67497205b1 [NETFILTER] Fix sparse endian warnings in pptp helper
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-22 23:45:24 -07:00
Harald Welte
0ae5d253ad [NETFILTER] fix DEBUG statement in PPTP helper
As noted by Alexey Dobriyan, the DEBUGP statement prints the wrong
callID.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-22 23:44:58 -07:00
Herbert Xu
83ca28befc [TCP]: Adjust Reno SACK estimate in tcp_fragment
Since the introduction of TSO pcount a year ago, it has been possible
for tcp_fragment() to cause packets_out to decrease.  Prior to that,
tcp_retrans_try_collapse() was the only way for that to happen on the
retransmission path.

When this happens with Reno, it is possible for sasked_out to become
invalid because it is only an estimate and not tied to any particular
packet on the retransmission queue.

Therefore we need to adjust sacked_out as well as left_out in the Reno
case.  The following patch does exactly that.

This bug is pretty difficult to trigger in practice though since you
need a SACKless peer with a retransmission that occurs just as the
cached MTU value expires.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-22 23:32:56 -07:00
Stephen Hemminger
7957aed72b [TCP]: Set default congestion control correctly for incoming connections.
Patch from Joel Sing to fix the default congestion control algorithm
for incoming connections. If a new congestion control handler is added
(via module), it should become the default for new
connections. Instead, the incoming connections use reno. The cause is
incorrect initialisation causes the tcp_init_congestion_control()
function to return after the initial if test fails.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Acked-by: Ian McDonald <imcdnzl@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-21 00:19:46 -07:00
Stephen Hemminger
78c6671a88 [FIB_TRIE]: message cleanup
Cleanup the printk's in fib_trie:
	* Convert a couple of places in the dump code to BUG_ON
	* Put log level's on each message
The version message really needed the message since it leaks out
on the pretty Fedora bootup.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Acked-by: Robert Olsson <Robert.Olsson@data.slu.se>,
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-21 00:15:39 -07:00
Linus Torvalds
875bd5ab01 Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2005-09-19 18:46:11 -07:00
Mark J Cox
6d1cfe3f17 [PATCH] raw_sendmsg DoS on 2.6
Fix unchecked __get_user that could be tricked into generating a
memory read on an arbitrary address.  The result of the read is not
returned directly but you may be able to divine some information about
it, or use the read to cause a crash on some architectures by reading
hardware state.  CAN-2004-2492.

Fix from Al Viro, ack from Dave Miller.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-19 18:45:42 -07:00
Herbert Xu
e14c3caf60 [TCP]: Handle SACK'd packets properly in tcp_fragment().
The problem is that we're now calling tcp_fragment() in a context
where the packets might be marked as SACKED_ACKED or SACKED_RETRANS.
This was not possible before as you never retransmitted packets that
are so marked.

Because of this, we need to adjust sacked_out and retrans_out in
tcp_fragment().  This is exactly what the following patch does.

We also need to preserve the SACKED_ACKED/SACKED_RETRANS marking
if they exist.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-19 18:18:38 -07:00
Harald Welte
8922bc93aa [NETFILTER]: Export ip_nat_port_{nfattr_to_range,range_to_nfattr}
Those exports are needed by the PPTP helper following in the next
couple of changes.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-19 15:35:57 -07:00
Patrick McHardy
a41bc00234 [NETFILTER]: Rename misnamed function
Both __ip_conntrack_expect_find and ip_conntrack_expect_find_get take
a reference to the expectation, the difference is that callers of
__ip_conntrack_expect_find must hold ip_conntrack_lock.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-19 15:35:31 -07:00
Harald Welte
926b50f92a [NETFILTER]: Add new PPTP conntrack and NAT helper
This new "version 3" PPTP conntrack/nat helper is finally ready for
mainline inclusion.  Special thanks to lots of last-minute bugfixing
by Patric McHardy.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-19 15:33:08 -07:00
Robert Olsson
772cb712b1 [IPV4]: fib_trie RCU refinements
* This patch is from Paul McKenney's RCU reviewing. 

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-19 15:31:18 -07:00
Robert Olsson
1d25cd6cc2 [IPV4]: fib_trie tnode stats refinements
* Prints the route tnode and set the stats level deepth as before.

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-19 15:29:52 -07:00
Harald Welte
628f87f3d5 [NETFILTER]: Solve Kconfig dependency problem
As suggested by Roman Zippel.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-18 00:33:02 -07:00
Harald Welte
777ed97f3e [NETFILTER] Fix Kconfig dependencies for nfnetlink/ctnetlink
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-17 00:41:02 -07:00
Harald Welte
a8f39143ac [NETFILTER]: Fix oops in conntrack event cache
ip_ct_refresh_acct() can be called without a valid "skb" pointer.
This used to work, since ct_add_counters() deals with that fact.
However, the recently-added event cache doesn't handle this at all.

This patch is a quick fix that is supposed to be replaced soon by a cleaner
solution during the pending redesign of the event cache.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-16 17:00:38 -07:00
KOVACS Krisztian
136e92bbec [NETFILTER] CLUSTERIP: use a bitmap to store node responsibility data
Instead of maintaining an array containing a list of nodes this instance
is responsible for let's use a simple bitmap. This provides the
following features:

  * clusterip_responsible() and the add_node()/delete_node() operations
    become very simple and don't need locking
  * the config structure is much smaller

In spite of the completely different internal data representation the
user-space interface remains almost unchanged; the only difference is
that the proc file does not list nodes in the order they were added.
(The target info structure remains the same.)

Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-16 17:00:04 -07:00
KOVACS Krisztian
4451362445 [NETFILTER] CLUSTERIP: introduce reference counting for entries
The CLUSTERIP target creates a procfs entry for all different cluster
IPs.  Although more than one rules can refer to a single cluster IP (and
thus a single config structure), removal of the procfs entry is done
unconditionally in destroy(). In more complicated situations involving
deferred dereferencing of the config structure by procfs and creating a
new rule with the same cluster IP it's also possible that no entry will
be created for the new rule.

This patch fixes the problem by counting the number of entries
referencing a given config structure and moving the config list
manipulation and procfs entry deletion parts to the
clusterip_config_entry_put() function.

Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-16 16:59:46 -07:00
Julian Anastasov
87375ab47c [IPVS]: ip_vs_ftp breaks connections using persistence
ip_vs_ftp when loaded can create NAT connections with unknown client
port for passive FTP. For such expectations we lookup with cport=0 on
incoming packet but it matches the format of the persistence templates
causing packets to other persistent virtual servers to be forwarded to
real server without creating connection. Later the reply packets are
treated as foreign and not SNAT-ed.

This patch changes the connection lookup for packets from clients:

* introduce IP_VS_CONN_F_TEMPLATE connection flag to mark the
  connection as template

* create new connection lookup function just for templates -
  ip_vs_ct_in_get

* make sure ip_vs_conn_in_get hits only connections with
  IP_VS_CONN_F_NO_CPORT flag set when s_port is 0. By this way
  we avoid returning template when looking for cport=0 (ftp)

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-14 21:08:51 -07:00
Julian Anastasov
f5e229db9c [IPVS]: Really invalidate persistent templates
Agostino di Salle noticed that persistent templates are not
invalidated due to buggy optimization.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-14 21:04:23 -07:00
Denis Lukianov
de9daad90e [MCAST]: Fix MCAST_EXCLUDE line dupes
This patch fixes line dupes at /ipv4/igmp.c and /ipv6/mcast.c in the  
2.6 kernel, where MCAST_EXCLUDE is mistakenly used instead of  
MCAST_INCLUDE.

Signed-off-by: Denis Lukianov <denis@voxelsoft.com>
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-14 20:53:42 -07:00
Herbert Xu
3c05d92ed4 [TCP]: Compute in_sacked properly when we split up a TSO frame.
The problem is that the SACK fragmenting code may incorrectly call
tcp_fragment() with a length larger than the skb->len.  This happens
when the skb on the transmit queue completely falls to the LHS of the
SACK.

And add a BUG() check to tcp_fragment() so we can spot this kind of
error more quickly in the future.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-14 20:50:35 -07:00
Patrick McHardy
adcb5ad1e5 [NETFILTER]: Fix DHCP + MASQUERADE problem
In 2.6.13-rcX the MASQUERADE target was changed not to exclude local
packets for better source address consistency. This breaks DHCP clients
using UDP sockets when the DHCP requests are caught by a MASQUERADE rule
because the MASQUERADE target drops packets when no address is configured
on the outgoing interface. This patch makes it ignore packets with a
source address of 0.

Thanks to Rusty for this suggestion.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-13 13:49:15 -07:00
Patrick McHardy
cd0bf2d796 [NETFILTER]: Fix rcu race in ipt_REDIRECT
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-13 13:48:58 -07:00
Patrick McHardy
e7fa1bd93f [NETFILTER]: Simplify netbios helper
Don't parse the packet, the data is already available in the conntrack
structure.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-13 13:48:34 -07:00
Patrick McHardy
5cb30640ce [NETFILTER]: Use correct type for "ports" module parameter
With large port numbers the helper_names buffer can overflow.
Noticed by Samir Bellabes <sbellabes@mandriva.com>

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-13 13:48:00 -07:00
Nishanth Aravamudan
121caf577d [NET]: fix-up schedule_timeout() usage
Use schedule_timeout_{,un}interruptible() instead of
set_current_state()/schedule_timeout() to reduce kernel size.  Also use
human-time conversion functions instead of hard-coded division to avoid
rounding issues.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-12 14:15:34 -07:00
Herbert Xu
e130af5dab [TCP]: Fix double adjustment of tp->{lost,left}_out in tcp_fragment().
There is an extra left_out/lost_out adjustment in tcp_fragment which
means that the lost_out accounting is always wrong.  This patch removes
that chunk of code.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-10 17:19:09 -07:00
Linus Torvalds
1d8674edb5 Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2005-09-09 14:25:22 -07:00
Ingo Molnar
8d06afab73 [PATCH] timer initialization cleanup: DEFINE_TIMER
Clean up timer initialization by introducing DEFINE_TIMER a'la
DEFINE_SPINLOCK.  Build and boot-tested on x86.  A similar patch has been
been in the -RT tree for some time.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 14:03:48 -07:00
Dipankar Sarma
b835996f62 [PATCH] files: lock-free fd look-up
With the use of RCU in files structure, the look-up of files using fds can now
be lock-free.  The lookup is protected by rcu_read_lock()/rcu_read_unlock().
This patch changes the readers to use lock-free lookup.

Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Ravikiran Thirumalai <kiran_th@gmail.com>
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 13:57:55 -07:00
Stephen Hemminger
cb7b593c2c [IPV4] fib_trie: fix proc interface
Create one iterator for walking over FIB trie, and use it
for all the /proc functions. Add a /proc/net/route
output for backwards compatibility with old applications.

Make initialization of fib_trie same as fib_hash so no #ifdef
is needed in af_inet.c

Fixes: http://bugzilla.kernel.org/show_bug.cgi?id=5209

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-09 13:35:42 -07:00
Patrick McHardy
e104411b82 [XFRM]: Always release dst_entry on error in xfrm_lookup
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-08 15:11:55 -07:00
Herbert Xu
cf0b450cd5 [TCP]: Fix off by one in tcp_fragment() "already sent" test.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-08 15:10:52 -07:00
Andrew Morton
3a93481589 [NETFILTER]: ip_conntrack_netbios_ns.c gcc-2.95.x build fix
gcc-2.95.x can't do this sort of initialisation

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-08 13:36:34 -07:00
Julian Anastasov
ce723d8e04 [IPV4]: Fix refcount damaging in net/ipv4/route.c
One such place that can damage the dst refcnts is route.c with
CONFIG_IP_ROUTE_MULTIPATH_CACHED enabled, i don't see the user's
.config. In this new code i see that rt_intern_hash is called before
dst->refcnt is set to 1, dst is the 2nd arg to rt_intern_hash.

Arg 2 of rt_intern_hash must come with refcnt 1 as it is added to
table or dropped depending on error/add/update. One such example is
ip_mkroute_input where __mkroute_input return rth with refcnt 0 which
is provided to rt_intern_hash. ip_mkroute_output looks like a 2nd such
place. Appending untested patch for comments and review.  The idea is
to put previous reference as we are going to return next result/error.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-08 13:34:47 -07:00
Stephen Hemminger
e308e25c97 [IPV4] udp: trim forgets about CHECKSUM_HW
A UDP packet may contain extra data that needs to be trimmed off.
But when doing so, UDP forgets to fixup the skb checksum if CHECKSUM_HW
is being used.

I think this explains the case of a NFS receive using skge driver
causing 'udp hw checksum failures' when interacting with a crufty
settop box.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-08 12:32:21 -07:00
Stephen Hemminger
48bc41a49c [IPV4]: Reassembly trim not clearing CHECKSUM_HW
This was found by inspection while looking for checksum problems
with the skge driver that sets CHECKSUM_HW. It did not fix the
problem, but it looks like it is needed.

If IP reassembly is trimming an overlapping fragment, it
should reset (or adjust) the hardware checksum flag on the skb.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-06 15:51:48 -07:00
Patrick McHardy
e446639939 [NETFILTER]: Missing unlock in TCP connection tracking error path
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-06 15:11:10 -07:00
Pablo Neira Ayuso
49719eb355 [NETFILTER]: kill __ip_ct_expect_unlink_destroy
The following patch kills __ip_ct_expect_unlink_destroy and export
unlink_expect as ip_ct_unlink_expect. As it was discussed [1], the function
__ip_ct_expect_unlink_destroy is a bit confusing so better do the following
sequence: ip_ct_destroy_expect and ip_conntrack_expect_put.

[1] https://lists.netfilter.org/pipermail/netfilter-devel/2005-August/020794.html

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-06 15:10:46 -07:00
Pablo Neira Ayuso
91c46e2e60 [NETFILTER]: Don't increase master refcount on expectations
As it's been discussed [1][2]. We shouldn't increase the master conntrack
refcount for non-fulfilled conntracks. During the conntrack destruction,
the expectations are always killed before the conntrack itself, this
guarantees that there won't be any orphan expectation.

[1]https://lists.netfilter.org/pipermail/netfilter-devel/2005-August/020783.html
[2]https://lists.netfilter.org/pipermail/netfilter-devel/2005-August/020904.html

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-06 15:10:23 -07:00
Patrick McHardy
03486a4f83 [NETFILTER]: Handle NAT module load race
When the NAT module is loaded when connections are already confirmed
it must not change their tuples anymore. This is especially important
with CONFIG_NETFILTER_DEBUG, the netfilter listhelp functions will
refuse to remove an entry from a list when it can not be found on
the list, so when a changed tuple hashes to a new bucket the entry
is kept in the list until and after the conntrack is freed.

Allocate the exact conntrack tuple for NAT for already confirmed
connections or drop them if that fails.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-06 15:09:43 -07:00
Yasuyuki Kozakai
31c913e7fd [NETFILTER]: Fix CONNMARK Kconfig dependency
Connection mark tracking support is one of the feature in connection
tracking, so IP_NF_CONNTRACK_MARK depends on IP_NF_CONNTRACK.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-06 15:09:20 -07:00
Patrick McHardy
a2978aea39 [NETFILTER]: Add NetBIOS name service helper
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-06 15:08:51 -07:00
Patrick McHardy
2248bcfcd8 [NETFILTER]: Add support for permanent expectations
A permanent expectation exists until timeing out and can expect
multiple related connections.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-06 15:06:42 -07:00
Herbert Xu
fb5f5e6e0c [TCP]: Fix TCP_OFF() bug check introduced by previous change.
The TCP_OFF assignment at the bottom of that if block can indeed set
TCP_OFF without setting TCP_PAGE.  Since there is not much to be
gained from avoiding this situation, we might as well just zap the
offset.  The following patch should fix it.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-05 18:55:48 -07:00
Adrian Bunk
43d60661ac [IPV4]: net/ipv4/ipconfig.c should #include <linux/nfs_fs.h>
Every file should #include the header files containing the prototypes of 
it's global functions.

nfs_fs.h contains the prototype of root_nfs_parse_addr().

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-05 18:05:52 -07:00
David S. Miller
6475be16fd [TCP]: Keep TSO enabled even during loss events.
All we need to do is resegment the queue so that
we record SACK information accurately.  The edges
of the SACK blocks guide our resegmenting decisions.

With help from Herbert Xu.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-01 22:47:01 -07:00
Herbert Xu
ef01578615 [TCP]: Fix sk_forward_alloc underflow in tcp_sendmsg
I've finally found a potential cause of the sk_forward_alloc underflows
that people have been reporting sporadically.

When tcp_sendmsg tacks on extra bits to an existing TCP_PAGE we don't
check sk_forward_alloc even though a large amount of time may have
elapsed since we allocated the page.  In the mean time someone could've
come along and liberated packets and reclaimed sk_forward_alloc memory.

This patch makes tcp_sendmsg check sk_forward_alloc every time as we
do in do_tcp_sendpages.
 
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-01 17:48:59 -07:00
Herbert Xu
d80d99d643 [NET]: Add sk_stream_wmem_schedule
This patch introduces sk_stream_wmem_schedule as a short-hand for
the sk_forward_alloc checking on egress.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-01 17:48:23 -07:00
Jesper Juhl
573dbd9596 [CRYPTO]: crypto_free_tfm() callers no longer need to check for NULL
Since the patch to add a NULL short-circuit to crypto_free_tfm() went in,
there's no longer any need for callers of that function to check for NULL.
This patch removes the redundant NULL checks and also a few similar checks
for NULL before calls to kfree() that I ran into while doing the
crypto_free_tfm bits.

I've succesfuly compile tested this patch, and a kernel with the patch 
applied boots and runs just fine.

When I posted the patch to LKML (and other lists/people on Cc) it drew the
following comments :

 J. Bruce Fields commented
  "I've no problem with the auth_gss or nfsv4 bits.--b."

 Sridhar Samudrala said
  "sctp change looks fine."

 Herbert Xu signed off on the patch.

So, I guess this is ready to be dropped into -mm and eventually mainline.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-01 17:44:29 -07:00
KOVACS Krisztian
5170dbebbb [NETFILTER]: CLUSTERIP: fix memcpy() length typo
Fix a trivial typo in clusterip_config_init().

Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-01 17:44:06 -07:00
Harald Welte
5f2c3b9107 [NETFILTER]: Add new iptables TTL target
This new iptables target allows manipulation of the TTL of an IPv4 packet.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:13:22 -07:00
Eric Dumazet
ba89966c19 [NET]: use __read_mostly on kmem_cache_t , DEFINE_SNMP_STAT pointers
This patch puts mostly read only data in the right section
(read_mostly), to help sharing of these data between CPUS without
memory ping pongs.

On one of my production machine, tcp_statistics was sitting in a
heavily modified cache line, so *every* SNMP update had to force a
reload.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:11:18 -07:00
David S. Miller
29cb9f9c55 [LIB]: Make TEXTSEARCH_BM plain tristate like the others
And select it when the relevant modules are enabled.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:11:11 -07:00
Robert Olsson
2373ce1ca0 [IPV4]: Convert FIB Trie to RCU.
* Removes RW-lock
* Proteced read functions uses
  rcu_dereference proteced with rcu_read_lock()
* writing of procted pointer w. rcu_assigen_pointer
* Insert/Replace atomic list_replace_rcu
* A BUG_ON condition removed.in trie_rebalance

With help from Paul E. McKenney.

Signed-off-by: Robert Olsson <Robert.Olsson@data.slu.se>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:09:03 -07:00
Robert Olsson
e5b4376074 [IPV4]: Prepare FIB core for RCU.
* RCU versions of hlist_***_rcu
* fib_alias partial rcu port just whats needed now.

Signed-off-by: Robert Olsson <Robert.Olsson@data.slu.se>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:08:31 -07:00
Ralf Baechle
3625796806 [IPV4]: Module export of ip_rcv() no longer needed.
With ip_rcv nowhere outside the IP stack being used anymore it's
EXPORT_SYMBOL is not needed any longer either.

Signed-off-by: Ralf Baechle DL5RB <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:08:23 -07:00
Stephen Hemminger
0c7770c740 [IPV4]: FIB trie cleanup
This is a redo of earlier cleanup stuff:
	* replace DBG() macro with pr_debug()
	* get rid of duplicate extern's that are already in fib_lookup.h
	* use BUG_ON and WARN_ON
	* don't use BUG checks for null pointers where next statement would
	  get a fault anyway
	* remove debug printout when rebalance causes deep tree
	* remove trailing blanks

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:08:01 -07:00
Arnaldo Carvalho de Melo
dc40c7bc76 [ICSK]: Generalise tcp_listen_poll
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:05:32 -07:00
Patrick McHardy
05465343bf [NETFILTER]: Add goto target
Originally written by Henrik Nordstrom <hno@marasystems.com>, taken
from netfilter patch-o-matic and added ip6_tables support.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:04:18 -07:00
Pablo Neira Ayuso
7567662ba8 [NETFILTER]: Add string match
Signed-off-by: Pablo Neira Ayuso <pablo@eurodev.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:04:07 -07:00
Thomas Graf
33d043d65b [IPV4]: ip_finish_output() can be inlined
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:03:10 -07:00
Thomas Graf
9070683bda [IPV4]: Remove some dead code from ip_forward()
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:03:06 -07:00
Thomas Graf
3e192beaf5 [IPV4]: Avoid common branch mispredictions in ip_rcv_finish()
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:03:03 -07:00
Thomas Graf
d245407e75 [IPV4]: Move ip options parsing out of ip_rcv_finish()
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:03:00 -07:00
Thomas Graf
e9c6042273 [IPV4]: Avoid common branch misprediction while checking csum in ip_rcv()
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:02:57 -07:00
Thomas Graf
5861524241 [IPV4]: Consistency and whitespace cleanup of ip_rcv()
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:02:53 -07:00
David S. Miller
bf0ff9e578 [IPVS]: ipv4_table --> ipvs_ipv4_table
Fix conflict with symbol of same name in global
namespace.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:02:29 -07:00
David S. Miller
d179cd1292 [NET]: Implement SKB fast cloning.
Protocols that make extensive use of SKB cloning,
for example TCP, eat at least 2 allocations per
packet sent as a result.

To cut the kmalloc() count in half, we implement
a pre-allocation scheme wherein we allocate
2 sk_buff objects in advance, then use a simple
reference count to free up the memory at the
correct time.

Based upon an initial patch by Thomas Graf and
suggestions from Herbert Xu.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:01:54 -07:00
David S. Miller
ba602a8161 [IPVS]: Rename tcp_{init,exit}() --> ip_vs_tcp_{init,exit}()
Conflicts with global namespace functions with the
same name.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:01:47 -07:00
Arnaldo Carvalho de Melo
4c6ea29d82 [IP]: Introduce ip_options_get_from_user
This variant is needed to satisfy sparse __user annotations.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:01:39 -07:00
Arnaldo Carvalho de Melo
20380731bc [NET]: Fix sparse warnings
Of this type, mostly:

CHECK   net/ipv6/netfilter.c
net/ipv6/netfilter.c:96:12: warning: symbol 'ipv6_netfilter_init' was not declared. Should it be static?
net/ipv6/netfilter.c:101:6: warning: symbol 'ipv6_netfilter_fini' was not declared. Should it be static?

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:01:32 -07:00
Patrick McHardy
066286071d [NETLINK]: Add "groups" argument to netlink_kernel_create
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:01:11 -07:00
Patrick McHardy
ac6d439d20 [NETLINK]: Convert netlink users to use group numbers instead of bitmasks
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:00:54 -07:00
Patrick McHardy
db08052979 [NETLINK]: Remove unused groups member from struct netlink_skb_parms
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:00:39 -07:00
Christoph Hellwig
34b4a4a624 [NETFILTER]: Remove tasklist_lock abuse in ipt{,6}owner
Rip out cmd/sid/pid matching since its unfixable broken and stands in the
way of locking changes to tasklist_lock.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:59:07 -07:00
Gary Wayne Smith
000efe1d86 [NETFILTER]: Make NETMAP target usable in OUTPUT
Signed-off-by: Gary Wayne Smith <gary.w.smith@primeexalia.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:58:41 -07:00
Patrick McHardy
9baa5c67ff [NETFILTER]: Don't exclude local packets from MASQUERADING
Increases consistency in source-address selection.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:58:36 -07:00
Patrick McHardy
a61bbcf28a [NET]: Store skb->timestamp as offset to a base timestamp
Reduces skb size by 8 bytes on 64-bit.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:58:24 -07:00
Patrick McHardy
25ed891019 [NETFILTER]: Nicer names for ipt_connbytes constants
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:58:17 -07:00
Patrick McHardy
8ffde67173 [NETFILTER]: Fix div64_64 in ipt_connbytes
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:58:11 -07:00
Harald Welte
9d810fd2d2 [NETFILTER]: Add new iptables "connbytes" match
This patch ads a new "connbytes" match that utilizes the CONFIG_NF_CT_ACCT
per-connection byte and packet counters.  Using it you can do things like
packet classification on average packet size within a connection.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:58:04 -07:00
Arnaldo Carvalho de Melo
17b085eace [INET_DIAG]: Move the tcp_diag interface to the proper place
With this the previous setup is back, i.e. tcp_diag can be built as a module,
as dccp_diag and both share the infrastructure available in inet_diag.

If one selects CONFIG_INET_DIAG as module CONFIG_INET_TCP_DIAG will also be
built as a module, as will CONFIG_INET_DCCP_DIAG, if CONFIG_IP_DCCP was
selected static or as a module, if CONFIG_INET_DIAG is y, being statically
linked CONFIG_INET_TCP_DIAG will follow suit and CONFIG_INET_DCCP_DIAG will be
built in the same manner as CONFIG_IP_DCCP.

Now to aim at UDP, converting it to use inet_hashinfo, so that we can use
iproute2 for UDP sockets as well.

Ah, just to show an example of this new infrastructure working for DCCP :-)

[root@qemu ~]# ./ss -dane
State      Recv-Q Send-Q Local Address:Port  Peer Address:Port
LISTEN     0      0                  *:5001             *:*     ino:942 sk:cfd503a0
ESTAB      0      0          127.0.0.1:5001     127.0.0.1:32770 ino:943 sk:cfd50a60
ESTAB      0      0          127.0.0.1:32770    127.0.0.1:5001  ino:947 sk:cfd50700
TIME-WAIT  0      0          127.0.0.1:32769    127.0.0.1:5001  timer:(timewait,3.430ms,0) ino:0 sk:cf209620

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:57:54 -07:00
Arnaldo Carvalho de Melo
a8c2190ee7 [INET_DIAG]: Rename tcp_diag.[ch] to inet_diag.[ch]
Next changeset will introduce net/ipv4/tcp_diag.c, moving the code that was put
transitioanlly in inet_diag.c.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:57:48 -07:00
Arnaldo Carvalho de Melo
73c1f4a033 [TCPDIAG]: Just rename everything to inet_diag
Next changeset will rename tcp_diag.[ch] to inet_diag.[ch].

I'm taking this longer route so as to easy review, making clear the changes
made all along the way.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:57:44 -07:00
Arnaldo Carvalho de Melo
4f5736c4c7 [TCPDIAG]: Introduce inet_diag_{register,unregister}
Next changeset will rename tcp_diag to inet_diag and move the tcp_diag code out
of it and into a new tcp_diag.c, similar to the net/dccp/diag.c introduced in
this changeset, completing the transition to a generic inet_diag
infrastructure.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:57:38 -07:00
Arnaldo Carvalho de Melo
5324a040cc [INET6_HASHTABLES]: Move inet6_lookup functions to net/ipv6/inet6_hashtables.c
Doing this we allow tcp_diag to support IPV6 even if tcp_diag is compiled
statically and IPV6 is compiled as a module, removing the previous restriction
while not building any IPV6 code if it is not selected.

Now to work on the tcpdiag_register infrastructure and then to rename the whole
thing to inetdiag, reflecting its by then completely generic nature.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:57:29 -07:00
Arnaldo Carvalho de Melo
505cbfc577 [IPV6]: Generalise the tcp_v6_lookup routines
In the same way as was done with the v4 counterparts, this will be moved
to inet6_hashtables.c.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:57:24 -07:00
Arnaldo Carvalho de Melo
e41aac41e3 [TCPDIAG]: Introduce CONFIG_IP_TCPDIAG_DCCP
Similar to CONFIG_IP_TCPDIAG_IPV6

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:56:49 -07:00
Arnaldo Carvalho de Melo
540722ffc3 [TCPDIAG]: Implement cheapest way of supporting DCCPDIAG_GETSOCK
With ugly ifdefs, etc, but this actually:

1. keeps the existing ABI, i.e. no need to recompile the iproute2
   utilities if not interested in DCCP.

2. Provides all the tcp_diag functionality in DCCP, with just a
   small patch that makes iproute2 support DCCP.

Of course I'll get this cleaned-up in time, but for now I think its
OK to be this way to quickly get this functionality.

iproute2-ss050808 patch at:

http://vger.kernel.org/~acme/iproute2-ss050808.dccp.patch

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:56:23 -07:00
Arnaldo Carvalho de Melo
6687e988d9 [ICSK]: Move TCP congestion avoidance members to icsk
This changeset basically moves tcp_sk()->{ca_ops,ca_state,etc} to inet_csk(),
minimal renaming/moving done in this changeset to ease review.

Most of it is just changes of struct tcp_sock * to struct sock * parameters.

With this we move to a state closer to two interesting goals:

1. Generalisation of net/ipv4/tcp_diag.c, becoming inet_diag.c, being used
   for any INET transport protocol that has struct inet_hashinfo and are
   derived from struct inet_connection_sock. Keeps the userspace API, that will
   just not display DCCP sockets, while newer versions of tools can support
   DCCP.

2. INET generic transport pluggable Congestion Avoidance infrastructure, using
   the current TCP CA infrastructure with DCCP.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:56:18 -07:00
Patrick McHardy
64ce207306 [NET]: Make NETDEBUG pure printk wrappers
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:56:08 -07:00
Arnaldo Carvalho de Melo
696ab2d3bf [TIMEWAIT]: Move inet_timewait_death_row routines to net/ipv4/inet_timewait_sock.c
Also export the ones that will be used in the next changeset, when
DCCP uses this infrastructure.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:55:58 -07:00
Arnaldo Carvalho de Melo
295ff7edb8 [TIMEWAIT]: Introduce inet_timewait_death_row
That groups all of the tables and variables associated to the TCP timewait
schedulling/recycling/killing code, that now can be isolated from the TCP
specific code and used by other transport protocols, such as DCCP.

Next changeset will move this code to net/ipv4/inet_timewait_sock.c

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:55:48 -07:00
Harald Welte
1d3de414eb [NETFILTER]: New iptables DCCP protocol header match
Using this new iptables DCCP protocol header match, it is possible to
create simplistic stateless packet filtering rules for DCCP.  It
permits matching of port numbers, packet type and options.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:54:28 -07:00
Stephen Hemmigner
bb435b8d81 [IPV4]: fib_trie: Use const
Use const where possible and get rid of EXTRACT() macro
that was never used.

Signed-off-by: Stephen Hemmigner <shemminger@osdl.org>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:54:08 -07:00
Robert Olsson
2f80b3c826 [IPV4]: fib_trie: Use ERR_PTR to handle errno return
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:54:00 -07:00
Olof Johansson
91b9a277fc [IPV4]: FIB Trie cleanups.
Below is a patch that cleans up some of this, supposedly without
changing any behaviour:

* Whitespace cleanups
* Introduce DBG()
* BUG_ON() instead of if () { BUG(); }
* Remove some of the deep nesting to make the code flow more
  comprehensible
* Some mask operations were simplified

Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:53:52 -07:00
Yasuyuki Kozakai
7663f18807 [NETFILTER]: return ENOMEM when ip_conntrack_alloc() fails.
This patch fixes the bug which doesn't return ERR_PTR(-ENOMEM) if it
failed to allocate memory space from slab cache.  This bug leads to
erroneously not dropped packets under stress, and wrong statistic
counters ('invalid' is incremented instead of 'drop').  It was
introduced during the ctnetlink merge in the net-2.6.14 tree, so no
stable or mainline releases affected.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:51:28 -07:00
Harald Welte
bbd86b9fc4 [NETFILTER]: add /proc/net/netfilter interface to nf_queue
This patch adds a /proc/net/netfilter/nf_queue file, similar to the
recently-added /proc/net/netfilter/nf_log.  It indicates which queue
handler is registered to which protocol family.  This is useful since
there are now multiple queue handlers in the treee (ip[6]_queue,
nfnetlink_queue).

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:51:18 -07:00
Harald Welte
210a9ebef2 [NETFILTER]: ip{6}_queue: prevent unregistration race with nfnetlink_queue
Since nfnetlink_queue can override ip{6}_queue as queue handlers, we
can no longer blindly unregister whoever is registered for PF_INET[6],
but only unregister ourselves.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:51:08 -07:00
Harald Welte
2669d63d20 [NETFILTER]: move conntrack helper buffers from BSS to kmalloc()ed memory
According to DaveM, it is preferrable to have large data structures be
allocated dynamically from the module init() function rather than
putting them as static global variables into BSS.

This patch moves the conntrack helper packet buffers into dynamically
allocated memory.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:50:57 -07:00
Arnaldo Carvalho de Melo
bb97d31f51 [INET]: Make inet_create try to load protocol modules
Syntax is net-pf-PROTOCOL_FAMILY-PROTOCOL-SOCK_TYPE and if this
fails net-pf-PROTOCOL_FAMILY-PROTOCOL.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:50:54 -07:00
Arnaldo Carvalho de Melo
a019d6fe2b [ICSK]: Move generalised functions from tcp to inet_connection_sock
This also improves reqsk_queue_prune and renames it to
inet_csk_reqsk_queue_prune, as it deals with both inet_connection_sock
and inet_request_sock objects, not just with request_sock ones thus
belonging to inet_request_sock.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:49:50 -07:00
Arnaldo Carvalho de Melo
d8c97a9451 [NET]: Export symbols needed by the current DCCP code
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:49:35 -07:00
Arnaldo Carvalho de Melo
295f7324ff [ICSK]: Introduce reqsk_queue_prune from code in tcp_synack_timer
With this we're very close to getting all of the current TCP
refactorings in my dccp-2.6 tree merged, next changeset will export
some functions needed by the current DCCP code and then dccp-2.6.git
will be born!

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:49:29 -07:00
Arnaldo Carvalho de Melo
0a5578cf8e [ICSK]: Generalise tcp_listen_{start,stop}
This also moved inet_iif from tcp to inet_hashtables.h, as it is
needed by the inet_lookup callers, perhaps this needs a bit of
polishing, but for now seems fine.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:49:24 -07:00
Arnaldo Carvalho de Melo
9f1d2604c7 [ICSK]: Introduce inet_csk_clone
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:49:20 -07:00
Arnaldo Carvalho de Melo
3f421baa47 [NET]: Just move the inet_connection_sock function from tcp sources
Completing the previous changeset, this also generalises tcp_v4_synq_add,
renaming it to inet_csk_reqsk_queue_hash_add, already geing used in the
DCCP tree, which I plan to merge RSN.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:49:14 -07:00
Arnaldo Carvalho de Melo
463c84b97f [NET]: Introduce inet_connection_sock
This creates struct inet_connection_sock, moving members out of struct
tcp_sock that are shareable with other INET connection oriented
protocols, such as DCCP, that in my private tree already uses most of
these members.

The functions that operate on these members were renamed, using a
inet_csk_ prefix while not being moved yet to a new file, so as to
ease the review of these changes.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:43:19 -07:00
Arnaldo Carvalho de Melo
87d11ceb9d [SOCK]: Introduce sk_clone
Out of tcp_create_openreq_child, will be used in
dccp_create_openreq_child, and is a nice sock function anyway.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:42:36 -07:00
Arnaldo Carvalho de Melo
c676270bcd [INET_TWSK]: Introduce inet_twsk_alloc
With the parts of tcp_time_wait that are not TCP specific, tcp_time_wait uses
it and so will dccp_time_wait.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:42:26 -07:00
Arnaldo Carvalho de Melo
e48c414ee6 [INET]: Generalise the TCP sock ID lookup routines
And also some TIME_WAIT functions.

[acme@toy net-2.6.14]$ grep built-in /tmp/before.size /tmp/after.size
/tmp/before.size: 282955   13122    9312  305389   4a8ed net/ipv4/built-in.o
/tmp/after.size:  281566   13122    9312  304000   4a380 net/ipv4/built-in.o
[acme@toy net-2.6.14]$

I kept them still inlined, will uninline at some point to see what
would be the performance difference.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:42:18 -07:00
Arnaldo Carvalho de Melo
8feaf0c0a5 [INET]: Generalise tcp_tw_bucket, aka TIME_WAIT sockets
This paves the way to generalise the rest of the sock ID lookup
routines and saves some bytes in TCPv4 TIME_WAIT sockets on distro
kernels (where IPv6 is always built as a module):

[root@qemu ~]# grep tw_sock /proc/slabinfo
tw_sock_TCPv6  0  0  128  31  1
tw_sock_TCP    0  0   96  41  1
[root@qemu ~]#

Now if a protocol wants to use the TIME_WAIT generic infrastructure it
only has to set the sk_prot->twsk_obj_size field with the size of its
inet_timewait_sock derived sock and proto_register will create
sk_prot->twsk_slab, for now its only for INET sockets, but we can
introduce timewait_sock later if some non INET transport protocolo
wants to use this stuff.

Next changesets will take advantage of this new infrastructure to
generalise even more TCP code.

[acme@toy net-2.6.14]$ grep built-in /tmp/before.size /tmp/after.size
/tmp/before.size: 188646   11764    5068  205478   322a6 net/ipv4/built-in.o
/tmp/after.size:  188144   11764    5068  204976   320b0 net/ipv4/built-in.o
[acme@toy net-2.6.14]$

Tested with both IPv4 & IPv6 (::1 (localhost) & ::ffff:172.20.0.1
(qemu host)).

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:42:13 -07:00
Arnaldo Carvalho de Melo
33b6223190 [INET]: Generalise tcp_v4_lookup_listener
[acme@toy net-2.6.14]$ grep built-in /tmp/before /tmp/after
/tmp/before: 282560       13122    9312  304994   4a762 net/ipv4/built-in.o
/tmp/after:  282560       13122    9312  304994   4a762 net/ipv4/built-in.o

Will be used in DCCP, not exporting it right now not to get in Adrian
Bunk's exported-but-not-used-on-modules radar 8)

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:42:08 -07:00
Arnaldo Carvalho de Melo
81849d106b [INET]: Generalise tcp_v4_hash & tcp_unhash
It really just makes the existing code be a helper function that
tcp_v4_hash and tcp_unhash uses, specifying the right inet_hashinfo,
tcp_hashinfo.

One thing I'll investigate at some point is to have the inet_hashinfo
pointer in sk_prot, so that we get all the hashtable information from
the sk pointer, this can lead to some extra indirections that may well
hurt performance/code size, we'll see. Ultimate idea would be that
sk_prot would provide _all_ the information about a protocol
implementation.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:42:02 -07:00
Arnaldo Carvalho de Melo
c752f0739f [TCP]: Move the tcp sock states to net/tcp_states.h
Lots of places just needs the states, not even linux/tcp.h, where this
enum was, needs it.

This speeds up development of the refactorings as less sources are
rebuilt when things get moved from net/tcp.h.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:41:54 -07:00
Arnaldo Carvalho de Melo
f3f05f7046 [INET]: Generalise the tcp_listen_ lock routines
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:41:49 -07:00
Arnaldo Carvalho de Melo
6e04e02165 [INET]: Move tcp_port_rover to inet_hashinfo
Also expose all of the tcp_hashinfo members, i.e. killing those
tcp_ehash, etc macros, this will more clearly expose already generic
functions and some that need just a bit of work to become generic, as
we'll see in the upcoming changesets.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:41:44 -07:00
Arnaldo Carvalho de Melo
2d8c4ce519 [INET]: Generalise tcp_bind_hash & tcp_inherit_port
This required moving tcp_bucket_cachep to inet_hashinfo.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:40:29 -07:00
Pablo Neira Ayuso
ff21d5774b [NETFILTER]: fix list traversal order in ctnetlink
Currently conntracks are inserted after the head. That means that
conntracks are sorted from the biggest to the smallest id. This happens
because we use list_prepend (list_add) instead list_add_tail. This can
result in problems during the list iteration.

                 list_for_each(i, &ip_conntrack_hash[cb->args[0]]) {
                         h = (struct ip_conntrack_tuple_hash *) i;
                         if (DIRECTION(h) != IP_CT_DIR_ORIGINAL)
                                 continue;
                         ct = tuplehash_to_ctrack(h);
                         if (ct->id <= *id)
                                 continue;

In that case just the first conntrack in the bucket will be dumped. To
fix this, we iterate the list from the tail to the head via
list_for_each_prev. Same thing for the list of expectations.

Signed-off-by: Pablo Neira Ayuso <pablo@eurodev.net>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:40:25 -07:00
Pablo Neira Ayuso
28b19d99ac [NETFILTER]: Fix typo in ctnl_exp_cb array (no bug, just memory waste)
This fixes the size of the ctnl_exp_cb array that is IPCTNL_MSG_EXP_MAX
instead of IPCTNL_MSG_MAX. Simple typo.

Signed-off-by: Pablo Neira Ayuso <pablo@eurodev.net>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:40:21 -07:00
Pablo Neira Ayuso
37012f7fd3 [NETFILTER]: fix conntrack refcount leak in unlink_expect()
In unlink_expect(), the expectation is removed from the list so the
refcount must be dropped as well.

Signed-off-by: Pablo Neira Ayuso <pablo@eurodev.net>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:40:17 -07:00
Pablo Neira Ayuso
14a50bbaa5 [NETFILTER]: ctnetlink: make sure event order is correct
The following sequence is displayed during events dumping of an ICMP
connection: [NEW] [DESTROY] [UPDATE]

This happens because the event IPCT_DESTROY is delivered in
death_by_timeout(), that is called from the icmp protocol helper
(ct->timeout.function) once we see the reply.

To fix this, we move this event to destroy_conntrack().

Signed-off-by: Pablo Neira Ayuso <pablo@eurodev.net>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:40:13 -07:00
Harald Welte
1444fc559b [NETFILTER]: don't use nested attributes for conntrack_expect
We used to use nested nfattr structures for ip_conntrack_expect.  This is
bogus, since ip_conntrack and ip_conntrack_expect are communicated in
different netlink message types.  both should be encoded at the top level
attributes, no extra nesting required.  This patch addresses the issue.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:40:09 -07:00
Harald Welte
927ccbcc28 [NETFILTER]: attribute count is an attribute of message type, not subsytem
Prior to this patch, every nfnetlink subsystem had to specify it's
attribute count.  However, in reality the attribute count depends on
the message type within the subsystem, not the subsystem itself.  This
patch moves 'attr_count' from 'struct nfnetlink_subsys' into
nfnl_callback to fix this.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:39:14 -07:00
Harald Welte
bd9a26b7f2 [NETFILTER]: fix ctnetlink 'create_expect' parsing
There was a stupid copy+paste mistake where we parse the MASK nfattr into
the "tuple" variable instead of the "mask" variable.  This patch fixes it.
Thanks to Pablo Neira.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:39:10 -07:00
Pablo Neira
88aa042904 [NETFILTER]: conntrack_netlink: Fix locking during conntrack_create
The current codepath allowed for ip_conntrack_lock to be unlock'ed twice.

Signed-off-by: Pablo Neira <pablo@eurodev.net>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:39:05 -07:00
Pablo Neira
94cd2b6764 [NETFILTER]: remove bogus memset() calls from ip_conntrack_netlink.c
nfattr_parse_nested() calls nfattr_parse() which in turn does a memset
on the 'tb' array.  All callers therefore don't need to memset before
calling it.

Signed-off-by: Pablo Neira <pablo@eurodev.net>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:39:00 -07:00
Patrick McHardy
a86888b925 [NETFILTER]: Fix multiple problems with the conntrack event cache
refcnt underflow: the reference count is decremented when a conntrack
entry is removed from the hash but it is not incremented when entering
new entries.

missing protection of process context against softirq context: all
cache operations need to locally disable softirqs to avoid races.
Additionally the event cache can't be initialized when a packet
enteres the conntrack code but needs to be initialized whenever we
cache an event and the stored conntrack entry doesn't match the
current one.

incorrect flushing of the event cache in ip_ct_iterate_cleanup:
without real locking we can't flush the cache for different CPUs
without incurring races. The cache for different CPUs can only be
flushed when no packets are going through the
code. ip_ct_iterate_cleanup doesn't need to drop all references, so
flushing is moved to the cleanup path.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:38:54 -07:00
Arnaldo Carvalho de Melo
a55ebcc4c4 [INET]: Move bind_hash from tcp_sk to inet_sk
This should really be in a inet_connection_sock, but I'm leaving it
for a later optimization, when some more fields common to INET
transport protocols now in tcp_sk or inet_sk will be chunked out into
inet_connection_sock, for now its better to concentrate on getting the
changes in the core merged to leave the DCCP tree with only DCCP
specific code.

Next changesets will take advantage of this move to generalise things
like tcp_bind_hash, tcp_put_port, tcp_inherit_port, making the later
receive a inet_hashinfo parameter, and even __tcp_tw_hashdance, etc in
the future, when tcp_tw_bucket gets transformed into the struct
timewait_sock hierarchy.

tcp_destroy_sock also is eligible as soon as tcp_orphan_count gets
moved to sk_prot.

A cascade of incremental changes will ultimately make the tcp_lookup
functions be fully generic.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:38:48 -07:00
Arnaldo Carvalho de Melo
77d8bf9c62 [INET]: Move the TCP hashtable functions/structs to inet_hashtables.[ch]
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:38:39 -07:00
Arnaldo Carvalho de Melo
0f7ff9274e [INET]: Just rename the TCP hashtable functions/structs to inet_
This is to break down the complexity of the series of patches,
making it very clear that this one just does:

1. renames tcp_ prefixed hashtable functions and data structures that
   were already mostly generic to inet_ to share it with DCCP and
   other INET transport protocols.

2. Removes not used functions (__tb_head & tb_head)

3. Removes some leftover prototypes in the headers (tcp_bucket_unlock &
   tcp_v4_build_header)

Next changesets will move tcp_sk(sk)->bind_hash to inet_sock so that we can
make functions such as tcp_inherit_port, __tcp_inherit_port, tcp_v4_get_port,
__tcp_put_port,  generic and get others like tcp_destroy_sock closer to generic
(tcp_orphan_count will go to sk->sk_prot to allow this).

Eventually most of these functions will be used passing the transport protocol
inet_hashinfo structure.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:38:32 -07:00
Arnaldo Carvalho de Melo
304a16180f [INET]: Move the TCP ehash functions to include/net/inet_hashtables.h
To be shared with DCCP (and others), this is the start of a series of patches
that will expose the already generic TCP hash table routines.

The few changes noticed when calling gcc -S before/after on a pentium4 were of
this type:

        movl    40(%esp), %edx
        cmpl    %esi, 472(%edx)
        je      .L168
-       pushl   $291
+       pushl   $272
        pushl   $.LC0
        pushl   $.LC1
        pushl   $.LC2

[acme@toy net-2.6.14]$ size net/ipv4/tcp_ipv4.before.o net/ipv4/tcp_ipv4.after.o
   text    data     bss     dec     hex filename
  17804     516     140   18460    481c net/ipv4/tcp_ipv4.before.o
  17804     516     140   18460    481c net/ipv4/tcp_ipv4.after.o

Holler if some weird architecture has issues with things like this 8)

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:38:22 -07:00
Harald Welte
608c8e4f7b [NETFILTER]: Extend netfilter logging API
This patch is in preparation to nfnetlink_log:
- loggers now have to register struct nf_logger instead of nf_logfn
- nf_log_unregister() replaced by nf_log_unregister_pf() and
  nf_log_unregister_logger()
- add comment to ip[6]t_LOG.h to assure nobody redefines flags
- add /proc/net/netfilter/nf_log to tell user which logger is currently
  registered for which address family
- if user has configured logging, but no logging backend (logger) is
  available, always spit a message to syslog, not just the first time.
- split ip[6]t_LOG.c into two parts:
  Backend: Always try to register as logger for the respective address family
  Frontend: Always log via nf_log_packet() API
- modify all users of nf_log_packet() to accomodate additional argument

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:38:07 -07:00
Arnaldo Carvalho de Melo
32519f11d3 [INET]: Introduce inet_sk_rebuild_header
From tcp_v4_rebuild_header, that already was pretty generic, I only
needed to use sk->sk_protocol instead of the hardcoded IPPROTO_TCP and
establish the requirement that INET transport layer protocols that
want to use this function map TCP_SYN_SENT to its equivalent state.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:37:55 -07:00
Arnaldo Carvalho de Melo
6cbb0df788 [SOCK]: Introduce sk_setup_caps
From tcp_v4_setup_caps, that always is preceded by a call to
__sk_dst_set, so coalesce this sequence into sk_setup_caps, removing
one call to a TCP function in the IP layer.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:37:48 -07:00
Arnaldo Carvalho de Melo
614c6cb4f2 [SOCK]: Rename __tcp_v4_rehash to __sk_prot_rehash
This operation was already generic and DCCP will use it.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:37:42 -07:00
Arnaldo Carvalho de Melo
e6848976b7 [NET]: Cleanup INET_REFCNT_DEBUG code
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:37:29 -07:00
Patrick McHardy
d13964f449 [IPV4/6]: Check if packet was actually delivered to a raw socket to decide whether to send an ICMP unreachable
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:37:22 -07:00
Harald Welte
7af4cc3fa1 [NETFILTER]: Add "nfnetlink_queue" netfilter queue handler over nfnetlink
- Add new nfnetlink_queue module
- Add new ipt_NFQUEUE and ip6t_NFQUEUE modules to access queue numbers 1-65535
- Mark ip_queue and ip6_queue Kconfig options as OBSOLETE
- Update feature-removal-schedule to remove ip[6]_queue in December

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:36:56 -07:00
Harald Welte
0ab43f8499 [NETFILTER]: Core changes required by upcoming nfnetlink_queue code
- split netfiler verdict in 16bit verdict and 16bit queue number
- add 'queuenum' argument to nf_queue_outfn_t and its users ip[6]_queue
- move NFNL_SUBSYS_ definitions from enum to #define
- introduce autoloading for nfnetlink subsystem modules
- add MODULE_ALIAS_NFNL_SUBSYS macro
- add nf_unregister_queue_handlers() to register all handlers for a given
  nf_queue_outfn_t
- add more verbose DEBUGP macro definition to nfnetlink.c
- make nfnetlink_subsys_register fail if subsys already exists
- add some more comments and debug statements to nfnetlink.c

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:36:49 -07:00
Harald Welte
2cc7d57309 [NETFILTER]: Move reroute-after-queue code up to the nf_queue layer.
The rerouting functionality is required by the core, therefore it has
to be implemented by the core and not in individual queue handlers.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:36:19 -07:00
Harald Welte
4fdb3bb723 [NETLINK]: Add properly module refcounting for kernel netlink sockets.
- Remove bogus code for compiling netlink as module
- Add module refcounting support for modules implementing a netlink
  protocol
- Add support for autoloading modules that implement a netlink protocol
  as soon as someone opens a socket for that protocol

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:35:08 -07:00
Harald Welte
020b4c12db [NETFILTER]: Move ipv4 specific code from net/core/netfilter.c to net/ipv4/netfilter.c
Netfilter cleanup
- Move ipv4 code from net/core/netfilter.c to net/ipv4/netfilter.c
- Move ipv6 netfilter code from net/ipv6/ip6_output.c to net/ipv6/netfilter.c

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:35:01 -07:00
Harald Welte
089af26c70 [NETFILTER]: Rename skb_ip_make_writable() to skb_make_writable()
There is nothing IPv4-specific in it.  In fact, it was already used by
IPv6, too...  Upcoming nfnetlink_queue code will use it for any kind
of packet.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:34:40 -07:00
Patrick McHardy
373ac73595 [NETFILTER]: C99 initizalizers for NAT protocols
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:33:34 -07:00
Adrian Bunk
0742fd53a3 [IPV4]: possible cleanups
This patch contains the following possible cleanups:
- make needlessly global code static
- #if 0 the following unused global function:
  - xfrm4_state.c: xfrm4_state_fini
- remove the following unneeded EXPORT_SYMBOL's:
  - ip_output.c: ip_finish_output
  - ip_output.c: sysctl_ip_default_ttl
  - fib_frontend.c: ip_dev_find
  - inetpeer.c: inet_peer_idlock
  - ip_options.c: ip_options_compile
  - ip_options.c: ip_options_undo
  - net/core/request_sock.c: sysctl_max_syn_backlog

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:33:20 -07:00
David S. Miller
f2ccd8fa06 [NET]: Kill skb->real_dev
Bonding just wants the device before the skb_bond()
decapsulation occurs, so simply pass that original
device into packet_type->func() as an argument.

It remains to be seen whether we can use this same
exact thing to get rid of skb->input_dev as well.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:32:25 -07:00
Arnaldo Carvalho de Melo
83e3609eba [REQSK]: Move the syn_table destroy from tcp_listen_stop to reqsk_queue_destroy
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:32:11 -07:00
Harald Welte
080774a243 [NETFILTER]: Add ctnetlink subsystem
Add ctnetlink subsystem for userspace-access to ip_conntrack table.
This allows reading and updating of existing entries, as well as
creating new ones (and new expect's) via nfnetlink.

Please note the 'strange' byte order: nfattr (tag+length) are in host
byte order, while the payload is always guaranteed to be in network
byte order.  This allows a simple userspace process to encapsulate netlink
messages into arch-independent udp packets by just processing/swapping the
headers and not knowing anything about the actual payload.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:31:49 -07:00
Harald Welte
ac3247baf8 [NETFILTER]: connection tracking event notifiers
This adds a notifier chain based event mechanism for ip_conntrack state
changes.  As opposed to the previous implementations in patch-o-matic, we
do no longer need a field in the skb to achieve this.

Thanks to the valuable input from Patrick McHardy and Rusty on the idea
of a per_cpu implementation.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:31:24 -07:00
David S. Miller
8728b834b2 [NET]: Kill skb->list
Remove the "list" member of struct sk_buff, as it is entirely
redundant.  All SKB list removal callers know which list the
SKB is on, so storing this in sk_buff does nothing other than
taking up some space.

Two tricky bits were SCTP, which I took care of, and two ATM
drivers which Francois Romieu <romieu@fr.zoreil.com> fixed
up.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
2005-08-29 15:31:14 -07:00
Harald Welte
6869c4d8e0 [NETFILTER]: reduce netfilter sk_buff enlargement
As discussed at netconf'05, we're trying to save every bit in sk_buff.
The patch below makes sk_buff 8 bytes smaller.  I did some basic
testing on my notebook and it seems to work.

The only real in-tree user of nfcache was IPVS, who only needs a
single bit.  Unfortunately I couldn't find some other free bit in
sk_buff to stuff that bit into, so I introduced a separate field for
them.  Maybe the IPVS guys can resolve that to further save space.

Initially I wanted to shrink pkt_type to three bits (PACKET_HOST and
alike are only 6 values defined), but unfortunately the bluetooth code
overloads pkt_type :(

The conntrack-event-api (out-of-tree) uses nfcache, but Rusty just
came up with a way how to do it without any skb fields, so it's safe
to remove it.

- remove all never-implemented 'nfcache' code
- don't have ipvs code abuse 'nfcache' field. currently get's their own
  compile-conditional skb->ipvs_property field.  IPVS maintainers can
  decide to move this bit elswhere, but nfcache needs to die.
- remove skb->nfcache field to save 4 bytes
- move skb->nfctinfo into three unused bits to save further 4 bytes

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:31:04 -07:00
Harald Welte
bf3a46aa9b [NETFILTER]: convert nfmark and conntrack mark to 32bit
As discussed at netconf'05, we convert nfmark and conntrack-mark to be
32bits even on 64bit architectures.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:29:31 -07:00
Patrick McHardy
06c7427021 [FIB_TRIE]: Don't ignore negative results from fib_semantic_match
When a semantic match occurs either success, not found or an error
(for matching unreachable routes/blackholes) is returned. fib_trie
ignores the errors and looks for a different matching route. Treat
results other than "no match" as success and end lookup.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-23 22:06:09 -07:00
David S. Miller
d5d283751e [TCP]: Document non-trivial locking path in tcp_v{4,6}_get_port().
This trips up a lot of folks reading this code.
Put an unlikely() around the port-exhaustion test
for good measure.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-23 10:49:54 -07:00
David S. Miller
89ebd197eb [TCP]: Unconditionally clear TCP_NAGLE_PUSH in skb_entail().
Intention of this bit is to force pushing of the existing
send queue when TCP_CORK or TCP_NODELAY state changes via
setsockopt().

But it's easy to create a situation where the bit never
clears.  For example, if the send queue starts empty:

1) set TCP_NODELAY
2) clear TCP_NODELAY
3) set TCP_CORK
4) do small write()

The current code will leave TCP_NAGLE_PUSH set after that
sequence.  Unconditionally clearing the bit when new data
is added via skb_entail() solves the problem.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-23 10:13:06 -07:00
Patrick McHardy
66a79a19a7 [NETFILTER]: Fix HW checksum handling in ip_queue/ip6_queue
The checksum needs to be filled in on output, after mangling a packet
ip_summed needs to be reset.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-23 10:10:35 -07:00
Dave Johnson
1344a41637 [IPV4]: Fix negative timer loop with lots of ipv4 peers.
From: Dave Johnson <djohnson+linux-kernel@sw.starentnetworks.com>

Found this bug while doing some scaling testing that created 500K inet
peers.

peer_check_expire() in net/ipv4/inetpeer.c isn't using inet_peer_gc_mintime
correctly and will end up creating an expire timer with less than the
minimum duration, and even zero/negative if enough active peers are
present.

If >65K peers, the timer will be less than inet_peer_gc_mintime, and with
>70K peers, the timer duration will reach zero and go negative.

The timer handler will continue to schedule another zero/negative timer in
a loop until peers can be aged.  This can continue for at least a few
minutes or even longer if the peers remain active due to arriving packets
while the loop is occurring.

Bug is present in both 2.4 and 2.6.  Same patch will apply to both just
fine.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-23 10:10:15 -07:00
Dmitry Yusupov
14869c3886 [TCP]: Do TSO deferral even if tail SKB can go out now.
If the tail SKB fits into the window, it is still
benefitical to defer until the goal percentage of
the window is available.  This give the application
time to feed more data into the send queue and thus
results in larger TSO frames going out.

Patch from Dmitry Yusupov <dima@neterion.com>.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-23 10:09:27 -07:00
Patrick McHardy
7e71af49d4 [NETFILTER]: Fix HW checksum handling in TCPMSS target
Most importantly, remove bogus BUG() in receive path.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-20 17:40:41 -07:00
Patrick McHardy
f93592ff4f [NETFILTER]: Fix HW checksum handling in ECN target
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-20 17:39:15 -07:00
Patrick McHardy
fd841326d7 [NETFILTER]: Fix ECN target TCP marking
An incorrect check made it bail out before doing anything.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-20 17:38:40 -07:00
Herbert Xu
6fc8b9e7c6 [IPCOMP]: Fix false smp_processor_id warning
This patch fixes a false-positive from debug_smp_processor_id().

The processor ID is only used to look up crypto_tfm objects.
Any processor ID is acceptable here as long as it is one that is
iterated on by for_each_cpu().

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-18 14:36:59 -07:00
Patrick McHardy
cb94c62c25 [IPV4]: Fix DST leak in icmp_push_reply()
Based upon a bug report and initial patch by
Ollie Wild.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-18 14:05:44 -07:00
Herbert Xu
c8ac377464 [TCP]: Fix bug #5070: kernel BUG at net/ipv4/tcp_output.c:864
1) We send out a normal sized packet with TSO on to start off.
2) ICMP is received indicating a smaller MTU.
3) We send the current sk_send_head which needs to be fragmented
since it was created before the ICMP event.  The first fragment
is then sent out.

At this point the remaining fragment is allocated by tcp_fragment.
However, its size is padded to fit the L1 cache-line size therefore
creating tail-room up to 124 bytes long.

This fragment will also be sitting at sk_send_head.

4) tcp_sendmsg is called again and it stores data in the tail-room of
of the fragment.
5) tcp_push_one is called by tcp_sendmsg which then calls tso_fragment
since the packet as a whole exceeds the MTU.

At this point we have a packet that has data in the head area being
fed to tso_fragment which bombs out.

My take on this is that we shouldn't ever call tcp_fragment on a TSO
socket for a packet that is yet to be transmitted since this creates
a packet on sk_send_head that cannot be extended.

So here is a patch to change it so that tso_fragment is always used
in this case.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-16 20:43:40 -07:00
Herbert Xu
b5da623ae9 [TCP]: Adjust {p,f}ackets_out correctly in tcp_retransmit_skb()
Well I've only found one potential cause for the assertion
failure in tcp_mark_head_lost.  First of all, this can only
occur if cnt > 1 since tp->packets_out is never zero here.
If it did hit zero we'd have much bigger problems.

So cnt is equal to fackets_out - reordering.  Normally
fackets_out is less than packets_out.  The only reason
I've found that might cause fackets_out to exceed packets_out
is if tcp_fragment is called from tcp_retransmit_skb with a
TSO skb and the current MSS is greater than the MSS stored
in the TSO skb.  This might occur as the result of an expiring
dst entry.

In that case, packets_out may decrease (line 1380-1381 in
tcp_output.c).  However, fackets_out is unchanged which means
that it may in fact exceed packets_out.

Previously tcp_retrans_try_collapse was the only place where
packets_out can go down and it takes care of this by decrementing
fackets_out.

So we should make sure that fackets_out is reduced by an appropriate
amount here as well.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-10 18:32:36 -07:00
Linus Torvalds
92e52b2e82 Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2005-08-08 16:06:01 -07:00
Heikki Orsila
ca9334523c [IPV4]: Debug cleanup
Here's a small patch to cleanup NETDEBUG() use in net/ipv4/ for Linux 
kernel 2.6.13-rc5. Also weird use of indentation is changed in some
places.

Signed-off-by: Heikki Orsila <heikki.orsila@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-08 14:26:52 -07:00
Harald Welte
8b83bc77bf [PATCH] don't try to do any NAT on untracked connections
With the introduction of 'rustynat' in 2.6.11, the old tricks of preventing
NAT of 'untracked' connections (e.g. NOTRACK target in 'raw' table) are no
longer sufficient.

The ip_conntrack_untracked.status |= IPS_NAT_DONE_MASK effectively
prevents iteration of the 'nat' table, but doesn't prevent nat_packet()
to be executed.  Since nr_manips is gone in 'rustynat', nat_packet() now
implicitly thinks that it has to do NAT on the packet.

This patch fixes that problem by explicitly checking for
ip_conntrack_untracked in ip_nat_fn().

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-08 11:48:28 -07:00
Herbert Xu
6fc0b4a7a7 [IPSEC]: Restrict socket policy loading to CAP_NET_ADMIN.
The interface needs much redesigning if we wish to allow
normal users to do this in some way.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-06 06:33:15 -07:00
David S. Miller
b7656e7f29 [IPV4]: Fix memory leak during fib_info hash expansion.
When we grow the tables, we forget to free the olds ones
up.

Noticed by Yan Zheng.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-05 04:12:48 -07:00
Herbert Xu
b68e9f8572 [PATCH] tcp: fix TSO cwnd caching bug
tcp_write_xmit caches the cwnd value indirectly in cwnd_quota.  When
tcp_transmit_skb reduces the cwnd because of tcp_enter_cwr, the cached
value becomes invalid.

This patch ensures that the cwnd value is always reread after each
tcp_transmit_skb call.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-04 21:43:14 -07:00
David S. Miller
846998ae87 [PATCH] tcp: fix TSO sizing bugs
MSS changes can be lost since we preemptively initialize the tso_segs count
for an SKB before we %100 commit to sending it out.

So, by the time we send it out, the tso_size information can be stale due
to PMTU events.  This mucks up all of the logic in our send engine, and can
even result in the BUG() triggering in tcp_tso_should_defer().

Another problem we have is that we're storing the tp->mss_cache, not the
SACK block normalized MSS, as the tso_size.  That's wrong too.

Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-04 21:43:14 -07:00
Alexey Kuznetsov
db44575f6f [NET]: fix oops after tunnel module unload
Tunnel modules used to obtain module refcount each time when
some tunnel was created, which meaned that tunnel could be unloaded
only after all the tunnels are deleted.

Since killing old MOD_*_USE_COUNT macros this protection has gone.
It is possible to return it back as module_get/put, but it looks
more natural and practically useful to force destruction of all
the child tunnels on module unload.

Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-30 17:46:44 -07:00
Harald Welte
1f494c0e04 [NETFILTER] Inherit masq_index to slave connections
masq_index is used for cleanup in case the interface address changes
(such as a dialup ppp link with dynamic addreses).  Without this patch,
slave connections are not evicted in such a case, since they don't inherit
masq_index.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-30 17:44:07 -07:00
Baruch Even
d1b04c081e [NET]: Spelling mistakes threshoulds -> thresholds
Just simple spelling mistake fixes.

Signed-Off-By: Baruch Even <baruch@ev-en.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-30 17:41:59 -07:00
Matt Mackall
5e43db7730 [NET]: Move in_aton from net/ipv4/utils.c to net/core/utils.c
Move in_aton to allow netpoll and pktgen to work without the rest of
the IPv4 stack. Fix whitespace and add comment for the odd placement.

Delete now-empty net/ipv4/utils.c

Re-enable netpoll/netconsole without CONFIG_INET

Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-27 15:24:42 -07:00
Nick Sillik
7cee432a22 [NETFILTER]: Fix -Wunder error in ip_conntrack_core.c
Signed-off-by: Nick Sillik <n.sillik@temple.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-27 14:46:03 -07:00
Hans-Juergen Tappe (SYSGO AG)
eaa1c5d059 [IPV4]: Fix Kconfig syntax error
From: "Hans-Juergen Tappe (SYSGO AG)" <hjt@sysgo.com>

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-27 13:00:04 -07:00
Patrick McHardy
74bb421da7 [NETFILTER]: Use correct byteorder in ICMP NAT
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-22 12:51:38 -07:00
Patrick McHardy
21f930e4ab [NETFILTER]: Wait until all references to ip_conntrack_untracked are dropped on unload
Fixes a crash when unloading ip_conntrack.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-22 12:51:03 -07:00
Patrick McHardy
d04b4f8c1c [NETFILTER]: Fix potential memory corruption in NAT code (aka memory NAT)
The portptr pointing to the port in the conntrack tuple is declared static,
which could result in memory corruption when two packets of the same
protocol are NATed at the same time and one conntrack goes away.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-22 12:50:29 -07:00
Rusty Russell
4acdbdbe50 [NETFILTER]: ip_conntrack_expect_related must not free expectation
If a connection tracking helper tells us to expect a connection, and
we're already expecting that connection, we simply free the one they
gave us and return success.

The problem is that NAT helpers (eg. FTP) have to allocate the
expectation first (to see what port is available) then rewrite the
packet.  If that rewrite fails, they try to remove the expectation,
but it was freed in ip_conntrack_expect_related.

This is one example of a larger problem: having registered the
expectation, the pointer is no longer ours to use.  Reference counting
is needed for ctnetlink anyway, so introduce it now.

To have a single "put" path, we need to grab the reference to the
connection on creation, rather than open-coding it in the caller.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-21 13:14:46 -07:00
Patrick McHardy
0303770deb [NET]: Make ipip/ip6_tunnel independant of XFRM
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-19 14:03:34 -07:00
Stephen Hemminger
c877efb207 [IPV4]: Fix up lots of little whitespace indentation stuff in fib_trie.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-19 14:01:51 -07:00
Patrick McHardy
abaacad9bc [IPV4]: Don't select XFRM for ip_gre
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-19 13:59:17 -07:00
Adrian Bunk
6876f95f20 [IPV4]: fix IP_FIB_HASH kconfig warning
This patch fixes the following kconfig warning:
  net/ipv4/Kconfig:92:warning: defaults for choice values not supported

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-18 13:55:19 -07:00
Phil Oester
84531c24f2 [NETFILTER]: Revert nf_reset change
Revert the nf_reset change that caused so much trouble, drop conntrack
references manually before packets are queued to packet sockets.

Signed-off-by: Phil Oester <kernel@linuxace.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-12 11:57:52 -07:00
Sam Ravnborg
6a2e9b738c [NET]: move config options out to individual protocols
Move the protocol specific config options out to the specific protocols.
With this change net/Kconfig now starts to become readable and serve as a
good basis for further re-structuring.

The menu structure is left almost intact, except that indention is
fixed in most cases. Most visible are the INET changes where several
"depends on INET" are replaced with a single ifdef INET / endif pair.

Several new files were created to accomplish this change - they are
small but serve the purpose that config options are now distributed
out where they belongs.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-11 21:13:56 -07:00
Olaf Kirch
0b7f22aab4 [IPV4]: Prevent oops when printing martian source
In some cases, we may be generating packets with a source address that
qualifies as martian. This can happen when we're in the middle of setting
up the network, and netfilter decides to reject a packet with an RST.
The IPv4 routing code would try to print a warning and oops, because
locally generated packets do not have a valid skb->mac.raw pointer
at this point.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-11 21:01:42 -07:00
Julian Anastasov
af9debd461 [IPVS]: Add and reorder bh locks after moving to keventd.
An addition to the last ipvs changes that move
update_defense_level/si_meminfo to keventd:

- ip_vs_random_dropentry now runs in process context and should use _bh
  locks to protect from softirqs

- update_defense_level still needs _bh locks after si_meminfo is called,
  for the same purpose

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-11 20:59:57 -07:00
David L Stevens
84b42baef7 [IPV4]: fix IPv4 leave-group group matching
This patch fixes the multicast group matching for 
IP_DROP_MEMBERSHIP, similar to the IP_ADD_MEMBERSHIP fix in a prior
patch. Groups are identifiedby <group address,interface> and including
the interface address in the match will fail if a leave-group is done
by address when the join was done by index, or if different addresses
on the same interface are used in the join and leave.

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-08 17:48:38 -07:00
David L Stevens
9951f036fe [IPV4]: (INCLUDE,empty)/leave-group equivalence for full-state MSF APIs & errno fix
1) Adds (INCLUDE, empty)/leave-group equivalence to the full-state 
   multicast source filter APIs (IPv4 and IPv6)

2) Fixes an incorrect errno in the IPv6 leave-group (ENOENT should be
   EADDRNOTAVAIL)

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-08 17:47:28 -07:00
David L Stevens
917f2f105e [IPV4]: multicast API "join" issues
1) In the full-state API when imsf_numsrc == 0
   errno should be "0", but returns EADDRNOTAVAIL

2) An illegal filter mode change
   errno should be EINVAL, but returns EADDRNOTAVAIL

3) Trying to do an any-source option without IP_ADD_MEMBERSHIP
   errno should be EINVAL, but returns EADDRNOTAVAIL

4) Adds comments for the less obvious error return values

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-08 17:45:16 -07:00
David L Stevens
8cdaaa15da [IPV4]: multicast API "join" issues
1) Changes IP_ADD_SOURCE_MEMBERSHIP and MCAST_JOIN_SOURCE_GROUP to ignore
   EADDRINUSE errors on a "courtesy join" -- prior membership or not
   is ok for these.

2) Adds "leave group" equivalence of (INCLUDE, empty) filters in the 
   delta-based API. Without this, mixing delta-based API calls that
   end in an (INCLUDE, empty) filter would not allow a subsequent
   regular IP_ADD_MEMBERSHIP. It also frees socket buffer memory that
   isn't needed for both the multicast group record and source filter.

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-08 17:39:23 -07:00
David L Stevens
ca9b907d14 [IPV4]: multicast API "join" issues
This patch corrects a few problems with the IP_ADD_MEMBERSHIP
socket option:

1) The existing code makes an attempt at reference counting joins when
   using the ip_mreqn/imr_ifindex interface. Joining the same group
   on the same socket is an error, whatever the API. This leads to
   unexpected results when mixing ip_mreqn by index with ip_mreqn by
   address, ip_mreq, or other API's. For example, ip_mreq followed by
   ip_mreqn of the same group will "work" while the same two reversed
   will not.
           Fixed to always return EADDRINUSE on a duplicate join and
   removed the (now unused) reference count in ip_mc_socklist.

2) The group-search list in ip_mc_join_group() is comparing a full 
   ip_mreqn structure and all of it must match for it to find the
   group. This doesn't correctly match a group that was joined with
   ip_mreq or ip_mreqn with an address (with or without an index). It
   also doesn't match groups that are joined by different addresses on
   the same interface. All of these are the same multicast group,
   which is identified by group address and interface index.
           Fixed the check to correctly match groups so we don't get
   duplicate group entries on the ip_mc_socklist.

3) The old code allocates a multicast address before searching for
   duplicates requiring it to free in various error cases. This
   patch moves the allocate until after the search and
   igmp_max_memberships check, so never a need to allocate, then free
   an entry.

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-08 17:38:07 -07:00
Alexey Kuznetsov
4c866aa798 [IPV4]: Apply sysctl_icmp_echo_ignore_broadcasts to ICMP_TIMESTAMP as well.
This was the full intention of the original code.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-08 17:34:46 -07:00
Victor Fusco
86a76caf87 [NET]: Fix sparse warnings
From: Victor Fusco <victor@cetuc.puc-rio.br>

Fix the sparse warning "implicit cast to nocast type"

Signed-off-by: Victor Fusco <victor@cetuc.puc-rio.br>
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-08 14:57:47 -07:00
David S. Miller
b03efcfb21 [NET]: Transform skb_queue_len() binary tests into skb_queue_empty()
This is part of the grand scheme to eliminate the qlen
member of skb_queue_head, and subsequently remove the
'list' member of sk_buff.

Most users of skb_queue_len() want to know if the queue is
empty or not, and that's trivially done with skb_queue_empty()
which doesn't use the skb_queue_head->qlen member and instead
uses the queue list emptyness as the test.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-08 14:57:23 -07:00
David S. Miller
908a75c17a [TCP]: Never TSO defer under periods of congestion.
Congestion window recover after loss depends upon the fact
that if we have a full MSS sized frame at the head of the
send queue, we will send it.  TSO deferral can defeat the
ACK clocking necessary to exit cleanly from recovery.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 15:43:58 -07:00
David S. Miller
c1b4a7e695 [TCP]: Move to new TSO segmenting scheme.
Make TSO segment transmit size decisions at send time not earlier.

The basic scheme is that we try to build as large a TSO frame as
possible when pulling in the user data, but the size of the TSO frame
output to the card is determined at transmit time.

This is guided by tp->xmit_size_goal.  It is always set to a multiple
of MSS and tells sendmsg/sendpage how large an SKB to try and build.

Later, tcp_write_xmit() and tcp_push_one() chop up the packet if
necessary and conditions warrant.  These routines can also decide to
"defer" in order to wait for more ACKs to arrive and thus allow larger
TSO frames to be emitted.

A general observation is that TSO elongates the pipe, thus requiring a
larger congestion window and larger buffering especially at the sender
side.  Therefore, it is important that applications 1) get a large
enough socket send buffer (this is accomplished by our dynamic send
buffer expansion code) 2) do large enough writes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 15:24:38 -07:00
David S. Miller
0d9901df62 [TCP]: Break out send buffer expansion test.
This makes it easier to understand, and allows easier
tweaking of the heuristic later on.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 15:21:10 -07:00
David S. Miller
cb83199a29 [TCP]: Do not call tcp_tso_acked() if no work to do.
In tcp_clean_rtx_queue(), if the TSO packet is not even partially
acked, do not waste time calling tcp_tso_acked().

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 15:20:55 -07:00
David S. Miller
a56476962e [TCP]: Kill bogus comment above tcp_tso_acked().
Everything stated there is out of data.  tcp_trim_skb()
does adjust the available socket send buffer space and
skb->truesize now.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 15:20:41 -07:00
David S. Miller
b4e26f5ea0 [TCP]: Fix send-side cpu utiliziation regression.
Only put user data purely to pages when doing TSO.

The extra page allocations cause two problems:

1) Add the overhead of the page allocations themselves.
2) Make us do small user copies when we get to the end
   of the TCP socket cache page.

It is still beneficial to purely use pages for TSO,
so we will do it for that case.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 15:20:27 -07:00
David S. Miller
aa93466bdf [TCP]: Eliminate redundant computations in tcp_write_xmit().
tcp_snd_test() is run for every packet output by a single
call to tcp_write_xmit(), but this is not necessary.

For one, the congestion window space needs to only be
calculated one time, then used throughout the duration
of the loop.

This cleanup also makes experimenting with different TSO
packetization schemes much easier.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 15:20:09 -07:00
David S. Miller
7f4dd0a943 [TCP]: Break out tcp_snd_test() into it's constituent parts.
tcp_snd_test() does several different things, use inline
functions to express this more clearly.

1) It initializes the TSO count of SKB, if necessary.
2) It performs the Nagle test.
3) It makes sure the congestion window is adhered to.
4) It makes sure SKB fits into the send window.

This cleanup also sets things up so that things like the
available packets in the congestion window does not need
to be calculated multiple times by packet sending loops
such as tcp_write_xmit().
    
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 15:19:54 -07:00
David S. Miller
55c97f3e99 [TCP]: Fix __tcp_push_pending_frames() 'nonagle' handling.
'nonagle' should be passed to the tcp_snd_test() function
as 'TCP_NAGLE_PUSH' if we are checking an SKB not at the
tail of the write_queue.  This is because Nagle does not
apply to such frames since we cannot possibly tack more
data onto them.

However, while doing this __tcp_push_pending_frames() makes
all of the packets in the write_queue use this modified
'nonagle' value.

Fix the bug and simplify this function by just calling
tcp_write_xmit() directly if sk_send_head is non-NULL.

As a result, we can now make tcp_data_snd_check() just call
tcp_push_pending_frames() instead of the specialized
__tcp_data_snd_check().

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 15:19:38 -07:00
David S. Miller
a2e2a59c93 [TCP]: Fix redundant calculations of tcp_current_mss()
tcp_write_xmit() uses tcp_current_mss(), but some of it's callers,
namely __tcp_push_pending_frames(), already has this value available
already.

While we're here, fix the "cur_mss" argument to be "unsigned int"
instead of plain "unsigned".

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 15:19:23 -07:00
David S. Miller
92df7b518d [TCP]: tcp_write_xmit() tabbing cleanup
Put the main basic block of work at the top-level of
tabbing, and mark the TCP_CLOSE test with unlikely().

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 15:19:06 -07:00
David S. Miller
a762a98007 [TCP]: Kill extra cwnd validate in __tcp_push_pending_frames().
The tcp_cwnd_validate() function should only be invoked
if we actually send some frames, yet __tcp_push_pending_frames()
will always invoke it.  tcp_write_xmit() does the call for us,
so the call here can simply be removed.

Also, tcp_write_xmit() can be marked static.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 15:18:51 -07:00
David S. Miller
f44b527177 [TCP]: Add missing skb_header_release() call to tcp_fragment().
When we add any new packet to the TCP socket write queue,
we must call skb_header_release() on it in order for the
TSO sharing checks in the drivers to work.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 15:18:34 -07:00
David S. Miller
84d3e7b957 [TCP]: Move __tcp_data_snd_check into tcp_output.c
It reimplements portions of tcp_snd_check(), so it
we move it to tcp_output.c we can consolidate it's
logic much easier in a later change.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 15:18:18 -07:00
David S. Miller
f6302d1d78 [TCP]: Move send test logic out of net/tcp.h
This just moves the code into tcp_output.c, no code logic changes are
made by this patch.

Using this as a baseline, we can begin to untangle the mess of
comparisons for the Nagle test et al.  We will also be able to reduce
all of the redundant computation that occurs when outputting data
packets.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 15:18:03 -07:00
David S. Miller
fc6415bcb0 [TCP]: Fix quick-ack decrementing with TSO.
On each packet output, we call tcp_dec_quickack_mode()
if the ACK flag is set.  It drops tp->ack.quick until
it hits zero, at which time we deflate the ATO value.

When doing TSO, we are emitting multiple packets with
ACK set, so we should decrement tp->ack.quick that many
segments.

Note that, unlike this case, tcp_enter_cwr() should not
take the tcp_skb_pcount(skb) into consideration.  That
function, one time, readjusts tp->snd_cwnd and moves
into TCP_CA_CWR state.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 15:17:45 -07:00
David S. Miller
c65f7f00c5 [TCP]: Simplify SKB data portion allocation with NETIF_F_SG.
The ideal and most optimal layout for an SKB when doing
scatter-gather is to put all the headers at skb->data, and
all the user data in the page array.

This makes SKB splitting and combining extremely simple,
especially before a packet goes onto the wire the first
time.

So, when sk_stream_alloc_pskb() is given a zero size, make
sure there is no skb_tailroom().  This is achieved by applying
SKB_DATA_ALIGN() to the header length used here.

Next, make select_size() in TCP output segmentation use a
length of zero when NETIF_F_SG is true on the outgoing
interface.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 15:17:25 -07:00
Robert Olsson
2f36895aa7 [IPV4]: More broken memory allocation fixes for fib_trie
Below a patch to preallocate memory when doing resize of trie (inflate halve)
If preallocations fails it just skips the resize of this tnode for this time.

The oops we got when killing bgpd (with full routing) is now gone. 
Patrick memory patch is also used.

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 15:02:40 -07:00
Eric Dumazet
bb1d23b026 [IPV4]: Bug fix in rt_check_expire()
- rt_check_expire() fixes (an overflow occured if size of the hash
  was >= 65536)

reminder of the bugfix:

The rt_check_expire() has a serious problem on machines with large
route caches, and a standard HZ value of 1000.

With default values, ie ip_rt_gc_interval = 60*HZ = 60000 ;

the loop count :

     for (t = ip_rt_gc_interval << rt_hash_log; t >= 0;


overflows (t is a 31 bit value) as soon rt_hash_log is >= 16  (65536
slots in route cache hash table).

In this case, rt_check_expire() does nothing at all

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 15:00:32 -07:00
Eric Dumazet
424c4b70cc [IPV4]: Use the fancy alloc_large_system_hash() function for route hash table
- rt hash table allocated using alloc_large_system_hash() function.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 14:58:19 -07:00
Eric Dumazet
22c047ccbc [NET]: Hashed spinlocks in net/ipv4/route.c
- Locking abstraction
- Spinlocks moved out of rt hash table : Less memory (50%) used by rt 
  hash table. it's a win even on UP.
- Sizing of spinlocks table depends on NR_CPUS

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 14:55:24 -07:00
Patrick McHardy
f0e36f8cee [IPV4]: Handle large allocations in fib_trie
Inflating a node a couple of times makes it exceed the 128k kmalloc limit.
Use __get_free_pages for allocations > PAGE_SIZE, as in fib_hash.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Robert Olsson <Robert.Olsson@data.slu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 14:44:55 -07:00
Herbert Xu
30e224d76f [IPV4]: Fix crash in ip_rcv while booting related to netconsole
Makes IPv4 ip_rcv registration happen last in af_inet.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 14:40:10 -07:00
Thomas Graf
e176fe8954 [NET]: Remove unused security member in sk_buff
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-05 14:12:44 -07:00
Patrick McHardy
9666dae510 [NETFILTER]: Fix connection tracking bug in 2.6.12
In 2.6.12 we started dropping the conntrack reference when a packet
leaves the IP layer. This broke connection tracking on a bridge,
because bridge-netfilter defers calling some NF_IP_* hooks to the bridge
layer for locally generated packets going out a bridge, where the
conntrack reference is no longer available. This patch keeps the
reference in this case as a temporary solution, long term we will
remove the defered hook calling. No attempt is made to drop the
reference in the bridge-code when it is no longer needed, tc actions
could already have sent the packet anywhere.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-28 16:04:44 -07:00
Neil Horman
fb3d89498d [IPVS]: Close race conditions on ip_vs_conn_tab list modification
In an smp system, it is possible for an connection timer to expire, calling
ip_vs_conn_expire while the connection table is being flushed, before
ct_write_lock_bh is acquired.

Since the list iterator loop in ip_vs_con_flush releases and re-acquires the
spinlock (even though it doesn't re-enable softirqs), it is possible for the
expiration function to modify the connection list, while it is being traversed
in ip_vs_conn_flush.

The result is that the next pointer gets set to NULL, and subsequently
dereferenced, resulting in an oops.

Signed-off-by: Neil Horman <nhorman@redhat.com>
Acked-by: JulianAnastasov
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-28 15:40:02 -07:00
Robert Olsson
f835e471b5 [IPV4]: Broken memory allocation in fib_trie
This should help up the insertion... but the resize is more crucial.
and complex and needs some thinking. 

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-28 15:00:39 -07:00
Maxime Bizon
7a1af5d7bb [IPV4]: ipconfig.c: fix dhcp timeout behaviour
I think there is a small bug in ipconfig.c in case IPCONFIG_DHCP is set
and dhcp is used.

When a DHCPOFFER is received, ip address is kept until we get DHCPACK.
If no ack is received, ic_dynamic() returns negatively, but leaves the
offered ip address in ic_myaddr.

This makes the main loop in ip_auto_config() break and uses the maybe
incomplete configuration.

Not sure if it's the best way to do, but the following trivial patch
correct this. 

Signed-off-by: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-28 13:21:12 -07:00
Dietmar Eggemann
2c2910a401 [IPV4]: Snmpv2 Mib IP counter ipInAddrErrors support
I followed Thomas' proposal to see every martian destination as a case
where the ipInAddrErrors counter has to be incremented. There are
two advantages by doing so: (1) The relation between the ipInReceive
counter and all the other ipInXXX counters is more accurate in the
case the RTN_UNICAST code check fails and (2) it makes the code in
ip_route_input_slow easier.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-28 13:06:23 -07:00
Patrick McHardy
9ef1d4c7c7 [NETLINK]: Missing initializations in dumped data
Mostly missing initialization of padding fields of 1 or 2 bytes length,
two instances of uninitialized nlmsgerr->msg of 16 bytes length.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-28 12:55:30 -07:00
Harald Welte
4095ebf1e6 [NETFILTER]: ipt_CLUSTERIP: fix ARP mangling
This patch adds mangling of ARP requests (in addition to replies),
since ARP caches are made from snooping both requests and replies.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-28 12:49:30 -07:00
pageexec
4da62fc70d [IPVS]: Fix for overflows
From: <pageexec@freemail.hu>

$subject was fixed in 2.4 already, 2.6 needs it as well.

The impact of the bugs is a kernel stack overflow and privilege escalation
from CAP_NET_ADMIN via the IP_VS_SO_SET_STARTDAEMON/IP_VS_SO_GET_DAEMON
ioctls.  People running with 'root=all caps' (i.e., most users) are not
really affected (there's nothing to escalate), but SELinux and similar
users should take it seriously if they grant CAP_NET_ADMIN to other users.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-26 16:00:19 -07:00
Adrian Bunk
60fe740320 [TCP]: Let TCP_CONG_ADVANCED default to n
It doesn't seem to make much sense to let an "If unsure, say N." option 
default to y.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-26 15:21:15 -07:00
David S. Miller
6c3607676c [IPV4]: Fix thinko in TCP_CONG_BIC default.
Since it is tristate when we offer it as a choice, we should
definte it also as tristate when forcing it as the default.
Otherwise kconfig warns.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-26 15:20:20 -07:00
David S. Miller
a6484045fd [TCP]: Do not present confusing congestion control options by default.
Create TCP_CONG_ADVANCED option, akin to IP_ADVANCED_ROUTER, which
when disabled will bypass all of the congestion control Kconfig
options and leave the user with a safe default.

That safe default is currently BIC-TCP with new Reno as a fallback.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-24 18:07:51 -07:00
David S. Miller
bb298ca3ce [IPV4]: Move FIB lookup algorithm choice under IP_ADVANCED_ROUTING
Most users need not be concerned with a complex choice of what
FIB lookup algorithm to use.  So give them the safe default of
IP_FIB_HASH if IP_ADVANCED_ROUTING is disabled.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-24 17:50:53 -07:00
Stephen Hemminger
5f8ef48d24 [TCP]: Allow choosing TCP congestion control via sockopt.
Allow using setsockopt to set TCP congestion control to use on a per
socket basis.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-23 20:37:36 -07:00
John Heffner
0e57976b63 [TCP]: Add Scalable TCP congestion control module.
This patch implements Tom Kelly's Scalable TCP congestion control algorithm 
for the modular framework.

The algorithm has some nice scaling properties, and has been used a fair bit 
in research, though is known to have significant fairness issues, so it's not 
really suitable for general purpose use.

Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-23 12:29:07 -07:00
Baruch Even
a7868ea68d [TCP]: Add H-TCP congestion control module.
H-TCP is a congestion control algorithm developed at the Hamilton Institute, by
Douglas Leith and Robert Shorten. It is extending the standard Reno algorithm
with mode switching is thus a relatively simple modification.

H-TCP is defined in a layered manner as it is still a research platform. The
basic form includes the modification of beta according to the ratio of maxRTT
to min RTT and the alpha=2*factor*(1-beta) relation, where factor is dependant
on the time since last congestion.

The other layers improve convergence by adding appropriate factors to alpha.

The following patch implements the H-TCP algorithm in it's basic form.

Signed-Off-By: Baruch Even <baruch@ev-en.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-23 12:28:11 -07:00
Stephen Hemminger
b87d8561d8 [TCP]: Add TCP Vegas congestion control module.
TCP Vegas code modified for the new TCP infrastructure.  
Vegas now uses microsecond resolution timestamps for 
better estimation of performance over higher speed links.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-23 12:27:19 -07:00
Daniele Lacamera
835b3f0c0d [TCP]: Add TCP Hybla congestion control module.
TCP Hybla congestion avoidance.

- "In heterogeneous networks, TCP connections that incorporate a
terrestrial or satellite radio link are greatly disadvantaged with
respect to entirely wired connections, because of their longer round
trip times (RTTs). To cope with this problem, a new TCP proposal, the
TCP Hybla, is presented and discussed in the paper[1]. It stems from an
analytical evaluation of the congestion window dynamics in the TCP
standard versions (Tahoe, Reno, NewReno), which suggests the necessary
modifications to remove the performance dependence on RTT.[...]"[1]

[1]: Carlo Caini, Rosario Firrincieli, "TCP Hybla: a TCP enhancement for
heterogeneous networks",
International Journal of Satellite Communications and Networking
Volume 22, Issue 5 , Pages 547 - 566. September 2004.

Signed-off-by: Daniele Lacamera (root at danielinux.net)net
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-23 12:26:34 -07:00
John Heffner
a628d29b56 [TCP]: Add High Speed TCP congestion control module.
Sally Floyd's high speed TCP congestion control.
This is useful for comparison and research.

Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-23 12:24:58 -07:00
Stephen Hemminger
8727076289 [TCP]: Add TCP Westwood congestion control module.
This is the existing 2.6.12 Westwood code moved from tcp_input
to the new congestion framework. A lot of the inline functions
have been eliminated to try and make it clearer.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-23 12:24:09 -07:00
Stephen Hemminger
83803034f4 [TCP]: Add TCP BIC congestion control module.
TCP BIC congestion control reworked to use the new congestion control 
infrastructure. This version is more up to date than the BIC
code in 2.6.12; it incorporates enhancements from BICTCP 1.1, 
to handle low latency links.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-23 12:23:25 -07:00
Stephen Hemminger
056ede6cfa [TCP]: Report congestion control algorithm in tcp_diag.
Enhancement to the tcp_diag interface used by the iproute2 ss command
to report the tcp congestion control being used by a socket.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-23 12:21:28 -07:00
Stephen Hemminger
7c99c909fa [TCP]: Change tcp_diag to use the existing __RTA_PUT() macro.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-23 12:20:36 -07:00
Stephen Hemminger
317a76f9a4 [TCP]: Add pluggable congestion control algorithm infrastructure.
Allow TCP to have multiple pluggable congestion control algorithms.
Algorithms are defined by a set of operations and can be built in
or modules.  The legacy "new RENO" algorithm is used as a starting
point and fallback.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-23 12:19:55 -07:00
Paulo Marques
543537bd92 [PATCH] create a kstrdup library function
This patch creates a new kstrdup library function and changes the "local"
implementations in several places to use this function.

Most of the changes come from the sound and net subsystems.  The sound part
had already been acknowledged by Takashi Iwai and the net part by David S.
Miller.

I left UML alone for now because I would need more time to read the code
carefully before making changes there.

Signed-off-by: Paulo Marques <pmarques@grupopie.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:18 -07:00
Chuck Short
7abaa27c1c [IPV4]: Fix route.c gcc4 warnings
Signed-off by: Chuck Short <zulcss@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-22 22:10:23 -07:00
Harald Welte
5d927eb010 [NETFILTER]: Fix handling of ICMP packets (RELATED) in ipt_CLUSTERIP target.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-22 12:37:50 -07:00
Kumar Gala
b535420739 [PATCH] Fix extra double quote in IPV4 Kconfig
Kconfig option had an extra double quote at the end of the line
which was causing in warning when building.

Signed-off-by: Kumar Gala <kumar.gala@freescale.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-22 10:40:39 -07:00
David S. Miller
90f66914c8 [IPV4]: Fix fib_trie.c's args to fib_dump_info().
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-21 14:43:28 -07:00
Patrick McHardy
2715bcf9ef [NETFILTER]: Drop conntrack reference in ip_call_ra_chain()/ip_mr_input()
Drop reference before handing the packets to raw_rcv()

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-21 14:06:24 -07:00
Patrick McHardy
6150bacfec [NETFILTER]: Check TCP checksum in ipt_REJECT
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-21 14:03:46 -07:00
Keir Fraser
e3be8ba792 [NETFILTER]: Avoid unncessary checksum validation in UDP connection tracking
Signed-off-by: Keir Fraser <Keir.Fraser@xl.cam.ac.uk>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-21 14:03:23 -07:00
Phil Oester
1d3cdb41f5 [NETFILTER]: expectation timeouts are compulsory
Since expectation timeouts were made compulsory [1], there is no need to
check for them in ip_conntrack_expect_insert.

[1] https://lists.netfilter.org/pipermail/netfilter-devel/2005-January/018143.html

Signed-off-by: Phil Oester <kernel@linuxace.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-21 14:02:42 -07:00
Patrick McHardy
18b8afc771 [NETFILTER]: Kill nf_debug
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-21 14:01:57 -07:00
Patrick McHardy
e45b1be8bc [NETFILTER]: Kill lockhelp.h
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-21 14:01:30 -07:00
Robert Olsson
19baf839ff [IPV4]: Add LC-Trie FIB lookup algorithm.
Signed-off-by: Robert Olsson <Robert.Olsson@data.slu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-21 12:43:18 -07:00
Robert Olsson
246955fe4c [NETLINK]: fib_lookup() via netlink
Below is a more generic patch to do fib_lookup via netlink. For others 
we should say that we discussed this as a way to verify route selection.
It's also possible there are others uses for this.

In short the fist half of struct fib_result_nl is filled in by caller 
and netlink call fills in the other half and returns it.

In case anyone is interested there is a corresponding user app to compare 
the full routing table this was used to test implementation of the LC-trie. 

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-20 13:36:39 -07:00
Herbert Xu
dd87147eed [IPSEC]: Add XFRM_STATE_NOPMTUDISC flag
This patch adds the flag XFRM_STATE_NOPMTUDISC for xfrm states.  It is
similar to the nopmtudisc on IPIP/GRE tunnels.  It only has an effect
on IPv4 tunnel mode states.  For these states, it will ensure that the
DF flag is always cleared.

This is primarily useful to work around ICMP blackholes.

In future this flag could also allow a larger MTU to be set within the
tunnel just like IPIP/GRE tunnels.  This could be useful for short haul
tunnels where temporary fragmentation outside the tunnel is desired over
smaller fragments inside the tunnel.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: James Morris <jmorris@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-20 13:21:43 -07:00
Herbert Xu
72cb6962a9 [IPSEC]: Add xfrm_init_state
This patch adds xfrm_init_state which is simply a wrapper that calls
xfrm_get_type and subsequently x->type->init_state.  It also gets rid
of the unused args argument.

Abstracting it out allows us to add common initialisation code, e.g.,
to set family-specific flags.

The add_time setting in xfrm_user.c was deleted because it's already
set by xfrm_state_alloc.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: James Morris <jmorris@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-20 13:18:08 -07:00
David S. Miller
7df551254a [TCP]: Fix sysctl_tcp_low_latency
When enabled, this should disable UCOPY prequeue'ing altogether,
but it does not due to a missing test.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 23:01:10 -07:00
Jesper Juhl
f7d7fc0322 [IPV4]: [4/4] signed vs unsigned cleanup in net/ipv4/raw.c
This patch changes the type of the third parameter 'length' of the 
raw_send_hdrinc() function from 'int' to 'size_t'.
This makes sense since this function is only ever called from one 
location, and the value passed as the third parameter in that location is 
itself of type size_t, so this makes the recieving functions parameter 
type match. Also, inside raw_send_hdrinc() the 'length' variable is 
used in comparisons with unsigned values and passed as parameter to 
functions expecting unsigned values (it's used in a single comparison with 
a signed value, but that one can never actually be negative so the patch 
also casts that one to size_t to stop gcc worrying, and it is passed in a 
single instance to memcpy_fromiovecend() which expects a signed int, but 
as far as I can see that's not a problem since the value of 'length' 
shouldn't ever exceed the value of a signed int).

Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 23:00:34 -07:00
Jesper Juhl
93765d8a43 [IPV4]: [3/4] signed vs unsigned cleanup in net/ipv4/raw.c
This patch changes the type of the local variable 'i' in 
raw_probe_proto_opt() from 'int' to 'unsigned int'. The only use of 'i' in 
this function is as a counter in a for() loop and subsequent index into 
the msg->msg_iov[] array.
Since 'i' is compared in a loop to the unsigned variable msg->msg_iovlen 
gcc -W generates this warning : 

net/ipv4/raw.c:340: warning: comparison between signed and unsigned

Changing 'i' to unsigned silences this warning and is safe since the array 
index can never be negative anyway, so unsigned int is the logical type to 
use for 'i' and also enables a larger msg_iov[] array (but I don't know if 
that will ever matter).

Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 23:00:15 -07:00
Jesper Juhl
926d4b8122 [IPV4]: [2/4] signed vs unsigned cleanup in net/ipv4/raw.c
This patch gets rid of the following gcc -W warning in net/ipv4/raw.c :

net/ipv4/raw.c:387: warning: comparison of unsigned expression < 0 is always false

Since 'len' is of type size_t it is unsigned and can thus never be <0, and 
since this is obvious from the function declaration just a few lines above 
I think it's ok to remove the pointless check for len<0.


Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 23:00:00 -07:00
Jesper Juhl
5418c6926f [IPV4]: [1/4] signed vs unsigned cleanup in net/ipv4/raw.c
This patch silences these two gcc -W warnings in net/ipv4/raw.c :

net/ipv4/raw.c:517: warning: signed and unsigned type in conditional expression
net/ipv4/raw.c:613: warning: signed and unsigned type in conditional expression

It doesn't change the behaviour of the code, simply writes the conditional 
expression with plain 'if()' syntax instead of '? :' , but since this 
breaks it into sepperate statements gcc no longer complains about having 
both a signed and unsigned value in the same conditional expression.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 22:59:45 -07:00
Herbert Xu
e0f9f8586a [IPV4/IPV6]: Replace spin_lock_irq with spin_lock_bh
In light of my recent patch to net/ipv4/udp.c that replaced the
spin_lock_irq calls on the receive queue lock with spin_lock_bh,
here is a similar patch for all other occurences of spin_lock_irq
on receive/error queue locks in IPv4 and IPv6.

In these stacks, we know that they can only be entered from user
or softirq context.  Therefore it's safe to disable BH only.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 22:56:18 -07:00
Jamal Hadi Salim
9ed19f339e [NETLINK]: Set correct pid for ioctl originating netlink events
This patch ensures that netlink events created as a result of programns
using ioctls (such as ifconfig, route etc) contains the correct PID of
those events.
 
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 22:55:51 -07:00
Jamal Hadi Salim
b6544c0b4c [NETLINK]: Correctly set NLM_F_MULTI without checking the pid
This patch rectifies some rtnetlink message builders that derive the
flags from the pid. It is now explicit like the other cases
which get it right. Also fixes half a dozen dumpers which did not
set NLM_F_MULTI at all.

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 22:54:12 -07:00
David S. Miller
e52c1f17e4 [NET]: Move sysctl_max_syn_backlog into request_sock.c
This fixes the CONFIG_INET=n build failure noticed
by Andrew Morton.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 22:49:40 -07:00
Arnaldo Carvalho de Melo
2ad69c55a2 [NET] rename struct tcp_listen_opt to struct listen_sock
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 22:48:55 -07:00
Arnaldo Carvalho de Melo
0e87506fcc [NET] Generalise tcp_listen_opt
This chunks out the accept_queue and tcp_listen_opt code and moves
them to net/core/request_sock.c and include/net/request_sock.h, to
make it useful for other transport protocols, DCCP being the first one
to use it.

Next patches will rename tcp_listen_opt to accept_sock and remove the
inline tcp functions that just call a reqsk_queue_ function.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 22:47:59 -07:00
Arnaldo Carvalho de Melo
60236fdd08 [NET] Rename open_request to request_sock
Ok, this one just renames some stuff to have a better namespace and to
dissassociate it from TCP:

struct open_request  -> struct request_sock
tcp_openreq_alloc    -> reqsk_alloc
tcp_openreq_free     -> reqsk_free
tcp_openreq_fastfree -> __reqsk_free

With this most of the infrastructure closely resembles a struct
sock methods subset.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 22:47:21 -07:00
Arnaldo Carvalho de Melo
2e6599cb89 [NET] Generalise TCP's struct open_request minisock infrastructure
Kept this first changeset minimal, without changing existing names to
ease peer review.

Basicaly tcp_openreq_alloc now receives the or_calltable, that in turn
has two new members:

->slab, that replaces tcp_openreq_cachep
->obj_size, to inform the size of the openreq descendant for
  a specific protocol

The protocol specific fields in struct open_request were moved to a
class hierarchy, with the things that are common to all connection
oriented PF_INET protocols in struct inet_request_sock, the TCP ones
in tcp_request_sock, that is an inet_request_sock, that is an
open_request.

I.e. this uses the same approach used for the struct sock class
hierarchy, with sk_prot indicating if the protocol wants to use the
open_request infrastructure by filling in sk_prot->rsk_prot with an
or_calltable.

Results? Performance is improved and TCP v4 now uses only 64 bytes per
open request minisock, down from 96 without this patch :-)

Next changeset will rename some of the structs, fields and functions
mentioned above, struct or_calltable is way unclear, better name it
struct request_sock_ops, s/struct open_request/struct request_sock/g,
etc.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-18 22:46:52 -07:00
David S. Miller
bcfff0b471 [NETFILTER]: ipt_recent: last_pkts is an array of "unsigned long" not "u_int32_t"
This fixes various crashes on 64-bit when using this module.

Based upon a patch by Juergen Kreileder <jk@blackdown.de>.

Signed-off-by: David S. Miller <davem@davemloft.net>
ACKed-by: Patrick McHardy <kaber@trash.net>
2005-06-15 20:51:14 -07:00
Patrick McHardy
a96aca88ac [NETFILTER]: Advance seq-file position in exp_next_seq()
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-13 18:27:13 -07:00
J. Simonetti
1c2fb7f93c [IPV4]: Sysctl configurable icmp error source address.
This patch alows you to change the source address of icmp error
messages. It applies cleanly to 2.6.11.11 and retains the default
behaviour.

In the old (default) behaviour icmp error messages are sent with the ip
of the exiting interface.

The new behaviour (when the sysctl variable is toggled on), it will send
the message with the ip of the interface that received the packet that
caused the icmp error. This is the behaviour network administrators will
expect from a router. It makes debugging complicated network layouts
much easier. Also, all 'vendor routers' I know of have the later
behaviour.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-13 15:19:03 -07:00
Neil Horman
cdac4e0774 [SCTP] Add support for ip_nonlocal_bind sysctl & IP_FREEBIND socket option
Signed-off-by: Neil Horman <nhorman@redhat.com>
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-13 15:12:33 -07:00
Randy Dunlap
6efd8455cf [IPV4]: Multipath modules need a license to prevent kernel tainting.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-13 14:29:06 -07:00
Andi Kleen
e7626486c3 [TCP]: Adjust TCP mem order check to new alloc_large_system_hash
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-13 14:24:52 -07:00
Adrian Bunk
64a6c7aa38 [IPVS]: remove net/ipv4/ipvs/ip_vs_proto_icmp.c
ip_vs_proto_icmp.c was never finished.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-02 13:02:25 -07:00
Edgar E Iglesias
36839836e8 [IPSEC]: Fix esp_decap_data size verification in esp4.
Signed-off-by: Edgar E Iglesias <edgar@axis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-31 17:08:05 -07:00
Herbert Xu
208d89843b [IPV4]: Fix BUG() in 2.6.x, udp_poll(), fragments + CONFIG_HIGHMEM
Steven Hand <Steven.Hand@cl.cam.ac.uk> wrote:
> 
> Reconstructed forward trace: 
> 
>   net/ipv4/udp.c:1334   spin_lock_irq() 
>   net/ipv4/udp.c:1336   udp_checksum_complete() 
> net/core/skbuff.c:1069   skb_shinfo(skb)->nr_frags > 1
> net/core/skbuff.c:1086   kunmap_skb_frag()
> net/core/skbuff.h:1087   local_bh_enable()
> kernel/softirq.c:0140   WARN_ON(irqs_disabled());

The receive queue lock is never taken in IRQs (and should never be) so
we can simply substitute bh for irq.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-30 15:50:15 -07:00
Harald Welte
9bb7bc942d [NETFILTER]: Fix deadlock with ip_queue and tcp local input path.
When we have ip_queue being used from LOCAL_IN, then we end up with a
situation where the verdicts coming back from userspace traverse the TCP
input path from syscall context.  While this seems to work most of the
time, there's an ugly deadlock:

syscall context is interrupted by the timer interrupt.  When the timer
interrupt leaves, the timer softirq get's scheduled and calls
tcp_delack_timer() and alike.  They themselves do bh_lock_sock(sk),
which is already held from somewhere else -> boom.

I've now tested the suggested solution by Patrick McHardy and Herbert Xu to
simply use local_bh_{en,dis}able().

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-30 15:35:26 -07:00
Pravin B. Shelar
37e20a66db [IPV4]: Kill MULTIPATHHOLDROUTE flag.
It cannot work properly, so just ignore it in drr
and rr multipath algorithms just like the random
multipath algorithm does.

Suggested by Herbert Xu.

Signed-off by: Pravin B. Shelar <pravins@calsoftinc.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-29 20:26:44 -07:00
Harald Welte
8f937c6099 [IPV4]: Primary and secondary addresses
Add an option to make secondary IP addresses get promoted
when primary IP addresses are removed from the device.
It defaults to off to preserve existing behavior.

Signed-off-by: Harald Welte <laforge@gnumonks.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-29 20:23:46 -07:00
David S. Miller
314324121f [TCP]: Fix stretch ACK performance killer when doing ucopy.
When we are doing ucopy, we try to defer the ACK generation to
cleanup_rbuf().  This works most of the time very well, but if the
ucopy prequeue is large, this ACKing behavior kills performance.

With TSO, it is possible to fill the prequeue so large that by the
time the ACK is sent and gets back to the sender, most of the window
has emptied of data and performance suffers significantly.

This behavior does help in some cases, so we should think about
re-enabling this trick in the future, using some kind of limit in
order to avoid the bug case.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-23 12:03:06 -07:00
David S. Miller
8be58932ca [NETFILTER]: Do not be clever about SKB ownership in ip_ct_gather_frags().
Just do an skb_orphan() and be done with it.
Based upon discussions with Herbert Xu on netdev.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-19 12:36:33 -07:00
Julian Anastasov
d9fa0f392b [IP_VS]: Remove extra __ip_vs_conn_put() for incoming ICMP.
Remove extra __ip_vs_conn_put for incoming ICMP in direct routing
mode. Mark de Vries reports that IPVS connections are not leaked anymore.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-19 12:29:59 -07:00
Herbert Xu
2fdba6b085 [IPV4/IPV6] Ensure all frag_list members have NULL sk
Having frag_list members which holds wmem of an sk leads to nightmares
with partially cloned frag skb's.  The reason is that once you unleash
a skb with a frag_list that has individual sk ownerships into the stack
you can never undo those ownerships safely as they may have been cloned
by things like netfilter.  Since we have to undo them in order to make
skb_linearize happy this approach leads to a dead-end.

So let's go the other way and make this an invariant:

	For any skb on a frag_list, skb->sk must be NULL.

That is, the socket ownership always belongs to the head skb.
It turns out that the implementation is actually pretty simple.

The above invariant is actually violated in the following patch
for a short duration inside ip_fragment.  This is OK because the
offending frag_list member is either destroyed at the end of the
slow path without being sent anywhere, or it is detached from
the frag_list before being sent.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-18 22:52:33 -07:00
Jesper Juhl
02c30a84e6 [PATCH] update Ross Biro bouncing email address
Ross moved.  Remove the bad email address so people will find the correct
one in ./CREDITS.

Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-05 16:36:49 -07:00
Patrick McHardy
60d5306553 [IPV4]: multipath_wrandom.c GPF fixes
multipath_wrandom needs to use GFP_ATOMIC.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-05 14:30:15 -07:00
Herbert Xu
aabc9761b6 [IPSEC]: Store idev entries
I found a bug that stopped IPsec/IPv6 from working.  About
a month ago IPv6 started using rt6i_idev->dev on the cached socket dst
entries.  If the cached socket dst entry is IPsec, then rt6i_idev will
be NULL.

Since we want to look at the rt6i_idev of the original route in this
case, the easiest fix is to store rt6i_idev in the IPsec dst entry just
as we do for a number of other IPv6 route attributes.  Unfortunately
this means that we need some new code to handle the references to
rt6i_idev.  That's why this patch is bigger than it would otherwise be.

I've also done the same thing for IPv4 since it is conceivable that
once these idev attributes start getting used for accounting, we
probably need to dereference them for IPv4 IPsec entries too.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-03 16:27:10 -07:00
Patrick McHardy
bd96535b81 [NETFILTER]: Drop conntrack reference in ip_dev_loopback_xmit()
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-03 16:21:37 -07:00
Herbert Xu
2a0a6ebee1 [NETLINK]: Synchronous message processing.
Let's recap the problem.  The current asynchronous netlink kernel
message processing is vulnerable to these attacks:

1) Hit and run: Attacker sends one or more messages and then exits
before they're processed.  This may confuse/disable the next netlink
user that gets the netlink address of the attacker since it may
receive the responses to the attacker's messages.

Proposed solutions:

a) Synchronous processing.
b) Stream mode socket.
c) Restrict/prohibit binding.

2) Starvation: Because various netlink rcv functions were written
to not return until all messages have been processed on a socket,
it is possible for these functions to execute for an arbitrarily
long period of time.  If this is successfully exploited it could
also be used to hold rtnl forever.

Proposed solutions:

a) Synchronous processing.
b) Stream mode socket.

Firstly let's cross off solution c).  It only solves the first
problem and it has user-visible impacts.  In particular, it'll
break user space applications that expect to bind or communicate
with specific netlink addresses (pid's).

So we're left with a choice of synchronous processing versus
SOCK_STREAM for netlink.

For the moment I'm sticking with the synchronous approach as
suggested by Alexey since it's simpler and I'd rather spend
my time working on other things.

However, it does have a number of deficiencies compared to the
stream mode solution:

1) User-space to user-space netlink communication is still vulnerable.

2) Inefficient use of resources.  This is especially true for rtnetlink
since the lock is shared with other users such as networking drivers.
The latter could hold the rtnl while communicating with hardware which
causes the rtnetlink user to wait when it could be doing other things.

3) It is still possible to DoS all netlink users by flooding the kernel
netlink receive queue.  The attacker simply fills the receive socket
with a single netlink message that fills up the entire queue.  The
attacker then continues to call sendmsg with the same message in a loop.

Point 3) can be countered by retransmissions in user-space code, however
it is pretty messy.

In light of these problems (in particular, point 3), we should implement
stream mode netlink at some point.  In the mean time, here is a patch
that implements synchronous processing.  

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-03 14:55:09 -07:00
Folkert van Heusden
0b2531bdc5 [TCP]: Optimize check in port-allocation code.
Signed-off-by: Folkert van Heusden <folkert@vanheusden.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-03 14:36:08 -07:00
Thomas Graf
db46edc6d3 [RTNETLINK] Cleanup rtnetlink_link tables
Converts remaining rtnetlink_link tables to use c99 designated
initializers to make greping a little bit easier.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-03 14:29:39 -07:00
Patrick McHardy
31da185d81 [NETFILTER]: Don't checksum CHECKSUM_UNNECESSARY skbs in TCP connection tracking
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-03 14:23:50 -07:00
Patrick McHardy
b433095784 [NETFILTER]: Missing owner-field initialization in iptable_raw
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-03 14:23:13 -07:00
Olaf Rempel
5bec0039f4 [NET]: /proc/net/stat/* header cleanup
Signed-off-by: Olaf Rempel <razzor@kopf-tisch.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-04-28 12:16:08 -07:00
Dave Jones
7e3e0360b7 [IPV4]: Incorrect permissions on route flush sysctl
This has been brought up before.. http://lkml.org/lkml/2000/1/21/116
but didnt seem to get resolved.  This morning I got someone
file a bugzilla about it breaking sysctl(8).

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-04-28 12:11:03 -07:00
Al Viro
5523662c4c [NET]: kill gratitious includes of major.h
A lot of places in there are including major.h for no reason
whatsoever.  Removed.  And yes, it still builds.

	The history of that stuff is often amusing.  E.g. for net/core/sock.c
the story looks so, as far as I've been able to reconstruct it: we used to
need major.h in net/socket.c circa 1.1.early.  In 1.1.13 that need had
disappeared, along with register_chrdev(SOCKET_MAJOR, "socket", &net_fops)
in sock_init().  Include had not.  When 1.2 -> 1.3 reorg of net/* had moved
a lot of stuff from net/socket.c to net/core/sock.c, this crap had followed...

Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-04-25 21:40:39 -07:00
James Morris
088dd3a45f [TCP]: Trivial tcp_data_queue() cleanup
This patch removes a superfluous intialization from tcp_data_queue().

Signed-off-by: James Morris <jmorris@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-04-25 21:39:29 -07:00
Patrick McHardy
b31e5b1bb5 [NETFILTER]: Drop conntrack reference when packet leaves IP
In the event a raw socket is created for sending purposes only, the creator
never bothers to check the socket's receive queue.  But we continue to
add skbs to its queue until it fills up.

Unfortunately, if ip_conntrack is loaded on the box, each skb we add to the
queue potentially holds a reference to a conntrack.  If the user attempts
to unload ip_conntrack, we will spin around forever since the queued skbs
are pinned.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-04-25 12:01:07 -07:00
Yasuyuki KOZAKAI
f649a3bfd1 [NETFILTER]: Fix truncated sequence numbers in FTP helper
Signed-off-by: Yasuyuki KOZAKAI <yasuyuki.kozkaai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-04-25 12:00:04 -07:00
David S. Miller
d5ac99a648 [TCP]: skb pcount with MTU discovery
The problem is that when doing MTU discovery, the too-large segments in
the write queue will be calculated as having a pcount of >1.  When
tcp_write_xmit() is trying to send, tcp_snd_test() fails the cwnd test
when pcount > cwnd.

The segments are eventually transmitted one at a time by keepalive, but
this can take a long time.

This patch checks if TSO is enabled when setting pcount.

Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-04-24 19:12:33 -07:00
Patrick McHardy
3b2d59d1fc [NETFILTER]: Ignore PSH on SYN/ACK in TCP connection tracking
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-04-24 18:42:39 -07:00
Patrick McHardy
e281e3ac2b [NETFILTER]: Fix NAT sequence number adjustment
The NAT changes in 2.6.11 changed the position where helpers
are called and perform packet mangling. Before 2.6.11, a NAT
helper was called before the packet was NATed and had its
sequence number adjusted. Since 2.6.11, the helpers get packets
with already adjusted sequence numbers.

This breaks sequence number adjustment, adjust_tcp_sequence()
needs the original sequence number to determine whether
a packet was a retransmission and to store it for further
corrections. It can't be reconstructed without more information
than available, so this patch restores the old order by
calling helpers from a new conntrack hook two priorities
below ip_conntrack_confirm() and adjusting the sequence number
from a new NAT hook one priority below ip_conntrack_confirm().

Tracked down by Phil Oester <kernel@linuxace.com>

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-04-24 18:41:38 -07:00
Herbert Xu
4d78b6c78a [IPSEC]: COW skb header in UDP decap
The following patch just makes the header part of the skb writeable.
This is needed since we modify the IP headers just a few lines below.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-04-19 22:48:59 -07:00
Stephen Hemminger
9c2b3328f7 [NET]: skbuff: remove old NET_CALLER macro
Here is a revised alternative that uses BUG_ON/WARN_ON
(as suggested by Herbert Xu) to eliminate NET_CALLER.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-04-19 22:39:42 -07:00
Linus Torvalds
1da177e4c3 Linux-2.6.12-rc2
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!
2005-04-16 15:20:36 -07:00