An IPoIB subnet on an IB fabric that spans multiple IB subnets can't
use link-local scope in multicast GIDs. The existing routines that
map IP/IPv6 multicast addresses into IB link-level addresses hard-code
the scope to link-local, and they also leave the partition key field
uninitialised. This patch adds a parameter (the link-level broadcast
address) to the mapping routines, allowing them to initialise both the
scope and the P_Key appropriately, and fixes up the call sites.
The next step will be to add a way to configure the scope for an IPoIB
interface.
Signed-off-by: Rolf Manderscheid <rvm@obsidianresearch.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
As it is ip_append_data only counts page fragments to the skb that
allocated it. As such it means that the first skb gets hit with a
4K charge even though it might have only used a fraction of it while
all subsequent skb's that use the same page gets away with no charge
at all.
This bug was exposed by the UDP accounting patch.
[ The wmem_alloc bumping needs to be moved with the truesize,
noticed by Takahiro Yasui. -DaveM ]
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The snmp6 entry name was changed, and it broke compatibility
to RFC 2011.
Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
icmpv6_send() calls ip6_push_pending_frames() indirectly.
Both ip6_push_pending_frames() and icmpv6_send() increment
counter ICMP6_MIB_OUTMSGS.
This patch remove the increment from icmpv6_send.
Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We omit (or delay) sending NSes for known-to-unreachable routers (in
NUD_FAILED state) according to RFC 4191 (Default Router Preferences
and More-Specific Routes). But this is not fully compatible with RFC
4861 (Neighbor Discovery Protocol for IPv6), which does not remember
unreachability of neighbors.
So, let's avoid mixing sending algorithm of RFC 4191 and that of RFC
4861, and make the algorithm more friendly with RFC 4861 if RFC 4191
is disabled.
Issue was found by IPv6 Ready Logo Core Self_Test 1.5.0b2 (by TAHI
Project), and has been tracked down by Mitsuru Chinen
<mitch@linux.vnet.ibm.com>.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When looking for a conflicting connection the !sk->sk_bound_dev_if
check is performed only for live sockets, but not for timewait-ed.
This is not the case for ipv4, for __inet6_lookup_established in
both ipv4 and ipv6 and for other places that check for tw-s.
Was this missed accidentally? If so, then this patch fixes it and
besides makes use if the dif variable declared in the function.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC2464 says that the next to lowerst order bit of the first octet
of the Interface Identifier is formed by complementing
the Universal/Local bit of the EUI-64. But ip6t_eui64 uses OR not XOR.
Thanks Peter Ivancik for reporing this bug and posting a patch
for it.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
If CONFIG_NETFILTER if not selected when compile the kernel source code,
ipv6_getsockopt will returen an EINVAL error if optname is not supported by
the kernel. But if CONFIG_NETFILTER is selected, ENOPROTOOPT error will
be return.
This patch fix to always return ENOPROTOOPT error if optname argument of
ipv6_getsockopt is not supported by the kernel.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC4303 introduces dummy packets with a nexthdr value of 59
to implement traffic confidentiality. Such packets need to
be dropped silently and the payload may not be attempted to
be parsed as it consists of random chunk.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
RTCF_xxx flags, defined in include/linux/in_route.h) are available for
IPv4 route (rtable) entries only. Use RTF_xxx flags instead, defined
in include/linux/ipv6_route.h, for IPv6 route entries (rt6_info).
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 stack doesn't increment OutNoRoutes counter when IP datagrams
is being discarded because no route could be found to transmit them
to their destination. IPv6 stack should increment the counter.
Incidentally, IPv4 stack increments that counter in such situation.
Signed-off-by: Mitsuru Chinen <mitch@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avaid provided test application, so bug got fixed.
IPv6 addrconf removes ipv6 inner device from netdev each time cmu
changes and new value is less than IPV6_MIN_MTU (1280 bytes).
When mtu is changed and new value is greater than IPV6_MIN_MTU,
it does not add ipv6 addresses and inner device bac.
This patch fixes that.
Tested with Avaid's application, which works ok now.
Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Due to the bug, refcnt for md5sig pool was leaked when
an user try to delete a key if we have more than one key.
In addition to the leakage, we returned incorrect return
result value for userspace.
This fix should close Bug #9418, reported by <ming-baini@163.com>.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Userland neighbor discovery options are typically heavily involved with
the interface on which thay are received: add a missing ifindex field to
the original struct. Thanks to Rmi Denis-Courmont.
Signed-off-by: Pierre Ynard <linkfanel@yahoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a small memory leak. Default fib rules can be deleted by
the user if the rule does not carry FIB_RULE_PERMANENT flag, f.e. by
ip rule flush
Such a rule will not be freed as the ref-counter has 2 on start and becomes
clearly unreachable after removal.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are many places that get the dst entry, increase the
__use counter and set the "lastuse" time stamp.
Make a helper for this.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As done two years ago on IP route cache table (commit
22c047ccbc) , we can avoid using one
lock per hash bucket for the huge TCP/DCCP hash tables.
On a typical x86_64 platform, this saves about 2MB or 4MB of ram, for
litle performance differences. (we hit a different cache line for the
rwlock, but then the bucket cache line have a better sharing factor
among cpus, since we dirty it less often). For netstat or ss commands
that want a full scan of hash table, we perform fewer memory accesses.
Using a 'small' table of hashed rwlocks should be more than enough to
provide correct SMP concurrency between different buckets, without
using too much memory. Sizing of this table depends on
num_possible_cpus() and various CONFIG settings.
This patch provides some locking abstraction that may ease a future
work using a different model for TCP/DCCP table.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function crypto_alloc_comp returns an errno instead of NULL
to indicate error. So it needs to be tested with IS_ERR.
This is based on a patch by Vicen Beltran Querol.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes last proc_net_create() user. Kudos to Benjamin Thery and
Stephen Hemminger for comments on previous version.
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trivial patch to make "tcpv6,udpv6,udplitev6,rawv6" protocols uses the
fast "inuse sockets" infrastructure
Each protocol use then a static percpu var, instead of a dynamic one.
This saves some ram and some cpu cycles
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"struct proto" currently uses an array stats[NR_CPUS] to track change on
'inuse' sockets per protocol.
If NR_CPUS is big, this means we use a big memory area for this.
Moreover, all this memory area is located on a single node on NUMA
machines, increasing memory pressure on the boot node.
In this patch, I tried to :
- Keep a fast !CONFIG_SMP implementation
- Keep a fast CONFIG_SMP implementation for often used protocols
(tcp,udp,raw,...)
- Introduce a NUMA efficient implementation
Some helper macros are defined in include/net/sock.h
These macros take into account CONFIG_SMP
If a "struct proto" is declared without using DEFINE_PROTO_INUSE /
REF_PROTO_INUSE
macros, it will automatically use a default implementation, using a
dynamically allocated percpu zone.
This default implementation will be NUMA efficient, but might use 32/64
bytes per possible cpu
because of current alloc_percpu() implementation.
However it still should be better than previous implementation based on
stats[NR_CPUS] field.
When a "struct proto" is changed to use the new macros, we use a single
static "int" percpu variable,
lowering the memory and cpu costs, still preserving NUMA efficiency.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As the checksum verification is postponed till user calls recv or poll,
the inrementation of Udp6InErrors counter should be also postponed.
Currently, it is postponed in non-blocking operation case. However it
should be postponed in all case like the IPv4 code.
Signed-off-by: Mitsuru Chinen <mitch@linux.vnet.ibm.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ip6_push_pending_frames and ip6_flush_pending_frames do the
same things to flush the sock's cork. Move this into a separate
function and save ~100 bytes from the .text
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sort matches and targets in the NF makefiles.
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
I plan to kill ->get_info which means killing proc_net_create().
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
sg_mark_end() overwrites the page_link information, but all users want
__sg_mark_end() behaviour where we just set the end bit. That is the most
natural way to use the sg list, since you'll fill it in and then mark the
end point.
So change sg_mark_end() to only set the termination bit. Add a sg_magic
debug check as well, and clear a chain pointer if it is set.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Not architecture specific code should not #include <asm/scatterlist.h>.
This patch therefore either replaces them with
#include <linux/scatterlist.h> or simply removes them if they were
unused.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Finally, the zero_it argument can be completely removed from
the callers and from the function prototype.
Besides, fix the checkpatch.pl warnings about using the
assignments inside if-s.
This patch is rather big, and it is a part of the previous one.
I splitted it wishing to make the patches more readable. Hope
this particular split helped.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes scatterlist corruptions added by
commit 68e3f5dd4d
[CRYPTO] users: Fix up scatterlist conversion errors
The issue is that the code calls sg_mark_end() which clobbers the
sg_page() pointer of the final scatterlist entry.
The first part fo the fix makes skb_to_sgvec() do __sg_mark_end().
After considering all skb_to_sgvec() call sites the most correct
solution is to call __sg_mark_end() in skb_to_sgvec() since that is
what all of the callers would end up doing anyways.
I suspect this might have fixed some problems in virtio_net which is
the sole non-crypto user of skb_to_sgvec().
Other similar sg_mark_end() cases were converted over to
__sg_mark_end() as well.
Arguably sg_mark_end() is a poorly named function because it doesn't
just "mark", it clears out the page pointer as a side effect, which is
what led to these bugs in the first place.
The one remaining plain sg_mark_end() call is in scsi_alloc_sgtable()
and arguably it could be converted to __sg_mark_end() if only so that
we can delete this confusing interface from linux/scatterlist.h
Signed-off-by: David S. Miller <davem@davemloft.net>
The file /proc/net/if_inet6 is removed twice.
First time in:
inet6_exit
->addrconf_cleanup
And followed a few lines after by:
inet6_exit
-> if6_proc_exit
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
while reviewing the tcp_md5-related code further i came across with
another two of these casts which you probably have missed. I don't
actually think that they impose a problem by now, but as you said we
should remove them.
Signed-off-by: Matthias M. Dellweg <2500@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This bug was introduced by the commit
d12af679bc (sysctl: fix neighbour table
sysctls).
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the errors made in the users of the crypto layer during
the sg_init_table conversion. It also adds a few conversions that were
missing altogether.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the following compile errors in some configurations:
<-- snip -->
...
CC net/ipv4/esp4.o
/home/bunk/linux/kernel-2.6/git/linux-2.6/net/ipv4/esp4.c: In function 'esp_output':
/home/bunk/linux/kernel-2.6/git/linux-2.6/net/ipv4/esp4.c:113: error: implicit declaration of function 'sg_init_table'
make[3]: *** [net/ipv4/esp4.o] Error 1
...
/home/bunk/linux/kernel-2.6/git/linux-2.6/net/ipv6/esp6.c: In function 'esp6_output':
/home/bunk/linux/kernel-2.6/git/linux-2.6/net/ipv6/esp6.c:112: error: implicit declaration of function 'sg_init_table'
make[3]: *** [net/ipv6/esp6.o] Error 1
<-- snip -->
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/ipv6/tcp_ipv6.c: In function 'tcp_v6_rcv':
net/ipv6/tcp_ipv6.c:1736: error: implicit declaration of function
'get_softnet_dma'
net/ipv6/tcp_ipv6.c:1736: warning: assignment makes pointer from integer
without a cast
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In some places, the result of skb_headroom() is compared to an unsigned
integer, and in others, the result is compared to a signed integer. Make
the comparisons consistent and correct.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a justifying patch for Stephen's patches. Stephen's patches
disallows using a port range of one single port and brakes the meaning
of the 'remaining' variable, in some places it has different meaning.
My patch gives back the sense of 'remaining' variable. It should mean
how many ports are remaining and nothing else. Also my patch allows
using a single port.
I sure we must be able to use mentioned port range, this does not
restricted by documentation and does not brake current behavior.
usefull links:
Patches posted by Stephen Hemminger
http://marc.info/?l=linux-netdev&m=119206106218187&w=2http://marc.info/?l=linux-netdev&m=119206109918235&w=2
Andrew Morton's comment
http://marc.info/?l=linux-kernel&m=119248225007737&w=2
1. Allows using a port range of one single port.
2. Gives back sense of 'remaining' variable.
Signed-off-by: Anton Arapov <aarapov@redhat.com>
Acked-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (51 commits)
[IPV6]: Fix again the fl6_sock_lookup() fixed locking
[NETFILTER]: nf_conntrack_tcp: fix connection reopening fix
[IPV6]: Fix race in ipv6_flowlabel_opt() when inserting two labels
[IPV6]: Lost locking in fl6_sock_lookup
[IPV6]: Lost locking when inserting a flowlabel in ipv6_fl_list
[NETFILTER]: xt_sctp: fix mistake to pass a pointer where array is required
[NET]: Fix OOPS due to missing check in dev_parse_header().
[TCP]: Remove lost_retrans zero seqno special cases
[NET]: fix carrier-on bug?
[NET]: Fix uninitialised variable in ip_frag_reasm()
[IPSEC]: Rename mode to outer_mode and add inner_mode
[IPSEC]: Disallow combinations of RO and AH/ESP/IPCOMP
[IPSEC]: Use the top IPv4 route's peer instead of the bottom
[IPSEC]: Store afinfo pointer in xfrm_mode
[IPSEC]: Add missing BEET checks
[IPSEC]: Move type and mode map into xfrm_state.c
[IPSEC]: Fix length check in xfrm_parse_spi
[IPSEC]: Move ip_summed zapping out of xfrm6_rcv_spi
[IPSEC]: Get nexthdr from caller in xfrm6_rcv_spi
[IPSEC]: Move tunnel parsing for IPv4 out of xfrm4_input
...
No one has bothered to set strategy routine for the the netfilter sysctls that
return jiffies to be sysctl_jiffies.
So it appears the sys_sysctl path is unused and untested, so this patch
removes the binary sysctl numbers.
Which fixes the netfilter oops in 2.6.23-rc2-mm2 for me.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't preoperly support the sysctl binary path for flushing the ipv6
routes. So remove support for a binary path.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- In ipv6 ndisc_ifinfo_syctl_change so it doesn't depend on binary
sysctl names for a function that works with proc.
- In neighbour.c reorder the table to put the possibly unused entries
at the end so we can remove them by terminating the table early.
- In neighbour.c kill the entries with questionable binary sysctl
handling behavior.
- In neighbour.c if we don't have a strategy routine remove the
binary path. So we don't the default sysctl strategy routine
on data that is not ready for it.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>