RDS 3.0 connections (in OFED 1.3 and earlier) put the
header at the end. 3.1 connections put it at the head.
The code has significant added complexity in order to
handle both configurations. In OFED 1.6 we can
drop this and simplify the code by only supporting
"header-first" configuration.
This patch checks the protocol version, and if prior
to 3.1, does not complete the connection.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
both atomics and rdmas need to convert ib-specific completion codes
into RDS status codes. Rename rds_ib_rdma_send_complete to
rds_ib_send_complete, and have it take a pointer to the function to
call with the new error code.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Instead of using a constant for initiator_depth and
responder_resources, read the per-QP values when the
device is enumerated, and then use these values when creating
the connection.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Implement a CMSG-based interface to do FADD and CSWP ops.
Alter send routines to handle atomic ops.
Add atomic counters to stats.
Add xmit_atomic() to struct rds_transport
Inline rds_ib_send_unmap_rdma into unmap_rm
Signed-off-by: Andy Grover <andy.grover@oracle.com>
The previous code was correct, but made the assumption that
if r_notifier was non-NULL then either r_recverr or r_notify
was true. Valid, but fragile. Changed to explicitly check
r_recverr (shows up in greps for recverr now, too.)
Signed-off-by: Andy Grover <andy.grover@oracle.com>
rds_message_alloc_sgs() now returns correctly-initialized
sg lists, so calleds need not do this themselves.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
This eliminates a separate memory alloc, although
it is now necessary to add an "r_active" flag, since
it is no longer to use the m_rdma_op pointer as an
indicator of if an rdma op is present.
rdma SGs allocated from rm sg pool.
rds_rm_size also gets bigger. It's a little inefficient to
run through CMSGs twice, but it makes later steps a lot smoother.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
r_m_copy_from_user used to allocate the rm as well as kernel
buffers for the data, and then copy the data in. Now, sendmsg()
allocates the rm, although the data buffer alloc still happens
in r_m_copy_from_user.
SGs are still allocated with rm, but now r_m_alloc_sgs() is
used to reserve them. This allows multiple SG lists to be
allocated from the one rm -- this is important once we also
want to alloc our rdma sgl from this pool.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
First, it looks to me like the atomic_inc is wrong.
We should be decrementing refcount only once here, no? It's
already being done by the mr_put() at the end.
Second, simplify the logic a bit by bailing early (with a warning)
if !mr.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Clearly separate rdma-related variables in rm from data-related ones.
This is in anticipation of adding atomic support.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
This function has been the source of numerous bugs; it's just
too complicated. Simplified to nest spinlocks cleanly within
the second loop body, and kick out early if there are no
rms to drop.
This will be a little slower because conn lock is grabbed for
each entry instead of "caching" the lock across rms, but this
should be entirely irrelevant to fastpath performance.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
On second look at this bug (OFED #2002), it seems that the
collision is not with the retransmission queue (packet acked
by the peer), but with the local send completion. A theoretical
sequence of events (from time t0 to t3) is thought to be as
follows,
Thread #1
t0:
sock_release
rds_release
rds_send_drop_to /* wait on send completion */
t2:
rds_rdma_drop_keys() /* destroy & free all mrs */
Thread #2
t1:
rds_ib_send_cq_comp_handler
rds_ib_send_unmap_rm
rds_message_unmapped /* wake up #1 @ t0 */
t3:
rds_message_put
rds_message_purge
rds_mr_put /* memory corruption detected */
The problem with the rds_rdma_drop_keys() is it could
remove a mr's refcount more than its due (i.e. repeatedly
as long as it still remains in the tree (mr->r_refcount > 0)).
Theoretically it should remove only one reference - reference
by the tree.
/* Release any MRs associated with this socket */
while ((node = rb_first(&rs->rs_rdma_keys))) {
mr = container_of(node, struct rds_mr, r_rb_node);
if (mr->r_trans == rs->rs_transport)
mr->r_invalidate = 0;
rds_mr_put(mr);
}
I think the correct way of doing it is to remove the mr from
the tree and rds_destroy_mr it first, then a rds_mr_put()
to decrement its reference count by one. Whichever thread
holds the last reference will free the mr via rds_mr_put().
Signed-off-by: Tina Yang <tina.yang@oracle.com>
Signed-off-by: Andy Grover <andy.grover@oracle.com>
in_interrupt() is true in softirqs. The BUG_ONs are supposed
to check for if irqs are disabled, so we should use
BUG_ON(irqs_disabled()) instead, duh.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Blackhole routes are used when xfrm_lookup() returns -EREMOTE (error
triggered by IKE for example), hence this kind of route is always
temporary and so we should check if a better route exists for next
packets.
Bug has been introduced by commit d11a4dc18b.
Signed-off-by: Jianzhao Wang <jianzhao.wang@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Casts __kernel to __user pointer require __force markup, so add it. Also
sock_get/setsockopt() takes @optval and/or @optlen arguments as user pointers
but were taking kernel pointers, use new variables 'uoptval' and/or 'uoptlen'
to fix it. These remove following warnings from sparse:
net/socket.c:1922:46: warning: cast adds address space to expression (<asn:1>)
net/socket.c:3061:61: warning: incorrect type in argument 4 (different address spaces)
net/socket.c:3061:61: expected char [noderef] <asn:1>*optval
net/socket.c:3061:61: got char *optval
net/socket.c:3061:69: warning: incorrect type in argument 5 (different address spaces)
net/socket.c:3061:69: expected int [noderef] <asn:1>*optlen
net/socket.c:3061:69: got int *optlen
net/socket.c:3063:67: warning: incorrect type in argument 4 (different address spaces)
net/socket.c:3063:67: expected char [noderef] <asn:1>*optval
net/socket.c:3063:67: got char *optval
net/socket.c:3064:45: warning: incorrect type in argument 5 (different address spaces)
net/socket.c:3064:45: expected int [noderef] <asn:1>*optlen
net/socket.c:3064:45: got int *optlen
net/socket.c:3078:61: warning: incorrect type in argument 4 (different address spaces)
net/socket.c:3078:61: expected char [noderef] <asn:1>*optval
net/socket.c:3078:61: got char *optval
net/socket.c:3080:67: warning: incorrect type in argument 4 (different address spaces)
net/socket.c:3080:67: expected char [noderef] <asn:1>*optval
net/socket.c:3080:67: got char *optval
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When there is only one rps_cpus, skb_get_rxhash() can be eliminated.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This simple patch copies the current approach for SIOCINQ ioctl() from DCCP
into SCTP so that the userland code working with SCTP can use a similar
interface across different protocols to know how much space to allocate for
a buffer.
Signed-off-by: Diego Elio Pettenò <flameeyes@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Do not create expectation when forwarding the PORT
command to avoid blocking the connection. The problem is that
nf_conntrack_ftp.c:help() tries to create the same expectation later in
POST_ROUTING and drops the packet with "dropping packet" message after
failure in nf_ct_expect_related.
- Change ip_vs_update_conntrack to alter the conntrack
for related connections from real server. If we do not alter the reply in
this direction the next packet from client sent to vport 20 comes as NEW
connection. We alter it but may be some collision happens for both
conntracks and the second conntrack gets destroyed immediately. The
connection stucks too.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch: "gro: fix different skb headrooms" in its part:
"2) allocate a minimal skb for head of frag_list" is buggy. The copied
skb has p->data set at the ip header at the moment, and skb_gro_offset
is the length of ip + tcp headers. So, after the change the length of
mac header is skipped. Later skb_set_mac_header() sets it into the
NET_SKB_PAD area (if it's long enough) and ip header is misaligned at
NET_SKB_PAD + NET_IP_ALIGN offset. There is no reason to assume the
original skb was wrongly allocated, so let's copy it as it was.
bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=16626
fixes commit: 3d3be4333f
Reported-by: Plamen Petrov <pvp-lsts@fs.uni-ruse.bg>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Plamen Petrov <pvp-lsts@fs.uni-ruse.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (26 commits)
pkt_sched: Fix lockdep warning on est_tree_lock in gen_estimator
ipvs: avoid oops for passive FTP
Revert "sky2: don't do GRO on second port"
gro: fix different skb headrooms
bridge: Clear INET control block of SKBs passed into ip_fragment().
3c59x: Remove incorrect locking; correct documented lock hierarchy
sky2: don't do GRO on second port
ipv4: minor fix about RPF in help of Kconfig
xfrm_user: avoid a warning with some compiler
net/sched/sch_hfsc.c: initialize parent's cl_cfmin properly in init_vf()
pxa168_eth: fix a mdiobus leak
net sched: fix kernel leak in act_police
vhost: stop worker only if created
MAINTAINERS: Add ehea driver as Supported
ath9k_hw: fix parsing of HT40 5 GHz CTLs
ath9k_hw: Fix EEPROM uncompress block reading on AR9003
wireless: register wiphy rfkill w/o holding cfg80211_mutex
netlink: Make NETLINK_USERSOCK work again.
irda: Correctly clean up self->ias_obj on irda_bind() failure.
wireless extensions: fix kernel heap content leak
...
Actually iterate over the next-hops to make sure we have
a device match. Otherwise RP filtering is always elided
when the route matched has multiple next-hops.
Reported-by: Igor M Podlesny <for.poige@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We assumed that unix_autobind() never fails if kzalloc() succeeded.
But unix_autobind() allows only 1048576 names. If /proc/sys/fs/file-max is
larger than 1048576 (e.g. systems with more than 10GB of RAM), a local user can
consume all names using fork()/socket()/bind().
If all names are in use, those who call bind() with addr_len == sizeof(short)
or connect()/sendmsg() with setsockopt(SO_PASSCRED) will continue
while (1)
yield();
loop at unix_autobind() till a name becomes available.
This patch adds a loop counter in order to give up after 1048576 attempts.
Calling yield() for once per 256 attempts may not be sufficient when many names
are already in use, for __unix_find_socket_byname() can take long time under
such circumstance. Therefore, this patch also adds cond_resched() call.
Note that currently a local user can consume 2GB of kernel memory if the user
is allowed to create and autobind 1048576 UNIX domain sockets. We should
consider adding some restriction for autobind operation.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is an off by one. We would go past the end when we NUL terminate
the "value" string at end of the function. The "value" buffer is
allocated in irlan_client_parse_response() or
irlan_provider_parse_command().
CC: stable@kernel.org
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC5722 prohibits reassembling IPv6 fragments when some data overlaps.
Bug spotted by Zhang Zuotao <zuotao.zhang@6wind.com>.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC5722 prohibits reassembling fragments when some data overlaps.
Bug spotted by Zhang Zuotao <zuotao.zhang@6wind.com>.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a net device is implementing the select_queue callback and is part of
a bridge, frames coming from the bridge already have a tx queue associated
to the socket (introduced in commit a4ee3ce329,
"net: Use sk_tx_queue_mapping for connected sockets"). The call to
sk_tx_queue_get will then return the tx queue used by the bridge instead
of calling the select_queue callback.
In case of mac80211 this broke QoS which is implemented by using the
select_queue callback. Furthermore it introduced problems with rt2x00
because frames with the same TID and RA sometimes appeared on different
tx queues which the hw cannot handle correctly.
Fix this by always calling select_queue first if it is available and only
afterwards use the socket tx queue mapping.
Signed-off-by: Helmut Schaa <helmut.schaa@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No need to test twice sk->sk_shutdown & RCV_SHUTDOWN
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert pr_<level>("%s" ..., (struct netdev *)->name ...)
to netdev_<level>((struct netdev *), ...)
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch standardizes caif message logging prefixes.
Add #define pr_fmt(fmt) KBUILD_MODNAME ":%s(): " fmt, __func__
Add missing "\n"s to some logging messages
Convert pr_warning to pr_warn
This changes the logging message prefix from CAIF: to caif:
for all uses but caif_socket.c and chnl_net.c. Those now use
their filename without extension.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function has an unsigned return type, but returns a negative constant
to indicate an error condition. The result of calling the function is
always stored in a variable of type (signed) int, and thus unsigned can be
dropped from the return type.
A sematic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@exists@
identifier f;
constant C;
@@
unsigned f(...)
{ <+...
* return -C;
...+> }
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
pskb_expand_head() blindly takes references on fragments before calling
skb_release_data(), potentially releasing these references.
We can add a fast path, avoiding these atomic operations, if we own the
last reference on skb->head.
Based on a previous patch from David
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cause TIPC to return EAGAIN if it is unable to enable a new Ethernet
bearer because one or more recently disabled Ethernet bearers are
temporarily consuming resources during shut down. (The previous error
code, EDQUOT, is now returned only if all available Ethernet bearer
data structures are fully enabled at the time the request to enable an
additional bearer is received.)
Signed-off-by: Allan Stephens <allan.stephens@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add code to expand the headroom of an outgoing TIPC message if the
sk_buff has insufficient room to hold the header for the associated
Ethernet device. This change is necessary to ensure that messages
TIPC does not create itself (eg. incoming messages that are being
routed to another node) do not cause problems, since TIPC has no
control over the amount of headroom available in such messages.
Signed-off-by: Allan Stephens <allan.stephens@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Optimizes TIPC's name table translation code to avoid unnecessary
manipulation of the node address field of the resulting port id when
name translation fails. This change is possible because a valid port
id cannot have a reference field of zero, so examining the reference
only is sufficient to determine if the translation was successful.
Signed-off-by: Allan Stephens <allan.stephens@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__alloc_skb() uses a memset() to clear all the beginning of skb,
including bitfields contained in 'flags1' & 'flags2'.
We dont need any more to use kmemcheck_annotate_bitfield() on these
fields. However, we still need it for the clone part, which is not
cleared.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to accepting router advertisement, the IPv6 stack does not send router
solicitations if forwarding is enabled.
This patch enables this behavior to be overruled by setting forwarding to the
special value 2.
Signed-off-by: Thomas Graf <tgraf@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current IPv6 behavior is to not accept router advertisements while
forwarding, i.e. configured as router.
This does make sense, a router is typically not supposed to be auto
configured. However there are exceptions and we should allow the
current behavior to be overwritten.
Therefore this patch enables the user to overrule the "if forwarding
enabled then don't listen to RAs" rule by setting accept_ra to the
special value of 2.
An alternative would be to ignore the forwarding switch alltogether
and solely accept RAs based on the value of accept_ra. However, I
found that if not intended, accepting RAs as a router can lead to
strange unwanted behavior therefore we it seems wise to only do so
if the user explicitely asks for this behavior.
Signed-off-by: Thomas Graf <tgraf@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a lockdep warning:
[ 516.287584] =========================================================
[ 516.288386] [ INFO: possible irq lock inversion dependency detected ]
[ 516.288386] 2.6.35b #7
[ 516.288386] ---------------------------------------------------------
[ 516.288386] swapper/0 just changed the state of lock:
[ 516.288386] (&qdisc_tx_lock){+.-...}, at: [<c12eacda>] est_timer+0x62/0x1b4
[ 516.288386] but this lock took another, SOFTIRQ-unsafe lock in the past:
[ 516.288386] (est_tree_lock){+.+...}
[ 516.288386]
[ 516.288386] and interrupts could create inverse lock ordering between them.
...
So, est_tree_lock needs BH protection because it's taken by
qdisc_tx_lock, which is used both in BH and process contexts.
(Full warning with this patch at netdev, 02 Sep 2010.)
Fixes commit: ae638c47dc
("pkt_sched: gen_estimator: add a new lock")
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clean the code up according to Documentation/CodingStyle.
Don't initialize the variable dont_send in arp_process().
Remove the temporary varialbe flags in arp_state_to_flags().
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a small helper ptype_head() to get the head to manipulate
dev_add_pack() & __dev_remove_pack() can use a spinlock without
blocking BH, since softirq use RCU, and these functions are run from
process context only.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix Passive FTP problem in ip_vs_ftp:
- Do not oops in nf_nat_set_seq_adjust (adjust_tcp_sequence) when
iptable_nat module is not loaded
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use correctly the in_pkts packet counter also for SCTP
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Packets entering GRO might have different headrooms, even for a given
flow (because of implementation details in drivers, like copybreak).
We cant force drivers to deliver packets with a fixed headroom.
1) fix skb_segment()
skb_segment() makes the false assumption headrooms of fragments are same
than the head. When CHECKSUM_PARTIAL is used, this can give csum_start
errors, and crash later in skb_copy_and_csum_dev()
2) allocate a minimal skb for head of frag_list
skb_gro_receive() uses netdev_alloc_skb(headroom + skb_gro_offset(p)) to
allocate a fresh skb. This adds NET_SKB_PAD to a padding already
provided by netdevice, depending on various things, like copybreak.
Use alloc_skb() to allocate an exact padding, to reduce cache line
needs:
NET_SKB_PAD + NET_IP_ALIGN
bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=16626
Many thanks to Plamen Petrov, testing many debugging patches !
With help of Jarek Poplawski.
Reported-by: Plamen Petrov <pvp-lsts@fs.uni-ruse.bg>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In a similar vain to commit 17762060c2
("bridge: Clear IPCB before possible entry into IP stack")
Any time we call into the IP stack we have to make sure the state
there is as expected by the ipv4 code.
With help from Eric Dumazet and Herbert Xu.
Reported-by: Bandan Das <bandan.das@stratus.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thanks to Ilpo Jarvinen, this updates also the initial window
setting for tcp_output with regard to RFC 5681.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Attached is a small patch to remove a warning ("warning: ISO C90 forbids
mixed declarations and code" with gcc 4.3.2).
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes init_vf() function, so on each new backlog period parent's
cl_cfmin is properly updated (including further propgation towards the root),
even if the activated leaf has no upperlimit curve defined.
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reviewing commit 1c40be12f7, I
audited other users of tc_action_ops->dump for information leaks.
That commit covered almost all of them but act_police still had a leak.
opt.limit and opt.capab aren't zeroed out before the structure is
passed out.
This patch uses the C99 initializers to zero everything unused out.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Acked-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Otherwise the hardware scan handler could access an invalid scan request
structure. The driver should cancel any pending hardware scans during
the suspend process anyway, so also add a warning if the hardware scan
is still pending when the device resumes.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
(skb->data - skb->head) can be changed by skb_headroom(skb)
Remove some uses of NET_SKBUFF_DATA_USES_OFFSET, using
(skb_end_pointer(skb) - skb->head) or
(skb_tail_pointer(skb) - skb->head) : compiler does the right thing,
and this is more readable for us ;)
(struct skb_shared_info *) casts in pskb_expand_head() to help memcpy()
to use aligned moves.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- napi_gro_flush() is exported from net/core/dev.c, to avoid
an irq_save/irq_restore in the packet receive path.
- use napi_gro_receive() instead of netif_receive_skb()
- use napi_gro_flush() before calling __napi_complete()
- turn on NETIF_F_GRO by default
- Tested on a Marvell 88E8001 Gigabit NIC
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tunnel4_handlers, tunnel64_handlers, tunnel6_handlers and
tunnel46_handlers are protected by RCU, but we dont use appropriate rcu
primitives to scan them. rcu_lock() is already held by caller.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp4_gro_receive() and tcp4_gro_complete() dont need to be exported.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
remove non used variable "queue" in pg_cleanup
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[patch net-next-2.6] vlan: Use vlan_dev_real_dev in vlan_hwaccel_do_receive
Use helper as in other places.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function exists to clean-up after a hardware error or something
similar. The restart is accomplished using the same infrastructure used
to resume after a suspend. The suspend path cancels running scans, so
it seems appropriate to do that here as well for software-based scans.
If a hardware-based scan is pending, issue a warning message since this
indicates that the drivers has failed to clean-up after itself.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The same expression is tested twice and the result is the same each time.
The sematic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@expression@
expression E;
@@
(
* E
|| ... || E
|
* E
&& ... && E
)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The signal strength value in a single RX frame is not that reliable,
so it is better to delay start of CQM events until there is a real
average signal strength from more than a single Beacon frame
available.
Signed-off-by: Jouni Malinen <j@w1.fi>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The ave_beacon_signal value uses 1/16 dB unit and as such, must be
initialized with the signal level of the first Beacon frame multiplied
by 16. This fixes an issue where the initial CQM events are reported
incorrectly with a burst of events while the running average
approaches the correct value after the incorrect initialization. This
could cause user space -based roaming decision process to get quite
confused at the moment when we would like to go through authentication
and DHCP.
Cc: stable@kernel.org
Signed-off-by: Jouni Malinen <j@w1.fi>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Once we started enforcing the a nl_table[] entry exist for
a protocol, NETLINK_USERSOCK stopped working. Add a dummy
table entry so that it works again.
Reported-by: Thomas Voegtle <tv@lio96.de>
Tested-by: Thomas Voegtle <tv@lio96.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
If irda_open_tsap() fails, the irda_bind() code tries to destroy
the ->ias_obj object by hand, but does so wrongly.
In particular, it fails to a) release the hashbin attached to the
object and b) reset the self->ias_obj pointer to NULL.
Fix both problems by using irias_delete_object() and explicitly
setting self->ias_obj to NULL, just as irda_release() does.
Reported-by: Tavis Ormandy <taviso@cmpxchg8b.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tunnel6_handlers chain being scanned for each incoming packet,
make sure it doesnt share an often dirtied cache line.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tunnel4_handlers chain being scanned for each incoming packet,
make sure it doesnt share an often dirtied cache line.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes RTAX_RTO_MIN also available to CCID-3, replacing the compile-time
RTO lower bound with a per-route tunable value.
The original Kconfig option solved the problem that a very low RTT (in the
order of HZ) can trigger too frequent and unnecessary reductions of the
sending rate.
This tunable does not affect the initial RTO value of 2 seconds specified in
RFC 5348, section 4.2 and Appendix B. But like the hardcoded Kconfig value,
it allows to adapt to network conditions.
The same effect as the original Kconfig option of 100ms is now achieved by
> ip route replace to unicast 192.168.0.0/24 rto_min 100j dev eth0
(assuming HZ=1000).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using a fixed RTO_MIN of 0.2 seconds was found to cause problems for CCID-2
over 802.11g: at least once per session there was a spurious timeout. It
helped to then increase the the value of RTO_MIN over this link.
Since the problem is the same as in TCP, this patch makes the solution from
commit "05bb1fad1cde025a864a90cfeb98dcbefe78a44a"
"[TCP]: Allow minimum RTO to be configurable via routing metrics."
available to DCCP.
This avoids reinventing the wheel, so that e.g. the following works in the
expected way now also for CCID-2:
> ip route change 10.0.0.2 rto_min 800 dev ath0
Luckily this useful rto_min function was recently moved to net/tcp.h,
which simplifies sharing code originating from TCP.
Documentation also updated (plus minor whitespace fixes).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch consolidates initial-window code common to TCP and CCID-2:
* TCP uses RFC 3390 in a packet-oriented manner (tcp_input.c) and
* CCID-2 uses RFC 3390 in packet-oriented manner (RFC 4341).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes the wrappers around the sk timer functions, since not much is
gained from using them: the BUG_ON in start_rto_timer will never trigger
since that function is called only if:
* the RTO timer expires (rto_expire, and then timer_pending() is false);
* in tx_packet_sent only if !timer_pending() (BUG_ON is redundant here);
* previously in new_ack, after stopping the timer (timer_pending() false).
Removing the wrappers also clears the way for eventually replacing the
RTO timer with the icsk-retransmission-timer, as it is already part of the
DCCP socket.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since CCID-2 is de facto a mini implementation of TCP, it makes sense to share
as much code as possible.
Hence this patch aligns CCID-2 timestamping with TCP timestamping.
This also halves the space consumption (on 64-bit systems).
The necessary include file <net/tcp.h> is already included by way of
net/dccp.h. Redundant includes have been removed.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wireless extensions have an unfortunate, undocumented
requirement which requires drivers to always fill
iwp->length when returning a successful status. When
a driver doesn't do this, it leads to a kernel heap
content leak when userspace offers a larger buffer
than would have been necessary.
Arguably, this is a driver bug, as it should, if it
returns 0, fill iwp->length, even if it separately
indicated that the buffer contents was not valid.
However, we can also at least avoid the memory content
leak if the driver doesn't do this by setting the iwp
length to max_tokens, which then reflects how big the
buffer is that the driver may fill, regardless of how
big the userspace buffer is.
To illustrate the point, this patch also fixes a
corresponding cfg80211 bug (since this requirement
isn't documented nor was ever pointed out by anyone
during code review, I don't trust all drivers nor
all cfg80211 handlers to implement it correctly).
Cc: stable@kernel.org [all the way back]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.
Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.
The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.
The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.
This option, like many others, will be inherited by an acceptor from its
listener.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new workqueue changes helped me find this bug
that's been lingering since the changes to the work
processing in mac80211 -- the work timer is never
deleted properly. Do that to avoid having it fire
after all data structures have been freed. It can't
be re-armed because all it will do, if running, is
schedule the work, but that gets flushed later and
won't have anything to do since all work items are
gone by now (by way of interface removal).
Cc: stable@kernel.org [2.6.34+]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Fixes this build error:
net/netfilter/ipvs/ip_vs_core.c: In function 'ip_vs_nat_icmp_v6':
net/netfilter/ipvs/ip_vs_core.c:640: error: implicit declaration of function 'csum_ipv6_magic'
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
net/ipv4: Eliminate kstrdup memory leak
net/caif/cfrfml.c: use asm/unaligned.h
ax25: missplaced sock_put(sk)
qlge: reset the chip before freeing the buffers
l2tp: test for ethernet header in l2tp_eth_dev_recv()
tcp: select(writefds) don't hang up when a peer close connection
tcp: fix three tcp sysctls tuning
tcp: Combat per-cpu skew in orphan tests.
pxa168_eth: silence gcc warnings
pxa168_eth: update call to phy_mii_ioctl()
pxa168_eth: fix error handling in prope
pxa168_eth: remove unneeded null check
phylib: Fix race between returning phydev and calling adjust_link
caif-driver: add HAS_DMA dependency
3c59x: Fix deadlock between boomerang_interrupt and boomerang_start_tx
qlcnic: fix poll implementation
netxen: fix poll implementation
bridge: netfilter: fix a memory leak
Replace open-coded loop with for_each_set_bit().
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
The spinlock aun_queue_lock is initialized statically. It is unnecessary
to initialize by spin_lock_init() at module load time.
This is detected by the semantic patch.
// <smpl>
@def@
declarer name DEFINE_SPINLOCK;
identifier spinlock;
@@
DEFINE_SPINLOCK(spinlock);
@@
identifier def.spinlock;
@@
- spin_lock_init(&spinlock);
// </smpl>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Julia Lawall <julia@diku.dk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
The string clone is only used as a temporary copy of the argument val
within the while loop, and so it should be freed before leaving the
function. The call to strsep, however, modifies clone, so a pointer to the
front of the string is kept in saved_clone, to make it possible to free it.
The sematic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@r exists@
local idexpression x;
expression E;
identifier l;
statement S;
@@
*x= \(kasprintf\|kstrdup\)(...);
...
if (x == NULL) S
... when != kfree(x)
when != E = x
if (...) {
<... when != kfree(x)
* goto l;
...>
* return ...;
}
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Somebody noticed this problem, and I outlined
to them how to fix it, but haven't heard back
from them. So while I was adding the state
field I figured I could use it to fix it.
The problem, as I understand it, is that when
we go offchannel while the driver has a queue
stopped, the driver will likely start draining
the queue and then enable it while offchannel.
This in turn will enable the interface queue,
and that leads to transmitting data frames on
the wrong channel.
Fix this by keeping track of offchannel status
per interface, and not enabling the interface
queues on interfaces that are offchannel when
the driver enables a queue.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Add support to mac80211 for changing the interface
type even when the interface is UP, if the driver
supports it.
To achieve this
* add a new driver callback for switching,
* split some of the interface up/down code out
into new functions (do_open/do_stop), and
* maintain an own __SDATA_RUNNING bit that will
not be set during interface type, so that any
other code doesn't use the interface.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Split the concurrent virtual interface checks
into a new function that can be used to check
for any given new interface type.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The libertas_tf special code for zero addresses
is a bit too complex, it compares against a stack
value instead of using is_zero_ether_addr() and
tries to update all interfaces even if just the
one that's being brought up needs to be changed.
Additionally, the repeated check for a valid MAC
address need only be done if we actually changed
it on the fly.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Since the introduction of ieee80211_sdata_running(),
some new code was introduced that uses netif_running()
instead. Switch all these instances over.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
There's a lot of redundant code in mac80211's
interface cleanup/down, for example freeing
AP beacons is done both when the interface is
set DOWN as well as when it is torn down, of
which only the former has any effect.
Also, a bunch of things should be closer to
where they matter, like the MLME timers that
we should cancel when disassociating, rather
than only when the interface is set DOWN.
Clean up all this code.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
There are subqueue helpers so that we don't
need to get the TX queue and then wake/stop
it, use those helpers.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Some vendor specified mechanisms for 802.1X-style
functionality use a different protocol than EAP
(even if EAP is vendor-extensible). Support this
in mac80211 via the cfg80211 API for it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Juuso Oikarinen <juuso.oikarinen@nokia.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Some vendor specified mechanisms for 802.1X-style
functionality use a different protocol than EAP
(even if EAP is vendor-extensible). Allow setting
the ethertype for the protocol when a driver has
support for this. The default if unspecified is
EAP, of course.
Note: This is suitable only for station mode, not
for AP implementation.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Juuso Oikarinen <juuso.oikarinen@nokia.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Allow drivers to specify their own set of cipher
suites to advertise vendor-specific ciphers. The
driver is then required to implement hardware
crypto offload for it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Juuso Oikarinen <juuso.oikarinen@nokia.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
cfg80211 currently rejects all cipher suites it
doesn't know about for key length checking
purposes. This can lead to inconsistencies when
a driver advertises an algorithm that cfg80211
doesn't know about. Remove this rejection so
drivers can specify any algorithm they like.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Juuso Oikarinen <juuso.oikarinen@nokia.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The ieee80211_scan_completed() function was a frequent
source of potential deadlocks, since it is called by
drivers but may call back into drivers, so drivers had
to make sure to call it without any locks held, which
frequently lead to more complex code in drivers. Avoid
that problem by allowing the function to be called in
any context, and queueing the actual work it does.
Also update the documentation for it to indicate this.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Since cfg80211 manages the BSS list completely,
this define hasn't been used for a long time
and will never be used again.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
compare_ether_header() can have a special implementation on 64 bit
arches if CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is defined.
__napi_gro_receive() and vlan_gro_common() can avoid a conditional
branch to perform device match.
On x86_64, __napi_gro_receive() has now 38 instructions instead of 53
As gcc-4.4.3 still choose to not inline it, add inline keyword to this
performance critical function.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
caif does not build on ia64 starting with 2.6.32-rc1. Using
asm/unaligned.h instead of linux/unaligned/le_byteshift.h fixes the issue.
include/linux/unaligned/le_byteshift.h:40:50: error: redefinition of 'get_unaligned_le16'
include/linux/unaligned/le_byteshift.h:45:50: error: redefinition of 'get_unaligned_le32'
include/linux/unaligned/le_byteshift.h:50:50: error: redefinition of 'get_unaligned_le64'
include/linux/unaligned/le_byteshift.h:55:51: error: redefinition of 'put_unaligned_le16'
include/linux/unaligned/le_byteshift.h:60:51: error: redefinition of 'put_unaligned_le32'
include/linux/unaligned/le_byteshift.h:65:51: error: redefinition of 'put_unaligned_le64'
include/linux/unaligned/le_struct.h:31:51: note: previous definition of 'put_unaligned_le64' was here
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves a missplaced sock_put(sk) after
bh_unlock_sock(sk)
like in other parts of AX25 driver.
Signed-off-by: Bernard Pidoux <f6bvp@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
strlcpy() returns the total length of the string they tried to create, so
we should not use its return value without any check. scnprintf() returns
the number of characters written into @buf not including the trailing '\0',
so use it instead here.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change SCTP_DEBUG_PRINTK and SCTP_DEBUG_PRINTK_IPADDR to
use do { print } while (0) guards.
Add SCTP_DEBUG_PRINTK_CONT to fix errors in log when
lines were continued.
Add #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
Add a missing newline in "Failed bind hash alloc"
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
close https://bugzilla.kernel.org/show_bug.cgi?id=16529
Before calling dev_forward_skb(), we should make sure skb head contains
at least an ethernet header, even if length included in upper layer said
so. Use pskb_may_pull() to make sure this ethernet header is present in
skb head.
Reported-by: Thomas Heil <heil@terminal-consulting.de>
Reported-by: Ian Campbell <Ian.Campbell@eu.citrix.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Switch from GFP_ATOMIC allocations to GFP_KERNEL ones in
ip_vs_add_service() and ip_vs_new_dest(), as we hold a mutex and are
allowed to sleep in this context.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also rename __ip_vs_securetcp_lock to ip_vs_securetcp_lock.
Spinlock conversion was suggested by Eric Dumazet.
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also rename __ip_vs_sched_lock to ip_vs_sched_lock.
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Xiaoyu Du <tingsrain@gmail.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This issue come from ruby language community. Below test program
hang up when only run on Linux.
% uname -mrsv
Linux 2.6.26-2-486 #1 Sat Dec 26 08:37:39 UTC 2009 i686
% ruby -rsocket -ve '
BasicSocket.do_not_reverse_lookup = true
serv = TCPServer.open("127.0.0.1", 0)
s1 = TCPSocket.open("127.0.0.1", serv.addr[1])
s2 = serv.accept
s2.close
s1.write("a") rescue p $!
s1.write("a") rescue p $!
Thread.new {
s1.write("a")
}.join'
ruby 1.9.3dev (2010-07-06 trunk 28554) [i686-linux]
#<Errno::EPIPE: Broken pipe>
[Hang Here]
FreeBSD, Solaris, Mac doesn't. because Ruby's write() method call
select() internally. and tcp_poll has a bug.
SUS defined 'ready for writing' of select() as following.
| A descriptor shall be considered ready for writing when a call to an output
| function with O_NONBLOCK clear would not block, whether or not the function
| would transfer data successfully.
That said, EPIPE situation is clearly one of 'ready for writing'.
We don't have read-side issue because tcp_poll() already has read side
shutdown care.
| if (sk->sk_shutdown & RCV_SHUTDOWN)
| mask |= POLLIN | POLLRDNORM | POLLRDHUP;
So, Let's insert same logic in write side.
- reference url
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/31065http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/31068
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As discovered by Anton Blanchard, current code to autotune
tcp_death_row.sysctl_max_tw_buckets, sysctl_tcp_max_orphans and
sysctl_max_syn_backlog makes little sense.
The bigger a page is, the less tcp_max_orphans is : 4096 on a 512GB
machine in Anton's case.
(tcp_hashinfo.bhash_size * sizeof(struct inet_bind_hashbucket))
is much bigger if spinlock debugging is on. Its wrong to select bigger
limits in this case (where kernel structures are also bigger)
bhash_size max is 65536, and we get this value even for small machines.
A better ground is to use size of ehash table, this also makes code
shorter and more obvious.
Based on a patch from Anton, and another from David.
Reported-and-tested-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If bridge port is offline, don't call ethtool to query speed.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The carrier check is not called from work queue in current code.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__ip_vs_service_get and __ip_vs_svc_fwm_get increment a reference count, so
that reference count should be decremented before leaving the function in an
error case.
A simplified version of the semantic match that finds this problem is:
(http://coccinelle.lip6.fr/)
// <smpl>
@r exists@
local idexpression x;
expression E;
identifier f1;
iterator I;
@@
x = __ip_vs_service_get(...);
<... when != x
when != true (x == NULL || ...)
when != if (...) { <+...x...+> }
when != I (...) { <+...x...+> }
(
x == NULL
|
x == E
|
x->f1
)
...>
* return ...;
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a mac80211-based driver advertises mesh mode
support, this will be advertised to userspace.
However, if mac80211 was compiled without mesh
support, then that won't actually be true. Fix
this by removing the bit for mesh if mesh isn't
compiled in.
Since this synchronizes what we advertise to
cfg80211 and actually support, it means we can
now rely on cfg80211's interface type checks
and need not check again in mac80211.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This patch fixes a potential crash (null-pointer de-
reference) which was introduced in my previous patch:
"mac80211: AMPDU rx reorder timeout timer"
During a BA teardown, the pointer to the soon-to-be-gone
tid_ampdu_rx element will be nullified. Therefore the
release timer mechanism has to be careful not to
accidentally access the item without any RCU protection.
Signed-off-by: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
commit 95a6ccbb46c70cff376684c752831c014c87029d
Author: Johannes Berg <johannes.berg@intel.com>
Date: Thu Aug 12 15:38:38 2010 +0200
cfg80211/mac80211: extensible frame processing
introduced a netlink bug that caused parsing errors
in userspace because it forgot to close a nesting,
which would advertise a nesting length of zero to
userspace, which then completely threw off parsing
and led to
Illegal nla->nla_type == 0
being printed by libnl.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Unlike most other workqueue-tasks, the restart_work is
not scheduled onto mac80211's private per-interface
workqueue, but onto one of the system-wide workqueues.
Therefore the mac80211-stack has to cancel any pending
restarts, before destroying the shared device context
and handing back the memory. Otherwise - under very
unlucky circumstances - there could be a stale work-
item left, because some other kernel component might
have delayed the execution of ieee80211_restart_work
for too long.
Signed-off-by: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
mesh_hdr only used when CONFIG_MAC80211_MESH is defined
Signed-off-by: Wey-Yi Guy <wey-yi.w.guy@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Standardize logging messages from
printk(KERN_<level> "%s: " fmt , wiphy_name(foo), args);
to
wiphy_<level>(foo, fmt, args);
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
As reported by Anton Blanchard when we use
percpu_counter_read_positive() to make our orphan socket limit checks,
the check can be off by up to num_cpus_online() * batch (which is 32
by default) which on a 128 cpu machine can be as large as the default
orphan limit itself.
Fix this by doing the full expensive sum check if the optimized check
triggers.
Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Trivial extension to existing meta data match rules to allow
matching on skb receive hash value.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Compiler is not smart enough to avoid a conditional branch.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow userspace to register for more than just
action frames by giving the frame subtype, and
make it possible to use this in various modes
as well.
With some tweaks and some added functionality
this will, in the future, also be usable in AP
mode and be able to replace the cooked monitor
interface currently used in that case.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When MFP is disabled, action frames will not
be encrypted since they are management frames
and the only management frames that can then
be encrypted are authentication frames.
Therefore, setting the don't-encrypt flag on
action frames is unnecessary.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This function analyses only its single, value-passed
argument, and has no side effects. Thus it can be
const, which makes mac80211 smaller, for example:
text data bss dec hex filename
362518 16720 884 380122 5ccda mac80211.ko (before)
362358 16720 884 379962 5cc3a mac80211.ko (after)
a 160 byte saving in text size, and an optimisation
because the function won't be called as often.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The SNMP daemon uses ethtool to determine the speed of
network interfaces. This fails on Debian (and probably elsewhere)
because for security SNMP daemon runs as non-root user (snmp).
Note: A similar patch was rejected previously because of a concern about
the possibility that on some hardware querying the ethtool settings
requires access to the PHY and could slow the machine down. But the
security risk of requiring SNMP daemon (and related services)
to run as root far out weighs the risk of denial-of-service.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No need to use a temporary struct rtnl_link_stats64 variable,
just copy the source to skb buffer.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nf_bridge_alloc() always reset the skb->nf_bridge, so we should always
put the old one.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Bart De Schuymer <bdschuym@pandora.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current CCID-2 RTT estimator code is in parts broken and lags behind the
suggestions in RFC2988 of using scaled variants for SRTT/RTTVAR.
That code is replaced by the present patch, which reuses the Linux TCP RTT
estimator code.
Further details:
----------------
1. The minimum RTO of previously one second has been replaced with TCP's, since
RFC4341, sec. 5 says that the minimum of 1 sec. (suggested in RFC2988, 2.4)
is not necessary. Instead, the TCP_RTO_MIN is used, which agrees with DCCP's
concept of a default RTT (RFC 4340, 3.4).
2. The maximum RTO has been set to DCCP_RTO_MAX (64 sec), which agrees with
RFC2988, (2.5).
3. De-inlined the function ccid2_new_ack().
4. Added a FIXME: the RTT is sampled several times per Ack Vector, which will
give the wrong estimate. It should be replaced with one sample per Ack.
However, at the moment this can not be resolved easily, since
- it depends on TX history code (which also needs some work),
- the cleanest solution is not to use the `sent' time at all (saves 4 bytes
per entry) and use DCCP timestamps / elapsed time to estimated the RTT,
which however is non-trivial to get right (but needs to be done).
Reasons for reusing the Linux TCP estimator algorithm:
------------------------------------------------------
Some time was spent to find a better alternative, using basic RFC2988 as a first
step. Further analysis and experimentation showed that the Linux TCP RTO
estimator is superior to a basic RFC2988 implementation. A summary is on
http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/ccid2/rto_estimator/
In addition, this estimator fared well in a recent empirical evaluation:
Rewaskar, Sushant, Jasleen Kaur and F. Donelson Smith.
A Performance Study of Loss Detection/Recovery in Real-world TCP
Implementations. Proceedings of 15th IEEE International
Conference on Network Protocols (ICNP-07), 2007.
Thus there is significant benefit in reusing the existing TCP code.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes the dec_pipe function and improves the way the RTO timer is rearmed
when a new acknowledgment comes in.
Details and justification for removal:
--------------------------------------
1) The BUG_ON in dec_pipe is never triggered: pipe is only decremented for TX
history entries between tail and head, for which it had previously been
incremented in tx_packet_sent; and it is not decremented twice for the same
entry, since it is
- either decremented when a corresponding Ack Vector cell in state 0 or 1
was received (and then ccid2s_acked==1),
- or it is decremented when ccid2s_acked==0, as part of the loss detection
in tx_packet_recv (and hence it can not have been decremented earlier).
2) Restarting the RTO timer happens for every single entry in each Ack Vector
parsed by tx_packet_recv (according to RFC 4340, 11.4 this can happen up to
16192 times per Ack Vector).
3) The RTO timer should not be restarted when all outstanding data has been
acknowledged. This is currently done similar to (2), in dec_pipe, when
pipe has reached 0.
The patch onsolidates the code which rearms the RTO timer, combining the
segments from new_ack and dec_pipe. As a result, the code becomes clearer
(compare with tcp_rearm_rto()).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes the ccid2_hc_tx_check_sanity function: it is redundant.
Details:
The tx_check_sanity function performs three tests:
1) it checks that the circular TX list is sorted
- in ascending order of sequence number (ccid2s_seq)
- and time (ccid2s_sent),
- in the direction from `tail' (hctx_seqt) to `head' (hctx_seqh);
2) it ensures that the entire list has the length seqbufc * CCID2_SEQBUF_LEN;
3) it ensures that pipe equals the number of packets that were not
marked `acked' (ccid2s_acked) between `tail' and `head'.
The following argues that each of these tests is redundant, this can be verified
by going through the code.
(1) is not necessary, since both time and GSS increase from one packet to the
next, so that subsequent insertions in tx_packet_sent (which advance the `head'
pointer) will be in ascending order of time and sequence number.
In (2), the length of the list is always equal to seqbufc times CCID2_SEQBUF_LEN
(set to 1024) unless allocation caused an earlier failure, because:
* at initialisation (tx_init), there is one chunk of size 1024 and seqbufc=1;
* subsequent calls to tx_alloc_seq take place whenever head->next == tail in
tx_packet_sent; then a new chunk of size 1024 is inserted between head and
tail, and seqbufc is incremented by one.
To show that (3) is redundant requires looking at two cases.
The `pipe' variable of the TX socket is incremented only in tx_packet_sent, and
decremented in tx_packet_recv. When head == tail (TX history empty) then pipe
should be 0, which is the case directly after initialisation and after a
retransmission timeout has occurred (ccid2_hc_tx_rto_expire).
The first case involves parsing Ack Vectors for packets recorded in the live
portion of the buffer, between tail and head. For each packet marked by the
receiver as received (state 0) or ECN-marked (state 1), pipe is decremented by
one, so for all such packets the BUG_ON in tx_check_sanity will not trigger.
The second case is the loss detection in the second half of tx_packet_recv,
below the comment "Check for NUMDUPACK".
The first while-loop here ensures that the sequence number of `seqp' is either
above or equal to `high_ack', or otherwise equal to the highest sequence number
sent so far (of the entry head->prev, as head points to the next unsent entry).
The next while-loop ("while (1)") counts the number of acked packets starting
from that position of seqp, going backwards in the direction from head->prev to
tail. If NUMDUPACK=3 such packets were counted within this loop, `seqp' points
to the last acknowledged packet of these, and the "if (done == NUMDUPACK)" block
is entered next.
The while-loop contained within that block in turn traverses the list backwards,
from head to tail; the position of `seqp' is saved in the variable `last_acked'.
For each packet not marked as `acked', a congestion event is triggered within
the loop, and pipe is decremented. The loop terminates when `seqp' has reached
`tail', whereupon tail is set to the position previously stored in `last_acked'.
Thus, between `last_acked' and the previous position of `tail',
- pipe has been decremented earlier if the packet was marked as state 0 or 1;
- pipe was decremented if the packet was not marked as acked.
That is, pipe has been decremented by the number of packets between `last_acked'
and the previous position of `tail'. As a consequence, pipe now again reflects
the number of packets which have not (yet) been acked between the new position
of tail (at `last_acked') and head->prev, or 0 if head==tail. The result is that
the BUG_ON condition in check_sanity will also not be triggered, hence the test
(3) is also redundant.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
The CCIDs are activated as last of the features, at the end of the handshake,
were the LISTEN state of the master socket is inherited into the server
state of the child socket. Thus, the only states visible to CCIDs now are
OPEN/PARTOPEN, and the closing states.
This allows to remove tests which were previously necessary to protect
against referencing a socket in the listening state (in CCID-3), but which
now have become redundant.
As a further byproduct of enabling the CCIDs only after the connection has been
fully established, several typecast-initialisations of ccid3_hc_{rx,tx}_sock
can now be eliminated:
* the CCID is loaded, so it is not necessary to test if it is NULL,
* if it is possible to load a CCID and leave the private area NULL, then this
is a bug, which should crash loudly - and earlier,
* the test for state==OPEN || state==PARTOPEN now reduces only to the closing
phase (e.g. when the node has received an unexpected Reset).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch collects cosmetics-only changes to separate these from
code changes:
* update with regard to CodingStyle and whitespace changes,
* documentation:
- adding/revising comments,
- remove CCID-3 RX socket documentation which is either
duplicate or refers to fields that no longer exist,
* expand embedded tfrc_tx_info struct inline for consistency,
removing indirections via #define.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits)
netfilter: fix CONFIG_COMPAT support
isdn/avm: fix build when PCMCIA is not enabled
header: fix broken headers for user space
e1000e: don't check for alternate MAC addr on parts that don't support it
e1000e: disable ASPM L1 on 82573
ll_temac: Fix poll implementation
netxen: fix a race in netxen_nic_get_stats()
qlnic: fix a race in qlcnic_get_stats()
irda: fix a race in irlan_eth_xmit()
net: sh_eth: remove unused variable
netxen: update version 4.0.74
netxen: fix inconsistent lock state
vlan: Match underlying dev carrier on vlan add
ibmveth: Fix opps during MTU change on an active device
ehea: Fix synchronization between HW and SW send queue
bnx2x: Update bnx2x version to 1.52.53-4
bnx2x: Fix PHY locking problem
rds: fix a leak of kernel memory
netlink: fix compat recvmsg
netfilter: fix userspace header warning
...
commit f3c5c1bfd4
(netfilter: xtables: make ip_tables reentrant) forgot to
also compute the jumpstack size in the compat handlers.
Result is that "iptables -I INPUT -j userchain" turns into -j DROP.
Reported by Sebastian Roesner on #netfilter, closes
http://bugzilla.netfilter.org/show_bug.cgi?id=669.
Note: arptables change is compile-tested only.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
SKBs can be "fragmented" in two ways, via a page array (called
skb_shinfo(skb)->frags[]) and via a list of SKBs (called
skb_shinfo(skb)->frag_list).
Since skb_has_frags() tests the latter, it's name is confusing
since it sounds more like it's testing the former.
Signed-off-by: David S. Miller <davem@davemloft.net>
Via setsockopt it is possible to reduce the socket RX buffer
(SO_RCVBUF). TCP method to select the initial window and window scaling
option in tcp_select_initial_window() currently misbehaves and do not
consider a reduced RX socket buffer via setsockopt.
Even though the server's RX buffer is reduced via setsockopt() to 256
byte (Initial Window 384 byte => 256 * 2 - (256 * 2 / 4)) the window
scale option is still 7:
192.168.1.38.40676 > 78.47.222.210.5001: Flags [S], seq 2577214362, win 5840, options [mss 1460,sackOK,TS val 338417 ecr 0,nop,wscale 0], length 0
78.47.222.210.5001 > 192.168.1.38.40676: Flags [S.], seq 1570631029, ack 2577214363, win 384, options [mss 1452,sackOK,TS val 2435248895 ecr 338417,nop,wscale 7], length 0
192.168.1.38.40676 > 78.47.222.210.5001: Flags [.], ack 1, win 5840, options [nop,nop,TS val 338421 ecr 2435248895], length 0
Within tcp_select_initial_window() the original space argument - a
representation of the rx buffer size - is expanded during
tcp_select_initial_window(). Only sysctl_tcp_rmem[2], sysctl_rmem_max
and window_clamp are considered to calculate the initial window.
This patch adjust the window_clamp argument if the user explicitly
reduce the receive buffer.
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>