* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (56 commits)
route: Take the right src and dst addresses in ip_route_newports
ipv4: Fix nexthop caching wrt. scoping.
ipv4: Invalidate nexthop cache nh_saddr more correctly.
net: fix pch_gbe section mismatch warning
ipv4: fix fib metrics
mlx4_en: Removing HW info from ethtool -i report.
net_sched: fix THROTTLED/RUNNING race
drivers/net/a2065.c: Convert release_resource to release_region/release_mem_region
drivers/net/ariadne.c: Convert release_resource to release_region/release_mem_region
bonding: fix rx_handler locking
myri10ge: fix rmmod crash
mlx4_en: updated driver version to 1.5.4.1
mlx4_en: Using blue flame support
mlx4_core: reserve UARs for userspace consumers
mlx4_core: maintain available field in bitmap allocator
mlx4: Add blue flame support for kernel consumers
mlx4_en: Enabling new steering
mlx4: Add support for promiscuous mode in the new steering model.
mlx4: generalization of multicast steering.
mlx4_en: Reporting HW revision in ethtool -i
...
Move the scope value out of the fib alias entries and into fib_info,
so that we always use the correct scope when recomputing the nexthop
cached source address.
Reported-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Any operation that:
1) Brings up an interface
2) Adds an IP address to an interface
3) Deletes an IP address from an interface
can potentially invalidate the nh_saddr value, requiring
it to be recomputed.
Perform the recomputation lazily using a generation ID.
Reported-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alessandro Suardi reported that we could not change route metrics :
ip ro change default .... advmss 1400
This regression came with commit 9c150e82ac (Allocate fib metrics
dynamically). fib_metrics is no longer an array, but a pointer to an
array.
Reported-by: Alessandro Suardi <alessandro.suardi@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Alessandro Suardi <alessandro.suardi@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When sec=<something> is not presented as a mount option,
we should attempt to determine what security flavor the
server is using.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
A submount may use different security than the parent
mount does. We should figure out what sec flavor the
submount uses at mount time.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'for-2.6.39' of git://linux-nfs.org/~bfields/linux:
SUNRPC: Remove resource leak in svc_rdma_send_error()
nfsd: wrong index used in inner loop
nfsd4: fix comment and remove unused nfsd4_file fields
nfs41: make sure nfs server return right ca_maxresponsesize_cached
nfsd: fix compile error
svcrpc: fix bad argument in unix_domain_find
nfsd4: fix struct file leak
nfsd4: minor nfs4state.c reshuffling
svcrpc: fix rare race on unix_domain creation
nfsd41: modify the members value of nfsd4_op_flags
nfsd: add proc file listing kernel's gss_krb5 enctypes
gss:krb5 only include enctype numbers in gm_upcall_enctypes
NFSD, VFS: Remove dead code in nfsd_rename()
nfsd: kill unused macro definition
locks: use assign_type()
As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Andy Grover <andy.grover@oracle.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
asm-generic/bitops/le.h is only intended to be included directly from
asm-generic/bitops/ext2-non-atomic.h or asm-generic/bitops/minix-le.h
which implements generic ext2 or minix bit operations.
This stops including asm-generic/bitops/le.h directly and use ext2
non-atomic bit operations instead.
It seems odd to use ext2_*_bit() on rds, but it will replaced with
__{set,clear,test}_bit_le() after introducing little endian bit operations
for all architectures. This indirect step is necessary to maintain
bisectability for some architectures which have their own little-endian
bit operations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Andy Grover <andy.grover@oracle.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit 2c8cec5c10 (Cache learned PMTU information in inetpeer) added
an extra inet_putpeer() call in ip_rt_update_pmtu().
This results in various problems, since we can free one inetpeer, while
it is still in use.
Ref: http://www.spinics.net/lists/netdev/msg159121.html
Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 9435eb1cf0
("ipv4: Implement __ip_dev_find using new interface address hash.")
we reimplemented __ip_dev_find() so that it doesn't have to
do a full FIB table lookup.
Instead, it consults a hash table of addresses configured to
interfaces.
This works identically to the old code in all except one case,
and that is for loopback subnets.
The old code would match the loopback device for any IP address
that falls within a subnet configured to the loopback device.
Handle this corner case by doing the FIB lookup.
We could implement this via inet_addr_onlink() but:
1) Someone could configure many addresses to loopback and
inet_addr_onlink() is a simple list traversal.
2) We know the old code works.
Reported-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the current undo logic, cwnd is moderated after it was restored
to the value prior entering fast-recovery. It was moderated first
in tcp_try_undo_recovery then again in tcp_complete_cwr.
Since the undo indicates recovery was false, these moderations
are not necessary. If the undo is triggered when most of the
outstanding data have been acknowledged, the (restored) cwnd is
falsely pulled down to a small value.
This patch removes these cwnd moderations if cwnd is undone
a) during fast-recovery
b) by receiving DSACKs past fast-recovery
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ipv6_dev_get_saddr() is currently called with an uninitialized
destination address. Although in tests it usually seemed to nevertheless
always fetch the right source address, there seems to be a possible race
condition.
Therefore this commit changes this, first setting the destination
address and only after that fetching the source address.
Reported-by: Jan Beulich <JBeulich@novell.com>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This avoids explicit cast to avoid 'discards qualifiers'
compiler warning in a netfilter patch that i've been working on.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
ksoftirqd, kworker, migration, and pktgend kthreads can be created with
kthread_create_on_node(), to get proper NUMA affinities for their stack and
task_struct.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: David Howells <dhowells@redhat.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
[net/9p]: Introduce basic flow-control for VirtIO transport.
9p: use the updated offset given by generic_write_checks
[net/9p] Don't re-pin pages on retrying virtqueue_add_buf().
[net/9p] Set the condition just before waking up.
[net/9p] unconditional wake_up to proc waiting for space on VirtIO ring
fs/9p: Add v9fs_dentry2v9ses
fs/9p: Attach writeback_fid on first open with WR flag
fs/9p: Open writeback fid in O_SYNC mode
fs/9p: Use truncate_setsize instead of vmtruncate
net/9p: Fix compile warning
net/9p: Convert the in the 9p rpc call path to GFP_NOFS
fs/9p: Fix race in initializing writeback fid
* git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: use watch/notify for changes in rbd header
libceph: add lingering request and watch/notify event framework
rbd: update email address in Documentation
ceph: rename dentry_release -> d_release, fix comment
ceph: add request to the tail of unsafe write list
ceph: remove request from unsafe list if it is canceled/timed out
ceph: move readahead default to fs/ceph from libceph
ceph: add ino32 mount option
ceph: update common header files
ceph: remove debugfs debug cruft
libceph: fix osd request queuing on osdmap updates
ceph: preserve I_COMPLETE across rename
libceph: Fix base64-decoding when input ends in newline.
If we call xs_close(), we're in one of two situations:
- Autoclose, which means we don't expect to resend a request
- bind+connect failed, which probably means the port is in use
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Recent zerocopy work in the 9P VirtIO transport maps and pins
user buffers into kernel memory for the server to work on them.
Since the user process can initiate this kind of pinning with a simple
read/write call, thousands of IO threads initiated by the user process can
hog the system resources and could result into denial of service.
This patch introduces flow control to avoid that extreme scenario.
The ceiling limit to avoid denial of service attacks is set to relatively
high (nr_free_pagecache_pages()/4) so that it won't interfere with
regular usage, but can step in extreme cases to limit the total system
hang. Since we don't have a global structure to accommodate this variable,
I choose the virtio_chan as the home for this.
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Reviewed-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Given that the sprious wake-ups are common, we need to move the
condition setting right next to the wake_up(). After setting the condition
to req->status = REQ_STATUS_RCVD, sprious wakeups may cause the
virtqueue back on the free list for someone else to use.
This may result in kernel panic while relasing the pinned pages
in p9_release_req_pages().
Also rearranged the while loop in req_done() for better redability.
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Process may wait to get space on VirtIO ring to send a transaction to
VirtFS server. Current code just does a conditional wake_up() which
means only one process will be woken up even if multiple processes
are waiting.
This fix makes the wake_up unconditional. Hence we won't have any
processes waiting for-ever.
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Lingering requests are requests that are sent to the OSD normally but
tracked also after we get a successful request. This keeps the OSD
connection open and resends the original request if the object moves to
another OSD. The OSD can then send notification messages back to us
if another client initiates a notify.
This framework will be used by RBD so that the client gets notification
when a snapshot is created by another node or tool.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Optimize the calling of fib_add_ifaddr for all
secondary addresses after the promoted one to start from
their place, not from the new place of the promoted
secondary. It will save some CPU cycles because we
are sure the promoted secondary was first for the subnet
and all next secondaries do not change their place.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
The secondary address promotion relies on fib_sync_down_addr
to remove all routes created for the secondary addresses when
the old primary address is deleted. It does not happen for cases
when the primary address is also in another subnet. Fix that
by deleting local and broadcast routes for all secondaries while
they are on device list and by faking that all addresses from
this subnet are to be deleted. It relies on fib_del_ifaddr being
able to ignore the IPs from the concerned subnet while checking
for duplication.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Sidorenko reported for problems with local
routes left after IP addresses are deleted. It happens
when same IPs are used in more than one subnet for the
device.
Fix fib_del_ifaddr to restrict the checks for duplicate
local and broadcast addresses only to the IFAs that use
our primary IFA or another primary IFA with same address.
And we expect the prefsrc to be matched when the routes
are deleted because it is possible they to differ only by
prefsrc. This patch prevents local and broadcast routes
to be leaked until their primary IP is deleted finally
from the box.
As the secondary address promotion needs to delete
the routes for all secondaries that used the old primary IFA,
add option to ignore these secondaries from the checks and
to assume they are already deleted, so that we can safely
delete the route while these IFAs are still on the device list.
Reported-by: Alex Sidorenko <alexandre.sidorenko@hp.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_table_delete forgets to match the routes by prefsrc.
Callers can specify known IP in fc_prefsrc and we should remove
the exact route. This is needed for cases when same local or
broadcast addresses are used in different subnets and the
routes differ only in prefsrc. All callers that do not provide
fc_prefsrc will ignore the route prefsrc as before and will
delete the first occurence. That is how the ip route del default
magic works.
Current callers are:
- ip_rt_ioctl where rtentry_to_fib_config provides fc_prefsrc only
when the provided device name matches IP label with colon.
- inet_rtm_delroute where RTA_PREFSRC is optional too
- fib_magic which deals with routes when deleting addresses
and where the fc_prefsrc is always set with the primary IP
for the concerned IFA.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement compatibility with new hw_features for dev_disable_lro().
This is a transition path - dev_disable_lro() should be later
integrated into netdev_fix_features() after all drivers are converted.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Simon Horman <horms@verge.net.au>
Cc: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I was fixing issues with unregisgtering tables under /proc/sys/net/ipv6/neigh
by adding a mount point it appears I missed a critical ordering issue, in the
ipv6 initialization. I had not realized that ipv6_sysctl_register is called
at the very end of the ipv6 initialization and in particular after we call
neigh_sysctl_register from ndisc_init.
"neigh" needs to be initialized in ipv6_static_sysctl_register which is
the first ipv6 table to initialized, and definitely before ndisc_init.
This removes the weirdness of duplicate tables while still providing a
"neigh" mount point which prevents races in sysctl unregistering.
This was initially reported at https://bugzilla.kernel.org/show_bug.cgi?id=31232
Reported-by: sunkan@zappa.cx
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was pointed out to me recently that my spelling could be better :)
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The BKL removal in appletalk introduced a use-after-free problem,
where atalk_destroy_socket frees a sock, but we still release
the socket lock on it.
An easy fix is to take an extra reference on the sock and sock_put
it when returning from atalk_release.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit b0d0d915d1 (remove the BKL) added a regression, because
sock_put() can free memory while we are going to use it later.
Fix is to delay sock_put() _after_ release_sock().
Reported-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
A struct used in the l2tp_eth driver for registering network namespace
ops was incorrectly marked as __net_initdata, leading to oops when
module unloaded.
BUG: unable to handle kernel paging request at ffffffffa00ec098
IP: [<ffffffff8123dbd8>] ops_exit_list+0x7/0x4b
PGD 142d067 PUD 1431063 PMD 195da8067 PTE 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/module/l2tp_eth/refcnt
Call Trace:
[<ffffffff8123dc94>] ? unregister_pernet_operations+0x32/0x93
[<ffffffff8123dd20>] ? unregister_pernet_device+0x2b/0x38
[<ffffffff81068b6e>] ? sys_delete_module+0x1b8/0x222
[<ffffffff810c7300>] ? do_munmap+0x254/0x318
[<ffffffff812c64e5>] ? page_fault+0x25/0x30
[<ffffffff812c6952>] ? system_call_fastpath+0x16/0x1b
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we send a request to osd A, and the request's pg remaps to osd B and
then back to A in quick succession, we need to resend the request to A. The
old code was only calling kick_requests after processing all incremental
maps in a message, so it was very possible to not resend a request that
needed to be resent. This would make the osd eventually time out (at least
with the current default of osd timeouts enabled).
The correct approach is to scan requests on every map incremental. This
patch refactors the kick code in a few ways:
- all requests are either on req_lru (in flight), req_unsent (ready to
send), or req_notarget (currently map to no up osd)
- mapping always done by map_request (previous map_osds)
- if the mapping changes, we requeue. requests are resent only after all
map incrementals are processed.
- some osd reset code is moved out of kick_requests into a separate
function
- the "kick this osd" functionality is moved to kick_osd_requests, as it
is unrelated to scanning for request->pg->osd mapping changes
Signed-off-by: Sage Weil <sage@newdream.net>
This field is used to determine the inactivity time. When in AP mode,
hostapd uses it for kicking out inactive clients after a while. Without this
patch, hostapd immediately deauthenticates a new client if it checks the
inactivity time before the client sends its first data frame.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Cc: stable@kernel.org
Signed-off-by: John W. Linville <linville@tuxdriver.com>
'buffer' string is copied from userspace. It is not checked whether it is
zero terminated. This may lead to overflow inside of simple_strtoul().
Changli Gao suggested to copy not more than user supplied 'size' bytes.
It was introduced before the git epoch. Files "ipt_CLUSTERIP/*" are
root writable only by default, however, on some setups permissions might be
relaxed to e.g. network admin user.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Acked-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
commit f3c5c1bfd4 (make ip_tables reentrant) introduced a race in
handling the stackptr restore, at the end of ipt_do_table()
We should do it before the call to xt_info_rdunlock_bh(), or we allow
cpu preemption and another cpu overwrites stackptr of original one.
A second fix is to change the underflow test to check the origptr value
instead of 0 to detect underflow, or else we allow a jump from different
hooks.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The revision of the set type was not checked at the create command: if the
userspace sent a valid set type but with not supported revision number,
it'd create a loop.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The hash:*port* types with IPv4 silently ignored when address ranges
with non TCP/UDP were added/deleted from the set and used the first
address from the range only.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Whenever we enter the IP stack proper from bridge netfilter we
need to ensure that the skb is in a form the IP stack expects
it to be in.
The entry point on NF_FORWARD did not meet the requirements of
the IP stack, therefore leading to potential crashes/panics.
This patch fixes the problem.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit c95b819ad7 (gre: Use needed_headroom)
made gre use needed_headroom instead of hard_header_len
This uncover a bug in vlan code.
We should make sure vlan devices take into account their
real_dev->needed_headroom or we risk a crash in ipgre_header(), because
we dont have enough room to push IP header in skb.
Reported-by: Diddi Oscarsson <diddi@diddi.se>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>