This patch adds the ability to allocate bucket table with GFP_ATOMIC
instead of GFP_KERNEL. This is needed when we perform an immediate
rehash during insertion.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the missing bits to allow multiple rehashes. The
read-side as well as remove already handle this correctly. So it's
only the rehasher and insertion that need modification to handle
this.
Note that this patch doesn't actually enable it so for now rehashing
is still only performed by the worker thread.
This patch also disables the explicit expand/shrink interface because
the table is meant to expand and shrink automatically, and continuing
to export these interfaces unnecessarily complicates the life of the
rehasher since the rehash process is now composed of two parts.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes rhashtable_shrink to shrink to the smallest
size possible rather than halving the table. This is needed
because with multiple rehashing we will defer shrinking until
all other rehashing is done, meaning that when we do shrink
we may be able to shrink a lot.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since every current rhashtable user uses jhash as their hash
function, the fact that jhash is an inline function causes each
user to generate a copy of its code.
This function provides a solution to this problem by allowing
hashfn to be unset. In which case rhashtable will automatically
set it to jhash. Furthermore, if the key length is a multiple
of 4, we will switch over to jhash2.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The walker is a lockless reader so it too needs an smp_rmb before
reading the future_tbl field in order to see any new tables that
may contain elements that we should have walked over.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/emulex/benet/be_main.c
net/core/sysctl_net_core.c
net/ipv4/inet_diag.c
The be_main.c conflict resolution was really tricky. The conflict
hunks generated by GIT were very unhelpful, to say the least. It
split functions in half and moved them around, when the real actual
conflict only existed solely inside of one function, that being
be_map_pci_bars().
So instead, to resolve this, I checked out be_main.c from the top
of net-next, then I applied the be_main.c changes from 'net' since
the last time I merged. And this worked beautifully.
The inet_diag.c and sysctl_net_core.c conflicts were simple
overlapping changes, and were easily to resolve.
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that all rhashtable users have been converted over to the
inline interface, this patch removes the unused out-of-line
interface.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts test_rhashtable to the inlined rhashtable
interface.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch deals with the complaint that we make indirect function
calls on the fast paths unnecessarily in rhashtable. We resolve
it by moving the fast paths into inline functions that take struct
rhashtable_param (which obviously must be the same set of parameters
supplied to rhashtable_init) as an argument.
The only remaining indirect call is to obj_hashfn (or key_hashfn it
obj_hashfn is unset) on the rehash as well as the insert-during-
rehash slow path.
This patch also extends the support of vairable-length keys to
include those where the key is fixed but scattered in the object.
For example, in netlink we want to key off the namespace and the
portid but they're not next to each other.
This patch does this by directly using the object hash function
as the indicator of whether the key is accessible or not. It
also adds a new function obj_cmpfn to compare a key against an
object. This means that the caller no longer needs to supply
explicit compare functions.
All this is done in a backwards compatible manner so no existing
users are affected until they convert to the new interface.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch marks the rhashtable_init params argument const as
there is no reason to modify it since we will always make a copy
of it in the rhashtable.
This patch also fixes a bug where we don't actually round up the
value of min_size unless it is less than HASH_MIN_SIZE.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Round up min_size respectively round down max_size to the next power
of two to make sure we always respect the limit specified by the
user. This is required because we compare the table size against the
limit before we expand or shrink.
Also fixes a minor bug where we modified min_size in the params
provided instead of the copy stored in struct rhashtable.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that nobody uses max_shift and min_shift, we can safely remove
them.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts test_rhashtable to use rhashtable max_size
instead of the obsolete max_shift.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the parameters max_size and min_size which are
meant to replace max_shift and min_shift.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Keeping both size and shift is silly. We only need one.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Caching the lock pointer avoids having to hash on the object
again to unlock the bucket locks.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following sparse warnings:
lib/rhashtable.c:767:5: warning: context imbalance in 'rhashtable_walk_start' - wrong count at exit
lib/rhashtable.c:849:6: warning: context imbalance in 'rhashtable_walk_stop' - unexpected unlock
Fixes: f2dba9c6ff ("rhashtable: Introduce rhashtable_walk_*")
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 9d901bc051 ("rhashtable:
Free bucket tables asynchronously after rehash") causes gratuitous
failures in rhashtable_remove.
The reason is that it inadvertently introduced multiple rehashing
from the perspective of readers. IOW it is now possible to see
more than two tables during a single RCU critical section.
Fortunately the other reader rhashtable_lookup already deals with
this correctly thanks to c4db8848af
("rhashtable: rhashtable: Move future_tbl into struct bucket_table")
so only rhashtable_remove is broken by this change.
This patch fixes this by looping over every table from the first
one to the last or until we find the element that we were trying
to delete.
Incidentally the simple test for detecting rehashing to prevent
starting another shrinking no longer works. Since it isn't needed
anyway (the work queue and the mutex serves as a natural barrier
to unnecessary rehashes) I've simply killed the test.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit c4db8848af ("rhashtable:
Move future_tbl into struct bucket_table") introduced a use-after-
free bug in rhashtable_walk_stop because it dereferences tbl after
droping the RCU read lock.
This patch fixes it by moving the RCU read unlock down to the bottom
of rhashtable_walk_stop. In fact this was how I had it originally
but it got dropped while rearranging patches because this one
depended on the async freeing of bucket_table.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves future_tbl to open up the possibility of having
multiple rehashes on the same table.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a rehash counter to bucket_table to indicate
the last bucket that has been rehashed. This serves two purposes:
1. Any bucket that has been rehashed can never gain a new object.
2. If the rehash counter reaches the size of the table, the table
will forever remain empty.
This patch also downsizes bucket_table->size to an unsigned int
since we do not support sizes greater than 32 bits yet.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is in fact no need to wait for an RCU grace period in the
rehash function, since all insertions are guaranteed to go into
the new table through spin locks.
This patch uses call_rcu to free the old/rehashed table at our
leisure.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems that I have already made every rehash redo the random
seed even though my commit message indicated otherwise :)
Since we have already taken that step, this patch goes one step
further and moves the seed initialisation into bucket_table_alloc.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
We only nest one level deep there is no need to roll our own
subclasses.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously whenever the walker encountered a resize it simply
snaps back to the beginning and starts again. However, this only
works if the rehash started and completed while the walker was
idle.
If the walker attempts to restart while the rehash is still ongoing,
we may miss objects that we shouldn't have.
This patch fixes this by making the walker walk the old table
followed by the new table just like all other readers. If a
rehash is detected we will still signal our caller of the fact
so they can prepare for duplicates but we will simply continue
the walk onto the new table after the old one is finished either
by us or by the rehasher.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull gadgetfs fixes from Al Viro:
"Assorted fixes around AIO on gadgetfs: leaks, use-after-free, troubles
caused by ->f_op flipping"
* 'gadget' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
gadgetfs: really get rid of switching ->f_op
gadgetfs: get rid of flipping ->f_op in ep_config()
gadget: switch ep_io_operations to ->read_iter/->write_iter
gadgetfs: use-after-free in ->aio_read()
gadget/function/f_fs.c: switch to ->{read,write}_iter()
gadget/function/f_fs.c: use put iov_iter into io_data
gadget/function/f_fs.c: close leaks
move iov_iter.c from mm/ to lib/
new helper: dup_iter()
This patch fixes a typo rhashtable_lookup_compare where we fail
to recompute the hash when looking up the new table. This causes
elements to be missed and potentially a crash during a resize.
Reported-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit c0c09bfdc4 ("rhashtable: avoid unnecessary wakeup for worker
queue") changed ht->shift to be atomic, which is actually unnecessary.
Instead of leaving the current shift in the core rhashtable structure,
it can be cached inside the individual bucket tables.
There, it will only be initialized once during a new table allocation
in the shrink/expansion slow path, and from then onward it stays immutable
for the rest of the bucket table liftime.
That allows shift to be non-atomic. The patch also moves hash_rnd
management into the table setup. The rhashtable structure now consumes
3 instead of 4 cachelines.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Ying Xue <ying.xue@windriver.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a potential race condition between readers and the rehasher.
In particular, the rehasher could have started a rehash while the
reader finishes a scan of the old table but fails to see the new
table pointer.
This patch closes this window by adding smp_wmb/smp_rmb.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that the only caller of obj_raw_hashfn is head_hashfn, we can
simply kill it and fold it into the latter.
This patch also moves the common shift from head_hashfn/key_hashfn
into rht_bucket_index.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
key_hashfn has only one caller and it doesn't really need to supply
the key length as an extra parameter.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we don't have cross-table hashes, we no longer need to
keep the entire hash value so all users of obj_raw_hashfn can
use head_hashfn instead.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch reverts commit c88455ce50
("rhashtable: key_hashfn() must return full hash value") because
the only user of it always masks the hash value.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit aa34a6cb04 ("rhashtable:
Add arbitrary rehash function") killed the annotation on the
nested lock which leads to bitching from lockdep.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a rehash function that supports the use of any
hash function for the new table. This is needed to support changing
the random seed value during the lifetime of the hash table.
However for now the random seed value is still constant and the
rehash function is simply used to replace the existing expand/shrink
functions.
[ ASSERT_BUCKET_LOCK() and thus debug_dump_table() +
debug_dump_buckets() are not longer used, so delete them
entirely. -DaveM ]
Signed-off-by: Herbert Xu <herbert.xu@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently hash_rnd is a parameter that users can set. However,
no existing users set this parameter. It is also something that
people are unlikely to want to set directly since it's just a
random number.
In preparation for allowing the reseeding/rehashing of rhashtable,
this patch moves hash_rnd into bucket_table so that it's now an
internal state rather than a parameter.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
contains fixes to ftrace when /proc/sys/kernel/ftrace_enabled and
function tracing are started. Doing the following causes some issues:
# echo 0 > /proc/sys/kernel/ftrace_enabled
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
# echo 1 > /proc/sys/kernel/ftrace_enabled
# echo nop > /sys/kernel/debug/tracing/current_tracer
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
As well as with function tracing too. Pratyush Anand first reported
this issue to me and supplied a patch. When I tested this on my x86
test box, it caused thousands of backtraces and warnings to appear in
dmesg, which also caused a denial of service (a warning for every
function that was listed). I applied Pratyush's patch but it did not
fix the issue for me. I looked into it and found a slight problem
with trampoline accounting. I fixed it and sent Pratyush a patch, but
he said that it did not fix the issue for him.
I later learned tha Pratyush was using an ARM64 server, and when I tested
on my ARM board, I was able to reproduce the same issue as Pratyush.
After applying his patch, it fixed the problem. The above test uncovered
two different bugs, one in x86 and one in ARM and ARM64. As this looked
like it would affect PowerPC, I tested it on my PPC64 box. It too broke,
but neither the patch that fixed ARM or x86 fixed this box (the changes
were all in generic code!). The above test, uncovered two more bugs that
affected PowerPC. Again, the changes were only done to generic code.
It's the way the arch code expected things to be done that was different
between the archs. Some where more sensitive than others.
The rest of this series fixes the PPC bugs as well.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU/cQSAAoJEEjnJuOKh9lde9sH/1MAPq+6jr7YaEFru0GKajE9
rVHjw8rde/I4tN2UxIVk+Qm6pXRZYpv3OKxHT48EHzkvgm++voioykpJP4IEVrP5
mEDuIcYe28csE2nV5u5Q9kwnZoC86TQW5nVV6zB1Gx/3IEzA8Z046jAov40Jya0y
zqHc/U43JeeVIDIOkwjzbH6OaFEDP13FkF3TO502WJhJLqMo+kPOalIgv0eauKzy
lVCQBSC4WS3rVsgW4W3dSrEBaUxbJxgunjxOuV2DwHj5eghHq0M2MKeIUxBz0PuN
wnhTrpf5cAfshTvYHxKlE0uItdyYfVb7UChAD5zTbBL4kMUFhpb183zVKH8K8kU=
=8R8y
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v4.0-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull seq-buf/ftrace fixes from Steven Rostedt:
"This includes fixes for seq_buf_bprintf() truncation issue. It also
contains fixes to ftrace when /proc/sys/kernel/ftrace_enabled and
function tracing are started. Doing the following causes some issues:
# echo 0 > /proc/sys/kernel/ftrace_enabled
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
# echo 1 > /proc/sys/kernel/ftrace_enabled
# echo nop > /sys/kernel/debug/tracing/current_tracer
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
As well as with function tracing too. Pratyush Anand first reported
this issue to me and supplied a patch. When I tested this on my x86
test box, it caused thousands of backtraces and warnings to appear in
dmesg, which also caused a denial of service (a warning for every
function that was listed). I applied Pratyush's patch but it did not
fix the issue for me. I looked into it and found a slight problem
with trampoline accounting. I fixed it and sent Pratyush a patch, but
he said that it did not fix the issue for him.
I later learned tha Pratyush was using an ARM64 server, and when I
tested on my ARM board, I was able to reproduce the same issue as
Pratyush. After applying his patch, it fixed the problem. The above
test uncovered two different bugs, one in x86 and one in ARM and
ARM64. As this looked like it would affect PowerPC, I tested it on my
PPC64 box. It too broke, but neither the patch that fixed ARM or x86
fixed this box (the changes were all in generic code!). The above
test, uncovered two more bugs that affected PowerPC. Again, the
changes were only done to generic code. It's the way the arch code
expected things to be done that was different between the archs. Some
where more sensitive than others.
The rest of this series fixes the PPC bugs as well"
* tag 'trace-fixes-v4.0-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix ftrace enable ordering of sysctl ftrace_enabled
ftrace: Fix en(dis)able graph caller when en(dis)abling record via sysctl
ftrace: Clear REGS_EN and TRAMP_EN flags on disabling record via sysctl
seq_buf: Fix seq_buf_bprintf() truncation
seq_buf: Fix seq_buf_vprintf() truncation
In seq_buf_bprintf(), bstr_printf() is used to copy the format into the
buffer remaining in the seq_buf structure. The return of bstr_printf()
is the amount of characters written to the buffer excluding the '\0',
unless the line was truncated!
If the line copied does not fit, it is truncated, and a '\0' is added
to the end of the buffer. But in this case, '\0' is included in the length
of the line written. To know if the buffer had overflowed, the return
length will be the same or greater than the length of the buffer passed in.
The check in seq_buf_bprintf() only checked if the length returned from
bstr_printf() would fit in the buffer, as the seq_buf_bprintf() is only
to be an all or nothing command. It either writes all the string into
the seq_buf, or none of it. If the string is truncated, the pointers
inside the seq_buf must be reset to what they were when the function was
called. This is not the case. On overflow, it copies only part of the string.
The fix is to change the overflow check to see if the length returned from
bstr_printf() is less than the length remaining in the seq_buf buffer, and not
if it is less than or equal to as it currently does. Then seq_buf_bprintf()
will know if the write from bstr_printf() was truncated or not.
Link: http://lkml.kernel.org/r/1425500481.2712.27.camel@perches.com
Cc: stable@vger.kernel.org
Reported-by: Joe Perches <joe@perches.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In seq_buf_vprintf(), vsnprintf() is used to copy the format into the
buffer remaining in the seq_buf structure. The return of vsnprintf()
is the amount of characters written to the buffer excluding the '\0',
unless the line was truncated!
If the line copied does not fit, it is truncated, and a '\0' is added
to the end of the buffer. But in this case, '\0' is included in the length
of the line written. To know if the buffer had overflowed, the return
length will be the same as the length of the buffer passed in.
The check in seq_buf_vprintf() only checked if the length returned from
vsnprintf() would fit in the buffer, as the seq_buf_vprintf() is only
to be an all or nothing command. It either writes all the string into
the seq_buf, or none of it. If the string is truncated, the pointers
inside the seq_buf must be reset to what they were when the function was
called. This is not the case. On overflow, it copies only part of the string.
The fix is to change the overflow check to see if the length returned from
vsnprintf() is less than the length remaining in the seq_buf buffer, and not
if it is less than or equal to as it currently does. Then seq_buf_vprintf()
will know if the write from vsnpritnf() was truncated or not.
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull networking fixes from David Miller:
1) If an IPVS tunnel is created with a mixed-family destination
address, it cannot be removed. Fix from Alexey Andriyanov.
2) Fix module refcount underflow in netfilter's nft_compat, from Pablo
Neira Ayuso.
3) Generic statistics infrastructure can reference variables sitting on
a released function stack, therefore use dynamic allocation always.
Fix from Ignacy Gawędzki.
4) skb_copy_bits() return value test is inverted in ip_check_defrag().
5) Fix network namespace exit in openvswitch, we have to release all of
the per-net vports. From Pravin B Shelar.
6) Fix signedness bug in CAIF's cfpkt_iterate(), from Dan Carpenter.
7) Fix rhashtable grow/shrink behavior, only expand during inserts and
shrink during deletes. From Daniel Borkmann.
8) Netdevice names with semicolons should never be allowed, because
they serve as a separator. From Matthew Thode.
9) Use {,__}set_current_state() where appropriate, from Fabian
Frederick.
10) Revert byte queue limits support in r8169 driver, it's causing
regressions we can't figure out.
11) tcp_should_expand_sndbuf() erroneously uses tp->packets_out to
measure packets in flight, properly use tcp_packets_in_flight()
instead. From Neal Cardwell.
12) Fix accidental removal of support for bluetooth in CSR based Intel
wireless cards. From Marcel Holtmann.
13) We accidently added a behavioral change between native and compat
tasks, wrt testing the MSG_CMSG_COMPAT bit. Just ignore it if the
user happened to set it in a native binary as that was always the
behavior we had. From Catalin Marinas.
14) Check genlmsg_unicast() return valud in hwsim netlink tx frame
handling, from Bob Copeland.
15) Fix stale ->radar_required setting in mac80211 that can prevent
starting new scans, from Eliad Peller.
16) Fix memory leak in nl80211 monitor, from Johannes Berg.
17) Fix race in TX index handling in xen-netback, from David Vrabel.
18) Don't enable interrupts in amx-xgbe driver until all software et al.
state is ready for the interrupt handler to run. From Thomas
Lendacky.
19) Add missing netlink_ns_capable() checks to rtnl_newlink(), from Eric
W Biederman.
20) The amount of header space needed in macvtap was not calculated
properly, fix it otherwise we splat past the beginning of the
packet. From Eric Dumazet.
21) Fix bcmgenet TCP TX perf regression, from Jaedon Shin.
22) Don't raw initialize or mod timers, use setup_timer() and
mod_timer() instead. From Vaishali Thakkar.
23) Fix software maintained statistics in bcmgenet and systemport
drivers, from Florian Fainelli.
24) DMA descriptor updates in sh_eth need proper memory barriers, from
Ben Hutchings.
25) Don't do UDP Fragmentation Offload on RAW sockets, from Michal
Kubecek.
26) Openvswitch's non-masked set actions aren't constructed properly
into netlink messages, fix from Joe Stringer.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
openvswitch: Fix serialization of non-masked set actions.
gianfar: Reduce logging noise seen due to phy polling if link is down
ibmveth: Add function to enable live MAC address changes
net: bridge: add compile-time assert for cb struct size
udp: only allow UFO for packets from SOCK_DGRAM sockets
sh_eth: Really fix padding of short frames on TX
Revert "sh_eth: Enable Rx descriptor word 0 shift for r8a7790"
sh_eth: Fix RX recovery on R-Car in case of RX ring underrun
sh_eth: Ensure proper ordering of descriptor active bit write/read
net/mlx4_en: Disbale GRO for incoming loopback/selftest packets
net/mlx4_core: Fix wrong mask and error flow for the update-qp command
net: systemport: fix software maintained statistics
net: bcmgenet: fix software maintained statistics
rxrpc: don't multiply with HZ twice
rxrpc: terminate retrans loop when sending of skb fails
net/hsr: Fix NULL pointer dereference and refcnt bugs when deleting a HSR interface.
net: pasemi: Use setup_timer and mod_timer
net: stmmac: Use setup_timer and mod_timer
net: 8390: axnet_cs: Use setup_timer and mod_timer
net: 8390: pcnet_cs: Use setup_timer and mod_timer
...
If a hash table has 128 slots and 16384 elems, expand to 256 slots
takes more than one second. For larger sets, a soft lockup is detected.
Holding cpu for that long, even in a work queue is a show stopper
for non preemptable kernels.
cond_resched() at strategic points to allow process scheduler
to reschedule us.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, all real users of rhashtable default their grow and shrink
decision functions to rht_grow_above_75() and rht_shrink_below_30(),
so that there's currently no need to have this explicitly selectable.
It can/should be generic and private inside rhashtable until a real
use case pops up. Since we can make this private, we'll save us this
additional indirection layer and can improve insertion/deletion time
as well.
Reference: http://patchwork.ozlabs.org/patch/443040/
Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
While commit c0c09bfdc4 ("rhashtable: avoid unnecessary wakeup for
worker queue") rightfully moved part of the decision making of
whether we should expand or shrink from the expand/shrink functions
themselves into insert/delete functions in order to avoid unnecessary
worker wake-ups, it however introduced a regression by doing so.
Before that change, if no max_shift was specified (= 0) on rhashtable
initialization, rhashtable_expand() would just grow unconditionally
and lets the available memory be the limiting factor. After that
change, if no max_shift was specified, there would be _no_ expansion
step at all.
Given that netlink and tipc have a max_shift specified, it was not
visible there, but Josh Hunt reported that if nft that starts out
with a default element hint of 3 if not otherwise provided, would
slow i.e. inserts down trememdously as it cannot grow larger to
relax table occupancy.
Given that the test case verifies shrinks/expands manually, we also
must remove pointer to the helper functions to explicitly avoid
parallel resizing on insertions/deletions. test_bucket_stats() and
test_rht_lookup() could also be wrapped around rhashtable mutex to
explicitly synchronize a walk from resizing, but I think that defeats
the actual test case which intended to have explicit test steps,
i.e. 1) inserts, 2) expands, 3) shrinks, 4) deletions, with object
verification after each stage.
Reported-by: Josh Hunt <johunt@akamai.com>
Fixes: c0c09bfdc4 ("rhashtable: avoid unnecessary wakeup for worker queue")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Ying Xue <ying.xue@windriver.com>
Cc: Josh Hunt <johunt@akamai.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit f2dba9c6ff ("rhashtable: Introduce rhashtable_walk_*") forgot to
initialize the members of struct rhashtable_walker after allocating it, which
caused an undefined value for 'resize' which is used later on.
Fixes: f2dba9c6ff ("rhashtable: Introduce rhashtable_walk_*")
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's no good reason why to disallow unloading of the rhashtable
test case module.
Commit 9d6dbe1bba moved the code from a boot test into a stand-alone
module, but only converted the subsys_initcall() handler into a
module_init() function without a related exit handler, and thus
preventing the test module from unloading.
Fixes: 9d6dbe1bba ("rhashtable: Make selftest modular")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
When trying to allocate future tables via bucket_table_alloc(), it seems
overkill on large table shifts that we probe for kzalloc() unconditionally
first, as it's likely to fail.
Only probe with kzalloc() for more reasonable table sizes and use vzalloc()
either as a fallback on failure or directly in case of large table sizes.
Fixes: 7e1e77636e ("lib: Resizable, Scalable, Concurrent Hash Table")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Restore pre 54c5b7d311 behaviour and only probe for expansions on inserts
and shrinks on deletes. Currently, it will happen that on initial inserts
into a sparse hash table, we may i.e. shrink it first simply because it's
not fully populated yet, only to later realize that we need to grow again.
This however is counter intuitive, e.g. an initial default size of 64
elements is already small enough, and in case an elements size hint is given
to the hash table by a user, we should avoid unnecessary expansion steps,
so a shrink is clearly unintended here.
Fixes: 54c5b7d311 ("rhashtable: introduce rhashtable_wakeup_worker helper function")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Ying Xue <ying.xue@windriver.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
With object runtime debugging enabled, the rhashtable test suite
will rightfully throw a warning "ODEBUG: object is on stack, but
not annotated" from rhashtable_init().
This is because run_work is (correctly) being initialized via
INIT_WORK(), and not annotated by INIT_WORK_ONSTACK(). Meaning,
rhashtable_init() is okay as is, we just need to move ht e.g.,
into global scope.
It never triggered anything, since test_rhashtable is rather a
controlled environment and effectively runs to completion, so
that stack memory is not vanishing underneath us, we shouldn't
confuse any testers with it though.
Fixes: 7e1e77636e ("lib: Resizable, Scalable, Concurrent Hash Table")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull kconfig updates from Michal Marek:
"Yann E Morin was supposed to take over kconfig maintainership, but
this hasn't happened. So I'm sending a few kconfig patches that I
collected:
- Fix for missing va_end in kconfig
- merge_config.sh displays used if given too few arguments
- s/boolean/bool/ in Kconfig files for consistency, with the plan to
only support bool in the future"
* 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
kconfig: use va_end to match corresponding va_start
merge_config.sh: Display usage if given too few arguments
kconfig: use bool instead of boolean for type definition attributes
On top of tht is the major rework of lguest, to use PCI and virtio 1.0, to
double-check the implementation.
Then comes the inevitable fixes and cleanups from that work.
Thanks,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJU5B9cAAoJENkgDmzRrbjxPacP/jajliXX353JJ/g/hkZ6oDN5
o7FhELBKiUMr7enVZYwj2BBYk5OM36nB9pQkiqHMSbjJGoS5IK70enxb4YRxSHBn
YCLblZMNqutGS0kclZ9DDysztjAhxH7CvLM6pMZ7eHP0f3+FM/QhbxHfbG9DTBUH
2U/nybvd3M/+YBe7ptwQdrH8aOCAD6RTIsXellfm99dNMK6K/5lqnWQ98WSXmNXq
vyvdaAQsqqUkmxtajjcBumaCH4/SehOJJjUqojCMsR3aBkgOBWDZJURMek+KA5Dt
X996fBsTAlvTtCUKRrmLTb2ScDH7fu+jwbWRqMYDk8zpEr3XqiLTTPV4/TiHGmi7
Wiw3g1wIY1YbETlZyongB5MIoVyUfmDAd+bT8nBsj3KIITD84gOUQFDMl6d63c0I
z6A9Pu/UzpJGsXZT3WoFLi6TO67QyhOseqZnhS4wBgLabjxffNM7yov9RVKUVH/n
JHunnpUk2iTtSgscBarOBz5867dstuurnaUIspZthVBo6y6N0z+GrU+agJ8Y4DXx
mvwzeYLhQH2208PjxPFiah/kA/gHNm1m678TbpS+CUsgmpQiJ4gTwtazDSi4TwZY
Hs9T9GulkzpZIzEyKL3qG2TsfyDhW5Avn+GvKInAT9+Fkig4BnP3DUONBxcwGZ78
eI3FDUWsE36NqE5ECWmz
=ivCe
-----END PGP SIGNATURE-----
Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull virtio updates from Rusty Russell:
"OK, this has the big virtio 1.0 implementation, as specified by OASIS.
On top of tht is the major rework of lguest, to use PCI and virtio
1.0, to double-check the implementation.
Then comes the inevitable fixes and cleanups from that work"
* tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (80 commits)
virtio: don't set VIRTIO_CONFIG_S_DRIVER_OK twice.
virtio_net: unconditionally define struct virtio_net_hdr_v1.
tools/lguest: don't use legacy definitions for net device in example launcher.
virtio: Don't expose legacy net features when VIRTIO_NET_NO_LEGACY defined.
tools/lguest: use common error macros in the example launcher.
tools/lguest: give virtqueues names for better error messages
tools/lguest: more documentation and checking of virtio 1.0 compliance.
lguest: don't look in console features to find emerg_wr.
tools/lguest: don't start devices until DRIVER_OK status set.
tools/lguest: handle indirect partway through chain.
tools/lguest: insert driver references from the 1.0 spec (4.1 Virtio Over PCI)
tools/lguest: insert device references from the 1.0 spec (4.1 Virtio Over PCI)
tools/lguest: rename virtio_pci_cfg_cap field to match spec.
tools/lguest: fix features_accepted logic in example launcher.
tools/lguest: handle device reset correctly in example launcher.
virtual: Documentation: simplify and generalize paravirt_ops.txt
lguest: remove NOTIFY call and eventfd facility.
lguest: remove NOTIFY facility from demonstration launcher.
lguest: use the PCI console device's emerg_wr for early boot messages.
lguest: always put console in PCI slot #1.
...