This patch converts stab qdisc management to RCU, so that we can perform
the qdisc_calculate_pkt_len() call before getting qdisc lock.
This shortens the lock's held time in __dev_xmit_skb().
This permits more qdiscs to get TCQ_F_CAN_BYPASS status, avoiding lot of
cache misses and so reducing latencies.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
CC: Jesper Dangaard Brouer <hawk@diku.dk>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Jamal Hadi Salim <hadi@cyberus.ca>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 3711210576 (net: QDISC_STATE_RUNNING dont need atomic bit
ops) I moved QDISC_STATE_RUNNING flag to __state container, located in
the cache line containing qdisc lock and often dirtied fields.
I now move TCQ_F_THROTTLED bit too, so that we let first cache line read
mostly, and shared by all cpus. This should speedup HTB/CBQ for example.
Not using test_bit()/__clear_bit()/__test_and_set_bit allows to use an
"unsigned int" for __state container, reducing by 8 bytes Qdisc size.
Introduce helpers to hide implementation details.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
CC: Jesper Dangaard Brouer <hawk@diku.dk>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Jamal Hadi Salim <hadi@cyberus.ca>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SFQ currently uses a 1024 slots hash table, and its internal structure
(sfq_sched_data) allocation needs order-1 page on x86_64
Allow tc command to specify a divisor value (hash table size), between 1
and 65536.
If no value is provided, assume the 1024 default size.
This allows admins to setup smaller (or bigger) SFQ for specific needs.
This also brings back sfq_sched_data allocations to order-0 ones, saving
3KB per SFQ qdisc.
Jesper uses ~55.000 SFQ in one machine, this patch should free 165 MB of
memory.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
CC: Jesper Dangaard Brouer <hawk@diku.dk>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Jamal Hadi Salim <hadi@cyberus.ca>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cleanup net/sched code to current CodingStyle and practices.
Reduce inline abuse
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This implements a mqprio queueing discipline that by default creates
a pfifo_fast qdisc per tx queue and provides the needed configuration
interface.
Using the mqprio qdisc the number of tcs currently in use along
with the range of queues alloted to each class can be configured. By
default skbs are mapped to traffic classes using the skb priority.
This mapping is configurable.
Configurable parameters,
struct tc_mqprio_qopt {
__u8 num_tc;
__u8 prio_tc_map[TC_BITMASK + 1];
__u8 hw;
__u16 count[TC_MAX_QUEUE];
__u16 offset[TC_MAX_QUEUE];
};
Here the count/offset pairing give the queue alignment and the
prio_tc_map gives the mapping from skb->priority to tc.
The hw bit determines if the hardware should configure the count
and offset values. If the hardware bit is set then the operation
will fail if the hardware does not implement the ndo_setup_tc
operation. This is to avoid undetermined states where the hardware
may or may not control the queue mapping. Also minimal bounds
checking is done on the count/offset to verify a queue does not
exceed num_tx_queues and that queue ranges do not overlap. Otherwise
it is left to user policy or hardware configuration to create
useful mappings.
It is expected that hardware QOS schemes can be implemented by
creating appropriate mappings of queues in ndo_tc_setup().
One expected use case is drivers will use the ndo_setup_tc to map
queue ranges onto 802.1Q traffic classes. This provides a generic
mechanism to map network traffic onto these traffic classes and
removes the need for lower layer drivers to know specifics about
traffic types.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (47 commits)
GRETH: resolve SMP issues and other problems
GRETH: handle frame error interrupts
GRETH: avoid writing bad speed/duplex when setting transfer mode
GRETH: fixed skb buffer memory leak on frame errors
GRETH: GBit transmit descriptor handling optimization
GRETH: fix opening/closing
GRETH: added raw AMBA vendor/device number to match against.
cassini: Fix build bustage on x86.
e1000e: consistent use of Rx/Tx vs. RX/TX/rx/tx in comments/logs
e1000e: update Copyright for 2011
e1000: Avoid unhandled IRQ
r8169: keep firmware in memory.
netdev: tilepro: Use is_unicast_ether_addr helper
etherdevice.h: Add is_unicast_ether_addr function
ks8695net: Use default implementation of ethtool_ops::get_link
ks8695net: Disable non-working ethtool operations
USB CDC NCM: Don't deref NULL in cdc_ncm_rx_fixup() and don't use uninitialized variable.
vxge: Remember to release firmware after upgrading firmware
netdev: bfin_mac: Remove is_multicast_ether_addr use in netdev_for_each_mc_addr
ipsec: update MAX_AH_AUTH_LEN to support sha512
...
Fix dependencies of netfilter realm match: it depends on NET_CLS_ROUTE,
which itself depends on NET_SCHED; this dependency is missing from netfilter.
Since matching on realms is also useful without having NET_SCHED enabled and
the option really only controls whether the tclassid member is included in
route and dst entries, rename the config option to IP_ROUTE_CLASSID and move
it outside of traffic scheduling context to get rid of the NET_SCHED dependeny.
Reported-by: Vladis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
After recent changes, (percpu stats on vlan/tunnels...), we dont need
anymore per struct netdev_queue tx_bytes/tx_packets/tx_dropped counters.
Only remaining users are ixgbe, sch_teql, gianfar & macvlan :
1) ixgbe can be converted to use existing tx_ring counters.
2) macvlan incremented txq->tx_dropped, it can use the
dev->stats.tx_dropped counter.
3) sch_teql : almost revert ab35cd4b8f (Use net_device internal stats)
Now we have ndo_get_stats64(), use it, even for "unsigned long"
fields (No need to bring back a struct net_device_stats)
4) gianfar adds a stats structure per tx queue to hold
tx_bytes/tx_packets
This removes a lockdep warning (and possible lockup) in rndis gadget,
calling dev_get_stats() from hard IRQ context.
Ref: http://www.spinics.net/lists/netdev/msg149202.html
Reported-by: Neil Jones <neiljay@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Sandeep Gopalpet <sandeep.kumar@freescale.com>
CC: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
HTB takes into account skb is segmented in stats updates.
Generalize this to all schedulers.
They should use qdisc_bstats_update() helper instead of manipulating
bstats.bytes and bstats.packets
Add bstats_update() helper too for classes that use
gnet_stats_basic_packed fields.
Note : Right now, TCQ_F_CAN_BYPASS shortcurt can be taken only if no
stab is setup on qdisc.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 57dbb2d83d (sched: add head drop fifo queue)
introduced pfifo_head_drop, and broke the invariant that
sch->bstats.bytes and sch->bstats.packets are COUNTER (increasing
counters only)
This can break estimators because est_timer() handles unsigned deltas
only. A decreasing counter can then give a huge unsigned delta.
My mid term suggestion would be to change things so that
sch->bstats.bytes and sch->bstats.packets are incremented in dequeue()
only, not at enqueue() time. We also could add drop_bytes/drop_packets
and provide estimations of drop rates.
It would be more sensible anyway for very low speeds, and big bursts.
Right now, if we drop packets, they still are accounted in byte/packets
abolute counters and rate estimators.
Before this mid term change, this patch makes pfifo_head_drop behavior
similar to other qdiscs in case of drops :
Dont decrement sch->bstats.bytes and sch->bstats.packets
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Provide child qdisc backlog (byte count) information so that "tc -s
qdisc" can report it to user.
packet count is already correctly provided.
qdisc red 11: parent 1:11 limit 60Kb min 15Kb max 45Kb ecn
Sent 3116427684 bytes 1415782 pkt (dropped 8, overlimits 7866 requeues 0)
rate 242385Kbit 13630pps backlog 13560b 8p requeues 0
marked 7865 early 1 pdrop 7 other 0
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
slot_dequeue_head() should make sure slot skb chain is correct in both
ways, or we can crash if all possible flows are in use.
Jarek pointed out slot_queue_init() can now be done in sfq_init() once,
instead each time a flow is setup.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SFQ is currently 'limited' to small packets, because it uses a 15bit
allotment number per flow. Introduce a scale by 8, so that we can handle
full size TSO/GRO packets.
Use appropriate handling to make sure allot is positive before a new
packet is dequeued, so that fairness is respected.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Jarek Poplawski <jarkao2@gmail.com>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
sfq_walk() runs without qdisc lock. By the time it selects a non empty
hash slot and sfq_dump_class_stats() is run (with lock held), slot might
have been freed : We then access q->slots[SFQ_EMPTY_SLOT], out of
bounds, and crash in slot_queue_walk()
On previous kernels, bug is here but out of bounds qs[SFQ_DEPTH] and
allot[SFQ_DEPTH] are located in struct sfq_sched_data, so no illegal
memory access happens, only possibly wrong data reported to user.
Also, slot_dequeue_tail() should make sure slot skb chain is correctly
terminated, or sfq_dump_class_stats() can access freed skbs.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
MAINTAINERS
arch/arm/mach-omap2/pm24xx.c
drivers/scsi/bfa/bfa_fcpim.c
Needed to update to apply fixes for which the old branch was too
outdated.
Here is a respin of patch.
I'll send a short patch to make SFQ more fair in presence of large
packets as well.
Thanks
[PATCH v3 net-next-2.6] net_sched: sch_sfq: better struct layouts
This patch shrinks sizeof(struct sfq_sched_data)
from 0x14f8 (or more if spinlocks are bigger) to 0x1180 bytes, and
reduce text size as well.
text data bss dec hex filename
4821 152 0 4973 136d old/net/sched/sch_sfq.o
4627 136 0 4763 129b new/net/sched/sch_sfq.o
All data for a slot/flow is now grouped in a compact and cache friendly
structure, instead of being spreaded in many different points.
struct sfq_slot {
struct sk_buff *skblist_next;
struct sk_buff *skblist_prev;
sfq_index qlen; /* number of skbs in skblist */
sfq_index next; /* next slot in sfq chain */
struct sfq_head dep; /* anchor in dep[] chains */
unsigned short hash; /* hash value (index in ht[]) */
short allot; /* credit for this slot */
};
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jarek Poplawski <jarkao2@gmail.com>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When deploying SFQ/IFB here at work, I found the allot management was
pretty wrong in sfq, even changing allot from short to int...
We should init allot for each new flow, not using a previous value found
in slot.
Before patch, I saw bursts of several packets per flow, apparently
denying the default "quantum 1514" limit I had on my SFQ class.
class sfq 11:1 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 7p requeues 0
allot 11546
class sfq 11:46 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 1p requeues 0
allot -23873
class sfq 11:78 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 5p requeues 0
allot 11393
After patch, better fairness among each flow, allot limit being
respected, allot is positive :
class sfq 11:e parent 11:
(dropped 0, overlimits 0 requeues 86)
backlog 0b 3p requeues 86
allot 596
class sfq 11:94 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 3p requeues 0
allot 1468
class sfq 11:a4 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 4p requeues 0
allot 650
class sfq 11:bb parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 3p requeues 0
allot 596
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently return for each active SFQ slot the number of packets in
queue. We can also give number of bytes accounted for these packets.
tc -s class show dev ifb0
Before patch :
class sfq 11:3d9 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 3p requeues 0
allot 1266
After patch :
class sfq 11:3e4 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 4380b 3p requeues 0
allot 1212
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add dev_close_many and dev_deactivate_many to factorize another
sync-rcu operation on the netdevice unregister path.
$ modprobe dummy numdummies=10000
$ ip link set dev dummy* up
$ time rmmod dummy
Without the patch With the patch
real 0m 24.63s real 0m 5.15s
user 0m 0.00s user 0m 0.00s
sys 0m 6.05s sys 0m 5.14s
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allocate qdisc memory according to NUMA properties of cpus included in
xps map.
To be effective, qdisc should be (re)setup after changes
of /sys/class/net/eth<n>/queues/tx-<n>/xps_cpus
I added a numa_node field in struct netdev_queue, containing NUMA node
if all cpus included in xps_cpus share same node, else -1.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When testing struct netdev_queue state against FROZEN bit, we also test
XOFF bit. We can test both bits at once and save some cycles.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some of the documentation refers to web pages under
the domain `osdl.org'. However, `osdl.org' now
redirects to `linuxfoundation.org'.
Rather than rely on redirections, this patch updates
the addresses appropriately; for the most part, only
documentation that is meant to be current has been
updated.
The patch should be pretty quick to scan and check;
each new web-page url was gotten by trying out the
original URL in a browser and then simply copying the
the redirected URL (formatting as necessary).
There is some conflict as to which one of these domain
names is preferred:
linuxfoundation.org
linux-foundation.org
So, I wrote:
info@linuxfoundation.org
and got this reply:
Message-ID: <4CE17EE6.9040807@linuxfoundation.org>
Date: Mon, 15 Nov 2010 10:41:42 -0800
From: David Ames <david@linuxfoundation.org>
...
linuxfoundation.org is preferred. The canonical name for our web site is
www.linuxfoundation.org. Our list site is actually
lists.linux-foundation.org.
Regarding email linuxfoundation.org is preferred there are a few people
who choose to use linux-foundation.org for their own reasons.
Consequently, I used `linuxfoundation.org' for web pages and
`lists.linux-foundation.org' for mailing-list web pages and email addresses;
the only personal email address I updated from `@osdl.org' was that of
Andrew Morton, who prefers `linux-foundation.org' according `git log'.
Signed-off-by: Michael Witten <mfwitten@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The basic classifier keeps statistics but does not report it to user space.
This showed up when using basic classifier (with police) as a default catch
all on ingress; no statistics were reported.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Somewhere along the lines net_cls_subsys_id became a macro when
cls_cgroup is built as a module. Not only did it make cls_cgroup
completely useless, it also causes it to crash on module unload.
This patch fixes this by removing that macro.
Thanks to Eric Dumazet for diagnosing this problem.
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While validating the configuration em_ops is already set, thus the
individual destroy functions are called, but the ematch data has
not been allocated and associated with the ematch yet.
Signed-off-by: Thomas Graf <tgraf@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1699 commits)
bnx2/bnx2x: Unsupported Ethtool operations should return -EINVAL.
vlan: Calling vlan_hwaccel_do_receive() is always valid.
tproxy: use the interface primary IP address as a default value for --on-ip
tproxy: added IPv6 support to the socket match
cxgb3: function namespace cleanup
tproxy: added IPv6 support to the TPROXY target
tproxy: added IPv6 socket lookup function to nf_tproxy_core
be2net: Changes to use only priority codes allowed by f/w
tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled
tproxy: added tproxy sockopt interface in the IPV6 layer
tproxy: added udp6_lib_lookup function
tproxy: added const specifiers to udp lookup functions
tproxy: split off ipv6 defragmentation to a separate module
l2tp: small cleanup
nf_nat: restrict ICMP translation for embedded header
can: mcp251x: fix generation of error frames
can: mcp251x: fix endless loop in interrupt handler if CANINTF_MERRF is set
can-raw: add msg_flags to distinguish local traffic
9p: client code cleanup
rds: make local functions/variables static
...
Fix up conflicts in net/core/dev.c, drivers/net/pcmcia/smc91c92_cs.c and
drivers/net/wireless/ath/ath9k/debug.c as per David
The first parameter dev isn't in use in qdisc_create_dflt().
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Peter Zijlstra found a bug in the way softirq time is accounted in
VIRT_CPU_ACCOUNTING on this thread:
http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html
The problem is, softirq processing uses local_bh_disable internally. There
is no way, later in the flow, to differentiate between whether softirq is
being processed or is it just that bh has been disabled. So, a hardirq when bh
is disabled results in time being wrongly accounted as softirq.
Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
as well. As account_system_time() in normal tick based accouting also uses
softirq_count, which will be set even when not in softirq with bh disabled.
Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count
for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
processing. The patch below does that and adds API in_serving_softirq() which
returns whether we are currently processing softirq or not.
Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
to in_serving_softirq.
Looks like many usages of in_softirq really want in_serving_softirq. Those
changes can be made individually on a case by case basis.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add a seqlock in struct neighbour to protect neigh->ha[], and avoid
dirtying neighbour in stress situation (many different flows / dsts)
Dirtying takes place because of read_lock(&n->lock) and n->used writes.
Switching to a seqlock, and writing n->used only on jiffies changes
permits less dirtying.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_headroom() is unsigned so "skb_headroom(skb) + toff" is also
unsigned and can't be less than zero. This test was added in 66d50d25:
"u32: negative offset fix" It was supposed to fix a regression.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ingress being not used very much, and net_device->ingress_queue being
quite a big object (128 or 256 bytes), use a dynamic allocation if
needed (tc qdisc add dev eth0 ingress ...)
dev_ingress_queue(dev) helper should be used only with RTNL taken.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is some confusion with rx_queue name after RPS, and net drivers
private rx_queue fields.
I suggest to rename "struct net_device"->rx_queue to ingress_queue.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The list_head conversion unearther an unnecessary flow
check. Since flow is always NULL here we don't need to
see if a matching flow exists already.
Reported-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes init_vf() function, so on each new backlog period parent's
cl_cfmin is properly updated (including further propgation towards the root),
even if the activated leaf has no upperlimit curve defined.
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reviewing commit 1c40be12f7, I
audited other users of tc_action_ops->dump for information leaks.
That commit covered almost all of them but act_police still had a leak.
opt.limit and opt.capab aren't zeroed out before the structure is
passed out.
This patch uses the C99 initializers to zero everything unused out.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Acked-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trivial extension to existing meta data match rules to allow
matching on skb receive hash value.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
for the declararion of csum_ipv6_magic.
Fixes this build error on PowerPC (at least):
net/sched/act_csum.c: In function 'tcf_csum_ipv6_icmp':
net/sched/act_csum.c:178: error: implicit declaration of function 'csum_ipv6_magic'
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can use rxhash to classify the traffic into flows. As rxhash maybe
supplied by NIC or RPS, it is cheaper.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/sched: add ACT_CSUM action to update packets checksums
ACT_CSUM can be called just after ACT_PEDIT in order to re-compute some
altered checksums in IPv4 and IPv6 packets. The following checksums are
supported by this patch:
- IPv4: IPv4 header, ICMP, IGMP, TCP, UDP & UDPLite
- IPv6: ICMPv6, TCP, UDP & UDPLite
It's possible to request in the same action to update different kind of
checksums, if the packets flow mix TCP, UDP and UDPLite, ...
An example of usage is done in the associated iproute2 patch.
Version 3 changes:
- remove useless goto instructions
- improve IPv6 hop options decoding
Version 2 changes:
- coding style correction
- remove useless arguments of some functions
- use stack in tcf_csum_dump()
- add tcf_csum_skb_nextlayer() to factor code
Signed-off-by: Gregoire Baron <baronchon@n7mm.org>
Acked-by: jamal <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no need to check "s". nla_data() doesn't return NULL. Also we
already dereferenced "s" at this point so it would have oopsed ealier if
it were NULL.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We leak at least 32bits of kernel memory to user land in tc dump,
because we dont init all fields (capab ?) of the dumped structure.
Use C99 initializers so that holes and non explicit fields are zeroed.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Require qdisc class ops .walk and .leaf for classful qdisc in
register_qdisc(). The checks could be done later insted, but these
ops are really needed and used by most of classful qdiscs.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sch_sfq as a classful qdisc needs the .leaf handler. Otherwise, there
is an oops possible in tc_modify_qdisc()/check_loop().
Fixes commit 7d2681a6ff
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is based on work originally done by Patric McHardy.
Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Verify in register_qdisc() some basic qdisc class handlers are present.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add dummy .unbind_tcf and .put qdisc class ops for easier verification.
(All other schedulers have it like this.)
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since there was added ->tcf_chain() method without ->bind_tcf() to
sch_sfq class options, there is oops when a filter is added with
the classid parameter.
Fixes commit 7d2681a6ff
netdev thread: null pointer at cls_api.c
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Reported-by: Franchoze Eric <franchoze@yandex.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
The packet length should be checked before the packet data is dereferenced.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The packet length should be checked before the packet data is dereferenced.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The packet length should be checked before the packet data is dereferenced.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
On the TX path, skb->data points to the ethernet header, not the network
header. So when validating the packet length for accessing we should
take the ethernet header into account.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was possible to use a negative offset in a u32 match to reference
the ethernet header or other parts of the link layer header.
This fixes the regression caused by:
commit fbc2e7d9cf
Author: Changli Gao <xiaosuo@gmail.com>
Date: Wed Jun 2 07:32:42 2010 -0700
cls_u32: use skb_header_pointer() to dereference data safely
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
after updating the value of the ICMP payload, inet_proto_csum_replace4() should
be called with zero pseudohdr.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
pskb_may_pull() may change skb pointers, so adjust icmph after pskb_may_pull().
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes hang when target device of mirred packet classifier
action is removed.
If a mirror or redirection action is configured to cause packets
to go to another device, the classifier holds a ref count, but was assuming
the adminstrator cleaned up all redirections before removing. The fix
is to add a notifier and cleanup during unregister.
The new list is implicitly protected by RTNL mutex because
it is held during filter add/delete as well as notifier.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/vhost/net.c
net/bridge/br_device.c
Fix merge conflict in drivers/vhost/net.c with guidance from
Stephen Rothwell.
Revert the effects of net-2.6 commit 573201f36f
since net-next-2.6 has fixes that make bridge netpoll work properly thus
we don't need it disabled.
Signed-off-by: David S. Miller <davem@davemloft.net>
Use modern this_cpu_xxx() api, saving few bytes on x86
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The reset_policy() does:
memset(d->tcfd_defdata, 0, SIMP_MAX_DATA);
strlcpy(d->tcfd_defdata, defdata, SIMP_MAX_DATA);
In the original code, the size of d->tcfd_defdata wasn't fixed and if
strlen(defdata) was less than 31, reset_policy() would cause memory
corruption.
Please Note: The original alloc_defdata() assumes defdata is 32
characters and a NUL terminator while reset_policy() assumes defdata is
31 characters and a NUL. This patch updates alloc_defdata() to match
reset_policy() (ie a shorter string). I'm not very familiar with this
code so please review carefully.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
not all of the ICMP packets need an IP header payload, so we check the length
of the skbs only when the packets should have an IP header payload.
Based upon analysis and initial patch by Rodrigo Partearroyo González.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
----
net/sched/act_nat.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
act_nat: use stack variable
structure tc_nat isn't too big for stack, so we can put it in stack.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
net/sched/act_nat.c | 31 ++++++++++---------------------
1 file changed, 10 insertions(+), 21 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
act_mirred: combine duplicate code
tcf_bstats is updated in any way, so we can do it earlier to reduce the size of
the code.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
----
net/sched/act_mirred.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
don't clone skb when skb isn't shared
When the tcf_action is TC_ACT_STOLEN, and the skb isn't shared, we don't need
to clone a new skb. As the skb will be freed after this function returns, we
can use it freely once we get a reference to it.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
include/net/sch_generic.h | 11 +++++++++--
net/sched/act_mirred.c | 6 +++---
2 files changed, 12 insertions(+), 5 deletions(-)
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
https://bugzilla.kernel.org/show_bug.cgi?id=16183
The sch_teql module, which can be used to load balance over a set of
underlying interfaces, stopped working after 2.6.30 and has been
broken in all kernels since then for any underlying interface which
requires the addition of link level headers.
The problem is that the transmit routine relies on being able to
access the destination address in the skb in order to do address
resolution once it has decided which underlying interface it is going
to transmit through.
In 2.6.31 the IFF_XMIT_DST_RELEASE flag was introduced, and set by
default for all interfaces, which causes the destination address to be
released before the transmit routine for the interface is called.
The solution is to clear that flag for teql interfaces.
Signed-off-by: Tom Hughes <tom@compton.nu>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gen_kill_estimator() API is incomplete or not well documented, since
caller should make sure an RCU grace period is respected before
freeing stats_lock.
This was partially addressed in commit 5d944c640b
(gen_estimator: deadlock fix), but same problem exist for all
gen_kill_estimator() users, if lock they use is not already RCU
protected.
A code review shows xt_RATEEST.c, act_api.c, act_police.c have this
problem. Other are ok because they use qdisc lock, already RCU
protected.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that the core path doesnt set OK to munge we detect
writable skbs by looking to see if they are cloned.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
remove two unnecessary assignments
we don't need to assign NULL when initialize structure objects.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
net/sched/sch_htb.c | 2 --
1 file changed, 2 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
access skb->data safely
we should use skb_header_pointer() and skb_store_bits() to access skb->data to
handle small or non-linear skbs.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
net/sched/act_pedit.c | 24 ++++++++++++++----------
1 file changed, 14 insertions(+), 10 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
use skb_header_pointer() to dereference data safely
the original skb->data dereference isn't safe, as there isn't any skb->len or
skb_is_nonlinear() check. skb_header_pointer() is used instead in this patch.
And when the skb isn't long enough, we terminate the function u32_classify()
immediately with -1.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fix the wrong checksum when addr isn't in old_addr/mask
For TCP and UDP packets, when addr isn't in old_addr/mask we don't do SNAT or
DNAT, and we should not update layer 4 checksum.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
net/sched/act_nat.c | 4 ++++
1 file changed, 4 insertions(+)
Signed-off-by: David S. Miller <davem@davemloft.net>
When many cpus compete for sending frames on a given qdisc, the qdisc
spinlock suffers from very high contention.
The cpu owning __QDISC_STATE_RUNNING bit has same priority to acquire
the lock, and cannot dequeue packets fast enough, since it must wait for
this lock for each dequeued packet.
One solution to this problem is to force all cpus spinning on a second
lock before trying to get the main lock, when/if they see
__QDISC_STATE_RUNNING already set.
The owning cpu then compete with at most one other cpu for the main
lock, allowing for higher dequeueing rate.
Based on a previous patch from Alexander Duyck. I added the heuristic to
avoid the atomic in fast path, and put the new lock far away from the
cache line used by the dequeue worker. Also try to release the busylock
lock as late as possible.
Tests with following script gave a boost from ~50.000 pps to ~600.000
pps on a dual quad core machine (E5450 @3.00GHz), tg3 driver.
(A single netperf flow can reach ~800.000 pps on this platform)
for j in `seq 0 3`; do
for i in `seq 0 7`; do
netperf -H 192.168.0.1 -t UDP_STREAM -l 60 -N -T $i -- -m 6 &
done
done
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define three helpers to manipulate QDISC_STATE_RUNNIG flag, that a
second patch will move on another location.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently such notifications are only generated when the device comes up or the
address changes. However one use case for these notifications is to enable
faster network recovery after a virtual machine migration (by causing switches
to relearn their MAC tables). A migration appears to the network stack as a
temporary loss of carrier and therefore does not trigger either of the current
conditions. Rather than adding carrier up as a trigger (which can cause issues
when interfaces a flapping) simply add an interface which the driver can use
to explicitly trigger the notification.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Stephen Hemminger <shemminger@linux-foundation.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Cc: stable@kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Up until now cls_cgroup has relied on fetching the classid out of
the current executing thread. This runs into trouble when a packet
processing is delayed in which case it may execute out of another
thread's context.
Furthermore, even when a packet is not delayed we may fail to
classify it if soft IRQs have been disabled, because this scenario
is indistinguishable from one where a packet unrelated to the
current thread is processed by a real soft IRQ.
In fact, the current semantics is inherently broken, as a single
skb may be constructed out of the writes of two different tasks.
A different manifestation of this problem is when the TCP stack
transmits in response of an incoming ACK. This is currently
unclassified.
As we already have a concept of packet ownership for accounting
purposes in the skb->sk pointer, this is a natural place to store
the classid in a persistent manner.
This patch adds the cls_cgroup classid in struct sock, filling up
an existing hole on 64-bit :)
The value is set at socket creation time. So all sockets created
via socket(2) automatically gains the ID of the thread creating it.
Whenever another process touches the socket by either reading or
writing to it, we will change the socket classid to that of the
process if it has a valid (non-zero) classid.
For sockets created on inbound connections through accept(2), we
inherit the classid of the original listening socket through
sk_clone, possibly preceding the actual accept(2) call.
In order to minimise risks, I have not made this the authoritative
classid. For now it is only used as a backup when we execute
with soft IRQs disabled. Once we're completely happy with its
semantics we can use it as the sole classid.
Footnote: I have rearranged the error path on cls_group module
creation. If we didn't do this, then there is a window where
someone could create a tc rule using cls_group before the cgroup
subsystem has been registered.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ben Pfaff reported a kernel oops and provided a test program to
reproduce it.
https://kerneltrap.org/mailarchive/linux-netdev/2010/5/21/6277805
tc_fill_qdisc() should not be called for builtin qdisc, or it
dereference a NULL pointer to get device ifindex.
Fix is to always use tc_qdisc_dump_ignore() before calling
tc_fill_qdisc().
Reported-by: Ben Pfaff <blp@nicira.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes from net/ (but not any netfilter files)
all the unnecessary return; statements that precede the
last closing brace of void functions.
It does not remove the returns that are immediately
preceded by a label as gcc doesn't like that.
Done via:
$ grep -rP --include=*.[ch] -l "return;\n}" net/ | \
xargs perl -i -e 'local $/ ; while (<>) { s/\n[ \t\n]+return;\n}/\n}/g; print; }'
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the user has a bad classification configuration, and gets a packet
that goes through too many steps. Chances are more packets will arrive,
and the message spew will overrun syslog because it is not rate limited.
And because it is not tagged with appropriate priority it can't not be screened.
Added the qdisc to the message to try and give some more context when
the message does arrive.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous patch encourage me to go look at all the messages in
the network scheduler and fix them. Many messages were missing
any severity level. Some serious ones that should never happen
were turned into WARN(), and the random noise messages that were
handled changed to pr_debug().
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>