Add support to allow getting/setting netfilter tables settings.
Currently these are IFLA_BR_NF_CALL_IPTABLES, IFLA_BR_NF_CALL_IP6TABLES
and IFLA_BR_NF_CALL_ARPTABLES.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to set/get all of the igmp's configurable intervals via
netlink. These currently are:
IFLA_BR_MCAST_LAST_MEMBER_INTVL
IFLA_BR_MCAST_MEMBERSHIP_INTVL
IFLA_BR_MCAST_QUERIER_INTVL
IFLA_BR_MCAST_QUERY_INTVL
IFLA_BR_MCAST_QUERY_RESPONSE_INTVL
IFLA_BR_MCAST_STARTUP_QUERY_INTVL
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_MCAST_STARTUP_QUERY_CNT to allow setting/getting
br->multicast_startup_query_count via netlink. Also align the ifla
comments.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_MCAST_LAST_MEMBER_CNT to allow setting/getting
br->multicast_last_member_count via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_MCAST_HASH_MAX to allow setting/getting br->hash_max via
netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_MCAST_HASH_ELASTICITY to allow setting/getting
br->hash_elasticity via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_MCAST_QUERIER to allow setting/getting br->multicast_querier
via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_MCAST_QUERY_USE_IFADDR to allow setting/getting
br->multicast_query_use_ifaddr via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_MCAST_SNOOPING to allow enabling/disabling multicast
snooping via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_MCAST_ROUTER to allow setting and retrieving
br->multicast_router when igmp snooping is enabled.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simple attribute that flushes the bridge's fdb.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_GROUP_ADDR attribute to allow setting and retrieving the
group_addr via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Export the following bridge timers (also exported via sysfs):
IFLA_BR_HELLO_TIMER, IFLA_BR_TCN_TIMER, IFLA_BR_TOPOLOGY_CHANGE_TIMER,
IFLA_BR_GC_TIMER via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_TOPOLOGY_CHANGE and IFLA_BR_TOPOLOGY_CHANGE_DETECTED and
export them via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_ROOT_PATH_COST and export it via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_ROOT_PORT and export it via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_BRIDGE_ID and export br->bridge_id via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_ROOT_ID and export br->designated_root via netlink. For this
purpose add struct ifla_bridge_id that would represent struct bridge_id.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_GROUP_FWD_MASK attribute to allow setting and retrieving the
group_fwd_mask via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The checks that lead to num_vlans change are always what
br_vlan_should_use checks for, namely if the vlan is only a context or
not and depending on that it's either not counted or counted
as a real/used vlan respectively.
Also give better explanation in br_vlan_should_use's comment.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's only one user now and we can include the flag directly.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce br_vlan_(get|put)_master which take a reference (or create the
master vlan first if it didn't exist) and drop a reference respectively.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I did the conversion to rhashtable I missed the required locking of
one important user of the vlan list - br_get_link_af_size_filtered()
which is called:
br_ifinfo_notify() -> br_nlmsg_size() -> br_get_link_af_size_filtered()
and the notifications can be sent without holding rtnl. Before this
conversion the function relied on using rcu and since we already use rcu to
destroy the vlans, we can simply migrate the list to use the rcu helpers.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suggested-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace "void *obj" with a generic structure. Introduce couple of
helpers along that.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the struct name in sync with object id name.
Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the struct name in sync with object id name.
Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To be aligned with obj.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should not pass the original flags when creating a context vlan only
because they may contain some flags that change behaviour in the bridge.
The new global context should be with minimal set of flags, so pass 0
and let br_vlan_add() set the master flag only.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a new port is being added we need to make vlgrp available after
rhashtable has been initialized and when removing a port we need to
flush the vlans and free the resources after we're sure noone can use
the port, i.e. after it's removed from the port list and synchronize_rcu
is executed.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One obvious way to converge more code (which was also used by the
previous vlan code) is to move pvid inside net_bridge_vlan_group. This
allows us to simplify some and remove other port-specific functions.
Also gives us the ability to simply pass the vlan group and use all of the
contained information.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While a new port is being initialized the rx_handler gets set, but the
vlans get initialized later in br_add_if() and in that window if we
receive a frame with a link-local address we can try to dereference
p->vlgrp in:
br_handle_frame() -> br_handle_local_finish() -> br_should_learn()
Fix this by checking vlgrp before using it.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As Stephen pointed out the default initial size is more than we need, so
let's start small (4 elements, thus nelem_hint = 3). Also limit the hash
locks to the number of CPUs as we don't need any write-side scaling and
this looks like the minimum.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following pull request contains Netfilter/IPVS updates for net-next
containing 90 patches from Eric Biederman.
The main goal of this batch is to avoid recurrent lookups for the netns
pointer, that happens over and over again in our Netfilter/IPVS code. The idea
consists of passing netns pointer from the hook state to the relevant functions
and objects where this may be needed.
You can find more information on the IPVS updates from Simon Horman's commit
merge message:
c3456026ad ("Merge tag 'ipvs2-for-v4.4' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next").
Exceptionally, this time, I'm not posting the patches again on netdev, Eric
already Cc'ed this mailing list in the original submission. If you need me to
make, just let me know.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to the notifier_call callback of a notifier_block, change the
function signature of switchdev add and del operations to:
int switchdev_port_obj_add/del(struct net_device *dev,
enum switchdev_obj_id id, void *obj);
This allows the caller to pass a specific switchdev_obj_* structure
instead of the generic switchdev_obj one.
Drivers implementation of these operations and switchdev have been
changed accordingly.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the bridge vlan implementation to use rhashtables
instead of bitmaps. The main motivation behind this change is that we
need extensible per-vlan structures (both per-port and global) so more
advanced features can be introduced and the vlan support can be
extended. I've tried to break this up but the moment net_port_vlans is
changed and the whole API goes away, thus this is a larger patch.
A few short goals of this patch are:
- Extensible per-vlan structs stored in rhashtables and a sorted list
- Keep user-visible behaviour (compressed vlans etc)
- Keep fastpath ingress/egress logic the same (optimizations to come
later)
Here's a brief list of some of the new features we'd like to introduce:
- per-vlan counters
- vlan ingress/egress mapping
- per-vlan igmp configuration
- vlan priorities
- avoid fdb entries replication (e.g. local fdb scaling issues)
The structure is kept single for both global and per-port entries so to
avoid code duplication where possible and also because we'll soon introduce
"port0 / aka bridge as port" which should simplify things further
(thanks to Vlad for the suggestion!).
Now we have per-vlan global rhashtable (bridge-wide) and per-vlan port
rhashtable, if an entry is added to a port it'll get a pointer to its
global context so it can be quickly accessed later. There's also a
sorted vlan list which is used for stable walks and some user-visible
behaviour such as the vlan ranges, also for error paths.
VLANs are stored in a "vlan group" which currently contains the
rhashtable, sorted vlan list and the number of "real" vlan entries.
A good side-effect of this change is that it resembles how hw keeps
per-vlan data.
One important note after this change is that if a VLAN is being looked up
in the bridge's rhashtable for filtering purposes (or to check if it's an
existing usable entry, not just a global context) then the new helper
br_vlan_should_use() needs to be used if the vlan is found. In case the
lookup is done only with a port's vlan group, then this check can be
skipped.
Things tested so far:
- basic vlan ingress/egress
- pvids
- untagged vlans
- undef CONFIG_BRIDGE_VLAN_FILTERING
- adding/deleting vlans in different scenarios (with/without global ctx,
while transmitting traffic, in ranges etc)
- loading/removing the module while having/adding/deleting vlans
- extracting bridge vlan information (user ABI), compressed requests
- adding/deleting fdbs on vlans
- bridge mac change, promisc mode
- default pvid change
- kmemleak ON during the whole time
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The network namespace is easiliy available in state->net so use it.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Allow bridge forward delay to be configured when Spanning Tree is enabled.
Signed-off-by: Ian Wilson <iwilson@brocade.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/ipv4/arp.c
The net/ipv4/arp.c conflict was one commit adding a new
local variable while another commit was deleting one.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Siva Mannem <siva.mannem.lnx@gmail.com>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Premkumar Jonnala <pjonnala@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only pass the void *priv parameter out of the nf_hook_ops. That is
all any of the functions are interested now, and by limiting what is
passed it becomes simpler to change implementation details.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
As xt_action_param lives on the stack this does not bloat any
persistent data structures.
This is a first step in making netfilter code that needs to know
which network namespace it is executing in simpler.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
- Add nft_pktinfo.pf to replace ops->pf
- Add nft_pktinfo.hook to replace ops->hooknum
This simplifies the code, makes it more readable, and likely reduces
cache line misses. Maintainability is enhanced as the details of
nft_hook_ops are of no concern to the recpients of nft_pktinfo.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Nearly everything thing of interest to ebt_do_table is already present
in nf_hook_state. Simplify ebt_do_table by just passing in the skb,
nf_hook_state, and the table. This make the code easier to read and
maintenance easier.
To support this create an nf_hook_state on the stack in ebt_broute
(the only caller without a nf_hook_state already available). This new
nf_hook_state adds no new computations to ebt_broute, but does use a
few more bytes of stack.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This is immediately motivated by the bridge code that chains functions that
call into netfilter. Without passing net into the okfns the bridge code would
need to guess about the best expression for the network namespace to process
packets in.
As net is frequently one of the first things computed in continuation functions
after netfilter has done it's job passing in the desired network namespace is in
many cases a code simplification.
To support this change the function dst_output_okfn is introduced to
simplify passing dst_output as an okfn. For the moment dst_output_okfn
just silently drops the struct net.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of saying "net = dev_net(state->in?state->in:state->out)"
just say "state->net". As that information is now availabe,
much less confusing and much less error prone.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a network namespace parameter into the netfilter hooks. At the
call site of the netfilter hooks the path a packet is taking through
the network stack is well known which allows the network namespace to
be easily and reliabily.
This allows the replacement of magic code like
"dev_net(state->in?:state->out)" that appears at the start of most
netfilter hooks with "state->net".
In almost all cases the network namespace passed in is derived
from the first network device passed in, guaranteeing those
paths will not see any changes in practice.
The exceptions are:
xfrm/xfrm_output.c:xfrm_output_resume() xs_net(skb_dst(skb)->xfrm)
ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont() ip_vs_conn_net(cp)
ipvs/ip_vs_xmit.c:ip_vs_send_or_cont() ip_vs_conn_net(cp)
ipv4/raw.c:raw_send_hdrinc() sock_net(sk)
ipv6/ip6_output.c:ip6_xmit() sock_net(sk)
ipv6/ndisc.c:ndisc_send_skb() dev_net(skb->dev) not dev_net(dst->dev)
ipv6/raw.c:raw6_send_hdrinc() sock_net(sk)
br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before skb->dev is set to nf_bridge->physindev
In all cases these exceptions seem to be a better expression for the
network namespace the packet is being processed in then the historic
"dev_net(in?in:out)". I am documenting them in case something odd
pops up and someone starts trying to track down what happened.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netif_receive_skb_sk is only called once in the bridge code, replace
it with a bridge specific function that calls netif_receive_skb.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is prep work for passing net to the netfilter hooks.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When struct net starts being passed through the ipv4 and ipv6 fragment
routines br_nf_push_frag_xmit will need to take a net parameter.
Prepare br_nf_push_frag_xmit before that is needed and introduce
br_nf_push_frag_xmit_sk for the call sites that still need the old
calling conventions.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a prep work for passing struct net through ip_do_fragment and
later the netfilter okfn. Doing this independently makes the later
code changes clearer.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function dev_queue_xmit_skb_sk is unncessary and very confusing.
Introduce br_send_bpdu_finish to remove the need for dev_queue_xmit_skb_sk,
and have br_send_bpdu_finish call dev_queue_xmit.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the newly introduced helper functions the skb pulling is hidden in
the checksumming function - and undone before returning to the caller.
The IGMPv3 and MLDv2 report parsing functions in the bridge still
assumed that the skb is pointing to the beginning of the IGMP/MLD
message while it is now kept at the beginning of the IPv4/6 header,
breaking the message parsing and creating packet loss.
Fixing this by taking the offset between IP and IGMP/MLD header into
account, too.
Fixes: 9afd85c9e4 ("net: Export IGMP/MLD message validation code")
Reported-by: Tobias Powalowski <tobias.powalowski@googlemail.com>
Tested-by: Tobias Powalowski <tobias.powalowski@googlemail.com>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the unnecessary switchdev.h include from br_netlink.c.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since __vlan_del can return an error code, change its inner function
__vlan_vid_del to return an eventual error from switchdev_port_obj_del.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next tree.
In sum, patches to address fallout from the previous round plus updates from
the IPVS folks via Simon Horman, they are:
1) Add a new scheduler to IPVS: The weighted overflow scheduling algorithm
directs network connections to the server with the highest weight that is
currently available and overflows to the next when active connections exceed
the node's weight. From Raducu Deaconu.
2) Fix locking ordering in IPVS, always take rtnl_lock in first place. Patch
from Julian Anastasov.
3) Allow to indicate the MTU to the IPVS in-kernel state sync daemon. From
Julian Anastasov.
4) Enhance multicast configuration for the IPVS state sync daemon. Also from
Julian.
5) Resolve sparse warnings in the nf_dup modules.
6) Fix a linking problem when CONFIG_NF_DUP_IPV6 is not set.
7) Add ICMP codes 5 and 6 to IPv6 REJECT target, they are more informative
subsets of code 1. From Andreas Herz.
8) Revert the jumpstack size calculation from mark_source_chains due to chain
depth miscalculations, from Florian Westphal.
9) Calm down more sparse warning around the Netfilter tree, again from Florian
Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
bridge/netfilter/ebtables.c:290:26: warning: incorrect type in assignment (different modifiers)
-> remove __pure annotation.
ipv6/netfilter/ip6t_SYNPROXY.c:240:27: warning: cast from restricted __be16
-> switch ntohs to htons and vice versa.
netfilter/core.c:391:30: warning: symbol 'nfq_ct_nat_hook' was not declared. Should it be static?
-> delete it, got removed
net/netfilter/nf_synproxy_core.c:221:48: warning: cast to restricted __be32
-> Use __be32 instead of u32.
Tested with objdiff that these changes do not affect generated code.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
While looking into fixing the local entries scalability issue I noticed
that the structure is badly arranged because vlan_id would fall in a
second cache line while keeping rcu which is used only when deleting
in the first, so re-arrange the structure and push rcu to the end so we
can get 16 bytes which can be used for other fields (by pushing rcu
fully in the second 64 byte chunk). With this change all the core
necessary information when doing fdb lookups will be available in a
single cache line.
pahole before (note vlan_id):
struct net_bridge_fdb_entry {
struct hlist_node hlist; /* 0 16 */
struct net_bridge_port * dst; /* 16 8 */
struct callback_head rcu; /* 24 16 */
long unsigned int updated; /* 40 8 */
long unsigned int used; /* 48 8 */
mac_addr addr; /* 56 6 */
unsigned char is_local:1; /* 62: 7 1 */
unsigned char is_static:1; /* 62: 6 1 */
unsigned char added_by_user:1; /* 62: 5 1 */
unsigned char added_by_external_learn:1; /* 62: 4 1 */
/* XXX 4 bits hole, try to pack */
/* XXX 1 byte hole, try to pack */
/* --- cacheline 1 boundary (64 bytes) --- */
__u16 vlan_id; /* 64 2 */
/* size: 72, cachelines: 2, members: 11 */
/* sum members: 65, holes: 1, sum holes: 1 */
/* bit holes: 1, sum bit holes: 4 bits */
/* padding: 6 */
/* last cacheline: 8 bytes */
}
pahole after (note vlan_id):
struct net_bridge_fdb_entry {
struct hlist_node hlist; /* 0 16 */
struct net_bridge_port * dst; /* 16 8 */
long unsigned int updated; /* 24 8 */
long unsigned int used; /* 32 8 */
mac_addr addr; /* 40 6 */
__u16 vlan_id; /* 46 2 */
unsigned char is_local:1; /* 48: 7 1 */
unsigned char is_static:1; /* 48: 6 1 */
unsigned char added_by_user:1; /* 48: 5 1 */
unsigned char added_by_external_learn:1; /* 48: 4 1 */
/* XXX 4 bits hole, try to pack */
/* XXX 7 bytes hole, try to pack */
struct callback_head rcu; /* 56 16 */
/* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
/* size: 72, cachelines: 2, members: 11 */
/* sum members: 65, holes: 1, sum holes: 7 */
/* bit holes: 1, sum bit holes: 4 bits */
/* last cacheline: 8 bytes */
}
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This enables bridge vlan_protocol to be configured through netlink.
When CONFIG_BRIDGE_VLAN_FILTERING is disabled, kernel behaves the
same way as this feature is not implemented.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
.maxtype should match .policy. Probably just been getting lucky here
because IFLA_BRPORT_MAX > IFLA_BR_MAX.
Fixes: 13323516 ("bridge: implement rtnl_link_ops->changelink")
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The recent refactoring of the IGMP and MLD parsing code into
ipv6_mc_check_mld() / ip_mc_check_igmp() introduced a potential crash /
BUG() invocation for bridges:
I wrongly assumed that skb_get() could be used as a simple reference
counter for an skb which is not the case. skb_get() bears additional
semantics, a user count. This leads to a BUG() invocation in
pskb_expand_head() / kernel panic if pskb_may_pull() is called on an skb
with a user count greater than one - unfortunately the refactoring did
just that.
Fixing this by removing the skb_get() call and changing the API: The
caller of ipv6_mc_check_mld() / ip_mc_check_igmp() now needs to
additionally check whether the returned skb_trimmed is a clone.
Fixes: 9afd85c9e4 ("net: Export IGMP/MLD message validation code")
Reported-by: Brenden Blanco <bblanco@plumgrid.com>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/cavium/Kconfig
The cavium conflict was overlapping dependency
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the ability to toggle the vlan filtering support via
netlink. Since we're already running with rtnl in .changelink() we don't
need to take any additional locks.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The address in the switchdev_obj_fdb structure is currently represented
as a pointer. Replacing it for a 6-byte array allows switchdev to carry
addresses directly read from hardware registers, not stored by the
switch chip driver (as in Rocker).
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The attribute size wasn't accounted for in the get_slave_size() callback
(br_port_get_slave_size) when it was introduced, so fix it now. Also add
a policy entry for it in br_port_policy.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 842a9ae08a ("bridge: Extend Proxy ARP design to allow optional rules for Wi-Fi")
Signed-off-by: David S. Miller <davem@davemloft.net>
The attribute size wasn't accounted for in the get_slave_size() callback
(br_port_get_slave_size) when it was introduced, so fix it now. Also add
a policy entry for it in br_port_policy.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 958501163d ("bridge: Add support for IEEE 802.11 Proxy ARP")
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next, they are:
1) A couple of cleanups for the netfilter core hook from Eric Biederman.
2) Net namespace hook registration, also from Eric. This adds a dependency with
the rtnl_lock. This should be fine by now but we have to keep an eye on this
because if we ever get the per-subsys nfnl_lock before rtnl we have may
problems in the future. But we have room to remove this in the future by
propagating the complexity to the clients, by registering hooks for the init
netns functions.
3) Update nf_tables to use the new net namespace hook infrastructure, also from
Eric.
4) Three patches to refine and to address problems from the new net namespace
hook infrastructure.
5) Switch to alternate jumpstack in xtables iff the packet is reentering. This
only applies to a very special case, the TEE target, but Eric Dumazet
reports that this is slowing down things for everyone else. So let's only
switch to the alternate jumpstack if the tee target is in used through a
static key. This batch also comes with offline precalculation of the
jumpstack based on the callchain depth. From Florian Westphal.
6) Minimal SCTP multihoming support for our conntrack helper, from Michal
Kubecek.
7) Reduce nf_bridge_info per skbuff scratchpad area to 32 bytes, from Florian
Westphal.
8) Fix several checkpatch errors in bridge netfilter, from Bernhard Thaler.
9) Get rid of useless debug message in ip6t_REJECT, from Subash Abhinov.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of trying to access br->vlan_enabled directly use the provided
helper br_vlan_enabled().
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before this patch when a vid was not specified, the entry was added with
vid 0 which is useless when vlan_filtering is enabled. This patch makes
the entry to be added on all configured vlans when vlan filtering is
enabled and respectively deleted from all, if the entry vid is 0.
This is also closer to the way fdb works with regard to vid 0 and vlan
filtering.
Example:
Setup:
$ bridge vlan add vid 256 dev eth4
$ bridge vlan add vid 1024 dev eth4
$ bridge vlan add vid 64 dev eth3
$ bridge vlan add vid 128 dev eth3
$ bridge vlan
port vlan ids
eth3 1 PVID Egress Untagged
64
128
eth4 1 PVID Egress Untagged
256
1024
$ echo 1 > /sys/class/net/br0/bridge/vlan_filtering
Before:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1
$ bridge mdb
dev br0 port eth3 grp 239.0.0.1 temp
After:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1
$ bridge mdb
dev br0 port eth3 grp 239.0.0.1 temp vid 1
dev br0 port eth3 grp 239.0.0.1 temp vid 128
dev br0 port eth3 grp 239.0.0.1 temp vid 64
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bridge devices don't need to segment multiple tagged packets since thier
ports can segment them.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
arch/s390/net/bpf_jit_comp.c
drivers/net/ethernet/ti/netcp_ethss.c
net/bridge/br_multicast.c
net/ipv4/ip_fragment.c
All four conflicts were cases of simple overlapping
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix checkpatch.pl "ERROR: do not initialise statics to 0 or NULL" for
all statics explicitly initialized to 0.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We can use union for most of the temporary cruft (original ipv4/ipv6
address, source mac, physoutdev) since they're used during different
stages of br netfilter traversal.
Also get rid of the last two ->mask users.
Shrinks struct from 48 to 32 on 64bit arch.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since mdb states were introduced when deleting an entry the state was
left as it was set in the delete request from the user which leads to
the following output when doing a monitor (for example):
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
$ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 temp
^^^
Note the "temp" state in the delete notification which is wrong since
the entry was permanent, the state in a delete is always reported as
"temp" regardless of the real state of the entry.
After this patch:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
$ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
There's one important note to make here that the state is actually not
matched when doing a delete, so one can delete a permanent entry by
stating "temp" in the end of the command, I've chosen this fix in order
not to break user-space tools which rely on this (incorrect) behaviour.
So to give an example after this patch and using the wrong state:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
$ bridge mdb del dev br0 port eth3 grp 239.0.0.1 temp
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
Note the state of the entry that got deleted is correct in the
notification.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: ccb1c31a7a ("bridge: add flags to distinguish permanent mdb entires")
Signed-off-by: David S. Miller <davem@davemloft.net>
When fast leave is configured on a bridge port and an IGMP leave is
received for a group, the group is not deleted immediately if there is
a router detected or if multicast querier is configured.
Ideally the group should be deleted immediately when fast leave is
configured.
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are several devices that can receive vlan tagged packets with
CHECKSUM_PARTIAL like tap, possibly veth and xennet.
When (multiple) vlan tagged packets with CHECKSUM_PARTIAL are forwarded
by bridge to a device with the IP_CSUM feature, they end up with checksum
error because before entering bridge, the network header is set to
ETH_HLEN (not including vlan header length) in __netif_receive_skb_core(),
get_rps_cpu(), or drivers' rx functions, and nobody fixes the pointer later.
Since the network header is exepected to be ETH_HLEN in flow-dissection
and hash-calculation in RPS in rx path, and since the header pointer fix
is needed only in tx path, set the appropriate network header on forwarding
packets.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
These should be handled only by the respective STP which is in control.
They become problematic for devices with limited resources with many
ports because the hold_timer is per port and fires each second and the
hello timer fires each 2 seconds even though it's global. While in
user-space STP mode these timers are completely unnecessary so it's better
to keep them off.
Also ensure that when the bridge is up these timers are started only when
running with kernel STP.
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Send notifications on router port add and del/expire, re-use the already
existing MDBA_ROUTER and send NEWMDB/DELMDB netlink notifications
respectively.
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since slave_changelink support was added there have been a few race
conditions when using br_setport() since some of the port functions it
uses require the bridge lock. It is very easy to trigger a lockup due to
some internal spin_lock() usage without bh disabled, also it's possible to
get the bridge into an inconsistent state.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 3ac636b859 ("bridge: implement rtnl_link_ops->slave_changelink")
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes setting of vinfo.flags in the br_fill_ifvlaninfo_range() method. The
assignment of vinfo.flags &= ~BRIDGE_VLAN_INFO_RANGE_BEGIN has no effect and is
unneeded, as vinfo.flags value is overriden by the immediately following
vinfo.flags = flags | BRIDGE_VLAN_INFO_RANGE_END assignement.
Signed-off-by: Rami Rosen <rami.rosen@intel.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/bridge/br_mdb.c
br_mdb.c conflict was a function call being removed to fix a bug in
'net' but whose signature was changed in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix:
net/bridge/br_if.c: In function 'br_dev_delete':
>> net/bridge/br_if.c:284:2: error: implicit declaration of function
>> 'br_multicast_dev_del' [-Werror=implicit-function-declaration]
br_multicast_dev_del(br);
^
cc1: some warnings being treated as errors
when igmp snooping is not defined.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the bridge (or port) is brought down/up flush only temp entries and
leave the perm ones. Flush perm entries only when deleting the bridge
device or the associated port.
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Group notifications were not sent when a group expired or was deleted
due to bridge/port device being deleted. So add br_mdb_notify() to
br_multicast_del_pg().
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the mdb add/del code was introduced there have been 2 br_mdb_notify
calls when doing br_mdb_add() resulting in 2 notifications on each add.
Example:
Command: bridge mdb add dev br0 port eth1 grp 239.0.0.1 permanent
Before patch:
root@debian:~# bridge monitor all
[MDB]dev br0 port eth1 grp 239.0.0.1 permanent
[MDB]dev br0 port eth1 grp 239.0.0.1 permanent
After patch:
root@debian:~# bridge monitor all
[MDB]dev br0 port eth1 grp 239.0.0.1 permanent
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: cfd5675435 ("bridge: add support of adding and deleting mdb entries")
Signed-off-by: David S. Miller <davem@davemloft.net>
A report with INCLUDE/Change_to_include and empty source list should be
treated as a leave, specified by RFC 3376, section 3.1:
"If the requested filter mode is INCLUDE *and* the requested source
list is empty, then the entry corresponding to the requested
interface and multicast address is deleted if present. If no such
entry is present, the request is ignored."
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/bridge/br_mdb.c
Minor conflict in br_mdb.c, in 'net' we added a memset of the
on-stack 'ip' variable whereas in 'net-next' we assign a new
member 'vid'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now all user mdb entries were added in vlan 0, this patch adds
support to allow the user to specify the vlan for the entry.
About the uapi change a hole in struct br_mdb_entry is used so the size
and offsets are kept the same (verified with pahole and tested with older
iproute2).
Example:
$ bridge mdb
dev br0 port eth1 grp 239.0.0.1 permanent vlan 2000
dev br0 port eth1 grp 239.0.0.1 permanent vlan 200
dev br0 port eth1 grp 239.0.0.1 permanent
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now when a querier was present static entries couldn't be deleted.
Fix this and allow the user to manipulate the mdb with or without a
querier.
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit c29390c6df ("xps: must clear sender_cpu before forwarding")
fixed an issue in normal forward path, caused by sender_cpu & napi_id
skb fields being an union.
Bridge is another point where skb can be forwarded, so we need
the same cure.
Bug triggers if packet was received on a NIC using skb_mark_napi_id()
Fixes: 2bd82484bb ("xps: fix xps for stacked devices")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Bob Liu <bob.liu@oracle.com>
Tested-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fill also the port group state when sending notifications.
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree. This batch
mostly comes with patches to address fallout from the previous merge window
cycle, they are:
1) Use entry->state.hook_list from nf_queue() instead of the global nf_hooks
which is not valid when used from NFPROTO_NETDEV, this should cause no
problems though since we have no userspace queueing for that family, but
let's fix this now for the sake of correctness. Patch from Eric W. Biederman.
2) Fix compilation breakage in bridge netfilter if CONFIG_NF_DEFRAG_IPV4 is not
set, from Bernhard Thaler.
3) Use percpu jumpstack in arptables too, now that there's a single copy of the
rule blob we can't store the return address there anymore. Patch from
Florian Westphal.
4) Fix a skb leak in the xmit path of bridge netfilter, problem there since
2.6.37 although it should be not possible to hit invalid traffic there, also
from Florian.
5) Eric Leblond reports that when loading a large ruleset with many missing
modules after a fresh boot, nf_tables can take long time commit it. Fix this
by processing the full batch until the end, even on missing modules, then
abort only once and restart processing.
6) Add bridge netfilter files to the MAINTAINER files.
7) Fix a net_device refcount leak in the new IPV6 bridge netfilter code, from
Julien Grall.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit b0e9a30dd6 ("bridge: Add vlan id to multicast groups")
there's a check in br_ip_equal() for a matching vlan id, but the mdb
functions were not modified to use (or at least zero it) so when an
entry was added it would have a garbage vlan id (from the local br_ip
variable in __br_mdb_add/del) and this would prevent it from being
matched and also deleted. So zero out the whole local ip var to protect
ourselves from future changes and also to fix the current bug, since
there's no vlan id support in the mdb uapi - use always vlan id 0.
Example before patch:
root@debian:~# bridge mdb add dev br0 port eth1 grp 239.0.0.1 permanent
root@debian:~# bridge mdb
dev br0 port eth1 grp 239.0.0.1 permanent
root@debian:~# bridge mdb del dev br0 port eth1 grp 239.0.0.1 permanent
RTNETLINK answers: Invalid argument
After patch:
root@debian:~# bridge mdb add dev br0 port eth1 grp 239.0.0.1 permanent
root@debian:~# bridge mdb
dev br0 port eth1 grp 239.0.0.1 permanent
root@debian:~# bridge mdb del dev br0 port eth1 grp 239.0.0.1 permanent
root@debian:~# bridge mdb
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Fixes: b0e9a30dd6 ("bridge: Add vlan id to multicast groups")
Signed-off-by: David S. Miller <davem@davemloft.net>
Start the delete timer when adding temp static entries so they can expire.
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: ccb1c31a7a ("bridge: add flags to distinguish permanent mdb entires")
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit efb6de9b4b "netfilter: bridge:
forward IPv6 fragmented packets" introduced a new function
br_validate_ipv6 which take a reference on the inet6 device. Although,
the reference is not released at the end.
This will result to the impossibility to destroy any netdevice using
ipv6 and bridge.
It's possible to directly retrieve the inet6 device without taking a
reference as all netfilter hooks are protected by rcu_read_lock via
nf_hook_slow.
Spotted while trying to destroy a Xen guest on the upstream Linux:
"unregister_netdevice: waiting for vif1.0 to become free. Usage count = 1"
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Cc: Bernhard Thaler <bernhard.thaler@wvnet.at>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: fw@strlen.de
Cc: ian.campbell@citrix.com
Cc: wei.liu2@citrix.com
Cc: Bob Liu <bob.liu@oracle.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Vlan ids 0 and 4095 were disallowed by commit:
8adff41c3d ("bridge: Don't use VID 0 and 4095 in vlan filtering")
but then the check was removed when vlan ranges were introduced by:
bdced7ef78 ("bridge: support for multiple vlans and vlan ranges in setlink and dellink requests")
So reintroduce the vlan range check.
Before patch:
[root@testvm ~]# bridge vlan add vid 0 dev eth0 master
(succeeds)
After Patch:
[root@testvm ~]# bridge vlan add vid 0 dev eth0 master
RTNETLINK answers: Invalid argument
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: bdced7ef78 ("bridge: support for multiple vlans and vlan ranges in setlink and dellink requests")
Acked-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_nf_dev_queue_xmit must free skb in its error path.
NF_DROP is misleading -- its an okfn, not a netfilter hook.
Fixes: 462fb2af97 ("bridge : Sanitize skb before it enters the IP stack")
Fixes: efb6de9b4b ("netfilter: bridge: forward IPv6 fragmented packets")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
br_nf_ip_fragment() is not needed when neither CONFIG_NF_DEFRAG_IPV4 nor
CONFIG_NF_DEFRAG_IPV6 is set.
struct brnf_frag_data must be available if either CONFIG_NF_DEFRAG_IPV4
or CONFIG_NF_DEFRAG_IPV6 is set.
Fixes: efb6de9b4b ("netfilter: bridge: forward IPv6 fragmented packets")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull networking updates from David Miller:
1) Add TX fast path in mac80211, from Johannes Berg.
2) Add TSO/GRO support to ibmveth, from Thomas Falcon
3) Move away from cached routes in ipv6, just like ipv4, from Martin
KaFai Lau.
4) Lots of new rhashtable tests, from Thomas Graf.
5) Run ingress qdisc lockless, from Alexei Starovoitov.
6) Allow servers to fetch TCP packet headers for SYN packets of new
connections, for fingerprinting. From Eric Dumazet.
7) Add mode parameter to pktgen, for testing receive. From Alexei
Starovoitov.
8) Cache access optimizations via simplifications of build_skb(), from
Alexander Duyck.
9) Move page frag allocator under mm/, also from Alexander.
10) Add xmit_more support to hv_netvsc, from KY Srinivasan.
11) Add a counter guard in case we try to perform endless reclassify
loops in the packet scheduler.
12) Extern flow dissector to be programmable and use it in new "Flower"
classifier. From Jiri Pirko.
13) AF_PACKET fanout rollover fixes, performance improvements, and new
statistics. From Willem de Bruijn.
14) Add netdev driver for GENEVE tunnels, from John W Linville.
15) Add ingress netfilter hooks and filtering, from Pablo Neira Ayuso.
16) Fix handling of epoll edge triggers in TCP, from Eric Dumazet.
17) Add an ECN retry fallback for the initial TCP handshake, from Daniel
Borkmann.
18) Add tail call support to BPF, from Alexei Starovoitov.
19) Add several pktgen helper scripts, from Jesper Dangaard Brouer.
20) Add zerocopy support to AF_UNIX, from Hannes Frederic Sowa.
21) Favor even port numbers for allocation to connect() requests, and
odd port numbers for bind(0), in an effort to help avoid
ip_local_port_range exhaustion. From Eric Dumazet.
22) Add Cavium ThunderX driver, from Sunil Goutham.
23) Allow bpf programs to access skb_iif and dev->ifindex SKB metadata,
from Alexei Starovoitov.
24) Add support for T6 chips in cxgb4vf driver, from Hariprasad Shenai.
25) Double TCP Small Queues default to 256K to accomodate situations
like the XEN driver and wireless aggregation. From Wei Liu.
26) Add more entropy inputs to flow dissector, from Tom Herbert.
27) Add CDG congestion control algorithm to TCP, from Kenneth Klette
Jonassen.
28) Convert ipset over to RCU locking, from Jozsef Kadlecsik.
29) Track and act upon link status of ipv4 route nexthops, from Andy
Gospodarek.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1670 commits)
bridge: vlan: flush the dynamically learned entries on port vlan delete
bridge: multicast: add a comment to br_port_state_selection about blocking state
net: inet_diag: export IPV6_V6ONLY sockopt
stmmac: troubleshoot unexpected bits in des0 & des1
net: ipv4 sysctl option to ignore routes when nexthop link is down
net: track link-status of ipv4 nexthops
net: switchdev: ignore unsupported bridge flags
net: Cavium: Fix MAC address setting in shutdown state
drivers: net: xgene: fix for ACPI support without ACPI
ip: report the original address of ICMP messages
net/mlx5e: Prefetch skb data on RX
net/mlx5e: Pop cq outside mlx5e_get_cqe
net/mlx5e: Remove mlx5e_cq.sqrq back-pointer
net/mlx5e: Remove extra spaces
net/mlx5e: Avoid TX CQE generation if more xmit packets expected
net/mlx5e: Avoid redundant dev_kfree_skb() upon NOP completion
net/mlx5e: Remove re-assignment of wq type in mlx5e_enable_rq()
net/mlx5e: Use skb_shinfo(skb)->gso_segs rather than counting them
net/mlx5e: Static mapping of netdev priv resources to/from netdev TX queues
net/mlx4_en: Use HW counters for rx/tx bytes/packets in PF device
...
Add a new argument to br_fdb_delete_by_port which allows to specify a
vid to match when flushing entries and use it in nbp_vlan_delete() to
flush the dynamically learned entries of the vlan/port pair when removing
a vlan from a port. Before this patch only the local mac was being
removed and the dynamically learned ones were left to expire.
Note that the do_all argument is still respected and if specified, the
vid will be ignored.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a comment to explain why we're not disabling port's multicast when it
goes in blocking state. Since there's a check in the timer's function which
bypasses the timer if the port's in blocking/disabled state, the timer will
simply expire and stop without sending more queries.
Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/mellanox/mlx4/main.c
net/packet/af_packet.c
Both conflicts were cases of simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Use vid_begin/end to be consistent with BRIDGE_VLAN_INFO_RANGE_BEGIN/END.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a port goes through a link down/up the multicast router configuration
is not restored.
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 0909e11758 ("bridge: Add multicast_router sysfs entries")
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When STP is running in user-space and querier is configured, the
querier timer is not started when a port goes to a non-blocking state.
This patch unifies the user- and kernel-space stp multicast port enable
path and enables it in all states different from blocking. Note that when a
port goes in BR_STATE_DISABLED it's not enabled because that is handled
in the beginning of the port list loop.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Resolve compilation breakage when CONFIG_IPV6 is not set by moving the IPv6
code into a separated br_netfilter_ipv6.c file.
Fixes: efb6de9b4b ("netfilter: bridge: forward IPv6 fragmented packets")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After the ->set() spinlocks were removed br_stp_set_bridge_priority
was left running without any protection when used via sysfs. It can
race with port add/del and could result in use-after-free cases and
corrupted lists. Tested by running port add/del in a loop with stp
enabled while setting priority in a loop, crashes are easily
reproducible.
The spinlocks around sysfs ->set() were removed in commit:
14f98f258f ("bridge: range check STP parameters")
There's also a race condition in the netlink priority support that is
fixed by this change, but it was introduced recently and the fixes tag
covers it, just in case it's needed the commit is:
af615762e9 ("bridge: add ageing_time, stp_state, priority over netlink")
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Fixes: 14f98f258f ("bridge: range check STP parameters")
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to delete from offload the device externally learnded fdbs when any
one of these events happen:
1) Bridge ages out fdb. (When bridge is doing ageing vs. device doing
ageing. If device is doing ageing, it would send SWITCHDEV_FDB_DEL
directly).
2) STP state change flushes fdbs on port.
3) User uses sysfs interface to flush fdbs from bridge or bridge port:
echo 1 >/sys/class/net/BR_DEV/bridge/flush
echo 1 >/sys/class/net/BR_PORT/brport/flush
4) Offload driver send event SWITCHDEV_FDB_DEL to delete fdb entry.
For rocker, we can now get called to delete fdb entry in wait and nowait
contexts, so set NOWAIT flag when deleting fdb entry.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
v2:
Move struct switchdev_obj automatics to inner scope where there used.
v1:
To maintain backward compatibility with the existing iproute2 "bridge vlan"
command, let bridge's setlink/dellink handler call into either the port
driver's 8021q ndo ops or the port driver's bridge_setlink/dellink ops.
This allows port driver to choose 8021q ops or the newer
bridge_setlink/dellink ops when implementing VLAN add/del filtering on the
device. The iproute "bridge vlan" command does not need to be modified.
To summarize using the "bridge vlan" command examples, we have:
1) bridge vlan add|del vid VID dev DEV
Here iproute2 sets MASTER flag. Bridge's bridge_setlink/dellink is called.
Vlan is set on bridge for port. If port driver implements ndo 8021q ops,
call those to port driver can install vlan filter on device. Otherwise, if
port driver implements bridge_setlink/dellink ops, call those to install
vlan filter to device. This option only works if port is bridged.
2) bridge vlan add|del vid VID dev DEV master
Same as 1)
3) bridge vlan add|del vid VID dev DEV self
Bridge's bridge_setlink/dellink isn't called. Port driver's
bridge_setlink/dellink is called, if implemented. This option works if
port is bridged or not. If port is not bridged, a VLAN can still be
added/deleted to device filter using this variant.
4) bridge vlan add|del vid VID dev DEV master self
This is a combination of 1) and 3), but will only work if port is bridged.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
This a bit large (and late) patchset that contains Netfilter updates for
net-next. Most relevantly br_netfilter fixes, ipset RCU support, removal of
x_tables percpu ruleset copy and rework of the nf_tables netdev support. More
specifically, they are:
1) Warn the user when there is a better protocol conntracker available, from
Marcelo Ricardo Leitner.
2) Fix forwarding of IPv6 fragmented traffic in br_netfilter, from Bernhard
Thaler. This comes with several patches to prepare the change in first place.
3) Get rid of special mtu handling of PPPoE/VLAN frames for br_netfilter. This
is not needed anymore since now we use the largest fragment size to
refragment, from Florian Westphal.
4) Restore vlan tag when refragmenting in br_netfilter, also from Florian.
5) Get rid of the percpu ruleset copy in x_tables, from Florian. Plus another
follow up patch to refine it from Eric Dumazet.
6) Several ipset cleanups, fixes and finally RCU support, from Jozsef Kadlecsik.
7) Get rid of parens in Netfilter Kconfig files.
8) Attach the net_device to the basechain as opposed to the initial per table
approach in the nf_tables netdev family.
9) Subscribe to netdev events to detect the removal and registration of a
device that is referenced by a basechain.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
If bridge netfilter is used with both
bridge-nf-call-iptables and bridge-nf-filter-vlan-tagged enabled
then ip fragments in VLAN frames are sent without the vlan header.
This has never worked reliably. Turns out this relied on pre-3.5
behaviour where skb frag_list was used to store ip fragments;
ip_fragment() then re-used these skbs.
But since commit 3cc4949269
("ipv4: use skb coalescing in defragmentation") this is no longer
the case. ip_do_fragment now needs to allocate new skbs, but these
don't contain the vlan tag information anymore.
Fix it by storing vlan information of the ressembled skb in the
br netfilter percpu frag area, and restore them for each of the
fragments.
Fixes: 3cc4949269 ("ipv4: use skb coalescing in defragmentation")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
since commit d6b915e29f
("ip_fragment: don't forward defragmented DF packet") the largest
fragment size is available in the IPCB.
Therefore we no longer need to care about 'encapsulation'
overhead of stripped PPPOE/VLAN headers since ip_do_fragment
doesn't use device mtu in such cases.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
IPv6 fragmented packets are not forwarded on an ethernet bridge
with netfilter ip6_tables loaded. e.g. steps to reproduce
1) create a simple bridge like this
modprobe br_netfilter
brctl addbr br0
brctl addif br0 eth0
brctl addif br0 eth2
ifconfig eth0 up
ifconfig eth2 up
ifconfig br0 up
2) place a host with an IPv6 address on each side of the bridge
set IPv6 address on host A:
ip -6 addr add fd01:2345:6789:1::1/64 dev eth0
set IPv6 address on host B:
ip -6 addr add fd01:2345:6789:1::2/64 dev eth0
3) run a simple ping command on host A with packets > MTU
ping6 -s 4000 fd01:2345:6789:1::2
4) wait some time and run e.g. "ip6tables -t nat -nvL" on the bridge
IPv6 fragmented packets traverse the bridge cleanly until somebody runs.
"ip6tables -t nat -nvL". As soon as it is run (and netfilter modules are
loaded) IPv6 fragmented packets do not traverse the bridge any more (you
see no more responses in ping's output).
After applying this patch IPv6 fragmented packets traverse the bridge
cleanly in above scenario.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
[pablo@netfilter.org: small changes to br_nf_dev_queue_xmit]
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Prepare check_hbh_len() to be called from newly introduced
br_validate_ipv6() in next commit.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
br_parse_ip_options() does not parse any IP options, it validates IP
packets as a whole and the function name is misleading.
Rename br_parse_ip_options() to br_validate_ipv4() and remove unneeded
commments.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently frag_max_size is member of br_input_skb_cb and copied back and
forth using IPCB(skb) and BR_INPUT_SKB_CB(skb) each time it is changed or
used.
Attach frag_max_size to nf_bridge_info and set value in pre_routing and
forward functions. Use its value in forward and xmit functions.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
IPv4 iptables allows to REDIRECT/DNAT/SNAT any traffic over a bridge.
e.g. REDIRECT
$ sysctl -w net.bridge.bridge-nf-call-iptables=1
$ iptables -t nat -A PREROUTING -p tcp -m tcp --dport 8080 \
-j REDIRECT --to-ports 81
This does not work with ip6tables on a bridge in NAT66 scenario
because the REDIRECT/DNAT/SNAT is not correctly detected.
The bridge pre-routing (finish) netfilter hook has to check for a possible
redirect and then fix the destination mac address. This allows to use the
ip6tables rules for local REDIRECT/DNAT/SNAT REDIRECT similar to the IPv4
iptables version.
e.g. REDIRECT
$ sysctl -w net.bridge.bridge-nf-call-ip6tables=1
$ ip6tables -t nat -A PREROUTING -p tcp -m tcp --dport 8080 \
-j REDIRECT --to-ports 81
This patch makes it possible to use IPv6 NAT66 on a bridge. It was tested
on a bridge with two interfaces using SNAT/DNAT NAT66 rules.
Reported-by: Artie Hamilton <artiemhamilton@yahoo.com>
Signed-off-by: Sven Eckelmann <sven@open-mesh.com>
[bernhard.thaler@wvnet.at: rebased, add indirect call to ip6_route_input()]
[bernhard.thaler@wvnet.at: rebased, split into separate patches]
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Put br_nf_pre_routing_finish_ipv6() after daddr_was_changed() and
br_nf_pre_routing_finish_bridge() to prepare calling these functions
from there.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
use binary AND on complement of BRNF_NF_BRIDGE_PREROUTING to unset
bit in nf_bridge->mask.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since the addition of sysfs multicast router support if one set
multicast_router to "2" more than once, then the port would be added to
the hlist every time and could end up linking to itself and thus causing an
endless loop for rlist walkers.
So to reproduce just do:
echo 2 > multicast_router; echo 2 > multicast_router;
in a bridge port and let some igmp traffic flow, for me it hangs up
in br_multicast_flood().
Fix this by adding a check in br_multicast_add_router() if the port is
already linked.
The reason this didn't happen before the addition of multicast_router
sysfs entries is because there's a !hlist_unhashed check that prevents
it.
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Fixes: 0909e11758 ("bridge: Add multicast_router sysfs entries")
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before this patch the user-specified bridge port was ignored when
deleting an fdb entry and thus one could delete an entry that belonged
to any port.
Example (eth0 and eth1 are br0 ports):
bridge fdb add 00:11:22:33:44:55 dev eth0 master
bridge fdb del 00:11:22:33:44:55 dev eth1 master
(succeeds)
after the patch:
bridge fdb add 00:11:22:33:44:55 dev eth0 master
bridge fdb del 00:11:22:33:44:55 dev eth1 master
RTNETLINK answers: No such file or directory
Based on a patch by Wilson Kok.
Reported-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fdb_update() can be called in process context in the following way:
br_fdb_add() -> __br_fdb_add() -> br_fdb_update() (if NTF_USE flag is set)
so we need to disable softirqs because there are softirq users of the
hash_lock. One easy way to reproduce this is to modify the bridge utility
to set NTF_USE, enable stp and then set maxageing to a low value so
br_fdb_cleanup() is called frequently and then just add new entries in
a loop. This happens because br_fdb_cleanup() is called from timer/softirq
context. The spin locks in br_fdb_update were _bh before commit f8ae737dee
("[BRIDGE]: forwarding remove unneeded preempt and bh diasables")
and at the time that commit was correct because br_fdb_update() couldn't be
called from process context, but that changed after commit:
292d139898 ("bridge: add NTF_USE support")
Using local_bh_disable/enable around br_fdb_update() allows us to keep
using the spin_lock/unlock in br_fdb_update for the fast-path.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 292d139898 ("bridge: add NTF_USE support")
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fdb_update() can be called in process context in the following way:
br_fdb_add() -> __br_fdb_add() -> br_fdb_update() (if NTF_USE flag is set)
so we need to use spin_lock_bh because there are softirq users of the
hash_lock. One easy way to reproduce this is to modify the bridge utility
to set NTF_USE, enable stp and then set maxageing to a low value so
br_fdb_cleanup() is called frequently and then just add new entries in
a loop. This happens because br_fdb_cleanup() is called from timer/softirq
context. These locks were _bh before commit f8ae737dee
("[BRIDGE]: forwarding remove unneeded preempt and bh diasables")
and at the time that commit was correct because br_fdb_update() couldn't be
called from process context, but that changed after commit:
292d139898 ("bridge: add NTF_USE support")
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 292d139898 ("bridge: add NTF_USE support")
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/phy/amd-xgbe-phy.c
drivers/net/wireless/iwlwifi/Kconfig
include/net/mac80211.h
iwlwifi/Kconfig and mac80211.h were both trivial overlapping
changes.
The drivers/net/phy/amd-xgbe-phy.c file got removed in 'net-next' and
the bug fix that happened on the 'net' side is already integrated
into the rest of the amd-xgbe driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fix for net
The following patch reverts the ebtables chunk that enforces counters that was
introduced in the recently applied d26e2c9ffa ('Revert "netfilter: ensure
number of counters is >0 in do_replace()"') since this breaks ebtables.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This partially reverts commit 1086bbe97a ("netfilter: ensure number of
counters is >0 in do_replace()") in net/bridge/netfilter/ebtables.c.
Setting rules with ebtables does not work any more with 1086bbe97a place.
There is an error message and no rules set in the end.
e.g.
~# ebtables -t nat -A POSTROUTING --src 12:34:56:78:9a:bc -j DROP
Unable to update the kernel. Two possible causes:
1. Multiple ebtables programs were executing simultaneously. The ebtables
userspace tool doesn't by default support multiple ebtables programs
running
Reverting the ebtables part of 1086bbe97a makes this work again.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
br_multicast_query_expired() querier argument is a pointer to
a struct bridge_mcast_querier :
struct bridge_mcast_querier {
struct br_ip addr;
struct net_bridge_port __rcu *port;
};
Intent of the code was to clear port field, not the pointer to querier.
Fixes: 2cd4143192 ("bridge: memorize and export selected IGMP/MLD querier port")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Acked-by: Linus Lüssing <linus.luessing@c0d3.blue>
Cc: Linus Lüssing <linus.luessing@web.de>
Cc: Steinar H. Gunderson <sesse@samfundet.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Check in fdb_add_entry() if the source port should learn, similar
check is used in br_fdb_update.
Note that new fdb entries which are added manually or
as local ones are still permitted.
This patch has been tested by running traffic via a bridge port and
switching the port's state, also by manually adding/removing entries
from the bridge's fdb.
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Network managers like netifd (used in OpenWRT for instance) try to
configure interface options after creation but before setting the
interface up.
Unfortunately the sysfs / bridge currently only allows to configure the
hash_max and multicast_router options when the bridge interface is up.
But since br_multicast_init() doesn't start any timers and only sets
default values and initializes timers it should be save to reconfigure
the default values after that, before things actually get active after
the bridge is set up.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/cadence/macb.c
drivers/net/phy/phy.c
include/linux/skbuff.h
net/ipv4/tcp.c
net/switchdev/switchdev.c
Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD}
renaming overlapping with net-next changes of various
sorts.
phy.c was a case of two changes, one adding a local
variable to a function whilst the second was removing
one.
tcp.c overlapped a deadlock fix with the addition of new tcp_info
statistic values.
macb.c involved the addition of two zyncq device entries.
skbuff.h involved adding back ipv4_daddr to nf_bridge_info
whilst net-next changes put two other existing members of
that struct into a union.
Signed-off-by: David S. Miller <davem@davemloft.net>
When more than a multicast address is present in a MLDv2 report, all but
the first address is ignored, because the code breaks out of the loop if
there has not been an error adding that address.
This has caused failures when two guests connected through the bridge
tried to communicate using IPv6. Neighbor discoveries would not be
transmitted to the other guest when both used a link-local address and a
static address.
This only happens when there is a MLDv2 querier in the network.
The fix will only break out of the loop when there is a failure adding a
multicast address.
The mdb before the patch:
dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp
dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp
dev ovirtmgmt port bond0.86 grp ff02::2 temp
After the patch:
dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp
dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp
dev ovirtmgmt port bond0.86 grp ff02::fb temp
dev ovirtmgmt port bond0.86 grp ff02::2 temp
dev ovirtmgmt port bond0.86 grp ff02::d temp
dev ovirtmgmt port vnet0 grp ff02::1:ff00:76 temp
dev ovirtmgmt port bond0.86 grp ff02::16 temp
dev ovirtmgmt port vnet1 grp ff02::1:ff00:77 temp
dev ovirtmgmt port bond0.86 grp ff02::1:ff00:def temp
dev ovirtmgmt port bond0.86 grp ff02::1:ffa1:40bf temp
Fixes: 08b202b672 ("bridge br_multicast: IPv6 MLD support.")
Reported-by: Rik Theys <Rik.Theys@esat.kuleuven.be>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Tested-by: Rik Theys <Rik.Theys@esat.kuleuven.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit c055d5b03b.
There are two issues:
'dnat_took_place' made me think that this is related to
-j DNAT/MASQUERADE.
But thats only one part of the story. This is also relevant for SNAT
when we undo snat translation in reverse/reply direction.
Furthermore, I originally wanted to do this mainly to avoid
storing ipv6 addresses once we make DNAT/REDIRECT work
for ipv6 on bridges.
However, I forgot about SNPT/DNPT which is stateless.
So we can't escape storing address for ipv6 anyway. Might as
well do it for ipv4 too.
Reported-and-tested-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After improving setsockopt() coverage in trinity, I started triggering
vmalloc failures pretty reliably from this code path:
warn_alloc_failed+0xe9/0x140
__vmalloc_node_range+0x1be/0x270
vzalloc+0x4b/0x50
__do_replace+0x52/0x260 [ip_tables]
do_ipt_set_ctl+0x15d/0x1d0 [ip_tables]
nf_setsockopt+0x65/0x90
ip_setsockopt+0x61/0xa0
raw_setsockopt+0x16/0x60
sock_common_setsockopt+0x14/0x20
SyS_setsockopt+0x71/0xd0
It turns out we don't validate that the num_counters field in the
struct we pass in from userspace is initialized.
The same problem also exists in ebtables, arptables, ipv6, and the
compat variants.
Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When bridge netfilter re-fragments an IP packet for output, all
packets that can not be re-fragmented to their original input size
should be silently discarded.
However, current bridge netfilter output path generates an ICMP packet
with 'size exceeded MTU' message for such packets, this is a bug.
This patch refactors the ip_fragment() API to allow two separate
use cases. The bridge netfilter user case will not
send ICMP, the routing output will, as before.
Signed-off-by: Andy Zhou <azhou@nicira.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next. Briefly
speaking, cleanups and minor fixes for ipset from Jozsef Kadlecsik and
Serget Popovich, more incremental updates to make br_netfilter a better
place from Florian Westphal, ARP support to the x_tables mark match /
target from and context Zhang Chunyu and the addition of context to know
that the x_tables runs through nft_compat. More specifically, they are:
1) Fix sparse warning in ipset/ip_set_hash_ipmark.c when fetching the
IPSET_ATTR_MARK netlink attribute, from Jozsef Kadlecsik.
2) Rename STREQ macro to STRNCMP in ipset, also from Jozsef.
3) Use skb->network_header to calculate the transport offset in
ip_set_get_ip{4,6}_port(). From Alexander Drozdov.
4) Reduce memory consumption per element due to size miscalculation,
this patch and follow up patches from Sergey Popovich.
5) Expand nomatch field from 1 bit to 8 bits to allow to simplify
mtype_data_reset_flags(), also from Sergey.
6) Small clean for ipset macro trickery.
7) Fix error reporting when both ip_set_get_hostipaddr4() and
ip_set_get_extensions() from per-set uadt functions.
8) Simplify IPSET_ATTR_PORT netlink attribute validation.
9) Introduce HOST_MASK instead of hardcoded 32 in ipset.
10) Return true/false instead of 0/1 in functions that return boolean
in the ipset code.
11) Validate maximum length of the IPSET_ATTR_COMMENT netlink attribute.
12) Allow to dereference from ext_*() ipset macros.
13) Get rid of incorrect definitions of HKEY_DATALEN.
14) Include linux/netfilter/ipset/ip_set.h in the x_tables set match.
15) Reduce nf_bridge_info size in br_netfilter, from Florian Westphal.
16) Release nf_bridge_info after POSTROUTING since this is only needed
from the physdev match, also from Florian.
17) Reduce size of ipset code by deinlining ip_set_put_extensions(),
from Denys Vlasenko.
18) Oneliner to add ARP support to the x_tables mark match/target, from
Zhang Chunyu.
19) Add context to know if the x_tables extension runs from nft_compat,
to address minor problems with three existing extensions.
20) Correct return value in several seqfile *_show() functions in the
netfilter tree, from Joe Perches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we have four xtables extensions that cannot be used from the
xt over nft compat layer. The problem is that they need real access to
the full blown xt_entry to validate that the rule comes with the right
dependencies. This check was introduced to overcome the lack of
sufficient userspace dependency validation in iptables.
To resolve this problem, this patch introduces a new field to the
xt_tgchk_param structure that tell us if the extension is run from
nft_compat context.
The three affected extensions are:
1) CLUSTERIP, this target has been superseded by xt_cluster. So just
bail out by returning -EINVAL.
2) TCPMSS. Relax the checking when used from nft_compat. If used with
the wrong configuration, it will corrupt !syn packets by adding TCP
MSS option.
3) ebt_stp. Relax the check to make sure it uses the reserved
destination MAC address for STP.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
nf_bridge information is only needed for -m physdev, so we can always free
it after POST_ROUTING. This has the advantage that allocation and free will
typically happen on the same cpu.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The neigh_header is only needed when we detect DNAT after prerouting
and neigh cache didn't have a mac address for us.
The output port has not been chosen yet so we can re-use the storage
area, bringing struct size down to 32 bytes on x86_64.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Older gcc versions (e.g. gcc version 4.4.6) don't like anonymous unions
which was causing build issues on the newly added switchdev attr/obj
structs. Fix this by using named union on structs.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Reported-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is revert of:
commit 68e331c785 ("bridge: offload bridge port attributes to switch asic
if feature flag set")
Restore br_dellink back to original and don't call into SELF port driver.
rtnetlink.c:bridge_dellink() already does a call into port driver for SELF.
bridge vlan add/del cmd defaults to MASTER. From man page for bridge vlan
add/del cmd:
self the vlan is configured on the specified physical device.
Required if the device is the bridge device.
master the vlan is configured on the software bridge (default).
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is revert of:
commit 68e331c785 ("bridge: offload bridge port attributes to switch asic
if feature flag set")
Restore br_setlink back to original and don't call into SELF port driver.
rtnetlink.c:bridge_setlink() already does a call into port driver for SELF.
bridge set link cmd defaults to MASTER. From man page for bridge link set
cmd:
self link setting is configured on specified physical device
master link setting is configured on the software bridge (default)
The link setting has two values: the device-side value and the software
bridge-side value. These are independent and settable using the bridge
link set cmd by specifying some combination of [master] | [self].
Furthermore, the device-side and bridge-side settings have their own
initial value, viewable from bridge -d link show cmd.
Restoring br_setlink back to original makes rocker (the only in-kernel user
of SELF link settings) work as first implement: two-sided values.
It's true that when both MASTER and SELF are specified from the command,
two netlink notifications are generated, one for each side of the settings.
The user-space app can distiquish between the two notifications by
observing the MASTER or SELF flag.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
STP update is just a settable port attribute, so convert
switchdev_port_stp_update to an attr set.
For DSA, the prepare phase is skipped and STP updates are only done in the
commit phase. This is because currently the DSA drivers don't need to
allocate any memory for STP updates and the STP update will not fail to HW
(unless something horrible goes wrong on the MDIO bus, in which case the
prepare phase wouldn't have been able to predict anyway).
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Turned out that "switchdev" sticks. So just unify all related terms to use
this prefix.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace "ntohs(proto) >= ETH_P_802_3_MIN" w/ eth_proto_is_802_3(proto).
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
BR_GROUPFWD_RESTRICTED bitmask restricts users from setting values to
/sys/class/net/brX/bridge/group_fwd_mask that allow forwarding of
some IEEE 802.1D Table 7-10 Reserved addresses:
(MAC Control) 802.3 01-80-C2-00-00-01
(Link Aggregation) 802.3 01-80-C2-00-00-02
802.1AB LLDP 01-80-C2-00-00-0E
Change BR_GROUPFWD_RESTRICTED to allow to forward LLDP frames and document
group_fwd_mask.
e.g.
echo 16384 > /sys/class/net/brX/bridge/group_fwd_mask
allows to forward LLDP frames.
This may be needed for bridge setups used for network troubleshooting or
any other scenario where forwarding of LLDP frames is desired (e.g. bridge
connecting a virtual machine to real switch transmitting LLDP frames that
virtual machine needs to receive).
Tested on a simple bridge setup with two interfaces and host transmitting
LLDP frames on one side of this bridge (used lldpd). Setting group_fwd_mask
as described above lets LLDP frames traverse bridge.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this patch, the IGMP and MLD message validation functions are moved
from the bridge code to IPv4/IPv6 multicast files. Some small
refactoring was done to enhance readibility and to iron out some
differences in behaviour between the IGMP and MLD parsing code (e.g. the
skb-cloning of MLD messages is now only done if necessary, just like the
IGMP part always did).
Finally, these IGMP and MLD message validation functions are exported so
that not only the bridge can use it but batman-adv later, too.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
Let's use these new, neat helpers.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
it is sent only at the end of a dump.
Libraries like libnl will wait forever for NLMSG_DONE.
Fixes: e5a55a8987 ("net: create generic bridge ops")
Fixes: 815cccbf10 ("ixgbe: add setlink, getlink support to ixgbe and ixgbevf")
CC: John Fastabend <john.r.fastabend@intel.com>
CC: Sathya Perla <sathya.perla@emulex.com>
CC: Subbu Seetharaman <subbu.seetharaman@emulex.com>
CC: Ajit Khaparde <ajit.khaparde@emulex.com>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: intel-wired-lan@lists.osuosl.org
CC: Jiri Pirko <jiri@resnulli.us>
CC: Scott Feldman <sfeldma@gmail.com>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: bridge@lists.linux-foundation.org
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
it is sent only at the end of a dump.
Libraries like libnl will wait forever for NLMSG_DONE.
Fixes: 37a393bc49 ("bridge: notify mdb changes via netlink")
CC: Cong Wang <amwang@redhat.com>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: bridge@lists.linux-foundation.org
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Switch the nf_tables registers from 128 bit addressing to 32 bit
addressing to support so called concatenations, where multiple values
can be concatenated over multiple registers for O(1) exact matches of
multiple dimensions using sets.
The old register values are mapped to areas of 128 bits for compatibility.
When dumping register numbers, values are expressed using the old values
if they refer to the beginning of a 128 bit area for compatibility.
To support concatenations, register loads of less than a full 32 bit
value need to be padded. This mainly affects the payload and exthdr
expressions, which both unconditionally zero the last word before
copying the data.
Userspace fully passes the testsuite using both old and new register
addressing.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add helper functions to parse and dump register values in netlink attributes.
These helpers will later be changed to take care of translation between the
old 128 bit and the new 32 bit register numbers.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Simple conversion to use u32 pointers to the beginning of the registers
to keep follow up patches smaller.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Replace the array of registers passed to expressions by a struct nft_regs,
containing the verdict as a seperate member, which aliases to the
NFT_REG_VERDICT register.
This is needed to seperate the verdict from the data registers completely,
so their size can be changed.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
All users of nft_validate_register_store() first invoke
nft_validate_output_register(). There is in fact no use for using it
on its own, so simplify the code by folding the functionality into
nft_validate_register_store() and kill it.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The existing name is ambiguous, data is loaded as well when we read from
a register. Rename to nft_validate_register_store() for clarity and
consistency with the upcoming patch to introduce its counterpart,
nft_validate_register_load().
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
For values spanning multiple registers, we need to validate that enough
space is available from the destination register onwards. Add a len
argument to nft_validate_data_load() and consolidate the existing length
validations in preparation of that.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next tree.
They are:
* nf_tables set timeout infrastructure from Patrick Mchardy.
1) Add support for set timeout support.
2) Add support for set element timeouts using the new set extension
infrastructure.
4) Add garbage collection helper functions to get rid of stale elements.
Elements are accumulated in a batch that are asynchronously released
via RCU when the batch is full.
5) Add garbage collection synchronization helpers. This introduces a new
element busy bit to address concurrent access from the netlink API and the
garbage collector.
5) Add timeout support for the nft_hash set implementation. The garbage
collector peridically checks for stale elements from the workqueue.
* iptables/nftables cgroup fixes:
6) Ignore non full-socket objects from the input path, otherwise cgroup
match may crash, from Daniel Borkmann.
7) Fix cgroup in nf_tables.
8) Save some cycles from xt_socket by skipping packet header parsing when
skb->sk is already set because of early demux. Also from Daniel.
* br_netfilter updates from Florian Westphal.
9) Save frag_max_size and restore it from the forward path too.
10) Use a per-cpu area to restore the original source MAC address when traffic
is DNAT'ed.
11) Add helper functions to access physical devices.
12) Use these new physdev helper function from xt_physdev.
13) Add another nf_bridge_info_get() helper function to fetch the br_netfilter
state information.
14) Annotate original layer 2 protocol number in nf_bridge info, instead of
using kludgy flags.
15) Also annotate the pkttype mangling when the packet travels back and forth
from the IP to the bridge layer, instead of using a flag.
* More nf_tables set enhancement from Patrick:
16) Fix possible usage of set variant that doesn't support timeouts.
17) Avoid spurious "set is full" errors from Netlink API when there are pending
stale elements scheduled to be released.
18) Restrict loop checks to set maps.
19) Add support for dynamic set updates from the packet path.
20) Add support to store optional user data (eg. comments) per set element.
BTW, I have also pulled net-next into nf-next to anticipate the conflict
resolution between your okfn() signature changes and Florian's br_netfilter
updates.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
More recent GCC warns about two kinds of switch statement uses:
1) Switching on an enumeration, but not having an explicit case
statement for all members of the enumeration. To show the
compiler this is intentional, we simply add a default case
with nothing more than a break statement.
2) Switching on a boolean value. I think this warning is dumb
but nevertheless you get it wholesale with -Wswitch.
This patch cures all such warnings in netfilter.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
nf_bridge_info->mask is used for several things, for example to
remember if skb->pkt_type was set to OTHER_HOST.
For a bridge, OTHER_HOST is expected case. For ip forward its a non-starter
though -- routing expects PACKET_HOST.
Bridge netfilter thus changes OTHER_HOST to PACKET_HOST before hook
invocation and then un-does it after hook traversal.
This information is irrelevant outside of br_netfilter.
After this change, ->mask now only contains flags that need to be
known outside of br_netfilter in fast-path.
Future patch changes mask into a 2bit state field in sk_buff, so that
we can remove skb->nf_bridge pointer for good and consider all remaining
places that access nf_bridge info content a not-so fastpath.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
->mask is a bit info field that mixes various use cases.
In particular, we have flags that are mutually exlusive, and flags that
are only used within br_netfilter while others need to be exposed to
other parts of the kernel.
Remove BRNF_8021Q/PPPoE flags. They're mutually exclusive and only
needed within br_netfilter context.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Don't access skb->nf_bridge directly, this pointer will be removed soon.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
br_netfilter maintains an extra state, nf_bridge_info, which is attached
to skb via skb->nf_bridge pointer.
Amongst other things we use skb->nf_bridge->data to store the original
mac header for every processed skb.
This is required for ip refragmentation when using conntrack
on top of bridge, because ip_fragment doesn't copy it from original skb.
However there is no need anymore to do this unconditionally.
Move this to the one place where its needed -- when br_netfilter calls
ip_fragment().
Also switch to percpu storage for this so we can handle fragmenting
without accessing nf_bridge meta data.
Only user left is neigh resolution when DNAT is detected, to hold
the original source mac address (neigh resolution builds new mac header
using bridge mac), so rename ->data and reduce its size to whats needed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
On the output paths in particular, we have to sometimes deal with two
socket contexts. First, and usually skb->sk, is the local socket that
generated the frame.
And second, is potentially the socket used to control a tunneling
socket, such as one the encapsulates using UDP.
We do not want to disassociate skb->sk when encapsulating in order
to fix this, because that would break socket memory accounting.
The most extreme case where this can cause huge problems is an
AF_PACKET socket transmitting over a vxlan device. We hit code
paths doing checks that assume they are dealing with an ipv4
socket, but are actually operating upon the AF_PACKET one.
Signed-off-by: David S. Miller <davem@davemloft.net>
The goal of this patch is to prepare the removal of the iflink field. It
introduces a new ndo function, which will be implemented by virtual interfaces.
There is no functional change into this patch. All readers of iflink field
now call dev_get_iflink().
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We also need to save/store in forward, else br_parse_ip_options call
will zero frag_max_size as well.
Fixes: 93fdd47e5 ('bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING')
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next.
Basically, more incremental updates for br_netfilter from Florian
Westphal, small nf_tables updates (including one fix for rb-tree
locking) and small two-liner to add extra validation for the REJECT6
target.
More specifically, they are:
1) Use the conntrack status flags from br_netfilter to know that DNAT is
happening. Patch for Florian Westphal.
2) nf_bridge->physoutdev == NULL already indicates that the traffic is
bridged, so let's get rid of the BRNF_BRIDGED flag. Also from Florian.
3) Another patch to prepare voidization of seq_printf/seq_puts/seq_putc,
from Joe Perches.
4) Consolidation of nf_tables_newtable() error path.
5) Kill nf_bridge_pad used by br_netfilter from ip_fragment(),
from Florian Westphal.
6) Access rb-tree root node inside the lock and remove unnecessary
locking from the get path (we already hold nfnl_lock there), from
Patrick McHardy.
7) You cannot use a NFT_SET_ELEM_INTERVAL_END when the set doesn't
support interval, also from Patrick.
8) Enforce IP6T_F_PROTO from ip6t_REJECT to make sure the core is
actually restricting matches to TCP.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The br_netfilter frag output function calls skb_cow_head() so in
case it needs a larger headroom to e.g. re-add a previously stripped PPPOE
or VLAN header things will still work (at cost of reallocation).
We can then move nf_bridge_encap_header_len to br_netfilter.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conflicts:
drivers/net/ethernet/emulex/benet/be_main.c
net/core/sysctl_net_core.c
net/ipv4/inet_diag.c
The be_main.c conflict resolution was really tricky. The conflict
hunks generated by GIT were very unhelpful, to say the least. It
split functions in half and moved them around, when the real actual
conflict only existed solely inside of one function, that being
be_map_pci_bars().
So instead, to resolve this, I checked out be_main.c from the top
of net-next, then I applied the be_main.c changes from 'net' since
the last time I merged. And this worked beautifully.
The inet_diag.c and sysctl_net_core.c conflicts were simple
overlapping changes, and were easily to resolve.
Signed-off-by: David S. Miller <davem@davemloft.net>
Its not needed anymore since 2bf540b73e
([NETFILTER]: bridge-netfilter: remove deferred hooks).
Before this it was possible to have physoutdev set for locally generated
packets -- this isn't the case anymore:
BRNF_STATE_BRIDGED flag is set when we assign nf_bridge->physoutdev,
so physoutdev != NULL means BRNF_STATE_BRIDGED is set.
If physoutdev is NULL, then we are looking at locally-delivered and
routed packet.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ask conntrack instead of storing ipv4 address in nf_bridge_info->data.
Ths avoids the need to use ->data during NF_PRE_ROUTING.
Only two functions that need ->data remain.
These will be addressed in followup patches.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
On adding an interface br_add_if() sets the MTU to the min of
all the interfaces. Do the same thing on removing an interface too
in br_del_if.
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e5de75b ("netfilter: bridge: move DNAT helper to br_netfilter") results
in the following link problem:
net/bridge/br_device.c:29: undefined reference to `br_nf_prerouting_finish_bridge`
Moreover it creates a hard dependency between br_netfilter and the
bridge core, which is what we've been trying to avoid so far.
Resolve this problem by using a hook structure so we reduce #ifdef
pollution and keep bridge netfilter specific code under br_netfilter.c
which was the original intention.
Reported-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
make C=1 CF=-D__CHECK_ENDIAN__ shows following:
net/bridge/netfilter/nft_reject_bridge.c:65:50: warning: incorrect type in argument 3 (different base types)
net/bridge/netfilter/nft_reject_bridge.c:65:50: expected restricted __be16 [usertype] protocol [..]
net/bridge/netfilter/nft_reject_bridge.c:102:37: warning: cast from restricted __be16
net/bridge/netfilter/nft_reject_bridge.c:102:37: warning: incorrect type in argument 1 (different base types) [..]
net/bridge/netfilter/nft_reject_bridge.c:121:50: warning: incorrect type in argument 3 (different base types) [..]
net/bridge/netfilter/nft_reject_bridge.c:168:52: warning: incorrect type in argument 3 (different base types) [..]
net/bridge/netfilter/nft_reject_bridge.c:233:52: warning: incorrect type in argument 3 (different base types) [..]
Caused by two (harmless) errors:
1. htons() instead of ntohs()
2. __be16 for protocol in nf_reject_ipXhdr_put API, use u8 instead.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After my change to neigh_hh_init to obtain the protocol from the
neigh_table there are no more users of protocol in struct dst_ops.
Remove the protocol field from dst_ops and all of it's initializers.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree. Basically, improvements for the packet rejection infrastructure,
deprecation of CLUSTERIP, cleanups for nf_tables and some untangling for
br_netfilter. More specifically they are:
1) Send packet to reset flow if checksum is valid, from Florian Westphal.
2) Fix nf_tables reject bridge from the input chain, also from Florian.
3) Deprecate the CLUSTERIP target, the cluster match supersedes it in
functionality and it's known to have problems.
4) A couple of cleanups for nf_tables rule tracing infrastructure, from
Patrick McHardy.
5) Another cleanup to place transaction declarations at the bottom of
nf_tables.h, also from Patrick.
6) Consolidate Kconfig dependencies wrt. NF_TABLES.
7) Limit table names to 32 bytes in nf_tables.
8) mac header copying in bridge netfilter is already required when
calling ip_fragment(), from Florian Westphal.
9) move nf_bridge_update_protocol() to br_netfilter.c, also from
Florian.
10) Small refactor in br_netfilter in the transmission path, again from
Florian.
11) Move br_nf_pre_routing_finish_bridge_slow() to br_netfilter.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Only one caller, there is no need to keep this in a header.
Move it to br_netfilter.c where this belongs to.
Based on patch from Florian Westphal.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The mac header only has to be copied back into the skb for
fragments generated by ip_fragment(), which only happens
for bridge forwarded packets with nf-call-iptables=1 && active nf_defrag.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This extends the design in commit 958501163d ("bridge: Add support for
IEEE 802.11 Proxy ARP") with optional set of rules that are needed to
meet the IEEE 802.11 and Hotspot 2.0 requirements for ProxyARP. The
previously added BR_PROXYARP behavior is left as-is and a new
BR_PROXYARP_WIFI alternative is added so that this behavior can be
configured from user space when required.
In addition, this enables proxyarp functionality for unicast ARP
requests for both BR_PROXYARP and BR_PROXYARP_WIFI since it is possible
to use unicast as well as broadcast for these frames.
The key differences in functionality:
BR_PROXYARP:
- uses the flag on the bridge port on which the request frame was
received to determine whether to reply
- block bridge port flooding completely on ports that enable proxy ARP
BR_PROXYARP_WIFI:
- uses the flag on the bridge port to which the target device of the
request belongs
- block bridge port flooding selectively based on whether the proxyarp
functionality replied
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the STP timer fires, it can call br_ifinfo_notify(),
which in turn ends up in the new br_get_link_af_size().
This function is annotated to be using RTNL locking, which
clearly isn't the case here, and thus lockdep warns:
===============================
[ INFO: suspicious RCU usage. ]
3.19.0+ #569 Not tainted
-------------------------------
net/bridge/br_private.h:204 suspicious rcu_dereference_protected() usage!
Fix this by doing RCU locking here.
Fixes: b7853d73e3 ("bridge: add vlan info to bridge setlink and dellink notification messages")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/rocker/rocker.c
The rocker commit was two overlapping changes, one to rename
the ->vport member to ->pport, and another making the bitmask
expression use '1ULL' instead of plain '1'.
Signed-off-by: David S. Miller <davem@davemloft.net>
make build fail if structure no longer fits into ->cb storage.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
bridge reject handling is not straightforward, there are many subtle
differences depending on configuration.
skb->dev is either the bridge port (PRE_ROUTING) or the bridge
itself (INPUT), so we need to use indev instead.
Also, checksum validation will only work reliably if we trim skb
according to the l3 header size.
While at it, add csum validation for ipv6 and skip existing tests
if skb was already checked e.g. by GRO.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
my previous patch skipped vlan range optimizations during skb size
calculations for simplicity.
This incremental patch considers vlan ranges during
skb size calculations. This leads to a bit of code duplication
in the fill and size calculation functions. But, I could not find a
prettier way to do this. will take any suggestions.
Previously, I had reused the existing br_get_link_af_size size calculation
function to calculate skb size for notifications. Reusing it this time
around creates some change in behaviour issues for the usual
.get_link_af_size callback.
This patch adds a new br_get_link_af_size_filtered() function to
base the size calculation on the incoming filter flag and include
vlan ranges.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
vlan add/deletes are not notified to userspace today. This patch adds
vlan info to bridge newlink/dellink notifications generated from the
bridge driver. Notifications use the RTEXT_FILTER_BRVLAN_COMPRESSED
flag to compress vlans into ranges whereever applicable.
The size calculations does not take ranges into account for
simplicity. This has the potential for allocating a larger skb than
required.
There is an existing inconsistency with bridge NEWLINK and DELLINK
change notifications. Both generate NEWLINK notifications. Since its
always a NEWLINK notification, this patch includes all vlans the port
belongs to in the notification. The NEWLINK and DELLINK request
messages however only include the vlans to be added and deleted.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If CONFIG_SYSCTL=n:
net/bridge/br_netfilter.c: In function ‘br_netfilter_init’:
net/bridge/br_netfilter.c:996: warning: label ‘err1’ defined but not used
Move the label and the code after it inside the existing #ifdef to get
rid of the warning.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bridge's default_pvid adds a vid by default, by which we cannot add a
non-vlan fdb entry by default, because br_fdb_add() adds fdb entries for
all vlans instead of a non-vlan one when any vlan is configured.
# ip link add br0 type bridge
# ip link set eth0 master br0
# bridge fdb add 12:34:56:78:90:ab dev eth0 master temp
# bridge fdb show brport eth0 | grep 12:34:56:78:90:ab
12:34:56:78:90:ab dev eth0 vlan 1 static
We expect a non-vlan fdb entry as well as vlan 1:
12:34:56:78:90:ab dev eth0 static
To fix this, we need to insert a non-vlan fdb entry if vlan is not
specified, even when any vlan is configured.
Fixes: 5be5a2df40 ("bridge: Add filtering support for default_pvid")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a missing bridge port check caught by smatch.
setlink/dellink of attributes like vlans can come for a bridge device
and there is no need to offload those today. So, this patch adds a bridge
port check. (In these cases however, the BRIDGE_SELF flags will always be set
and we may not hit a problem with the current code).
smatch complaint:
The patch 68e331c785: "bridge: offload bridge port attributes to
switch asic if feature flag set" from Jan 29, 2015, leads to the
following Smatch complaint:
net/bridge/br_netlink.c:552 br_setlink()
error: we previously assumed 'p' could be null (see line 518)
net/bridge/br_netlink.c
517
518 if (p && protinfo) {
^
Check for NULL.
Reported-By: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/vxlan.c
drivers/vhost/net.c
include/linux/if_vlan.h
net/core/dev.c
The net/core/dev.c conflict was the overlap of one commit marking an
existing function static whilst another was adding a new function.
In the include/linux/if_vlan.h case, the type used for a local
variable was changed in 'net', whereas the function got rewritten
to fix a stacked vlan bug in 'net-next'.
In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next'
overlapped with an endainness fix for VHOST 1.0 in 'net'.
In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter
in 'net-next' whereas in 'net' there was a bug fix to pass in the
correct network namespace pointer in calls to this function.
Signed-off-by: David S. Miller <davem@davemloft.net>
When 'learned_sync' flag is turned on, the offloaded switch
port syncs learned MAC addresses to bridge's FDB via switchdev notifier
(NETDEV_SWITCH_FDB_ADD). Currently, FDB entries learnt via this mechanism are
wrongly being deleted by bridge aging logic. This patch ensures that FDB
entries synced from offloaded switch ports are not deleted by bridging logic.
Such entries can only be deleted via switchdev notifier
(NETDEV_SWITCH_FDB_DEL).
Signed-off-by: Siva Mannem <siva.mannem.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support to set/del bridge port attributes in hardware from
the bridge driver.
With this, when the user sends a bridge setlink message with no flags or
master flags set,
- the bridge driver ndo_bridge_setlink handler sets settings in the kernel
- calls the swicthdev api to propagate the attrs to the switchdev
hardware
You can still use the self flag to go to the switch hw or switch port
driver directly.
With this, it also makes sure a notification goes out only after the
attributes are set both in the kernel and hw.
The patch calls switchdev api only if BRIDGE_FLAGS_SELF is not set.
This is because the offload cases with BRIDGE_FLAGS_SELF are handled in
the caller (in rtnetlink.c).
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bridge flags are needed inside ndo_bridge_setlink/dellink handlers to
avoid another call to parse IFLA_AF_SPEC inside these handlers
This is used later in this series
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
They are all either written once or extremly rarely (e.g. from init
code), so we can move them to the .data..read_mostly section.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Static checkers complain that we should maybe set "ret" before we do the
"goto out;". They interpret the NULL return from br_port_get_rtnl() as
a failure and forgetting to set the error code is a common bug in this
situation.
The code is confusing but it's actually correct. We are returning zero
deliberately. Let's re-write it a bit to be more clear.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
DSA-enabled master network devices with a switch tagging protocol should
strip the protocol specific format before handing the frame over to
higher layer.
When adding such a DSA master network device as a bridge member, we go
through the following code path when receiving a frame:
__netif_receive_skb_core
-> first ptype check against ptype_all is not returning any
handler for this skb
-> check and invoke rx_handler:
-> deliver frame to the bridge layer: br_handle_frame
DSA registers a ptype handler with the fake ETH_XDSA ethertype, which is
called *after* the bridge-layer rx_handler has run. br_handle_frame()
tries to parse the frame it received from the DSA master network device,
and will not be able to match any of its conditions and jumps straight
at the end of the end of br_handle_frame() and returns
RX_HANDLER_CONSUMED there.
Since we returned RX_HANDLER_CONSUMED, __netif_receive_skb_core() stops
RX processing for this frame and returns NET_RX_SUCCESS, so we never get
a chance to call our switch tag packet processing logic and deliver
frames to the DSA slave network devices, and so we do not get any
functional bridge members at all.
Instead of cluttering the bridge receive path with DSA-specific checks,
and rely on assumptions about how __netif_receive_skb_core() is
processing frames, we simply deny adding the DSA master network device
(conduit interface) as a bridge member, leaving only the slave DSA
network devices to be bridge members, since those will work correctly in
all circumstances.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The user can crash the kernel if it uses any of the existing NAT
expressions from the wrong hook, so add some code to validate this
when loading the rule.
This patch introduces nft_chain_validate_hooks() which is based on
an existing function in the bridge version of the reject expression.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Contrary to common expectations for an "int" return, these functions
return only a positive value -- if used correctly they cannot even
return 0 because the message header will necessarily be in the skb.
This makes the very common pattern of
if (genlmsg_end(...) < 0) { ... }
be a whole bunch of dead code. Many places also simply do
return nlmsg_end(...);
and the caller is expected to deal with it.
This also commonly (at least for me) causes errors, because it is very
common to write
if (my_function(...))
/* error condition */
and if my_function() does "return nlmsg_end()" this is of course wrong.
Additionally, there's not a single place in the kernel that actually
needs the message length returned, and if anyone needs it later then
it'll be very easy to just use skb->len there.
Remove this, and make the functions void. This removes a bunch of dead
code as described above. The patch adds lines because I did
- return nlmsg_end(...);
+ nlmsg_end(...);
+ return 0;
I could have preserved all the function's return values by returning
skb->len, but instead I've audited all the places calling the affected
functions and found that none cared. A few places actually compared
the return value with <= 0 in dump functionality, but that could just
be changed to < 0 with no change in behaviour, so I opted for the more
efficient version.
One instance of the error I've made numerous times now is also present
in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
check for <0 or <=0 and thus broke out of the loop every single time.
I've preserved this since it will (I think) have caused the messages to
userspace to be formatted differently with just a single message for
every SKB returned to userspace. It's possible that this isn't needed
for the tools that actually use this, but I don't even know what they
are so couldn't test that changing this behaviour would be acceptable.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch benefits from newly introduced switchdev notifier and uses it
to propagate fdb learn events from rocker driver to bridge. That avoids
direct function calls and possible use by other listeners (ovs).
Suggested-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
problems with bridge getlink/setlink notifications today:
- bridge setlink generates two notifications to userspace
- one from the bridge driver
- one from rtnetlink.c (rtnl_bridge_notify)
- dellink generates one notification from rtnetlink.c. Which
means bridge setlink and dellink notifications are not
consistent
- Looking at the code it appears,
If both BRIDGE_FLAGS_MASTER and BRIDGE_FLAGS_SELF were set,
the size calculation in rtnl_bridge_notify can be wrong.
Example: if you set both BRIDGE_FLAGS_MASTER and BRIDGE_FLAGS_SELF
in a setlink request to rocker dev, rtnl_bridge_notify will
allocate skb for one set of bridge attributes, but,
both the bridge driver and rocker dev will try to add
attributes resulting in twice the number of attributes
being added to the skb. (rocker dev calls ndo_dflt_bridge_getlink)
There are multiple options:
1) Generate one notification including all attributes from master and self:
But, I don't think it will work, because both master and self may use
the same attributes/policy. Cannot pack the same set of attributes in a
single notification from both master and slave (duplicate attributes).
2) Generate one notification from master and the other notification from
self (This seems to be ideal):
For master: the master driver will send notification (bridge in this
example)
For self: the self driver will send notification (rocker in the above
example. It can use helpers from rtnetlink.c to do so. Like the
ndo_dflt_bridge_getlink api).
This patch implements 2) (leaving the 'rtnl_bridge_notify' around to be used
with 'self').
v1->v2 :
- rtnl_bridge_notify is now called only for self,
so, remove 'BRIDGE_FLAGS_SELF' check and cleanup a few things
- rtnl_bridge_dellink used to always send a RTM_NEWLINK msg
earlier. So, I have changed the notification from br_dellink to
go as RTM_NEWLINK
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is just a cleanup, because in the current code MDBA_SET_ENTRY_MAX ==
MDBA_SET_ENTRY.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/xen-netfront.c
Minor overlapping changes in xen-netfront.c, mostly to do
with some buffer management changes alongside the split
of stats into TX and RX.
Signed-off-by: David S. Miller <davem@davemloft.net>
When IPV4 support is disabled, we cannot call arp_send from
the bridge code, which would result in a kernel link error:
net/built-in.o: In function `br_handle_frame_finish':
:(.text+0x59914): undefined reference to `arp_send'
:(.text+0x59a50): undefined reference to `arp_tbl'
This makes the newly added proxy ARP support in the bridge
code depend on the CONFIG_INET symbol and lets the compiler
optimize the code out to avoid the link error.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 958501163d ("bridge: Add support for IEEE 802.11 Proxy ARP")
Cc: Kyeyoon Park <kyeyoonp@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The same macros are used for rx as well. So rename it.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/bridge/br_netlink.c: In function ‘br_fill_ifinfo’:
net/bridge/br_netlink.c:146:32: warning: ‘vid_range_flags’ may be used uninitialized in this function [-Wmaybe-uninitialized]
err = br_fill_ifvlaninfo_range(skb, vid_range_start,
^
net/bridge/br_netlink.c:108:6: note: ‘vid_range_flags’ was declared here
u16 vid_range_flags;
Reported-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds new function to pack vlans into ranges
whereever applicable using the flags BRIDGE_VLAN_INFO_RANGE_BEGIN
and BRIDGE VLAN_INFO_RANGE_END
Old vlan packing code is moved to a new function and continues to be
called when filter_mask is RTEXT_FILTER_BRVLAN.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes bridge IFLA_AF_SPEC netlink attribute parser to
look for more than one IFLA_BRIDGE_VLAN_INFO attribute. This allows
userspace to pack more than one vlan in the setlink msg.
The dumps were already sending more than one vlan info in the getlink msg.
This patch also adds bridge_vlan_info flags BRIDGE_VLAN_INFO_RANGE_BEGIN and
BRIDGE_VLAN_INFO_RANGE_END to indicate start and end of vlan range
This patch also deletes unused ifla_br_policy.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently a bridge device turns off TSO feature if no bridge ports
support it. We can always enable it, since packets can be segmented on
ports by software as well as on the bridge device.
This will reduce the number of packets processed in the bridge.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add checking whether the call to ndo_dflt_fdb_dump is needed.
It is not expected to call ndo_dflt_fdb_dump unconditionally
by some drivers (i.e. qlcnic or macvlan) that defines
own ndo_fdb_dump. Other drivers define own ndo_fdb_dump
and don't want ndo_dflt_fdb_dump to be called at all.
At the same time it is desirable to call the default dump
function on a bridge device.
Fix attributes that are passed to dev->netdev_ops->ndo_fdb_dump.
Add extra checking in br_fdb_dump to avoid duplicate entries
as now filter_dev can be NULL.
Following tests for filtering have been performed before
the change and after the patch was applied to make sure
they are the same and it doesn't break the filtering algorithm.
[root@localhost ~]# cd /root/iproute2-3.18.0/bridge
[root@localhost bridge]# modprobe dummy
[root@localhost bridge]# ./bridge fdb add f1:f2:f3:f4:f5:f6 dev dummy0
[root@localhost bridge]# brctl addbr br0
[root@localhost bridge]# brctl addif br0 dummy0
[root@localhost bridge]# ip link set dev br0 address 02:00:00:12:01:04
[root@localhost bridge]# # show all
[root@localhost bridge]# ./bridge fdb show
33:33:00:00:00:01 dev p2p1 self permanent
01:00:5e:00:00:01 dev p2p1 self permanent
33:33:ff:ac:ce:32 dev p2p1 self permanent
33:33:00:00:02:02 dev p2p1 self permanent
01:00:5e:00:00:fb dev p2p1 self permanent
33:33:00:00:00:01 dev p7p1 self permanent
01:00:5e:00:00:01 dev p7p1 self permanent
33:33:ff:79:50:53 dev p7p1 self permanent
33:33:00:00:02:02 dev p7p1 self permanent
01:00:5e:00:00:fb dev p7p1 self permanent
f2:46:50:85:6d:d9 dev dummy0 master br0 permanent
f2:46:50:85:6d:d9 dev dummy0 vlan 1 master br0 permanent
33:33:00:00:00:01 dev dummy0 self permanent
f1:f2:f3:f4:f5:f6 dev dummy0 self permanent
33:33:00:00:00:01 dev br0 self permanent
02:00:00:12:01:04 dev br0 vlan 1 master br0 permanent
02:00:00:12:01:04 dev br0 master br0 permanent
[root@localhost bridge]# # filter by bridge
[root@localhost bridge]# ./bridge fdb show br br0
f2:46:50:85:6d:d9 dev dummy0 master br0 permanent
f2:46:50:85:6d:d9 dev dummy0 vlan 1 master br0 permanent
33:33:00:00:00:01 dev dummy0 self permanent
f1:f2:f3:f4:f5:f6 dev dummy0 self permanent
33:33:00:00:00:01 dev br0 self permanent
02:00:00:12:01:04 dev br0 vlan 1 master br0 permanent
02:00:00:12:01:04 dev br0 master br0 permanent
[root@localhost bridge]# # filter by port
[root@localhost bridge]# ./bridge fdb show brport dummy0
f2:46:50:85:6d:d9 master br0 permanent
f2:46:50:85:6d:d9 vlan 1 master br0 permanent
33:33:00:00:00:01 self permanent
f1:f2:f3:f4:f5:f6 self permanent
[root@localhost bridge]# # filter by port + bridge
[root@localhost bridge]# ./bridge fdb show br br0 brport dummy0
f2:46:50:85:6d:d9 master br0 permanent
f2:46:50:85:6d:d9 vlan 1 master br0 permanent
33:33:00:00:00:01 self permanent
f1:f2:f3:f4:f5:f6 self permanent
[root@localhost bridge]#
Signed-off-by: Hubert Sokolowski <hubert.sokolowski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following batch contains netfilter updates for net-next. Basically,
enhancements for xt_recent, skip zeroing of timer in conntrack, fix
linking problem with recent redirect support for nf_tables, ipset
updates and a couple of cleanups. More specifically, they are:
1) Rise maximum number per IP address to be remembered in xt_recent
while retaining backward compatibility, from Florian Westphal.
2) Skip zeroing timer area in nf_conn objects, also from Florian.
3) Inspect IPv4 and IPv6 traffic from the bridge to allow filtering using
using meta l4proto and transport layer header, from Alvaro Neira.
4) Fix linking problems in the new redirect support when CONFIG_IPV6=n
and IP6_NF_IPTABLES=n.
And ipset updates from Jozsef Kadlecsik:
5) Support updating element extensions when the set is full (fixes
netfilter bugzilla id 880).
6) Fix set match with 32-bits userspace / 64-bits kernel.
7) Indicate explicitly when /0 networks are supported in ipset.
8) Simplify cidr handling for hash:*net* types.
9) Allocate the proper size of memory when /0 networks are supported.
10) Explicitly add padding elements to hash:net,net and hash:net,port,
because the elements must be u32 sized for the used hash function.
Jozsef is also cooking ipset RCU conversion which should land soon if
they reach the merge window in time.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
When the swdev device learns a new mac/vlan on a port, it sends some async
notification to the driver and the driver installs an FDB in the device.
To give a holistic system view, the learned mac/vlan should be reflected
in the bridge's FBD table, so the user, using normal iproute2 cmds, can view
what is currently learned by the device. This API on the bridge driver gives
a way for the swdev driver to install an FBD entry in the bridge FBD table.
(And remove one).
This is equivalent to the device running these cmds:
bridge fdb [add|del] <mac> dev <dev> vid <vlan id> master
This patch needs some extra eyeballs for review, in paricular around the
locking and contexts.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To notify switch driver of change in STP state of bridge port, add new
.ndo op and provide switchdev wrapper func to call ndo op. Use it in bridge
code then.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do the work of parsing NDA_VLAN directly in rtnetlink code, pass simple
u16 vid to drivers from there.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current name might seem that this actually offloads the fdb entry to
hw. So rename it to clearly present that this for hardware address
addition/removal.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the missing bits to allow to match per meta l4proto from
the bridge. Example:
nft add rule bridge filter input ether type {ip, ip6} meta l4proto udp counter
Signed-off-by: Alvaro Neira Ayuso <alvaroneay@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch exports the functions nft_reject_iphdr_validate and
nft_reject_ip6hdr_validate to use it in follow up patches.
These functions check if the IPv4/IPv6 header is correct.
Signed-off-by: Alvaro Neira Ayuso <alvaroneay@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes: c2d3babf ("bridge: implement multicast fast leave")
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
netfilter/ipvs updates for net-next
The following patchset contains Netfilter updates for your net-next
tree, this includes the NAT redirection support for nf_tables, the
cgroup support for nft meta and conntrack zone support for the connlimit
match. Coming after those, a bunch of sparse warning fixes, missing
netns bits and cleanups. More specifically, they are:
1) Prepare IPv4 and IPv6 NAT redirect code to use it from nf_tables,
patches from Arturo Borrero.
2) Introduce the nf_tables redir expression, from Arturo Borrero.
3) Remove an unnecessary assignment in ip_vs_xmit/__ip_vs_get_out_rt().
Patch from Alex Gartrell.
4) Add nft_log_dereference() macro to the nf_log infrastructure, patch
from Marcelo Leitner.
5) Add some extra validation when registering logger families, also
from Marcelo.
6) Some spelling cleanups from stephen hemminger.
7) Fix sparse warning in nf_logger_find_get().
8) Add cgroup support to nf_tables meta, patch from Ana Rey.
9) A Kconfig fix for the new redir expression and fix sparse warnings in
the new redir expression.
10) Fix several sparse warnings in the netfilter tree, from
Florian Westphal.
11) Reduce verbosity when OOM in nfnetlink_log. User can basically do
nothing when this situation occurs.
12) Add conntrack zone support to xt_connlimit, again from Florian.
13) Add netnamespace support to the h323 conntrack helper, contributed
by Vasily Averin.
14) Remove unnecessary nul-pointer checks before free_percpu() and
module_put(), from Markus Elfring.
15) Use pr_fmt in nfnetlink_log, again patch from Marcelo Leitner.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ieee802154/fakehard.c
A bug fix went into 'net' for ieee802154/fakehard.c, which is removed
in 'net-next'.
Add build fix into the merge from Stephen Rothwell in openvswitch, the
logging macros take a new initial 'log' argument, a new call was added
in 'net' so when we merge that in here we have to explicitly add the
new 'log' arg to it else the build fails.
Signed-off-by: David S. Miller <davem@davemloft.net>
Name fits better. Plus there's going to be introduced
__vlan_insert_tag later on.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ebtables on the OUTPUT chain (NF_BR_LOCAL_OUT) would not work as expected
for both locally generated IGMP and MLD queries. The IP header specific
filter options are off by 14 Bytes for netfilter (actual output on
interfaces is fine).
NF_HOOK() expects the skb->data to point to the IP header, not the
ethernet one (while dev_queue_xmit() does not). Luckily there is an
br_dev_queue_push_xmit() helper function already - let's just use that.
Introduced by eb1d164143
("bridge: Add core IGMP snooping support")
Ebtables example:
$ ebtables -I OUTPUT -p IPv6 -o eth1 --logical-out br0 \
--log --log-level 6 --log-ip6 --log-prefix="~EBT: " -j DROP
before (broken):
~EBT: IN= OUT=eth1 MAC source = 02:04:64:a4:39:c2 \
MAC dest = 33:33:00:00:00:01 proto = 0x86dd IPv6 \
SRC=64a4:39c2:86dd:6000:0000:0020:0001:fe80 IPv6 \
DST=0000:0000:0000:0004:64ff:fea4:39c2:ff02, \
IPv6 priority=0x3, Next Header=2
after (working):
~EBT: IN= OUT=eth1 MAC source = 02:04:64:a4:39:c2 \
MAC dest = 33:33:00:00:00:01 proto = 0x86dd IPv6 \
SRC=fe80:0000:0000:0000:0004:64ff:fea4:39c2 IPv6 \
DST=ff02:0000:0000:0000:0000:0000:0000:0001, \
IPv6 priority=0x0, Next Header=0
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
net/bridge/br_netfilter.c:870:6: symbol 'br_netfilter_enable' was not declared. Should it be static?
no; add include
net/ipv4/netfilter/nft_reject_ipv4.c:22:6: symbol 'nft_reject_ipv4_eval' was not declared. Should it be static?
yes
net/ipv6/netfilter/nf_reject_ipv6.c:16:6: symbol 'nf_send_reset6' was not declared. Should it be static?
no; add include
net/ipv6/netfilter/nft_reject_ipv6.c:22:6: symbol 'nft_reject_ipv6_eval' was not declared. Should it be static?
yes
net/netfilter/core.c:33:32: symbol 'nf_ipv6_ops' was not declared. Should it be static?
no; add include
net/netfilter/xt_DSCP.c:40:57: cast truncates bits from constant value (ffffff03 becomes 3)
net/netfilter/xt_DSCP.c:57:59: cast truncates bits from constant value (ffffff03 becomes 3)
add __force, 3 is what we want.
net/ipv4/netfilter/nf_log_arp.c:77:6: symbol 'nf_log_arp_packet' was not declared. Should it be static?
yes
net/ipv4/netfilter/nf_reject_ipv4.c:17:6: symbol 'nf_send_reset' was not declared. Should it be static?
no; add include
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fix:
net/bridge/netfilter/nft_reject_bridge.c:
In function 'nft_reject_br_send_v6_unreach':
net/bridge/netfilter/nft_reject_bridge.c:240:3:
error: implicit declaration of function 'csum_ipv6_magic'
csum_ipv6_magic(&nip6h->saddr, &nip6h->daddr,
^
make[3]: *** [net/bridge/netfilter/nft_reject_bridge.o] Error 1
Seen with powerpc:allmodconfig.
Fixes: 523b929d54 ("netfilter: nft_reject_bridge: don't use IP stack to reject traffic")
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Restrict the reject expression to the prerouting and input bridge
hooks. If we allow this to be used from forward or any other later
bridge hook, if the frame is flooded to several ports, we'll end up
sending several reject packets, one per cloned packet.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If the packet is received via the bridge stack, this cannot reject
packets from the IP stack.
This adds functions to build the reject packet and send it from the
bridge stack. Comments and assumptions on this patch:
1) Validate the IPv4 and IPv6 headers before further processing,
given that the packet comes from the bridge stack, we cannot assume
they are clean. Truncated packets are dropped, we follow similar
approach in the existing iptables match/target extensions that need
to inspect layer 4 headers that is not available. This also includes
packets that are directed to multicast and broadcast ethernet
addresses.
2) br_deliver() is exported to inject the reject packet via
bridge localout -> postrouting. So the approach is similar to what
we already do in the iptables reject target. The reject packet is
sent to the bridge port from which we have received the original
packet.
3) The reject packet is forged based on the original packet. The TTL
is set based on sysctl_ip_default_ttl for IPv4 and per-net
ipv6.devconf_all hoplimit for IPv6.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This feature is defined in IEEE Std 802.11-2012, 10.23.13. It allows
the AP devices to keep track of the hardware-address-to-IP-address
mapping of the mobile devices within the WLAN network.
The AP will learn this mapping via observing DHCP, ARP, and NS/NA
frames. When a request for such information is made (i.e. ARP request,
Neighbor Solicitation), the AP will respond on behalf of the
associated mobile device. In the process of doing so, the AP will drop
the multicast request frame that was intended to go out to the wireless
medium.
It was recommended at the LKS workshop to do this implementation in
the bridge layer. vxlan.c is already doing something very similar.
The DHCP snooping code will be added to the userspace application
(hostapd) per the recommendation.
This RFC commit is only for IPv4. A similar approach in the bridge
layer will be taken for IPv6 as well.
Signed-off-by: Kyeyoon Park <kyeyoonp@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 462fb2af97
bridge : Sanitize skb before it enters the IP stack
broke when IP options are actually used because it mangles the
skb as if it entered the IP stack which is wrong because the
bridge is supposed to operate below the IP stack.
Since nobody has actually requested for parsing of IP options
this patch fixes it by simply reverting to the previous approach
of ignoring all IP options, i.e., zeroing the IPCB.
If and when somebody who uses IP options and actually needs them
to be parsed by the bridge complains then we can revisit this.
Reported-by: David Newall <davidn@davidnewall.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
As we may defragment the packet in IPv4 PRE_ROUTING and refragment
it after POST_ROUTING we should save the value of frag_max_size.
This is still very wrong as the bridge is supposed to leave the
packets intact, meaning that the right thing to do is to use the
original frag_list for fragmentation.
Unfortunately we don't currently guarantee that the frag_list is
left untouched throughout netfilter so until this changes this is
the best we can do.
There is also a spot in FORWARD where it appears that we can
forward a packet without going through fragmentation, mark it
so that we can fix it later.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains another batch with Netfilter/IPVS updates
for net-next, they are:
1) Add abstracted ICMP codes to the nf_tables reject expression. We
introduce four reasons to reject using ICMP that overlap in IPv4
and IPv6 from the semantic point of view. This should simplify the
maintainance of dual stack rule-sets through the inet table.
2) Move nf_send_reset() functions from header files to per-family
nf_reject modules, suggested by Patrick McHardy.
3) We have to use IS_ENABLED(CONFIG_BRIDGE_NETFILTER) everywhere in the
code now that br_netfilter can be modularized. Convert remaining spots
in the network stack code.
4) Use rcu_barrier() in the nf_tables module removal path to ensure that
we don't leave object that are still pending to be released via
call_rcu (that may likely result in a crash).
5) Remove incomplete arch 32/64 compat from nft_compat. The original (bad)
idea was to probe the word size based on the xtables match/target info
size, but this assumption is wrong when you have to dump the information
back to userspace.
6) Allow to filter from prerouting and postrouting in the nf_tables bridge.
In order to emulate the ebtables NAT chains (which are actually simple
filter chains with no special semantics), we have support filtering from
this hooks too.
7) Add explicit module dependency between xt_physdev and br_netfilter.
This provides a way to detect if the user needs br_netfilter from
the configuration path. This should reduce the breakage of the
br_netfilter modularization.
8) Cleanup coding style in ip_vs.h, from Simon Horman.
9) Fix crash in the recently added nf_tables masq expression. We have
to register/unregister the notifiers to clean up the conntrack table
entries from the module init/exit path, not from the rule addition /
deletion path. From Arturo Borrero.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently when vlan filtering is turned on on the bridge, the bridge
will drop all traffic untill the user configures the filter. This
isn't very nice for ports that don't care about vlans and just
want untagged traffic.
A concept of a default_pvid was recently introduced. This patch
adds filtering support for default_pvid. Now, ports that don't
care about vlans and don't define there own filter will belong
to the VLAN of the default_pvid and continue to receive untagged
traffic.
This filtering can be disabled by setting default_pvid to 0.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, if the pvid is not set, we return an illegal vlan value
even though the pvid value is set to 0. Since pvid of 0 is currently
invalid, just return 0 instead. This makes the current and future
checks simpler.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows the user to set and retrieve default_pvid
value. A new value can only be stored when vlan filtering
is disabled.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
You can use physdev to match the physical interface enslaved to the
bridge device. This information is stored in skb->nf_bridge and it is
set up by br_netfilter. So, this is only available when iptables is
used from the bridge netfilter path.
Since 34666d4 ("netfilter: bridge: move br_netfilter out of the core"),
the br_netfilter code is modular. To reduce the impact of this change,
we can autoload the br_netfilter if the physdev match is used since
we assume that the users need br_netfilter in place.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This allows us to emulate the NAT table in ebtables, which is actually
a plain filter chain that hooks at prerouting, output and postrouting.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch introduces the NFT_REJECT_ICMPX_UNREACH type which provides
an abstraction to the ICMP and ICMPv6 codes that you can use from the
inet and bridge tables, they are:
* NFT_REJECT_ICMPX_NO_ROUTE: no route to host - network unreachable
* NFT_REJECT_ICMPX_PORT_UNREACH: port unreachable
* NFT_REJECT_ICMPX_HOST_UNREACH: host unreachable
* NFT_REJECT_ICMPX_ADMIN_PROHIBITED: administratevely prohibited
You can still use the specific codes when restricting the rule to match
the corresponding layer 3 protocol.
I decided to not overload the existing NFT_REJECT_ICMP_UNREACH to have
different semantics depending on the table family and to allow the user
to specify ICMP family specific codes if they restrict it to the
corresponding family.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In preparation for being able to propagate port states to e.g: notifiers
or other kernel parts, do not manipulate the port state directly, but
instead use a helper function which will allow us to do a bit more than
just setting the state.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric reports build failure with
CONFIG_BRIDGE_NETFILTER=n
We insist to build br_nf_core.o unconditionally, but we must only do so
if br_netfilter was enabled, else it fails to build due to
functions being defined to empty stubs (and some structure members
being defined out).
Also, BRIDGE_NETFILTER=y|m makes no sense when BRIDGE=n.
Fixes: 34666d467 (netfilter: bridge: move br_netfilter out of the core)
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
pull request: netfilter/ipvs updates for net-next
The following patchset contains Netfilter/IPVS updates for net-next,
most relevantly they are:
1) Four patches to make the new nf_tables masquerading support
independent of the x_tables infrastructure. This also resolves a
compilation breakage if the masquerade target is disabled but the
nf_tables masq expression is enabled.
2) ipset updates via Jozsef Kadlecsik. This includes the addition of the
skbinfo extension that allows you to store packet metainformation in the
elements. This can be used to fetch and restore this to the packets through
the iptables SET target, patches from Anton Danilov.
3) Add the hash:mac set type to ipset, from Jozsef Kadlecsick.
4) Add simple weighted fail-over scheduler via Simon Horman. This provides
a fail-over IPVS scheduler (unlike existing load balancing schedulers).
Connections are directed to the appropriate server based solely on
highest weight value and server availability, patch from Kenny Mathis.
5) Support IPv6 real servers in IPv4 virtual-services and vice versa.
Simon Horman informs that the motivation for this is to allow more
flexibility in the choice of IP version offered by both virtual-servers
and real-servers as they no longer need to match: An IPv4 connection
from an end-user may be forwarded to a real-server using IPv6 and
vice versa. No ip_vs_sync support yet though. Patches from Alex Gartrell
and Julian Anastasov.
6) Add global generation ID to the nf_tables ruleset. When dumping from
several different object lists, we need a way to identify that an update
has ocurred so userspace knows that it needs to refresh its lists. This
also includes a new command to obtain the 32-bits generation ID. The
less significant 16-bits of this ID is also exposed through res_id field
in the nfnetlink header to quickly detect the interference and retry when
there is no risk of ID wraparound.
7) Move br_netfilter out of the bridge core. The br_netfilter code is
built in the bridge core by default. This causes problems of different
kind to people that don't want this: Jesper reported performance drop due
to the inconditional hook registration and I remember to have read complains
on netdev from people regarding the unexpected behaviour of our bridging
stack when br_netfilter is enabled (fragmentation handling, layer 3 and
upper inspection). People that still need this should easily undo the
damage by modprobing the new br_netfilter module.
8) Dump the set policy nf_tables that allows set parameterization. So
userspace can keep user-defined preferences when saving the ruleset.
From Arturo Borrero.
9) Use __seq_open_private() helper function to reduce boiler plate code
in x_tables, From Rob Jones.
10) Safer default behaviour in case that you forget to load the protocol
tracker. Daniel Borkmann and Florian Westphal detected that if your
ruleset is stateful, you allow traffic to at least one single SCTP port
and the SCTP protocol tracker is not loaded, then any SCTP traffic may
be pass through unfiltered. After this patch, the connection tracking
classifies SCTP/DCCP/UDPlite/GRE packets as invalid if your kernel has
been compiled with support for these modules.
====================
Trivially resolved conflict in include/linux/skbuff.h, Eric moved some
netfilter skbuff members around, and the netfilter tree adjusted the
ifdef guards for the bridging info pointer.
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper reported that br_netfilter always registers the hooks since
this is part of the bridge core. This harms performance for people that
don't need this.
This patch modularizes br_netfilter so it can be rmmod'ed, thus,
the hooks can be unregistered. I think the bridge netfilter should have
been a separated module since the beginning, Patrick agreed on that.
Note that this is breaking compatibility for users that expect that
bridge netfilter is going to be available after explicitly 'modprobe
bridge' or via automatic load through brctl.
However, the damage can be easily undone by modprobing br_netfilter.
The bridge core also spots a message to provide a clue to people that
didn't notice that this has been deprecated.
On top of that, the plan is that nftables will not rely on this software
layer, but integrate the connection tracking into the bridge layer to
enable stateful filtering and NAT, which is was bridge netfilter users
seem to require.
This patch still keeps the fake_dst_ops in the bridge core, since this
is required by when the bridge port is initialized. So we can safely
modprobe/rmmod br_netfilter anytime.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Florian Westphal <fw@strlen.de>
Move nf_bridge_copy_header() as static inline in netfilter_bridge.h
header file. This patch prepares the modularization of the br_netfilter
code.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conflicts:
arch/mips/net/bpf_jit.c
drivers/net/can/flexcan.c
Both the flexcan and MIPS bpf_jit conflicts were cases of simple
overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
As Toshiaki Makita pointed out, the BRIDGE_INPUT_SKB_CB will
not be initialized in br_should_learn() as that function
is called only from br_handle_local_finish(). That is
an input handler for link-local ethernet traffic so it perfectly
correct to check br->vlan_enabled here.
Reported-by: Toshiaki Makita<toshiaki.makita1@gmail.com>
Fixes: 20adfa1 bridge: Check if vlan filtering is enabled only once.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, it is possible to modify the vlan filter
configuration to add pvid or untagged support.
For example:
bridge vlan add vid 10 dev eth0
bridge vlan add vid 10 dev eth0 untagged pvid
The second statement will modify vlan 10 to
include untagged and pvid configuration.
However, it is currently impossible to go backwards
bridge vlan add vid 10 dev eth0 untagged pvid
bridge vlan add vid 10 dev eth0
Here nothing happens. This patch correct this so
that any modifiers not supplied are removed from
the configuration.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge code checks if vlan filtering is enabled on both
ingress and egress. When the state flip happens, it
is possible for the bridge to currently be forwarding packets
and forwarding behavior becomes non-deterministic. Bridge
may drop packets on some interfaces, but not others.
This patch solves this by caching the filtered state of the
packet into skb_cb on ingress. The skb_cb is guaranteed to
not be over-written between the time packet entres bridge
forwarding path and the time it leaves it. On egress, we
can then check the cached state to see if we need to
apply filtering information.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
nf-next pull request
The following patchset contains Netfilter/IPVS updates for your
net-next tree. Regarding nf_tables, most updates focus on consolidating
the NAT infrastructure and adding support for masquerading. More
specifically, they are:
1) use __u8 instead of u_int8_t in arptables header, from
Mike Frysinger.
2) Add support to match by skb->pkttype to the meta expression, from
Ana Rey.
3) Add support to match by cpu to the meta expression, also from
Ana Rey.
4) A smatch warning about IPSET_ATTR_MARKMASK validation, patch from
Vytas Dauksa.
5) Fix netnet and netportnet hash types the range support for IPv4,
from Sergey Popovich.
6) Fix missing-field-initializer warnings resolved, from Mark Rustad.
7) Dan Carperter reported possible integer overflows in ipset, from
Jozsef Kadlecsick.
8) Filter out accounting objects in nfacct by type, so you can
selectively reset quotas, from Alexey Perevalov.
9) Move specific NAT IPv4 functions to the core so x_tables and
nf_tables can share the same NAT IPv4 engine.
10) Use the new NAT IPv4 functions from nft_chain_nat_ipv4.
11) Move specific NAT IPv6 functions to the core so x_tables and
nf_tables can share the same NAT IPv4 engine.
12) Use the new NAT IPv6 functions from nft_chain_nat_ipv6.
13) Refactor code to add nft_delrule(), which can be reused in the
enhancement of the NFT_MSG_DELTABLE to remove a table and its
content, from Arturo Borrero.
14) Add a helper function to unregister chain hooks, from
Arturo Borrero.
15) A cleanup to rename to nft_delrule_by_chain for consistency with
the new nft_*() functions, also from Arturo.
16) Add support to match devgroup to the meta expression, from Ana Rey.
17) Reduce stack usage for IPVS socket option, from Julian Anastasov.
18) Remove unnecessary textsearch state initialization in xt_string,
from Bojan Prtvar.
19) Add several helper functions to nf_tables, more work to prepare
the enhancement of NFT_MSG_DELTABLE, again from Arturo Borrero.
20) Enhance NFT_MSG_DELTABLE to delete a table and its content, from
Arturo Borrero.
21) Support NAT flags in the nat expression to indicate the flavour,
eg. random fully, from Arturo.
22) Add missing audit code to ebtables when replacing tables, from
Nicolas Dichtel.
23) Generalize the IPv4 masquerading code to allow its re-use from
nf_tables, from Arturo.
24) Generalize the IPv6 masquerading code, also from Arturo.
25) Add the new masq expression to support IPv4/IPv6 masquerading
from nf_tables, also from Arturo.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes an unncessary check in the br_afspec() method of
br_netlink.c.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow rtnetlink users to set bridge master info via IFLA_INFO_DATA attr
This initial part implements forward_delay, hello_time, max_age options.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow rtnetlink users to get bridge master info in IFLA_INFO_DATA attr
This initial part implements forward_delay, hello_time, max_age options.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow rtnetlink users to set port info via IFLA_INFO_SLAVE_DATA attr
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow rtnetlink users to get port info in IFLA_INFO_SLAVE_DATA attr
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
The thing is that netdev_master_upper_dev_link calls
call_netdevice_notifiers(NETDEV_CHANGEUPPER, dev). That generates rtnl
link message and during that, rtnl_link_ops->fill_slave_info is called.
But with current ordering, rx_handler and IFF_BRIDGE_PORT are not set
yet so there would have to be check for that in fill_slave_info callback.
Resolve this by reordering to similar what bonding and team does to
avoid the check.
Also add removal of IFF_BRIDGE_PORT flag into error path.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is already done for x_tables (family AF_INET and AF_INET6), let's
do it for AF_BRIDGE also.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The use of "rcu_assign_pointer()" is NULLing out the pointer.
According to RCU_INIT_POINTER()'s block comment:
"1. This use of RCU_INIT_POINTER() is NULLing out the pointer"
it is better to use it instead of rcu_assign_pointer() because it has a
smaller overhead.
The following Coccinelle semantic patch was used:
@@
@@
- rcu_assign_pointer
+ RCU_INIT_POINTER
(..., NULL)
Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the functionality to untag traffic on input resides
as part of the vlan module and is build only when VLAN support
is enabled in the kernel. When VLAN is disabled, the function
vlan_untag() turns into a stub and doesn't really untag the
packets. This seems to create an interesting interaction
between VMs supporting checksum offloading and some network drivers.
There are some drivers that do not allow the user to change
tx-vlan-offload feature of the driver. These drivers also seem
to assume that any VLAN-tagged traffic they transmit will
have the vlan information in the vlan_tci and not in the vlan
header already in the skb. When transmitting skbs that already
have tagged data with partial checksum set, the checksum doesn't
appear to be updated correctly by the card thus resulting in a
failure to establish TCP connections.
The following is a packet trace taken on the receiver where a
sender is a VM with a VLAN configued. The host VM is running on
doest not have VLAN support and the outging interface on the
host is tg3:
10:12:43.503055 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
(0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27243,
offset 0, flags [DF], proto TCP (6), length 60)
10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
-> 0x48d9), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
4294837885 ecr 0,nop,wscale 7], length 0
10:12:44.505556 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
(0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27244,
offset 0, flags [DF], proto TCP (6), length 60)
10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
-> 0x44ee), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
4294838888 ecr 0,nop,wscale 7], length 0
This connection finally times out.
I've only access to the TG3 hardware in this configuration thus have
only tested this with TG3 driver. There are a lot of other drivers
that do not permit user changes to vlan acceleration features, and
I don't know if they all suffere from a similar issue.
The patch attempt to fix this another way. It moves the vlan header
stipping code out of the vlan module and always builds it into the
kernel network core. This way, even if vlan is not supported on
a virtualizatoin host, the virtual machines running on top of such
host will still work with VLANs enabled.
CC: Patrick McHardy <kaber@trash.net>
CC: Nithin Nayak Sujir <nsujir@broadcom.com>
CC: Michael Chan <mchan@broadcom.com>
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet reports that getsockopt() or setsockopt() sometimes
returns -EINTR instead of -ENOPROTOOPT, causing headaches to
application developers.
This patch replaces all the mutex_lock_interruptible() by mutex_lock()
in the netfilter tree, as there is no reason we should sleep for a
long time there.
Reported-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
All other add functions for lists have the new item as first argument
and the position where it is added as second argument. This was changed
for no good reason in this function and makes using it unnecessary
confusing.
The name was changed to hlist_add_behind() to cause unconverted code to
generate a compile error instead of using the wrong parameter order.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Ken Helias <kenhelias@firemail.de>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [intel driver bits]
Cc: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Conflicts:
drivers/net/Makefile
net/ipv6/sysctl_net_ipv6.c
Two ipv6_table_template[] additions overlap, so the index
of the ipv6_table[x] assignments needed to be adjusted.
In the drivers/net/Makefile case, we've gotten rid of the
garbage whereby we had to list every single USB networking
driver in the top-level Makefile, there is just one
"USB_NETWORKING" that guards everything.
Signed-off-by: David S. Miller <davem@davemloft.net>
Now bridge ports can be non-promiscuous, vlan_vid_add() is no longer an
unnecessary operation.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 6cbdceeb1c
bridge: Dump vlan information from a bridge port
introduced a comment in an attempt to explain the
code logic. The comment is unfinished so it confuses more
than it explains, remove it.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
An FDB entry with vlan_id 0 doesn't mean it is used in vlan 0, but used when
vlan_filtering is disabled.
There is inconsistency around NDA_VLAN whose payload is 0 - even if we add
an entry by RTM_NEWNEIGH without any NDA_VLAN, and even though adding an
entry with NDA_VLAN 0 is prohibited, we get an entry with NDA_VLAN 0 by
RTM_GETNEIGH.
Dumping an FDB entry with vlan_id 0 shouldn't include NDA_VLAN.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ulog targets were recently killed. A few references to the Kconfig
macros CONFIG_IP_NF_TARGET_ULOG and CONFIG_BRIDGE_EBT_ULOG were left
untouched. Kill these too.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
So you can reject IPv4 and IPv6 packets from bridge tables. If the ether
proto is now known, default on dropping the packet instead.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains updates for your net-next tree,
they are:
1) Use kvfree() helper function from x_tables, from Eric Dumazet.
2) Remove extra timer from the conntrack ecache extension, use a
workqueue instead to redeliver lost events to userspace instead,
from Florian Westphal.
3) Removal of the ulog targets for ebtables and iptables. The nflog
infrastructure superseded this almost 9 years ago, time to get rid
of this code.
4) Replace the list of loggers by an array now that we can only have
two possible non-overlapping logger flavours, ie. kernel ring buffer
and netlink logging.
5) Move Eric Dumazet's log buffer code to nf_log to reuse it from
all of the supported per-family loggers.
6) Consolidate nf_log_packet() as an unified interface for packet logging.
After this patch, if the struct nf_loginfo is available, it explicitly
selects the logger that is used.
7) Move ip and ip6 logging code from xt_LOG to the corresponding
per-family loggers. Thus, x_tables and nf_tables share the same code
for packet logging.
8) Add generic ARP packet logger, which is used by nf_tables. The
format aims to be consistent with the output of xt_LOG.
9) Add generic bridge packet logger. Again, this is used by nf_tables
and it routes the packets to the real family loggers. As a result,
we get consistent logging format for the bridge family. The ebt_log
logging code has been intentionally left in place not to break
backward compatibility since the logging output differs from xt_LOG.
10) Update nft_log to explicitly request the required family logger when
needed.
11) Finish nft_log so it supports arp, ip, ip6, bridge and inet families.
Allowing selection between netlink and kernel buffer ring logging.
12) Several fixes coming after the netfilter core logging changes spotted
by robots.
13) Use IS_ENABLED() macros whenever possible in the netfilter tree,
from Duan Jiong.
14) Removal of a couple of unnecessary branch before kfree, from Fabian
Frederick.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Actually better than brctl showmacs because we can filter by bridge
port in the kernel.
The current bridge netlink interface doesnt scale when you have many
bridges each with large fdbs or even bridges with many bridge ports
And now for the science non-fiction novel you have all been
waiting for..
//lets see what bridge ports we have
root@moja-1:/configs/may30-iprt/bridge# ./bridge link show
8: eth1 state DOWN : <BROADCAST,MULTICAST> mtu 1500 master br0 state
disabled priority 32 cost 19
17: sw1-p1 state DOWN : <BROADCAST,NOARP> mtu 1500 master br0 state
disabled priority 32 cost 100
// show all..
root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show
33:33:00:00:00:01 dev bond0 self permanent
33:33:00:00:00:01 dev dummy0 self permanent
33:33:00:00:00:01 dev ifb0 self permanent
33:33:00:00:00:01 dev ifb1 self permanent
33:33:00:00:00:01 dev eth0 self permanent
01:00:5e:00:00:01 dev eth0 self permanent
33:33:ff:22:01:01 dev eth0 self permanent
02:00:00:12:01:02 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:05 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:07 dev eth1 self permanent
33:33:00:00:00:01 dev eth1 self permanent
33:33:00:00:00:01 dev gretap0 self permanent
da:ac:46:27:d9:53 dev sw1-p1 vlan 0 master br0 permanent
33:33:00:00:00:01 dev sw1-p1 self permanent
//filter by bridge
root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show br br0
02:00:00:12:01:02 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:05 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:07 dev eth1 self permanent
33:33:00:00:00:01 dev eth1 self permanent
da:ac:46:27:d9:53 dev sw1-p1 vlan 0 master br0 permanent
33:33:00:00:00:01 dev sw1-p1 self permanent
// bridge sw1 has no ports attached..
root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show br sw1
//filter by port
root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show brport eth1
02:00:00:12:01:02 vlan 0 master br0 permanent
00:17:42:8a:b4:05 vlan 0 master br0 permanent
00:17:42:8a:b4:07 self permanent
33:33:00:00:00:01 self permanent
// filter by port + bridge
root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show br br0 brport
sw1-p1
da:ac:46:27:d9:53 vlan 0 master br0 permanent
33:33:00:00:00:01 self permanent
// for shits and giggles (as they say in New Brunswick), lets
// change the mac that br0 uses
// Note: a magical fdb entry with no brport is added ...
root@moja-1:/configs/may30-iprt/bridge# ip link set dev br0 address
02:00:00:12:01:04
// lets see if we can see the unicorn ..
root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show
33:33:00:00:00:01 dev bond0 self permanent
33:33:00:00:00:01 dev dummy0 self permanent
33:33:00:00:00:01 dev ifb0 self permanent
33:33:00:00:00:01 dev ifb1 self permanent
33:33:00:00:00:01 dev eth0 self permanent
01:00:5e:00:00:01 dev eth0 self permanent
33:33:ff:22:01:01 dev eth0 self permanent
02:00:00:12:01:02 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:05 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:07 dev eth1 self permanent
33:33:00:00:00:01 dev eth1 self permanent
33:33:00:00:00:01 dev gretap0 self permanent
02:00:00:12:01:04 dev br0 vlan 0 master br0 permanent <=== there it is
da:ac:46:27:d9:53 dev sw1-p1 vlan 0 master br0 permanent
33:33:00:00:00:01 dev sw1-p1 self permanent
//can we see it if we filter by bridge?
root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show br br0
02:00:00:12:01:02 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:05 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:07 dev eth1 self permanent
33:33:00:00:00:01 dev eth1 self permanent
02:00:00:12:01:04 dev br0 vlan 0 master br0 permanent <=== there it is
da:ac:46:27:d9:53 dev sw1-p1 vlan 0 master br0 permanent
33:33:00:00:00:01 dev sw1-p1 self permanent
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dumping a bridge fdb dumps every fdb entry
held. With this change we are going to filter
on selected bridge port.
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this patch other modules are able to ask the bridge whether an
IGMP or MLD querier exists on the according, bridged link layer.
Multicast snooping can only be performed if a valid, selected querier
exists on a link.
Just like the bridge only enables its multicast snooping if a querier
exists, e.g. batman-adv too can only activate its multicast
snooping in bridged scenarios if a querier is present.
For instance this export avoids having to reimplement IGMP/MLD
querier message snooping and parsing in e.g. batman-adv, when
multicast optimizations for bridged scenarios are added in the
future.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds the generic plain text packet loggger for bridged packets.
It routes the logging message to the real protocol packet logger.
I decided not to refactor the ebt_log code for two reasons:
1) The ebt_log output is not consistent with the IPv4 and IPv6
Netfilter packet loggers. The output is different for no good
reason and it adds redundant code to handle packet logging.
2) To avoid breaking backward compatibility for applications
outthere that are parsing the specific ebt_log output, the ebt_log
output has been left as is. So only nftables will use the new
consistent logging format for logged bridged packets.
More decisions coming in this patch:
1) This also removes ebt_log as default logger for bridged packets.
Thus, nf_log_packet() routes packet to this new packet logger
instead. This doesn't break backward compatibility since
nf_log_packet() is not used to log packets in plain text format
from anywhere in the ebtables/netfilter bridge code.
2) The new bridge packet logger also performs a lazy request to
register the real IPv4, ARP and IPv6 netfilter packet loggers.
If the real protocol logger is no available (not compiled or the
module is not available in the system, not packet logging happens.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Now that legacy ulog targets are not available anymore in the tree, we
can have up to two possible loggers:
1) The plain text logging via kernel logging ring.
2) The nfnetlink_log infrastructure which delivers log messages
to userspace.
This patch replaces the list of loggers by an array of two pointers
per family for each possible logger and it also introduces a new field
to the nf_logger structure which indicates the position in the logger
array (based on the logger type).
This prepares a follow up patch that consolidates the nf_log_packet()
interface by allowing to specify the logger as parameter.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This has been marked as deprecated for quite some time and the NFLOG
target replacement has been also available since 2006.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
use list_for_each_entry_continue_reverse to rollback in fdb_add_hw
when add address failed
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some fields in "struct net_bridge" aren't available when compiling the
kernel without IPv6 support. Therefore adding a check/macro to skip the
complaining code sections in that case.
Introduced by 2cd4143192
("bridge: memorize and export selected IGMP/MLD querier port")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
"New smatch warnings:
net/bridge/br_multicast.c:1368 br_ip6_multicast_query() error:
we previously assumed 'group' could be null (see line 1349)"
In the rare (sort of broken) case of a query having a Maximum
Response Delay of zero, we could create a potential null pointer
dereference.
Fixing this by skipping the multicast specific MLD Query parsing again
if no multicast group address is available.
Introduced by dc4eb53a99
("bridge: adhere to querier election mechanism specified by RFCs")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This enables us to change the vlan protocol for vlan filtering.
We come to be able to filter frames on the basis of 802.1ad vlan tags
through a bridge.
This also changes br->group_addr if it has not been set by user.
This is needed for an 802.1ad bridge.
(See IEEE 802.1Q-2011 8.13.5.)
Furthermore, this sets br->group_fwd_mask_required so that an 802.1ad
bridge can forward the Nearest Customer Bridge group addresses except
for br->group_addr, which should be passed to higher layer.
To change the vlan protocol, write a protocol in sysfs:
# echo 0x88a8 > /sys/class/net/br0/bridge/vlan_protocol
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a bridge is an 802.1ad bridge, it must forward another bridge group
addresses (the Nearest Customer Bridge group addresses).
(For details, see IEEE 802.1Q-2011 8.6.3.)
As user might not want group_fwd_mask to be modified by enabling 802.1ad,
introduce a new mask, group_fwd_mask_required, which indicates addresses
the bridge wants to forward. This will be set by enabling 802.1ad.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
This enables a bridge to have vlan protocol informantion and allows vlan
tag manipulation (retrieve, insert and remove tags) according to the vlan
protocol.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bridge device doesn't need to embed S-tag into skb->data.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding bridge support to the batman-adv multicast optimization requires
batman-adv knowing about the existence of bridged-in IGMP/MLD queriers
to be able to reliably serve any multicast listener behind this same
bridge.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this new, exported function br_multicast_list_adjacent(net_dev) a
list of IPv4/6 addresses is returned. This list contains all multicast
addresses sensed by the bridge multicast snooping feature on all bridge
ports of the bridge interface of net_dev, excluding addresses from the
specified net_device itself.
Adding bridge support to the batman-adv multicast optimization requires
batman-adv knowing about the existence of bridged-in multicast
listeners to be able to reliably serve them with multicast packets.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
MLDv1 (RFC2710 section 6), MLDv2 (RFC3810 section 7.6.2), IGMPv2
(RFC2236 section 3) and IGMPv3 (RFC3376 section 6.6.2) specify that the
querier with lowest source address shall become the selected
querier.
So far the bridge stopped its querier as soon as it heard another
querier regardless of its source address. This results in the "wrong"
querier potentially becoming the active querier or a potential,
unnecessary querying delay.
With this patch the bridge memorizes the source address of the currently
selected querier and ignores queries from queriers with a higher source
address than the currently selected one. This slight optimization is
supposed to make it more RFC compliant (but is rather uncritical and
therefore probably not necessary to be queued for stable kernels).
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current naming of these two structs is very random, in that
reversing their naming would not make any semantical difference.
This patch tries to make the naming less confusing by giving them a more
specific, distinguishable naming.
This is also useful for the upcoming patches reintroducing the
"struct bridge_mcast_querier" but for storing information about the
selected querier (no matter if our own or a foreign querier).
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_manage_promisc() incorrectly expects br_auto_port() to return only 0
or 1, while it actually returns flags, i.e., a subset of BR_AUTO_MASK.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
include/net/inetpeer.h
net/ipv6/output_core.c
Changes in net were fixing bugs in code removed in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
(This patch was previously posted as RFC at
http://patchwork.ozlabs.org/patch/352677/)
This patch adds NDA_MASTER attribute to neighbour attributes enum for
bridge/master ifindex. And adds NDA_MASTER to bridge fdb notify msgs.
Today bridge fdb notifications dont contain bridge information.
Userspace can derive it from the port information in the fdb
notification. However this is tricky in some scenarious.
Example, bridge port delete notification comes before bridge fdb
delete notifications. And we have seen problems in userspace
when using libnl where, the bridge fdb delete notification handling code
does not understand which bridge this fdb entry is part of because
the bridge and port association has already been deleted.
And these notifications (port membership and fdb) are generated on
separate rtnl groups.
Fixing the order of notifications could possibly solve the problem
for some cases (I can submit a separate patch for that).
This patch chooses to add NDA_MASTER to bridge fdb notify msgs
because it not only solves the problem described above, but also helps
userspace avoid another lookup into link msgs to derive the master index.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_handle_local_finish() is allowing us to insert an FDB entry with
disallowed vlan. For example, when port 1 and 2 are communicating in
vlan 10, and even if vlan 10 is disallowed on port 3, port 3 can
interfere with their communication by spoofed src mac address with
vlan id 10.
Note: Even if it is judged that a frame should not be learned, it should
not be dropped because it is destined for not forwarding layer but higher
layer. See IEEE 802.1Q-2011 8.13.10.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There has been a number incidents recently where customers running KVM have
reported that VM hosts on different Hypervisors are unreachable. Based on
pcap traces we found that the bridge was broadcasting the ARP request out
onto the network. However some NICs have an inbuilt switch which on occasions
were broadcasting the VMs ARP request back through the physical NIC on the
Hypervisor. This resulted in the bridge changing ports and incorrectly learning
that the VMs mac address was external. As a result the ARP reply was directed
back onto the external network and VM never updated it's ARP cache. This patch
will notify the bridge command, after a fdb has been updated to identify such
port toggling.
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As commit 2796d0c648 ("bridge: Automatically manage port
promiscuous mode."), make the add_if use dev_set_allmulti
instead of dev_set_promiscuous, so when add_if failed, we
should do dev_set_allmulti(dev, -1).
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Reviewed-by: Amos Kong <akong@redhat.com>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before f5efc69 ("netfilter: nf_tables: Add meta expression key for
bridge interface name"), the entire net/bridge/netfilter/ directory
depended on BRIDGE_NF_EBTABLES, ie. on ebtables. However, that
directory already contained the nf_tables bridge extension that
we should allow to compile separately. In f5efc69, we tried to
generalize this by using CONFIG_BRIDGE_NETFILTER which was not a good
idea since this option already existed and it is dedicated to enable
the Netfilter bridge IP/ARP filtering.
Let's try to fix this mess by:
1) making net/bridge/netfilter/ dependent on the toplevel
CONFIG_NETFILTER option, just like we do with the net/netfilter and
net/ipv{4,6}/netfilter/ directories.
2) Changing 'selects' to 'depends on' NETFILTER_XTABLES for
BRIDGE_NF_EBTABLES. I believe this problem was already before
f5efc69:
warning: (BRIDGE_NF_EBTABLES) selects NETFILTER_XTABLES which has
unmet direct dependencies (NET && INET && NETFILTER)
3) Fix ebtables/nf_tables bridge dependencies by making NF_TABLES_BRIDGE
and BRIDGE_NF_EBTABLES dependent on BRIDGE and NETFILTER:
warning: (NF_TABLES_BRIDGE && BRIDGE_NF_EBTABLES) selects
BRIDGE_NETFILTER which has unmet direct dependencies (NET && BRIDGE &&
NETFILTER && INET && NETFILTER_ADVANCED)
net/built-in.o: In function `br_parse_ip_options':
br_netfilter.c:(.text+0x4a5ba): undefined reference to `ip_options_compile'
br_netfilter.c:(.text+0x4a5ed): undefined reference to `ip_options_rcv_srr'
net/built-in.o: In function `br_nf_pre_routing_finish':
br_netfilter.c:(.text+0x4a8a4): undefined reference to `ip_route_input_noref'
br_netfilter.c:(.text+0x4a987): undefined reference to `ip_route_output_flow'
make: *** [vmlinux] Error 1
Reported-by: Jim Davis <jim.epost@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge net/bridge/br_notify.c into net/bridge/br.c,
since it has only br_device_event() and br.c is small.
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/nftables updates for net-next
The following patchset contains Netfilter/nftables updates for net-next,
most relevantly they are:
1) Add set element update notification via netlink, from Arturo Borrero.
2) Put all object updates in one single message batch that is sent to
kernel-space. Before this patch only rules where included in the batch.
This series also introduces the generic transaction infrastructure so
updates to all objects (tables, chains, rules and sets) are applied in
an all-or-nothing fashion, these series from me.
3) Defer release of objects via call_rcu to reduce the time required to
commit changes. The assumption is that all objects are destroyed in
reverse order to ensure that dependencies betweem them are fulfilled
(ie. rules and sets are destroyed first, then chains, and finally
tables).
4) Allow to match by bridge port name, from Tomasz Bursztyka. This series
include two patches to prepare this new feature.
5) Implement the proper set selection based on the characteristics of the
data. The new infrastructure also allows you to specify your preferences
in terms of memory and computational complexity so the underlying set
type is also selected according to your needs, from Patrick McHardy.
6) Several cleanup patches for nft expressions, including one minor possible
compilation breakage due to missing mark support, also from Patrick.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
fix build when BRIDGE_VLAN_FILTERING is not set
Fixes: 2796d0c648 ("bridge: Automatically manage port promiscuous mode")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There exist configurations where the administrator or another management
entity has the foreknowledge of all the mac addresses of end systems
that are being bridged together.
In these environments, the administrator can statically configure known
addresses in the bridge FDB and disable flooding and learning on ports.
This makes it possible to turn off promiscuous mode on the interfaces
connected to the bridge.
Here is why disabling flooding and learning allows us to control
promiscuity:
Consider port X. All traffic coming into this port from outside the
bridge (ingress) will be either forwarded through other ports of the
bridge (egress) or dropped. Forwarding (egress) is defined by FDB
entries and by flooding in the event that no FDB entry exists.
In the event that flooding is disabled, only FDB entries define
the egress. Once learning is disabled, only static FDB entries
provided by a management entity define the egress. If we provide
information from these static FDBs to the ingress port X, then we'll
be able to accept all traffic that can be successfully forwarded and
drop all the other traffic sooner without spending CPU cycles to
process it.
Another way to define the above is as following equations:
ingress = egress + drop
expanding egress
ingress = static FDB + learned FDB + flooding + drop
disabling flooding and learning we a left with
ingress = static FDB + drop
By adding addresses from the static FDB entries to the MAC address
filter of an ingress port X, we fully define what the bridge can
process without dropping and can thus turn off promiscuous mode,
thus dropping packets sooner.
There have been suggestions that we may want to allow learning
and update the filters with learned addresses as well. This
would require mac-level authentication similar to 802.1x to
prevent attacks against the hw filters as they are limited
resource.
Additionally, if the user places the bridge device in promiscuous mode,
all ports are placed in promiscuous mode regardless of the changes
to flooding and learning.
Since the above functionality depends on full static configuration,
we have also require that vlan filtering be enabled to take
advantage of this. The reason is that the bridge has to be
able to receive and process VLAN-tagged frames and the there
are only 2 ways to accomplish this right now: promiscuous mode
or vlan filtering.
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a static fdb entry is created, add the mac address
from this fdb entry to any ports that are currently running
in non-promiscuous mode. These ports need this data so that
they can receive traffic destined to these addresses.
By default ports start in promiscuous mode, so this feature
is disabled.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a BR_PROMISC per-port flag that will help us track if the
current port is supposed to be in promiscuous mode or not. For now,
always start in promiscuous mode.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add code that allows static fdb entires to be synced to the
hw list for a specified port. This will be used later to
program ports that can function in non-promiscuous mode.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By default, ports on the bridge are capable of automatic
discovery of nodes located behind the port. This is accomplished
via flooding of unknown traffic (BR_FLOOD) and learning the
mac addresses from these packets (BR_LEARNING).
If the above functionality is disabled by turning off these
flags, the port requires static configuration in the form
of static FDB entries to function properly.
This patch adds functionality to keep track of all ports
capable of automatic discovery. This will later be used
to control promiscuity settings.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Turn the flag change macro into a function to allow
easier updates and to reduce space.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net: get rid of SET_ETHTOOL_OPS
Dave Miller mentioned he'd like to see SET_ETHTOOL_OPS gone.
This does that.
Mostly done via coccinelle script:
@@
struct ethtool_ops *ops;
struct net_device *dev;
@@
- SET_ETHTOOL_OPS(dev, ops);
+ dev->ethtool_ops = ops;
Compile tested only, but I'd seriously wonder if this broke anything.
Suggested-by: Dave Miller <davem@davemloft.net>
Signed-off-by: Wilfried Klaebe <w-lkml@lebenslange-mailadresse.de>
Acked-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The __vlan_find_dev_deep should always called in RCU, according
David's suggestion, rename to __vlan_find_dev_deep_rcu looks more
reasonable.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently bridge can silently drop ipv4 fragments.
If node have loaded nf_defrag_ipv4 module but have no nf_conntrack_ipv4,
br_nf_pre_routing defragments incoming ipv4 fragments
but nfct check in br_nf_dev_queue_xmit does not allow re-fragment combined
packet back, and therefore it is dropped in br_dev_queue_push_xmit without
incrementing of any failcounters
It seems the only way to hit the ip_fragment code in the bridge xmit
path is to have a fragment list whose reassembled fragments go over
the mtu. This only happens if nf_defrag is enabled. Thanks to
Florian Westphal for providing feedback to clarify this.
Defragmentation ipv4 is required not only in conntracks but at least in
TPROXY target and socket match, therefore #ifdef is changed from
NF_CONNTRACK_IPV4 to NF_DEFRAG_IPV4
Signed-off-by: Vasily Averin <vvs@openvz.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When bridge device is created with IFLA_ADDRESS, we are not calling
br_stp_change_bridge_id(), which leads to incorrect local fdb
management and bridge id calculation, and prevents us from receiving
frames on the bridge device.
Reported-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
NFT_META_BRI_IIFNAME to get packet input bridge interface name
NFT_META_BRI_OIFNAME to get packet output bridge interface name
Such meta key are accessible only through NFPROTO_BRIDGE family, on a
dedicated nft meta module: nft_meta_bridge.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Tomasz Bursztyka <tomasz.bursztyka@linux.intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
br_allowed_ingress() has two problems.
1. If br_allowed_ingress() is called by br_handle_frame_finish() and
vlan_untag() in br_allowed_ingress() fails, skb will be freed by both
vlan_untag() and br_handle_frame_finish().
2. If br_allowed_ingress() is called by br_dev_xmit() and
br_allowed_ingress() fails, the skb will not be freed.
Fix these two problems by freeing the skb in br_allowed_ingress()
if it fails.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
All xtables variants suffer from the defect that the copy_to_user()
to copy the counters to user memory may fail after the table has
already been exchanged and thus exposed. Return an error at this
point will result in freeing the already exposed table. Any
subsequent packet processing will result in a kernel panic.
We can't copy the counters before exposing the new tables as we
want provide the counter state after the old table has been
unhooked. Therefore convert this into a silent error.
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use existing function instead of trying to use our own.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/marvell/mvneta.c
The mvneta.c conflict is a case of overlapping changes,
a conversion to devm_ioremap_resource() vs. a conversion
to netdev_alloc_pcpu_stats.
Signed-off-by: David S. Miller <davem@davemloft.net>
The gfp parameter was added in:
commit 47be03a28c
Author: Amerigo Wang <amwang@redhat.com>
Date: Fri Aug 10 01:24:37 2012 +0000
netpoll: use GFP_ATOMIC in slave_enable_netpoll() and __netpoll_setup()
slave_enable_netpoll() and __netpoll_setup() may be called
with read_lock() held, so should use GFP_ATOMIC to allocate
memory. Eric suggested to pass gfp flags to __netpoll_setup().
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The reason for the gfp parameter was removed in:
commit c4cdef9b71
Author: dingtianhong <dingtianhong@huawei.com>
Date: Tue Jul 23 15:25:27 2013 +0800
bonding: don't call slave_xxx_netpoll under spinlocks
The slave_xxx_netpoll will call synchronize_rcu_bh(),
so the function may schedule and sleep, it should't be
called under spinlocks.
bond_netpoll_setup() and bond_netpoll_cleanup() are always
protected by rtnl lock, it is no need to take the read lock,
as the slave list couldn't be changed outside rtnl lock.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nothing else that calls __netpoll_setup or ndo_netpoll_setup
requires a gfp paramter, so remove the gfp parameter from both
of these functions making the code clearer.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the vlan filtering is enabled on the bridge, but
the filter is not configured on the bridge device itself,
running tcpdump on the bridge device will result in a
an Oops with NULL pointer dereference. The reason
is that br_pass_frame_up() will bypass the vlan
check because promisc flag is set. It will then try
to get the table pointer and process the packet based
on the table. Since the table pointer is NULL, we oops.
Catch this special condition in br_handle_vlan().
Reported-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
CC: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a bridge with vlan_filtering enabled receives frames with stacked
vlan tags, i.e., they have two vlan tags, br_vlan_untag() strips not
only the outer tag but also the inner tag.
br_vlan_untag() is called only from br_handle_vlan(), and in this case,
it is enough to set skb->vlan_tci to 0 here, because vlan_tci has already
been set before calling br_handle_vlan().
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bridge vlan code (br_vlan_get_tag()) assumes that all frames have vlan_tci
if they are tagged, but if vlan tx offload is manually disabled on bridge
device and frames are sent from vlan device on the bridge device, the tags
are embedded in skb->data and they break this assumption.
Extract embedded vlan tags and move them to vlan_tci at ingress.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL)
The rcu_assign_pointer() ensures that the initialization of a structure
is carried out before storing a pointer to that structure.
And in the case of the NULL pointer, there is no structure to initialize.
So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL)
Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace the bh safe variant with the hard irq safe variant.
We need a hard irq safe variant to deal with netpoll transmitting
packets from hard irq context, and we need it in most if not all of
the places using the bh safe variant.
Except on 32bit uni-processor the code is exactly the same so don't
bother with a bh variant, just have a hard irq safe variant that
everyone can use.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/usb/r8152.c
drivers/net/xen-netback/netback.c
Both the r8152 and netback conflicts were simple overlapping
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Without this check someone could easily create a denial of service
by injecting multicast-specific queries to enable the bridge
snooping part if no real querier issuing periodic general queries
is present on the link which would result in the bridge wrongly
shutting down ports for multicast traffic as the bridge did not learn
about these listeners.
With this patch the snooping code is enabled upon receiving valid,
general queries only.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
General IGMP and MLD queries are supposed to have the multicast
link-local all-nodes address as their destination according to RFC2236
section 9, RFC3376 section 4.1.12/9.1, RFC2710 section 8 and RFC3810
section 5.1.15.
Without this check, such malformed IGMP/MLD queries can result in a
denial of service: The queries are ignored by most IGMP/MLD listeners
therefore they will not respond with an IGMP/MLD report. However,
without this patch these malformed MLD queries would enable the
snooping part in the bridge code, potentially shutting down the
according ports towards these hosts for multicast traffic as the
bridge did not learn about these listeners.
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e688a60480 ("net: introduce DST_NOPEER dst flag") introduced
DST_NOPEER because because of crashes in ipv6_select_ident called from
udp6_ufo_fragment.
Since commit 916e4cf46d ("ipv6: reuse ip6_frag_id from
ip6_ufo_append_data") we don't call ipv6_select_ident any more from
ip6_ufo_append_data, thus this flag lost its purpose and can be removed.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
MLD queries are supposed to have an IPv6 link-local source address
according to RFC2710, section 4 and RFC3810, section 5.1.14. This patch
adds a sanity check to ignore such broken MLD queries.
Without this check, such malformed MLD queries can result in a
denial of service: The queries are ignored by any MLD listener
therefore they will not respond with an MLD report. However,
without this patch these malformed MLD queries would enable the
snooping part in the bridge code, potentially shutting down the
according ports towards these hosts for multicast traffic as the
bridge did not learn about these listeners.
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert the uses of memcpy to ether_addr_copy because
for some architectures it is smaller and faster.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert the more obvious uses of memcpy to ether_addr_copy.
There are still uses of memcpy that could be converted but
these addresses are __aligned(2).
Convert a couple uses of 6 in gr_private.h to ETH_ALEN.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are many drivers calling alloc_percpu() to allocate pcpu stats
and then initializing ->syncp. So just introduce a helper function for them.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fdb_change_mac_address() calls fdb_insert()/fdb_delete() without
br->hash_lock.
These hash list updates are racy with br_fdb_update()/br_fdb_cleanup().
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vlan codes unconditionally delete local fdb entries.
We should consider the possibility that other ports have the same
address and vlan.
Example of problematic case:
ip link set eth0 address 12:34:56:78:90:ab
ip link set eth1 address aa:bb:cc:dd:ee:ff
brctl addif br0 eth0
brctl addif br0 eth1 # br0 will have mac address 12:34:56:78:90:ab
bridge vlan add dev eth0 vid 10
bridge vlan add dev eth1 vid 10
bridge vlan add dev br0 vid 10 self
We will have fdb entry such that f->dst == eth0, f->vlan_id == 10 and
f->addr == 12:34:56:78:90:ab at this time.
Next, delete eth0 vlan 10.
bridge vlan del dev eth0 vid 10
In this case, we still need the entry for br0, but it will be deleted.
Note that br0 needs the entry even though its mac address is not set
manually. To delete the entry with proper condition checking,
fdb_delete_local() is suitable to use.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fdb_delete_by_port() doesn't care about vlan and mac address of the
bridge device.
As the check is almost the same as mac address changing, slightly modify
fdb_delete_local() and use it.
Note that we can always set added_by_user to 0 in fdb_delete_local() because
- br_fdb_delete_by_port() calls fdb_delete_local() for local entries
regardless of its added_by_user. In this case, we have to check if another
port has the same address and vlan, and if found, we have to create the
entry (by changing dst). This is kernel-added entry, not user-added.
- br_fdb_changeaddr() doesn't call fdb_delete_local() for user-added entry.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fdb_change_mac_address() doesn't check if the local entry has the
same address as any of bridge ports.
Although I'm not sure when it is beneficial, current implementation allow
the bridge device to receive any mac address of its ports.
To preserve this behavior, we have to check if the mac address of the
entry being deleted is identical to that of any port.
As this check is almost the same as that in br_fdb_changeaddr(), create
a common function fdb_delete_local() and call it from
br_fdb_changeadddr() and br_fdb_change_mac_address().
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should take into account the followings when deleting a local fdb
entry.
- nbp_vlan_find() can be used only when vid != 0 to check if an entry is
deletable, because a fdb entry with vid 0 can exist at any time while
nbp_vlan_find() always return false with vid 0.
Example of problematic case:
ip link set eth0 address 12:34:56:78:90:ab
ip link set eth1 address 12:34:56:78:90:ab
brctl addif br0 eth0
brctl addif br0 eth1
ip link set eth0 address aa:bb:cc:dd:ee:ff
Then, the fdb entry 12:34:56:78:90:ab will be deleted even though the
bridge port eth1 still has that address.
- The port to which the bridge device is attached might needs a local entry
if its mac address is set manually.
Example of problematic case:
ip link set eth0 address 12:34:56:78:90:ab
brctl addif br0 eth0
ip link set br0 address 12:34:56:78:90:ab
ip link set eth0 address aa:bb:cc:dd:ee:ff
Then, the fdb still must have the entry 12:34:56:78:90:ab, but it will be
deleted.
We can use br->dev->addr_assign_type to check if the address is manually
set or not, but I propose another approach.
Since we delete and insert local entries whenever changing mac address
of the bridge device, we can change dst of the entry to NULL regardless of
addr_assign_type when deleting an entry associated with a certain port,
and if it is found to be unnecessary later, then delete it.
That is, if changing mac address of a port, the entry might be changed
to its dst being NULL first, but is eventually deleted when recalculating
and changing bridge id.
This approach is especially useful when we want to share the code with
deleting vlan in which the bridge device might want such an entry regardless
of addr_assign_type, and makes things easy because we don't have to consider
if mac address of the bridge device will be changed or not at the time we
delete a local entry of a port, which means fdb code will not be bothered
even if the bridge id calculating logic is changed in the future.
Also, this change reduces inconsistent state, where frames whose dst is the
mac address of the bridge, can't reach the bridge because of premature fdb
entry deletion. This change reduces the possibility that the bridge device
replies unreachable mac address to arp requests, which could occur during
the short window between calling del_nbp() and br_stp_recalculate_bridge_id()
in br_del_if(). This will effective after br_fdb_delete_by_port() starts to
use the same code by following patch.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vlan code may need fdb change when changing mac address of bridge device
even if it is caused by the mac address changing of a bridge port.
Example configuration:
ip link set eth0 address 12:34:56:78:90:ab
ip link set eth1 address aa:bb:cc:dd:ee:ff
brctl addif br0 eth0
brctl addif br0 eth1 # br0 will have mac address 12:34:56:78:90:ab
bridge vlan add dev br0 vid 10 self
bridge vlan add dev eth0 vid 10
We will have fdb entry such that f->dst == NULL, f->vlan_id == 10 and
f->addr == 12:34:56:78:90:ab at this time.
Next, change the mac address of eth0 to greater value.
ip link set eth0 address ee:ff:12:34:56:78
Then, mac address of br0 will be recalculated and set to aa:bb:cc:dd:ee:ff.
However, an entry aa:bb:cc:dd:ee:ff will not be created and we will be not
able to communicate using br0 on vlan 10.
Address this issue by deleting and adding local entries whenever
changing the mac address of the bridge device.
If there already exists an entry that has the same address, for example,
in case that br_fdb_changeaddr() has already inserted it,
br_fdb_change_mac_address() will simply fail to insert it and no
duplicated entry will be made, as it was.
This approach also needs br_add_if() to call br_fdb_insert() before
br_stp_recalculate_bridge_id() so that we don't create an entry whose
dst == NULL in this function to preserve previous behavior.
Note that this is a slight change in behavior where the bridge device can
receive the traffic to the new address before calling
br_stp_recalculate_bridge_id() in br_add_if().
However, it is not a problem because we have already the address on the
new port and such a way to insert new one before recalculating bridge id
is taken in br_device_event() as well.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have been always failed to delete the old entry at
br_fdb_change_mac_address() because br_set_mac_address() updates
dev->dev_addr before calling br_fdb_change_mac_address() and
br_fdb_change_mac_address() uses dev->dev_addr to find the old entry.
That update of dev_addr is completely unnecessary because the same work
is done in br_stp_change_bridge_id() which is called right away after
calling br_fdb_change_mac_address().
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit bc9a25d21e ("bridge: Add vlan support for local fdb entries"),
br_fdb_changeaddr() has inserted a new local fdb entry only if it can
find old one. But if we have two ports where they have the same address
or user has deleted a local entry, there will be no entry for one of the
ports.
Example of problematic case:
ip link set eth0 address aa:bb:cc:dd:ee:ff
ip link set eth1 address aa:bb:cc:dd:ee:ff
brctl addif br0 eth0
brctl addif br0 eth1 # eth1 will not have a local entry due to dup.
ip link set eth1 address 12:34:56:78:90:ab
Then, the new entry for the address 12:34:56:78:90:ab will not be
created, and the bridge device will not be able to communicate.
Insert new entries regardless of whether we can find old entries or not.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fdb_changeaddr() assumes that there is at most one local entry per port
per vlan. It used to be true, but since commit 36fd2b63e3 ("bridge: allow
creating/deleting fdb entries via netlink"), it has not been so.
Therefore, the function might fail to search a correct previous address
to be deleted and delete an arbitrary local entry if user has added local
entries manually.
Example of problematic case:
ip link set eth0 address ee:ff:12:34:56:78
brctl addif br0 eth0
bridge fdb add 12:34:56:78:90:ab dev eth0 master
ip link set eth0 address aa:bb:cc:dd:ee:ff
Then, the address 12:34:56:78:90:ab might be deleted instead of
ee:ff:12:34:56:78, the original mac address of eth0.
Address this issue by introducing a new flag, added_by_user, to struct
net_bridge_fdb_entry.
Note that br_fdb_delete_by_port() has to set added_by_user to 0 in cases
like:
ip link set eth0 address 12:34:56:78:90:ab
ip link set eth1 address aa:bb:cc:dd:ee:ff
brctl addif br0 eth0
bridge fdb add aa:bb:cc:dd:ee:ff dev eth0 master
brctl addif br0 eth1
brctl delif br0 eth0
In this case, kernel should delete the user-added entry aa:bb:cc:dd:ee:ff,
but it also should have been added by "brctl addif br0 eth1" originally,
so we don't delete it and treat it a new kernel-created entry.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 93d8bf9fb8 ("bridge: cleanup netpoll code") introduced
a check in br_netpoll_enable(), but this check is incorrect for
br_netpoll_setup(). This patch moves the code after the check
into __br_netpoll_enable() and calls it in br_netpoll_setup().
For br_add_if(), the check is still needed.
Fixes: 93d8bf9fb8 ("bridge: cleanup netpoll code")
Cc: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Tested-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_handle_vlan() pushes HW accelerated vlan tag into skbuff when outgoing
port is the bridge device.
This is unnecessary because __netif_receive_skb_core() can handle skbs
with HW accelerated vlan tag. In current implementation,
__netif_receive_skb_core() needs to extract the vlan tag embedded in skb
data. This could cause low network performance especially when receiving
frames at a high frame rate on the bridge device.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
And it can become static.
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't encode argument types into function names and since besides
nft_do_chain() there are only AF-specific versions, there is no risk
of confusion.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Minor nf_chain_type cleanups:
- reorder struct to plug a hoe
- rename struct module member to "owner" for consistency
- rename nf_hookfn array to "hooks" for consistency
- reorder initializers for better readability
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In some cases we neither take a reference to the AF info nor to the
chain type, allowing the module to be unloaded while in use.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add support to register chains to multiple hooks for different address
families for mixed IPv4/IPv6 tables.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Currently the AF-specific hook functions override the chain-type specific
hook functions. That doesn't make too much sense since the chain types
are a special case of the AF-specific hooks.
Make the AF-specific hook functions the default and make the optional
chain type hooks override them.
As a side effect, the necessary code restructuring reduces the code size,
f.i. in case of nf_tables_ipv4.o:
nf_tables_ipv4_init_net | -24
nft_do_chain_ipv4 | -113
2 functions changed, 137 bytes removed, diff: -137
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conflicts:
drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
net/ipv6/ip6_tunnel.c
net/ipv6/ip6_vti.c
ipv6 tunnel statistic bug fixes conflicting with consolidation into
generic sw per-cpu net stats.
qlogic conflict between queue counting bug fix and the addition
of multiple MAC address support.
Signed-off-by: David S. Miller <davem@davemloft.net>
Use DEVICE_ATTR_RO/RW macros to simplify bridge sysfs attribute definitions.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_multicast_set_hash_max() is called from process context in
net/bridge/br_sysfs_br.c by the sysfs store_hash_max() function.
br_multicast_set_hash_max() calls spin_lock(&br->multicast_lock),
which can deadlock the CPU if a softirq that also tries to take the
same lock interrupts br_multicast_set_hash_max() while the lock is
held . This can happen quite easily when any of the bridge multicast
timers expire, which try to take the same lock.
The fix here is to use spin_lock_bh(), preventing other softirqs from
executing on this CPU.
Steps to reproduce:
1. Create a bridge with several interfaces (I used 4).
2. Set the "multicast query interval" to a low number, like 2.
3. Enable the bridge as a multicast querier.
4. Repeatedly set the bridge hash_max parameter via sysfs.
# brctl addbr br0
# brctl addif br0 eth1 eth2 eth3 eth4
# brctl setmcqi br0 2
# brctl setmcquerier br0 1
# while true ; do echo 4096 > /sys/class/net/br0/bridge/hash_max; done
Signed-off-by: Curt Brune <curt@cumulusnetworks.com>
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
They are same, so unify them as one, pcpu_sw_netstats.
Define pcpu_sw_netstat in netdevice.h, remove pcpu_tstats
from if_tunnel and remove br_cpu_netstats from br_private.h
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function __rtnl_af_register is never called outside this
code, and the return value is always 0.
Signed-off-by: David S. Miller <davem@davemloft.net>
That open brace { should be on the previous line.
Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Spaces required before the open parenthesis '(', before the open
brace '{', after that ',' and around that '?/:'.
Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Return is not a function, parentheses are not required.
Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because err is always negative, remove unnecessary condition
judgment.
Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix spelling errors in bridge driver.
Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Various spelling fixes in networking stack
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge 'net' into 'net-next' to get the AF_PACKET bug fix that
Daniel's direct transmit changes depend upon.
Signed-off-by: David S. Miller <davem@davemloft.net>
br_stp_rcv() is reached by non-rx_handler path. That means there is no
guarantee that dev is bridge port and therefore simple NULL check of
->rx_handler_data is not enough. There is need to check if dev is really
bridge port and since only rcu read lock is held here, do it by checking
->rx_handler pointer.
Note that synchronize_net() in netdev_rx_handler_unregister() ensures
this approach as valid.
Introduced originally by:
commit f350a0a873
"bridge: use rx_handler_data pointer to store net_bridge_port pointer"
Fixed but not in the best way by:
commit b5ed54e94d
"bridge: fix RCU races with bridge port"
Reintroduced by:
commit 716ec052d2
"bridge: fix NULL pointer deref of br_port_get_rcu"
Please apply to stable trees as well. Thanks.
RH bugzilla reference: https://bugzilla.redhat.com/show_bug.cgi?id=1025770
Reported-by: Laine Stump <laine@redhat.com>
Debugged-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several files refer to an old address for the Free Software Foundation
in the file header comment. Resolve by replacing the address with
the URL <http://www.gnu.org/licenses/> so that we do not have to keep
updating the header comments anytime the address changes.
CC: netfilter@vger.kernel.org
CC: Pablo Neira Ayuso <pablo@netfilter.org>
CC: Patrick McHardy <kaber@trash.net>
CC: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
netfilter fixes for net
The following patchset contains fixes for your net tree, they are:
* Remove extra quote from connlimit configuration in Kconfig, from
Randy Dunlap.
* Fix missing mss option in syn packets sent to the backend in our
new synproxy target, from Martin Topholm.
* Use window scale announced by client when sending the forged
syn to the backend, from Martin Topholm.
* Fix IPv6 address comparison in ebtables, from Luís Fernando
Cornachioni Estrozi.
* Fix wrong endianess in sequence adjustment which breaks helpers
in NAT configurations, from Phil Oester.
* Fix the error path handling of nft_compat, from me.
* Make sure the global conntrack counter is decremented after the
object has been released, also from me.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
bridge dev
When the following commands are executed:
brctl addbr br0
ifconfig br0 hw ether <addr>
rmmod bridge
The calltrace will occur:
[ 563.312114] device eth1 left promiscuous mode
[ 563.312188] br0: port 1(eth1) entered disabled state
[ 563.468190] kmem_cache_destroy bridge_fdb_cache: Slab cache still has objects
[ 563.468197] CPU: 6 PID: 6982 Comm: rmmod Tainted: G O 3.12.0-0.7-default+ #9
[ 563.468199] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 563.468200] 0000000000000880 ffff88010f111e98 ffffffff814d1c92 ffff88010f111eb8
[ 563.468204] ffffffff81148efd ffff88010f111eb8 0000000000000000 ffff88010f111ec8
[ 563.468206] ffffffffa062a270 ffff88010f111ed8 ffffffffa063ac76 ffff88010f111f78
[ 563.468209] Call Trace:
[ 563.468218] [<ffffffff814d1c92>] dump_stack+0x6a/0x78
[ 563.468234] [<ffffffff81148efd>] kmem_cache_destroy+0xfd/0x100
[ 563.468242] [<ffffffffa062a270>] br_fdb_fini+0x10/0x20 [bridge]
[ 563.468247] [<ffffffffa063ac76>] br_deinit+0x4e/0x50 [bridge]
[ 563.468254] [<ffffffff810c7dc9>] SyS_delete_module+0x199/0x2b0
[ 563.468259] [<ffffffff814e0922>] system_call_fastpath+0x16/0x1b
[ 570.377958] Bridge firewalling registered
--------------------------- cut here -------------------------------
The reason is that when the bridge dev's address is changed, the
br_fdb_change_mac_address() will add new address in fdb, but when
the bridge was removed, the address entry in the fdb did not free,
the bridge_fdb_cache still has objects when destroy the cache, Fix
this by flushing the bridge address entry when removing the bridge.
v2: according to the Toshiaki Makita and Vlad's suggestion, I only
delete the vlan0 entry, it still have a leak here if the vlan id
is other number, so I need to call fdb_delete_by_port(br, NULL, 1)
to flush all entries whose dst is NULL for the bridge.
Suggested-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Suggested-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
"Mostly these are fixes for fallout due to merge window changes, as
well as cures for problems that have been with us for a much longer
period of time"
1) Johannes Berg noticed two major deficiencies in our genetlink
registration. Some genetlink protocols we passing in constant
counts for their ops array rather than something like
ARRAY_SIZE(ops) or similar. Also, some genetlink protocols were
using fixed IDs for their multicast groups.
We have to retain these fixed IDs to keep existing userland tools
working, but reserve them so that other multicast groups used by
other protocols can not possibly conflict.
In dealing with these two problems, we actually now use less state
management for genetlink operations and multicast groups.
2) When configuring interface hardware timestamping, fix several
drivers that simply do not validate that the hwtstamp_config value
is one the driver actually supports. From Ben Hutchings.
3) Invalid memory references in mwifiex driver, from Amitkumar Karwar.
4) In dev_forward_skb(), set the skb->protocol in the right order
relative to skb_scrub_packet(). From Alexei Starovoitov.
5) Bridge erroneously fails to use the proper wrapper functions to make
calls to netdev_ops->ndo_vlan_rx_{add,kill}_vid. Fix from Toshiaki
Makita.
6) When detaching a bridge port, make sure to flush all VLAN IDs to
prevent them from leaking, also from Toshiaki Makita.
7) Put in a compromise for TCP Small Queues so that deep queued devices
that delay TX reclaim non-trivially don't have such a performance
decrease. One particularly problematic area is 802.11 AMPDU in
wireless. From Eric Dumazet.
8) Fix crashes in tcp_fastopen_cache_get(), we can see NULL socket dsts
here. Fix from Eric Dumzaet, reported by Dave Jones.
9) Fix use after free in ipv6 SIT driver, from Willem de Bruijn.
10) When computing mergeable buffer sizes, virtio-net fails to take the
virtio-net header into account. From Michael Dalton.
11) Fix seqlock deadlock in ip4_datagram_connect() wrt. statistic
bumping, this one has been with us for a while. From Eric Dumazet.
12) Fix NULL deref in the new TIPC fragmentation handling, from Erik
Hugne.
13) 6lowpan bit used for traffic classification was wrong, from Jukka
Rissanen.
14) macvlan has the same issue as normal vlans did wrt. propagating LRO
disabling down to the real device, fix it the same way. From Michal
Kubecek.
15) CPSW driver needs to soft reset all slaves during suspend, from
Daniel Mack.
16) Fix small frame pacing in FQ packet scheduler, from Eric Dumazet.
17) The xen-netfront RX buffer refill timer isn't properly scheduled on
partial RX allocation success, from Ma JieYue.
18) When ipv6 ping protocol support was added, the AF_INET6 protocol
initialization cleanup path on failure was borked a little. Fix
from Vlad Yasevich.
19) If a socket disconnects during a read/recvmsg/recvfrom/etc that
blocks we can do the wrong thing with the msg_name we write back to
userspace. From Hannes Frederic Sowa. There is another fix in the
works from Hannes which will prevent future problems of this nature.
20) Fix route leak in VTI tunnel transmit, from Fan Du.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
genetlink: make multicast groups const, prevent abuse
genetlink: pass family to functions using groups
genetlink: add and use genl_set_err()
genetlink: remove family pointer from genl_multicast_group
genetlink: remove genl_unregister_mc_group()
hsr: don't call genl_unregister_mc_group()
quota/genetlink: use proper genetlink multicast APIs
drop_monitor/genetlink: use proper genetlink multicast APIs
genetlink: only pass array to genl_register_family_with_ops()
tcp: don't update snd_nxt, when a socket is switched from repair mode
atm: idt77252: fix dev refcnt leak
xfrm: Release dst if this dst is improper for vti tunnel
netlink: fix documentation typo in netlink_set_err()
be2net: Delete secondary unicast MAC addresses during be_close
be2net: Fix unconditional enabling of Rx interface options
net, virtio_net: replace the magic value
ping: prevent NULL pointer dereference on write to msg_name
bnx2x: Prevent "timeout waiting for state X"
bnx2x: prevent CFC attention
bnx2x: Prevent panic during DMAE timeout
...
This bug was introduced on commit 0898f99a2. This just recovers two
checks that existed before as suggested by Bart De Schuymer.
Signed-off-by: Luís Fernando Cornachioni Estrozi <lestrozi@uolinc.com>
Signed-off-by: Bart De Schuymer <bdschuym@pandora.be>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We should call vlan_vid_del for all vids at nbp_vlan_flush to prevent
vid_info->refcount from being leaked when detaching a bridge port.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should use wrapper functions vlan_vid_[add/del] instead of
ndo_vlan_rx_[add/kill]_vid. Otherwise, we might be not able to communicate
using vlan interface in a certain situation.
Example of problematic case:
vconfig add eth0 10
brctl addif br0 eth0
bridge vlan add dev eth0 vid 10
bridge vlan del dev eth0 vid 10
brctl delif br0 eth0
In this case, we cannot communicate via eth0.10 because vlan 10 is
filtered by NIC that has the vlan filtering feature.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull core locking changes from Ingo Molnar:
"The biggest changes:
- add lockdep support for seqcount/seqlocks structures, this
unearthed both bugs and required extra annotation.
- move the various kernel locking primitives to the new
kernel/locking/ directory"
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
block: Use u64_stats_init() to initialize seqcounts
locking/lockdep: Mark __lockdep_count_forward_deps() as static
lockdep/proc: Fix lock-time avg computation
locking/doc: Update references to kernel/mutex.c
ipv6: Fix possible ipv6 seqlock deadlock
cpuset: Fix potential deadlock w/ set_mems_allowed
seqcount: Add lockdep functionality to seqcount/seqlock structures
net: Explicitly initialize u64_stats_sync structures for lockdep
locking: Move the percpu-rwsem code to kernel/locking/
locking: Move the lglocks code to kernel/locking/
locking: Move the rwsem code to kernel/locking/
locking: Move the rtmutex code to kernel/locking/
locking: Move the semaphore core to kernel/locking/
locking: Move the spinlock code to kernel/locking/
locking: Move the lockdep code to kernel/locking/
locking: Move the mutex code to kernel/locking/
hung_task debugging: Add tracepoint to report the hang
x86/locking/kconfig: Update paravirt spinlock Kconfig description
lockstat: Report avg wait and hold times
lockdep, x86/alternatives: Drop ancient lockdep fixup message
...
In order to enable lockdep on seqcount/seqlock structures, we
must explicitly initialize any locks.
The u64_stats_sync structure, uses a seqcount, and thus we need
to introduce a u64_stats_init() function and use it to initialize
the structure.
This unfortunately adds a lot of fairly trivial initialization code
to a number of drivers. But the benefit of ensuring correctness makes
this worth while.
Because these changes are required for lockdep to be enabled, and the
changes are quite trivial, I've not yet split this patch out into 30-some
separate patches, as I figured it would be better to get the various
maintainers thoughts on how to best merge this change along with
the seqcount lockdep enablement.
Feedback would be appreciated!
Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Jesse Gross <jesse@nicira.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mirko Lindner <mlindner@marvell.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Roger Luethi <rl@hellgate.ch>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Simon Horman <horms@verge.net.au>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Wensong Zhang <wensong@linux-vs.org>
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/1381186321-4906-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pablo Neira Ayuso says:
====================
This batch contains fives nf_tables patches for your net-next tree,
they are:
* Fix possible use after free in the module removal path of the
x_tables compatibility layer, from Dan Carpenter.
* Add filter chain type for the bridge family, from myself.
* Fix Kconfig dependencies of the nf_tables bridge family with
the core, from myself.
* Fix sparse warnings in nft_nat, from Tomasz Bursztyka.
* Remove duplicated include in the IPv4 family support for nf_tables,
from Wei Yongjun.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
This is another batch containing Netfilter/IPVS updates for your net-next
tree, they are:
* Six patches to make the ipt_CLUSTERIP target support netnamespace,
from Gao feng.
* Two cleanups for the nf_conntrack_acct infrastructure, introducing
a new structure to encapsulate conntrack counters, from Holger
Eitzenberger.
* Fix missing verdict in SCTP support for IPVS, from Daniel Borkmann.
* Skip checksum recalculation in SCTP support for IPVS, also from
Daniel Borkmann.
* Fix behavioural change in xt_socket after IP early demux, from
Florian Westphal.
* Fix bogus large memory allocation in the bitmap port set type in ipset,
from Jozsef Kadlecsik.
* Fix possible compilation issues in the hash netnet set type in ipset,
also from Jozsef Kadlecsik.
* Define constants to identify netlink callback data in ipset dumps,
again from Jozsef Kadlecsik.
* Use sock_gen_put() in xt_socket to replace xt_socket_put_sk,
from Eric Dumazet.
* Improvements for the SH scheduler in IPVS, from Alexander Frolkin.
* Remove extra delay due to unneeded rcu barrier in IPVS net namespace
cleanup path, from Julian Anastasov.
* Save some cycles in ip6t_REJECT by skipping checksum validation in
packets leaving from our stack, from Stanislav Fomichev.
* Fix IPVS_CMD_ATTR_MAX definition in IPVS, larger that required, from
Julian Anastasov.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/emulex/benet/be.h
drivers/net/netconsole.c
net/bridge/br_private.h
Three mostly trivial conflicts.
The net/bridge/br_private.h conflict was a function signature (argument
addition) change overlapping with the extern removals from Joe Perches.
In drivers/net/netconsole.c we had one change adjusting a printk message
whilst another changed "printk(KERN_INFO" into "pr_info(".
Lastly, the emulex change was a new inline function addition overlapping
with Joe Perches's extern removals.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently multicast code attempts to extrace the vlan id from
the skb even when vlan filtering is disabled. This can lead
to mdb entries being created with the wrong vlan id.
Pass the already extracted vlan id to the multicast
filtering code to make the correct id is used in
creation as well as lookup.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the filter chain type which is required to
create filter chains in the bridge family from userspace.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
when CONFIG_NF_TABLES[_MODULE] is not enabled,
but CONFIG_NF_TABLES_BRIDGE is enabled:
net/bridge/netfilter/nf_tables_bridge.c: In function 'nf_tables_bridge_init_net':
net/bridge/netfilter/nf_tables_bridge.c:24:5: error: 'struct net' has no member named 'nft'
net/bridge/netfilter/nf_tables_bridge.c:25:9: error: 'struct net' has no member named 'nft'
net/bridge/netfilter/nf_tables_bridge.c:28:2: error: 'struct net' has no member named 'nft'
net/bridge/netfilter/nf_tables_bridge.c:30:34: error: 'struct net' has no member named 'nft'
net/bridge/netfilter/nf_tables_bridge.c:35:11: error: 'struct net' has no member named 'nft'
net/bridge/netfilter/nf_tables_bridge.c: In function 'nf_tables_bridge_exit_net':
net/bridge/netfilter/nf_tables_bridge.c:41:27: error: 'struct net' has no member named 'nft'
net/bridge/netfilter/nf_tables_bridge.c:42:11: error: 'struct net' has no member named 'nft'
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pekka Pietikäinen reports xt_socket behavioural change after commit
00028aa37098o (netfilter: xt_socket: use IP early demux).
Reason is xt_socket now no longer does an unconditional sk lookup -
it re-uses existing skb->sk if possible, assuming ->sk was set by
ip early demux.
However, when netfilter is invoked via bridge, this can cause 'bogus'
sockets to be examined by the match, e.g. a 'tun' device socket.
bridge netfilter should orphan the skb just like the routing path
before invoking ipv4/ipv6 netfilter hooks to avoid this.
Reported-and-tested-by: Pekka Pietikäinen <pp@ee.oulu.fi>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
The following patchset contains three netfilter fixes for your net
tree, they are:
* A couple of fixes to resolve info leak to userspace due to uninitialized
memory area in ulogd, from Mathias Krause.
* Fix instruction ordering issues that may lead to the access of
uninitialized data in x_tables. The problem involves the table update
(producer) and the main packet matching (consumer) routines. Detected in
SMP ARMv7, from Will Deacon.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/usb/qmi_wwan.c
include/net/dst.h
Trivial merge conflicts, both were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
While this commit was a good attempt to fix issues occuring when no
multicast querier is present, this commit still has two more issues:
1) There are cases where mdb entries do not expire even if there is a
querier present. The bridge will unnecessarily continue flooding
multicast packets on the according ports.
2) Never removing an mdb entry could be exploited for a Denial of
Service by an attacker on the local link, slowly, but steadily eating up
all memory.
Actually, this commit became obsolete with
"bridge: disable snooping if there is no querier" (b00589af3b)
which included fixes for a few more cases.
Therefore reverting the following commits (the commit stated in the
commit message plus three of its follow up fixes):
====================
Revert "bridge: update mdb expiration timer upon reports."
This reverts commit f144febd93.
Revert "bridge: do not call setup_timer() multiple times"
This reverts commit 1faabf2aab.
Revert "bridge: fix some kernel warning in multicast timer"
This reverts commit c7e8e8a8f7.
Revert "bridge: only expire the mdb entry when query is received"
This reverts commit 9f00b2e7cf.
====================
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Reviewed-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are a mix of function prototypes with and without extern
in the kernel sources. Standardize on not using extern for
function prototypes.
Function prototypes don't need to be written with extern.
extern is assumed by the compiler. Its use is as unnecessary as
using auto to declare automatic/local variables in a block.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently set the value that variable vid is pointing, which will be
used in FDB later, to 0 at br_allowed_ingress() when we receive untagged
or priority-tagged frames, even though the PVID is valid.
This leads to FDB updates in such a wrong way that they are learned with
VID 0.
Update the value to that of PVID if the PVID is applied.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Reviewed-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are using the VLAN_TAG_PRESENT bit to detect whether the PVID is
set or not at br_get_pvid(), while we don't care about the bit in
adding/deleting the PVID, which makes it impossible to forward any
incomming untagged frame with vlan_filtering enabled.
Since vid 0 cannot be used for the PVID, we can use vid 0 to indicate
that the PVID is not set, which is slightly more efficient than using
the VLAN_TAG_PRESENT.
Fix the problem by getting rid of using the VLAN_TAG_PRESENT.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Reviewed-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IEEE 802.1Q says that when we receive priority-tagged (VID 0) frames
use the PVID for the port as its VID.
(See IEEE 802.1Q-2011 6.9.1 and Table 9-2)
Apply the PVID to not only untagged frames but also priority-tagged frames.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Reviewed-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IEEE 802.1Q says that:
- VID 0 shall not be configured as a PVID, or configured in any Filtering
Database entry.
- VID 4095 shall not be configured as a PVID, or transmitted in a tag
header. This VID value may be used to indicate a wildcard match for the VID
in management operations or Filtering Database entries.
(See IEEE 802.1Q-2011 6.9.1 and Table 9-2)
Don't accept adding these VIDs in the vlan_filtering implementation.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Reviewed-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit be4f154d5e
bridge: Clamp forward_delay when enabling STP
had a typo when attempting to clamp maximum forward delay.
It is possible to set bridge_forward_delay to be higher then
permitted maximum when STP is off. When turning STP on, the
higher then allowed delay has to be clamed down to max value.
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Reviewed-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Register family per netnamespace to ensure that sets are
only visible in its approapriate namespace.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds nftables which is the intended successor of iptables.
This packet filtering framework reuses the existing netfilter hooks,
the connection tracking system, the NAT subsystem, the transparent
proxying engine, the logging infrastructure and the userspace packet
queueing facilities.
In a nutshell, nftables provides a pseudo-state machine with 4 general
purpose registers of 128 bits and 1 specific purpose register to store
verdicts. This pseudo-machine comes with an extensible instruction set,
a.k.a. "expressions" in the nftables jargon. The expressions included
in this patch provide the basic functionality, they are:
* bitwise: to perform bitwise operations.
* byteorder: to change from host/network endianess.
* cmp: to compare data with the content of the registers.
* counter: to enable counters on rules.
* ct: to store conntrack keys into register.
* exthdr: to match IPv6 extension headers.
* immediate: to load data into registers.
* limit: to limit matching based on packet rate.
* log: to log packets.
* meta: to match metainformation that usually comes with the skbuff.
* nat: to perform Network Address Translation.
* payload: to fetch data from the packet payload and store it into
registers.
* reject (IPv4 only): to explicitly close connection, eg. TCP RST.
Using this instruction-set, the userspace utility 'nft' can transform
the rules expressed in human-readable text representation (using a
new syntax, inspired by tcpdump) to nftables bytecode.
nftables also inherits the table, chain and rule objects from
iptables, but in a more configurable way, and it also includes the
original datatype-agnostic set infrastructure with mapping support.
This set infrastructure is enhanced in the follow up patch (netfilter:
nf_tables: add netlink set API).
This patch includes the following components:
* the netlink API: net/netfilter/nf_tables_api.c and
include/uapi/netfilter/nf_tables.h
* the packet filter core: net/netfilter/nf_tables_core.c
* the expressions (described above): net/netfilter/nft_*.c
* the filter tables: arp, IPv4, IPv6 and bridge:
net/ipv4/netfilter/nf_tables_ipv4.c
net/ipv6/netfilter/nf_tables_ipv6.c
net/ipv4/netfilter/nf_tables_arp.c
net/bridge/netfilter/nf_tables_bridge.c
* the NAT table (IPv4 only):
net/ipv4/netfilter/nf_table_nat_ipv4.c
* the route table (similar to mangle):
net/ipv4/netfilter/nf_table_route_ipv4.c
net/ipv6/netfilter/nf_table_route_ipv6.c
* internal definitions under:
include/net/netfilter/nf_tables.h
include/net/netfilter/nf_tables_core.h
* It also includes an skeleton expression:
net/netfilter/nft_expr_template.c
and the preliminary implementation of the meta target
net/netfilter/nft_meta_target.c
It also includes a change in struct nf_hook_ops to add a new
pointer to store private data to the hook, that is used to store
the rule list per chain.
This patch is based on the patch from Patrick McHardy, plus merged
accumulated cleanups, fixes and small enhancements to the nftables
code that has been done since 2009, which are:
From Patrick McHardy:
* nf_tables: adjust netlink handler function signatures
* nf_tables: only retry table lookup after successful table module load
* nf_tables: fix event notification echo and avoid unnecessary messages
* nft_ct: add l3proto support
* nf_tables: pass expression context to nft_validate_data_load()
* nf_tables: remove redundant definition
* nft_ct: fix maxattr initialization
* nf_tables: fix invalid event type in nf_tables_getrule()
* nf_tables: simplify nft_data_init() usage
* nf_tables: build in more core modules
* nf_tables: fix double lookup expression unregistation
* nf_tables: move expression initialization to nf_tables_core.c
* nf_tables: build in payload module
* nf_tables: use NFPROTO constants
* nf_tables: rename pid variables to portid
* nf_tables: save 48 bits per rule
* nf_tables: introduce chain rename
* nf_tables: check for duplicate names on chain rename
* nf_tables: remove ability to specify handles for new rules
* nf_tables: return error for rule change request
* nf_tables: return error for NLM_F_REPLACE without rule handle
* nf_tables: include NLM_F_APPEND/NLM_F_REPLACE flags in rule notification
* nf_tables: fix NLM_F_MULTI usage in netlink notifications
* nf_tables: include NLM_F_APPEND in rule dumps
From Pablo Neira Ayuso:
* nf_tables: fix stack overflow in nf_tables_newrule
* nf_tables: nft_ct: fix compilation warning
* nf_tables: nft_ct: fix crash with invalid packets
* nft_log: group and qthreshold are 2^16
* nf_tables: nft_meta: fix socket uid,gid handling
* nft_counter: allow to restore counters
* nf_tables: fix module autoload
* nf_tables: allow to remove all rules placed in one chain
* nf_tables: use 64-bits rule handle instead of 16-bits
* nf_tables: fix chain after rule deletion
* nf_tables: improve deletion performance
* nf_tables: add missing code in route chain type
* nf_tables: rise maximum number of expressions from 12 to 128
* nf_tables: don't delete table if in use
* nf_tables: fix basechain release
From Tomasz Bursztyka:
* nf_tables: Add support for changing users chain's name
* nf_tables: Change chain's name to be fixed sized
* nf_tables: Add support for replacing a rule by another one
* nf_tables: Update uapi nftables netlink header documentation
From Florian Westphal:
* nft_log: group is u16, snaplen u32
From Phil Oester:
* nf_tables: operational limit match
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pass the hook ops to the hookfn to allow for generic hook
functions. This change is required by nf_tables.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
commit 9f00b2e7cf
bridge: only expire the mdb entry when query is received
changed the mdb expiration timer to be armed only when QUERY is
received. Howerver, this causes issues in an environment where
the multicast server socket comes and goes very fast while a client
is trying to send traffic to it.
The root cause is a race where a sequence of LEAVE followed by REPORT
messages can race against QUERY messages generated in response to LEAVE.
The QUERY ends up starting the expiration timer, and that timer can
potentially expire after the new REPORT message has been received signaling
the new join operation. This leads to a significant drop in multicast
traffic and possible complete stall.
The solution is to have REPORT messages update the expiration timer
on entries that already exist.
CC: Cong Wang <xiyou.wangcong@gmail.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert the memset/memcpy uses of 6 to ETH_ALEN
where appropriate.
Also convert some struct definitions and u8 array
declarations of [6] to ETH_ALEN.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Arend van Spriel <arend@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ulog messages leak heap bytes by the means of padding bytes and
incompletely filled string arrays. Fix those by memset(0)'ing the
whole struct before filling it.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The NULL deref happens when br_handle_frame is called between these
2 lines of del_nbp:
dev->priv_flags &= ~IFF_BRIDGE_PORT;
/* --> br_handle_frame is called at this time */
netdev_rx_handler_unregister(dev);
In br_handle_frame the return of br_port_get_rcu(dev) is dereferenced
without check but br_port_get_rcu(dev) returns NULL if:
!(dev->priv_flags & IFF_BRIDGE_PORT)
Eric Dumazet pointed out the testing of IFF_BRIDGE_PORT is not necessary
here since we're in rcu_read_lock and we have synchronize_net() in
netdev_rx_handler_unregister. So remove the testing of IFF_BRIDGE_PORT
and by the previous patch, make sure br_port_get_rcu is called in
bridging code.
Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
current br_port_get_rcu is problematic in bridging path
(NULL deref). Change these calls in netlink path first.
Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At some point limits were added to forward_delay. However, the
limits are only enforced when STP is enabled. This created a
scenario where you could have a value outside the allowed range
while STP is disabled, which then stuck around even after STP
is enabled.
This patch fixes this by clamping the value when we enable STP.
I had to move the locking around a bit to ensure that there is
no window where someone could insert a value outside the range
while we're in the middle of enabling STP.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Cheers,
Signed-off-by: David S. Miller <davem@davemloft.net>
This changes the message_age_timer calculation to use the BPDU's max age as
opposed to the local bridge's max age. This is in accordance with section
8.6.2.3.2 Step 2 of the 802.1D-1998 sprecification.
With the current implementation, when running with very large bridge
diameters, convergance will not always occur even if a root bridge is
configured to have a longer max age.
Tested successfully on bridge diameters of ~200.
Signed-off-by: Chris Healy <cphealy@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking changes from David Miller:
"Noteworthy changes this time around:
1) Multicast rejoin support for team driver, from Jiri Pirko.
2) Centralize and simplify TCP RTT measurement handling in order to
reduce the impact of bad RTO seeding from SYN/ACKs. Also, when
both timestamps and local RTT measurements are available prefer
the later because there are broken middleware devices which
scramble the timestamp.
From Yuchung Cheng.
3) Add TCP_NOTSENT_LOWAT socket option to limit the amount of kernel
memory consumed to queue up unsend user data. From Eric Dumazet.
4) Add a "physical port ID" abstraction for network devices, from
Jiri Pirko.
5) Add a "suppress" operation to influence fib_rules lookups, from
Stefan Tomanek.
6) Add a networking development FAQ, from Paul Gortmaker.
7) Extend the information provided by tcp_probe and add ipv6 support,
from Daniel Borkmann.
8) Use RCU locking more extensively in openvswitch data paths, from
Pravin B Shelar.
9) Add SCTP support to openvswitch, from Joe Stringer.
10) Add EF10 chip support to SFC driver, from Ben Hutchings.
11) Add new SYNPROXY netfilter target, from Patrick McHardy.
12) Compute a rate approximation for sending in TCP sockets, and use
this to more intelligently coalesce TSO frames. Furthermore, add
a new packet scheduler which takes advantage of this estimate when
available. From Eric Dumazet.
13) Allow AF_PACKET fanouts with random selection, from Daniel
Borkmann.
14) Add ipv6 support to vxlan driver, from Cong Wang"
Resolved conflicts as per discussion.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1218 commits)
openvswitch: Fix alignment of struct sw_flow_key.
netfilter: Fix build errors with xt_socket.c
tcp: Add missing braces to do_tcp_setsockopt
caif: Add missing braces to multiline if in cfctrl_linkup_request
bnx2x: Add missing braces in bnx2x:bnx2x_link_initialize
vxlan: Fix kernel panic on device delete.
net: mvneta: implement ->ndo_do_ioctl() to support PHY ioctls
net: mvneta: properly disable HW PHY polling and ensure adjust_link() works
icplus: Use netif_running to determine device state
ethernet/arc/arc_emac: Fix huge delays in large file copies
tuntap: orphan frags before trying to set tx timestamp
tuntap: purge socket error queue on detach
qlcnic: use standard NAPI weights
ipv6:introduce function to find route for redirect
bnx2x: VF RSS support - VF side
bnx2x: VF RSS support - PF side
vxlan: Notify drivers for listening UDP port changes
net: usbnet: update addr_assign_type if appropriate
driver/net: enic: update enic maintainers and driver
driver/net: enic: Exposing symbols for Cisco's low latency driver
...
Conflicts:
drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
net/bridge/br_multicast.c
net/ipv6/sit.c
The conflicts were minor:
1) sit.c changes overlap with change to ip_tunnel_xmit() signature.
2) br_multicast.c had an overlap between computing max_delay using
msecs_to_jiffies and turning MLDV2_MRC() into an inline function
with a name using lowercase instead of uppercase letters.
3) stmmac had two overlapping changes, one which conditionally allocated
and hooked up a dma_cfg based upon the presence of the pbl OF property,
and another one handling store-and-forward DMA made. The latter of
which should not go into the new of_find_property() basic block.
Signed-off-by: David S. Miller <davem@davemloft.net>
The multicast snooping code should have matured enough to be safely
applicable to IPv6 link-local multicast addresses (excluding the
link-local all nodes address, ff02::1), too.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently if there is no listener for a certain group then IPv6 packets
for that group are flooded on all ports, even though there might be no
host and router interested in it on a port.
With this commit they are only forwarded to ports with a multicast
router.
Just like commit bd4265fe36 ("bridge: Only flood unregistered groups
to routers") did for IPv4, let's do the same for IPv6 with the same
reasoning.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages.
This has been sitting in linux-next for a whole cycle.
Thanks,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJSJo+1AAoJENkgDmzRrbjxIC4QALJK95o8AUXuwUkl+2fmFkUt
hh2/PJ1vDYgk4Xt0J6hyoK7XMa0H1RkbBrROuDdsBnorMFpEsGcgdkUZte9ufoAS
97Bg+7N0KPbTB/S8vOwtW1vbERTJIVPN2uf6h1Wqm9Xc2puCh3HbMMr1AWMGu0WQ
NqY5+Zz8zecy1UOrMhEP6H1CjeQcL1w1DO6YM5ydeqlKNzAz+JMfDXriLPDwiE7+
XFPDF/O3Vtd2ckA7L70Lio7hfHwxV5U4WwFVfiwls98XB4jcZqDKIoh1r8z4SRgR
+0Rae2DN3BaOabGMr//5XdrzQVpwJTh5m2w8BAOHJvCJ9HR7Sq29UIN4u+TowZBy
L2xYo4dvFxkympwu5zEd3c7vHYWKIaqmSq5PIjr4gF/uIo2OeOTrpPIK782ZEYb7
e+qUgOEM05V9AmQZCrSZeP9u474Sj8ow3sCtWxfdRtwNfoEIcUXsNNJd/zDHlVtW
cEtXqc2xXIpcuUJQWlSaGp8fmRQjVZPzrLKYLM2m39ZcOOJbf5rzQAYS7hHPosIa
SK+YVux/+Zzi+Xo/vXq1OlM/SruCr5S7JOgCxLowoQ88vupgXME6uPyC8EO+QQ50
GsrHes5ZNLbk0uVsfcexIyojkUnyvDmmnDpv+1zdC6RgZLJQn8OXp5yNhHhnhrFT
BiHX6YFWtDDqRlVv8Q0F
=LeaW
-----END PGP SIGNATURE-----
Merge tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull PTR_RET() removal patches from Rusty Russell:
"PTR_RET() is a weird name, and led to some confusing usage. We ended
up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages.
This has been sitting in linux-next for a whole cycle"
[ There are still some PTR_RET users scattered about, with some of them
possibly being new, but most of them existing in Rusty's tree too. We
have that
#define PTR_RET(p) PTR_ERR_OR_ZERO(p)
thing in <linux/err.h>, so they continue to work for now - Linus ]
* tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
GFS2: Replace PTR_RET with PTR_ERR_OR_ZERO
Btrfs: volume: Replace PTR_RET with PTR_ERR_OR_ZERO
drm/cma: Replace PTR_RET with PTR_ERR_OR_ZERO
sh_veu: Replace PTR_RET with PTR_ERR_OR_ZERO
dma-buf: Replace PTR_RET with PTR_ERR_OR_ZERO
drivers/rtc: Replace PTR_RET with PTR_ERR_OR_ZERO
mm/oom_kill: remove weird use of ERR_PTR()/PTR_ERR().
staging/zcache: don't use PTR_RET().
remoteproc: don't use PTR_RET().
pinctrl: don't use PTR_RET().
acpi: Replace weird use of PTR_RET.
s390: Replace weird use of PTR_RET.
PTR_RET is now PTR_ERR_OR_ZERO(): Replace most.
PTR_RET is now PTR_ERR_OR_ZERO
Get rid of MLDV2_MRC and use our new macros for mantisse and
exponent to calculate Maximum Response Delay out of the Maximum
Response Code.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
While looking into MLDv1/v2 code, I noticed that bridging code does
not convert it's max delay into jiffies for MLDv2 messages as we do
in core IPv6' multicast code.
RFC3810, 5.1.3. Maximum Response Code says:
The Maximum Response Code field specifies the maximum time allowed
before sending a responding Report. The actual time allowed, called
the Maximum Response Delay, is represented in units of milliseconds,
and is derived from the Maximum Response Code as follows: [...]
As we update timers that work with jiffies, we need to convert it.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Linus Lüssing <linus.luessing@web.de>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we would still potentially suffer multicast packet loss if there
is just either an IGMP or an MLD querier: For the former case, we would
possibly drop IPv6 multicast packets, for the latter IPv4 ones. This is
because we are currently assuming that if either an IGMP or MLD querier
is present that the other one is present, too.
This patch makes the behaviour and fix added in
"bridge: disable snooping if there is no querier" (b00589af3b)
to also work if there is either just an IGMP or an MLD querier on the
link: It refines the deactivation of the snooping to be protocol
specific by using separate timers for the snooped IGMP and MLD queries
as well as separate timers for our internal IGMP and MLD queriers.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some slave devices may have set a dev->needed_headroom value which is
different than the default one, most likely in order to prepend a
hardware descriptor in front of the Ethernet frame to send. Whenever a
new slave is added to a bridge, ensure that we update the
needed_headroom value accordingly to account for the slave
needed_headroom value.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/wireless/iwlwifi/pcie/trans.c
include/linux/inetdevice.h
The inetdevice.h conflict involves moving the IPV4_DEVCONF values
into a UAPI header, overlapping additions of some new entries.
The iwlwifi conflict is a context overlap.
Signed-off-by: David S. Miller <davem@davemloft.net>
The VLAN code needs to know the length of the per-port VLAN bitmap to
perform its most basic operations (retrieving VLAN informations, removing
VLANs, forwarding database manipulation, etc). Unfortunately, in the
current implementation we are using a macro that indicates the bitmap
size in longs in places where the size in bits is expected, which in
some cases can cause what appear to be random failures.
Use the correct macro.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_sysfs_if.c is for sysfs attributes of bridge ports, while br_sysfs_br.c
is for sysfs attributes of bridge itself. Correct the comment here.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we are reading an uninitialized value for the max_delay
variable when snooping an MLD query message of invalid length and would
update our timers with that.
Fixing this by simply ignoring such broken MLD queries (just like we do
for IGMP already).
This is a regression introduced by:
"bridge: disable snooping if there is no querier" (b00589af3b)
Reported-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use of RCU here with out marked pointer and function doesn't match prototype
with sparse.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If there is no querier on a link then we won't get periodic reports and
therefore won't be able to learn about multicast listeners behind ports,
potentially leading to lost multicast packets, especially for multicast
listeners that joined before the creation of the bridge.
These lost multicast packets can appear since c5c2326059
("bridge: Add multicast_querier toggle and disable queries by default")
in particular.
With this patch we are flooding multicast packets if our querier is
disabled and if we didn't detect any other querier.
A grace period of the Maximum Response Delay of the querier is added to
give multicast responses enough time to arrive and to be learned from
before disabling the flooding behaviour again.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This started out with fixing a sparse warning, then I realized that
the wrapper function br_netpoll_info could just be collapsed away
by rolling it into the enable code.
Also, eliminate unnecessary goto's
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now, bond_resend_igmp_join_requests() looks for vlans attached to
bonding device, bridge where bonding act as port manually. It does not
care of other scenarios, like stacked bonds or team device above. Make
this more generic and use netdev notifier to propagate the event to
upper devices and to actually call ip_mc_rejoin_groups().
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 9f00b2e7cf ("bridge: only expire the mdb entry when query is
received") added a nasty bug as an active timer can be reinitialized.
setup_timer() must be done once, no matter how many time mod_timer()
is called. br_multicast_new_group() is the right place to do this.
Reported-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Diagnosed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several people reported the warning: "kernel BUG at kernel/timer.c:729!"
and the stack trace is:
#7 [ffff880214d25c10] mod_timer+501 at ffffffff8106d905
#8 [ffff880214d25c50] br_multicast_del_pg.isra.20+261 at ffffffffa0731d25 [bridge]
#9 [ffff880214d25c80] br_multicast_disable_port+88 at ffffffffa0732948 [bridge]
#10 [ffff880214d25cb0] br_stp_disable_port+154 at ffffffffa072bcca [bridge]
#11 [ffff880214d25ce8] br_device_event+520 at ffffffffa072a4e8 [bridge]
#12 [ffff880214d25d18] notifier_call_chain+76 at ffffffff8164aafc
#13 [ffff880214d25d50] raw_notifier_call_chain+22 at ffffffff810858f6
#14 [ffff880214d25d60] call_netdevice_notifiers+45 at ffffffff81536aad
#15 [ffff880214d25d80] dev_close_many+183 at ffffffff81536d17
#16 [ffff880214d25dc0] rollback_registered_many+168 at ffffffff81537f68
#17 [ffff880214d25de8] rollback_registered+49 at ffffffff81538101
#18 [ffff880214d25e10] unregister_netdevice_queue+72 at ffffffff815390d8
#19 [ffff880214d25e30] __tun_detach+272 at ffffffffa074c2f0 [tun]
#20 [ffff880214d25e88] tun_chr_close+45 at ffffffffa074c4bd [tun]
#21 [ffff880214d25ea8] __fput+225 at ffffffff8119b1f1
#22 [ffff880214d25ef0] ____fput+14 at ffffffff8119b3fe
#23 [ffff880214d25f00] task_work_run+159 at ffffffff8107cf7f
#24 [ffff880214d25f30] do_notify_resume+97 at ffffffff810139e1
#25 [ffff880214d25f50] int_signal+18 at ffffffff8164f292
this is due to I forgot to check if mp->timer is armed in
br_multicast_del_pg(). This bug is introduced by
commit 9f00b2e7cf (bridge: only expire the mdb entry
when query is received).
Same for __br_mdb_del().
Tested-by: poma <pomidorabelisima@gmail.com>
Reported-by: LiYonghua <809674045@qq.com>
Reported-by: Robert Hancock <hancockrwd@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The check for all-zero ether address was removed from rtnetlink core,
since Vxlan uses all-zero ether address to signify default address.
Need to add check back in for bridge.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/wireless/ath/ath9k/Kconfig
drivers/net/xen-netback/netback.c
net/batman-adv/bat_iv_ogm.c
net/wireless/nl80211.c
The ath9k Kconfig conflict was a change of a Kconfig option name right
next to the deletion of another option.
The xen-netback conflict was overlapping changes involving the
handling of the notify list in xen_netbk_rx_action().
Batman conflict resolution provided by Antonio Quartulli, basically
keep everything in both conflict hunks.
The nl80211 conflict is a little more involved. In 'net' we added a
dynamic memory allocation to nl80211_dump_wiphy() to fix a race that
Linus reported. Meanwhile in 'net-next' the handlers were converted
to use pre and post doit handlers which use a flag to determine
whether to hold the RTNL mutex around the operation.
However, the dump handlers to not use this logic. Instead they have
to explicitly do the locking. There were apparent bugs in the
conversion of nl80211_dump_wiphy() in that we were not dropping the
RTNL mutex in all the return paths, and it seems we very much should
be doing so. So I fixed that whilst handling the overlapping changes.
To simplify the initial returns, I take the RTNL mutex after we try
to allocate 'tb'.
Signed-off-by: David S. Miller <davem@davemloft.net>
General Queries (the one with the Multicast Address field
set to zero / '::') are supposed to have a Maximum Response Delay
of [Query Response Interval], while for Multicast-Address-Specific
Queries it is [Last Listener Query Interval] - not the other way
round. (see RFC2710, section 7.3+7.8)
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reduce the uses of this unnecessary typedef.
Done via perl script:
$ git grep --name-only -w ctl_table net | \
xargs perl -p -i -e '\
sub trim { my ($local) = @_; $local =~ s/(^\s+|\s+$)//g; return $local; } \
s/\b(?<!struct\s)ctl_table\b(\s*\*\s*|\s+\w+)/"struct ctl_table " . trim($1)/ge'
Reflow the modified lines that now exceed 80 columns.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a flag to control flood of unicast traffic. By default, flood is
on and the bridge will flood unicast traffic if it doesn't know
the destination. When the flag is turned off, unicast traffic
without an FDB will not be forwarded to the specified port.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow user to control whether mac learning is enabled on the port.
By default, mac learning is enabled. Disabling mac learning will
cause new dynamic FDB entries to not be created for a particular port.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/netfilter/nf_log.c
The conflict in nf_log.c is that in 'net' we added CONFIG_PROC_FS
protection around foo_proc_entry() calls to fix a build failure,
whereas in Pablo's tree a guard if() test around a call is
remove_proc_entry() was removed. Trivially resolved.
Pablo Neira Ayuso says:
====================
The following patchset contains the first batch of
Netfilter/IPVS updates for your net-next tree, they are:
* Three patches with improvements and code refactorization
for nfnetlink_queue, from Florian Westphal.
* FTP helper now parses replies without brackets, as RFC1123
recommends, from Jeff Mahoney.
* Rise a warning to tell everyone about ULOG deprecation,
NFLOG has been already in the kernel tree for long time
and supersedes the old logging over netlink stub, from
myself.
* Don't panic if we fail to load netfilter core framework,
just bail out instead, from myself.
* Add cond_resched_rcu, used by IPVS to allow rescheduling
while walking over big hashtables, from Simon Horman.
* Change type of IPVS sysctl_sync_qlen_max sysctl to avoid
possible overflow, from Zhang Yanfei.
* Use strlcpy instead of strncpy to skip zeroing of already
initialized area to write the extension names in ebtables,
from Chen Gang.
* Use already existing per-cpu notrack object from xt_CT,
from Eric Dumazet.
* Save explicit socket lookup in xt_socket now that we have
early demux, also from Eric Dumazet.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
So far, only net_device * could be passed along with netdevice notifier
event. This patch provides a possibility to pass custom structure
able to provide info that event listener needs to know.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
v2->v3: fix typo on simeth
shortened dev_getter
shortened notifier_info struct name
v1->v2: fix notifier_call parameter in call_netdevice_notifier()
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge net into net-next because some upcoming net-next changes
build on top of bug fixes that went into net.
Signed-off-by: David S. Miller <davem@davemloft.net>
When vlan device is configured on top of the brige, it does
not support any offload capabilities because the bridge
device does not initiliaze vlan_fatures. Set vlan_fatures to
be equivalent to hw_fatures.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This target has been superseded by NFLOG. Spot a warning
so we prepare removal in a couple of years.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Gao feng <gaofeng@cn.fujitsu.com>
'name' has already set all zero when it is defined, so not need let
strncpy() to pad it again.
'name' is a string, better always let is NUL terminated, so use
strlcpy() instead of strncpy().
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Acked-by: Bart De Schuymer <bdschuym@pandora.be>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Continue sending queries when leave is received if the user marks
it as a querier.
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Adam Baker <linux@baker-net.org.uk>
Signed-off-by: Cong Wang <amwang@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we arm the expire timer when the mdb entry is added,
however, this causes problem when there is no querier sent
out after that.
So we should only arm the timer when a corresponding query is
received, as suggested by Herbert.
And he also mentioned "if there is no querier then group
subscriptions shouldn't expire. There has to be at least one querier
in the network for this thing to work. Otherwise it just degenerates
into a non-snooping switch, which is OK."
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Adam Baker <linux@baker-net.org.uk>
Signed-off-by: Cong Wang <amwang@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quote from Adam:
"If it is believed that the use of 0.0.0.0
as the IP address is what is causing strange behaviour on other devices
then is there a good reason that a bridge rather than a router shouldn't
be the active querier? If not then using the bridge IP address and
having the querier enabled by default may be a reasonable solution
(provided that our querier obeys the election rules and shuts up if it
sees a query from a lower IP address that isn't 0.0.0.0). Just because a
device is the elected querier for IGMP doesn't appear to mean it is
required to perform any other routing functions."
And introduce a new troggle for it, as suggested by Herbert.
Suggested-by: Adam Baker <linux@baker-net.org.uk>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Adam Baker <linux@baker-net.org.uk>
Signed-off-by: Cong Wang <amwang@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since (69b34fb netfilter: xt_LOG: add net namespace support
for xt_LOG), we hit this:
[ 4224.708977] BUG: unable to handle kernel NULL pointer dereference at 0000000000000388
[ 4224.709074] IP: [<ffffffff8147f699>] ipt_log_packet+0x29/0x270
when callling log functions from conntrack both in and out
are NULL i.e. the net pointer is invalid.
Adding struct net *net in call to nf_logfn() will secure that
there always is a vaild net ptr.
Reported as netfilter's bugzilla bug 818:
https://bugzilla.netfilter.org/show_bug.cgi?id=818
Reported-by: Ronald <ronald645@gmail.com>
Signed-off-by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
A bridge should only send topology change notice if it is not
the root bridge. It is possible for message age timer to elect itself
as a new root bridge, and still have a topology change timer running
but waiting for bridge lock on other CPU.
Solve the race by checking if we are root bridge before continuing.
This was the root cause of the cases where br_send_tcn_bpdu would OOPS.
Reported-by: JerryKang <jerry.kang@samsung.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bridge can crash while trying to send topology change packet.
This happens if root port can't be found. This was reported by user
but currently unable to reproduce it easily. The STP conditions that cause
this are not known yet, but the problem doesn't have to be fatal.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current bridge fdb update code does not seem to update the port
during fdb update. This patch adds a check for fdb dst (port)
change during fdb update. Also rearranges the call to
fdb_notify to send only one notification for create and update.
Changelog:
v2 - Change notify flag to bool
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/emulex/benet/be_main.c
drivers/net/ethernet/intel/igb/igb_main.c
drivers/net/wireless/brcm80211/brcmsmac/mac80211_if.c
include/net/scm.h
net/batman-adv/routing.c
net/ipv4/tcp_input.c
The e{uid,gid} --> {uid,gid} credentials fix conflicted with the
cleanup in net-next to now pass cred structs around.
The be2net driver had a bug fix in 'net' that overlapped with the VLAN
interface changes by Patrick McHardy in net-next.
An IGB conflict existed because in 'net' the build_skb() support was
reverted, and in 'net-next' there was a comment style fix within that
code.
Several batman-adv conflicts were resolved by making sure that all
calls to batadv_is_my_mac() are changed to have a new bat_priv first
argument.
Eric Dumazet's TS ECR fix in TCP in 'net' conflicted with the F-RTO
rewrite in 'net-next', mostly overlapping changes.
Thanks to Stephen Rothwell and Antonio Quartulli for help with several
of these merge resolutions.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a protocol argument to the VLAN packet tagging functions. In case of HW
tagging, we need that protocol available in the ndo_start_xmit functions,
so it is stored in a new field in the skb. The new field fits into a hole
(on 64 bit) and doesn't increase the sks's size.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the encapsulation protocol value a property of VLAN devices and change
the device lookup functions to take the protocol value into account.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the rx_{add,kill}_vid callbacks to take a protocol argument in
preparation of 802.1ad support. The protocol argument used so far is
always htons(ETH_P_8021Q).
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename the hardware VLAN acceleration features to include "CTAG" to indicate
that they only support CTAGs. Follow up patches will introduce 802.1ad
server provider tagging (STAGs) and require the distinction for hardware not
supporting acclerating both.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Keep a STP port path cost value if it was set by a user.
Don't replace it with the link-speed based path cost
whenever the link goes down and comes back up.
Reported-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contains Netfilter and IPVS updates for
your net-next tree, most relevantly they are:
* Add net namespace support to NFLOG, ULOG and ebt_ulog and NFQUEUE.
The LOG and ebt_log target has been also adapted, but they still
depend on the syslog netnamespace that seems to be missing, from
Gao Feng.
* Don't lose indications of congestion in IPv6 fragmentation handling,
from Hannes Frederic Sowa.i
* IPVS conversion to use RCU, including some code consolidation patches
and optimizations, also some from Julian Anastasov.
* cpu fanout support for NFQUEUE, from Holger Eitzenberger.
* Better error reporting to userspace when dropping packets from
all our _*_[xfrm|route]_me_harder functions, from Patrick McHardy.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add pernet support to ebt_ulog by means of the new nf_log_set
function added in (30e0c6a netfilter: nf_log: prepare net
namespace support for loggers).
This patch also make ulog_buffers and netlink socket
ebtulognl per netns.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add pernet support to ebt_log by means of the new nf_log_set
function added in (30e0c6a netfilter: nf_log: prepare net
namespace support for loggers).
Since syslog ns has yet not been implemented, we don't want
the containers to DDOS host's syslogd. So only enable ebt_log
only from init_net and wait for syslog ns support.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds netns support to nf_log and it prepares netns
support for existing loggers. It is composed of four major
changes.
1) nf_log_register has been split to two functions: nf_log_register
and nf_log_set. The new nf_log_register is used to globally
register the nf_logger and nf_log_set is used for enabling
pernet support from nf_loggers.
Per netns is not yet complete after this patch, it comes in
separate follow up patches.
2) Add net as a parameter of nf_log_bind_pf. Per netns is not
yet complete after this patch, it only allows to bind the
nf_logger to the protocol family from init_net and it skips
other cases.
3) Adapt all nf_log_packet callers to pass netns as parameter.
After this patch, this function only works for init_net.
4) Make the sysctl net/netfilter/nf_log pernet.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
commit 00cfec3748 (net: add a synchronize_net() in
netdev_rx_handler_unregister())
allows us to remove the synchronized_net() call from del_nbp()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Hong Zhiguo <honkiko@gmail.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new constant ETH_P_802_3_MIN, the minimum ethernet type for
an 802.3 frame. Frames with a lower value in the ethernet type field
are Ethernet II.
Also update all the users of this value that David Miller and
I could find to use the new constant.
Also correct a bug in util.c. The comparison with ETH_P_802_3_MIN
should be >= not >.
As suggested by Jesse Gross.
Compile tested only.
Cc: David Miller <davem@davemloft.net>
Cc: Jesse Gross <jesse@nicira.com>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: John W. Linville <linville@tuxdriver.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Bart De Schuymer <bart.de.schuymer@pandora.be>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Gustavo Padovan <gustavo@padovan.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: linux-bluetooth@vger.kernel.org
Cc: netfilter-devel@vger.kernel.org
Cc: bridge@lists.linux-foundation.org
Cc: linux-wireless@vger.kernel.org
Cc: linux1394-devel@lists.sourceforge.net
Cc: linux-media@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: dev@openvswitch.org
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
include/net/ipip.h
The changes made to ipip.h in 'net' were already included
in 'net-next' before that header was moved to another location.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contains Netfilter/IPVS updates for
your net-next tree, they are:
* Better performance in nfnetlink_queue by avoiding copy from the
packet to netlink message, from Eric Dumazet.
* Remove unnecessary locking in the exit path of ebt_ulog, from Gao Feng.
* Use new function ipv6_iface_scope_id in nf_ct_ipv6, from Hannes Frederic Sowa.
* A couple of sparse fixes for IPVS, from Julian Anastasov.
* Use xor hashing in nfnetlink_queue, as suggested by Eric Dumazet, from
myself.
* Allow to dump expectations per master conntrack via ctnetlink, from myself.
* A couple of cleanups to use PTR_RET in module init path, from Silviu-Mihai
Popescu.
* Remove nf_conntrack module a bit faster if netns are in use, from
Vladimir Davydov.
* Use checksum_partial in ip6t_NPT, from YOSHIFUJI Hideaki.
* Sparse fix for nf_conntrack, from Stephen Hemminger.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When I tried to set mac address of a bridge interface to a mac
address which already learned on this bridge, I got system hang.
The cause is straight forward: function br_fdb_change_mac_address
calls fdb_insert with NULL source nbp. Then an fdb lookup is
performed. If an fdb entry is found and it's local, it's OK. But
if it's not local, source is dereferenced for printk without NULL
check.
Signed-off-by: Hong Zhiguo <honkiko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When neither IFF_BRIDGE nor IFF_BRIDGE_PORT is set,
and afspec == NULL but protinfo != NULL, we run into
"if (err == 0) br_ifinfo_notify(RTM_NEWLINK, p);" with
random value in ret.
Thanks to Sergei for pointing out the error in commit comments.
Signed-off-by: Hong Zhiguo <honkiko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With decnet converted, we can finally get rid of rta_buf and its
computations around it. It also gets rid of the minimal header
length verification since all message handlers do that explicitly
anyway.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most of the support was already there. The only thing that was missing
was the call to set the flag. Add this call.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No need for spinlock to protect the netlink skb in the
ebt_ulog_fini path. We are sure there is noone using it
at that stage.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This uses PTR_RET instead of IS_ERR and PTR_ERR in order to increase
readability.
Signed-off-by: Silviu-Mihai Popescu <silviupopescu1990@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Using for_each_set_bit() to simplify the code.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using for_each_set_bit_from() to simplify the code.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/intel/e1000e/netdev.c
Minor conflict in e1000e, a line that got fixed in 'net'
has been removed in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge multicast fast leave feature was added sufficient space
was not reserved in the netlink message. This means the flag may be
lost in netlink events and results of queries.
Found by observation while looking up some netlink stuff for discussion with Vlad.
Problem introduced by commit c2d3babfaf
Author: David S. Miller <davem@davemloft.net>
Date: Wed Dec 5 16:24:45 2012 -0500
bridge: implement multicast fast leave
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridging code discloses heap and stack bytes via the RTM_GETMDB
netlink interface and via the notify messages send to group RTNLGRP_MDB
afer a successful add/del.
Fix both cases by initializing all unset members/padding bytes with
memset(0).
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Obviously, vid should be considered when searching for multicast
group.
Cc: Vlad Yasevich <vyasevic@redhat.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
They are only used within this file.
Cc: Vlad Yasevich <vyasevic@redhat.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an ability to configure a separate "untagged" egress
policy to the VLAN information of the bridge. This superseeds PVID
policy and makes PVID ingress-only. The policy is configured with a
new flag and is represented as a port bitmap per vlan. Egress frames
with a VLAN id in "untagged" policy bitmap would egress
the port without VLAN header.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When VLAN is added to the port, a local fdb entry for that port
(the entry with the mac address of the port) is added for that
VLAN. This way we can correctly determine if the traffic
is for the bridge itself. If the address of the port changes,
we try to change all the local fdb entries we have for that port.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a user adds bridge neighbors, allow him to specify VLAN id.
If the VLAN id is not specified, the neighbor will be added
for VLANs currently in the ports filter list. If no VLANs are
configured on the port, we use vlan 0 and only add 1 entry.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add vlan_id to multicasts groups so that we know which vlan
each group belongs to and can correctly forward to appropriate vlan.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds vlan to unicast fdb entries that are created for
learned addresses (not the manually configured ones). It adds
vlan id into the hash mix and uses vlan as an addditional parameter
for an entry match.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A user may designate a certain vlan as PVID. This means that
any ingress frame that does not contain a vlan tag is assigned to
this vlan and any forwarding decisions are made with this vlan in mind.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At ingress, any untagged traffic is assigned to the PVID.
Any tagged traffic is filtered according to membership bitmap.
At egress, if the vlan matches the PVID, the frame is sent
untagged. Otherwise the frame is sent tagged.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using the RTM_GETLINK dump the vlan filter list of a given
bridge port. The information depends on setting the filter
flag similar to how nic VF info is dumped.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a netlink interface to add and remove vlan configuration on bridge port.
The interface uses the RTM_SETLINK message and encodes the vlan
configuration inside the IFLA_AF_SPEC. It is possble to include multiple
vlans to either add or remove in a single message.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When bridge forwards a frame, make sure that a frame is allowed
to egress on that port.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a frame arrives on a port or transmitted by the bridge,
if we have VLANs configured, validate that a given VLAN is allowed
to enter the bridge.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds an optional infrustructure component to bridge that would allow
native vlan filtering in the bridge. Each bridge port (as well
as the bridge device) now get a VLAN bitmap. Each bit in the bitmap
is associated with a vlan id. This way if the bit corresponding to
the vid is set in the bitmap that the packet with vid is allowed to
enter and exit the port.
Write access the bitmap is protected by RTNL and read access
protected by RCU.
Vlan functionality is disabled by default.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
The bnx2x gso_type setting bug fix in 'net' conflicted with
changes in 'net-next' that broke the gso_* setting logic
out into a seperate function, which also fixes the bug in
question. Thus, use the 'net-next' version.
Signed-off-by: David S. Miller <davem@davemloft.net>
__netpoll_rcu_free is used to free netpoll structures when the rtnl_lock is
already held. The mechanism is used to asynchronously call __netpoll_cleanup
outside of the holding of the rtnl_lock, so as to avoid deadlock.
Unfortunately, __netpoll_cleanup modifies pointers (dev->np), which means the
rtnl_lock must be held while calling it. Further, it cannot be held, because
rcu callbacks may be issued in softirq contexts, which cannot sleep.
Fix this by converting the rcu callback to a work queue that is guaranteed to
get scheduled in process context, so that we can hold the rtnl properly while
calling __netpoll_cleanup
Tested successfully by myself.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Cong Wang <amwang@redhat.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Spanning Tree Protocol packets should have always been marked as
control packets, this causes them to get queued in the high prirority
FIFO. As Radia Perlman mentioned in her LCA talk, STP dies if bridge
gets overloaded and can't communicate. This is a long-standing bug back
to the first versions of Linux bridge.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
And remove no longer used br->flags.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As in del_timer() there has already placed a timer_pending() function
to check whether the timer to be deleted is pending or not, it's
unnecessary to check timer pending state again before del_timer() is
called.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
since the mdb table is belong to bridge device,and the
bridge device can only be seen in one netns.
So it's safe to allow unprivileged user which is the
creator of userns and netns to modify the mdb table.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ebt_table is a private resource of netns, operating ebtables
in one netns will not affect other netns, we can allow the
creator user of userns and netns to change the ebtables.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
Documentation/networking/ip-sysctl.txt
drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
Both conflicts were simply overlapping context.
A build fix for qlcnic is in here too, simply removing the added
devinit annotations which no longer exist.
Signed-off-by: David S. Miller <davem@davemloft.net>
Use strlcpy where possible to ensure the string is \0 terminated.
Use always sizeof(string) instead of 32, ETHTOOL_BUSINFO_LEN
and custom defines.
Use snprintf instead of sprint.
Remove unnecessary inits of ->fw_version
Remove unnecessary inits of drvinfo struct.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
NET_ADDR_SET is set in dev_set_mac_address() no need to alter
dev->addr_assign_type value in drivers.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds empty br_mdb_init() and br_mdb_uninit() definitions in
br_private.h to avoid build failure when CONFIG_BRIDGE_IGMP_SNOOPING is not set.
These methods were moved from br_multicast.c to br_netlink.c by
commit 3ec8e9f085
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 63233159fd:
bridge: Do not unregister all PF_BRIDGE rtnl operations
introduced a bug where a removal of a single bridge from a
multi-bridge system would remove MDB netlink handlers.
The handlers should only be removed once all bridges are gone, but
since we don't keep track of the number of bridge interfaces, it's
simpler to do it when the bridge module is unloaded. To make it
consistent, move the registration code into module initialization
code path.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge link detection should follow the operational state
of the lower device, rather than the carrier bit. This allows devices
like tunnels that are controlled by userspace control plane to work
with bridge STP link management.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Reviewed-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When netdev_set_master faild in br_add_if, we should
call br_netpoll_disable to do some cleanup jobs,such
as free the memory of struct netpoll which allocated
in br_netpoll_enable.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When dumping mdb table, set the addresses the kernel returns
based on the address protocol type.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bridge fdb and link rtnl operations are registered in
core/rtnetlink. Bridge mdb operations are registred
in bridge/mdb. When removing bridge module, do not
unregister ALL PF_BRIDGE ops since that would remove
the ops from rtnetlink as well. Do remove mdb ops when
bridge is destroyed.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a flag to each mdb entry, so that we can distinguish
permanent entries with temporary entries.
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As suggested by Stephen Hemminger, this remove the temporary variable
introduced in commit eca2a43bb0
("bridge: fix icmpv6 endian bug and other sparse warnings")
Signed-off-by: Ang Way Chuang <wcang@sfc.wide.ad.jp>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the warnings reported by sparse on recent bridge multicast
changes. Mostly just rcu annotation issues but in this case
sparse found a real bug! The ICMPv6 mld2 query mrc
values is in network byte order.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implents adding/deleting mdb entries via netlink.
Currently all entries are temp, we probably need a flag to distinguish
permanent entries too.
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As Stephen mentioned, we need to monitor the mdb
changes in user-space, so add notifications via netlink too.
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of rehashing, introduce a global variable 'br_mdb_rehash_seq'
which gets increased every time when rehashing, and assign
net->dev_base_seq + br_mdb_rehash_seq to cb->seq.
In theory cb->seq could be wrapped to zero, but this is not
easy to fix, as net->dev_base_seq is not visible inside
br_mdb_rehash(). In practice, this is rare.
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We pass IFLA_BRPORT_MAX to nla_parse_nested() so we need
IFLA_BRPORT_MAX + 1 elements. Also Smatch complains that we read past
the end of the array when in br_set_port_flag() when it's called with
IFLA_BRPORT_FAST_LEAVE.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
V5: fix two bugs pointed out by Thomas
remove seq check for now, mark it as TODO
V4: remove some useless #include
some coding style fix
V3: drop debugging printk's
update selinux perm table as well
V2: drop patch 1/2, export ifindex directly
Redesign netlink attributes
Improve netlink seq check
Handle IPv6 addr as well
This patch exports bridge multicast database via netlink
message type RTM_GETMDB. Similar to fdb, but currently bridge-specific.
We may need to support modify multicast database too (RTM_{ADD,DEL}MDB).
(Thanks to Thomas for patient reviews)
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
V3: make it a flag
V2: make the toggle per-port
Fast leave allows bridge to immediately stops the multicast
traffic on the port receives IGMP Leave when IGMP snooping is enabled,
no timeouts are observed.
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
V2: make the toggle per-port
Fast leave allows bridge to immediately stops the multicast
traffic on the port receives IGMP Leave when IGMP snooping is enabled,
no timeouts are observed.
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow an unpriviled user who has created a user namespace, and then
created a network namespace to effectively use the new network
namespace, by reducing capable(CAP_NET_ADMIN) and
capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.
Allow setting bridge paramters via sysfs.
Allow all of the bridge ioctls:
BRCTL_ADD_IF
BRCTL_DEL_IF
BRCTL_SET_BRDIGE_FORWARD_DELAY
BRCTL_SET_BRIDGE_HELLO_TIME
BRCTL_SET_BRIDGE_MAX_AGE
BRCTL_SET_BRIDGE_AGING_TIME
BRCTL_SET_BRIDGE_STP_STATE
BRCTL_SET_BRIDGE_PRIORITY
BRCTL_SET_PORT_PRIORITY
BRCTL_SET_PATH_COST
BRCTL_ADD_BRIDGE
BRCTL_DEL_BRDIGE
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- In rtnetlink_rcv_msg convert the capable(CAP_NET_ADMIN) check
to ns_capable(net->user-ns, CAP_NET_ADMIN). Allowing unprivileged
users to make netlink calls to modify their local network
namespace.
- In the rtnetlink doit methods add capable(CAP_NET_ADMIN) so
that calls that are not safe for unprivileged users are still
protected.
Later patches will remove the extra capable calls from methods
that are safe for unprivilged users.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is Linux bridge implementation of root port guard.
If BPDU is received from a leaf (edge) port, it should not
be elected as root port.
Why would you want to do this?
If using STP on a bridge and the downstream bridges are not fully
trusted; this prevents a hostile guest for rerouting traffic.
Why not just use netfilter?
Netfilter does not track of follow spanning tree decisions.
It would be difficult and error prone to try and mirror STP
resolution in netfilter module.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is Linux bridge implementation of STP protection
(Cisco BPDU guard/Juniper BPDU block). BPDU block disables
the bridge port if a STP BPDU packet is received.
Why would you want to do this?
If running Spanning Tree on bridge, hostile devices on the network
may send BPDU and cause network failure. Enabling bpdu block
will detect and stop this.
How to recover the port?
The port will be restarted if link is brought down, or
removed and reattached. For example:
# ip li set dev eth0 down; ip li set dev eth0 up
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Provide macro to build sysfs data structures and functions
for accessing flag bits. If flag bits change do netlink
notification.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Expose bridge port parameter over netlink. By switching to a nested
message, this can be used for other bridge parameters.
This changes IFLA_PROTINFO attribute from one byte to a full nested
set of attributes. This is safe for application interface because the
old message used IFLA_PROTINFO and new one uses
IFLA_PROTINFO | NLA_F_NESTED.
The code adapts to old format requests, and therefore stays
compatible with user mode RSTP daemon. Since the type field
for nested and unnested attributes are different, and the old
code in libnetlink doesn't do the mask, it is also safe to use
with old versions of bridge monitor command.
Note: although mode is only a boolean, treating it as a
full byte since in the future someone will probably want to add more
values (like macvlan has).
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of issuing (0) statements when !CONFIG_SYSFS which will cause
'warning: ', we'll use inline statements instead. This will effectively
do the same thing, but suppress any unnecessary warnings.
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: bridge@lists.linux-foundation.org
Cc: netdev@vger.kernel.org
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since this array is no longer part of the bridge driver, it should
have an 'eth' prefix not 'br'.
We also assume that either it's 16-bit-aligned or the architecture has
efficient unaligned access. Ensure the first of these is true by
explicitly aligning it.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Function name should include '_ether_addr'.
Return type should be bool.
Parameter name should be 'addr' not 'dest' (also matching kernel-doc).
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Parse the string into an array of bytes rather than ints, so we can
use is_link_local() rather than reimplementing it.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jeff Kirsher says:
====================
This series contains updates to ixgbe, ixgbevf, igbvf, igb and
networking core (bridge). Most notably is the addition of support
for local link multicast addresses in SR-IOV mode to the networking
core.
Also note, the ixgbe patch "ixgbe: Add support for pipeline reset" and
"ixgbe: Fix return value from macvlan filter function" is revised based
on community feedback.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Hardware switches may support enabling and disabling the
loopback switch which puts the device in a VEPA mode defined
in the IEEE 802.1Qbg specification. In this mode frames are
not switched in the hardware but sent directly to the switch.
SR-IOV capable NICs will likely support this mode I am
aware of at least two such devices. Also I am told (but don't
have any of this hardware available) that there are devices
that only support VEPA modes. In these cases it is important
at a minimum to be able to query these attributes.
This patch adds an additional IFLA_BRIDGE_MODE attribute that can be
set and dumped via the PF_BRIDGE:{SET|GET}LINK operations. Also
anticipating bridge attributes that may be common for both embedded
bridges and software bridges this adds a flags attribute
IFLA_BRIDGE_FLAGS currently used to determine if the command or event
is being generated to/from an embedded bridge or software bridge.
Finally, the event generation is pulled out of the bridge module and
into rtnetlink proper.
For example using the macvlan driver in VEPA mode on top of
an embedded switch requires putting the embedded switch into
a VEPA mode to get the expected results.
-------- --------
| VEPA | | VEPA | <-- macvlan vepa edge relays
-------- --------
| |
| |
------------------
| VEPA | <-- embedded switch in NIC
------------------
|
|
-------------------
| external switch | <-- shiny new physical
------------------- switch with VEPA support
A packet sent from the macvlan VEPA at the top could be
loopbacked on the embedded switch and never seen by the
external switch. So in order for this to work the embedded
switch needs to be set in the VEPA state via the above
described commands.
By making these attributes nested in IFLA_AF_SPEC we allow
future extensions to be made as needed.
CC: Lennert Buytenhek <buytenh@wantstofly.org>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The PF_BRIDGE:RTM_{GET|SET}LINK nlmsg family and type are
currently embedded in the ./net/bridge module. This prohibits
them from being used by other bridging devices. One example
of this being hardware that has embedded bridging components.
In order to use these nlmsg types more generically this patch
adds two net_device_ops hooks. One to set link bridge attributes
and another to dump the current bride attributes.
ndo_bridge_setlink()
ndo_bridge_getlink()
CC: Lennert Buytenhek <buytenh@wantstofly.org>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In SR-IOV mode the PF driver acts as the uplink port and is
used to send control packets e.g. lldpad, stp, etc.
eth0.1 eth0.2 eth0
VF VF PF
| | | <-- stand-in for uplink
| | |
--------------------------
| Embedded Switch |
--------------------------
|
MAC <-- uplink
But the embedded switch is setup to forward multicast addresses
to all interfaces both VFs and PF and onto the physical link.
This results in reserved MAC addresses used by control protocols
to be forwarded over the switch onto the VF.
In the LLDP case the PF sends an LLDPDU and it is currently
being forwarded to all the VFs who then see the PF as a peer.
This is incorrect.
This patch adds the multicast addresses to the RAR table in the
hardware to prevent this behavior.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Tested-by: Sibai Li <sibai.li@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
If lower layer driver leaves the ip header in the skb fragment, it needs to
be first pulled into skb->data before inspecting ip header length or ip version
number.
Signed-off-by: Sarveshwar Bandi <sarveshwar.bandi@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Later changes need to be able to refer to neighbour attributes
when doing fdb_add.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The internal functions for add/deleting addresses don't change
their argument.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/netfilter/nfnetlink_log.c
net/netfilter/xt_LOG.c
Rather easy conflict resolution, the 'net' tree had bug fixes to make
sure we checked if a socket is a time-wait one or not and elide the
logging code if so.
Whereas on the 'net-next' side we are calculating the UID and GID from
the creds using different interfaces due to the user namespace changes
from Eric Biederman.
Signed-off-by: David S. Miller <davem@davemloft.net>
auto75914331@hushmail.com reports that iptables does not correctly
output the KERN_<level>.
$IPTABLES -A RULE_0_in -j LOG --log-level notice --log-prefix "DENY in: "
result with linux 3.6-rc5
Sep 12 06:37:29 xxxxx kernel: <5>DENY in: IN=eth0 OUT= MAC=.......
result with linux 3.5.3 and older:
Sep 9 10:43:01 xxxxx kernel: DENY in: IN=eth0 OUT= MAC......
commit 04d2c8c83d
("printk: convert the format for KERN_<LEVEL> to a 2 byte pattern")
updated the syslog header style but did not update netfilter uses.
Do so.
Use KERN_SOH and string concatenation instead of "%c" KERN_SOH_ASCII
as suggested by Eric Dumazet.
Signed-off-by: Joe Perches <joe@perches.com>
cc: auto75914331@hushmail.com
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
It is a frequent mistake to confuse the netlink port identifier with a
process identifier. Try to reduce this confusion by renaming fields
that hold port identifiers portid instead of pid.
I have carefully avoided changing the structures exported to
userspace to avoid changing the userspace API.
I have successfully built an allyesconfig kernel with this change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch defines netlink_kernel_create as a wrapper function of
__netlink_kernel_create to hide the struct module *me parameter
(which seems to be THIS_MODULE in all existing netlink subsystems).
Suggested by David S. Miller.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
This is the first batch of Netfilter and IPVS updates for your
net-next tree. Mostly cleanups for the Netfilter side. They are:
* Remove unnecessary RTNL locking now that we have support
for namespace in nf_conntrack, from Patrick McHardy.
* Cleanup to eliminate unnecessary goto in the initialization
path of several Netfilter tables, from Jean Sacren.
* Another cleanup from Wu Fengguang, this time to PTR_RET instead
of if IS_ERR then return PTR_ERR.
* Use list_for_each_entry_continue_rcu in nf_iterate, from
Michael Wang.
* Add pmtu_disc sysctl option to disable PMTU in their tunneling
transmitter, from Julian Anastasov.
* Generalize application protocol registration in IPVS and modify
IPVS FTP helper to use it, from Julian Anastasov.
* update Kconfig. The IPVS FTP helper depends on the Netfilter FTP
helper for NAT support, from Julian Anastasov.
* Add logic to update PMTU for IPIP packets in IPVS, again
from Julian Anastasov.
* A couple of sparse warning fixes for IPVS and Netfilter from
Claudiu Ghioc and Patrick McHardy respectively.
Patrick's IPv6 NAT changes will follow after this batch, I need
to flush this batch first before refreshing my tree.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alternative solution for problem found by Linux Driver Verification
project (linuxtesting.org).
As it noted in the comment before the br_handle_frame_finish
function, this function should be called under rcu_read_lock.
The problem callgraph:
br_dev_xmit -> br_nf_pre_routing_finish_bridge_slow ->
-> br_handle_frame_finish -> br_port_get_rcu -> rcu_dereference
And in this case there is no read-lock section.
Reported-by: Denis Efremov <yefremov.denis@gmail.com>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Although this doesn't matter actually, because netpoll_tx_running()
doesn't use the parameter, the code will be more readable.
For team_dev_queue_xmit() we have to move it down to avoid
compile errors.
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't delete 'p' from the list in the loop,
so we can just use list_for_each_entry().
Cc: David Miller <davem@davemloft.net>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add comments on why we don't notify NETDEV_RELEASE.
Cc: David Miller <davem@davemloft.net>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like the previous patch, slave_disable_netpoll() and __netpoll_cleanup()
may be called with read_lock() held too, so we should make them
non-block, by moving the cleanup and kfree() to call_rcu_bh() callbacks.
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
slave_enable_netpoll() and __netpoll_setup() may be called
with read_lock() held, so should use GFP_ATOMIC to allocate
memory. Eric suggested to pass gfp flags to __netpoll_setup().
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This quiets the coccinelle warnings:
net/bridge/netfilter/ebtable_filter.c:107:1-3: WARNING: PTR_RET can be used
net/bridge/netfilter/ebtable_nat.c:107:1-3: WARNING: PTR_RET can be used
net/ipv6/netfilter/ip6table_filter.c:65:1-3: WARNING: PTR_RET can be used
net/ipv6/netfilter/ip6table_mangle.c💯1-3: WARNING: PTR_RET can be used
net/ipv6/netfilter/ip6table_raw.c:44:1-3: WARNING: PTR_RET can be used
net/ipv6/netfilter/ip6table_security.c:62:1-3: WARNING: PTR_RET can be used
net/ipv4/netfilter/iptable_filter.c:72:1-3: WARNING: PTR_RET can be used
net/ipv4/netfilter/iptable_mangle.c:107:1-3: WARNING: PTR_RET can be used
net/ipv4/netfilter/iptable_raw.c:51:1-3: WARNING: PTR_RET can be used
net/ipv4/netfilter/iptable_security.c:70:1-3: WARNING: PTR_RET can be used
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Various /proc/net files sometimes report crazy timer values, expressed
in clock_t units.
This happens when an expired timer delta (expires - jiffies) is passed
to jiffies_to_clock_t().
This function has an overflow in :
return div_u64((u64)x * TICK_NSEC, NSEC_PER_SEC / USER_HZ);
commit cbbc719fcc (time: Change jiffies_to_clock_t() argument type
to unsigned long) only got around the problem.
As we cant output negative values in /proc/net/tcp without breaking
various tools, I suggest adding a jiffies_delta_to_clock_t() wrapper
that caps the negative delta to a 0 value.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: hank <pyu@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix race condition in several network drivers when reading stats on 32bit
UP architectures. These drivers update their stats in a BH context and
therefore should use u64_stats_fetch_begin_bh/u64_stats_fetch_retry_bh
instead of u64_stats_fetch_begin/u64_stats_fetch_retry when reading the
stats.
Signed-off-by: Kevin Groeneveld <kgroeneveld@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These patches implement the final mechanism necessary to really allow
us to go without the route cache in ipv4.
We need a place to have long-term storage of PMTU/redirect information
which is independent of the routes themselves, yet does not get us
back into a situation where we have to write to metrics or anything
like that.
For this we use an "next-hop exception" table in the FIB nexthops.
The one thing I desperately want to avoid is having to create clone
routes in the FIB trie for this purpose, because that is very
expensive. However, I'm willing to entertain such an idea later
if this current scheme proves to have downsides that the FIB trie
variant would not have.
In order to accomodate this any such scheme, we need to be able to
produce a full flow key at PMTU/redirect time. That required an
adjustment of the interface call-sites used to propagate these events.
For a PMTU/redirect with a fully specified socket, we pass that socket
and use it to produce the flow key.
Otherwise we use a passed in SKB to formulate the key. There are two
cases that need to be distinguished, ICMP message processing (in which
case the IP header is at skb->data) and output packet processing
(mostly tunnels, and in all such cases the IP header is at ip_hdr(skb)).
We also have to make the code able to handle the case where the dst
itself passed into the dst_ops->{update_pmtu,redirect} method is
invalidated. This matters for calls from sockets that have cached
that route. We provide a inet{,6} helper function for this purpose,
and edit SCTP specially since it caches routes at the transport rather
than socket level.
Signed-off-by: David S. Miller <davem@davemloft.net>
This will be used so that we can compose a full flow key.
Even though we have a route in this context, we need more. In the
future the routes will be without destination address, source address,
etc. keying. One ipv4 route will cover entire subnets, etc.
In this environment we have to have a way to possess persistent storage
for redirects and PMTU information. This persistent storage will exist
in the FIB tables, and that's why we'll need to be able to rebuild a
full lookup flow key here. Using that flow key will do a fib_lookup()
and create/update the persistent entry.
Signed-off-by: David S. Miller <davem@davemloft.net>
The hash size is doubled when it needs to grow and compared against
hash_max. The >= comparison will limit the hash table size to half
of what is expected i.e. the default 512 hash_max will not allow
the hash table to grow larger than 256.
Also print the hash table limit instead of the desirable size when
the limit is reached.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
mld->mld_maxdelay is net endian, so we should use ntohs, not htons
CC: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Causes the handler to use the daddr in the ipv4/ipv6 header when
the route gateway is unspecified (local subnet).
Signed-off-by: David S. Miller <davem@davemloft.net>
This was should be a kfree_skb() here to free the sk_buff pointer.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the following structure:
struct netlink_kernel_cfg {
unsigned int groups;
void (*input)(struct sk_buff *skb);
struct mutex *cb_mutex;
};
That can be passed to netlink_kernel_create to set optional configurations
for netlink kernel sockets.
I've populated this structure by looking for NULL and zero parameters at the
existing code. The remaining parameters that always need to be set are still
left in the original interface.
That includes optional parameters for the netlink socket creation. This allows
easy extensibility of this interface in the future.
This patch also adapts all callers to use this new interface.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/caif/caif_hsi.c
drivers/net/usb/qmi_wwan.c
The qmi_wwan merge was trivial.
The caif_hsi.c, on the other hand, was not. It's a conflict between
1c385f1fdf ("caif-hsi: Replace platform
device with ops structure.") in the net-next tree and commit
39abbaef19 ("caif-hsi: Postpone init of
HIS until open()") in the net tree.
I did my best with that one and will ask Sjur to check it out.
Signed-off-by: David S. Miller <davem@davemloft.net>
And use nlmsg_data() while we're here too.
Also, free and NULL out skb when nlmsg_put() fails and remove
pointless kernel log message.
Signed-off-by: David S. Miller <davem@davemloft.net>
This ensures that bridges created with brctl(8) or ioctl(2) directly
also carry IFLA_LINKINFO when dumped over netlink. This also allows
to create a bridge with ioctl(2) and delete it with RTM_DELLINK.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is a cleanup. Use NFPROTO_* for consistency with other
netfilter code.
Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk>
Reviewed-by: Javier Martinez Canillas <javier.martinez@collabora.co.uk>
Reviewed-by: Vincent Sanders <vincent.sanders@collabora.co.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The padding destination or hop-by-hop option is called Pad1 and not Pad0.
See RFC2460 (4.2) or the IANA ipv6-parameters registry:
http://www.iana.org/assignments/ipv6-parameters/ipv6-parameters.xml
Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the new bool function ether_addr_equal to add
some clarity and reduce the likelihood for misuse
of compare_ether_addr for sorting.
Done via cocci script:
$ cat compare_ether_addr.cocci
@@
expression a,b;
@@
- !compare_ether_addr(a, b)
+ ether_addr_equal(a, b)
@@
expression a,b;
@@
- compare_ether_addr(a, b)
+ !ether_addr_equal(a, b)
@@
expression a,b;
@@
- !ether_addr_equal(a, b) == 0
+ ether_addr_equal(a, b)
@@
expression a,b;
@@
- !ether_addr_equal(a, b) != 0
+ !ether_addr_equal(a, b)
@@
expression a,b;
@@
- ether_addr_equal(a, b) == 0
+ !ether_addr_equal(a, b)
@@
expression a,b;
@@
- ether_addr_equal(a, b) != 0
+ ether_addr_equal(a, b)
@@
expression a,b;
@@
- !!ether_addr_equal(a, b)
+ ether_addr_equal(a, b)
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the new bool function ether_addr_equal to add
some clarity and reduce the likelihood for misuse
of compare_ether_addr for sorting.
Done via cocci script:
$ cat compare_ether_addr.cocci
@@
expression a,b;
@@
- !compare_ether_addr(a, b)
+ ether_addr_equal(a, b)
@@
expression a,b;
@@
- compare_ether_addr(a, b)
+ !ether_addr_equal(a, b)
@@
expression a,b;
@@
- !ether_addr_equal(a, b) == 0
+ ether_addr_equal(a, b)
@@
expression a,b;
@@
- !ether_addr_equal(a, b) != 0
+ !ether_addr_equal(a, b)
@@
expression a,b;
@@
- ether_addr_equal(a, b) == 0
+ !ether_addr_equal(a, b)
@@
expression a,b;
@@
- ether_addr_equal(a, b) != 0
+ ether_addr_equal(a, b)
@@
expression a,b;
@@
- !!ether_addr_equal(a, b)
+ ether_addr_equal(a, b)
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
if net.bridge.bridge-nf-filter-vlan-tagged sysctl is enabled, bridge
netfilter removes the vlan header temporarily and then feeds the packet
to ip(6)tables.
When the new "bridge-nf-pass-vlan-input-device" sysctl is on
(default off), then bridge netfilter will also set the
in-interface to the vlan interface; if such an interface exists.
This is needed to make iptables REDIRECT target work with
"vlan-on-top-of-bridge" setups and to allow use of "iptables -i" to
match the vlan device name.
Also update Documentation with current brnf default settings.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Bart De Schuymer <bdschuym@pandora.be>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conflicts:
drivers/net/ethernet/intel/e1000e/param.c
drivers/net/wireless/iwlwifi/iwl-agn-rx.c
drivers/net/wireless/iwlwifi/iwl-trans-pcie-rx.c
drivers/net/wireless/iwlwifi/iwl-trans.h
Resolved the iwlwifi conflict with mainline using 3-way diff posted
by John Linville and Stephen Rothwell. In 'net' we added a bug
fix to make iwlwifi report a more accurate skb->truesize but this
conflicted with RX path changes that happened meanwhile in net-next.
In e1000e a conflict arose in the validation code for settings of
adapter->itr. 'net-next' had more sophisticated logic so that
logic was used.
Signed-off-by: David S. Miller <davem@davemloft.net>
Unfortunately it seems that I didn't properly test the case of
an expired external querier in the recent multicast bridge series.
The setup of the timer in that case is completely broken and leads
to a NULL-pointer dereference. This patch fixes it.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bridge: set fake_rtable's dst to NULL to avoid kernel Oops
when bridge is deleted before tap/vif device's delete, kernel may
encounter an oops because of NULL reference to fake_rtable's dst.
Set fake_rtable's dst to NULL before sending packets out can solve
this problem.
v4 reformat, change br_drop_fake_rtable(skb) to {}
v3 enrich commit header
v2 introducing new flag DST_FAKE_RTABLE to dst_entry struct.
[ Use "do { } while (0)" for nop br_drop_fake_rtable()
implementation -DaveM ]
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Peter Huang <peter.huangpeng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This results in code with less boiler plate that is a bit easier
to read.
Additionally stops us from using compatibility code in the sysctl
core, hastening the day when the compatibility code can be removed.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes it clearer which sysctls are relative to your current network
namespace.
This makes it a little less error prone by not exposing sysctls for the
initial network namespace in other namespaces.
This is the same way we handle all of our other network interfaces to
userspace and I can't honestly remember why we didn't do this for
sysctls right from the start.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds two new flags NTF_MASTER and NTF_SELF that can
now be used to specify where PF_BRIDGE netlink commands should
be sent. NTF_MASTER sends the commands to the 'dev->master'
device for parsing. Typically this will be the linux net/bridge,
or open-vswitch devices. Also without any flags set the command
will be handled by the master device as well so that current user
space tools continue to work as expected.
The NTF_SELF flag will push the PF_BRIDGE commands to the
device. In the basic example below the commands are then parsed
and programmed in the embedded bridge.
Note if both NTF_SELF and NTF_MASTER bits are set then the
command will be sent to both 'dev->master' and 'dev' this allows
user space to easily keep the embedded bridge and software bridge
in sync.
There is a slight complication in the case with both flags set
when an error occurs. To resolve this the rtnl handler clears
the NTF_ flag in the netlink ack to indicate which sets completed
successfully. The add/del handlers will abort as soon as any
error occurs.
To support this new net device ops were added to call into
the device and the existing bridging code was refactored
to use these. There should be no required changes in user space
to support the current bridge behavior.
A basic setup with a SR-IOV enabled NIC looks like this,
veth0 veth2
| |
------------
| bridge0 | <---- software bridging
------------
/
/
ethx.y ethx
VF PF
\ \ <---- propagate FDB entries to HW
\ \
--------------------
| Embedded Bridge | <---- hardware offloaded switching
--------------------
In this case the embedded bridge must be managed to allow 'veth0'
to communicate with 'ethx.y' correctly. At present drivers managing
the embedded bridge either send frames onto the network which
then get dropped by the switch OR the embedded bridge will flood
these frames. With this patch we have a mechanism to manage the
embedded bridge correctly from user space. This example is specific
to SR-IOV but replacing the VF with another PF or dropping this
into the DSA framework generates similar management issues.
Examples session using the 'br'[1] tool to add, dump and then
delete a mac address with a new "embedded" option and enabled
ixgbe driver:
# br fdb add 22:35:19:ac:60:59 dev eth3
# br fdb
port mac addr flags
veth0 22:35:19:ac:60:58 static
veth0 9a:5f:81:f7:f6:ec local
eth3 00:1b:21:55:23:59 local
eth3 22:35:19:ac:60:59 static
veth0 22:35:19:ac:60:57 static
#br fdb add 22:35:19:ac:60:59 embedded dev eth3
#br fdb
port mac addr flags
veth0 22:35:19:ac:60:58 static
veth0 9a:5f:81:f7:f6:ec local
eth3 00:1b:21:55:23:59 local
eth3 22:35:19:ac:60:59 static
veth0 22:35:19:ac:60:57 static
eth3 22:35:19:ac:60:59 local embedded
#br fdb del 22:35:19:ac:60:59 embedded dev eth3
I added a couple lines to 'br' to set the flags correctly is all. It
is my opinion that the merit of this patch is now embedded and SW
bridges can both be modeled correctly in user space using very nearly
the same message passing.
[1] 'br' tool was published as an RFC here and will be renamed 'bridge'
http://patchwork.ozlabs.org/patch/117664/
Thanks to Jamal Hadi Salim, Stephen Hemminger and Ben Hutchings for
valuable feedback, suggestions, and review.
v2: fixed api descriptions and error case with both NTF_SELF and
NTF_MASTER set plus updated patch description.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sending general queries was implemented as an optimisation to speed
up convergence on start-up. In order to prevent interference with
multicast routers a zero source address has to be used.
Unfortunately these packets appear to cause some multicast-aware
switches to misbehave, e.g., by disrupting multicast packets to us.
Since the multicast snooping feature still functions without sending
our own queries, this patch will change the default to not send
queries.
For those that need queries in order to speed up convergence on start-up,
a toggle is provided to restore the previous behaviour.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
As it stands when we discover that a real querier (one that queries
with a non-zero source address) we stop querying. However, even
after said querier has fallen off the edge of the earth, we will
never restart querying (unless the bridge itself is restarted).
This patch fixes this by kicking our own querier into gear when
the timer for other queriers expire.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the helper br_multicast_start_querier so that
the code which starts the queriers in br_multicast_toggle can
be reused elsewhere.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use of "unsigned int" is preferred to bare "unsigned" in net tree.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As it stands the bridge IGMP snooping system will respond to
group leave messages with queries for remaining membership.
This is both unnecessary and undesirable. First of all any
multicast routers present should be doing this rather than us.
What's more the queries that we send may end up upsetting other
multicast snooping swithces in the system that are buggy.
In fact, we can simply remove the code that send these queries
because the existing membership expiry mechanism doesn't rely
on them anyway.
So this patch simply removes all code associated with group
queries in response to group leave messages.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
These macros contain a hidden goto, and are thus extremely error
prone and make code hard to audit.
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we have:
eth0: link *down*
br0: port 1(eth0) entered *forwarding* state
br_log_state(p) should be called *after* p->state is set
to BR_STATE_DISABLED.
Reported-by: Zilvinas Valinskas <zilvinas@wilibox.com>
Signed-off-by: Paulius Zaleckas <paulius.zaleckas@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When br_log_state() is reporting state it should say "entered"
istead of "entering" since state at this point is already
changed.
Signed-off-by: Paulius Zaleckas <paulius.zaleckas@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When net.bridge.bridge-nf-filter-vlan-tagged is 0 (default), vlan packets
arriving should not be sent to ip(6)tables by bridge netfilter.
However, it turns out that we currently always send VLAN packets to
netfilter, if ..
a), CONFIG_VLAN_8021Q is enabled ; or
b), CONFIG_VLAN_8021Q is not set but rx vlan offload is enabled
on the bridge port.
This is because bridge netfilter treats skb with
skb->protocol == ETH_P_IP{V6} as "non-vlan packet".
With rx vlan offload on or CONFIG_VLAN_8021Q=y, the vlan header has
already been removed here, and we cannot rely on skb->protocol alone.
Fix this by only using skb->protocol if the skb has no vlan tag,
or if a vlan tag is present and filter-vlan-tagged bridge netfilter
sysctl is enabled.
We cannot remove the skb->protocol == htons(ETH_P_8021Q) test
because the vlan tag is still around in the CONFIG_VLAN_8021Q=n &&
"ethtool -K $itf rxvlan off" case.
reproducer:
iptables -t raw -I PREROUTING -i br0
iptables -t raw -I PREROUTING -i br0.1
Then send packets to an ip address configured on br0.1 interface.
Even with net.bridge.bridge-nf-filter-vlan-tagged=0, the 1st rule
will match instead of the 2nd one.
With this patch applied, the 2nd rule will match instead.
In the non-local address case, netfilter won't be consulted after
this patch unless the sysctl is switched on.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In adf7ff8, a invalid dereference was added in ebt_make_names.
CC [M] net/bridge/netfilter/ebtables.o
net/bridge/netfilter/ebtables.c: In function `ebt_make_names':
net/bridge/netfilter/ebtables.c:1371:20: warning: `t' may be used uninitialized in this function [-Wuninitialized]
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
user-space ebtables expects 32 bytes-long names, but xt_match names
use 29 bytes. We have to copy less 29 bytes and then, make sure we
fill the remaining bytes with zeroes.
Signed-off-by: Santosh Nayak <santoshprasadnayak@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/vmxnet3/vmxnet3_drv.c
Small vmxnet3 conflict with header size bug fix in 'net'.
Signed-off-by: David S. Miller <davem@davemloft.net>
otherwise source IPv6 address of ICMPV6_MGM_QUERY packet
might be random junk if IPv6 is disabled on interface or
link-local address is not yet ready (DAD).
Signed-off-by: Ulrich Weber <ulrich.weber@sophos.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit bridge: send proper message_age in config BPDU
added this gem:
bpdu.message_age = (jiffies - root->designated_age)
p->designated_age = jiffies + bpdu->message_age;
Notice how bpdu->message_age is negated when reassigned to
bpdu.message_age. This causes message age to decrease breaking the
STP protocol.
Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
min age increment needs to round up its min age tick for all
HZ values to guarantee message age is increasing.
Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/broadcom/tg3.c
Conflicts in the statistics regression bug fix from 'net',
but happily Matt Carlson originally posted the fix against
'net-next' so I used that to resolve this.
Signed-off-by: David S. Miller <davem@davemloft.net>
We expected 0 if module doesn't exist, which is no longer the case
(42046e2e45,
netfilter: x_tables: return -ENOENT for non-existant matches/targets).
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Unify return value of .ndo_set_mac_address if the given address
isn't valid. Return -EADDRNOTAVAIL as eth_mac_addr() already does
if is_valid_ether_addr() fails.
Signed-off-by: Danny Kukawka <danny.kukawka@bisect.de>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use eth_hw_addr_random() instead of calling random_ether_addr()
to set addr_assign_type correctly to NET_ADDR_RANDOM.
Reset the state to NET_ADDR_PERM as soon as the MAC get
changed via .ndo_set_mac_address.
v2: adapt to renamed eth_hw_addr_random()
Signed-off-by: Danny Kukawka <danny.kukawka@bisect.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fdb_cleanup() is run from timer interrupt, BH already masked.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Štefan Gula <steweg@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/bluetooth/l2cap_core.c
Just two overlapping changes, one added an initialization of
a local variable, and another change added a new local variable.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 618f9bc74a (net: Move mtu handling down to the protocol
depended handlers) forgot the bridge netfilter case, adding a NULL
dereference in ip_fragment().
Reported-by: Chris Boot <bootc@bootc.net>
CC: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use IS_ENABLED(CONFIG_FOO)
instead of defined(CONFIG_FOO) || defined (CONFIG_FOO_MODULE)
Signed-off-by: Igor Maravić <igorm@etf.rs>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of testing defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If user has configured a MAC address that is not one of the existing
ports of the bridge, then we need to add a special entry in the forwarding
table. This forwarding table entry has no outgoing port so it has to be
treated a little differently. The special entry is reported by the netlink
interface with ifindex of bridge, but ignored by the old interface since there
is no usable way to put it in the ABI.
Reported-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass bridge to fdb_notify so it can determine correct namespace based
on namespace of bridge rather than namespace of destination port.
Also makes next patch easier.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move fdb_notify outside of fdb_create. This fixes the problem
that notification of local entries are not flagged correctly.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To reflect the fact that a refrence is not obtained to the
resulting neighbour entry.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Roland Dreier <roland@purestorage.com>
While parsing through IPv6 extension headers, fragment headers are
skipped making them invisible to the caller. This reports the
fragment offset of the last header in order to make it possible to
determine whether the packet is fragmented and, if so whether it is
a first or last fragment.
Signed-off-by: Jesse Gross <jesse@nicira.com>
When in user-stp mode, bridge master do not follow state of its slaves, so
after the following sequence of events it can stuck forever in no-carrier
state:
1) turn stp off
2) put all slaves down - master device will follow their state and also go in
no-carrier state
3) turn stp on with bridge-stp script returning 0 (go to the user-stp mode)
Now bridge master won't follow slaves' state and will never reach running
state.
This patch solves the problem by making user-stp and kernel-stp behavior
similar regarding master following slaves' states.
Signed-off-by: Vitalii Demianets <vitas@nppfactor.kiev.ua>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
C assignment can handle struct in6_addr copying.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The forcedeth changes had a conflict with the conversion over
to atomic u64 statistics in net-next.
The libertas cfg.c code had a conflict with the bss reference
counting fix by John Linville in net-next.
Conflicts:
drivers/net/ethernet/nvidia/forcedeth.c
drivers/net/wireless/libertas/cfg.c
Only distinct use is checking if NETIF_F_NOCACHE_COPY should be
enabled by default. The check heuristics is altered a bit here,
so it hits other people than before. The default shouldn't be
trusted for performance-critical cases anyway.
For all other uses NETIF_F_NO_CSUM is equivalent to NETIF_F_HW_CSUM.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
v2: add couple missing conversions in drivers
split unexporting netdev_fix_features()
implemented %pNF
convert sock::sk_route_(no?)caps
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bridge multicast snooping of ICMPv6 would incorrectly report a checksum problem
when used with Ethernet devices like sky2 that use CHECKSUM_COMPLETE.
When bytes are removed from skb, the computed checksum needs to be adjusted.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Tested-by: Martin Volf <martin.volf.42@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
More changes to the recent code to support control of forwarding
database via netlink.
* Support NTF_USE like neighbour table
* Validate state bits from application
* Only send notifications (and change bits) if new entry is
different.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
multicast_lock is taken in softirq context, so we should use
spin_lock_bh() in userspace.
call-chain in softirq context:
run_timer_softirq()
br_multicast_query_expired()
call-chain in userspace:
sysfs_write_file()
store_multicast_snooping()
br_multicast_toggle()
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include <linux/module.h>
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include <linux/module.h>
net: sch_generic remove redundant use of <linux/module.h>
net: inet_timewait_sock doesnt need <linux/module.h>
...
Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h
Site specific OOM messages are duplications of a generic MM
out of memory message and aren't really useful, so just
delete them.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
To fix this, once the implicit presence of module.h is removed:
net/bridge/br_stp_if.c: In function ‘br_stp_start’:
net/bridge/br_stp_if.c:131: error: implicit declaration of function ‘call_usermodehelper’
net/bridge/br_stp_if.c:131: error: ‘UMH_WAIT_PROC’ undeclared (first use in this function)
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
These files are non modular, but need to export symbols using
the macros now living in export.h -- call out the include so
that things won't break when we remove the implicit presence
of module.h from everywhere.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Need to cleanup bridge device timers and ports when being bridge
device is being removed via netlink.
This fixes the problem of observed when doing:
ip link add br0 type bridge
ip link set dev eth1 master br0
ip link set br0 up
ip link del br0
which would cause br0 to hang in unregister_netdev because
of leftover reference count.
Reported-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is based on an earlier patch by Nick Carter with comments
by David Lamparter but with some refinements. Thanks for their patience
this is a confusing area with overlap of standards, user requirements,
and compatibility with earlier releases.
It adds a new sysfs attribute
/sys/class/net/brX/bridge/group_fwd_mask
that controls forwarding of frames with address of: 01-80-C2-00-00-0X
The default setting has no forwarding to retain compatibility.
One change from earlier releases is that forwarding of group
addresses is not dependent on STP being enabled or disabled. This
choice was made based on interpretation of tie 802.1 standards.
I expect complaints will arise because of this, but better to follow
the standard than continue acting incorrectly by default.
The filtering mask is writeable, but only values that don't forward
known control frames are allowed. It intentionally blocks attempts
to filter control protocols. For example: writing a 8 allows
forwarding 802.1X PAE addresses which is the most common request.
Reported-by: David Lamparter <equinox@diac24.net>
Original-patch-by: Nick Carter <ncarter100@gmail.com>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Tested-by: Benjamin Poirier <benjamin.poirier@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This resolves a regression seen by some users of bridging.
Some users use the bridge like a dummy device.
They expect to be able to put an IPv6 address on the device
with no ports attached. Although there are better ways of doing
this, there is no reason to not allow it.
Note: the bridge still will reflect the state of ports in the
bridge if there are any added.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Need to allow application to update existing fdb entries that already
exist. This makes bridge netlink neighbor API have same flags and
semantics as ip neighbor table.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When port is added to a bridge, the old code would send the new neighbor
netlink message before the subsequent new link message. This bug makes
it difficult to use the monitoring API in an application.
This code changes the ordering to add the forwarding entry
after the port is setup. One of the error checks (for invalid address)
is moved earlier in the process to avoid having to do unwind.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch does several things:
- introduces __ethtool_get_settings which is called from ethtool code and
from drivers as well. Put ASSERT_RTNL there.
- dev_ethtool_get_settings() is replaced by __ethtool_get_settings()
- changes calling in drivers so rtnl locking is respected. In
iboe_get_rate was previously ->get_settings() called unlocked. This
fixes it. Also prb_calc_retire_blk_tmo() in af_packet.c had the same
problem. Also fixed by calling __dev_get_by_index() instead of
dev_get_by_index() and holding rtnl_lock for both calls.
- introduces rtnl_lock in bnx2fc_vport_create() and fcoe_vport_create()
so bnx2fc_if_create() and fcoe_if_create() are called locked as they
are from other places.
- use __ethtool_get_settings() in bonding code
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
v2->v3:
-removed dev_ethtool_get_settings()
-added ASSERT_RTNL into __ethtool_get_settings()
-prb_calc_retire_blk_tmo - use __dev_get_by_index() and lock
around it and __ethtool_get_settings() call
v1->v2:
add missing export_symbol
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> [except FCoE bits]
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since these checks and initialization are done in
dev_ethtool_get_settings called later on, remove this redundancy.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_multicast_ipv6_rcv() can call pskb_trim_rcsum() and therefore skb
head can be reallocated.
Cache icmp6_type field instead of dereferencing twice the struct
icmp6hdr pointer.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Checksum of ICMPv6 is not properly computed because the pseudo header is not used.
Thus, the MLD packet gets dropped by the bridge.
Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Reported-by: Ang Way Chuang <wcang@sfc.wide.ad.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jan Beulich reported a possible net_device leak in bridge code after
commit bb900b27a2 (bridge: allow creating bridge devices with netlink)
Reported-by: Jan Beulich <JBeulich@novell.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Free the locally allocated table and newinfo as done in adjacent error
handling code.
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This ensures the neighbor entries associated with the bridge
dev are flushed, also invalidating the associated cached L2 headers.
This means we br_add_if/br_del_if ports to implement hand-over and
not wind up with bridge packets going out with stale MAC.
This means we can also change MAC of port device and also not wind
up with bridge packets going out with stale MAC.
This builds on Stephen Hemminger's patch, also handling the br_del_if
case and the port MAC change case.
Cc: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Andrei Warkentin <andreiw@motorola.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When assigning a NULL value to an RCU protected pointer, no barrier
is needed. The rcu_assign_pointer, used to handle that but will soon
change to not handle the special case.
Convert all rcu_assign_pointer of NULL value.
//smpl
@@ expression P; @@
- rcu_assign_pointer(P, NULL)
+ RCU_INIT_POINTER(P, NULL)
// </smpl>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The configuration of ebtables shouldn't depend on
CONFIG_BRIDGE_NETFILTER, only on CONFIG_NETFILTER.
Reported-by: Sbastien Laveze <slaveze@gmail.com>
Signed-off-by: Bart De Schuymer <bdschuym@pandora.be>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
fs: Merge split strings
treewide: fix potentially dangerous trailing ';' in #defined values/expressions
uwb: Fix misspelling of neighbourhood in comment
net, netfilter: Remove redundant goto in ebt_ulog_packet
trivial: don't touch files that are removed in the staging tree
lib/vsprintf: replace link to Draft by final RFC number
doc: Kconfig: `to be' -> `be'
doc: Kconfig: Typo: square -> squared
doc: Konfig: Documentation/power/{pm => apm-acpi}.txt
drivers/net: static should be at beginning of declaration
drivers/media: static should be at beginning of declaration
drivers/i2c: static should be at beginning of declaration
XTENSA: static should be at beginning of declaration
SH: static should be at beginning of declaration
MIPS: static should be at beginning of declaration
ARM: static should be at beginning of declaration
rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check
Update my e-mail address
PCIe ASPM: forcedly -> forcibly
gma500: push through device driver tree
...
Fix up trivial conflicts:
- arch/arm/mach-ep93xx/dma-m2p.c (deleted)
- drivers/gpio/gpio-ep93xx.c (renamed and context nearby)
- drivers/net/r8169.c (just context changes)
Some minor cleanups that won't impact code:
1. Remove inline from non-critical functions; compiler will most
likely inline them anyway.
2. Make function args const where possible.
3. Whitespace cleanup
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When STP changes state of interface need to send a new link
message to reflect that change.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a new device is added to a bridge, the ethernet address of the
bridge network device may change. When the address changes, the
appropriate callback is called, but with the wrong device argument.
The address of the bridge device (ie br0) changes not the address
of the device being passed to add_if (ie eth0).
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the message_age is already greater than the max_age, then the
BPDU is bogus. Linux won't generate BPDU, but conformance tester
or buggy implementation might.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A bridge topology with three systems:
+------+ +------+
| A(2) |--| B(1) |
+------+ +------+
\ /
+------+
| C(3) |
+------+
What is supposed to happen:
* bridge with the lowest ID is elected root (for example: B)
* C detects that A->C is higher cost path and puts in blocking state
What happens. Bridge with lowest id (B) is elected correctly as
root and things start out fine initially. But then config BPDU
doesn't get transmitted from A -> C. Because of that
the link from A-C is transistioned to the forwarding state.
The root cause of this is that the configuration messages
is generated with bogus message age, and dropped before
sending.
In the standardmessage_age is supposed to be:
the time since the generation of the Configuration BPDU by
the Root that instigated the generation of this Configuration BPDU.
Reimplement this by recording the timestamp (age + jiffies) when
recording config information. The old code incorrectly used the time
elapsed on the ageing timer which was incorrect.
See also:
https://bugzilla.vyatta.com/show_bug.cgi?id=7164
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In net/bridge/netfilter/ebt_ulog.c:ebt_ulog_packet() the 'goto unlock'
before the 'alloc_failure' label is completely redundant. This patch
removes it.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
In the future dst entries will be neigh-less. In that environment we
need to have an easy transition point for current users of
dst->neighbour outside of the packet output fast path.
Signed-off-by: David S. Miller <davem@davemloft.net>
This will get us closer to being able to do "neigh stuff"
completely independent of the underlying dst_entry for
protocols (ipv4/ipv6) that wish to do so.
We will also be able to make dst entries neigh-less.
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that there is a one-to-one correspondance between neighbour
and hh_cache entries, we no longer need:
1) dynamic allocation
2) attachment to dst->hh
3) refcounting
Initialization of the hh_cache entry is indicated by hh_len
being non-zero, and such initialization is always done with
the neighbour's lock held as a writer.
Signed-off-by: David S. Miller <davem@davemloft.net>
As is_multicast_ether_addr returns true on broadcast packets as
well, we need to explicitly exclude broadcast packets so that
they're always flooded. This wasn't an issue before as broadcast
packets were considered to be an unregistered multicast group,
which were always flooded. However, as we now only flood such
packets to router ports, this is no longer acceptable.
Reported-by: Michael Guntsche <mike@it-loops.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge currently floods packets to groups that we have never
seen before to all ports. This is not required by RFC4541 and
in fact it is not desirable in environment where traffic to
unregistered group is always present.
This patch changes the behaviour so that we only send traffic
to unregistered groups to ports marked as routers.
The user can always force flooding behaviour to any given port
by marking it as a router.
Note that this change does not apply to traffic to 224.0.0.X
as traffic to those groups must always be flooded to all ports.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Otherwise we will not see the name of the slave dev in error
message:
[ 388.469446] (null): doesn't support polling, aborting.
Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Upon reception of a MGM report packet the kernel sets the mrouters_only flag
in a skb that is a clone of the original skb, which means that the bridge
loses track of MGM packets (cb buffers are tied to a specific skb and not
shared) and it ends up forwading join requests to the bridge interface.
This can cause unexpected membership timeouts and intermitent/permanent loss
of connectivity as described in RFC 4541 [2.1.1. IGMP Forwarding Rules]:
A snooping switch should forward IGMP Membership Reports only to
those ports where multicast routers are attached.
[...]
Sending membership reports to other hosts can result, for IGMPv1
and IGMPv2, in unintentionally preventing a host from joining a
specific multicast group.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: David S. Miller <davem@conan.davemloft.net>
Upon reception of a IGMP/IGMPv2 membership report the kernel sets the
mrouters_only flag in a skb that may be a clone of the original skb, which
means that sometimes the bridge loses track of membership report packets (cb
buffers are tied to a specific skb and not shared) and it ends up forwading
join requests to the bridge interface.
This can cause unexpected membership timeouts and intermitent/permanent loss
of connectivity as described in RFC 4541 [2.1.1. IGMP Forwarding Rules]:
A snooping switch should forward IGMP Membership Reports only to
those ports where multicast routers are attached.
[...]
Sending membership reports to other hosts can result, for IGMPv1
and IGMPv2, in unintentionally preventing a host from joining a
specific multicast group.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Tested-by: Hayato Kakuta <kakuta.hayato@oss.ntt.co.jp>
Signed-off-by: David S. Miller <davem@conan.davemloft.net>
The message size allocated for rtnl ifinfo dumps was limited to
a single page. This is not enough for additional interface info
available with devices that support SR-IOV and caused a bug in
which VF info would not be displayed if more than approximately
40 VFs were created per interface.
Implement a new function pointer for the rtnl_register service that will
calculate the amount of data required for the ifinfo dump and allocate
enough data to satisfy the request.
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Kill set but not used 'entry_offset'.
Add a default case to the switch statement so the compiler
can see that we always initialize off and size_kern before
using them.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
bridge netfilter code uses a fake_rtable, and we must init its _metric
field or risk NULL dereference later.
Ref: https://bugzilla.kernel.org/show_bug.cgi?id=35672
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the old days, we used to access dev->master in __netif_receive_skb()
in a rcu_read_lock section.
So one synchronize_net() call was needed in netdev_set_master() to make
sure another cpu could not use old master while/after we release it.
We now use netdev_rx_handler infrastructure and added one
synchronize_net() call in bond_release()/bond_release_all()
Remove the obsolete synchronize_net() from netdev_set_master() and add
one in bridge del_nbp() after its netdev_rx_handler_unregister() call.
This makes enslave -d a bit faster.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jiri Pirko <jpirko@redhat.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the previous patch I added NETDEV_JOIN, now
we can notify netconsole when adding a device to a bridge too.
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Neil Horman <nhorman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 6b1e960fdb
bridge: Reset IPCB when entering IP stack on NF_FORWARD
broke forwarding of IPV6 packets in bridge because it would
call bp_parse_ip_options with an IPV6 packet.
Reported-by: Noah Meyerhans <noahm@debian.org>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The optimizations in commit 255d0dc340
(netfilter: x_table: speedup compat operations) assume that
xt_compat_add_offset is called once per rule.
ebtables however called it for each match/target found in a rule.
The match/watcher/target parser already returns the needed delta, so it
is sufficient to move the xt_compat_add_offset call to a more reasonable
location.
While at it, also get rid of the unused COMPAT iterator macros.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
commit 255d0dc340 (netfilter: x_table: speedup compat operations)
made ebtables not working anymore.
1) xt_compat_calc_jump() is not an exact match lookup
2) compat_table_info() has a typo in xt_compat_init_offsets() call
3) compat_do_replace() misses a xt_compat_init_offsets() call
Reported-by: dann frazier <dannf@dannf.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Four years ago, Patrick made a change to hold rtnl mutex during netlink
dump callbacks.
I believe it was a wrong move. This slows down concurrent dumps, making
good old /proc/net/ files faster than rtnetlink in some situations.
This occurred to me because one "ip link show dev ..." was _very_ slow
on a workload adding/removing network devices in background.
All dump callbacks are able to use RCU locking now, so this patch does
roughly a revert of commits :
1c2d670f36 : [RTNETLINK]: Hold rtnl_mutex during netlink dump callbacks
6313c1e099 : [RTNETLINK]: Remove unnecessary locking in dump callbacks
This let writers fight for rtnl mutex and readers going full speed.
It also takes care of phonet : phonet_route_get() is now called from rcu
read section. I renamed it to phonet_route_get_rcu()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Remi Denis-Courmont <remi.denis-courmont@nokia.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes sure the ethtool's set_settings() callback of network
drivers don't ignore the 16 most significant bits when ethtool calls
their set_settings().
All drivers compiled with make allyesconfig on x86_64 have been
updated.
Signed-off-by: David Decotigny <decot@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Note: netdev_update_features() needs only rtnl_lock as br->port_list
is only changed while holding it.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Resolved logic conflicts causing a build failure due to
drivers/net/r8169.c changes using a patch from Stephen Rothwell.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers
where possible, to make code intention more obvious.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 1e253c3b8a.
It breaks 802.3ad bonding inside of a bridge.
The commit was meant to support transport bridging, and specifically
virtual machines bridged to an ethernet interface connected to a
switch port wiht 802.1x enabled.
But this isn't the way to do it, it breaks too many other things.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit bb900b27a2 ("bridge: allow
creating bridge devices with netlink") introduced a bug in net-next
because of a typo in notifier. Every device would have the sysfs
bridge directory (and files).
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 462fb2af97 (bridge : Sanitize skb before it enters the IP
stack), missed one IPCB init before calling ip_options_compile()
Thanks to Scot Doyle for his tests and bug reports.
Reported-by: Scot Doyle <lkml@scotdoyle.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
Acked-by: Bandan Das <bandan.das@stratus.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Cc: Jan Lübbe <jluebbe@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Apply restrictions on STP parameters based 802.1D 1998 standard.
* Fixes missing locking in set path cost ioctl
* Uses common code for both ioctl and sysfs
This is based on an earlier patch Sasikanth V but with overhaul.
Note:
1. It does NOT enforce the restriction on the relationship max_age and
forward delay or hello time because in existing implementation these are
set as independant operations.
2. If STP is disabled, there is no restriction on forward delay
3. No restriction on holding time because users use Linux code to act
as hub or be sticky.
4. Although standard allow 0-255, Linux only allows 0-63 for port priority
because more bits are reserved for port number.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add netlink device ops to allow creating bridge device via netlink.
This works in a manner similar to vlan, macvlan and bonding.
Example:
# ip link add link dev br0 type bridge
# ip link del dev br0
The change required rearranging initializtion code to deal with
being called by create link. Most of the initialization happens
in br_dev_setup, but allocation of stats is done in ndo_init callback
to deal with allocation failure. Sysfs setup has to wait until
after the network device kobject is registered.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use RTM_NEWNEIGH and RTM_DELNEIGH to allow updating of entries
in bridge forwarding table. This allows manipulating static entries
which is not possible with existing tools.
Example (using bridge extensions to iproute2)
# br fdb add 00:02:03:04:05:06 dev eth0
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows applications to query and monitor bridge forwarding
table in the same method used for neighbor table. The forward table
entries are returned in same structure format as used by the ioctl.
If more information is desired in future, the netlink method is
extensible.
Example (using bridge extensions to iproute2)
# br monitor
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In some cases, look up of forward database entry is done with RCU;
and for others no RCU is needed because of locking. Split the two
cases into two differnt loops (and take off inline).
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds tracking the last used time in forwarding table.
Rename ageing_timer to updated to better describe it.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Later patch provides ability to create non-local static entry.
To make this easier move the updating of the flag values to
after the code that creates entry.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"len = ntohs(ip6h->payload_len)" does not include the length of the ipv6
header itself, which the rest of this function assumes, though.
This leads to a length check less restrictive as it should be in the
following line for one thing. For another, it very likely leads to an
integer underrun when substracting the offset and therefore to a very
high new value of 'len' due to its unsignedness. This will ultimately
lead to the pskb_trim_rcsum() practically never being called, even in
the cases where it should.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/bridge/br_stp_if.c: In function ‘br_stp_recalculate_bridge_id’:
net/bridge/br_stp_if.c:216:3: warning: ‘return’ with no value, in function returning non-void
Signed-off-by: G.Balaji <balajig81@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mac address of the bridge device may be changed when a new interface
is added to the bridge. If this happens, then the bridge needs to call
the network notifiers to tickle any other systems that care. Since bridge
can be a module, this also means exporting the notifier function.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ipv6_dev_get_saddr() is currently called with an uninitialized
destination address. Although in tests it usually seemed to nevertheless
always fetch the right source address, there seems to be a possible race
condition.
Therefore this commit changes this, first setting the destination
address and only after that fetching the source address.
Reported-by: Jan Beulich <JBeulich@novell.com>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Whenever we enter the IP stack proper from bridge netfilter we
need to ensure that the skb is in a form the IP stack expects
it to be in.
The entry point on NF_FORWARD did not meet the requirements of
the IP stack, therefore leading to potential crashes/panics.
This patch fixes the problem.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows rx_handlers to better signalize what to do next to
it's caller. That makes skb->deliver_no_wcard no longer needed.
kernel-doc for rx_handler_result is taken from Nicolas' patch.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
If Spanning Tree Protocol is not enabled, there is no good reason for
the bridge code to wait for the forwarding delay period before enabling
the link. The purpose of the forwarding delay is to allow STP to
learn about other bridges before nominating itself.
The only possible impact is that when starting up a new port
the bridge may flood a packet now, where previously it might have
seen traffic from the other host and preseeded the forwarding table.
Includes change for local variable br already available in that func.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes the bridge device behave like a physical device.
In earlier releases the bridge always asserted carrier. This
changes the behavior so that bridge device carrier is on only
if one or more ports are in the forwarding state. This
should help IPv6 autoconfiguration, DHCP, and routing daemons.
I did brief testing with Network and Virt manager and they
seem fine, but since this changes behavior of bridge, it should
wait until net-next (2.6.39).
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Tested-By: Adam Majer <adamm@zombino.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The idea here is this minimizes the number of places one has to edit
in order to make changes to how flows are defined and used.
Signed-off-by: David S. Miller <davem@davemloft.net>
When configs BRIDGE=y and IPV6=m, this build error occurs:
br_multicast.c:(.text+0xa3341): undefined reference to `ipv6_dev_get_saddr'
BRIDGE_IGMP_SNOOPING is boolean; if it were tristate, then adding
depends on IPV6 || IPV6=n
to BRIDGE_IGMP_SNOOPING would be a good fix. As it is currently,
making BRIDGE depend on the IPV6 config works.
Reported-by: Patrick Schaaf <netdev@bof.de>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the bridge multicast snooping feature periodically issues
IPv6 general multicast listener queries to sense the absence of a
listener.
For this, it uses :: as its source address - however RFC 2710 requires:
"To be valid, the Query message MUST come from a link-local IPv6 Source
Address". Current Linux kernel versions seem to follow this requirement
and ignore our bogus MLD queries.
With this commit a link local address from the bridge interface is being
used to issue the MLD query, resulting in other Linux devices which are
multicast listeners in the network to respond with a MLD response (which
was not the case before).
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Map the IPv6 header's destination multicast address to an ethernet
source address instead of the MLD queries multicast address.
For instance for a general MLD query (multicast address in the MLD query
set to ::), this would wrongly be mapped to 33:33:00:00:00:00, although
an MLD queries destination MAC should always be 33:33:00:00:00:01 which
matches the IPv6 header's multicast destination ff02::1.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the multicast bridge snooping support is not active for
link local multicast. I assume this has been done to leave
important multicast data untouched, like IPv6 Neighborhood Discovery.
In larger, bridged, local networks it could however be desirable to
optimize for instance local multicast audio/video streaming too.
With the transient flag in IPv6 multicast addresses we have an easy
way to optimize such multimedia traffic without tempering with the
high priority multicast data from well-known addresses.
This patch alters the multicast bridge snooping for IPv6, to take
effect for transient multicast addresses instead of non-link-local
addresses.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The nsrcs number is 2 Byte wide, therefore we need to call ntohs()
before using it.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
We actually want a pointer to the grec_nsrcr and not the following
field. Otherwise we can get very high values for *nsrcs as the first two
bytes of the IPv6 multicast address are being used instead, leading to
a failing pskb_may_pull() which results in MLDv2 reports not being
parsed.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The protocol type for IPv6 entries in the hash table for multicast
bridge snooping is falsely set to ETH_P_IP, marking it as an IPv4
address, instead of setting it to ETH_P_IPV6, which results in negative
look-ups in the hash table later.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Struct tmp is copied from userspace. It is not checked whether the "name"
field is NULL terminated. This may lead to buffer overflow and passing
contents of kernel stack as a module name to try_then_request_module() and,
consequently, to modprobe commandline. It would be seen by all userspace
processes.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
As it turns out we never need to walk through the list of multicast
groups subscribed by the bridge interface itself (the only time we'd
want to do that is when we shut down the bridge, in which case we
simply walk through all multicast groups), we don't really need to
keep an hlist for mp->mglist.
This means that we can replace it with just a single bit to indicate
whether the bridge interface is subscribed to a group.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
In a couple of spots where we are supposed to modify the port
group timer (p->timer) we instead modify the bridge interface
group timer (mp->timer).
The effect of this is mostly harmless. However, it can cause
port subscriptions to be longer than they should be, thus making
snooping less effective.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The list mp->mglist is used to indicate whether a multicast group
is active on the bridge interface itself as opposed to one of the
constituent interfaces in the bridge.
Unfortunately the operation that adds the mp->mglist node to the
list neglected to check whether it has already been added. This
leads to list corruption in the form of nodes pointing to itself.
Normally this would be quite obvious as it would cause an infinite
loop when walking the list. However, as this list is never actually
walked (which means that we don't really need it, I'll get rid of
it in a subsequent patch), this instead is hidden until we perform
a delete operation on the affected nodes.
As the same node may now be pointed to by more than one node, the
delete operations can then cause modification of freed memory.
This was observed in practice to cause corruption in 512-byte slabs,
most commonly leading to crashes in jbd2.
Thanks to Josef Bacik for pointing me in the right direction.
Reported-by: Ian Page Hands <ihands@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The fdb_create() puts a new fdb into hash with only addr set. This is
not good, since there are callers, that search the hash w/o the lock
and access all the other its fields.
Applies to current netdev tree.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reduce printk() levels to KERN_INFO in netdev_fix_features() as this will
be used by ethtool and might spam dmesg unnecessarily.
This converts the function to use netdev_info() instead of plain printk().
As a side effect, bonding and bridge devices will now log dropped features
on every slave device change.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quoting Ben Hutchings: we presumably won't be defining features that
can only be enabled on 64-bit architectures.
Occurences found by `grep -r` on net/, drivers/net, include/
[ Move features and vlan_features next to each other in
struct netdev, as per Eric Dumazet's suggestion -DaveM ]
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
To avoid adding a new match revision icmp type/code are stored
in the sport/dport area.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Holger Eitzenberger <holger@eitzenberger.org>
Reviewed-by: Bart De Schuymer<bdschuym@pandora.be>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
One iptables invocation with 135000 rules takes 35 seconds of cpu time
on a recent server, using a 32bit distro and a 64bit kernel.
We eventually trigger NMI/RCU watchdog.
INFO: rcu_sched_state detected stall on CPU 3 (t=6000 jiffies)
COMPAT mode has quadratic behavior and consume 16 bytes of memory per
rule.
Switch the xt_compat algos to use an array instead of list, and use a
binary search to locate an offset in the sorted array.
This halves memory need (8 bytes per rule), and removes quadratic
behavior [ O(N*N) -> O(N*log2(N)) ]
Time of iptables goes from 35 s to 150 ms.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since nf_bridge_maybe_copy_header() may change the length of skb,
we should check the length of skb after it to handle the ppoe skbs.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/bridge//br_stp_if.c:148:66: warning: conversion of
net/bridge//br_stp_if.c:148:66: int to
net/bridge//br_stp_if.c:148:66: int enum umh_wait
net/bridge//netfilter/ebtables.c:1150:30: warning: Using plain integer as NULL pointer
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit bf9ae5386b
(llc: use dev_hard_header) removed the
skb_reset_mac_header call from llc_mac_hdr_init.
This seems fine itself, but br_send_bpdu() invokes ebtables LOCAL_OUT.
We oops in ebt_basic_match() because it assumes eth_hdr(skb) returns
a meaningful result.
Cc: acme@ghostprotocols.net
References: https://bugzilla.kernel.org/show_bug.cgi?id=24532
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a missing ntohs() for bridge IPv6 multicast snooping.
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The nf_pre_routing functions in bridging have collected two
distinct ways of returning NF_DROP over the years, inline and
via goto. There is no reason for preferring either one.
So this patch arbitrarily picks the inline variant and converts
the all the gotos.
Also removes a redundant comment.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
If br_multicast_new_group returns NULL, we would return 0 (no error) to
the caller of br_multicast_add_group, which is not what we want. Instead
br_multicast_new_group should return ERR_PTR(-ENOMEM) in this case.
Also propagate the error number returned by br_mdb_rehash properly.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use helper functions to hide all direct accesses, especially writes,
to dst_entry metrics values.
This will allow us to:
1) More easily change how the metrics are stored.
2) Implement COW for metrics.
In particular this will help us put metrics into the inetpeer
cache if that is what we end up doing. We can make the _metrics
member a pointer instead of an array, initially have it point
at the read-only metrics in the FIB, and then on the first set
grab an inetpeer entry and point the _metrics member there.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Use the macros defined for the members of flowi to clean the code up.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_port_get() renamed to br_port_get_rtnl() to make clear RTNL is held.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The macro br_port_exists() is not enough protection when only
RCU is being used. There is a tiny race where other CPU has cleared port
handler hook, but is bridge port flag might still be set.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add br_should_route_hook_t typedef, this is the only way we can
get a clean RCU implementation for function pointer.
Move route_hook to location where it is used.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add modern __rcu annotatations to bridge multicast table.
Use newer hlist macros to avoid direct access to hlist internals.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make all frames sent to reserved group MAC addresses (01:80:c2:00:00:00 to
01:80:c2:00:00:0f) be forwarded if STP is disabled. This enables
forwarding EAPOL frames, among other things.
Signed-off-by: Benjamin Poirier <benjamin.poirier@polymtl.ca>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If some of the underlying devices support it, enable vlan offload on
transmit for bridge devices. This allows senders to take advantage of the
hardware support, similar to other forms of acceleration.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
VLAN_GROUP_ARRAY_LEN is simply the number of possible vlan VIDs.
Since vlan groups will soon be more of an implementation detail
for vlan devices, rename the constant to be descriptive of its
actual purpose.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
An upcoming commit will allow packets with hardware vlan acceleration
information to be passed though more parts of the network stack, including
packets trunked through the bridge. This adds support for matching and
filtering those packets through ebtables.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct dst_ops tracks number of allocated dst in an atomic_t field,
subject to high cache line contention in stress workload.
Switch to a percpu_counter, to reduce number of time we need to dirty a
central location. Place it on a separate cache line to avoid dirtying
read only fields.
Stress test :
(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE, SLUB/NUMA)
Before:
real 0m51.179s
user 0m15.329s
sys 10m15.942s
After:
real 0m45.570s
user 0m15.525s
sys 9m56.669s
With a small reordering of struct neighbour fields, subject of a
following patch, (to separate refcnt from other read mostly fields)
real 0m41.841s
user 0m15.261s
sys 8m45.949s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Related dicussion here : http://lkml.org/lkml/2010/9/3/16
Introduce a function br_parse_ip_options that will audit the
skb and possibly refill IP options before a packet enters the
IP stack. If no options are present, the function will zero out
the skb cb area so that it is not misinterpreted as options by some
unsuspecting IP layer routine. If packet consistency fails, drop it.
Signed-off-by: Bandan Das <bandan.das@stratus.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In a similar vain to commit 17762060c2
("bridge: Clear IPCB before possible entry into IP stack")
Any time we call into the IP stack we have to make sure the state
there is as expected by the ipv4 code.
With help from Eric Dumazet and Herbert Xu.
Reported-by: Bandan Das <bandan.das@stratus.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If bridge port is offline, don't call ethtool to query speed.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The carrier check is not called from work queue in current code.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nf_bridge_alloc() always reset the skb->nf_bridge, so we should always
put the old one.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Bart De Schuymer <bdschuym@pandora.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
While looking at using netdev_rx_handler_register for openvswitch Jesse
Gross suggested that an unlikely() might be worthwhile in that code.
I'm interested to see if its appropriate for the bridge code.
Cc: Jesse Gross <jesse@nicira.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently you cannot disable multicast snooping while a device is
down. There is no good reason for this restriction and this patch
removes it.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
On the bridge TX path we're leaking an skb when br_multicast_rcv
returns an error.
Reported-by: David Lamparter <equinox@diac24.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Long ago, when bridge was converted to RCU, rcu lock was equivalent
to having preempt disabled. RCU has changed a lot since then and
bridge code was still assuming the since transmit was called with
bottom half disabled, it was RCU safe.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Tested-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new netpoll code in bridging contains use-after-free bugs
that are non-trivial to fix.
This patch fixes this by removing the code that uses skbs after
they're freed.
As a consequence, this means that we can no longer call bridge
from the netpoll path, so this patch also removes the controller
function in order to disable netpoll.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Thanks,
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6_skip_exthdr() can return error code that is below zero.
'offset' is unsigned, so it makes no sense.
ipv6_skip_exthdr() returns 'int' so we can painlessly change type of
offset to int.
Signed-off-by: Kulikov Vasiliy <segooon@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a small possibility that a reader gets incorrect values on 32
bit arches. SNMP applications could catch incorrect counters when a
32bit high part is changed by another stats consumer/provider.
One way to solve this is to add a rtnl_link_stats64 param to all
ndo_get_stats64() methods, and also add such a parameter to
dev_get_stats().
Rule is that we are not allowed to use dev->stats64 as a temporary
storage for 64bit stats, but a caller provided area (usually on stack)
Old drivers (only providing get_stats() method) need no changes.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge protocol lives dangerously by having incestuous relations
with the IP stack. In this instance an abomination has been created
where a bogus IPCB area from a bridged packet leads to a crash in
the IP stack because it's interpreted as IP options.
This patch papers over the problem by clearing the IPCB area in that
particular spot. To fix this properly we'd also need to parse any
IP options if present but I'm way too lazy for that.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Cheers,
Signed-off-by: David S. Miller <davem@davemloft.net>
On Tue, Jul 06, 2010 at 08:48:35AM +0800, Herbert Xu wrote:
>
> bridge: Restore NULL check in br_mdb_ip_get
Resend with proper attribution.
bridge: Restore NULL check in br_mdb_ip_get
Somewhere along the line the NULL check in br_mdb_ip_get went
AWOL, causing crashes when we receive an IGMP packet with no
multicast table allocated.
This patch restores it and ensures all br_mdb_*_get functions
use it.
Reported-by: Frank Arnold <frank.arnold@amd.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Thanks,
Signed-off-by: David S. Miller <davem@davemloft.net>
Support more fine grained control of bridge netfilter iptables invocation
by adding seperate brnf_call_*tables parameters for each device using the
sysfs interface. Packets are passed to layer 3 netfilter when either the
global parameter or the per bridge parameter is enabled.
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Use u64_stats_sync infrastructure to provide 64bit rx/tx
counters even on 32bit hosts.
It is safe to use a single u64_stats_sync for rx and tx,
because BH is disabled on both, and we use per_cpu data.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is common in end-node, non STP bridges to set forwarding
delay to zero; which causes the forwarding database cleanup
to run every clock tick. Change to run only as soon as needed
or at next ageing timer interval which ever is sooner.
Use round_jiffies_up macro rather than attempting round up
by changing value.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The version of br_netpoll_send_skb used when netpoll is off is
missing a const thus causing a warning.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge multicast patches introduced an OOM crash in the forward
path, when deliver_clone fails to clone the skb.
Reported-by: Mark Wagner <mwagner@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Register net_bridge_port pointer as rx_handler data pointer. As br_port is
removed from struct net_device, another netdev priv_flag is added to indicate
the device serves as a bridge port. Also rcuized pointers are now correctly
dereferenced in br_fdb.c and in netfilter parts.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add possibility to register rx_handler data pointer along with a rx_handler.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are multiple problems with the newly added netpoll support:
1) Use-after-free on each netpoll packet.
2) Invoking unsafe code on netpoll/IRQ path.
3) Breaks when netpoll is enabled on the underlying device.
This patch fixes all of these problems. In particular, we now
allocate proper netpoll structures for each underlying device.
We only allow netpoll to be enabled on the bridge when all the
devices underneath it support netpoll. Once it is enabled, we
do not allow non-netpoll devices to join the bridge (until netpoll
is disabled again).
This allows us to do away with the npinfo juggling that caused
problem number 1.
Incidentally this patch fixes number 2 by bypassing unsafe code
such as multicast snooping and netfilter.
Reported-by: Qianfeng Zhang <frzhang@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that netpoll always zaps npinfo we no longer need to do it
in bridge.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
remove useless union keyword in rtable, rt6_info and dn_route.
Since there is only one member in a union, the union keyword isn't useful.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
What this patch does is it removes two receive frame hooks (for bridge and for
macvlan) from __netif_receive_skb. These are replaced them with a single
hook for both. It only supports one hook per device because it makes no
sense to do bridging and macvlan on the same device.
Then a network driver (of virtual netdev like macvlan or bridge) can register
an rx_handler for needed net device.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid dirtying bridge_parent_rtable refcount, using new dst noref
infrastructure.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This allows bin_attr->read,write,mmap callbacks to check file specific data
(such as inode owner) as part of any privilege validation.
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Fix build when CONFIG_SYSFS is not enabled:
net/bridge/br_if.c:136: error: 'struct net_bridge_port' has no member named 'sysfs_name'
Note: dev->name == sysfs_name except when change name is in
progress, and we are protected from that by RTNL mutex.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Links for each port are created in sysfs using the device
name, but this could be changed after being added to the
bridge.
As well as being unable to remove interfaces after this
occurs (because userspace tools don't recognise the new
name, and the kernel won't recognise the old name), adding
another interface with the old name to the bridge will
cause an error trying to create the sysfs link.
This fixes the problem by listening for NETDEV_CHANGENAME
notifications and renaming the link.
https://bugzilla.kernel.org/show_bug.cgi?id=12743
Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use one set of macro's for all bridge messages.
Note: can't use netdev_XXX macro's because bridge is purely
virtual and has no device parent.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move code around so that the ifdef for NETPOLL_CONTROLLER don't have to
show up in main code path. The control functions should be in helpers
that are only compiled if needed.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since xt_action_param is writable, let's use it. The pointer to
'bool hotdrop' always worried (8 bytes (64-bit) to write 1 byte!).
Surprisingly results in a reduction in size:
text data bss filename
5457066 692730 357892 vmlinux.o-prev
5456554 692730 357892 vmlinux.o
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
In future, layer-3 matches will be an xt module of their own, and
need to set the fragoff and thoff fields. Adding more pointers would
needlessy increase memory requirements (esp. so for 64-bit, where
pointers are wider).
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
The structures carried - besides match/target - almost the same data.
It is possible to combine them, as extensions are evaluated serially,
and so, the callers end up a little smaller.
text data bss filename
-15318 740 104 net/ipv4/netfilter/ip_tables.o
+15286 740 104 net/ipv4/netfilter/ip_tables.o
-15333 540 152 net/ipv6/netfilter/ip6_tables.o
+15269 540 152 net/ipv6/netfilter/ip6_tables.o
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Based on the previous patch, make bridge support netpoll by:
1) implement the 2 methods to support netpoll for bridge;
2) modify netpoll during forwarding packets via bridge;
3) disable netpoll support of bridge when a netpoll-unabled device
is added to bridge;
4) enable netpoll support when all underlying devices support netpoll.
Cc: David Miller <davem@davemloft.net>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Stephen Hemminger <shemminger@linux-foundation.org>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move some declarations around to make it clearer which variables
are being used inside loop.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The recently introduced bridge mulitcast port group list was only
partially using RCU correctly. It was missing rcu_dereference()
and missing the necessary barrier on deletion.
The code should have used one of the standard list methods (list or hlist)
instead of open coding a RCU based link list.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix unsafe usage of RCU. Would never work on Alpha SMP because
of lack of rcu_dereference()
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By coding slightly differently, there are only two cases
to deal with: add at head and add after previous entry.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit ff65e8275f.
As explained by Stephen Hemminger, the traversal doesn't require
RCU handling as we hold a lock.
The list addition et al. calls, on the other hand, do.
Signed-off-by: David S. Miller <davem@davemloft.net>
I prefer that the hlist be only accessed through the hlist macro
objects. Explicit twiddling of links (especially with RCU) exposes
the code to future bugs.
Compile tested only.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon a report from Stephen Rothwell:
--------------------
net/bridge/br_multicast.c: In function 'br_ip6_multicast_alloc_query':
net/bridge/br_multicast.c:469: error: implicit declaration of function 'csum_ipv6_magic'
Introduced by commit 08b202b672 ("bridge
br_multicast: IPv6 MLD support") from the net tree.
csum_ipv6_magic is declared in net/ip6_checksum.h ...
--------------------
Signed-off-by: David S. Miller <davem@davemloft.net>
Even with commit 32dec5dd02 ("bridge
br_multicast: Don't refer to BR_INPUT_SKB_CB(skb)->mrouters_only
without IGMP snooping."), BR_INPUT_SKB_CB(skb)->mrouters_only is
not appropriately initialized if IGMP/MLD snooping support is
compiled and disabled, so we can see garbage.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce struct br_ip{} to store ip address and protocol
and make functions more generic so that we can support
both IPv4 and IPv6 with less pain.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Sparse can help us find endianness bugs, but we need to make some
cleanups to be able to more easily spot real bugs.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
grec_nsrcs is in network order, we should convert to host horder in
br_multicast_igmp3_report()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The MTU for IP traffic encapsulated inside PPPoE traffic is smaller
than the MTU of the Ethernet device (1500). Connection tracking
gathers all IP packets and sometimes will refragment them in
ip_fragment(). We then need to subtract the length of the
encapsulating header from the mtu used in ip_fragment(). The check in
br_nf_dev_queue_xmit() which determines if ip_fragment() has to be
called is also updated for the PPPoE-encapsulated packets.
nf_bridge_copy_header() is also updated to make sure the PPPoE data
length field has the correct value.
Signed-off-by: Bart De Schuymer <bdschuym@pandora.be>
Signed-off-by: Patrick McHardy <kaber@trash.net>
- fix IP DNAT on vlan- or pppoe-encapsulated traffic: The functions
neigh_hh_output() or dst->neighbour->output() overwrite the complete
Ethernet header, although we only need the destination MAC address.
For encapsulated packets, they ended up overwriting the encapsulating
header. The new code copies the Ethernet source MAC address and
protocol number before calling dst->neighbour->output(). The Ethernet
source MAC and protocol number are copied back in place in
br_nf_pre_routing_finish_bridge_slow(). This also makes the IP DNAT
more transparent because in the old scheme the source MAC of the
bridge was copied into the source address in the Ethernet header. We
also let skb->protocol equal ETH_P_IP resp. ETH_P_IPV6 during the
execution of the PF_INET resp. PF_INET6 hooks.
- Speed up IP DNAT by calling neigh_hh_bridge() instead of
neigh_hh_output(): if dst->hh is available, we already know the MAC
address so we can just copy it.
Signed-off-by: Bart De Schuymer <bdschuym@pandora.be>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Remove br_netfilter.c::br_nf_local_out(). The function
br_nf_local_out() was needed because the PF_BRIDGE::LOCAL_OUT hook
could be called when IP DNAT happens on to-be-bridged traffic. The
new scheme eliminates this mess.
Signed-off-by: Bart De Schuymer <bdschuym@pandora.be>
Signed-off-by: Patrick McHardy <kaber@trash.net>
ip_refrag isn't used anymore in the bridge-netfilter code
Signed-off-by: Bart De Schuymer <bdschuym@pandora.be>
Signed-off-by: Patrick McHardy <kaber@trash.net>
bridge-netfilter: cleanup br_netfilter.c
- remove some of the graffiti at the head of br_netfilter.c
- remove __br_dnat_complain()
- remove KERN_INFO messages when CONFIG_NETFILTER_DEBUG is defined
Signed-off-by: Bart De Schuymer <bdschuym@pandora.be>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The IGMP3 report parsing is looking at the wrong address for
group records. This patch fixes it.
Reported-by: Banyeer <banyeer@yahoo.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Restore function signatures from bool to int so that we can report
memory allocation failures or similar using -ENOMEM rather than
always having to pass -EINVAL back.
// <smpl>
@@
type bool;
identifier check, par;
@@
-bool check
+int check
(struct xt_tgchk_param *par) { ... }
// </smpl>
Minus the change it does to xt_ct_find_proto.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Restore function signatures from bool to int so that we can report
memory allocation failures or similar using -ENOMEM rather than
always having to pass -EINVAL back.
This semantic patch may not be too precise (checking for functions
that use xt_mtchk_param rather than functions referenced by
xt_match.checkentry), but reviewed, it produced the intended result.
// <smpl>
@@
type bool;
identifier check, par;
@@
-bool check
+int check
(struct xt_mtchk_param *par) { ... }
// </smpl>
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
The first argument to NF_HOOK* is an nfproto since quite some time.
Commit v2.6.27-2457-gfdc9314 was the first to practically start using
the new names. Do that now for the remaining NF_HOOK calls.
The semantic patch used was:
// <smpl>
@@
@@
(NF_HOOK
|NF_HOOK_THRESH
)(
-PF_BRIDGE,
+NFPROTO_BRIDGE,
...)
@@
@@
NF_HOOK(
-PF_INET6,
+NFPROTO_IPV6,
...)
@@
@@
NF_HOOK(
-PF_INET,
+NFPROTO_IPV4,
...)
// </smpl>
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Supplement to 1159683ef4.
Downgrade the log level to INFO for most checkentry messages as they
are, IMO, just an extra information to the -EINVAL code that is
returned as part of a parameter "constraint violation". Leave errors
to real errors, such as being unable to create a LED trigger.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
We never actually use iph again so this assignment can be removed.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's not desired for underlaying devices to change type. At the time,
there is for example possible to have bond with changed type from
Ethernet to Infiniband as a port of a bridge. This patch fixes this.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ENOMEM is a very obvious error code (cf. EINVAL), so I think we do not
really need a warning message. Not to mention that if the allocation
fails, the user is most likely going to get a stack trace from slab
already.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
The shared packet statistics are a potential source of slow down
on bridged traffic. Convert to per-cpu array, but only keep those
statistics which change per-packet.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without CONFIG_BRIDGE_IGMP_SNOOPING,
BR_INPUT_SKB_CB(skb)->mrouters_only is not appropriately
initialized, so we can see garbage.
A clear option to fix this is to set it even without that
config, but we cannot optimize out the branch.
Let's introduce a macro that returns value of mrouters_only
and let it return 0 without CONFIG_BRIDGE_IGMP_SNOOPING.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Michael Braun <michael-dev@fami-braun.de>
bridge: Fix br_forward crash in promiscuous mode
It's a linux-next kernel from 2010-03-12 on an x86 system and it
OOPs in the bridge module in br_pass_frame_up (called by
br_handle_frame_finish) because brdev cannot be dereferenced (its set to
a non-null value).
Adding some BUG_ON statements revealed that
BR_INPUT_SKB_CB(skb)->brdev == br-dev
(as set in br_handle_frame_finish first)
only holds until br_forward is called.
The next call to br_pass_frame_up then fails.
Digging deeper it seems that br_forward either frees the skb or passes
it to NF_HOOK which will in turn take care of freeing the skb. The
same is holds for br_pass_frame_ip. So it seems as if two independent
skb allocations are required. As far as I can see, commit
b33084be19 ("bridge: Avoid unnecessary
clone on forward path") removed skb duplication and so likely causes
this crash. This crash does not happen on 2.6.33.
I've therefore modified br_forward the same way br_flood has been
modified so that the skb is not freed if skb0 is going to be used
and I can confirm that the attached patch resolves the issue for me.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>