Commit Graph

519808 Commits

Author SHA1 Message Date
Alexei Starovoitov
5bacd7805a samples/bpf: bpf_tail_call example for tracing
kprobe example that demonstrates how future seccomp programs may look like.
It attaches to seccomp_phase1() function and tail-calls other BPF programs
depending on syscall number.

Existing optimized classic BPF seccomp programs generated by Chrome look like:
if (sd.nr < 121) {
  if (sd.nr < 57) {
    if (sd.nr < 22) {
      if (sd.nr < 7) {
        if (sd.nr < 4) {
          if (sd.nr < 1) {
            check sys_read
          } else {
            if (sd.nr < 3) {
              check sys_write and sys_open
            } else {
              check sys_close
            }
          }
        } else {
      } else {
    } else {
  } else {
} else {
}

the future seccomp using native eBPF may look like:
  bpf_tail_call(&sd, &syscall_jmp_table, sd.nr);
which is simpler, faster and leaves more room for per-syscall checks.

Usage:
$ sudo ./tracex5
<...>-366   [001] d...     4.870033: : read(fd=1, buf=00007f6d5bebf000, size=771)
<...>-369   [003] d...     4.870066: : mmap
<...>-369   [003] d...     4.870077: : syscall=110 (one of get/set uid/pid/gid)
<...>-369   [003] d...     4.870089: : syscall=107 (one of get/set uid/pid/gid)
   sh-369   [000] d...     4.891740: : read(fd=0, buf=00000000023d1000, size=512)
   sh-369   [000] d...     4.891747: : write(fd=1, buf=00000000023d3000, size=512)
   sh-369   [000] d...     4.891747: : read(fd=1, buf=00000000023d3000, size=512)

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 17:07:59 -04:00
Alexei Starovoitov
b52f00e6a7 x86: bpf_jit: implement bpf_tail_call() helper
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table

In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue, so the callee program reuses the same stack.

The logic can be roughly expressed in C like:

u32 tail_call_cnt;

void *jumptable[2] = { &&label1, &&label2 };

int bpf_prog1(void *ctx)
{
label1:
    ...
}

int bpf_prog2(void *ctx)
{
label2:
    ...
}

int bpf_prog1(void *ctx)
{
    ...
    if (tail_call_cnt++ < MAX_TAIL_CALL_CNT)
        goto *jumptable[index]; ... and pass my 'ctx' to callee ...

    ... fall through if no entry in jumptable ...
}

Note that 'skip current program epilogue and next program prologue' is
an optimization. Other JITs don't have to do it the same way.
>From safety point of view it's valid as well, since programs always
initialize the stack before use, so any residue in the stack left by
the current program is not going be read. The same verifier checks are
done for the calls from the kernel into all bpf programs.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 17:07:59 -04:00
Alexei Starovoitov
04fd61ab36 bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
  ...
  bpf_tail_call(ctx, &jmp_table, index);
  ...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
  ...
  if (jmp_table[index])
    return (*jmp_table[index])(ctx);
  ...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.

bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table

Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.

New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.

The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.

Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>

==========
- simplify complex programs by splitting them into a sequence of small programs

- dispatch routine
  For tracing and future seccomp the program may be triggered on all system
  calls, but processing of syscall arguments will be different. It's more
  efficient to implement them as:
  int syscall_entry(struct seccomp_data *ctx)
  {
     bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
     ... default: process unknown syscall ...
  }
  int sys_write_event(struct seccomp_data *ctx) {...}
  int sys_read_event(struct seccomp_data *ctx) {...}
  syscall_jmp_table[__NR_write] = sys_write_event;
  syscall_jmp_table[__NR_read] = sys_read_event;

  For networking the program may call into different parsers depending on
  packet format, like:
  int packet_parser(struct __sk_buff *skb)
  {
     ... parse L2, L3 here ...
     __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
     bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
     ... default: process unknown protocol ...
  }
  int parse_tcp(struct __sk_buff *skb) {...}
  int parse_udp(struct __sk_buff *skb) {...}
  ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
  ipproto_jmp_table[IPPROTO_UDP] = parse_udp;

- for TC use case, bpf_tail_call() allows to implement reclassify-like logic

- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
  are atomic, so user space can build chains of BPF programs on the fly

Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
  It could have been implemented without JIT changes as a wrapper on top of
  BPF_PROG_RUN() macro, but with two downsides:
  . all programs would have to pay performance penalty for this feature and
    tail call itself would be slower, since mandatory stack unwind, return,
    stack allocate would be done for every tailcall.
  . tailcall would be limited to programs running preempt_disabled, since
    generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
    need to be either global per_cpu variable accessed by helper and by wrapper
    or global variable protected by locks.

  In this implementation x64 JIT bypasses stack unwind and jumps into the
  callee program after prologue.

- bpf_prog_array_compatible() ensures that prog_type of callee and caller
  are the same and JITed/non-JITed flag is the same, since calling JITed
  program from non-JITed is invalid, since stack frames are different.
  Similarly calling kprobe type program from socket type program is invalid.

- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
  abstraction, its user space API and all of verifier logic.
  It's in the existing arraymap.c file, since several functions are
  shared with regular array map.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 17:07:59 -04:00
Daniel Borkmann
e7582bab5d net: dev: reduce both ingress hook ifdefs
Reduce ifdef pollution slightly, no functional change. We can simply
remove the extra alternative definition of handle_ing() and nf_ingress().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 16:58:53 -04:00
Eric Dumazet
eb9344781a tcp: add a force_schedule argument to sk_stream_alloc_skb()
In commit 8e4d980ac2 ("tcp: fix behavior for epoll edge trigger")
we fixed a possible hang of TCP sockets under memory pressure,
by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule()
if no packet is in socket write queue.

It turns out there are other cases where we want to force memory
schedule :

tcp_fragment() & tso_fragment() need to split a big TSO packet into
two smaller ones. If we block here because of TCP memory pressure,
we can effectively block TCP socket from sending new data.
If no further ACK is coming, this hang would be definitive, and socket
has no chance to effectively reduce its memory usage.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 16:56:40 -04:00
Erik Kline
765c9c639f neigh: Better handling of transition to NUD_PROBE state
[1] When entering NUD_PROBE state via neigh_update(), perhaps received
    from userspace, correctly (re)initialize the probes count to zero.

    This is useful for forcing revalidation of a neighbor (for example
    if the host is attempting to do DNA [IPv4 4436, IPv6 6059]).

[2] Notify listeners when a neighbor goes into NUD_PROBE state.

    By sending notifications on entry to NUD_PROBE state listeners get
    more timely warnings of imminent connectivity issues.

    The current notifications on entry to NUD_STALE have somewhat
    limited usefulness: NUD_STALE is a perfectly normal state, as is
    NUD_DELAY, whereas notifications on entry to NUD_FAILURE come after
    a neighbor reachability problem has been confirmed (typically after
    three probes).

Signed-off-by: Erik Kline <ek@google.com>
Acked-By: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 16:52:17 -04:00
Andy Zhou
06b2c61c92 ip: remove unused function prototype
ip_do_nat() function was removed prior to kernel 3.4. Remove the
unnecessary function prototype as well.

Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 16:54:36 -04:00
Daniel Borkmann
492135557d tcp: add rfc3168, section 6.1.1.1. fallback
This work as a follow-up of commit f7b3bec6f5 ("net: allow setting ecn
via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
ECN connections. In other words, this work adds a retry with a non-ECN
setup SYN packet, as suggested from the RFC on the first timeout:

  [...] A host that receives no reply to an ECN-setup SYN within the
  normal SYN retransmission timeout interval MAY resend the SYN and
  any subsequent SYN retransmissions with CWR and ECE cleared. [...]

Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
that is, Linux default since 2009 via commit 255cac91c3 ("tcp: extend
ECN sysctl to allow server-side only ECN"):

 1) Normal ECN-capable path:

    SYN ECE CWR ----->
                <----- SYN ACK ECE
            ACK ----->

 2) Path with broken middlebox, when client has fallback:

    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)
            SYN ----->
                <----- SYN ACK
            ACK ----->

In case we would not have the fallback implemented, the middlebox drop
point would basically end up as:

    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)
    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)
    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)

In any case, it's rather a smaller percentage of sites where there would
occur such additional setup latency: it was found in end of 2014 that ~56%
of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
fallback would mitigate with a slight latency trade-off. Recent related
paper on this topic:

  Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
  Gorry Fairhurst, and Richard Scheffenegger:
    "Enabling Internet-Wide Deployment of Explicit Congestion Notification."
    Proc. PAM 2015, New York.
  http://ecn.ethz.ch/ecn-pam15.pdf

Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
allows for disabling the fallback.

tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
rather we let tcp_ecn_rcv_synack() take that over on input path in case a
SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
ECN being negotiated eventually in that case.

Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Dave That <dave.taht@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 16:53:37 -04:00
David S. Miller
134e0dbe72 Merge branch 'cxgb4-next'
Hariprasad Shenai says:

====================
cxgb4: Remove dead code and replace byte-oder functions

This series removes dead fn t4_read_edc and t4_read_mc, also replaces
ntoh{s,l} and hton{s,l} calls with the generic byteorder.

PATCH 2/2 was sent as a single PATCH, but had some byte-ordering issues
in t4_read_edc and t4_read_mc function. Found that t4_read_edc and
t4_read_mc is unused, so PATCH 1/2 is added to remove it.

This patch series is created against net-next tree and includes
patches on cxgb4 driver.

We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 16:47:32 -04:00
Hariprasad Shenai
f404f80c70 cxgb4: replace ntoh{s, l} and hton{s, l} calls with the generic byteorder
replace ntoh{s,l} and hton{s,l} calls with the generic byteorder in
cxgb4/t4_hw.c file

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 16:47:31 -04:00
Hariprasad Shenai
75daacc7ea cxgb4: Remove dead function t4_read_edc and t4_read_mc
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 16:47:31 -04:00
David S. Miller
b7a3a8e31f This just has a few fixes:
* LED throughput trigger was crashing
  * fast-xmit wasn't treating QoS changes in IBSS correctly
  * TDLS could use the wrong channel definition
  * using a reserved channel context could use the wrong channel width
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJVWuPvAAoJEDBSmw7B7bqr/A4P/0TqzkCC5L2qJvi3a6QNFxvf
 s3riQMJ8WQeUxCRNgFNeeeNUAgSJn3hhiINGrjRmkwXXxYC4mbCwM0YXNT+WhSRL
 /Kx4mRJr7u5ZU0olW+KRvIV5CyTsbr9zVnaraCh5NV43nT87ZVZRBKC9vz2UkSM5
 AsN6fUvhWMGhhHoGGDqtjRBjve8Xs5iKiEcE1iQTzLOPnFP3dKtB1zKKiA0JCQs4
 OjxkQ7uaF0T1IfkMFr0gyzgQi4A8iPoMKV3qcRIH/QZN5dpJ6DR1dgaU50CrzQ+R
 JD9W09ifF9U8GnvQU/baJHKCxEvnQWO2XwlV4+mV6bXF1j5Ng4LRiXntIeu2d3T3
 5JuvPV9cNJb8dSTzsYw+TRJg73hStlJCAjVMJ7hiOMQc1YCCY9Exrff0pWzJPJfE
 NygIkMHXymcy66yL3b7DIIXro5jHNVGVoHq3vMB+W+/EcEDFN6L9LeCzUVo+oKjl
 Qg4kC7VHDjcdt0f7Vgv2Cal76ZVfCZaq74QZV1cySF2sCiD27LnAAfoHVeMY979K
 qBsCRqhkBlc7ntnstv6tGz9LfG8ro+Fv548HIUDG80capZl6N6FR6g+8hYIuvJwu
 2abJq36bp/NAGDt43UofmtDxyZNyvoKmzcQKSdn2QpryGKQ3uRcDHG/I+WD0yWX9
 4WFNEm86sXmfL/Eyu0lU
 =iUFe
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2015-05-19' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
This just has a few fixes:
 * LED throughput trigger was crashing
 * fast-xmit wasn't treating QoS changes in IBSS correctly
 * TDLS could use the wrong channel definition
 * using a reserved channel context could use the wrong channel width
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 16:43:17 -04:00
Arnd Bergmann
9a03259c3d be2net: make hwmon interface optional
The hwmon interface in the be2net driver causes a link error when
be2net is built-in while the hwmon subsystem is a loadable module:

drivers/built-in.o: In function `be_probe':
drivers/net/ethernet/emulex/benet/be_main.c:5761: undefined reference to `devm_hwmon_device_register_with_groups'

This adds a new Kconfig symbol, following the example of multiple
other drivers that have the same problem. The new CONFIG_BE2NET_HWMON
will not be available when (BE2NET=y && HWMON=m) to avoid this
problem.

We have to also mark be_hwmon_show_temp as 'static' to ensure the
compiler can optimize out all the unused code.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 29e9122b3a ("be2net: Export board temperature using hwmon-sysfs interface.")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 16:40:04 -04:00
Eric B Munson
aea0929e51 tcp: Return error instead of partial read for saved syn headers
Currently the getsockopt() requesting the cached contents of the syn
packet headers will fail silently if the caller uses a buffer that is
too small to contain the requested data.  Rather than fail silently and
discard the headers, getsockopt() should return an error and report the
required size to hold the data.

Signed-off-by: Eric B Munson <emunson@akamai.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 16:33:34 -04:00
Parav Pandit
c9a70d4346 net-next: ethtool: Added port speed macros.
Signed-off-by: Parav Pandit <parav.pandit@avagotech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 16:32:18 -04:00
David S. Miller
76d7c45765 Merge branch 'icmp_frag'
Andy Zhou says:

====================
fragmentation ICMP

Currently, we send ICMP packets when errors occur during fragmentation or
de-fragmentation.  However, it is a bug when sending those ICMP packets
in the context of using netfilter for bridging.

Those ICMP packets are only expected in the context of routing, not in
bridging mode.

The local stack is not involved in bridging forward decisions, thus
should be not used for deciding the reverse path for those ICMP messages.

This bug only affects IPV4, not in IPv6.

v1->v2:  restructure the patches into two patches that fix defragmentation and
         fragmentation respectively.

	 A bit is add in IPCB to control whether ICMP packet should be
	 generated for defragmentation.

	 Fragmentation ICMP is now removed by restructuring the
	 ip_fragment() API.

v2->v3:  Add droping icmp for bridging contrack users
         drop exporting ip_fragment() API.

v3->v4:  Remove unnecessary parentheses in 'return' statements

v4->v5:  Drop the patch that sets and checks a bit in IPCB
         that prevents ip_defrag to send ICMP.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 00:15:50 -04:00
Andy Zhou
49d16b23cd bridge_netfilter: No ICMP packet on IPv4 fragmentation error
When bridge netfilter re-fragments an IP packet for output, all
packets that can not be re-fragmented to their original input size
should be silently discarded.

However, current bridge netfilter output path generates an ICMP packet
with 'size exceeded MTU' message for such packets, this is a bug.

This patch refactors the ip_fragment() API to allow two separate
use cases. The bridge netfilter user case will not
send ICMP, the routing output will, as before.

Signed-off-by: Andy Zhou <azhou@nicira.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 00:15:39 -04:00
Andy Zhou
8bc04864ac IPv4: skip ICMP for bridge contrack users when defrag expires
users in [IP_DEFRAG_CONNTRACK_BRIDGE_IN, __IP_DEFRAG_CONNTRACK_BR_IN]
should not ICMP message also.

Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 00:15:27 -04:00
Andy Zhou
5cf4228082 ipv4: introduce frag_expire_skip_icmp()
Improve readability of skip ICMP for de-fragmentation expiration logic.
This change will also make the logic easier to maintain when the
following patches in this series are applied.

Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 00:15:26 -04:00
Willem de Bruijn
a2ad5d2ad9 selftests/net: expect headroom in psock_fanout rollover
psock_fanout tests the various fanout modes. Change the test for
rollover mode to expect early rollover due to socket pressure
as implemented in 2ccdbaa6d5 ("packet: rollover lock contention
avoidance").

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-18 17:04:18 -04:00
Edward Cree
7e91a210b4 sfc: nicer log message on Siena SR-IOV probe fail
We expect that MC_CMD_SRIOV will fail if the card has no VFs configured.
So output a readable message instead of a cryptic MCDI error.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-18 16:20:07 -04:00
David S. Miller
0bc4c07046 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next. Briefly
speaking, cleanups and minor fixes for ipset from Jozsef Kadlecsik and
Serget Popovich, more incremental updates to make br_netfilter a better
place from Florian Westphal, ARP support to the x_tables mark match /
target from and context Zhang Chunyu and the addition of context to know
that the x_tables runs through nft_compat. More specifically, they are:

1) Fix sparse warning in ipset/ip_set_hash_ipmark.c when fetching the
   IPSET_ATTR_MARK netlink attribute, from Jozsef Kadlecsik.

2) Rename STREQ macro to STRNCMP in ipset, also from Jozsef.

3) Use skb->network_header to calculate the transport offset in
   ip_set_get_ip{4,6}_port(). From Alexander Drozdov.

4) Reduce memory consumption per element due to size miscalculation,
   this patch and follow up patches from Sergey Popovich.

5) Expand nomatch field from 1 bit to 8 bits to allow to simplify
   mtype_data_reset_flags(), also from Sergey.

6) Small clean for ipset macro trickery.

7) Fix error reporting when both ip_set_get_hostipaddr4() and
   ip_set_get_extensions() from per-set uadt functions.

8) Simplify IPSET_ATTR_PORT netlink attribute validation.

9) Introduce HOST_MASK instead of hardcoded 32 in ipset.

10) Return true/false instead of 0/1 in functions that return boolean
    in the ipset code.

11) Validate maximum length of the IPSET_ATTR_COMMENT netlink attribute.

12) Allow to dereference from ext_*() ipset macros.

13) Get rid of incorrect definitions of HKEY_DATALEN.

14) Include linux/netfilter/ipset/ip_set.h in the x_tables set match.

15) Reduce nf_bridge_info size in br_netfilter, from Florian Westphal.

16) Release nf_bridge_info after POSTROUTING since this is only needed
    from the physdev match, also from Florian.

17) Reduce size of ipset code by deinlining ip_set_put_extensions(),
    from Denys Vlasenko.

18) Oneliner to add ARP support to the x_tables mark match/target, from
    Zhang Chunyu.

19) Add context to know if the x_tables extension runs from nft_compat,
    to address minor problems with three existing extensions.

20) Correct return value in several seqfile *_show() functions in the
    netfilter tree, from Joe Perches.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-18 14:47:36 -04:00
David S. Miller
17032ae32d Merge branch 'qeth-next'
Ursula Braun says:

====================
s390: network patches for net-next

here are s390 related patches for net-next
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-18 12:14:18 -04:00
Peter Oberparleiter
89a2a8e76b s390/lcs: Fix null-pointer access in msg
An attempt to configure a CTC device as LCS results in the
following error message:

  (null): Detecting a network adapter for LCS devices failed
          with rc=-5 (0xfffffffb)

"(null)" results from access to &card->dev->dev in lcs_new_device()
which is only initialized later in the function. Fix this by using
&ccwgdev->dev instead which is initialized before lcs_new_device()
is called.

Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-18 12:14:18 -04:00
Eugene Crosser
ffb9525141 qeth: replace ENOSYS with EOPNOTSUPP
Since recently, `checkpatch.pl` advices that ENOSYS should not be
used for anything other than "invalid syscall nr". This patch
replaces ENOSYS return code with EOPNOTSUPP for the "unsupported
function" conditions.

Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-18 12:14:17 -04:00
Eugene Crosser
ff1d929110 qeth: BRIDGEPORT "sanity check"
Forbid enabling IFF_PROMISC reflection to BRIDGEPORT when a role
is already assigned, and forbid direct manipulation of the role
when reflection mode is engaged.

Reviewed-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-18 12:14:17 -04:00
Eugene Crosser
9c23f4dab1 qeth: OSA version of SETBRIDGEPORT command
OSA Ethernet hardware is introducing BRIDGEPORT functionality
similar (but not identical) to HiperSockets BRIDGEPORT. This
patch makes HiperSockets BRIDGEPORT related sysfs attributes
and udev events work with OSA hardware too.

Reviewed-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-18 12:14:17 -04:00
Eugene Crosser
0db587b065 qeth: IFF_PROMISC flag to BRIDGE PORT mode
OSA and HiperSocket devices do not support promiscuous mode proper,
but they support "BRIDGE PORT" mode that is functionally similar.
This update introduces sysfs attribute that, when set, makes the driver
try to "reflect" setting and resetting of the IFF_PROMISC flag on the
interface into setting and resetting PRIMARY or SECONDARY bridge port
role on the underlying OSA or HiperSocket device.

Reviewed-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-18 12:14:17 -04:00
Eugene Crosser
d3c29a5c3f qeth: remove locks from sysfs _show
Locking is probably unnecessary in this case, and the rest of the
qeth sysfs code does not use locks in the *_show() functions.
Remove locks from the layer2 *_show() functions in which they where
accidentally introduced.

Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-18 12:14:17 -04:00
Eugene Crosser
c88394e7ee qeth: fix handling of IPA return codes
Function that executes IPA commands returns the result code from the
IPA response block. If non-negative, it needs to be transformed into
errno-compatible code before returning to the caller.

Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-18 12:14:17 -04:00
Thomas Richter
c7258d8637 qeth: fix rx checksum offload handling
ethtool is used to change some device driver features
such as RX/TX hardware checksum offloading.
The qeth device driver callback function to
turn on/off RX hardware check sum handling never changes
the hardware configuration.
The NETIF_F_RXCSUM bit is cleared when the feature bitset
type netdev_features_t(64bit) is assigned to 32 a bit
variable.

This patch fixes the NETIF_F_RXCSUM handling.
Also there is no need to manipulate the device's features
bit set as this is done by the caller when no error occurs.

Signed-off-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-18 12:14:16 -04:00
Herbert Xu
b9fbe709de netlink: Use random autobind rover
Currently we use a global rover to select a port ID that is unique.
This used to work consistently when it was protected with a global
lock.  However as we're now lockless, the global rover can exhibit
pathological behaviour should multiple threads all stomp on it at
the same time.

Granted this will eventually resolve itself but the process is
suboptimal.

This patch replaces the global rover with a pseudorandom starting
point to avoid this issue.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 23:43:31 -04:00
WANG Cong
de133464c9 netns: make nsid_lock per net
The spinlock is used to protect netns_ids which is per net,
so there is no need to use a global spinlock.

Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 23:41:11 -04:00
Florian Fainelli
4ab7f91381 net: dsa: bcm_sf2: properly propagate carrier down state for MoCA
MoCA interfaces require the use of an user-space daemon (mocad) which
will typically use cmd->autoneg to force the link. This is causing other
network manager applications not to get proper carrier down
notifications because of the following sequence of events:

- link down interrupt is received, link is set to 0 by the interrupt
  handler
- fixed_link update callback runs and updates the BMSR register
  accordingly
- PHY library polls the PHY for link status, sees the link is down,
  proceeds with reporting that
- mocad gets notified of the link state and call phy_ethtool_sset()
  with cmd->autoneg set to the link status (0)
- phy_start_aneg() is called at the end of phy_ethtool_sset() and sets
  the PHY state to PHY_FORCING

Just make sure we notify the interface carrier appropriately when we
detect that the link is down in our fixed_link update callback. This is
made local to the bcm_sf2 driver as the PHY library does the right thing
in any case. This is similar to the GENET change introduced in
54d7c01d3e ("net: bcmgenet: enable MoCA
link state change detection").

Fixes: 246d7f773c ("net: dsa: add Broadcom SF2 switch driver")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 23:40:24 -04:00
Jiri Pirko
74b80e841b flow_dissector: remove bogus return in tipc section
Fixes: 06635a35d1 ("flow_dissect: use programable dissector in skb_flow_dissect and friends")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 23:38:23 -04:00
sixiao@microsoft.com
4b02b58b52 hv_netvsc: change member name of struct netvsc_stats
Currently the struct netvsc_stats has a member s_sync
of type u64_stats_sync.
This definition will break kernel build as the macro
netdev_alloc_pcpu_stats requires this member name to be syncp.
(see netdev_alloc_pcpu_stats definition in ./include/linux/netdevice.h)

This patch changes netvsc_stats's member name from s_sync to syncp to fix
the build break.

Signed-off-by: Simon Xiao <sixiao@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 23:37:38 -04:00
Samudrala, Sridhar
45d4122ca7 switchdev: add support for fdb add/del/dump via switchdev_port_obj ops.
- introduce port fdb obj and generic switchdev_port_fdb_add/del/dump()
- use switchdev_port_fdb_add/del/dump in rocker/team/bonding ndo ops.
- add support for fdb obj in switchdev_port_obj_add/del/dump()
- switch rocker to implement fdb ops via switchdev_ops

v3: updated to sync with named union changes.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:49:09 -04:00
David S. Miller
5d48ef3e95 Merge branch 'tcp_mem_pressure'
Eric Dumazet says:

====================
tcp: better handling of memory pressure

When testing commit 790ba4566c ("tcp: set SOCK_NOSPACE under memory
pressure") using edge triggered epoll applications, I found various
issues under memory pressure and thousands of active sockets.

This patch series is a first round to solve these issues, in send
and receive paths. There are probably other fixes needed, but
with this series, my tests now all succeed.

v2: fix typo in "allow one skb to be received per socket under memory pressure",
as spotted by Jason Baron.
====================

Acked-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:46:26 -04:00
Eric Dumazet
b66e91ccbc tcp: halves tcp_mem[] limits
Allowing tcp to use ~19% of physical memory is way too much,
and allowed bugs to be hidden. Add to this that some drivers use a full
page per incoming frame, so real cost can be twice the advertized one.

Reduce tcp_mem by 50 % as a first step to sanity.

tcp_mem[0,1,2] defaults are now 4.68%, 6.25%, 9.37% of physical memory.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:45:49 -04:00
Eric Dumazet
76dfa60820 tcp: allow one skb to be received per socket under memory pressure
While testing tight tcp_mem settings, I found tcp sessions could be
stuck because we do not allow even one skb to be received on them.

By allowing one skb to be received, we introduce fairness and
eventuallu force memory hogs to release their allocation.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:45:49 -04:00
Eric Dumazet
8e4d980ac2 tcp: fix behavior for epoll edge trigger
Under memory pressure, tcp_sendmsg() can fail to queue a packet
while no packet is present in write queue. If we return -EAGAIN
with no packet in write queue, no ACK packet will ever come
to raise EPOLLOUT.

We need to allow one skb per TCP socket, and make sure that
tcp sockets can release their forward allocations under pressure.

This is a followup to commit 790ba4566c ("tcp: set SOCK_NOSPACE
under memory pressure")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:45:48 -04:00
Eric Dumazet
b8da51ebb1 tcp: introduce tcp_under_memory_pressure()
Introduce an optimized version of sk_under_memory_pressure()
for TCP. Our intent is to use it in fast paths.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:45:48 -04:00
Eric Dumazet
a6c5ea4ccf tcp: rename sk_forced_wmem_schedule() to sk_forced_mem_schedule()
We plan to use sk_forced_wmem_schedule() in input path as well,
so make it non static and rename it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:45:48 -04:00
Eric Dumazet
1a24e04e4b net: fix sk_mem_reclaim_partial()
sk_mem_reclaim_partial() goal is to ensure each socket has
one SK_MEM_QUANTUM forward allocation. This is needed both for
performance and better handling of memory pressure situations in
follow up patches.

SK_MEM_QUANTUM is currently a page, but might be reduced to 4096 bytes
as some arches have 64KB pages.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:45:48 -04:00
Willem de Bruijn
4633c9e07b net-packet: fix null pointer exception in rollover mode
Rollover can be enabled as flag or mode. Allocate state in both cases.
This solves a NULL pointer exception in fanout_demux_rollover on
referencing po->rollover if using mode rollover.

Also make sure that in rollover mode each silo is tried (contrary
to rollover flag, where the main socket is excluded after an initial
try_self).

Tested:
  Passes tools/testing/net/psock_fanout.c, which tests both modes and
  flag. My previous tests were limited to bench_rollover, which only
  stresses the flag. The test now completes safely. it still gives an
  error for mode rollover, because it does not expect the new headroom
  (ROOM_NORMAL) requirement. I will send a separate patch to the test.

Fixes: 0648ab70af ("packet: rollover prepare: per-socket state")

Signed-off-by: Willem de Bruijn <willemb@google.com>

----

I should have run this test and caught this before submission, of
course. Apologies for the oversight.
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:41:38 -04:00
Eric Dumazet
c91d460652 net: fix two sparse errors
First one in __skb_checksum_validate_complete() fixes the following
(and other callers)

make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/tcp_ipv4.o
  CHECK   net/ipv4/tcp_ipv4.c
include/linux/skbuff.h:3052:24: warning: incorrect type in return expression (different base types)
include/linux/skbuff.h:3052:24:    expected restricted __sum16
include/linux/skbuff.h:3052:24:    got int

Second is fixing gso_make_checksum() :

  CHECK   net/ipv4/gre_offload.c
include/linux/skbuff.h:3360:14: warning: incorrect type in assignment (different base types)
include/linux/skbuff.h:3360:14:    expected unsigned short [unsigned] [usertype] csum
include/linux/skbuff.h:3360:14:    got restricted __sum16
include/linux/skbuff.h:3365:16: warning: incorrect type in return expression (different base types)
include/linux/skbuff.h:3365:16:    expected restricted __sum16
include/linux/skbuff.h:3365:16:    got unsigned short [unsigned] [usertype] csum

Fixes: 5a21232983 ("net: Support for csum_bad in skbuff")
Fixes: 7e2b10c1e5 ("net: Support for multiple checksums with gso")
Signed-off-by: Eric Dumazet <edumazet@google.com>
CC: Tom Herbert <tom@herbertland.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 13:08:29 -04:00
Eric Dumazet
ba6d05641c netfilter: synproxy: fix sparse errors
Fix verbose sparse errors :

make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/netfilter/ipt_SYNPROXY.o

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 13:08:29 -04:00
Eric Dumazet
252a8fbe81 ipip: fix one sparse error
make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/ipip.o
  CHECK   net/ipv4/ipip.c
net/ipv4/ipip.c:254:27: warning: incorrect type in assignment (different base types)
net/ipv4/ipip.c:254:27:    expected restricted __be32 [addressable] [usertype] o_key
net/ipv4/ipip.c:254:27:    got restricted __be16 [addressable] [usertype] i_flags

Fixes: 3b7b514f44 ("ipip: fix a regression in ioctl")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 13:08:29 -04:00
Eric Dumazet
d53a2aa3a1 net: fix sparse error in csum_replace4()
make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/netfilter/nf_nat_l3proto_ipv4.o
  CHECK   net/ipv4/netfilter/nf_nat_l3proto_ipv4.c
include/net/checksum.h:125:64: warning: incorrect type in argument 2 (different base types)
include/net/checksum.h:125:64:    expected restricted __wsum [usertype] addend
include/net/checksum.h:125:64:    got restricted __be32 [usertype] from
include/net/checksum.h:125:71: warning: incorrect type in argument 2 (different base types)
include/net/checksum.h:125:71:    expected restricted __wsum [usertype] addend
include/net/checksum.h:125:71:    got restricted __be32 [usertype] to
include/net/checksum.h:125:64: warning: incorrect type in argument 2 (different base types)
include/net/checksum.h:125:64:    expected restricted __wsum [usertype] addend
include/net/checksum.h:125:64:    got restricted __be32 [usertype] from
include/net/checksum.h:125:71: warning: incorrect type in argument 2 (different base types)
include/net/checksum.h:125:71:    expected restricted __wsum [usertype] addend
include/net/checksum.h:125:71:    got restricted __be32 [usertype] to

Fixes: 4565af0d40 ("net: optimise csum_replace4()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 13:08:29 -04:00
Joe Perches
861fb1078f netfilter: Use correct return for seq_show functions
Using seq_has_overflowed doesn't produce the right return value.
Either 0 or -1 is, but 0 is much more common and works well when
seq allocation retries.

I believe this doesn't matter as the initial allocation is always
sufficient, this is just a correctness patch.

Miscellanea:

o Don't use strlen, use *ptr to determine if a string
  should be emitted like all the other tests here
o Delete unnecessary return statements

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-17 17:25:35 +02:00