This patch is almost to revert commit 02f3d4ce9e ("sctp: Adjust PMTU
updates to accomodate route invalidation."). As t->asoc can't be NULL
in sctp_transport_update_pmtu, it could get sk from asoc, and no need
to pass sk into that function.
It is also to remove some duplicated codes from that function.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cb_running is reported in /proc/self/net/netlink and it is reported by
the ss tool, when it gets information from the proc files.
sock_diag is a new interface which is used instead of proc files, so it
looks reasonable that this interface has to report no less information
about sockets than proc files.
We use these flags to dump and restore netlink sockets.
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We accidentally left this dead code behind after commit 5952fde10c
("net: sched: choke: remove dead filter classify code").
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using a private copy of struct net_device_stats in struct
batadv_priv, use stats from struct net_device.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
PPP pseudo-wire type is 7 (11 is L2TP_PWTYPE_IP).
Fixes: f1f39f9110 ("l2tp: auto load type modules")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Take a reference on the sessions returned by l2tp_session_find_nth()
(and rename it l2tp_session_get_nth() to reflect this change), so that
caller is assured that the session isn't going to disappear while
processing it.
For procfs and debugfs handlers, the session is held in the .start()
callback and dropped in .show(). Given that pppol2tp_seq_session_show()
dereferences the associated PPPoL2TP socket and that
l2tp_dfs_seq_session_show() might call pppol2tp_show(), we also need to
call the session's .ref() callback to prevent the socket from going
away from under us.
Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Fixes: 0ad6614048 ("l2tp: Add debugfs files for dumping l2tp debug info")
Fixes: 309795f4be ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds initial support for network namespaces. The changes only
enable support in the CAN raw, proc and af_can code. GW and BCM still
have their checks that ensure that they are used only from the main
namespace.
The patch boils down to moving the global structures, i.e. the global
filter list and their /proc stats, into a per-namespace structure and passing
around the corresponding "struct net" in a lot of different places.
Changes since v1:
- rebased on current HEAD (2bfe01e)
- fixed overlong line
Signed-off-by: Mario Kicherer <dev@kicherer.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Number of sockets is limited by 16-bit, so 64-bit allocation will never
happen.
16-bit ops are the worst code density-wise on x86_64 because of
additional prefix (66).
Space savings:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-3 (-3)
function old new delta
reuseport_add_sock 539 536 -3
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make ->hash_count, ->low_watermark and ->high_watermark unsigned int
and propagate unsignedness to other variables.
This change doesn't change code generation because these fields aren't
used in 64-bit contexts but make it anyway: these fields can't be
negative numbers.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Flow keys aren't 4GB+ numbers so 64-bit arithmetic is excessive.
Space savings (I'm not sure what CSWTCH is):
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-48 (-48)
function old new delta
flow_cache_lookup 1163 1159 -4
CSWTCH 75997 75953 -44
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Trippelsdorf reported that after commit dcb17d22e1 ("tcp: warn
on bogus MSS and try to amend it") the kernel started logging the
warning for a NIC driver that doesn't even support GRO.
It was diagnosed that it was possibly caused on connections that were
using TCP Timestamps but some packets lacked the Timestamps option. As
we reduce rcv_mss when timestamps are used, the lack of them would cause
the packets to be bigger than expected, although this is a valid case.
As this warning is more as a hint, getting a clean-cut on the
threshold is probably not worth the execution time spent on it. This
patch thus alleviates the false-positives with 2 quick checks: by
accounting for the entire TCP option space and also checking against the
interface MTU if it's available.
These changes, specially the MTU one, might mask some real positives,
though if they are really happening, it's possible that sooner or later
it will be triggered anyway.
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to move sctp_transport_dst_check into sctp_packet_config
from sctp_packet_transmit and add pathmtu check in sctp_packet_config.
With this fix, sctp can update dst or pathmtu before appending chunks,
which can void dropping packets in sctp_packet_transmit when dst is
obsolete or dst's mtu is changed.
This patch is also to improve some other codes in sctp_packet_config.
It updates packet max_size with gso_max_size, checks for dst and
pathmtu, and appends ecne chunk only when packet is empty and asoc
is not NULL.
It makes sctp flush work better, as we only need to set up them once
for one flush schedule. It's also safe, since asoc is NULL only when
the packet is created by sctp_ootb_pkt_new in which it just gets the
new dst, no need to do more things for it other than set packet with
transport's pathmtu.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before when implementing sctp prsctp, SCTP_PR_STREAM_STATUS wasn't
added, as it needs to save abandoned_(un)sent for every stream.
After sctp stream reconf is added in sctp, assoc has structure
sctp_stream_out to save per stream info.
This patch is to add SCTP_PR_STREAM_STATUS by putting the prsctp
per stream statistics into sctp_stream_out.
v1->v2:
fix an indent issue.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The last argument to __skb_header_pointer() should be a buffer large
enough to store struct arphdr. This can be a pointer to a struct arphdr
structure. The code was previously using a pointer to a pointer to
struct arphdr.
By my counting the storage available both before and after is 8 bytes on
x86_64.
Fixes: 55733350e5 ("flow disector: ARP support")
Reported-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ethtool code was spread in soft-interface.c. This makes reading the
code and working on it unnecessary complicated. Having everything in a
common place next to the other code which references it, makes it slightly
easier.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The .get_settings function pointer and the related API was deprecated.
Fortunately, batman-adv is a virtual interface and never provided any
useful information via .get_settings. The stub can therefore be
removed.
This also avoids that incorrect information is shown in ethtool about the
batadv interface.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
batadv devices don't support msglevel. The ethtool stubs therefore returned
that it isn't supported. But instead, the complete function can be dropped
to avoid that bogus values are shown in ethtool.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The ethtool_ops of batman-adv never contained more than a stub for the
get_link function pointer. It was always returning that a link exists even
when the devices was not yet up and therefore nothing resampling a link
could have been available.
Instead use the ethtool helper which returns the current carrier state.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The rds_connect_worker() has a bug in the check that enforces the
canonical connection order described in the comments of
rds_tcp_state_change(). The intention is to make sure that all
the multipath connections are always initiated by the smaller IP
address via rds_start_mprds. To achieve this, rds_connection_worker
should check that cp_index > 0.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rds_conn_shutdown() runs in workq context, and marks the rds_connection
as DISCONNECTING before quiescing Tx/Rx paths. However, after all I/O
has quiesced, we may still find the rds_connection state to be
RDS_CONN_ERROR if an intervening FIN was processed in softirq context.
This is not a fatal error: rds_conn_shutdown() should continue the
shutdown, and there is no need to log noisy messages about this event.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alow users to push down more labels per MPLS encap. Similar to LSR case,
move label array to the end of mpls_iptunnel_encap and allocate based on
the number of labels for the route.
For consistency with the LSR case, re-use the same maximum number of
labels.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow users to push down more labels per MPLS route. With the previous
patches, no memory allocations are based on MAX_NEW_LABELS; the limit
is only used to keep userspace in check.
At this point MAX_NEW_LABELS is only used for mpls_route_config (copying
route data from userspace) and processing nexthops looking for the max
number of labels across the route spec.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Limit memory allocation size for mpls_route to 4096.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move labels to the end of mpls_nh as a 0-sized array and within mpls_route
move the via for a nexthop after the mpls_nh. The new layout becomes:
+----------------------+
| mpls_route |
+----------------------+
| mpls_nh 0 |
+----------------------+
| alignment padding | 4 bytes for odd number of labels; 0 for even
+----------------------+
| via[rt_max_alen] 0 |
+----------------------+
| alignment padding | via's aligned on sizeof(unsigned long)
+----------------------+
| ... |
+----------------------+
| mpls_nh n-1 |
+----------------------+
| via[rt_max_alen] n-1 |
+----------------------+
Memory allocated for nexthop + via is constant across all nexthops and
their via. It is based on the maximum number of labels across all nexthops
and the maximum via length. The size is saved in the mpls_route as
rt_nh_size. Accessing a nexthop becomes rt->rt_nh + index * rt->rt_nh_size.
The offset of the via address from a nexthop is saved as rt_via_offset
so that given an mpls_nh pointer the via for that hop is simply
nh + rt->rt_via_offset.
With prior code, memory allocated per mpls_route with 1 nexthop:
via is an ethernet address - 64 bytes
via is an ipv4 address - 64
via is an ipv6 address - 72
With this patch set, memory allocated per mpls_route with 1 nexthop and
1 or 2 labels:
via is an ethernet address - 56 bytes
via is an ipv4 address - 56
via is an ipv6 address - 64
The 8-byte reduction is due to the previous patch; the change introduced
by this patch has no impact on the size of allocations for 1 or 2 labels.
Performance impact of this change was examined using network namespaces
with veth pairs connecting namespaces. ns0 inserts the packet to the
label-switched path using an lwt route with encap mpls. ns1 adds 1 or 2
labels depending on test, ns2 (and ns3 for 2-label test) pops the label
and forwards. ns3 (or ns4) for a 2-label is the destination. Similar
series of namespaces used for 2-nexthop test.
Intent is to measure changes to latency (overhead in manipulating the
packet) in the forwarding path. Tests used netperf with UDP_RR.
IPv4: current patches
1 label, 1 nexthop 29908 30115
2 label, 1 nexthop 29071 29612
1 label, 2 nexthop 29582 29776
2 label, 2 nexthop 29086 29149
IPv6: current patches
1 label, 1 nexthop 24502 24960
2 label, 1 nexthop 24041 24407
1 label, 2 nexthop 23795 23899
2 label, 2 nexthop 23074 22959
In short, the change has no effect to a modest increase in performance.
This is expected since this patch does not really have an impact on routes
with 1 or 2 labels (the current limit) and 1 or 2 nexthops.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Number of nexthops and number of alive nexthops are tracked using an
unsigned int. A route should never have more than 255 nexthops so
convert both to u8. Update all references and intermediate variables
to consistently use u8 as well.
Shrinks the size of mpls_route from 32 bytes to 24 bytes with a 2-byte
hole before the nexthops.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The number of alive nexthops for a route (rt->rt_nhn_alive) and the
flags for a next hop (nh->nh_flags) are modified by netdev event
handlers. The event handlers run with rtnl_lock held so updates are
always done with the lock held. The packet path accesses the fields
under the rcu lock. Since those fields can change at any moment in
the packet path, both fields should be accessed using READ_ONCE. Updates
to both fields should use WRITE_ONCE.
Update mpls_select_multipath (packet path) and mpls_ifdown and mpls_ifup
(event handlers) accordingly.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Callers of l2tp_nl_session_find() need to hold a reference on the
returned session since there's no guarantee that it isn't going to
disappear from under them.
Relying on the fact that no l2tp netlink message may be processed
concurrently isn't enough: sessions can be deleted by other means
(e.g. by closing the PPPOL2TP socket of a ppp pseudowire).
l2tp_nl_cmd_session_delete() is a bit special: it runs a callback
function that may require a previous call to session->ref(). In
particular, for ppp pseudowires, the callback is l2tp_session_delete(),
which then calls pppol2tp_session_close() and dereferences the PPPOL2TP
socket. The socket might already be gone at the moment
l2tp_session_delete() calls session->ref(), so we need to take a
reference during the session lookup. So we need to pass the do_ref
variable down to l2tp_session_get() and l2tp_session_get_by_ifname().
Since all callers have to be updated, l2tp_session_find_by_ifname() and
l2tp_nl_session_find() are renamed to reflect their new behaviour.
Fixes: 309795f4be ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
l2tp_session_find() doesn't take any reference on the returned session.
Therefore, the session may disappear while sending the notification.
Use l2tp_session_get() instead and decrement session's refcount once
the notification is sent.
Fixes: 33f72e6f0c ("l2tp : multicast notification to the registered listeners")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
l2tp_session_create() relies on its caller for checking for duplicate
sessions. This is racy since a session can be concurrently inserted
after the caller's verification.
Fix this by letting l2tp_session_create() verify sessions uniqueness
upon insertion. Callers need to be adapted to check for
l2tp_session_create()'s return code instead of calling
l2tp_session_find().
pppol2tp_connect() is a bit special because it has to work on existing
sessions (if they're not connected) or to create a new session if none
is found. When acting on a preexisting session, a reference must be
held or it could go away on us. So we have to use l2tp_session_get()
instead of l2tp_session_find() and drop the reference before exiting.
Fixes: d9e31d17ce ("l2tp: Add L2TP ethernet pseudowire support")
Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Holding a reference on session is required before calling
pppol2tp_session_ioctl(). The session could get freed while processing the
ioctl otherwise. Since pppol2tp_session_ioctl() uses the session's socket,
we also need to take a reference on it in l2tp_session_get().
Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Taking a reference on sessions in l2tp_recv_common() is racy; this
has to be done by the callers.
To this end, a new function is required (l2tp_session_get()) to
atomically lookup a session and take a reference on it. Callers then
have to manually drop this reference.
Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since sctp reconf was added in sctp, the real cnt of in/out stream
have not been c.sinit_max_instreams and c.sinit_num_ostreams any
more.
This patch is to replace them with stream->in/outcnt.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 96567d5dac ("net: dsa: dsa2: Add basic support of devlink")
I see the following link error with CONFIG_NET_DSA=y and CONFIG_NET_DEVLINK=m:
net/built-in.o: In function 'dsa_register_switch':
(.text+0xe226b): undefined reference to `devlink_alloc'
net/built-in.o: In function 'dsa_register_switch':
(.text+0xe2284): undefined reference to `devlink_register'
net/built-in.o: In function 'dsa_register_switch':
(.text+0xe243e): undefined reference to `devlink_port_register'
net/built-in.o: In function 'dsa_register_switch':
(.text+0xe24e1): undefined reference to `devlink_port_register'
net/built-in.o: In function 'dsa_register_switch':
(.text+0xe24fa): undefined reference to `devlink_port_type_eth_set'
net/built-in.o: In function 'dsa_dst_unapply.part.8':
dsa2.c:(.text.unlikely+0x345): undefined reference to 'devlink_port_unregister'
dsa2.c:(.text.unlikely+0x36c): undefined reference to 'devlink_port_unregister'
dsa2.c:(.text.unlikely+0x38e): undefined reference to 'devlink_port_unregister'
dsa2.c:(.text.unlikely+0x3f2): undefined reference to 'devlink_unregister'
dsa2.c:(.text.unlikely+0x3fb): undefined reference to 'devlink_free'
Fix this by adding a dependency on MAY_USE_DEVLINK so that CONFIG_NET_DSA
get switched to be build as module when CONFIG_NET_DEVLINK=m.
Fixes: 96567d5dac ("net: dsa: dsa2: Add basic support of devlink")
Signed-off-by: Tobias Regnery <tobias.regnery@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
development and testing of networking bpf programs is quite cumbersome.
Despite availability of user space bpf interpreters the kernel is
the ultimate authority and execution environment.
Current test frameworks for TC include creation of netns, veth,
qdiscs and use of various packet generators just to test functionality
of a bpf program. XDP testing is even more complicated, since
qemu needs to be started with gro/gso disabled and precise queue
configuration, transferring of xdp program from host into guest,
attaching to virtio/eth0 and generating traffic from the host
while capturing the results from the guest.
Moreover analyzing performance bottlenecks in XDP program is
impossible in virtio environment, since cost of running the program
is tiny comparing to the overhead of virtio packet processing,
so performance testing can only be done on physical nic
with another server generating traffic.
Furthermore ongoing changes to user space control plane of production
applications cannot be run on the test servers leaving bpf programs
stubbed out for testing.
Last but not least, the upstream llvm changes are validated by the bpf
backend testsuite which has no ability to test the code generated.
To improve this situation introduce BPF_PROG_TEST_RUN command
to test and performance benchmark bpf programs.
Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce crosschip_bridge_{join,leave} operations in the dsa_switch_ops
structure, which can be used by switches supporting interconnection.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ovs_flow_key_update() is called when the flow key is invalid, and it is
used to update and revalidate the flow key. Commit 329f45bc4f
("openvswitch: add mac_proto field to the flow key") introduces mac_proto
field to flow key and use it to determine whether the flow key is valid.
However, the commit does not update the code path in ovs_flow_key_update()
to revalidate the flow key which may cause BUG_ON() on execute_recirc().
This patch addresses the aforementioned issue.
Fixes: 329f45bc4f ("openvswitch: add mac_proto field to the flow key")
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
backchannel; fix. Also some minor refinements to the nfsd
version-setting interface that we'd like to get fixed before release.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJY3weFAAoJECebzXlCjuG+fQYQAKh7QVB/vDZwvjWZ7OGSBhoe
LJfS1mzzf52MfQk6qNEM9to86qu+ywmH4XD2rvgMdif6Kc0qF1xBzzVEnp527lVW
nt1nLq/oIXeRl7a+5p+FxTnqHn0OBH+iCnxcVZI8tvHgptpR+6TwEzR36k4n4esN
kvNrrv9dFIoBGx0nSBScHTC3zNArU+C9oZTTHAuPJICNBHsOMqYF7sSAXPg9NFR+
HnowZfGWWXpSnWZ9DaM14Zb/dX0B9Pexv1MsgfiKYS9Beh7g4JN48+CHtWyVliwd
M/LVBpT2ZwFM/NzvJ2exVIqm/hwj4El2Sjy67XHwQzDvGjnUn/fz55SfXfzZSMyD
PMj+IeHKuT3jipNui1AXAlzYEz8gPvuKQ0vQ0vVuNX4Ln28KydGwvVTkUNltDnfq
E7L7RI6mj03OY1j1p6zeK6UJeueZq1gmjfq1NVTPO+TgQFPhh8A50NsDEVcaiwNN
W8uX7Qa39y79BT+4OFuYL05AbuqxKR+nAmCbVSLy9Kq4sc/6/YErBZtXXAghzPPl
4Es4tzlAd8skkxWlcVeCeUqPfcDCqHL6xKqa62m5wUuYXmqPtIMyAyVutF4lWpH/
dAV5Lcjz7HeQnpCgFZXXtIQW5OIIfFa08s0f7fuFm+uxz+nM/x8gg99tuwWP6OsT
Za7EtdB84M1ZGgjO3JhY
=oi0+
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.11-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"The restriction of NFSv4 to TCP went overboard and also broke the
backchannel; fix.
Also some minor refinements to the nfsd version-setting interface that
we'd like to get fixed before release"
* tag 'nfsd-4.11-1' of git://linux-nfs.org/~bfields/linux:
svcrdma: set XPT_CONG_CTRL flag for bc xprt
NFSD: fix nfsd_reset_versions for NFSv4.
NFSD: fix nfsd_minorversion(.., NFSD_AVAIL)
NFSD: further refinement of content of /proc/fs/nfsd/versions
nfsd: map the ENOKEY to nfserr_perm for avoiding warning
SUNRPC/backchanel: set XPT_CONG_CTRL flag for bc xprt
sock_recv_ts_and_drops() unconditionally set sk->sk_stamp for
every packet, even if the SOCK_TIMESTAMP flag is not set in the
related socket.
If selinux is enabled, this cause a cache miss for every packet
since sk->sk_stamp and sk->sk_security share the same cacheline.
With this change sk_stamp is set only if the SOCK_TIMESTAMP
flag is set, and is cleared for the first packet, so that the user
perceived behavior is unchanged.
This gives up to 5% speed-up under udp-flood with small packets.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1. Move the "window = tp->rcv_wnd;" into the condition block without
tp->rx_opt.rcv_wscale.
Because it is unnecessary when enable wscale;
2. Use the macro ALIGN instead of two statements.
The two statements are used to make window align to 1<<wscale.
Use the ALIGN is more clearer.
3. Use the rounddown to make codes clearer.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When sending a msg without asoc established, sctp will send INIT packet
first and then enqueue chunks.
Before receiving INIT_ACK, stream info is not yet alloced. But enqueuing
chunks needs to access stream info, like out stream state and out stream
cnt.
This patch is to fix it by allocing out stream info when initializing an
asoc, allocing in stream and re-allocing out stream when processing init.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rather than assign the positive errno values to ret and then
checking if it is positive and flip the sign, just return the
errno value.
Detected by CoverityScan, CID#986649 ("Logically Dead Code")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These files all use functions declared in interrupt.h, but currently rely
on implicit inclusion of this file (via netns/xfrm.h).
That won't work anymore when the flow cache is removed so include that
header where needed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
When calculating po->tp_hdrlen + po->tp_reserve the result can overflow.
Fix by checking that tp_reserve <= INT_MAX on assign.
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When calculating rb->frames_per_block * req->tp_block_nr the result
can overflow.
Add a check that tp_block_size * tp_block_nr <= UINT_MAX.
Since frames_per_block <= tp_block_size, the expression would
never overflow.
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Subtracting tp_sizeof_priv from tp_block_size and casting to int
to check whether one is less then the other doesn't always work
(both of them are unsigned ints).
Compare them as is instead.
Also cast tp_sizeof_priv to u64 before using BLK_PLUS_PRIV, as
it can overflow inside BLK_PLUS_PRIV otherwise.
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains a rather large update with Netfilter
fixes, specifically targeted to incorrect RCU usage in several spots and
the userspace conntrack helper infrastructure (nfnetlink_cthelper),
more specifically they are:
1) expect_class_max is incorrect set via cthelper, as in kernel semantics
mandate that this represents the array of expectation classes minus 1.
Patch from Liping Zhang.
2) Expectation policy updates via cthelper are currently broken for several
reasons: This code allows illegal changes in the policy such as changing
the number of expeciation classes, it is leaking the updated policy and
such update occurs with no RCU protection at all. Fix this by adding a
new nfnl_cthelper_update_policy() that describes what is really legal on
the update path.
3) Fix several memory leaks in cthelper, from Jeffy Chen.
4) synchronize_rcu() is missing in the removal path of several modules,
this may lead to races since CPU may still be running on code that has
just gone. Also from Liping Zhang.
5) Don't use the helper hashtable from cthelper, it is not safe to walk
over those bits without the helper mutex. Fix this by introducing a
new independent list for userspace helpers. From Liping Zhang.
6) nf_ct_extend_unregister() needs synchronize_rcu() to make sure no
packets are walking on any conntrack extension that is gone after
module removal, again from Liping.
7) nf_nat_snmp may crash if we fail to unregister the helper due to
accidental leftover code, from Gao Feng.
8) Fix leak in nfnetlink_queue with secctx support, from Liping Zhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
for socketpairs using connectionless transport, we cache
the respective node local TIPC portid to use in subsequent
calls to send() in the socket's private data.
Signed-off-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sockets A and B are connected back-to-back, similar to what
AF_UNIX does.
Signed-off-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge xfrm_user validation fixes from Andy Whitcroft:
"Two patches we are applying to Ubuntu for XFRM_MSG_NEWAE validation
issue reported by ZDI.
The first of these is the primary fix, and the second is for a more
theoretical issue that Kees pointed out when reviewing the first"
* emailed patches from Andy Whitcroft <apw@canonical.com>:
xfrm_user: validate XFRM_MSG_NEWAE incoming ESN size harder
xfrm_user: validate XFRM_MSG_NEWAE XFRMA_REPLAY_ESN_VAL replay_window
A recent commit skips nexthops in a route if the device has been
deleted. Update lfib_nlmsg_size accordingly.
Reported-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Rx path may grab the socket right before pppol2tp_release(), but
nothing guarantees that it will enqueue packets before
skb_queue_purge(). Therefore, the socket can be destroyed without its
queues fully purged.
Fix this by purging queues in pppol2tp_session_destruct() where we're
guaranteed nothing is still referencing the socket.
Fixes: 9e9cb6221a ("l2tp: fix userspace reception on plain L2TP sockets")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code following l2tp_tunnel_find() expects that a new reference is
held on sk. Either sk_receive_skb() or the discard_put error path will
drop a reference from the tunnel's socket.
This issue exists in both l2tp_ip and l2tp_ip6.
Fixes: a3c18422a4 ("l2tp: hold socket before dropping lock in l2tp_ip{, 6}_recv()")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kees Cook has pointed out that xfrm_replay_state_esn_len() is subject to
wrapping issues. To ensure we are correctly ensuring that the two ESN
structures are the same size compare both the overall size as reported
by xfrm_replay_state_esn_len() and the internal length are the same.
CVE-2017-7184
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a new xfrm state is created during an XFRM_MSG_NEWSA call we
validate the user supplied replay_esn to ensure that the size is valid
and to ensure that the replay_window size is within the allocated
buffer. However later it is possible to update this replay_esn via a
XFRM_MSG_NEWAE call. There we again validate the size of the supplied
buffer matches the existing state and if so inject the contents. We do
not at this point check that the replay_window is within the allocated
memory. This leads to out-of-bounds reads and writes triggered by
netlink packets. This leads to memory corruption and the potential for
priviledge escalation.
We already attempt to validate the incoming replay information in
xfrm_new_ae() via xfrm_replay_verify_len(). This confirms that the user
is not trying to change the size of the replay state buffer which
includes the replay_esn. It however does not check the replay_window
remains within that buffer. Add validation of the contained
replay_window.
CVE-2017-7184
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When internal mac80211 TXQs aren't supported, netdev queues must
always started out started even when driver queues are stopped
while the interface is added. This is necessary because with the
internal TXQ support netdev queues are never stopped and packet
scheduling/dropping is done in mac80211.
Cc: stable@vger.kernel.org # 4.9+
Fixes: 80a83cfc43 ("mac80211: skip netdev queue control with software queuing")
Reported-and-tested-by: Sven Eckelmann <sven.eckelmann@openmesh.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We must call security_release_secctx to free the memory returned by
security_secid_to_secctx, otherwise memory may be leaked forever.
Fixes: ef493bd930 ("netfilter: nfnetlink_queue: add security context information")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We got the following use-after-free KASAN report:
BUG: KASAN: use-after-free in wiphy_resume+0x591/0x5a0 [cfg80211]
at addr ffff8803fc244090
Read of size 8 by task kworker/u16:24/2587
CPU: 6 PID: 2587 Comm: kworker/u16:24 Tainted: G B 4.9.13-debug+
Hardware name: Dell Inc. XPS 15 9550/0N7TVV, BIOS 1.2.19 12/22/2016
Workqueue: events_unbound async_run_entry_fn
ffff880425d4f9d8 ffffffffaeedb541 ffff88042b80ef00 ffff8803fc244088
ffff880425d4fa00 ffffffffae84d7a1 ffff880425d4fa98 ffff8803fc244080
ffff88042b80ef00 ffff880425d4fa88 ffffffffae84da3a ffffffffc141f7d9
Call Trace:
[<ffffffffaeedb541>] dump_stack+0x85/0xc4
[<ffffffffae84d7a1>] kasan_object_err+0x21/0x70
[<ffffffffae84da3a>] kasan_report_error+0x1fa/0x500
[<ffffffffc141f7d9>] ? cfg80211_bss_age+0x39/0xc0 [cfg80211]
[<ffffffffc141f83a>] ? cfg80211_bss_age+0x9a/0xc0 [cfg80211]
[<ffffffffae48d46d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffffc13fb1c0>] ? wiphy_suspend+0xc70/0xc70 [cfg80211]
[<ffffffffae84def1>] __asan_report_load8_noabort+0x61/0x70
[<ffffffffc13fb100>] ? wiphy_suspend+0xbb0/0xc70 [cfg80211]
[<ffffffffc13fb751>] ? wiphy_resume+0x591/0x5a0 [cfg80211]
[<ffffffffc13fb751>] wiphy_resume+0x591/0x5a0 [cfg80211]
[<ffffffffc13fb1c0>] ? wiphy_suspend+0xc70/0xc70 [cfg80211]
[<ffffffffaf3b206e>] dpm_run_callback+0x6e/0x4f0
[<ffffffffaf3b31b2>] device_resume+0x1c2/0x670
[<ffffffffaf3b367d>] async_resume+0x1d/0x50
[<ffffffffae3ee84e>] async_run_entry_fn+0xfe/0x610
[<ffffffffae3d0666>] process_one_work+0x716/0x1a50
[<ffffffffae3d05c9>] ? process_one_work+0x679/0x1a50
[<ffffffffafdd7b6d>] ? _raw_spin_unlock_irq+0x3d/0x60
[<ffffffffae3cff50>] ? pwq_dec_nr_in_flight+0x2b0/0x2b0
[<ffffffffae3d1a80>] worker_thread+0xe0/0x1460
[<ffffffffae3d19a0>] ? process_one_work+0x1a50/0x1a50
[<ffffffffae3e54c2>] kthread+0x222/0x2e0
[<ffffffffae3e52a0>] ? kthread_park+0x80/0x80
[<ffffffffae3e52a0>] ? kthread_park+0x80/0x80
[<ffffffffae3e52a0>] ? kthread_park+0x80/0x80
[<ffffffffafdd86aa>] ret_from_fork+0x2a/0x40
Object at ffff8803fc244088, in cache kmalloc-1024 size: 1024
Allocated:
PID = 71
save_stack_trace+0x1b/0x20
save_stack+0x46/0xd0
kasan_kmalloc+0xad/0xe0
kasan_slab_alloc+0x12/0x20
__kmalloc_track_caller+0x134/0x360
kmemdup+0x20/0x50
brcmf_cfg80211_attach+0x10b/0x3a90 [brcmfmac]
brcmf_bus_start+0x19a/0x9a0 [brcmfmac]
brcmf_pcie_setup+0x1f1a/0x3680 [brcmfmac]
brcmf_fw_request_nvram_done+0x44c/0x11b0 [brcmfmac]
request_firmware_work_func+0x135/0x280
process_one_work+0x716/0x1a50
worker_thread+0xe0/0x1460
kthread+0x222/0x2e0
ret_from_fork+0x2a/0x40
Freed:
PID = 2568
save_stack_trace+0x1b/0x20
save_stack+0x46/0xd0
kasan_slab_free+0x71/0xb0
kfree+0xe8/0x2e0
brcmf_cfg80211_detach+0x62/0xf0 [brcmfmac]
brcmf_detach+0x14a/0x2b0 [brcmfmac]
brcmf_pcie_remove+0x140/0x5d0 [brcmfmac]
brcmf_pcie_pm_leave_D3+0x198/0x2e0 [brcmfmac]
pci_pm_resume+0x186/0x220
dpm_run_callback+0x6e/0x4f0
device_resume+0x1c2/0x670
async_resume+0x1d/0x50
async_run_entry_fn+0xfe/0x610
process_one_work+0x716/0x1a50
worker_thread+0xe0/0x1460
kthread+0x222/0x2e0
ret_from_fork+0x2a/0x40
Memory state around the buggy address:
ffff8803fc243f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff8803fc244000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff8803fc244080: fc fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8803fc244100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8803fc244180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
What is happening is that brcmf_pcie_resume() detects a device that
is no longer responsive and it decides to unbind resulting in a
wiphy_unregister() and wiphy_free() call. Now the wiphy instance
remains allocated, because PM needs to call wiphy_resume() for it.
However, brcmfmac already does a kfree() for the struct
cfg80211_registered_device::ops field. Change the checks in
wiphy_resume() to only access the struct cfg80211_registered_device::ops
if the wiphy instance is still registered at this time.
Cc: stable@vger.kernel.org # 4.10.x, 4.9.x
Reported-by: Daniel J Blueman <daniel@quora.org>
Reviewed-by: Hante Meuleman <hante.meuleman@broadcom.com>
Reviewed-by: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com>
Reviewed-by: Franky Lin <franky.lin@broadcom.com>
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Register the switch and its ports with devlink.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is an include loop between netdevice.h, dsa.h, devlink.h because
of NETDEV_ALIGN, making it impossible to use devlink structures in
dsa.h.
Break this loop by taking dsa.h out of netdevice.h, add a forward
declaration of dsa_switch_tree and netdev_set_default_ethtool_ops()
function, which is what netdevice.h requires.
No longer having dsa.h in netdevice.h means the includes in dsa.h no
longer get included. This breaks a few other files which depend on
these includes. Add these directly in the affected file.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Send netconf notifications for MPLS when the device registers and
unregisters.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor mpls_netconf_notify_devconf to take the event as an input arg.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Send RTM_DELNETCONF notifications when a device is deleted. The message only
needs the device index, so modify inet6_netconf_fill_devconf to skip devconf
references if it is NULL.
Allows a userspace cache to remove entries as devices are deleted.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor inet6_netconf_notify_devconf to take the event as an input arg.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Send RTM_DELNETCONF notifications when a device is deleted. The message only
needs the device index, so modify inet_netconf_fill_devconf to skip devconf
references if it is NULL.
Allows a userspace cache to remove entries as devices are deleted.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor inet_netconf_notify_devconf to take the event as an input arg.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I do not hold the copyright of the DSA core and drivers source files,
since these changes have been written as an initiative of my day job.
Fix this.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for NETDEV_RESEND_IGMP event similar
to how it works for IPv4.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The reference count held for skb needs to be released when the skb's
nfct pointer is cleared regardless of if nf_ct_delete() is called or
not.
Failing to release the skb's reference cound led to deferred conntrack
cleanup spinning forever within nf_conntrack_cleanup_net_list() when
cleaning up a network namespace:
kworker/u16:0-19025 [004] 45981067.173642: sched_switch: kworker/u16:0:19025 [120] R ==> rcu_preempt:7 [120]
kworker/u16:0-19025 [004] 45981067.173651: kernel_stack: <stack trace>
=> ___preempt_schedule (ffffffffa001ed36)
=> _raw_spin_unlock_bh (ffffffffa0713290)
=> nf_ct_iterate_cleanup (ffffffffc00a4454)
=> nf_conntrack_cleanup_net_list (ffffffffc00a5e1e)
=> nf_conntrack_pernet_exit (ffffffffc00a63dd)
=> ops_exit_list.isra.1 (ffffffffa06075f3)
=> cleanup_net (ffffffffa0607df0)
=> process_one_work (ffffffffa0084c31)
=> worker_thread (ffffffffa008592b)
=> kthread (ffffffffa008bee2)
=> ret_from_fork (ffffffffa071b67c)
Fixes: dd41d33f0b ("openvswitch: Add force commit.")
Reported-by: Yang Song <yangsong@vmware.com>
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Same change as Kinglong Mee's fix for the TCP backchannel service.
Fixes: 5283b03ee5 ("nfs/nfsd/sunrpc: enforce transport...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When a new subscription object is inserted into name_seq->subscriptions
list, it's under name_seq->lock protection; when a subscription is
deleted from the list, it's also under the same lock protection;
similarly, when accessing a subscription by going through subscriptions
list, the entire process is also protected by the name_seq->lock.
Therefore, if subscription refcount is increased before it's inserted
into subscriptions list, and its refcount is decreased after it's
deleted from the list, it will be unnecessary to hold refcount at all
before accessing subscription object which is obtained by going through
subscriptions list under name_seq->lock protection.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After a subscription object is created, it's inserted into its
subscriber subscrp_list list under subscriber lock protection,
similarly, before it's destroyed, it should be first removed from
its subscriber->subscrp_list. Since the subscription list is
accessed with subscriber lock, all the subscriptions are valid
during the lock duration. Hence in tipc_subscrb_subscrp_delete(), we
remove subscription get/put and the extra subscriber unlock/lock.
After this change, the subscriptions refcount cleanup is very simple
and does not access any lock.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By moving these client drivers to use RPMSG instead of the direct SMD
API we can reuse them ontop of the newly added GLINK wire-protocol
support found in the 820 and 835 Qualcomm platforms.
As the new (RPMSG-based) and old SMD implementations are mutually
exclusive we have to change all client drivers in one commit, to make
sure we have a working system before and after this transition.
Acked-by: Andy Gross <andy.gross@linaro.org>
Acked-by: Kalle Valo <kvalo@codeaurora.org>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Laight noticed the support for MSG_MORE with datamsg->force_delay
didn't really work as we expected, as the first msg with MSG_MORE set
would always block the following chunks' dequeuing.
This Patch is to rewrite it by saving the MSG_MORE flag into assoc as
David Laight suggested.
asoc->force_delay is used to save MSG_MORE flag before a msg is sent.
All chunks in queue would not be sent out if asoc->force_delay is set
by the msg with MSG_MORE flag, until a new msg without MSG_MORE flag
clears asoc->force_delay.
Note that this change would not affect the flush is generated by other
triggers, like asoc->state != ESTABLISHED, queue size > pmtu etc.
v1->v2:
Not clear asoc->force_delay after sending the msg with MSG_MORE flag.
Fixes: 4ea0c32f5f ("sctp: add support for MSG_MORE")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: David Laight <david.laight@aculab.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The pipeline debug is used to export the pipeline abstractions for the
main objects - tables, headers and entries. The only support for set is
for changing the counter parameter on specific table.
The basic structures:
Header - can represent a real protocol header information or internal
metadata. Generic protocol headers like IPv4 can be shared
between drivers. Each driver can add local headers.
Field - part of a header. Can represent protocol field or specific ASIC
metadata field. Hardware special metadata fields can be mapped
to different resources, for example switch ASIC ports can have
internal number which from the systems point of view is mapped
to netdeivce ifindex.
Match - represent specific match rule. Can describe match on specific
field or header. The header index should be specified as well
in order to support several header instances of the same type
(tunneling).
Action - represents specific action rule. Actions can describe operations
on specific field values for example like set, increment, etc.
And header operation like add and delete.
Value - represents value which can be associated with specific match or
action.
Table - represents a hardware block which can be described with match/
action behavior. The match/action can be done on the packets
data or on the internal metadata that it gathered along the
packets traversal throw the pipeline which is vendor specific
and should be exported in order to provide understanding of
ASICs behavior.
Entry - represents single record in a specific table. The entry is
identified by specific combination of values for match/action.
Prior to accessing the tables/entries the drivers provide the header/
field data base which is used by driver to user-space. The data base
is split between the shared headers and unique headers.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Our chosen ic_dev may be anywhere in our list of ic_devs, and we may
free it before attempting to close others. When we compare d->dev and
ic_dev->dev, we're potentially dereferencing memory returned to the
allocator. This causes KASAN to scream for each subsequent ic_dev we
check.
As there's a 1-1 mapping between ic_devs and netdevs, we can instead
compare d and ic_dev directly, which implicitly handles the !ic_dev
case, and avoids the use-after-free. The ic_dev pointer may be stale,
but we will not dereference it.
Original splat:
[ 6.487446] ==================================================================
[ 6.494693] BUG: KASAN: use-after-free in ic_close_devs+0xc4/0x154 at addr ffff800367efa708
[ 6.503013] Read of size 8 by task swapper/0/1
[ 6.507452] CPU: 5 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc3-00002-gda42158 #8
[ 6.514993] Hardware name: AppliedMicro Mustang/Mustang, BIOS 3.05.05-beta_rc Jan 27 2016
[ 6.523138] Call trace:
[ 6.525590] [<ffff200008094778>] dump_backtrace+0x0/0x570
[ 6.530976] [<ffff200008094d08>] show_stack+0x20/0x30
[ 6.536017] [<ffff200008bee928>] dump_stack+0x120/0x188
[ 6.541231] [<ffff20000856d5e4>] kasan_object_err+0x24/0xa0
[ 6.546790] [<ffff20000856d924>] kasan_report_error+0x244/0x738
[ 6.552695] [<ffff20000856dfec>] __asan_report_load8_noabort+0x54/0x80
[ 6.559204] [<ffff20000aae86ac>] ic_close_devs+0xc4/0x154
[ 6.564590] [<ffff20000aaedbac>] ip_auto_config+0x2ed4/0x2f1c
[ 6.570321] [<ffff200008084b04>] do_one_initcall+0xcc/0x370
[ 6.575882] [<ffff20000aa31de8>] kernel_init_freeable+0x5f8/0x6c4
[ 6.581959] [<ffff20000a16df00>] kernel_init+0x18/0x190
[ 6.587171] [<ffff200008084710>] ret_from_fork+0x10/0x40
[ 6.592468] Object at ffff800367efa700, in cache kmalloc-128 size: 128
[ 6.598969] Allocated:
[ 6.601324] PID = 1
[ 6.603427] save_stack_trace_tsk+0x0/0x418
[ 6.607603] save_stack_trace+0x20/0x30
[ 6.611430] kasan_kmalloc+0xd8/0x188
[ 6.615087] ip_auto_config+0x8c4/0x2f1c
[ 6.619002] do_one_initcall+0xcc/0x370
[ 6.622832] kernel_init_freeable+0x5f8/0x6c4
[ 6.627178] kernel_init+0x18/0x190
[ 6.630660] ret_from_fork+0x10/0x40
[ 6.634223] Freed:
[ 6.636233] PID = 1
[ 6.638334] save_stack_trace_tsk+0x0/0x418
[ 6.642510] save_stack_trace+0x20/0x30
[ 6.646337] kasan_slab_free+0x88/0x178
[ 6.650167] kfree+0xb8/0x478
[ 6.653131] ic_close_devs+0x130/0x154
[ 6.656875] ip_auto_config+0x2ed4/0x2f1c
[ 6.660875] do_one_initcall+0xcc/0x370
[ 6.664705] kernel_init_freeable+0x5f8/0x6c4
[ 6.669051] kernel_init+0x18/0x190
[ 6.672534] ret_from_fork+0x10/0x40
[ 6.676098] Memory state around the buggy address:
[ 6.680880] ffff800367efa600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 6.688078] ffff800367efa680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 6.695276] >ffff800367efa700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 6.702469] ^
[ 6.705952] ffff800367efa780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 6.713149] ffff800367efa800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 6.720343] ==================================================================
[ 6.727536] Disabling lock debugging due to kernel taint
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: David S. Miller <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
When CONFIG_IPV6_SEG6_LWTUNNEL is selected, automatically select DST_CACHE.
This allows to remove multiple ifdefs.
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
When all devices for all nexthops in a route have been deleted, the
route is effectively dead, so remove it.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the device for a nexthop in a multipath route is deleted, the nexthop
is effectively removed from the route. Currently, a route dump still
returns the nexhop though without the device set:
$ ip -f mpls ro ls
100
nexthopvia inet 10.11.1.2 dev br0
nexthopvia inet 10.100.3.1 dev eth3
$ ip li del br0
$ ip -f mpls ro ls
100
nexthopvia inet 10.11.1.2 dev * dead linkdown
nexthopvia inet 10.100.3.1 dev eth3
Since the nexthop is effectively deleted, drop the hop from the route
dump.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the commit 93557f53e1 ("netfilter: nf_conntrack: nf_conntrack snmp
helper"), the snmp_helper is replaced by nf_nat_snmp_hook. So the
snmp_helper is never registered. But it still tries to unregister the
snmp_helper, it could cause the panic.
Now remove the useless snmp_helper and the unregister call in the
error handler.
Fixes: 93557f53e1 ("netfilter: nf_conntrack: nf_conntrack snmp helper")
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If one cpu is doing nf_ct_extend_unregister while another cpu is doing
__nf_ct_ext_add_length, then we may hit BUG_ON(t == NULL). Moreover,
there's no synchronize_rcu invocation after set nf_ct_ext_types[id] to
NULL, so it's possible that we may access invalid pointer.
But actually, most of the ct extends are built-in, so the problem listed
above will not happen. However, there are two exceptions: NF_CT_EXT_NAT
and NF_CT_EXT_SYNPROXY.
For _EXT_NAT, the panic will not happen, since adding the nat extend and
unregistering the nat extend are located in the same file(nf_nat_core.c),
this means that after the nat module is removed, we cannot add the nat
extend too.
For _EXT_SYNPROXY, synproxy extend may be added by init_conntrack, while
synproxy extend unregister will be done by synproxy_core_exit. So after
nf_synproxy_core.ko is removed, we may still try to add the synproxy
extend, then kernel panic may happen.
I know it's very hard to reproduce this issue, but I can play a tricky
game to make it happen very easily :)
Step 1. Enable SYNPROXY for tcp dport 1234 at FORWARD hook:
# iptables -I FORWARD -p tcp --dport 1234 -j SYNPROXY
Step 2. Queue the syn packet to the userspace at raw table OUTPUT hook.
Also note, in the userspace we only add a 20s' delay, then
reinject the syn packet to the kernel:
# iptables -t raw -I OUTPUT -p tcp --syn -j NFQUEUE --queue-num 1
Step 3. Using "nc 2.2.2.2 1234" to connect the server.
Step 4. Now remove the nf_synproxy_core.ko quickly:
# iptables -F FORWARD
# rmmod ipt_SYNPROXY
# rmmod nf_synproxy_core
Step 5. After 20s' delay, the syn packet is reinjected to the kernel.
Now you will see the panic like this:
kernel BUG at net/netfilter/nf_conntrack_extend.c:91!
Call Trace:
? __nf_ct_ext_add_length+0x53/0x3c0 [nf_conntrack]
init_conntrack+0x12b/0x600 [nf_conntrack]
nf_conntrack_in+0x4cc/0x580 [nf_conntrack]
ipv4_conntrack_local+0x48/0x50 [nf_conntrack_ipv4]
nf_reinject+0x104/0x270
nfqnl_recv_verdict+0x3e1/0x5f9 [nfnetlink_queue]
? nfqnl_recv_verdict+0x5/0x5f9 [nfnetlink_queue]
? nla_parse+0xa0/0x100
nfnetlink_rcv_msg+0x175/0x6a9 [nfnetlink]
[...]
One possible solution is to make NF_CT_EXT_SYNPROXY extend built-in, i.e.
introduce nf_conntrack_synproxy.c and only do ct extend register and
unregister in it, similar to nf_conntrack_timeout.c.
But having such a obscure restriction of nf_ct_extend_unregister is not a
good idea, so we should invoke synchronize_rcu after set nf_ct_ext_types
to NULL, and check the NULL pointer when do __nf_ct_ext_add_length. Then
it will be easier if we add new ct extend in the future.
Last, we use kfree_rcu to free nf_ct_ext, so rcu_barrier() is unnecessary
anymore, remove it too.
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The nf_ct_helper_hash table is protected by nf_ct_helper_mutex, while
nfct_helper operation is protected by nfnl_lock(NFNL_SUBSYS_CTHELPER).
So it's possible that one CPU is walking the nf_ct_helper_hash for
cthelper add/get/del, another cpu is doing nf_conntrack_helpers_unregister
at the same time. This is dangrous, and may cause use after free error.
Note, delete operation will flush all cthelpers added via nfnetlink, so
using rcu to do protect is not easy.
Now introduce a dummy list to record all the cthelpers added via
nfnetlink, then we can walk the dummy list instead of walking the
nf_ct_helper_hash. Also, keep nfnl_cthelper_dump_table unchanged, it
may be invoked without nfnl_lock(NFNL_SUBSYS_CTHELPER) held.
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Otherwise, another CPU may access the invalid pointer. For example:
CPU0 CPU1
- rcu_read_lock();
- pfunc = _hook_;
_hook_ = NULL; -
mod unload -
- pfunc(); // invalid, panic
- rcu_read_unlock();
So we must call synchronize_rcu() to wait the rcu reader to finish.
Also note, in nf_nat_snmp_basic_fini, synchronize_rcu() will be invoked
by later nf_conntrack_helper_unregister, but I'm inclined to add a
explicit synchronize_rcu after set the nf_nat_snmp_hook to NULL. Depend
on such obscure assumptions is not a good idea.
Last, in nfnetlink_cttimeout, we use kfree_rcu to free the time object,
so in cttimeout_exit, invoking rcu_barrier() is not necessary at all,
remove it too.
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch refactors the num_packets counter of a forw_packet in the
following three ways:
1) Removed dual-use of forw_packet::num_packets:
-> now for aggregation purposes only
2) Using forw_packet::skb::cb::num_bcasts instead:
-> for easier access in aggregation code later
3) make access to num_bcasts private to batadv_forw_packet_*()
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
[sven@narfation.org: Change num_bcasts to unsigned]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
An skb is assigned to a forw_packet only once, shortly after the
forw_packet allocation.
With this patch the assignment is moved into the this allocation
function.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
We got a report of yet another bug in ping
http://www.openwall.com/lists/oss-security/2017/03/24/6
->disconnect() is not called with socket lock held.
Fix this by acquiring ping rwlock earlier.
Thanks to Daniel, Alexander and Andrey for letting us know this problem.
Fixes: c319b4d76b ("net: ipv4: add IPPROTO_ICMP socket kind")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Daniel Jiang <danieljiang0415@gmail.com>
Reported-by: Solar Designer <solar@openwall.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This socket option returns the NAPI ID associated with the queue on which
the last frame is received. This information can be used by the apps to
split the incoming flows among the threads based on the Rx queue on which
they are received.
If the NAPI ID actually represents a sender_cpu then the value is ignored
and 0 is returned.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the core functionality in sk_busy_loop() to napi_busy_loop() and
make it independent of sk.
This enables re-using this function in epoll busy loop implementation.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch flips the logic we were using to determine if the busy polling
has timed out. The main motivation for this is that we will need to
support two different possible timeout values in the future and by
recording the start time rather than when we would want to end we can focus
on making the end_time specific to the task be it epoll or socket based
polling.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
checking the return value of sk_busy_loop. As there are only a few
consumers of that data, and the data being checked for can be replaced
with a check for !skb_queue_empty() we might as well just pull the code
out of sk_busy_loop and place it in the spots that actually need it.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While working on some recent busy poll changes we found that child sockets
were being instantiated without NAPI ID being set. In our first attempt to
fix it, it was suggested that we should just pull programming the NAPI ID
into the function itself since all callers will need to have it set.
In addition to the NAPI ID change I have dropped the code that was
populating the Rx hash since it was actually being populated in
tcp_get_cookie_sock.
Reported-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is a cleanup/fix for NAPI IDs following the changes that made it
so that sender_cpu and napi_id were doing a better job of sharing the same
location in the sk_buff.
One issue I found is that we weren't validating the napi_id as being valid
before we started trying to setup the busy polling. This change corrects
that by using the MIN_NAPI_ID value that is now used in both allocating the
NAPI IDs, as well as validating them.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unfortunately too many devices (not under our control) use tcp_tw_recycle=1,
which depends on timestamps being identical of the same saddr.
Although tcp_tw_recycle got removed in net-next we can't make
such end hosts disappear so downgrade to per-host timestamp offsets.
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Reported-by: Yvan Vanrossomme <yvan@vanrossomme.net>
Fixes: 95a22caee3 ("tcp: randomize tcp timestamp offsets for each connection")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change basically codifies what I think was already the limitations on
the busy_poll and busy_read sysctl interfaces. We weren't checking the
lower bounds and as such could input negative values. The behavior when
that was used was dependent on the architecture. In order to prevent any
issues with that I am just disabling support for values less than 0 since
this way we don't have to worry about any odd behaviors.
By limiting the sysctl values this way it also makes it consistent with how
we handle the SO_BUSY_POLL socket option since the value appears to be
reported as a signed integer value and negative values are rejected.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We already use dst_cache in seg6_output, when handling locally generated
packets. We extend it in seg6_input, to also handle forwarded packets, and avoid
unnecessary fib lookups.
Performances for SRH encapsulation before the patch:
Result: OK: 5656067(c5655678+d388) usec, 5000000 (1000byte,0frags)
884006pps 7072Mb/sec (7072048000bps) errors: 0
Performances after the patch:
Result: OK: 4774543(c4774084+d459) usec, 5000000 (1000byte,0frags)
1047220pps 8377Mb/sec (8377760000bps) errors: 0
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
To insert or encapsulate a packet with an SRH, we need a large enough skb
headroom. Currently, we are using pskb_expand_head to inconditionally increase
the size of the headroom by the amount needed by the SRH (and IPv6 header).
If this reallocation is performed by another CPU than the one that initially
allocated the skb, then when the initial CPU kfree the skb, it will enter the
__slab_free slowpath, impacting performances.
This patch replaces pskb_expand_head with skb_cow_head, that will reallocate the
skb head only if the headroom is not large enough.
Performances for SRH encapsulation before the patch:
Result: OK: 7348320(c7347271+d1048) usec, 5000000 (1000byte,0frags)
680427pps 5443Mb/sec (5443416000bps) errors: 0
Performances after the patch:
Result: OK: 5656067(c5655678+d388) usec, 5000000 (1000byte,0frags)
884006pps 7072Mb/sec (7072048000bps) errors: 0
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use setup_deferrable_timer() instead of init_timer_deferrable() to
simplify the code.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
reclaim path, tagged for stable.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJY1QoTAAoJEEp/3jgCEfOLtnoH/0uj3Y2QUVHDo0Je9HkQhzVC
Mp2a/WzN1YO/wdYRdxs6LbVqMIukPJ5kKbKO3WIgpJZR9lxSM9k5Y0YC98YVBHuB
v1uuTPe60fmdJRYnoD4/SfwXiIiAF5/dzPp80SI4xQ9zN24dS1wBREkH2eXUeoKL
FCEoHUv7cJju1dGNbcGpv4MV75b0e+HHAxPuDG4fiUdT79Vp+wP7exx0lMkS9ub8
i5T1GU5gDDBR+SKhpqMIvNgR7s3zdInUs476tk6jz/o049BwybkLIuDhx0+x93/g
1Y1KXHYPCT3yZeL5kdJxM6Dul6wDYOsr2/wIPHzY0ym9RXqCEy/PFowJnL+7eWk=
=cTM6
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.11-rc4' of git://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"A fix for a writeback deadlock caused by a GFP_KERNEL allocation on
the reclaim path, tagged for stable"
* tag 'ceph-for-4.11-rc4' of git://github.com/ceph/ceph-client:
libceph: force GFP_NOIO for socket allocations
Fix copy and paste error setting rt_ttl_propagate.
Fixes: 5b441ac878 ("mpls: allow TTL propagation to IP packets to be configured")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>