Add variables to track RPC level errors so that we can distinguish
between issue that arose in the RPC transport layer as opposed to
those arising from the reply message.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Ensure that when in the transport layer, we don't sleep past
a major timeout.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Don't wake idle CPUs only for the purpose of servicing an RPC
queue timeout.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Simplify the setting of queue timeouts by using the timer_reduce()
function.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Add a helper to ensure that debugfs and friends print out the
correct current task timeout value.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up the RPC task sleep interfaces by replacing the task->tk_timeout
'hidden parameter' to rpc_sleep_on() with a new function that takes an
absolute timeout.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
None of the callers set the 'action' argument, so let's just remove it.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
rpc_sleep_on() does not need to set the task->tk_callback under the
queue lock, so move that out.
Also refactor the check for whether the task is active.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Convert the transport callback to actually put the request to sleep
instead of just setting a timeout. This is in preparation for
rpc_sleep_on_timeout().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The RPC_TASK_KILLED flag should really not be set from another context
because it can clobber data in the struct task when task->tk_flags is
changed non-atomically.
Let's therefore swap out RPC_TASK_KILLED with an atomic flag, and add
a function to set that flag and safely wake up the task.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The minimum encryption key size for LE connections is 56 bits and to
align LE with BR/EDR, enforce 56 bits of minimum encryption key size for
BR/EDR connections as well.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Cc: stable@vger.kernel.org
The flags field in 'struct shash_desc' never actually does anything.
The only ostensibly supported flag is CRYPTO_TFM_REQ_MAY_SLEEP.
However, no shash algorithm ever sleeps, making this flag a no-op.
With this being the case, inevitably some users who can't sleep wrongly
pass MAY_SLEEP. These would all need to be fixed if any shash algorithm
actually started sleeping. For example, the shash_ahash_*() functions,
which wrap a shash algorithm with the ahash API, pass through MAY_SLEEP
from the ahash API to the shash API. However, the shash functions are
called under kmap_atomic(), so actually they're assumed to never sleep.
Even if it turns out that some users do need preemption points while
hashing large buffers, we could easily provide a helper function
crypto_shash_update_large() which divides the data into smaller chunks
and calls crypto_shash_update() and cond_resched() for each chunk. It's
not necessary to have a flag in 'struct shash_desc', nor is it necessary
to make individual shash algorithms aware of this at all.
Therefore, remove shash_desc::flags, and document that the
crypto_shash_*() functions can be called from any context.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Pull networking fixes from David Miller:
"Just the usual assortment of small'ish fixes:
1) Conntrack timeout is sometimes not initialized properly, from
Alexander Potapenko.
2) Add a reasonable range limit to tcp_min_rtt_wlen to avoid
undefined behavior. From ZhangXiaoxu.
3) des1 field of descriptor in stmmac driver is initialized with the
wrong variable. From Yue Haibing.
4) Increase mlxsw pci sw reset timeout a little bit more, from Ido
Schimmel.
5) Match IOT2000 stmmac devices more accurately, from Su Bao Cheng.
6) Fallback refcount fix in TLS code, from Jakub Kicinski.
7) Fix max MTU check when using XDP in mlx5, from Maxim Mikityanskiy.
8) Fix recursive locking in team driver, from Hangbin Liu.
9) Fix tls_set_device_offload_Rx() deadlock, from Jakub Kicinski.
10) Don't use napi_alloc_frag() outside of softiq context of socionext
driver, from Ilias Apalodimas.
11) MAC address increment overflow in ncsi, from Tao Ren.
12) Fix a regression in 8K/1M pool switching of RDS, from Zhu Yanjun.
13) ipv4_link_failure has to validate the headers that are actually
there because RAW sockets can pass in arbitrary garbage, from Eric
Dumazet"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits)
ipv4: add sanity checks in ipv4_link_failure()
net/rose: fix unbound loop in rose_loopback_timer()
rxrpc: fix race condition in rxrpc_input_packet()
net: rds: exchange of 8K and 1M pool
net: vrf: Fix operation not supported when set vrf mac
net/ncsi: handle overflow when incrementing mac address
net: socionext: replace napi_alloc_frag with the netdev variant on init
net: atheros: fix spelling mistake "underun" -> "underrun"
spi: ST ST95HF NFC: declare missing of table
spi: Micrel eth switch: declare missing of table
net: stmmac: move stmmac_check_ether_addr() to driver probe
netfilter: fix nf_l4proto_log_invalid to log invalid packets
netfilter: never get/set skb->tstamp
netfilter: ebtables: CONFIG_COMPAT: drop a bogus WARN_ON
Documentation: decnet: remove reference to CONFIG_DECNET_ROUTE_FWMARK
dt-bindings: add an explanation for internal phy-mode
net/tls: don't leak IV and record seq when offload fails
net/tls: avoid potential deadlock in tls_set_device_offload_rx()
selftests/net: correct the return value for run_afpackettests
team: fix possible recursive locking when add slaves
...
Before the commit 490ea5967b ("RDS: IB: move FMR code to its own file"),
when the dirty_count is greater than 9/10 of max_items of 8K pool,
1M pool is used, Vice versa. After the commit 490ea5967b ("RDS: IB: move
FMR code to its own file"), the above is removed. When we make the
following tests.
Server:
rds-stress -r 1.1.1.16 -D 1M
Client:
rds-stress -r 1.1.1.14 -s 1.1.1.16 -D 1M
The following will appear.
"
connecting to 1.1.1.16:4000
negotiated options, tasks will start in 2 seconds
Starting up..header from 1.1.1.166:4001 to id 4001 bogus
..
tsks tx/s rx/s tx+rx K/s mbi K/s mbo K/s tx us/c rtt us
cpu %
1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
...
"
So this exchange between 8K and 1M pool is added back.
Fixes: commit 490ea5967b ("RDS: IB: move FMR code to its own file")
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent changes that introduced unlocked flower did not properly account for
case when reoffload is initiated concurrently with filter updates. To fix
the issue, extend flower with 'hw_filters' list that is used to store
filters that don't have 'skip_hw' flag set. Filter is added to the list
when it is inserted to hardware and only removed from it after being
unoffloaded from all drivers that parent block is attached to. This ensures
that concurrent reoffload can still access filter that is being deleted and
prevents race condition when driver callback can be removed when filter is
no longer accessible trough idr, but is still present in hardware.
Refactor fl_change() to respect new filter reference counter and to release
filter reference with __fl_put() in case of error, instead of directly
deallocating filter memory. This allows for concurrent access to filter
from fl_reoffload() and protects it with reference counting. Refactor
fl_reoffload() to iterate over hw_filters list instead of idr. Implement
fl_get_next_hw_filter() helper function that is used to iterate over
hw_filters list with reference counting and skips filters that are being
concurrently deleted.
Fixes: 9214919006 ("net: sched: flower: set unlocked flag for flower proto ops")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
First thing tipc_udp_recv() does is to use rcu_dereference_sk_user_data(),
and this is really hinting we already own rcu_read_lock() from the caller
(UDP stack).
No need to add another rcu_read_lock()/rcu_read_unlock() pair.
Also use rcu_dereference() instead of rcu_dereference_rtnl()
in the data path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rsi_parse() is part of a downcall, so we must assume that the uids
and gids are encoded using the current user namespace.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
gid_parse() is part of a downcall, so uids and gids should be assumed
encoded using the current user namespace.
svcauth_unix_accept() is, on the other hand, decoding uids and gids from
the wire, so we assume those are encoded to match the user namespace of
the server process.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Temporary sockets should inherit the credential (and hence the user
namespace) from the parent listener transport.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In order to be able to interpret uids and gids correctly in knfsd, we
should cache the user namespace of the process that created the RPC
server's listener. To do so, we refcount the credential of that process.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Add a callback to allow customisation of the rpcbind registration.
When clients have the ability to turn on and off version support,
we want to allow them to also prevent registration of those
versions with the rpc portmapper.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Simplify the generic server dispatcher.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Add a callback to help initialise server requests before they are
processed. This will allow us to clean up the NFS server version
support, and to make it container safe.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
RPC server procedures are normally expected to return a __be32 encoded
status value of type 'enum rpc_accept_stat', however at least one function
wants to return an authentication status of type 'enum rpc_auth_stat'
in the case where authentication fails.
This patch adds functionality to allow this.
Fixes: a4e187d83d ("NFS: Don't drop CB requests with invalid principals")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If no handler (such as rpc.mountd) has opened
a cache 'channel', the sunrpc cache responds to
all lookup requests with -ENOENT. This is particularly
important for the auth.unix.gid cache which is
optional.
If the channel was open briefly and an upcall was written to it,
this upcall remains pending even when the handler closes the
channel. When an upcall is pending, the code currently
doesn't check if there are still listeners, it only performs
that check before sending an upcall.
As the cache treads a recently closes channel (closed less than
30 seconds ago) as "potentially still open", there is a
reasonable sized window when a request can become pending
in a closed channel, and thereby block lookups indefinitely.
This can easily be demonstrated by running
cat /proc/net/rpc/auth.unix.gid/channel
and then trying to mount an NFS filesystem from this host. It
will block indefinitely (unless mountd is run with --manage-gids,
or krb5 is used).
When cache_check() finds that an upcall is pending, it should
perform the "cache_listeners_exist()" exist test. If no
listeners do exist, the request should be negated.
With this change in place, there can still be a 30second wait on
mount, until the cache gives up waiting for a handler to come
back, but this is much better than an indefinite wait.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When flag HCI_QUIRK_USE_BDADDR_PROPERTY is set, we will read the
bluetooth address from dts. If the bluetooth address node is missing
from the dts we will enable it controller UNCONFIGURED state.
This patch enables the normal flow even if the BD address is missing
from the dts tree.
Signed-off-by: Balakrishna Godavarthi <bgodavar@codeaurora.org>
Tested-by: Harish Bandi <c-hbandi@codeaurora.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
arg.result is sometimes used as fib6_result and sometimes used to
hold the rt6_info. Add rt6_info to fib6_result and make the use
of arg.result consistent through ipv6 rules.
The rt6 entry is filled in for lookups returning a dst_entry, but not
for direct fib_lookups that just want a fib6_info.
Fixes: effda4dd97 ("ipv6: Pass fib6_result to fib lookups")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib rule actions should return -EAGAIN for the rules to continue to the
next one. A recent change overwrote err with the lookup always returning
0 (future change will make it more like IPv4) which means the rules
stopped at the first (e.g., local table lookup only). Catch and reset err
to -EAGAIN.
Fixes: effda4dd97 ("ipv6: Pass fib6_result to fib lookups")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously BMC's MAC address is calculated by simply adding 1 to the
last byte of network controller's MAC address, and it produces incorrect
result when network controller's MAC address ends with 0xFF.
The problem can be fixed by calling eth_addr_inc() function to increment
MAC address; besides, the MAC address is also validated before assigning
to BMC.
Fixes: cb10c7c0df ("net/ncsi: Add NCSI Broadcom OEM command")
Signed-off-by: Tao Ren <taoren@fb.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case we don't have 'guard' or 'budget' to transmit the skb, we should
continue traversing the qdisc list since the remaining guard/budget
might be enough to transmit a skb from other children qdiscs.
Fixes: 5a781ccbd1 (“tc: Add support for configuring the taprio scheduler”)
Signed-off-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While traversing taprio's children qdisc list, if the gate is closed for
a given traffic class, we should continue traversing the list since the
remaining qdiscs may have skb ready for transmission.
This patch also takes this opportunity and changes the function to use
the TAPRIO_ALL_GATES_OPEN macro instead of the magic number '-1'.
Fixes: 5a781ccbd1 (“tc: Add support for configuring the taprio scheduler”)
Signed-off-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'entry' argument from should_restart_cycle() cannot be NULL since it
is already checked by the caller so the WARN_ON() within should_
restart_cycle() could be removed. By doing that, that function becomes
a dummy wrapper on list_is_last() so this patch simply gets rid of it
and call list_is_last() within advance_sched() instead.
Signed-off-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch does a code refactoring to taprio_get_start_time() function
to improve readability and report error properly.
If 'base' time is later than 'now', the start time is equal to 'base'
and taprio_get_start_time() is done. That's the natural case so we move
that code to the beginning of the function. Also, if 'cycle' calculation
is zero, something went really wrong with taprio and we should log that
internal error properly.
Signed-off-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes a pointless variable assigment in taprio_change().
The 'err' variable is not used from this assignment to the next one so
this patch removes it.
Signed-off-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nhc_flags holds the RTNH_F flags for a given nexthop (fib{6}_nh).
All of the RTNH_F_ flags fit in an unsigned char, and since the API to
userspace (rtnh_flags and lower byte of rtm_flags) is 1 byte it can not
grow. Make nhc_flags in fib_nh_common an unsigned char and shrink the
size of the struct by 8, from 56 to 48 bytes.
Update the flags arguments for up netdevice events and fib_nexthop_info
which determines the RTNH_F flags to return on a dump/event. The RTNH_F
flags are passed in the lower byte of rtm_flags which is an unsigned int
so use a temp variable for the flags to fib_nexthop_info and combine
with rtm_flags in the caller.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, lwtunnel_fill_encap hardcodes the encap and encap type
attributes as RTA_ENCAP and RTA_ENCAP_TYPE, respectively. The nexthop
objects want to re-use this code but the encap attributes passed to
userspace as NHA_ENCAP and NHA_ENCAP_TYPE. Since that is the only
difference, change lwtunnel_fill_encap to take the attribute type as
an input.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We suspect some issues involving fib6_ref 0 -> 1 transitions might
cause strange syzbot reports.
Lets convert fib6_ref to refcount_t to catch them earlier.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using atomic_inc(), prefer fib6_info_hold()
so that upcoming refcount_t conversion is simpler.
Only fib6_info_alloc() is using atomic_set() since we
just allocated a new object.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We do not need to clear f6i->rt6i_exception_bucket right before
freeing f6i.
Note that f6i->rt6i_exception_bucket is properly protected by
f6i->exception_bucket_flushed being set to one in rt6_flush_exceptions()
under the protection of rt6_exception_lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some tunnels, like sit, change the network protocol of packet.
If so, update skb->protocol to match the new type.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
lock-notification callbacks, NFSv3 readdir encoding, and the
cache/upcall code.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcv3EfAAoJECebzXlCjuG+j8IQAIMJFCpMTSu3lw0X9BjJFwe/
TELdvH4eXm+xtnz0Rbp4RTmlNgt/ox1ySp+sPQqcDutJcU6ImtJS1JY5BmoSU80p
2ZTRzVOusSS1dP6EDwnYQlR3rI3ObCdDYtkjfboqDd4UJM0hvuCrdQols3DRZF1/
gBMUDNk8Sch64JMs8EcpItmNWgZ1QDmL6vbKsYDB8HWfBcFkNOs/Q1OecSD7F9Dh
k+GojPb4pXRCgN8heeFr2H2mXxJSDOPyyybBSQOtbv58rT7IXTpVvAt+WHpjy917
ljOSD/EuCY+IvX6oY8xPrly9a3t4Mh4eQQFj1EZGSlNucqDgsi1B421QcxGZSkF4
WZkTRC5BppairSx4dnW96Nx0BbCysj2dvZ0MOnUa59YRwRbyrnHKCv221SzFBidI
0slxNBYW7Rw2ViFJ6fBPjsjH0XWM9Hq4wC84kj8XXNeDDh2KzXPtUtUzp3iDpKQg
jP0UiEYnt6OXWvHOH0hfh52Mds48iIUSRmKL4huDT8Ijnt9KBZxnTunClJt6g1SH
MsIttRsZ2fLo/M5SlWzvQ7XNoKNn/s49Z3SZjqYEKqL9aMwuMoMcIgolqlh5oFPu
j3cqprDNXuij+2pl35IuQRbk57G2DZ1maShJ5n8I7l0uBe6DMX7T+4d4rHy75Tu9
otqz4WYU2Yjnnmv5Kp/H
=9hKY
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.1-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfixes from Bruce Fields:
"Fix miscellaneous nfsd bugs, in NFSv4.1 callbacks, NFSv4.1
lock-notification callbacks, NFSv3 readdir encoding, and the
cache/upcall code"
* tag 'nfsd-5.1-1' of git://linux-nfs.org/~bfields/linux:
nfsd: wake blocked file lock waiters before sending callback
nfsd: wake waiters blocked on file_lock before deleting it
nfsd: Don't release the callback slot unless it was actually held
nfsd/nfsd3_proc_readdir: fix buffer count and page pointers
sunrpc: don't mark uninitialised items as VALID.
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
struct boo entry[];
};
size = sizeof(struct foo) + count * sizeof(struct boo);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
size = struct_size(instance, entry, count);
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
In ext_adv_report_event rssi comes before data (not after data as
in legacy adv_report_evt) so "+ 1" is not required in the ptr arithmatic
to point to next report.
Signed-off-by: Jaganath Kanakkassery <jaganath.kanakkassery@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
NEXTHDR_MAX is 255. What happens here is that we take a u8 value
"hdr->nexthdr" from the network and then look it up in
lowpan_nexthdr_nhcs[]. The problem is that if hdr->nexthdr is 0xff then
we read one element beyond the end of the array so the array needs to
be one element larger.
Fixes: 92aa7c65d2 ("6lowpan: add generic nhc layer interface")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Acked-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Struct ca is copied from userspace. It is not checked whether the "name"
field is NULL terminated, which allows local users to obtain potentially
sensitive information from kernel stack memory, via a HIDPCONNADD command.
This vulnerability is similar to CVE-2011-1079.
Signed-off-by: Young Xiao <YangX92@hotmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org
Now that we use skb-less flow dissector let's return true nhoff and
thoff. We used to adjust them by ETH_HLEN because that's how it was
done in the skb case. For VLAN tests that looks confusing: nhoff is
pointing to vlan parts :-\
Warning, this is an API change for BPF_PROG_TEST_RUN! Feel free to drop
if you think that it's too late at this point to fix it.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Update all users of eth_get_headlen to pass network device, fetch
network namespace from it and pass it down to the flow dissector.
This commit is a noop until administrator inserts BPF flow dissector
program.
Cc: Maxim Krasnyansky <maxk@qti.qualcomm.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: intel-wired-lan@lists.osuosl.org
Cc: Yisen Zhuang <yisen.zhuang@huawei.com>
Cc: Salil Mehta <salil.mehta@huawei.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Cc: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
When called without skb, gather all required data from the
__skb_flow_dissect's arguments and use recently introduces
no-skb mode of bpf flow dissector.
Note: WARN_ON_ONCE(!net) will now trigger for eth_get_headlen users.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This new argument will be used in the next patches for the
eth_get_headlen use case. eth_get_headlen calls flow dissector
with only data (without skb) so there is currently no way to
pull attached BPF flow dissector program. With this new argument,
we can amend the callers to explicitly pass network namespace
so we can use attached BPF program.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Now that we have bpf_flow_dissect which can work on raw data,
use it when doing BPF_PROG_TEST_RUN for flow dissector.
Simplifies bpf_prog_test_run_flow_dissector and allows us to
test no-skb mode.
Note, that previously, with bpf_flow_dissect_skb we used to call
eth_type_trans which pulled L2 (ETH_HLEN) header and we explicitly called
skb_reset_network_header. That means flow_keys->nhoff would be
initialized to 0 (skb_network_offset) in init_flow_keys.
Now we call bpf_flow_dissect with nhoff set to ETH_HLEN and need
to undo it once the dissection is done to preserve the existing behavior.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
struct bpf_flow_dissector has a small subset of sk_buff fields that
flow dissector BPF program is allowed to access and an optional
pointer to real skb. Real skb is used only in bpf_skb_load_bytes
helper to read non-linear data.
The real motivation for this is to be able to call flow dissector
from eth_get_headlen context where we don't have an skb and need
to dissect raw bytes.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add return check for security level set for socket interface since
stack will check the return value.
Signed-off-by: Fugang Duan <fugang.duan@nxp.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
l2cap_le_flowctl_init was reseting the tx_credits which works only for
outgoing connection since that set the tx_credits on the response, for
incoming connections that was not the case which leaves the channel
without any credits causing it to be suspended.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org # 4.20+
We need to dereference the directory to get its parent to
be able to rename it, so it's clearly not safe to try to
do this with ERR_PTR() pointers. Skip in this case.
It seems that this is most likely what was causing the
report by syzbot, but I'm not entirely sure as it didn't
come with a reproducer this time.
Cc: stable@vger.kernel.org
Reported-by: syzbot+4ece1a28b8f4730547c9@syzkaller.appspotmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Commit c82c06ce43d3("cfg80211: Notify all User Hints To self managed wiphys")
notified all new user hints to self managed wiphy's after device registration.
But it didn't do this for anything other than cell base hints done before
registration.
This needs to be done during wiphy registration of a self managed device also,
so that the previous user settings are retained.
Fixes: c82c06ce43 ("cfg80211: Notify all User Hints To self managed wiphys")
Signed-off-by: Sriram R <srirrama@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The txq of vif is added to active_txqs list for ATF TXQ scheduling
in the function ieee80211_queue_skb(), but it was not properly removed
before freeing the txq object. It was causing use after free of the txq
objects from the active_txqs list, result was kernel panic
due to invalid memory access.
Fix kernel invalid memory access by properly removing txq object
from active_txqs list before free the object.
Signed-off-by: Bhagavathi Perumal S <bperumal@codeaurora.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
None of them have any external callers, make them static.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
No external dependencies, might as well handle this directly.
xfrm_afinfo_policy is now 40 bytes on x86_64.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
handle this directly, its only used by ipv6.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Only used by ipv4, we can read the fl4 tos value directly instead.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Add extack to shared buffer set operations, so that meaningful error
messages could be propagated to the user.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Kconfig currently controlling compilation of this code is:
net/strparser/Kconfig:config STREAM_PARSER
net/strparser/Kconfig: def_bool n
...meaning that it currently is not being built as a module by anyone.
Lets remove the modular code that is essentially orphaned, so that
when reading the driver there is no doubt it is builtin-only.
Since module_init translates to device_initcall in the non-modular
case, the init ordering remains unchanged with this commit. For
clarity, we change the fcn name mod_init to dev_init at the same time.
We replace module.h with init.h and export.h ; the latter since this
file exports some syms.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Kconfig controlling this code is:
bpfilter/Kconfig:menuconfig BPFILTER
bpfilter/Kconfig: bool "BPF based packet filtering framework (BPFILTER)"
Since it isn't a module, we shouldn't use module_init(). Instead we
use device_initcall() - which is exactly what module_init() defaults
to for non-modular code/builds.
We don't remove <linux/module.h> from the includes since this file does
a request_module() and hence is a valid user of that header file, even
though it is not modular itself.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Kconfig currently controlling compilation of this code is:
net/Kconfig:config CGROUP_NET_PRIO
net/Kconfig: bool "Network priority cgroup"
...meaning that it currently is not being built as a module by anyone,
as module support was discontinued in 2014.
We delete the MODULE_LICENSE tag since all that information is already
contained at the top of the file in the comments.
We don't delete module.h from the includes since it was no longer there
to begin with.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Rosen, Rami" <rami.rosen@intel.com>
Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The header contains rtnh_ macros so rename the file accordingly.
Allows a later patch to use the nexthop.h name for the new
nexthop code.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2019-04-22
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) allow stack/queue helpers from more bpf program types, from Alban.
2) allow parallel verification of root bpf programs, from Alexei.
3) introduce bpf sysctl hook for trusted root cases, from Andrey.
4) recognize var/datasec in btf deduplication, from Andrii.
5) cpumap performance optimizations, from Jesper.
6) verifier prep for alu32 optimization, from Jiong.
7) libbpf xsk cleanup, from Magnus.
8) other various fixes and cleanups.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for your net tree:
1) Add a selftest for icmp packet too big errors with conntrack, from
Florian Westphal.
2) Validate inner header in ICMP error message does not lie to us
in conntrack, also from Florian.
3) Initialize ct->timeout to calm down KASAN, from Alexander Potapenko.
4) Skip ICMP error messages from tunnels in IPVS, from Julian Anastasov.
5) Use a hash to expose conntrack and expectation ID, from Florian Westphal.
6) Prevent shift wrap in nft_chain_parse_hook(), from Dan Carpenter.
7) Fix broken ICMP ID randomization with NAT, also from Florian.
8) Remove WARN_ON in ebtables compat that is reached via syzkaller,
from Florian Westphal.
9) Fix broken timestamps since fb420d5d91 ("tcp/fq: move back to
CLOCK_MONOTONIC"), from Florian.
10) Fix logging of invalid packets in conntrack, from Andrei Vagin.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
It doesn't log a packet if sysctl_log_invalid isn't equal to protonum
OR sysctl_log_invalid isn't equal to IPPROTO_RAW. This sentence is
always true. I believe we need to replace OR to AND.
Cc: Florian Westphal <fw@strlen.de>
Fixes: c4f3db1595 ("netfilter: conntrack: add and use nf_l4proto_log_invalid")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
setting net.netfilter.nf_conntrack_timestamp=1 breaks xmit with fq
scheduler. skb->tstamp might be "refreshed" using ktime_get_real(),
but fq expects CLOCK_MONOTONIC.
This patch removes all places in netfilter that check/set skb->tstamp:
1. To fix the bogus "start" time seen with conntrack timestamping for
outgoing packets, never use skb->tstamp and always use current time.
2. In nfqueue and nflog, only use skb->tstamp for incoming packets,
as determined by current hook (prerouting, input, forward).
3. xt_time has to use system clock as well rather than skb->tstamp.
We could still use skb->tstamp for prerouting/input/foward, but
I see no advantage to make this conditional.
Fixes: fb420d5d91 ("tcp/fq: move back to CLOCK_MONOTONIC")
Cc: Eric Dumazet <edumazet@google.com>
Reported-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
It means userspace gave us a ruleset where there is some other
data after the ebtables target but before the beginning of the next rule.
Fixes: 81e675c227 ("netfilter: ebtables: add CONFIG_COMPAT support")
Reported-by: syzbot+659574e7bcc7f7eb4df7@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pointers should be printed with %p or %px rather than
cast to long type and printed with %8.8lx.
Change %8.8lx to %p to print the pointer.
Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When device refuses the offload in tls_set_device_offload_rx()
it calls tls_sw_free_resources_rx() to clean up software context
state.
Unfortunately, tls_sw_free_resources_rx() does not free all
the state tls_set_sw_offload() allocated - it leaks IV and
sequence number buffers. All other code paths which lead to
tls_sw_release_resources_rx() (which tls_sw_free_resources_rx()
calls) free those right before the call.
Avoid the leak by moving freeing of iv and rec_seq into
tls_sw_release_resources_rx().
Fixes: 4799ac81e5 ("tls: Add rx inline crypto offload")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If device supports offload, but offload fails tls_set_device_offload_rx()
will call tls_sw_free_resources_rx() which (unhelpfully) releases
and reacquires the socket lock.
For a small fix release and reacquire the device_offload_lock.
Fixes: 4799ac81e5 ("tls: Add rx inline crypto offload")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Highlights include:
Bugfixes:
- Fix a regression in which an RPC call can be tagged with an error despite
the transmission being successful
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcu3IgAAoJEA4mA3inWBJcYEAP/jb/O5l2ANmUyMaxew6JMTBH
ULxTyqM8R6B0eifWpBzFjoQjEhywyY6Y3WtY5SnPszHEAR9xZdkLtfTQRySNKOAG
l19VbhIlSKFbBOvXAOXkcdaaIsk3XPl66Drk9SD3DMa5RYdBcF0Kc7bUfF1N4dS7
SS9T97AelHSpMJLpy7CxMMccIXj/Z/65nHNtKlbG7IRqKBdcM6mJeiut+6R/CUv9
NTq+f+ZgnJYsra1oIWbnn5CvQc0c0owl//NIqhos5GFwOAdfdKxNcEuDMcy7QM06
QW+T51omJgQuLwP3EOsOge6apo8Y8FlmxeCCraLGRDKHUlvTJzqtBpO35bJSsJSE
6qjpS8xu2luFmIDGumQib/xVYzVJJjmOJCMUYM6x+XV0TUhJ2l7WASmt7M+1OJCT
j7MJqLBdl2rCM+iNa2tDWLHUElsLOqBCqyFHvngR45yYzyCfUPUM3J+BMij/3yGA
7dO4R2Ii346Iv8iafxr2FfOIB24OxkkDb+I/cVI7GmzbyCTixKaCjrDcJVlsMJCf
rGfZXrep2zrP6AmompOCxUeU3RzJZ4RJRdJ3gTm3aFmqxGlPA5U1MR7MIkRSN4MB
HGShI3jW/Ph9ZoYj7nZ+afyu0V8WJDDtdyqDs/k7HCKeFv2/TL/XxMjlas4n0puB
yhvwt1iW0ZO5f5RlJ6vy
=3T3x
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.1-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfix from Trond Myklebust:
"Fix a regression in which an RPC call can be tagged with an error
despite the transmission being successful"
* tag 'nfs-for-5.1-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
SUNRPC: Ignore queue transmission errors on successful transmission
tcp sendmsg() and sendpage() normally advance skb->data_len
and skb->truesize by the payload added to an skb.
But sendmsg(fd, ..., MSG_ZEROCOPY) has to account for whole pages,
even if a single byte of payload is used in the page.
This means that we can not assume skb->truesize can be adjusted
by skb->data_len. We must instead overwrite its value.
Otherwise skb->truesize is too big and can hit socket sndbuf limit,
especially if the skb is recycled multiple times :/
Fixes: 472c2e07ee ("tcp: add one skb cache for tx")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using TIPC_SOCK_RECVQ_DEPTH for getsockopt(), it returns the
number of buffers in receive socket buffer which is not so helpful
for user space applications.
This commit introduces the new option TIPC_SOCK_RECVQ_USED which
returns the current allocated bytes of the receive socket buffer.
This helps user space applications dimension its buffer usage to
avoid buffer overload issue.
Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'timeval' and 'timespec' data structures used for socket timestamps
are going to be redefined in user space based on 64-bit time_t in future
versions of the C library to deal with the y2038 overflow problem,
which breaks the ABI definition.
Unlike many modern ioctl commands, SIOCGSTAMP and SIOCGSTAMPNS do not
use the _IOR() macro to encode the size of the transferred data, so it
remains ambiguous whether the application uses the old or new layout.
The best workaround I could find is rather ugly: we redefine the command
code based on the size of the respective data structure with a ternary
operator. This lets it get evaluated as late as possible, hopefully after
that structure is visible to the caller. We cannot use an #ifdef here,
because inux/sockios.h might have been included before any libc header
that could determine the size of time_t.
The ioctl implementation now interprets the new command codes as always
referring to the 64-bit structure on all architectures, while the old
architecture specific command code still refers to the old architecture
specific layout. The new command number is only used when they are
actually different.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SIOCGSTAMP/SIOCGSTAMPNS ioctl commands are implemented by many
socket protocol handlers, and all of those end up calling the same
sock_get_timestamp()/sock_get_timestampns() helper functions, which
results in a lot of duplicate code.
With the introduction of 64-bit time_t on 32-bit architectures, this
gets worse, as we then need four different ioctl commands in each
socket protocol implementation.
To simplify that, let's add a new .gettstamp() operation in
struct proto_ops, and move ioctl implementation into the common
sock_ioctl()/compat_sock_ioctl_trans() functions that these all go
through.
We can reuse the sock_get_timestamp() implementation, but generalize
it so it can deal with both native and compat mode, as well as
timeval and timespec structures.
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://lore.kernel.org/lkml/CAK8P3a038aDQQotzua_QtKGhq8O9n+rdiz2=WDCp82ys8eUT+A@mail.gmail.com/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If vlan bridge binding is enabled, then the link state of a vlan device
that is an upper device of the bridge tracks the state of bridge ports
that are members of that vlan. But this can only be done when the link
state of the bridge is up. If it is down, then the link state of the
vlan devices must also be down. This is to maintain existing behavior
for when STP is enabled and there are no live ports, in which case the
link state for the bridge and any vlan devices is down.
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If vlan bridge binding is enabled, then the link state of a vlan device
that is an upper device of the bridge should track the state of bridge
ports that are members of that vlan. So if a bridge port becomes or
stops being a member of a vlan, then update the link state of the
vlan device if necessary.
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the case of vlan filtering on bridges, the bridge may also have the
corresponding vlan devices as upper devices. A vlan bridge binding mode
is added to allow the link state of the vlan device to track only the
state of the subset of bridge ports that are also members of the vlan,
rather than that of all bridge ports. This mode is set with a vlan flag
rather than a bridge sysfs so that the 8021q module is aware that it
should not set the link state for the vlan device.
If bridge vlan is configured, the bridge device event handling results
in the link state for an upper device being set, if it is a vlan device
with the vlan bridge binding mode enabled. This also sets a
vlan_bridge_binding flag so that subsequent UP/DOWN/CHANGE events for
the ports in that bridge result in a link state update of the vlan
device if required.
The link state of the vlan device is up if there is at least one bridge
port that is a vlan member that is admin & oper up, otherwise its oper
state is IF_OPER_LOWERLAYERDOWN.
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In vlan bridge binding mode, the link state is no longer transferred
from the lower device. Instead it is set by the bridge module according
to the state of bridge ports that are members of the vlan.
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the case of vlan filtering on bridges, the bridge may also have the
corresponding vlan devices as upper devices. Currently the link state
of vlan devices is transferred from the lower device. So this is up if
the bridge is in admin up state and there is at least one bridge port
that is up, regardless of the vlan that the port is a member of.
The link state of the vlan device may need to track only the state of
the subset of ports that are also members of the corresponding vlan,
rather than that of all ports.
Add a flag to specify a vlan bridge binding mode, by which the link
state is no longer automatically transferred from the lower device,
but is instead determined by the bridge ports that are members of the
vlan.
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent changes to taprio did not use the correct div64 helpers,
leading to:
net/sched/sch_taprio.o: In function `taprio_dequeue':
sch_taprio.c:(.text+0x34a): undefined reference to `__divdi3'
net/sched/sch_taprio.o: In function `advance_sched':
sch_taprio.c:(.text+0xa0b): undefined reference to `__divdi3'
net/sched/sch_taprio.o: In function `taprio_init':
sch_taprio.c:(.text+0x1450): undefined reference to `__divdi3'
/home/jkicinski/devel/linux/Makefile:1032: recipe for target 'vmlinux' failed
Use math64 helpers.
Fixes: 7b9eba7ba0 ("net/sched: taprio: fix picos_per_byte miscalculation")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
GCC complains:
net/l2tp/l2tp_ppp.c: In function ‘pppol2tp_ioctl’:
net/l2tp/l2tp_ppp.c:1073:6: warning: variable ‘val’ set but not used [-Wunused-but-set-variable]
int val;
^~~
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To make ICMPv6 closer to ICMPv4, add ratemask parameter. Since the ICMP
message types use larger numeric values, a simple bitmask doesn't fit.
I use large bitmap. The input and output are the in form of list of
ranges. Set the default to rate limit all error messages but Packet Too
Big. For Packet Too Big, use ratemask instead of hard-coded.
There are functions where icmpv6_xrlim_allow() and icmpv6_global_allow()
aren't called. This patch only adds them to icmpv6_echo_reply().
Rate limiting error messages is mandated by RFC 4443 but RFC 4890 says
that it is also acceptable to rate limit informational messages. Thus,
I removed the current hard-coded behavior of icmpv6_mask_allow() that
doesn't rate limit informational messages.
v2: Add dummy function proc_do_large_bitmap() if CONFIG_PROC_SYSCTL
isn't defined, expand the description in ip-sysctl.txt and remove
unnecessary conditional before kfree().
v3: Inline the bitmap instead of dynamically allocated. Still is a
pointer to it is needed because of the way proc_do_large_bitmap work.
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unlike atomic_add(), refcount_add() does not deal well
with a negative argument. TLS fallback code reallocates
the skb and is very likely to shrink the truesize, leading to:
[ 189.513254] WARNING: CPU: 5 PID: 0 at lib/refcount.c:81 refcount_add_not_zero_checked+0x15c/0x180
Call Trace:
refcount_add_checked+0x6/0x40
tls_enc_skb+0xb93/0x13e0 [tls]
Once wmem_allocated count saturates the application can no longer
send data on the socket. This is similar to Eric's fixes for GSO,
TCP:
commit 7ec318feee ("tcp: gso: avoid refcount_t warning from tcp_gso_segment()")
and UDP:
commit 575b65bc5b ("udp: avoid refcount_t saturation in __udp_gso_segment()").
Unlike the GSO case, for TLS fallback it's likely that the skb has
shrunk, so the "likely" annotation is the other way around (likely
branch being "sub").
Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a spelling mistake in a NL_SET_ERR_MSG_MOD error message,
fix it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull RCU and LKMM commits from Paul E. McKenney:
- An LKMM commit adding support for synchronize_srcu_expedited()
- A couple of straggling RCU flavor consolidation updates
- Documentation updates.
- Miscellaneous fixes
- SRCU updates
- RCU CPU stall-warning updates
- Torture-test updates
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Disabling IPv6 on an interface removes existing entries but nothing prevents
new entries from being manually added. To that end, add a new neigh_table
operation, allow_add, that is called on RTM_NEWNEIGH to see if neighbor
entries are allowed on a given device. If IPv6 is disabled on the device,
allow_add returns false and passes a message back to the user via extack.
$ echo 1 > /proc/sys/net/ipv6/conf/eth1/disable_ipv6
$ ip -6 neigh add fe80::4c88:bff:fe21:2704 dev eth1 lladdr de:ad:be:ef:01:01
Error: IPv6 is disabled on this device.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the fib6_flags and fib6_type to fib6_result. Update the lookup helpers
to set them and update post fib lookup users to use the version from the
result.
This allows nexthop objects to have blackhole nexthop.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change fib6_lookup and fib6_table_lookup to take a fib6_result and set
f6i and nh rather than returning a fib6_info. For now both always
return 0.
A later patch set can make these more like the IPv4 counterparts and
return EINVAL, EACCESS, etc based on fib6_type.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change fib6_table_lookup tracepoint to take the fib6_result and use
the fib6_info and fib6_nh from it.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass fib6_result to rt6_select. Instead of returning the fib entry, it
will set f6i and nh based on the lookup.
find_rr_leaf is changed to remove the match option in favor of taking
fib6_result and having __find_rr_leaf set f6i in the result.
In the process, update fib6_info references in __find_rr_leaf to f6i names.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass fib6_result to rt6_device_match with f6i set. rt6_device_match
updates f6i in the result if it finds a better match and sets nh.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change ip6_mtu_from_fib6 and fib6_mtu to take a fib6_result over a
fib6_info. Update both to use the fib6_nh from fib6_result.
Since the signature of ip6_mtu_from_fib6 is already changing, add const
to daddr and saddr.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update rt6_insert_exception to take a fib6_result over a fib6_info.
Change ort to f6i from the fib6_result and rename to better reflect
what it references (a fib6_info).
Since this function is already getting changed, update the comments
to reference fib6_info variables rather than the older rt6_info.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that all callers are update to have a fib6_result, pass it down
to ip6_rt_get_dev_rcu, ip6_rt_copy_init, and ip6_rt_init_dst.
In the process, change ort to f6i in ip6_rt_copy_init to make it
clear it is a reference to a fib6_info.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update ip6_rt_pcpu_alloc, rt6_get_pcpu_route and rt6_make_pcpu_route
to a fib6_result over a fib6_info.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change ip6_create_rt_rcu to take fib6_result over a fib6_info.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change ip6_rt_cache_alloc to take a fib6_result over a fib6_info.
Since ip6_rt_cache_alloc is only the caller, update the
rt6_is_gw_or_nonexthop helper to take fib6_result.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simplify rt6_find_cached_rt for the fast path cases and pass fib6_result
to rt6_find_cached_rt. Rename the local return variable to ret to maintain
consisting with fib6_result name.
Update the comment in rt6_find_cached_rt to reference the new names in
a fib6_info vs the old name when fib entries were an rt6_info.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add 'struct fib6_result' to hold the fib entry and fib6_nh from a fib
lookup as separate entries, similar to what IPv4 now has with fib_result.
Rename fib6_multipath_select to fib6_select_path, pass fib6_result to
it, and set f6i and nh in the result once a path selection is done.
Call fib6_select_path unconditionally for path selection which means
moving the sibling and oif check to fib6_select_path. To handle the two
different call paths (2 only call multipath_select if flowi6_oif == 0 and
the other always calls it), add a new have_oif_match that controls the
sibling walk if relevant.
Update callers of fib6_multipath_select accordingly and have them use the
fib6_info and fib6_nh from the result.
This is needed for multipath nexthop objects where a single f6i can
point to multiple fib6_nh (similar to IPv4).
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function build_skb() also have the responsibility to allocate and clear
the SKB structure. Introduce a new function build_skb_around(), that moves
the responsibility of allocation and clearing to the caller. This allows
caller to use kmem_cache (slab/slub) bulk allocation API.
Next patch use this function combined with kmem_cache_alloc_bulk.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
If a request transmission fails due to write space or slot unavailability
errors, but the queued task then gets transmitted before it has time to
process the error in call_transmit_status() or call_bc_transmit_status(),
we need to suppress the transmission error code to prevent it from leaking
out of the RPC layer.
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Pull networking fixes from David Miller:
1) Handle init flow failures properly in iwlwifi driver, from Shahar S
Matityahu.
2) mac80211 TXQs need to be unscheduled on powersave start, from Felix
Fietkau.
3) SKB memory accounting fix in A-MDSU aggregation, from Felix Fietkau.
4) Increase RCU lock hold time in mlx5 FPGA code, from Saeed Mahameed.
5) Avoid checksum complete with XDP in mlx5, also from Saeed.
6) Fix netdev feature clobbering in ibmvnic driver, from Thomas Falcon.
7) Partial sent TLS record leak fix from Jakub Kicinski.
8) Reject zero size iova range in vhost, from Jason Wang.
9) Allow pending work to complete before clcsock release from Karsten
Graul.
10) Fix XDP handling max MTU in thunderx, from Matteo Croce.
11) A lot of protocols look at the sa_family field of a sockaddr before
validating it's length is large enough, from Tetsuo Handa.
12) Don't write to free'd pointer in qede ptp error path, from Colin Ian
King.
13) Have to recompile IP options in ipv4_link_failure because it can be
invoked from ARP, from Stephen Suryaputra.
14) Doorbell handling fixes in qed from Denis Bolotin.
15) Revert net-sysfs kobject register leak fix, it causes new problems.
From Wang Hai.
16) Spectre v1 fix in ATM code, from Gustavo A. R. Silva.
17) Fix put of BROPT_VLAN_STATS_PER_PORT in bridging code, from Nikolay
Aleksandrov.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (111 commits)
socket: fix compat SO_RCVTIMEO_NEW/SO_SNDTIMEO_NEW
tcp: tcp_grow_window() needs to respect tcp_space()
ocelot: Clean up stats update deferred work
ocelot: Don't sleep in atomic context (irqs_disabled())
net: bridge: fix netlink export of vlan_stats_per_port option
qed: fix spelling mistake "faspath" -> "fastpath"
tipc: set sysctl_tipc_rmem and named_timeout right range
tipc: fix link established but not in session
net: Fix missing meta data in skb with vlan packet
net: atm: Fix potential Spectre v1 vulnerabilities
net/core: work around section mismatch warning for ptp_classifier
net: bridge: fix per-port af_packet sockets
bnx2x: fix spelling mistake "dicline" -> "decline"
route: Avoid crash from dereferencing NULL rt->from
MAINTAINERS: normalize Woojung Huh's email address
bonding: fix event handling for stacked bonds
Revert "net-sysfs: Fix memory leak in netdev_register_kobject"
rtnetlink: fix rtnl_valid_stats_req() nlmsg_len check
qed: Fix the DORQ's attentions handling
qed: Fix missing DORQ attentions
...
It looks like the new socket options only work correctly
for native execution, but in case of compat mode fall back
to the old behavior as we ignore the 'old_timeval' flag.
Rework so we treat SO_RCVTIMEO_NEW/SO_SNDTIMEO_NEW the
same way in compat and native 32-bit mode.
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Fixes: a9beb86ae6 ("sock: Add SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEW")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For some reason, tcp_grow_window() correctly tests if enough room
is present before attempting to increase tp->rcv_ssthresh,
but does not prevent it to grow past tcp_space()
This is causing hard to debug issues, like failing
the (__tcp_select_window(sk) >= tp->rcv_wnd) test
in __tcp_ack_snd_check(), causing ACK delays and possibly
slow flows.
Depending on tcp_rmem[2], MTU, skb->len/skb->truesize ratio,
we can see the problem happening on "netperf -t TCP_RR -- -r 2000,2000"
after about 60 round trips, when the active side no longer sends
immediate acks.
This bug predates git history.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the introduction of the vlan_stats_per_port option the netlink
export of it has been broken since I made a typo and used the ifla
attribute instead of the bridge option to retrieve its state.
Sysfs export is fine, only netlink export has been affected.
Fixes: 9163a0fc1f ("net: bridge: add support for per-port vlan stats")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We find that sysctl_tipc_rmem and named_timeout do not have the right minimum
setting. sysctl_tipc_rmem should be larger than zero, like sysctl_tcp_rmem.
And named_timeout as a timeout setting should be not less than zero.
Fixes: cc79dd1ba9 ("tipc: change socket buffer overflow control to respect sk_rcvbuf")
Fixes: a5325ae5b8 ("tipc: add name distributor resiliency queue")
Signed-off-by: Jie Liu <liujie165@huawei.com>
Reported-by: Qiang Ning <ningqiang1@huawei.com>
Reviewed-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to the link FSM, when a link endpoint got RESET_MSG (- a
traditional one without the stopping bit) from its peer, it moves to
PEER_RESET state and raises a LINK_DOWN event which then resets the
link itself. Its state will become ESTABLISHING after the reset event
and the link will be re-established soon after this endpoint starts to
send ACTIVATE_MSG to the peer.
There is no problem with this mechanism, however the link resetting has
cleared the link 'in_session' flag (along with the other important link
data such as: the link 'mtu') that was correctly set up at the 1st step
(i.e. when this endpoint received the peer RESET_MSG). As a result, the
link will become ESTABLISHED, but the 'in_session' flag is not set, and
all STATE_MSG from its peer will be dropped at the link_validate_msg().
It means the link not synced and will sooner or later face a failure.
Since the link reset action is obviously needed for a new link session
(this is also true in the other situations), the problem here is that
the link is re-established a bit too early when the link endpoints are
not really in-sync yet. The commit forces a resync as already done in
the previous commit 91986ee166 ("tipc: fix link session and
re-establish issues") by simply varying the link 'peer_session' value
at the link_reset().
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_reorder_vlan_header() should move XDP meta data with ethernet header
if XDP meta data exists.
Fixes: de8f3a83b0 ("bpf: add meta pointer for direct access")
Signed-off-by: Yuya Kusakabe <yuya.kusakabe@gmail.com>
Signed-off-by: Takeru Hayasaka <taketarou2@gmail.com>
Co-developed-by: Takeru Hayasaka <taketarou2@gmail.com>
Reviewed-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
arg is controlled by user-space, hence leading to a potential
exploitation of the Spectre variant 1 vulnerability.
This issue was detected with the help of Smatch:
net/atm/lec.c:715 lec_mcast_attach() warn: potential spectre issue 'dev_lec' [r] (local cap)
Fix this by sanitizing arg before using it to index dev_lec.
Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].
[1] https://lore.kernel.org/lkml/20180423164740.GY17484@dhcp22.suse.cz/
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The routine ptp_classifier_init() uses an initializer for an
automatic struct type variable which refers to an __initdata
symbol. This is perfectly legal, but may trigger a section
mismatch warning when running the compiler in -fpic mode, due
to the fact that the initializer may be emitted into an anonymous
.data section thats lack the __init annotation. So work around it
by using assignments instead.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the commit below was introduced it changed two visible things:
- the skb was no longer passed through the protocol handlers with the
original device
- the skb was passed up the stack with skb->dev = bridge
The first change broke af_packet sockets on bridge ports. For example we
use them for hostapd which listens for ETH_P_PAE packets on the ports.
We discussed two possible fixes:
- create a clone and pass it through NF_HOOK(), act on the original skb
based on the result
- somehow signal to the caller from the okfn() that it was called,
meaning the skb is ok to be passed, which this patch is trying to
implement via returning 1 from the bridge link-local okfn()
Note that we rely on the fact that NF_QUEUE/STOLEN would return 0 and
drop/error would return < 0 thus the okfn() is called only when the
return was 1, so we signal to the caller that it was called by preserving
the return value from nf_hook().
Fixes: 8626c56c82 ("bridge: fix potential use-after-free when hook returns QUEUE or STOLEN verdict")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ring buffer code of XDP sockets is missing a memory barrier on the
consumer side between the load of the data and the write that signals
that it is ok for the producer to put new data into the buffer. On
architectures that does not guarantee that stores are not reordered
with older loads, the producer might put data into the ring before the
consumer had the chance to read it. As IA does guarantee this
ordering, it would only need a compiler barrier here, but there are no
primitives in Linux for this specific case (hinder writes to be ordered
before older reads) so I had to add a smp_mb() here which will
translate into a run-time synch operation on IA.
Added a longish comment in the code explaining what each barrier in
the ring implementation accomplishes and what would happen if we
removed one of them.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The helper function bpf_sock_ops_cb_flags_set() can be used to both
set and clear the sock_ops callback flags. However, its current
behavior is not consistent. BPF program may clear a flag if more than
one were set, or replace a flag with another one, but cannot clear all
flags.
This patch also updates the documentation to clarify the ability to
clear flags of this helper function.
Signed-off-by: Hoang Tran <hoang.tran@uclouvain.be>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The ENCAP flags in bpf_skb_adjust_room are ignored on decap with
bpf_skb_net_shrink. Reserve these bits for future use.
Fixes: 868d523535 ("bpf: add bpf_skb_adjust_room encap flags")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add tx stats to hsr interface. Without this
ifconfig for hsr interface doesn't show tx packet stats.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the path of hsr debugfs root directory to use the net device
name so that it can work with multiple interfaces. While at it,
also fix some typos.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the file name and functions to match with existing implementation.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_forward_alloc's updating is also done on rx path, but to be consistent
we change to use sk_mem_charge() in sctp_skb_set_owner_r().
In sctp_eat_data(), it's not enough to check sctp_memory_pressure only,
which doesn't work for mem_cgroup_sockets_enabled, so we change to use
sk_under_memory_pressure().
When it's under memory pressure, sk_mem_reclaim() and sk_rmem_schedule()
should be called on both RENEGE or CHUNK DELIVERY path exit the memory
pressure status as soon as possible.
Note that sk_rmem_schedule() is using datalen to make things easy there.
Reported-by: Matteo Croce <mcroce@redhat.com>
Tested-by: Matteo Croce <mcroce@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now when sending packets, sk_mem_charge() and sk_mem_uncharge() have been
used to set sk_forward_alloc. We just need to call sk_wmem_schedule() to
check if the allocated should be raised, and call sk_mem_reclaim() to
check if the allocated should be reduced when it's under memory pressure.
If sk_wmem_schedule() returns false, which means no memory is allowed to
allocate, it will block and wait for memory to become available.
Note different from tcp, sctp wait_for_buf happens before allocating any
skb, so memory accounting check is done with the whole msg_len before it
too.
Reported-by: Matteo Croce <mcroce@redhat.com>
Tested-by: Matteo Croce <mcroce@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Remove the broute pseudo hook, implement this from the bridge
prerouting hook instead. Now broute becomes real table in ebtables,
from Florian Westphal. This also includes a size reduction patch for the
bridge control buffer area via squashing boolean into bitfields and
a selftest.
2) Add OS passive fingerprint version matching, from Fernando Fernandez.
3) Support for gue encapsulation for IPVS, from Jacky Hu.
4) Add support for NAT to the inet family, from Florian Westphal.
This includes support for masquerade, redirect and nat extensions.
5) Skip interface lookup in flowtable, use device in the dst object.
6) Add jiffies64_to_msecs() and use it, from Li RongQing.
7) Remove unused parameter in nf_tables_set_desc_parse(), from Colin Ian King.
8) Statify several functions, patches from YueHaibing and Florian Westphal.
9) Add an optimized version of nf_inet_addr_cmp(), from Li RongQing.
10) Merge route extension to core, also from Florian.
11) Use IS_ENABLED(CONFIG_NF_NAT) instead of NF_NAT_NEEDED, from Florian.
12) Merge ip/ip6 masquerade extensions, from Florian. This includes
netdevice notifier unification.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
After merging the netfilter-next tree, today's linux-next build (powerpc
ppc44x_defconfig) failed like this:
In file included from net/bridge/br_input.c:19:
include/net/netfilter/nf_queue.h:16:23: error: field 'state' has incomplete type
struct nf_hook_state state;
^~~~~
Fixes: 971502d77f ("bridge: netfilter: unroll NF_HOOK helper in bridge input path")
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
when CONFIG_INET is not enabled:
net/xfrm/xfrm_output.c: In function ‘xfrm4_tunnel_encap_add’:
net/xfrm/xfrm_output.c:234:2: error: implicit declaration of function ‘ip_select_ident’ [-Werror=implicit-function-declaration]
ip_select_ident(dev_net(dst->dev), skb, NULL);
XFRM only supports ipv4 and ipv6 so change dependency to INET and place
user-visible options (pfkey sockets, migrate support and the like)
under 'if INET' guard as well.
Fixes: 1de7083006 ("xfrm: remove output2 indirection from xfrm_mode")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Sven Auhagen reported that a 2nd ping request will fail if 'fully-random'
mode is used.
Reason is that if no proto information is given, min/max are both 0,
so we set the icmp id to 0 instead of chosing a random value between
0 and 65535.
Update test case as well to catch this, without fix this yields:
[..]
ERROR: cannot ping ns1 from ns2 with ip masquerade fully-random (attempt 2)
ERROR: cannot ping ns1 from ns2 with ipv6 masquerade fully-random (attempt 2)
... becaus 2nd ping clashes with existing 'id 0' icmp conntrack and gets
dropped.
Fixes: 203f2e7820 ("netfilter: nat: remove l4proto->unique_tuple")
Reported-by: Sven Auhagen <sven.auhagen@voleatech.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
I believe that "hook->num" can be up to UINT_MAX. Shifting more than
31 bits would is undefined in C but in practice it would lead to shift
wrapping. That would lead to an array overflow in nf_tables_addchain():
ops->hook = hook.type->hooks[ops->hooknum];
Fixes: fe19c04ca1 ("netfilter: nf_tables: remove nhooks field from struct nft_af_info")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
else, we leak the addresses to userspace via ctnetlink events
and dumps.
Compute an ID on demand based on the immutable parts of nf_conn struct.
Another advantage compared to using an address is that there is no
immediate re-use of the same ID in case the conntrack entry is freed and
reallocated again immediately.
Fixes: 3583240249 ("[NETFILTER]: nf_conntrack_expect: kill unique ID")
Fixes: 7f85f91472 ("[NETFILTER]: nf_conntrack: kill unique ID")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Jakub forgot to either use nlmsg_len() or nlmsg_msg_size(),
allowing KMSAN to detect a possible uninit-value in rtnl_stats_get
BUG: KMSAN: uninit-value in rtnl_stats_get+0x6d9/0x11d0 net/core/rtnetlink.c:4997
CPU: 0 PID: 10428 Comm: syz-executor034 Not tainted 5.1.0-rc2+ #24
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x173/0x1d0 lib/dump_stack.c:113
kmsan_report+0x131/0x2a0 mm/kmsan/kmsan.c:619
__msan_warning+0x7a/0xf0 mm/kmsan/kmsan_instr.c:310
rtnl_stats_get+0x6d9/0x11d0 net/core/rtnetlink.c:4997
rtnetlink_rcv_msg+0x115b/0x1550 net/core/rtnetlink.c:5192
netlink_rcv_skb+0x431/0x620 net/netlink/af_netlink.c:2485
rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:5210
netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
netlink_unicast+0xf3e/0x1020 net/netlink/af_netlink.c:1336
netlink_sendmsg+0x127f/0x1300 net/netlink/af_netlink.c:1925
sock_sendmsg_nosec net/socket.c:622 [inline]
sock_sendmsg net/socket.c:632 [inline]
___sys_sendmsg+0xdb3/0x1220 net/socket.c:2137
__sys_sendmsg net/socket.c:2175 [inline]
__do_sys_sendmsg net/socket.c:2184 [inline]
__se_sys_sendmsg+0x305/0x460 net/socket.c:2182
__x64_sys_sendmsg+0x4a/0x70 net/socket.c:2182
do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291
entry_SYSCALL_64_after_hwframe+0x63/0xe7
Fixes: 51bc860d4a ("rtnetlink: stats: validate attributes in get as well as dumps")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can receive ICMP errors from client or from
tunneling real server. While the former can be
scheduled to real server, the latter should
not be scheduled, they are decapsulated only when
existing connection is found.
Fixes: 6044eeffaf ("ipvs: attempt to schedule icmp packets")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Luca Moro says:
------
The issue lies in the filtering of ICMP and ICMPv6 errors that include an
inner IP datagram.
For these packets, icmp_error_message() extract the ICMP error and inner
layer to search of a known state.
If a state is found the packet is tagged as related (IP_CT_RELATED).
The problem is that there is no correlation check between the inner and
outer layer of the packet.
So one can encapsulate an error with an inner layer matching a known state,
while its outer layer is directed to a filtered host.
In this case the whole packet will be tagged as related.
This has various implications from a rule bypass (if a rule to related
trafic is allow), to a known state oracle.
Unfortunately, we could not find a real statement in a RFC on how this case
should be filtered.
The closest we found is RFC5927 (Section 4.3) but it is not very clear.
A possible fix would be to check that the inner IP source is the same than
the outer destination.
We believed this kind of attack was not documented yet, so we started to
write a blog post about it.
You can find it attached to this mail (sorry for the extract quality).
It contains more technical details, PoC and discussion about the identified
behavior.
We discovered later that
https://www.gont.com.ar/papers/filtering-of-icmp-error-messages.pdf
described a similar attack concept in 2004 but without the stateful
filtering in mind.
-----
This implements above suggested fix:
In icmp(v6) error handler, take outer destination address, then pass
that into the common function that does the "related" association.
After obtaining the nf_conn of the matching inner-headers connection,
check that the destination address of the opposite direction tuple
is the same as the outer address and only set RELATED if thats the case.
Reported-by: Luca Moro <luca.moro@synacktiv.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Recompile IP options since IPCB may not be valid anymore when
ipv4_link_failure is called from arp_error_report.
Refer to the commit 3da1ed7ac3 ("net: avoid use IPCB in cipso_v4_error")
and the commit before that (9ef6b42ad6) for a similar issue.
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the review of 0b34eb0043 ("ipv6: Refactor __ip6_route_redirect"),
Martin noted that the flowi6_oif compare is moved to the new helper and
should be removed from __ip6_route_redirect. Fix the oversight.
Fixes: 0b34eb0043 ("ipv6: Refactor __ip6_route_redirect")
Reported-by: Martin Lau <kafai@fb.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The rxrpc packet serial number cannot be safely used to compute out of
order ack packets for several reasons:
1. The allocation of serial numbers cannot be assumed to imply the order
by which acks are populated and transmitted. In some rxrpc
implementations, delayed acks and ping acks are transmitted
asynchronously to the receipt of data packets and so may be transmitted
out of order. As a result, they can race with idle acks.
2. Serial numbers are allocated by the rxrpc connection and not the call
and as such may wrap independently if multiple channels are in use.
In any case, what matters is whether the ack packet provides new
information relating to the bounds of the window (the firstPacket and
previousPacket in the ACK data).
Fix this by discarding packets that appear to wind back the window bounds
rather than on serial number procession.
Fixes: 298bc15b20 ("rxrpc: Only take the rwind and mtu values from latest ACK")
Signed-off-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trace received calls that are aborted due to a connection abort, typically
because of authentication failure. Without this, connection aborts don't
show up in the trace log.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change rxrpc_queue_packet()'s signature so that it can return any error
code it may encounter when trying to send the packet.
This allows the caller to eventually do something in case of error - though
it should be noted that the packet has been queued and a resend is
scheduled.
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make rxrpc_kernel_check_life() pass back the life counter through the
argument list and return true if the call has not yet completed.
Suggested-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an ICMP or ICMPV6 error is received, the error will be attached
to the socket (sk_err) and the report function will get called.
Clear any pending error here by calling sock_error().
This would cause the following attempt to use the socket to fail with
the error code stored by the ICMP error, resulting in unexpected errors
with various side effects depending on the context.
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Jonathan Billings <jsbillin@umich.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rework smc_conn_create() to always return a valid DECLINE reason code.
This removes the need to translate the return codes on 4 different
places and allows to easily add more detailed return codes by changing
smc_conn_create() only.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rework smc_listen_work() to provide improved reason codes when an
SMC connection is declined. This allows better debugging on user side.
This also adds 3 more detailed reason codes in smc_clc.h to indicate
what type of device was not found (ism or rdma or both), or if ism
cannot talk to the peer.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In smc_listen_work() the variables rc and reason_code are defined which
have the same meaning. Eliminate reason_code in favor of the shorter
name rc. No functional changes.
Rename the functions smc_check_ism() and smc_check_rdma() into
smc_find_ism_device() and smc_find_rdma_device() to make there purpose
more clear. No functional changes.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The vlan_id of the underlying CLC socket was retrieved two times
during processing of the listen handshaking. Change this to get the
vlan id one time in connect and in listen processing, and reuse the id.
And add a new CLC DECLINE return code for the case when the retrieval
of the vlan id failed.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
During initialization of an SMC socket a lot of function parameters need
to get passed down the function call path. Consolidate the parameters
in a helper struct so there are less enough parameters to get all passed
by register.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The check for a matching ip prefix and subnet was only done for SMC-R
in smc_listen_rdma_check() but not when an SMC-D connection was
possible. Rename the function into smc_listen_prfx_check() and move its
call to a place where it is called for both SMC variants.
And add a new CLC DECLINE reason for the case when the IP prefix or
subnet check fails so the reason for the failing SMC connection can be
found out more easily.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Correct the CLC decline reason codes for internal problems to not have
the sign bit set, negative reason codes are interpreted as not eligible
for TCP fallback.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For nonblocking sockets move the kernel_connect() from the connect
worker into the initial smc_connect part to return kernel_connect()
errors other than -EINPROGRESS to user space.
Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
KMSAN will complain if valid address length passed to udpv6_pre_connect()
is shorter than sizeof("struct sockaddr"->sa_family) bytes.
(This patch is bogus if it is guaranteed that udpv6_pre_connect() is
always called after checking "struct sockaddr"->sa_family. In that case,
we want a comment why we don't need to check valid address length here.)
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
KMSAN will complain if valid address length passed to bpf_bind() is
shorter than sizeof("struct sockaddr"->sa_family) bytes.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
KMSAN will complain if valid address length passed to bind() is shorter
than sizeof(struct sockaddr_llc) bytes.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
KMSAN will complain if valid address length passed to bind() is shorter
than sizeof(struct sockaddr_sco) bytes.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
KMSAN will complain if valid address length passed to bind() is shorter
than sizeof(struct sockaddr_rxrpc) bytes.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
KMSAN will complain if valid address length passed to bind() is shorter
than sizeof(struct sockaddr_nl) bytes.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
KMSAN will complain if valid address length passed to connect() is shorter
than sizeof("struct sockaddr"->sa_family) bytes.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot is reporting uninitialized value at rds_connect() [1] and
rds_bind() [2]. This is because syzbot is passing ulen == 0 whereas
these functions expect that it is safe to access sockaddr->family field
in order to determine minimal address length for validation.
[1] https://syzkaller.appspot.com/bug?id=f4e61c010416c1e6f0fa3ffe247561b60a50ad71
[2] https://syzkaller.appspot.com/bug?id=a4bf9e41b7e055c3823fdcd83e8c58ca7270e38f
Reported-by: syzbot <syzbot+0049bebbf3042dbd2e8f@syzkaller.appspotmail.com>
Reported-by: syzbot <syzbot+915c9f99f3dbc4bd6cd1@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now the SKB list implementation assumption can be removed.
And now that we know that the list head is always non-NULL
we can remove the code blocks dealing with that as well.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass this, instead of an event. Then everything trickles down and we
always have events a non-empty list.
Then we needs a list creating stub to place into .enqueue_event for sctp_stream_interleave_1.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This way we can make sure events sent this way to
sctp_ulpq_tail_event() are on a list as well. Now all such code paths
are fully covered.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This way we can simplify the logic and remove assumptions
about the implementation of skb lists.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Inside the loop, we always start with event non-NULL.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix net reference counting in fl_change() and remove redundant call to
tcf_exts_get_net() from __fl_delete(). __fl_put() already tries to get net
before releasing exts and deallocating a filter, so this code caused flower
classifier to obtain net twice per filter that is being deleted.
Implementation of __fl_delete() called tcf_exts_get_net() to pass its
result as 'async' flag to fl_mask_put(). However, 'async' flag is redundant
and only complicates fl_mask_put() implementation. This functionality seems
to be copied from filter cleanup code, where it was added by Cong with
following explanation:
This patchset tries to fix the race between call_rcu() and
cleanup_net() again. Without holding the netns refcnt the
tc_action_net_exit() in netns workqueue could be called before
filter destroy works in tc filter workqueue. This patchset
moves the netns refcnt from tc actions to tcf_exts, without
breaking per-netns tc actions.
This doesn't apply to flower mask, which doesn't call any tc action code
during cleanup. Simplify fl_mask_put() by removing the flag parameter and
always use tcf_queue_work() to free mask objects.
Fixes: 061775583e ("net: sched: flower: introduce reference counting for filters")
Fixes: 1f17f7742e ("net: sched: flower: insert filter to ht before offloading it to hw")
Fixes: 05cd271fd6 ("cls_flower: Support multiple masks per priority")
Reported-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit e21db6f69a ("tcp: track total bytes delivered with ECN CE marks")
core TCP stack does a very good job tracking ECN signals.
The "sender's best estimate of CE information" Yuchung mentioned in his
patch is indeed the best we can do.
DCTCP can use tp->delivered_ce and tp->delivered to not duplicate the logic,
and use the existing best estimate.
This solves some problems, since current DCTCP logic does not deal with losses
and/or GRO or ack aggregation very well.
This also removes a dubious use of inet_csk(sk)->icsk_ack.rcv_mss
(this should have been tp->mss_cache), and a 64 bit divide.
Finally, we can see that the DCTCP logic, calling dctcp_update_alpha() for
every ACK could be done differently, calling it only once per RTT.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Lawrence Brakmo <brakmo@fb.com>
Cc: Abdul Kabbani <akabbani@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2019-04-12
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Improve BPF verifier scalability for large programs through two
optimizations: i) remove verifier states that are not useful in pruning,
ii) stop walking parentage chain once first LIVE_READ is seen. Combined
gives approx 20x speedup. Increase limits for accepting large programs
under root, and add various stress tests, from Alexei.
2) Implement global data support in BPF. This enables static global variables
for .data, .rodata and .bss sections to be properly handled which allows
for more natural program development. This also opens up the possibility
to optimize program workflow by compiling ELFs only once and later only
rewriting section data before reload, from Daniel and with test cases and
libbpf refactoring from Joe.
3) Add config option to generate BTF type info for vmlinux as part of the
kernel build process. DWARF debug info is converted via pahole to BTF.
Latter relies on libbpf and makes use of BTF deduplication algorithm which
results in 100x savings compared to DWARF data. Resulting .BTF section is
typically about 2MB in size, from Andrii.
4) Add BPF verifier support for stack access with variable offset from
helpers and add various test cases along with it, from Andrey.
5) Extend bpf_skb_adjust_room() growth BPF helper to mark inner MAC header
so that L2 encapsulation can be used for tc tunnels, from Alan.
6) Add support for input __sk_buff context in BPF_PROG_TEST_RUN so that
users can define a subset of allowed __sk_buff fields that get fed into
the test program, from Stanislav.
7) Add bpf fs multi-dimensional array tests for BTF test suite and fix up
various UBSAN warnings in bpftool, from Yonghong.
8) Generate a pkg-config file for libbpf, from Luca.
9) Dump program's BTF id in bpftool, from Prashant.
10) libbpf fix to use smaller BPF log buffer size for AF_XDP's XDP
program, from Magnus.
11) kallsyms related fixes for the case when symbols are not present in
BPF selftests and samples, from Daniel
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes broute a normal ebtables table, hooking at PREROUTING.
The broute hook is removed.
It uses skb->cb to signal to bridge rx handler that the skb should be
routed instead of being bridged.
This change is backwards compatible with ebtables as no userspace visible
parts are changed.
This means we can also remove the !ops test in ebt_register_table,
it was only there for broute table sake.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Replace NF_HOOK() based invocation of the netfilter hooks with a private
copy of nf_hook_slow().
This copy has one difference: it can return the rx handler value expected
by the stack, i.e. RX_HANDLER_CONSUMED or RX_HANDLER_PASS.
This is needed by the next patch to invoke the ebtables
"broute" table via the standard netfilter hooks rather than the custom
"br_should_route_hook" indirection that is used now.
When the skb is to be "brouted", we must return RX_HANDLER_PASS from the
bridge rx input handler, but there is no way to indicate this via
NF_HOOK(), unless perhaps by some hack such as exposing bridge_cb in the
netfilter core or a percpu flag.
text data bss dec filename
3369 56 0 3425 net/bridge/br_input.o.before
3458 40 0 3498 net/bridge/br_input.o.after
This allows removal of the "br_should_route_hook" in the next patch.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reduce size of br_input_skb_cb from 24 to 16 bytes by
using bitfield for those values that can only be 0 or 1.
igmp is the igmp type value, so it needs to be at least u8.
Furthermore, the bridge currently relies on step-by-step initialization
of br_input_skb_cb fields as the skb passes through the stack.
Explicitly zero out the bridge input cb instead, this avoids having to
review/validate that no BR_INPUT_SKB_CB(skb)->foo test can see a
'random' value from previous protocol cb.
AFAICS all current fields are always set up before they are read again,
so this is not a bug fix.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This should allow us later to extend BPF_PROG_TEST_RUN for non-skb case
and be sure that nobody is erroneously setting ctx_{in,out}.
Fixes: b0b9395d86 ("bpf: support input __sk_buff context in BPF_PROG_TEST_RUN")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Move the nexthop evaluation of a fib entry to a helper that can be
leveraged for each fib6_nh in a multipath nexthop object.
In the move, 'continue' statements means the helper returns false
(loop should continue) and 'break' means return true (found the entry
of interest).
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the device and gateway checks in the fib6_next loop to a helper
that can be called per fib6_nh entry.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the siblings and fib6_multipath_select after the null entry check
since a null entry can not have siblings.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clean up the fib6_null_entry handling in ip6_pol_route_lookup.
rt6_device_match can return fib6_null_entry, but fib6_multipath_select
can not. Consolidate the fib6_null_entry handling and on the final
null_entry check set rt and goto out - no need to defer to a second
check after rt6_find_cached_rt.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
find_rr_leaf has 3 loops over fib_entries calling find_match. The loops
are very similar with differences in start point and whether the metric
is evaluated:
1. start at rr_head, no extra loop compare, check fib metric
2. start at leaf, compare rt against rr_head, check metric
3. start at cont (potential saved point from earlier loops), no
extra loop compare, no metric check
Create 1 loop that is called 3 different times. This will make a
later change with multipath nexthop objects much simpler.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
find_match primarily needs a fib6_nh (and fib6_flags which it passes
through to rt6_score_route). Move fib6_check_expired up to the call
sites so find_match is only called for relevant entries. Remove the
match argument which is mostly a pass through and use the return
boolean to decide if match gets set in the call sites.
The end result is a helper that can be called per fib6_nh struct
which is needed once fib entries reference nexthop objects that
have more than one fib6_nh.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rt6_score_route only needs the fib6_flags and nexthop data. Change
it accordingly. Allows re-use later for nexthop based fib6_nh.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rt6_probe sends probes for gateways in a nexthop. As such it really
depends on a fib6_nh, not a fib entry. Move last_probe to fib6_nh and
update rt6_probe to a fib6_nh struct.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rt6_check_dev is a simpler helper with only 1 caller. Fold the code
into rt6_score_route.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change rt6_check_neigh to take a fib6_nh instead of a fib entry.
Move the check on fib_flags and whether the nexthop has a gateway
up to the one caller.
Remove the inline from the definition as well. Not necessary.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The zero namelen check is redundant as it has already been checked
for zero at the start of the function. Remove the redundant check.
Addresses-Coverity: ("Logically Dead Code")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 868d523535 ("bpf: add bpf_skb_adjust_room encap flags")
introduced support to bpf_skb_adjust_room for GSO-friendly GRE
and UDP encapsulation.
For GSO to work for skbs, the inner headers (mac and network) need to
be marked. For L3 encapsulation using bpf_skb_adjust_room, the mac
and network headers are identical. Here we provide a way of specifying
the inner mac header length for cases where L2 encap is desired. Such
an approach can support encapsulated ethernet headers, MPLS headers etc.
For example to convert from a packet of form [eth][ip][tcp] to
[eth][ip][udp][inner mac][ip][tcp], something like the following could
be done:
headroom = sizeof(iph) + sizeof(struct udphdr) + inner_maclen;
ret = bpf_skb_adjust_room(skb, headroom, BPF_ADJ_ROOM_MAC,
BPF_F_ADJ_ROOM_ENCAP_L4_UDP |
BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 |
BPF_F_ADJ_ROOM_ENCAP_L2(inner_maclen));
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In the function tipc_node_create() we protect the peer capability field
by using the node rw_lock. However, we access the lock directly instead
of using the dedicated functions for this, as we do everywhere else in
node.c. This cosmetic spot is fixed here.
Fixes: 40999f11ce ("tipc: make link capability update thread safe")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit b0b9395d86 ("bpf: support input __sk_buff context in
BPF_PROG_TEST_RUN") started using bpf_check_uarg_tail_zero in
BPF_PROG_TEST_RUN. However, bpf_check_uarg_tail_zero is not defined
for !CONFIG_BPF_SYSCALL:
net/bpf/test_run.c: In function ‘bpf_ctx_init’:
net/bpf/test_run.c:142:9: error: implicit declaration of function ‘bpf_check_uarg_tail_zero’ [-Werror=implicit-function-declaration]
err = bpf_check_uarg_tail_zero(data_in, max_size, size);
^~~~~~~~~~~~~~~~~~~~~~~~
Let's not build net/bpf/test_run.c when CONFIG_BPF_SYSCALL is not set.
Reported-by: kbuild test robot <lkp@intel.com>
Fixes: b0b9395d86 ("bpf: support input __sk_buff context in BPF_PROG_TEST_RUN")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This reverts commit 009a82f643.
The ability to optimise here relies on compiler being able to optimise
away tail calls to avoid stack overflows. Unfortunately, we are seeing
reports of problems, so let's just revert.
Reported-by: Daniel Mack <daniel@zonque.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We want to drain only the RQ first. Otherwise the transport can
deadlock on ->close if there are outstanding Send completions.
Fixes: 6d2d0ee27c ("xprtrdma: Replace rpcrdma_receive_wq ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # v5.0+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Only reason for having two different register functions was because of
ipt_MASQUERADE and ip6t_MASQUERADE being two different modules.
Previous patch merged those into xt_MASQUERADE, so we can merge this too.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
No need to have separate modules for this.
before:
text data bss dec filename
2038 1168 0 3206 net/ipv4/netfilter/ipt_MASQUERADE.ko
1526 1024 0 2550 net/ipv6/netfilter/ip6t_MASQUERADE.ko
after:
text data bss dec filename
2521 1296 0 3817 net/netfilter/xt_MASQUERADE.ko
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Both are now implemented by nf_nat_masquerade.c, so no need to keep
different headers.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Implementation of function rhashtable_insert_fast() check if its internal
helper function __rhashtable_insert_fast() returns non-NULL pointer and
seemingly return -EEXIST in such case. However, since
__rhashtable_insert_fast() is called with NULL key pointer, it never
actually checks for duplicates, which means that -EEXIST is never returned
to the user. Use rhashtable_lookup_insert_fast() hash table API instead. In
order to verify that it works as expected and prevent the problem from
happening in future, extend tc-tests with new test that verifies that no
new filters with existing key can be inserted to flower classifier.
Fixes: 1f17f7742e ("net: sched: flower: insert filter to ht before offloading it to hw")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NETNSA_NSID is signed. Use nla_get_s32() to avoid confusion.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_multicast_start_querier() walks over the port list but it can be
called from a timer with only multicast_lock held which doesn't protect
the port list, so use RCU to walk over it.
Fixes: c83b8fab06 ("bridge: Restart queries when last querier expires")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit <26d92e951fe0>
("net/smc: move unhash as early as possible in smc_release()")
fixes one occurrence in the smc code, but the same pattern exists
in other places. This patch covers the remaining occurrences and
makes sure, the unhash operation is done before the smc->clcsock is
released. This avoids a potential use-after-free in smc_diag_dump().
Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The FLUSH command is used to empty the pnet table. No return code is
expected from the command. Commit a9d8b0b1e3d6 added namespace support
for the pnet table and changed the FLUSH command processing to call
smc_pnet_remove_by_pnetid() to remove the pnet entries. This function
returns -ENOENT when no entry was deleted, which is now the return code
of the FLUSH command. As a result the FLUSH command will return an error
when the pnet table is already empty.
Restore the expected behavior and let FLUSH always return 0.
Fixes: a9d8b0b1e3d6 ("net/smc: add pnet table namespace support")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fcntl(fd, F_SETOWN, getpid()) selects the recipient of SIGURG signals
that are delivered when out-of-band data arrives on socket fd.
If an SMC socket program makes use of such an fcntl() call, it fails
in case of fallback to TCP-mode. In case of fallback the traffic is
processed with the internal TCP socket. Propagating field "file" from the
SMC socket to the internal TCP socket fixes the issue.
Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case alloc_ordered_workqueue fails, the fix returns NULL
to avoid NULL pointer dereference.
Signed-off-by: Kangjie Lu <kjlu@umn.edu>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the clcsock is already released using sock_release() and a pending
smc_listen_work accesses the clcsock than that will fail. Solve this
by canceling and waiting for the work to complete first. Because the
work holds the sock_lock it must make sure that the lock is not hold
before the new helper smc_clcsock_release() is invoked. And before the
smc_listen_work starts working check if the parent listen socket is
still valid, otherwise stop the work early.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new set of arguments to bpf_attr for BPF_PROG_TEST_RUN:
* ctx_in/ctx_size_in - input context
* ctx_out/ctx_size_out - output context
The intended use case is to pass some meta data to the test runs that
operate on skb (this has being brought up on recent LPC).
For programs that use bpf_prog_test_run_skb, support __sk_buff input and
output. Initially, from input __sk_buff, copy _only_ cb and priority into
skb, all other non-zero fields are prohibited (with EINVAL).
If the user has set ctx_out/ctx_size_out, copy the potentially modified
__sk_buff back to the userspace.
We require all fields of input __sk_buff except the ones we explicitly
support to be set to zero. The expectation is that in the future we might
add support for more fields and we want to fail explicitly if the user
runs the program on the kernel where we don't yet support them.
The API is intentionally vague (i.e. we don't explicitly add __sk_buff
to bpf_attr, but ctx_in) to potentially let other test_run types use
this interface in the future (this can be xdp_md for xdp types for
example).
v4:
* don't copy more than allowed in bpf_ctx_init [Martin]
v3:
* handle case where ctx_in is NULL, but ctx_out is not [Martin]
* convert size==0 checks to ptr==NULL checks and add some extra ptr
checks [Martin]
v2:
* Addressed comments from Martin Lau
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Remove not useful protocol version check in gue_udp_recv since just
gue version 0 can hit that code. Moreover remove duplicated hdrlen
computation
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gue tunnels run iptunnel_pull_offloads on received skbs. This can
determine a possible use-after-free accessing guehdr pointer since
the packet will be 'uncloned' running pskb_expand_head if it is a
cloned gso skb (e.g if the packet has been sent though a veth device)
Fixes: a09a4c8dd1 ("tunnels: Remove encapsulation offloads on decap")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When binding multiple services with specific type 1Ki, 2Ki..,
this leads to some entries in the name table of publications
missing when listed out via 'tipc name show'.
The problem is at identify zero last_type conditional provided
via netlink. The first is initial 'type' when starting name table
dummping. The second is continuously with zero type (node state
service type). Then, lookup function failure to finding node state
service type in next iteration.
To solve this, adding more conditional to marked as dirty type and
lookup correct service type for the next iteration instead of select
the first service as initial 'type' zero.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Correct spelling of encapsulation.
Found by inspection.
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a netdev appears through hot plug then gets enslaved by a failover
master that is already up and running, the slave will be opened
right away after getting enslaved. Today there's a race that userspace
(udev) may fail to rename the slave if the kernel (net_failover)
opens the slave earlier than when the userspace rename happens.
Unlike bond or team, the primary slave of failover can't be renamed by
userspace ahead of time, since the kernel initiated auto-enslavement is
unable to, or rather, is never meant to be synchronized with the rename
request from userspace.
As the failover slave interfaces are not designed to be operated
directly by userspace apps: IP configuration, filter rules with
regard to network traffic passing and etc., should all be done on master
interface. In general, userspace apps only care about the
name of master interface, while slave names are less important as long
as admin users can see reliable names that may carry
other information describing the netdev. For e.g., they can infer that
"ens3nsby" is a standby slave of "ens3", while for a
name like "eth0" they can't tell which master it belongs to.
Historically the name of IFF_UP interface can't be changed because
there might be admin script or management software that is already
relying on such behavior and assumes that the slave name can't be
changed once UP. But failover is special: with the in-kernel
auto-enslavement mechanism, the userspace expectation for device
enumeration and bring-up order is already broken. Previously initramfs
and various userspace config tools were modified to bypass failover
slaves because of auto-enslavement and duplicate MAC address. Similarly,
in case that users care about seeing reliable slave name, the new type
of failover slaves needs to be taken care of specifically in userspace
anyway.
It's less risky to lift up the rename restriction on failover slave
which is already UP. Although it's possible this change may potentially
break userspace component (most likely configuration scripts or
management software) that assumes slave name can't be changed while
UP, it's relatively a limited and controllable set among all userspace
components, which can be fixed specifically to listen for the rename
events on failover slaves. Userspace component interacting with slaves
is expected to be changed to operate on failover master interface
instead, as the failover slave is dynamic in nature which may come and
go at any point. The goal is to make the role of failover slaves less
relevant, and userspace components should only deal with failover master
in the long run.
Fixes: 30c8bd5aa8 ("net: Introduce generic failover module")
Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Credit Based Shaper heavily depends on link speed to calculate
the scheduling credits, we can't properly calculate the credits if the
device has failed to report the link speed.
In that case we can't dequeue packets assuming a wrong port rate that will
result into an inconsistent credit distribution.
This patch makes sure we fail to dequeue case:
1) __ethtool_get_link_ksettings() reports error or 2) the ethernet driver
failed to set the ksettings' speed value (setting link speed to
SPEED_UNKNOWN).
Additionally we properly re calculate the port rate whenever the link speed
is changed.
Fixes: 3d0bd028ff ("net/sched: Add support for HW offloading for CBS")
Signed-off-by: Leandro Dorileo <leandro.maciel.dorileo@intel.com>
Reviewed-by: Vedang Patel <vedang.patel@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Time Aware Priority Scheduler is heavily dependent to link speed,
it relies on it to calculate transmission bytes per cycle, we can't
properly calculate the so called budget if the device has failed
to report the link speed.
In that case we can't dequeue packets assuming a wrong budget.
This patch makes sure we fail to dequeue case:
1) __ethtool_get_link_ksettings() reports error or 2) the ethernet
driver failed to set the ksettings' speed value (setting link speed
to SPEED_UNKNOWN).
Additionally we re calculate the budget whenever the link speed is
changed.
Fixes: 5a781ccbd1 ("tc: Add support for configuring the taprio scheduler")
Signed-off-by: Leandro Dorileo <leandro.maciel.dorileo@intel.com>
Reviewed-by: Vedang Patel <vedang.patel@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
buildbot noticed that TLS_HW is not defined if CONFIG_TLS_DEVICE=n.
Wrap the cleanup branch into an ifdef, tls_device_free_resources_tx()
wouldn't be compiled either in this case.
Fixes: 35b71a34ad ("net/tls: don't leak partially sent record in device mode")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts the first part of commit 4e485d06bb ("strparser: Call
skb_unclone conditionally"). To build a message with multiple
fragments we need our own root of frag_list. We can't simply
use the frag_list of orig_skb, because it will lead to linking
all orig_skbs together creating very long frag chains, and causing
stack overflow on kfree_skb() (which is called recursively on
the frag_lists).
BUG: stack guard page was hit at 00000000d40fad41 (stack is 0000000029dde9f4..000000008cce03d5)
kernel stack overflow (double-fault): 0000 [#1] PREEMPT SMP
RIP: 0010:free_one_page+0x2b/0x490
Call Trace:
__free_pages_ok+0x143/0x2c0
skb_release_data+0x8e/0x140
? skb_release_data+0xad/0x140
kfree_skb+0x32/0xb0
[...]
skb_release_data+0xad/0x140
? skb_release_data+0xad/0x140
kfree_skb+0x32/0xb0
skb_release_data+0xad/0x140
? skb_release_data+0xad/0x140
kfree_skb+0x32/0xb0
skb_release_data+0xad/0x140
? skb_release_data+0xad/0x140
kfree_skb+0x32/0xb0
skb_release_data+0xad/0x140
? skb_release_data+0xad/0x140
kfree_skb+0x32/0xb0
skb_release_data+0xad/0x140
__kfree_skb+0xe/0x20
tcp_disconnect+0xd6/0x4d0
tcp_close+0xf4/0x430
? tcp_check_oom+0xf0/0xf0
tls_sk_proto_close+0xe4/0x1e0 [tls]
inet_release+0x36/0x60
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0xa2/0x1d0
task_work_run+0x89/0xb0
exit_to_usermode_loop+0x9a/0xa0
do_syscall_64+0xc0/0xf0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Let's leave the second unclone conditional, as I'm not entirely
sure what is its purpose :)
Fixes: 4e485d06bb ("strparser: Call skb_unclone conditionally")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David reports that tls triggers warnings related to
sk->sk_forward_alloc not being zero at destruction time:
WARNING: CPU: 5 PID: 6831 at net/core/stream.c:206 sk_stream_kill_queues+0x103/0x110
WARNING: CPU: 5 PID: 6831 at net/ipv4/af_inet.c:160 inet_sock_destruct+0x15b/0x170
When sender fills up the write buffer and dies from
SIGPIPE. This is due to the device implementation
not cleaning up the partially_sent_record.
This is because commit a42055e8d2 ("net/tls: Add support for async encryption of records for performance")
moved the partial record cleanup to the SW-only path.
Fixes: a42055e8d2 ("net/tls: Add support for async encryption of records for performance")
Reported-by: David Beckett <david.beckett@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit f66de3ee2c ("net/tls: Split conf to rx + tx") made
freeing of IV and record sequence number conditional to SW
path only, but commit e8f6979981 ("net/tls: Add generic NIC
offload infrastructure") also allocates that state for the
device offload configuration. Remember to free it.
Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Govindarajulu reported a regression with Network Manager which sends an
RTA_GATEWAY attribute with the address set to 0. Fixup the handling of
RTA_GATEWAY to only set fc_gw_family if the gateway address is actually
set.
Fixes: f35b794b3b ("ipv4: Prepare fib_config for IPv6 gateway")
Reported-by: Govindarajulu Varadarajan <govind.varadar@gmail.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This revert commit 46b1c18f9d ("net: sched: put back q.qlen into
a single location").
After the previous patch, when a NOLOCK qdisc is enslaved to a
locking qdisc it switches to global stats accounting. As a consequence,
when a classful qdisc accesses directly a child qdisc's qlen, such
qdisc is not doing per CPU accounting and qlen value is consistent.
In the control path nobody uses directly qlen since commit
e5f0e8f8e4 ("net: sched: introduce and use qdisc tree flush/purge
helpers"), so we can remove the contented atomic ops from the
datapath.
v1 -> v2:
- complete the qdisc_qstats_atomic_qlen_dec() ->
qdisc_qstats_cpu_qlen_dec() replacement, fix build issue
- more descriptive commit message
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since stats updating is always consistent with TCQ_F_CPUSTATS flag,
we can disable it at qdisc creation time flipping such bit.
In my experiments, if the NOLOCK flag is cleared, per CPU stats
accounting does not give any measurable performance gain, but it
waste some memory.
Let's clear TCQ_F_CPUSTATS together with NOLOCK, when enslaving
a NOLOCK qdisc to 'lock' one.
Use stats update helper inside pfifo_fast, to cope correctly with
TCQ_F_CPUSTATS flag change.
As a side effect, q.qlen value for any child qdiscs is always
consistent for all lock classfull qdiscs.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The core sched implementation checks independently for NOLOCK flag
to acquire/release the root spin lock and for qdisc_is_percpu_stats()
to account per CPU values in many places.
This change update the last few places checking the TCQ_F_NOLOCK to
do per CPU stats accounting according to qdisc_is_percpu_stats()
value.
The above allows to clean dev_requeue_skb() implementation a bit
and makes stats update always consistent with a single flag.
v1 -> v2:
- do not move qdisc_is_empty definition, fix build issue
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Such helper does not cope correctly with NOLOCK qdiscs.
In the following patches we will move back qlen to per CPU
values for such qdiscs, so qdisc_qlen_sum() is not an option,
too.
Instead, use qlen only for lock qdiscs, and always set
flow off for NOLOCK qdiscs with a not empty tx queue.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The original patch neglected to take byte order conversions
into account, fix that.
Fixes: d9bb410888 ("mac80211: allow overriding HT STBC capabilities")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
* iTXQ fixes from Felix
* tracing fix - increase message length
* fix SW_CRYPTO_CONTROL enforcement
* WMM rule handling for regdomain intersection
* max_interfaces in hwsim - reported by syzbot
* clear private data in some more commands
* a clang compiler warning fix
I added a patch with two new (unused) macros for
rate-limited printing to simplify getting the users
into the tree.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlyshYoACgkQB8qZga/f
l8Q1EQ//YO53GM06uvgdOlQkoF3D3uSLtPbJyxWcZCaJXjJ6jOOZsSrZZBLn7avf
i80UziS34ep0zdRHyVmGe6Q/VqufWUnzd6KBNdINEDdepOw6jFlBe9Ue+7mGwBnl
bqzJJ+rxQj1wVqMKEQMtznqCEX58ULiDKcPUPTG2Szw7moqMoTVaw7u2d3eX5mTS
Ka/rdsA/V9v4VOkBi//JqYmlZjdq83el8r4rshI2YbNdxwljTchK5YDsJreDawzF
X7pDjAou2EAskot2Pahz1qAEew0XUnSliTDRVTsgcsellnUHFsEGXytXQ7sp8Myy
kpGuBdeBqJuPWwKTbHfxiQb1VudpBWuMB2x/rotszgzrqSoVWot5j6/UyDSvMPyC
U/drsGJrN1iF/mYTn6G/r8sKPIaCxN7h02Z75IV7LBokGmz9Yt7srEjrOKt0uuK+
l5nb/YQNcN37lkREa6zLzec0f3wDzoqw1GWL0IuQyb+cFC2+6q5iFQ8dyXnu3FdQ
1cSJGJdnyG3bM0b42WrOcEOFt3EKmhb4UeuusoxkOd4aKuhBXRjVEzM+X6ADqTdR
mWvk7bEfYD0UMsqhtvrYBdBfquPuy3S4QzA0IMzxjk8bhCug233KpOPcLgVFjlzh
g7wCgfahl0TRhH7R1D3hnRUtcdsjifoa75XMAaDnWfnR9PiN6uI=
=2EV9
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2019-04-09' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Various fixes:
* iTXQ fixes from Felix
* tracing fix - increase message length
* fix SW_CRYPTO_CONTROL enforcement
* WMM rule handling for regdomain intersection
* max_interfaces in hwsim - reported by syzbot
* clear private data in some more commands
* a clang compiler warning fix
I added a patch with two new (unused) macros for
rate-limited printing to simplify getting the users
into the tree.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The RCU flavors have been consolidated, so this commit replaces a
comment's mention of call_rcu_bh() with call_rcu().
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: <netfilter-devel@vger.kernel.org>
Cc: <coreteam@netfilter.org>
Cc: <netdev@vger.kernel.org>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Restore SW_CRYPTO_CONTROL operation on AP_VLAN interfaces for unicast
keys, the original override was intended to be done for group keys as
those are treated specially by mac80211 and would always have been
rejected.
Now the situation is that AP_VLAN support must be enabled by the driver
if it can support it (meaning it can support software crypto GTK TX).
Thus, also simplify the code - if we get here with AP_VLAN and non-
pairwise key, software crypto must be used (driver doesn't know about
the interface) and can be used (driver must've advertised AP_VLAN if
it also uses SW_CRYPTO_CONTROL).
Fixes: db3bdcb9c3 ("mac80211: allow AP_VLAN operation on crypto controlled devices")
Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
[rewrite commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
erspan_v6 tunnels run __iptunnel_pull_header on received skbs to remove
erspan header. This can determine a possible use-after-free accessing
pkt_md pointer in ip6erspan_rcv since the packet will be 'uncloned'
running pskb_expand_head if it is a cloned gso skb (e.g if the packet has
been sent though a veth device). Fix it resetting pkt_md pointer after
__iptunnel_pull_header
Fixes: 1d7e2ed22f ("net: erspan: refactor existing erspan code")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
erspan tunnels run __iptunnel_pull_header on received skbs to remove
gre and erspan headers. This can determine a possible use-after-free
accessing pkt_md pointer in erspan_rcv since the packet will be 'uncloned'
running pskb_expand_head if it is a cloned gso skb (e.g if the packet has
been sent though a veth device). Fix it resetting pkt_md pointer after
__iptunnel_pull_header
Fixes: 1d7e2ed22f ("net: erspan: refactor existing erspan code")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for RTA_VIA and allow an IPv6 nexthop for v4 routes:
$ ip ro add 172.16.1.0/24 via inet6 2001:db8::1 dev eth0
$ ip ro ls
...
172.16.1.0/24 via inet6 2001:db8::1 dev eth0
For convenience and simplicity, userspace can use RTA_VIA to specify
AF_INET or AF_INET6 gateway.
The common fib_nexthop_info dump function compares the gateway address
family to the nh_common family to know if the gateway should be encoded
as RTA_VIA or RTA_GATEWAY.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Until support is added to the offload drivers, they need to be able to
reject routes with an IPv6 gateway. To that end add a flag to fib_info
that indicates if any fib_nh has a v6 gateway. The flag allows the drivers
to efficiently know the use of a v6 gateway without walking all fib_nh
tied to a fib_info each time a route is added.
Update mlxsw and rocker to reject the routes with extack message as to why.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update ipv4_confirm_neigh to handle an ipv6 gateway.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update bpf_ipv4_fib_lookup to handle an ipv6 gateway.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A common theme in the output path is looking up a neigh entry for a
nexthop, either the gateway in an rtable or a fallback to the daddr
in the skb:
nexthop = (__force u32)rt_nexthop(rt, ip_hdr(skb)->daddr);
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
To allow the nexthop to be an IPv6 address we need to consider the
family of the nexthop and then call __ipv{4,6}_neigh_lookup_noref based
on it.
To make this simpler, add a ip_neigh_gw4 helper similar to ip_neigh_gw6
added in an earlier patch which handles:
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
And then add a second one, ip_neigh_for_gw, that calls either
ip_neigh_gw4 or ip_neigh_gw6 based on the address family of the gateway.
Update the output paths in the VRF driver and core v4 code to use
ip_neigh_for_gw simplifying the family based lookup and making both
ready for a v6 nexthop.
ipv4_neigh_lookup has a different need - the potential to resolve a
passed in address in addition to any gateway in the rtable or skb. Since
this is a one-off, add ip_neigh_gw4 and ip_neigh_gw6 diectly. The
difference between __neigh_create used by the helpers and neigh_create
called by ipv4_neigh_lookup is taking a refcount, so add rcu_read_lock_bh
and bump the refcnt on the neigh entry.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A later patch allows an IPv6 gateway with an IPv4 route. The neighbor
entry will exist in the v6 ndisc table and the cached header will contain
the ipv6 protocol which is wrong for an IPv4 packet. For an IPv4 packet to
use the v6 neighbor entry, neigh_output needs to skip the cached header
and just use the output callback for the neigh entry.
A future patchset can look at expanding the hh_cache to handle 2
protocols. For now, IPv6 gateways with an IPv4 route will take the
extra overhead of generating the header.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add helper to use fib6_nh_init to validate a nexthop spec with an IPv6
gateway.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_check_nh is currently huge covering multiple uses cases - device only,
device + gateway, and device + gateway with ONLINK. The next patch adds
validation checks for IPv6 which only further complicates it. So, break
fib_check_nh into 2 helpers - one for gateway validation and one for device
only.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for an IPv6 gateway to fib_config. Since a gateway is either
IPv4 or IPv6, make it a union with fc_gw4 where fc_gw_family decides
which address is in use. Update current checks on family and gw4 to
handle ipv6 as well.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for an IPv6 gateway to rtable. Since a gateway is either
IPv4 or IPv6, make it a union with rt_gw4 where rt_gw_family decides
which address is in use.
When dumping the route data, encode an ipv6 nexthop using RTA_VIA.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to rtable, fib_config needs to allow the gateway to be either an
IPv4 or an IPv6 address. To that end, rename fc_gw to fc_gw4 to mean an
IPv4 address and add fc_gw_family. Checks on 'is a gateway set' are changed
to see if fc_gw_family is set. In the process prepare the code for a
fc_gw_family == AF_INET6.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To allow the gateway to be either an IPv4 or IPv6 address, remove
rt_uses_gateway from rtable and replace with rt_gw_family. If
rt_gw_family is set it implies rt_uses_gateway. Rename rt_gateway
to rt_gw4 to represent the IPv4 version.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow the gateway in a fib_nh_common to be from a different address
family than the outer fib{6}_nh. To that end, replace nhc_has_gw with
nhc_gw_family and update users of nhc_has_gw to check nhc_gw_family.
Now nhc_family is used to know if the nh_common is part of a fib_nh
or fib6_nh (used for container_of to get to route family specific data),
and nhc_gw_family represents the address family for the gateway.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ipv6 helpers to handle ndisc references via the stub. Update
bpf_ipv6_fib_lookup to use __ipv6_neigh_lookup_noref_stub instead of
the open code ___neigh_lookup_noref with the stub.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add fib6_nh_init and fib6_nh_release to ipv6_stubs. If fib6_nh_init fails,
callers should not invoke fib6_nh_release, so there is no reason to have
a dummy stub for the IPv6 is not enabled case.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add version option support to the nftables "osf" expression.
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
allows to redirect both ipv4 and ipv6 with a single rule in an
inet nat table.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This allows use of a single masquerade rule in nat inet family
to handle both ipv4 and ipv6.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
NF_NAT_NEEDED is true whenever nat support for either ipv4 or ipv6 is
enabled. Now that the af-specific nat configuration switches have been
removed, IS_ENABLED(CONFIG_NF_NAT) has the same effect.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
very little code, so it really doesn't make sense to have extra
modules or even a kconfig knob for this.
Merge them and make functionality available unconditionally.
The merge makes inet family route support trivial, so add it
as well here.
Before:
text data bss dec hex filename
835 832 0 1667 683 nft_chain_route_ipv4.ko
870 832 0 1702 6a6 nft_chain_route_ipv6.ko
111568 2556 529 114653 1bfdd nf_tables.ko
After:
text data bss dec hex filename
113133 2556 529 116218 1c5fa nf_tables.ko
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We need minimal support from the nat core for this, as we do not
want to register additional base hooks.
When an inet hook is registered, interally register ipv4 and ipv6
hooks for them and unregister those when inet hooks are removed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ipip packets are blocked in some public cloud environments, this patch
allows gue encapsulation with the tunneling method, which would make
tunneling working in those environments.
Signed-off-by: Jacky Hu <hengqing.hu@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fix sparse warning:
net/netfilter/nft_redir.c:85:5:
warning: symbol 'nft_redir_dump' was not declared. Should it be static?
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Function nf_tables_set_desc_parse parameter ctx is not being used
so remove it as it is redundant.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
there is a similar helper in net/netfilter/nf_tables_api.c,
this maybe become a common request someday, so move it to
time.c
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After commit a297569fe0 ("net/udp: do not touch skb->peeked unless
really needed") the 'peeked' argument of __skb_try_recv_datagram()
and friends is always equal to !!'flags & MSG_PEEK'.
Since such argument is really a boolean info, and the callers have
already 'flags & MSG_PEEK' handy, we can remove it and clean-up the
code a bit.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This interface allows the host driver to offload OWE processing
to user space. This intends to support OWE (Opportunistic Wireless
Encryption) AKM by the drivers that implement SME but rely on the
user space for the cryptographic/OWE processing in AP mode. Such
drivers are not capable of processing/deriving the DH IE.
A new NL80211 command - NL80211_CMD_UPDATE_OWE_INFO is introduced
to send the request/event between the host driver and user space.
Driver shall provide the OWE info (MAC address and DH IE) of
the peer to user space for cryptographic processing of the DH IE
through the event. Accordingly, the user space shall update the
OWE info/DH IE to the driver.
Following is the sequence in AP mode for OWE authentication.
Driver passes the OWE info obtained from the peer in the
Association Request to the user space through the event
cfg80211_update_owe_info_event. User space shall process the
OWE info received and generate new OWE info. This OWE info is
passed to the driver through NL80211_CMD_UPDATE_OWE_INFO
request. Driver eventually uses this OWE info to send the
Association Response to the peer.
This OWE info in the command interface carries the IEs that include
PMKID of the peer if the PMKSA is still valid or an updated DH IE
for generating a new PMKSA with the peer.
Signed-off-by: Liangwei Dong <liangwei@codeaurora.org>
Signed-off-by: Sunil Dutt <usdutt@codeaurora.org>
Signed-off-by: Srinivas Dasari <dasaris@codeaurora.org>
[remove policy initialization - no longer exists]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add support for mesh airtime link metric attribute
NL80211_STA_INFO_AIRTIME_LINK_METRIC.
Signed-off-by: Narayanraddi Masti <team.nmasti@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This commit adds the support to specify the RSSI thresholds per
band for each match set. This enhances the current behavior which
specifies a single rssi_threshold across all the bands by
introducing the rssi_threshold_per_band. These per band rssi
thresholds are referred through NL80211_BAND_* (enum nl80211_band)
variables as attribute types. Such attributes/values per each
band are nested through NL80211_ATTR_SCHED_SCAN_MIN_RSSI.
These band specific rssi thresholds shall take precedence over
the current rssi_thold per match set.
Drivers indicate this support through
%NL80211_EXT_FEATURE_SCHED_SCAN_BAND_SPECIFIC_RSSI_THOLD.
These per band rssi attributes/values does not specify
"default RSSI filter" as done by
NL80211_SCHED_SCAN_MATCH_ATTR_RSSI to stay backward compatible.
That said, these per band rssi values have to be specified for
the corresponding matchset.
Signed-off-by: vamsi krishna <vamsin@codeaurora.org>
Signed-off-by: Srinivas Dasari <dasaris@codeaurora.org>
[rebase on refactoring, add policy]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The sched scan code here is really deep - avoid one level
of indentation by short-circuiting the loop instead of
putting everything into the if block.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently there is no way for the driver to signal to mac80211 that it should
schedule a TXQ even if there are no packets on the mac80211 part of that queue.
This is problematic if the driver has an internal retry queue to deal with
software A-MPDU retry.
This patch changes the behavior of ieee80211_schedule_txq to always schedule
the queue, as its only user (ath9k) seems to expect such behavior already:
it calls this function on tx status and on powersave wakeup whenever its
internal retry queue is not empty.
Also add an extra argument to ieee80211_return_txq to get the same behavior.
This fixes an issue on ath9k where tx queues with packets to retry (and no
new packets in mac80211) would not get serviced.
Fixes: 89cea7493a ("ath9k: Switch to mac80211 TXQ scheduling and airtime APIs")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This structure is now only 4 bytes, so its more efficient
to cache a copy rather than its address.
No significant size difference in allmodconfig vmlinux.
With non-modular kernel that has all XFRM options enabled, this
series reduces vmlinux image size by ~11kb. All xfrm_mode
indirections are gone and all modes are built-in.
before (ipsec-next master):
text data bss dec filename
21071494 7233140 11104324 39408958 vmlinux.master
after this series:
21066448 7226772 11104324 39397544 vmlinux.patched
With allmodconfig kernel, the size increase is only 362 bytes,
even all the xfrm config options removed in this series are
modular.
before:
text data bss dec filename
15731286 6936912 4046908 26715106 vmlinux.master
after this series:
15731492 6937068 4046908 26715468 vmlinux
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
after previous changes, xfrm_mode contains no function pointers anymore
and all modules defining such struct contain no code except an init/exit
functions to register the xfrm_mode struct with the xfrm core.
Just place the xfrm modes core and remove the modules,
the run-time xfrm_mode register/unregister functionality is removed.
Before:
text data bss dec filename
7523 200 2364 10087 net/xfrm/xfrm_input.o
40003 628 440 41071 net/xfrm/xfrm_state.o
15730338 6937080 4046908 26714326 vmlinux
7389 200 2364 9953 net/xfrm/xfrm_input.o
40574 656 440 41670 net/xfrm/xfrm_state.o
15730084 6937068 4046908 26714060 vmlinux
The xfrm*_mode_{transport,tunnel,beet} modules are gone.
v2: replace CONFIG_INET6_XFRM_MODE_* IS_ENABLED guards with CONFIG_IPV6
ones rather than removing them.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Adds an EXPORT_SYMBOL for afinfo_get_rcu, as it will now be called from
ipv6 in case of CONFIG_IPV6=m.
This change has virtually no effect on vmlinux size, but it reduces
afinfo size and allows followup patch to make xfrm modes const.
v2: mark if (afinfo) tests as likely (Sabrina)
re-fetch afinfo according to inner_mode in xfrm_prepare_input().
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
similar to previous patch: no external module dependencies,
so we can avoid the indirection by placing this in the core.
This change removes the last indirection from xfrm_mode and the
xfrm4|6_mode_{beet,tunnel}.c modules contain (almost) no code anymore.
Before:
text data bss dec hex filename
3957 136 0 4093 ffd net/xfrm/xfrm_output.o
587 44 0 631 277 net/ipv4/xfrm4_mode_beet.o
649 32 0 681 2a9 net/ipv4/xfrm4_mode_tunnel.o
625 44 0 669 29d net/ipv6/xfrm6_mode_beet.o
599 32 0 631 277 net/ipv6/xfrm6_mode_tunnel.o
After:
text data bss dec hex filename
5359 184 0 5543 15a7 net/xfrm/xfrm_output.o
171 24 0 195 c3 net/ipv4/xfrm4_mode_beet.o
171 24 0 195 c3 net/ipv4/xfrm4_mode_tunnel.o
172 24 0 196 c4 net/ipv6/xfrm6_mode_beet.o
172 24 0 196 c4 net/ipv6/xfrm6_mode_tunnel.o
v2: fold the *encap_add functions into xfrm*_prepare_output
preserve (move) output2 comment (Sabrina)
use x->outer_mode->encap, not inner
fix a build breakage on ppc (kbuild robot)
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
No external dependencies on any module, place this in the core.
Increase is about 1800 byte for xfrm_input.o.
The beet helpers get added to internal header, as they can be reused
from xfrm_output.c in the next patch (kernel contains several
copies of them in the xfrm{4,6}_mode_beet.c files).
Before:
text data bss dec filename
5578 176 2364 8118 net/xfrm/xfrm_input.o
1180 64 0 1244 net/ipv4/xfrm4_mode_beet.o
171 40 0 211 net/ipv4/xfrm4_mode_transport.o
1163 40 0 1203 net/ipv4/xfrm4_mode_tunnel.o
1083 52 0 1135 net/ipv6/xfrm6_mode_beet.o
172 40 0 212 net/ipv6/xfrm6_mode_ro.o
172 40 0 212 net/ipv6/xfrm6_mode_transport.o
1056 40 0 1096 net/ipv6/xfrm6_mode_tunnel.o
After:
text data bss dec filename
7373 200 2364 9937 net/xfrm/xfrm_input.o
587 44 0 631 net/ipv4/xfrm4_mode_beet.o
171 32 0 203 net/ipv4/xfrm4_mode_transport.o
649 32 0 681 net/ipv4/xfrm4_mode_tunnel.o
625 44 0 669 net/ipv6/xfrm6_mode_beet.o
172 32 0 204 net/ipv6/xfrm6_mode_ro.o
172 32 0 204 net/ipv6/xfrm6_mode_transport.o
599 32 0 631 net/ipv6/xfrm6_mode_tunnel.o
v2: pass inner_mode to xfrm_inner_mode_encap_remove to fix
AF_UNSPEC selector breakage (bisected by Benedict Wong)
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
There are only two versions (tunnel and transport). The ip/ipv6 versions
are only differ in sizeof(iphdr) vs ipv6hdr.
Place this in the core and use x->outer_mode->encap type to call the
correct adjustment helper.
Before:
text data bss dec filename
15730311 6937008 4046908 26714227 vmlinux
After:
15730428 6937008 4046908 26714344 vmlinux
(about 117 byte increase)
v2: use family from x->outer_mode, not inner
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Same is input indirection. Only exception: we need to export
xfrm_outer_mode_output for pktgen.
Increases size of vmlinux by about 163 byte:
Before:
text data bss dec filename
15730208 6936948 4046908 26714064 vmlinux
After:
15730311 6937008 4046908 26714227 vmlinux
xfrm_inner_extract_output has no more external callers, make it static.
v2: add IS_ENABLED(IPV6) guard in xfrm6_prepare_output
add two missing breaks in xfrm_outer_mode_output (Sabrina Dubroca)
add WARN_ON_ONCE for 'call AF_INET6 related output function, but
CONFIG_IPV6=n' case.
make xfrm_inner_extract_output static
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
No need for any indirection or abstraction here, both functions
are pretty much the same and quite small, they also have no external
dependencies.
xfrm_prepare_input can then be made static.
With allmodconfig build, size increase of vmlinux is 25 byte:
Before:
text data bss dec filename
15730207 6936924 4046908 26714039 vmlinux
After:
15730208 6936948 4046908 26714064 vmlinux
v2: Fix INET_XFRM_MODE_TRANSPORT name in is-enabled test (Sabrina Dubroca)
change copied comment to refer to transport and network header,
not skb->{h,nh}, which don't exist anymore. (Sabrina)
make xfrm_prepare_input static (Eyal Birger)
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Now that we have the family available directly in the
xfrm_mode struct, we can use that and avoid one extra dereference.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This will be useful to know if we're supposed to decode ipv4 or ipv6.
While at it, make the unregister function return void, all module_exit
functions did just BUG(); there is never a point in doing error checks
if there is no way to handle such error.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
John reports:
Recent refactoring of fl_change aims to use the classifier spinlock to
avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer()
function was moved to before the lock is taken. This can create problems
for drivers if duplicate filters are created (commmon in ovs tc offload
due to filters being triggered by user-space matches).
Drivers registered for such filters will now receive multiple copies of
the same rule, each with a different cookie value. This means that the
drivers would need to do a full match field lookup to determine
duplicates, repeating work that will happen in flower __fl_lookup().
Currently, drivers do not expect to receive duplicate filters.
To fix this, verify that filter with same key is not present in flower
classifier hash table and insert the new filter to the flower hash table
before offloading it to hardware. Implement helper function
fl_ht_insert_unique() to atomically verify/insert a filter.
This change makes filter visible to fast path at the beginning of
fl_change() function, which means it can no longer be freed directly in
case of error. Refactor fl_change() error handling code to deallocate the
filter with rcu timeout.
Fixes: 620da48608 ("net: sched: flower: refactor fl_change")
Reported-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
bucket pointer to lock the hash chain for that bucket.
The benefits of a bit spin_lock are:
- no need to allocate a separate array of locks.
- no need to have a configuration option to guide the
choice of the size of this array
- locking cost is often a single test-and-set in a cache line
that will have to be loaded anyway. When inserting at, or removing
from, the head of the chain, the unlock is free - writing the new
address in the bucket head implicitly clears the lock bit.
For __rhashtable_insert_fast() we ensure this always happens
when adding a new key.
- even when lockings costs 2 updates (lock and unlock), they are
in a cacheline that needs to be read anyway.
The cost of using a bit spin_lock is a little bit of code complexity,
which I think is quite manageable.
Bit spin_locks are sometimes inappropriate because they are not fair -
if multiple CPUs repeatedly contend of the same lock, one CPU can
easily be starved. This is not a credible situation with rhashtable.
Multiple CPUs may want to repeatedly add or remove objects, but they
will typically do so at different buckets, so they will attempt to
acquire different locks.
As we have more bit-locks than we previously had spinlocks (by at
least a factor of two) we can expect slightly less contention to
go with the slightly better cache behavior and reduced memory
consumption.
To enhance type checking, a new struct is introduced to represent the
pointer plus lock-bit
that is stored in the bucket-table. This is "struct rhash_lock_head"
and is empty. A pointer to this needs to be cast to either an
unsigned lock, or a "struct rhash_head *" to be useful.
Variables of this type are most often called "bkt".
Previously "pprev" would sometimes point to a bucket, and sometimes a
->next pointer in an rhash_head. As these are now different types,
pprev is NULL when it would have pointed to the bucket. In that case,
'blk' is used, together with correct locking protocol.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
HSR should forget nodes after configured node forget time expiry based
on HSR_NODE_FORGET_TIME. As part of hsr_prune_nodes(), code checks to
see if entries are to be flushed out if not heard for longer than forget
time. But currently hsr_prune_nodes() is called only once during device
creation. Restart the timer at the end of hsr_prune_nodes() so that
hsr_prune_nodes() gets called periodically and forgotten entries are
removed from node table.
Signed-off-by: Aaron Kramer <a-kramer@ti.com>
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds a debugfs interface to allow display the nodes learned
by the hsr master.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use SPDX-License-Identifier instead of a verbose license text.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a blank line after function declaration as suggested by
checkpatch.pl -f
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current driver code uses camel case in many places. This is
seen when ran checkpatch.pl -f on files under net/hsr. This
patch fixes the code to remove camel case usage.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add missing space around operator in code. This is
seen when ran checkpatch.pl -f on files under net/hsr.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In a multi-line statement exceeding 80 characters, logical operator
should be at the end of a line instead of being at the start. This
is seen when ran checkpatch.pl -f on files under net/hsr. The change
is per suggestion from checkpatch.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes unnecessary space after a cast. This is seen
when ran checkpatch.pl -f on files under net/hsr.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces all instance of NULL checks such as
if (foo == NULL) with if (!foo)
Also
if (foo != NULL) with if (foo)
This is seen when ran checkpatch.pl -f on files under net/hsr
and suggestion is to replace as above.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes function calls that ends with '(' in a line.
This is seen when ran checkpatch.pl -f option on files under
net/hsr.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes alignment issues in code for functions. This is
seen when ran checkpatch.pl -f option on files under net/hsr.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes unnecessary paranthesis from the code. This is
seen when ran checkpatch.pl -f option on files under net/hsr.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes multiple blank lines in the code. This is seen
when ran checkpatch.pl -f option for files under net/hsr
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes lines exceeding 80 characters. This is seen
when ran checkpatch.pl with -f option for files under
net/hsr.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The non-null check on tskb is always false because it is in an else
path of a check on tskb and hence tskb is null in this code block.
This is check is therefore redundant and can be removed as well
as the label coalesc.
if (tsbk) {
...
} else {
...
if (unlikely(!skb)) {
if (tskb) /* can never be true, redundant code */
goto coalesc;
return;
}
}
Addresses-Coverity: ("Logically dead code")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is similar to commit 674d9de02a ("NFC: Fix possible memory
corruption when handling SHDLC I-Frame commands").
I'm not totally sure, but I think that commit description may have
overstated the danger. I was under the impression that this data came
from the firmware? If you can't trust your networking firmware, then
you're already in trouble.
Anyway, these days we add bounds checking where ever we can and we call
it kernel hardening. Better safe than sorry.
Fixes: 11f54f2286 ("NFC: nci: Add HCI over NCI protocol support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A recent commit added a call to cache_fresh_locked()
when an expired item was found.
The call sets the CACHE_VALID flag, so it is important
that the item actually is valid.
There are two ways it could be valid:
1/ If ->update has been called to fill in relevant content
2/ if CACHE_NEGATIVE is set, to say that content doesn't exist.
An expired item that is waiting for an update will be neither.
Setting CACHE_VALID will mean that a subsequent call to cache_put()
will be likely to dereference uninitialised pointers.
So we must make sure the item is valid, and we already have code to do
that in try_to_negate_entry(). This takes the hash lock and so cannot
be used directly, so take out the two lines that we need and use them.
Now cache_fresh_locked() is certain to be called only on
a valid item.
Cc: stable@kernel.org # 2.6.35
Fixes: 4ecd55ea07 ("sunrpc: fix cache_head leak due to queued request")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Minor comment merge conflict in mlx5.
Staging driver has a fixup due to the skb->xmit_more changes
in 'net-next', but was removed in 'net'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Several hash table refcount fixes in batman-adv, from Sven
Eckelmann.
2) Use after free in bpf_evict_inode(), from Daniel Borkmann.
3) Fix mdio bus registration in ixgbe, from Ivan Vecera.
4) Unbounded loop in __skb_try_recv_datagram(), from Paolo Abeni.
5) ila rhashtable corruption fix from Herbert Xu.
6) Don't allow upper-devices to be added to vrf devices, from Sabrina
Dubroca.
7) Add qmi_wwan device ID for Olicard 600, from Bjørn Mork.
8) Don't leave skb->next poisoned in __netif_receive_skb_list_ptype,
from Alexander Lobakin.
9) Missing IDR checks in mlx5 driver, from Aditya Pakki.
10) Fix false connection termination in ktls, from Jakub Kicinski.
11) Work around some ASPM issues with r8169 by disabling rx interrupt
coalescing on certain chips. From Heiner Kallweit.
12) Properly use per-cpu qstat values on NOLOCK qdiscs, from Paolo
Abeni.
13) Fully initialize sockaddr_in structures in SCTP, from Xin Long.
14) Various BPF flow dissector fixes from Stanislav Fomichev.
15) Divide by zero in act_sample, from Davide Caratti.
16) Fix bridging multicast regression introduced by rhashtable
conversion, from Nikolay Aleksandrov.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
ibmvnic: Fix completion structure initialization
ipv6: sit: reset ip header pointer in ipip6_rcv
net: bridge: always clear mcast matching struct on reports and leaves
libcxgb: fix incorrect ppmax calculation
vlan: conditional inclusion of FCoE hooks to match netdevice.h and bnx2x
sch_cake: Make sure we can write the IP header before changing DSCP bits
sch_cake: Use tc_skb_protocol() helper for getting packet protocol
tcp: Ensure DCTCP reacts to losses
net/sched: act_sample: fix divide by zero in the traffic path
net: thunderx: fix NULL pointer dereference in nicvf_open/nicvf_stop
net: hns: Fix sparse: some warnings in HNS drivers
net: hns: Fix WARNING when remove HNS driver with SMMU enabled
net: hns: fix ICMP6 neighbor solicitation messages discard problem
net: hns: Fix probabilistic memory overwrite when HNS driver initialized
net: hns: Use NAPI_POLL_WEIGHT for hns driver
net: hns: fix KASAN: use-after-free in hns_nic_net_xmit_hw()
flow_dissector: rst'ify documentation
ipv6: Fix dangling pointer when ipv6 fragment
net-gro: Fix GRO flush when receiving a GSO packet.
flow_dissector: document BPF flow dissector environment
...
In commit 0ae955e2656d ("tipc: improve TIPC throughput by Gap ACK
blocks"), we enhance the link transmq by releasing as many packets as
possible with the multi-ACKs from peer node. This also means the queue
is now non-linear and the peer link deferdq becomes vital.
Whereas, in the case of link failover, all messages in the link transmq
need to be transmitted as tunnel messages in such a way that message
sequentiality and cardinality per sender is preserved. This requires us
to maintain the link deferdq somehow, so that when the tunnel messages
arrive, the inner user messages along with the ones in the deferdq will
be delivered to upper layer correctly.
The commit accomplishes this by defining a new queue in the TIPC link
structure to hold the old link deferdq when link failover happens and
process it upon receipt of tunnel messages.
Also, in the case of link syncing, the link deferdq will not be purged
to avoid unnecessary retransmissions that in the worst case will fail
because the packets might have been freed on the sending side.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
For unicast transmission, the current NACK sending althorithm is over-
active that forces the sending side to retransmit a packet that is not
really lost but just arrived at the receiving side with some delay, or
even retransmit same packets that have already been retransmitted
before. As a result, many duplicates are observed also under normal
condition, ie. without packet loss.
One example case is: node1 transmits 1 2 3 4 10 5 6 7 8 9, when node2
receives packet #10, it puts into the deferdq. When the packet #5 comes
it sends NACK with gap [6 - 9]. However, shortly after that, when
packet #6 arrives, it pulls out packet #10 from the deferfq, but it is
still out of order, so it makes another NACK with gap [7 - 9] and so on
... Finally, node1 has to retransmit the packets 5 6 7 8 9 a number of
times, but in fact all the packets are not lost at all, so duplicates!
This commit reduces duplicates by changing the condition to send NACK,
also restricting the retransmissions on individual packets via a timer
of about 1ms. However, it also needs to say that too tricky condition
for NACKs or too long timeout value for retransmissions will result in
performance reducing! The criterias in this commit are found to be
effective for both the requirements to reduce duplicates but not affect
performance.
The tipc_link_rcv() is also improved to only dequeue skb from the link
deferdq if it is expected (ie. its seqno <= rcv_nxt).
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
During unicast link transmission, it's observed very often that because
of one or a few lost/dis-ordered packets, the sending side will fastly
reach the send window limit and must wait for the packets to be arrived
at the receiving side or in the worst case, a retransmission must be
done first. The sending side cannot release a lot of subsequent packets
in its transmq even though all of them might have already been received
by the receiving side.
That is, one or two packets dis-ordered/lost and dozens of packets have
to wait, this obviously reduces the overall throughput!
This commit introduces an algorithm to overcome this by using "Gap ACK
blocks". Basically, a Gap ACK block will consist of <ack, gap> numbers
that describes the link deferdq where packets have been got by the
receiving side but with gaps, for example:
link deferdq: [1 2 3 4 10 11 13 14 15 20]
--> Gap ACK blocks: <4, 5>, <11, 1>, <15, 4>, <20, 0>
The Gap ACK blocks will be sent to the sending side along with the
traditional ACK or NACK message. Immediately when receiving the message
the sending side will now not only release from its transmq the packets
ack-ed by the ACK but also by the Gap ACK blocks! So, more packets can
be enqueued and transmitted.
In addition, the sending side can now do "multi-retransmissions"
according to the Gaps reported in the Gap ACK blocks.
The new algorithm as verified helps greatly improve the TIPC throughput
especially under packet loss condition.
So far, a maximum of 32 blocks is quite enough without any "Too few Gap
ACK blocks" reports with a 5.0% packet loss rate, however this number
can be increased in the furture if needed.
Also, the patch is backward compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the mcast conversion to rhashtable this function has been unused, so
remove it.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to be careful and always zero the whole br_ip struct when it is
used for matching since the rhashtable change. This patch fixes all the
places which didn't properly clear it which in turn might've caused
mismatches.
Thanks for the great bug report with reproducing steps and bisection.
Steps to reproduce (from the bug report):
ip link add br0 type bridge mcast_querier 1
ip link set br0 up
ip link add v2 type veth peer name v3
ip link set v2 master br0
ip link set v2 up
ip link set v3 up
ip addr add 3.0.0.2/24 dev v3
ip netns add test
ip link add v1 type veth peer name v1 netns test
ip link set v1 master br0
ip link set v1 up
ip -n test link set v1 up
ip -n test addr add 3.0.0.1/24 dev v1
# Multicast receiver
ip netns exec test socat
UDP4-RECVFROM:5588,ip-add-membership=224.224.224.224:3.0.0.1,fork -
# Multicast sender
echo hello | nc -u -s 3.0.0.2 224.224.224.224 5588
Reported-by: liam.mcbirnie@boeing.com
Fixes: 19e3a9c90c ("net: bridge: convert multicast to generic rhashtable")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linux currently disable ECN for incoming connections when the SYN
requests ECN and the IP header has ECT(0)/ECT(1) set, as some
networks were reportedly mangling the ToS byte, hence could later
trigger false congestion notifications.
RFC8311 §4.3 relaxes RFC3168's requirements such that ECT can be set
one TCP control packets (including SYNs). The main benefit of this
is the decreased probability of losing a SYN in a congested
ECN-capable network (i.e., it avoids the initial 1s timeout).
Additionally, this allows the development of newer TCP extensions,
such as AccECN.
This patch relaxes the previous check, by enabling ECN on incoming
connections using SYN+ECT if at least one bit of the reserved flags
of the TCP header is set. Such bit would indicate that the sender of
the SYN is using a newer TCP feature than what the host implements,
such as AccECN, and is thus implementing RFC8311. This enables
end-hosts not supporting such extensions to still negociate ECN, and
to have some of the benefits of using ECN on control packets.
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
Suggested-by: Bob Briscoe <research@bobbriscoe.net>
Cc: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently if the driver registers devlink port instance, he should set
the devlink port attributes as well. Then the devlink core is able to
obtain switch id itself, no need for driver to implement the ndo.
Once all drivers will implement devlink port registration, this ndo
should be removed. This warning guides new drivers to do things as
they should be done.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass the switch ID down the to devlink through devlink_port_attrs_set()
so it can be used by devlink_compat_switch_id_get(). Leave
ndo_get_port_parent_id implementation only for legacy.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce devlink_compat_switch_id_get() helper which fills up switch_id
according to passed netdev pointer. Call it directly from
dev_get_port_parent_id() as a fallback when ndo_get_port_parent_id
is not defined for given netdev.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend devlink_port_attrs_set() to pass switch ID for ports which are
part of switch and store it in port attrs. For other ports, this is
NULL.
Note that this allows the driver to group devlink ports into one or more
switches according to the actual topology.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can optimize the fdb convergence when a backup_port is present by not
immediately flushing the entries of the stopped port since traffic for
those entries will flow towards the backup_port.
There are 2 cases specifically that benefit most:
- when the stopped port comes up before the entries expire by themselves
- when there's an external entry refresh and they're kept while the
backup_port is operating (e.g. mlag)
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb somehow dequeued out of inputq before processing, it causes to
NULL pointer and kernel crashed.
Add checking skb valid before using.
Fixes: c55c8edafa ("tipc: smooth change between replicast and broadcast")
Reported-by: Tuong Lien Tong <tuong.t.lien@dektech.com.au>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent changes to TC flower remove the requirement for rtnl lock when
accessing and modifying filters. Refcounts now ensure access and deletion
do not happen concurrently. However, the reoffload function which cycles
through all filters and replays them to registered hw drivers is not
protected.
Use the fl_get_next_filter() function to cycle the filters for reoffload
and ensure the ref taken by this function is put when done with each
filter.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Way back in 3c9c36bced the
ndo_fcoe_get_wwn pointer was switched from depending on CONFIG_FCOE to
CONFIG_LIBFCOE in order to allow building FCoE support into the bnx2x
driver and used by bnx2fc without including the generic software fcoe
module.
But, FCoE is generally used over an 802.1q VLAN, and the implementation
of ndo_fcoe_get_wwn in the 8021q module was not similarly changed. The
result is that if CONFIG_FCOE is disabled, then bnz2fc cannot make a
call to ndo_fcoe_get_wwn through the 8021q interface to the underlying
bnx2x interface. The bnx2fc driver then falls back to a potentially
different mapping of Ethernet MAC to Fibre Channel WWN, creating an
incompatibility with the fabric and target configurations when compared
to the WWNs used by pre-boot firmware and differently-configured
kernels.
So make the conditional inclusion of FCoE code in 8021q match the
conditional inclusion in netdevice.h
Signed-off-by: Chris Leech <cleech@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2019-04-04
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Batch of fixes to the existing BPF flow dissector API to support
calling BPF programs from the eth_get_headlen context (support for
latter is planned to be added in bpf-next), from Stanislav.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
There is not actually any guarantee that the IP headers are valid before we
access the DSCP bits of the packets. Fix this using the same approach taken
in sch_dsmark.
Reported-by: Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We shouldn't be using skb->protocol directly as that will miss cases with
hardware-accelerated VLAN tags. Use the helper instead to get the right
protocol number.
Reported-by: Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC8257 §3.5 explicitly states that "A DCTCP sender MUST react to
loss episodes in the same way as conventional TCP".
Currently, Linux DCTCP performs no cwnd reduction when losses
are encountered. Optionally, the dctcp_clamp_alpha_on_loss resets
alpha to its maximal value if a RTO happens. This behavior
is sub-optimal for at least two reasons: i) it ignores losses
triggering fast retransmissions; and ii) it causes unnecessary large
cwnd reduction in the future if the loss was isolated as it resets
the historical term of DCTCP's alpha EWMA to its maximal value (i.e.,
denoting a total congestion). The second reason has an especially
noticeable effect when using DCTCP in high BDP environments, where
alpha normally stays at low values.
This patch replace the clamping of alpha by setting ssthresh to
half of cwnd for both fast retransmissions and RTOs, at most once
per RTT. Consequently, the dctcp_clamp_alpha_on_loss module parameter
has been removed.
The table below shows experimental results where we measured the
drop probability of a PIE AQM (not applying ECN marks) at a
bottleneck in the presence of a single TCP flow with either the
alpha-clamping option enabled or the cwnd halving proposed by this
patch. Results using reno or cubic are given for comparison.
| Link | RTT | Drop
TCP CC | speed | base+AQM | probability
==================|=========|==========|============
CUBIC | 40Mbps | 7+20ms | 0.21%
RENO | | | 0.19%
DCTCP-CLAMP-ALPHA | | | 25.80%
DCTCP-HALVE-CWND | | | 0.22%
------------------|---------|----------|------------
CUBIC | 100Mbps | 7+20ms | 0.03%
RENO | | | 0.02%
DCTCP-CLAMP-ALPHA | | | 23.30%
DCTCP-HALVE-CWND | | | 0.04%
------------------|---------|----------|------------
CUBIC | 800Mbps | 1+1ms | 0.04%
RENO | | | 0.05%
DCTCP-CLAMP-ALPHA | | | 18.70%
DCTCP-HALVE-CWND | | | 0.06%
We see that, without halving its cwnd for all source of losses,
DCTCP drives the AQM to large drop probabilities in order to keep
the queue length under control (i.e., it repeatedly faces RTOs).
Instead, if DCTCP reacts to all source of losses, it can then be
controlled by the AQM using similar drop levels than cubic or reno.
Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
Cc: Bob Briscoe <research@bobbriscoe.net>
Cc: Lawrence Brakmo <brakmo@fb.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Daniel Borkmann <borkmann@iogearbox.net>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Andrew Shewmaker <agshew@gmail.com>
Cc: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Florian Westphal <fw@strlen.de>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simplify this code by updating bridge multicast stats from
maybe_deliver().
Note that commit 6db6f0eae6 ("bridge: multicast to unicast"), in case
the port flag BR_MULTICAST_TO_UNICAST is set, never updates the previous
port pointer, therefore it is always going to be different from the
existing port in this deduplicated list iteration.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Just like 46cfd725c3 ("net: use kfree_skb_list() helper in more places").
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Export fib_nexthop_info and fib_add_nexthop for use by IPv6 code.
Remove rt6_nexthop_info and rt6_add_nexthop in favor of the IPv4
versions. Update fib_nexthop_info for IPv6 linkdown check and
RTA_GATEWAY for AF_INET6.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the exception of the nexthop weight, the nexthop attributes used by
fib_nexthop_info and fib_add_nexthop come from the fib_nh_common struct.
Update both to use it and change fib_nexthop_info to check the family
as needed.
nexthop weight comes from the common struct for existing use cases, but
for nexthop groups the weight is outside of the fib_nh_common to allow
the same nexthop definition to be used in multiple groups with different
weights.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to ipv6, move addition of nexthop attributes to dump
message into helpers that are called for both single path and
multipath routes. Align the new helpers to the IPv6 variant
which most notably means computing the flags argument based on
settings in nh_flags.
The RTA_FLOW argument is unique to IPv4, so it is appended after
the new fib_nexthop_info helper. The intent of a later patch is to
make both fib_nexthop_info and fib_add_nexthop usable for both IPv4
and IPv6. This patch is stepping stone in that direction.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most of the ipv4 code only needs data from fib_nh_common. Add
fib_nh_common selection to fib_result and update users to use it.
Right now, fib_nh_common in fib_result will point to a fib_nh struct
that is embedded within a fib_info:
fib_info --> fib_nh
fib_nh
...
fib_nh
^
fib_result->nhc ----+
Later, nhc can point to a fib_nh within a nexthop struct:
fib_info --> nexthop --> fib_nh
^
fib_result->nhc ---------------+
or for a nexthop group:
fib_info --> nexthop --> nexthop --> fib_nh
nexthop --> fib_nh
...
nexthop --> fib_nh
^
fib_result->nhc ---------------------------+
In all cases nhsel within fib_result will point to which leg in the
multipath route is used.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update fib_table_lookup tracepoint to take a fib_nh_common struct and
dump the v6 gateway address if the nexthop uses it.
Over the years saddr has not proven useful and the output of the
tracepoint produces very long lines. Since saddr is not part of
fib_nh_common, drop it. If it needs to be added later, fib_nh which
contains saddr can be obtained from a fib_nh_common via container_of.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At the beginning of ip6_fragment func, the prevhdr pointer is
obtained in the ip6_find_1stfragopt func.
However, all the pointers pointing into skb header may change
when calling skb_checksum_help func with
skb->ip_summed = CHECKSUM_PARTIAL condition.
The prevhdr pointe will be dangling if it is not reloaded after
calling __skb_linearize func in skb_checksum_help func.
Here, I add a variable, nexthdr_offset, to evaluate the offset,
which does not changes even after calling __skb_linearize func.
Fixes: 405c92f7a5 ("ipv6: add defensive check for CHECKSUM_PARTIAL skbs in ip_fragment")
Signed-off-by: Junwei Hu <hujunwei4@huawei.com>
Reported-by: Wenhao Zhang <zhangwenhao8@huawei.com>
Reported-by: syzbot+e8ce541d095e486074fc@syzkaller.appspotmail.com
Reviewed-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we may merge incorrectly a received GSO packet
or a packet with frag_list into a packet sitting in the
gro_hash list. skb_segment() may crash case because
the assumptions on the skb layout are not met.
The correct behaviour would be to flush the packet in the
gro_hash list and send the received GSO packet directly
afterwards. Commit d61d072e87 ("net-gro: avoid reorders")
sets NAPI_GRO_CB(skb)->flush in this case, but this is not
checked before merging. This patch makes sure to check this
flag and to not merge in that case.
Fixes: d61d072e87 ("net-gro: avoid reorders")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
This patch fixes the following warning:
net/rxrpc/local_object.c: In function ‘rxrpc_open_socket’:
net/rxrpc/local_object.c:175:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
if (ret < 0) {
^
net/rxrpc/local_object.c:184:2: note: here
case AF_INET:
^~~~
Warning level 3 was used: -Wimplicit-fallthrough=3
Currently, GCC is expecting to find the fall-through annotations
at the very bottom of the case and on its own line. That's why
I had to add the annotation, although the intentional fall-through
is already mentioned in a few lines above.
This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use whitelist instead of a blacklist and allow only a small set of
fields that might be relevant in the context of flow dissector:
* data
* data_end
* flow_keys
This is required for the eth_get_headlen case where we have only a
chunk of data to dissect (i.e. trying to read the other skb fields
doesn't make sense).
Note, that it is a breaking API change! However, we've provided
flow_keys->n_proto as a substitute for skb->protocol; and there is
no need to manually handle skb->vlan_present. So even if we
break somebody, the migration is trivial. Unfortunately, we can't
support eth_get_headlen use-case without those breaking changes.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Don't allow BPF program to set flow_keys->nhoff to less than initial
value. We currently don't read the value afterwards in anything but
the tests, but it's still a good practice to return consistent
values to the test programs.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This is a preparation for the next commit that would prohibit access to
the most fields of __sk_buff from the BPF programs.
Instead of requiring BPF flow dissector programs to look into skb,
pass all input data in the flow_keys.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Action tunnel_key doesn't have a metadata/tunnel for release(decap) action.
Drivers do not dereference entry->tunnel pointer for that action type, so
this behavior doesn't result in a crash at the moment. However, this needs
to be corrected as a preparation for updating hardware offloads API to not
rely on rtnl lock, for which flow_action code will copy the tunnel data to
temporary buffer to prevent concurrent action overwrite from
invalidating/freeing it.
Fixes: 3a7b68617d ("cls_api: add translator to flow_action representation")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The device type for ip6 tunnels is set to
ARPHRD_TUNNEL6. However, the ip4ip6_err function
is expecting the device type of the tunnel to be
ARPHRD_TUNNEL. Since the device types do not
match, the function exits and the ICMP error
packet is not sent to the originating host. Note
that the device type for IPv4 tunnels is set to
ARPHRD_TUNNEL.
Fix is to expect a tunnel device type of
ARPHRD_TUNNEL6 instead. Now the tunnel device
type matches and the ICMP error packet is sent
to the originating host.
Signed-off-by: Sheena Mira-ato <sheena.mira-ato@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
We free "ct_info->ct" and then use it on the next line when we pass it
to nf_ct_destroy_timeout(). This patch swaps the order to avoid the use
after free.
Fixes: 06bd2bdf19 ("openvswitch: Add timeout support to ct action")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently don't reload pointers pointing into skb header
after doing pskb_may_pull() in _decode_session4(). So in case
pskb_may_pull() changed the pointers, we read from random
memory. Fix this by putting all the needed infos on the
stack, so that we don't need to access the header pointers
after doing pskb_may_pull().
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This fills a hole in softnet data, so no change in structure size.
Also prepares for xmit_more placement in the same spot;
skb->xmit_more will be removed in followup patch.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
If dccp_feat_push_change fails, we forget free the mem
which is alloced by kmemdup in dccp_feat_clone_sp_val.
Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: e8ef967a54 ("dccp: Registration routines for changing feature values")
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As Eric noticed, in .sk_destruct, sk->sk_dst_cache update is prevented, and
no barrier is needed for this. So change to use rcu_dereference_protected()
instead of rcu_dereference_check() to fetch sk_dst_cache in there.
v1->v2:
- no change, repost after net-next is open.
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Syzbot report a kernel-infoleak:
BUG: KMSAN: kernel-infoleak in _copy_to_user+0x16b/0x1f0 lib/usercopy.c:32
Call Trace:
_copy_to_user+0x16b/0x1f0 lib/usercopy.c:32
copy_to_user include/linux/uaccess.h:174 [inline]
sctp_getsockopt_peer_addrs net/sctp/socket.c:5911 [inline]
sctp_getsockopt+0x1668e/0x17f70 net/sctp/socket.c:7562
...
Uninit was stored to memory at:
sctp_transport_init net/sctp/transport.c:61 [inline]
sctp_transport_new+0x16d/0x9a0 net/sctp/transport.c:115
sctp_assoc_add_peer+0x532/0x1f70 net/sctp/associola.c:637
sctp_process_param net/sctp/sm_make_chunk.c:2548 [inline]
sctp_process_init+0x1a1b/0x3ed0 net/sctp/sm_make_chunk.c:2361
...
Bytes 8-15 of 16 are uninitialized
It was caused by that th _pad field (the 8-15 bytes) of a v4 addr (saved in
struct sockaddr_in) wasn't initialized, but directly copied to user memory
in sctp_getsockopt_peer_addrs().
So fix it by calling memset(addr->v4.sin_zero, 0, 8) to initialize _pad of
sockaddr_in before copying it to user memory in sctp_v4_addr_to_user(), as
sctp_v6_addr_to_user() does.
Reported-by: syzbot+86b5c7c236a22616a72f@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Tested-by: Alexander Potapenko <glider@google.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When kcm is loaded while many processes try to create a KCM socket, a
crash occurs:
BUG: unable to handle kernel NULL pointer dereference at 000000000000000e
IP: mutex_lock+0x27/0x40 kernel/locking/mutex.c:240
PGD 8000000016ef2067 P4D 8000000016ef2067 PUD 3d6e9067 PMD 0
Oops: 0002 [#1] SMP KASAN PTI
CPU: 0 PID: 7005 Comm: syz-executor.5 Not tainted 4.12.14-396-default #1 SLE15-SP1 (unreleased)
RIP: 0010:mutex_lock+0x27/0x40 kernel/locking/mutex.c:240
RSP: 0018:ffff88000d487a00 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 000000000000000e RCX: 1ffff100082b0719
...
CR2: 000000000000000e CR3: 000000004b1bc003 CR4: 0000000000060ef0
Call Trace:
kcm_create+0x600/0xbf0 [kcm]
__sock_create+0x324/0x750 net/socket.c:1272
...
This is due to race between sock_create and unfinished
register_pernet_device. kcm_create tries to do "net_generic(net,
kcm_net_id)". but kcm_net_id is not initialized yet.
So switch the order of the two to close the race.
This can be reproduced with mutiple processes doing socket(PF_KCM, ...)
and one process doing module removal.
Fixes: ab7ac4eb98 ("kcm: Kernel Connection Multiplexor module")
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before creating a slave netdevice, get the mac address from DTS and
apply in case it is valid.
Signed-off-by: Xiaofei Shen <xiaofeis@codeaurora.org>
Signed-off-by: Vinod Koul <vkoul@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The same code to flush qdisc tree and purge the qdisc queue
is duplicated in many places and in most cases it does not
respect NOLOCK qdisc: the global backlog len is used and the
per CPU values are ignored.
This change addresses the above, factoring-out the relevant
code and using the helpers introduced by the previous patch
to fetch the correct backlog len.
Fixes: c5ad119fb6 ("net: sched: pfifo_fast use skb_array")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Classful qdiscs can't access directly the child qdiscs backlog
length: if such qdisc is NOLOCK, per CPU values should be
accounted instead.
Most qdiscs no not respect the above. As a result, qstats fetching
for most classful qdisc is currently incorrect: if the child qdisc is
NOLOCK, it always reports 0 len backlog.
This change introduces a pair of helpers to safely fetch
both backlog and qlen and use them in stats class dumping
functions, fixing the above issue and cleaning a bit the code.
DRR needs also to access the child qdisc queue length, so it
needs custom handling.
Fixes: c5ad119fb6 ("net: sched: pfifo_fast use skb_array")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It returned always NULL, thus it was never possible to get the filter.
Example:
$ ip link add foo type dummy
$ ip link add bar type dummy
$ tc qdisc add dev foo clsact
$ tc filter add dev foo protocol all pref 1 ingress handle 1234 \
matchall action mirred ingress mirror dev bar
Before the patch:
$ tc filter get dev foo protocol all pref 1 ingress handle 1234 matchall
Error: Specified filter handle not found.
We have an error talking to the kernel
After:
$ tc filter get dev foo protocol all pref 1 ingress handle 1234 matchall
filter ingress protocol all pref 1 matchall chain 0 handle 0x4d2
not_in_hw
action order 1: mirred (Ingress Mirror to device bar) pipe
index 1 ref 1 bind 1
CC: Yotam Gigi <yotamg@mellanox.com>
CC: Jiri Pirko <jiri@mellanox.com>
Fixes: fd62d9f5c5 ("net/sched: matchall: Fix configuration race")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Configuration check to accept source route IP options should be made on
the incoming netdevice when the skb->dev is an l3mdev master. The route
lookup for the source route next hop also needs the incoming netdev.
v2->v3:
- Simplify by passing the original netdevice down the stack (per David
Ahern).
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It turns out that struct ipv6_pinfo is not located as we think.
inet6_sk_generic() and tcp_inet6_sk() disagree on 32bit kernels by 4-bytes,
because struct tcp_sock has 8-bytes alignment,
but ipv6_pinfo size is not a multiple of 8.
sizeof(struct ipv6_pinfo): 116 (not padded to 8)
I actually first coded tcp_inet6_sk() as this patch does, but thought
that "container_of(tcp_sk(sk), struct tcp6_sock, tcp)" was cleaner.
As Julian told me : Nobody should use tcp6_sock.inet6
directly, it should be accessed via tcp_inet6_sk() or inet6_sk().
This happened when we added the first u64 field in struct tcp_sock.
Fixes: 93a77c11ae ("tcp: add tcp_inet6_sk() helper")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Bisected-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
When tcp_sk_init() failed in inet_ctl_sock_create(),
'net->ipv4.tcp_congestion_control' will be left
uninitialized, but tcp_sk_exit() hasn't check for
that.
This patch add checking on 'net->ipv4.tcp_congestion_control'
in tcp_sk_exit() to prevent NULL-ptr dereference.
Fixes: 6670e15244 ("tcp: Namespace-ify sysctl_tcp_default_congestion_control")
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>