Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.
As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.
Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
When binding a non-abstract AF_UNIX socket it will gain a representation
in the filesystem. Enable the socket infrastructure to handle idmapped
mounts by passing down the user namespace of the mount the socket will
be created from. If the initial user namespace is passed nothing changes
so non-idmapped mounts will see identical behavior as before.
Link: https://lore.kernel.org/r/20210121131959.646623-18-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The various vfs_*() helpers are called by filesystems or by the vfs
itself to perform core operations such as create, link, mkdir, mknod, rename,
rmdir, tmpfile and unlink. Enable them to handle idmapped mounts. If the
inode is accessed through an idmapped mount map it into the
mount's user namespace and pass it down. Afterwards the checks and
operations are identical to non-idmapped mounts. If the initial user
namespace is passed nothing changes so non-idmapped mounts will see
identical behavior as before.
Link: https://lore.kernel.org/r/20210121131959.646623-15-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The posix acl permission checking helpers determine whether a caller is
privileged over an inode according to the acls associated with the
inode. Add helpers that make it possible to handle acls on idmapped
mounts.
The vfs and the filesystems targeted by this first iteration make use of
posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to
translate basic posix access and default permissions such as the
ACL_USER and ACL_GROUP type according to the initial user namespace (or
the superblock's user namespace) to and from the caller's current user
namespace. Adapt these two helpers to handle idmapped mounts whereby we
either map from or into the mount's user namespace depending on in which
direction we're translating.
Similarly, cap_convert_nscap() is used by the vfs to translate user
namespace and non-user namespace aware filesystem capabilities from the
superblock's user namespace to the caller's user namespace. Enable it to
handle idmapped mounts by accounting for the mount's user namespace.
In addition the fileystems targeted in the first iteration of this patch
series make use of the posix_acl_chmod() and, posix_acl_update_mode()
helpers. Both helpers perform permission checks on the target inode. Let
them handle idmapped mounts. These two helpers are called when posix
acls are set by the respective filesystems to handle this case we extend
the ->set() method to take an additional user namespace argument to pass
the mount's user namespace down.
Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Add two simple helpers to check permissions on a file and path
respectively and convert over some callers. It simplifies quite a few
codepaths and also reduces the churn in later patches quite a bit.
Christoph also correctly points out that this makes codepaths (e.g.
ioctls) way easier to follow that would otherwise have to do more
complex argument passing than necessary.
Link: https://lore.kernel.org/r/20210121131959.646623-4-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Current release - regressions:
- fix feature enforcement to allow NETIF_F_HW_TLS_TX
if IP_CSUM && IPV6_CSUM
- dcb: accept RTM_GETDCB messages carrying set-like DCB commands
if user is admin for backward-compatibility
- selftests/tls: fix selftests build after adding ChaCha20-Poly1305
Current release - always broken:
- ppp: fix refcount underflow on channel unbridge
- bnxt_en: clear DEFRAG flag in firmware message when retry flashing
- smc: fix out of bound access in the new netlink interface
Previous releases - regressions:
- fix use-after-free with UDP GRO by frags
- mptcp: better msk-level shutdown
- rndis_host: set proper input size for OID_GEN_PHYSICAL_MEDIUM request
- i40e: xsk: fix potential NULL pointer dereferencing
Previous releases - always broken:
- skb frag: kmap_atomic fixes
- avoid 32 x truesize under-estimation for tiny skbs
- fix issues around register_netdevice() failures
- udp: prevent reuseport_select_sock from reading uninitialized socks
- dsa: unbind all switches from tree when DSA master unbinds
- dsa: clear devlink port type before unregistering slave netdevs
- can: isotp: isotp_getname(): fix kernel information leak
- mlxsw: core: Thermal control fixes
- ipv6: validate GSO SKB against MTU before finish IPv6 processing
- stmmac: use __napi_schedule() for PREEMPT_RT
- net: mvpp2: remove Pause and Asym_Pause support
Misc:
- remove from MAINTAINERS folks who had been inactive for >5yrs
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmAAnyYACgkQMUZtbf5S
IrsdmhAAotkTNVS1zEsvwIirI9KUKKMXvNvscpO0+HJgsQHVnCGkfrj0BQmqQR21
D9njJIkGRiIANRO/Y/3wVCew55a0bxLmyE3JaU6krGLpvcNUFX6+fvuuzFSiWtKu
1c/AaXFIDTa8uVtXP/Ve8DfxKZmh3YPX5pNtk3fS6OlymbUfu8pOEPY5k69/Nlmr
QwbGZO0Q5Ab18rmPztgWpcZi8wLbpZYbrIR2E45u3k+LnXG3UUVYeYTC9Hi89wkz
8YiS0PIs6GmWeSWnWK9TWXFSaxV8ttABsFxpbmzWW6oqkaviGjLfPg7kYYRgPu08
nCyYx7LN58shQ8FTfZm1yBpJ1fbPV/5RIMZKQ6Fg4cICgCab63E4N6xxoA9mLNu9
hP/qgeynQ2w1FbPw5yQVbDCVmcyfPb5V4WC1OccHQdgaAzz2SFPxvsUTOoBRxY8m
DmZDHjBi2ZXB3/PSkwWmIsW9PuPq6de8xgHIQtjrCeduvVvmOYkrcdfkMxTx9HC0
LH2a5x9VCL/cf/Y/tQ2TZSntweSq8MhlRV9vOIO1FOqiviYHlnD8+EuIBMe8To14
XRIDMl92lpY5xjJpKdRhZ7Yh4CNMk199yFf5bt3xSlM4A3ALUlwqRKES6I2MZiiF
0Yvxsr2qVShaHx6XpmBAimaUXxTmmUV7X1hf19EEzzmTdiMjad4=
=e8t6
-----END PGP SIGNATURE-----
Merge tag 'net-5.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"We have a few fixes for long standing issues, in particular Eric's fix
to not underestimate the skb sizes, and my fix for brokenness of
register_netdevice() error path. They may uncover other bugs so we
will keep an eye on them. Also included are Willem's fixes for
kmap(_atomic).
Looking at the "current release" fixes, it seems we are about one rc
behind a normal cycle. We've previously seen an uptick of "people had
run their test suites" / "humans actually tried to use new features"
fixes between rc2 and rc3.
Summary:
Current release - regressions:
- fix feature enforcement to allow NETIF_F_HW_TLS_TX if IP_CSUM &&
IPV6_CSUM
- dcb: accept RTM_GETDCB messages carrying set-like DCB commands if
user is admin for backward-compatibility
- selftests/tls: fix selftests build after adding ChaCha20-Poly1305
Current release - always broken:
- ppp: fix refcount underflow on channel unbridge
- bnxt_en: clear DEFRAG flag in firmware message when retry flashing
- smc: fix out of bound access in the new netlink interface
Previous releases - regressions:
- fix use-after-free with UDP GRO by frags
- mptcp: better msk-level shutdown
- rndis_host: set proper input size for OID_GEN_PHYSICAL_MEDIUM
request
- i40e: xsk: fix potential NULL pointer dereferencing
Previous releases - always broken:
- skb frag: kmap_atomic fixes
- avoid 32 x truesize under-estimation for tiny skbs
- fix issues around register_netdevice() failures
- udp: prevent reuseport_select_sock from reading uninitialized socks
- dsa: unbind all switches from tree when DSA master unbinds
- dsa: clear devlink port type before unregistering slave netdevs
- can: isotp: isotp_getname(): fix kernel information leak
- mlxsw: core: Thermal control fixes
- ipv6: validate GSO SKB against MTU before finish IPv6 processing
- stmmac: use __napi_schedule() for PREEMPT_RT
- net: mvpp2: remove Pause and Asym_Pause support
Misc:
- remove from MAINTAINERS folks who had been inactive for >5yrs"
* tag 'net-5.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits)
mptcp: fix locking in mptcp_disconnect()
net: Allow NETIF_F_HW_TLS_TX if IP_CSUM && IPV6_CSUM
MAINTAINERS: dccp: move Gerrit Renker to CREDITS
MAINTAINERS: ipvs: move Wensong Zhang to CREDITS
MAINTAINERS: tls: move Aviad to CREDITS
MAINTAINERS: ena: remove Zorik Machulsky from reviewers
MAINTAINERS: vrf: move Shrijeet to CREDITS
MAINTAINERS: net: move Alexey Kuznetsov to CREDITS
MAINTAINERS: altx: move Jay Cliburn to CREDITS
net: avoid 32 x truesize under-estimation for tiny skbs
nt: usb: USB_RTL8153_ECM should not default to y
net: stmmac: fix taprio configuration when base_time is in the past
net: stmmac: fix taprio schedule configuration
net: tip: fix a couple kernel-doc markups
net: sit: unregister_netdevice on newlink's error path
net: stmmac: Fixed mtu channged by cache aligned
cxgb4/chtls: Fix tid stuck due to wrong update of qid
i40e: fix potential NULL pointer dereferencing
net: stmmac: use __napi_schedule() for PREEMPT_RT
can: mcp251xfd: mcp251xfd_handle_rxif_one(): fix wrong NULL pointer check
...
tcp_disconnect() expects the caller acquires the sock lock,
but mptcp_disconnect() is not doing that. Add the missing
required lock.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: 76e2a55d16 ("mptcp: better msk-level shutdown.")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/f818e82b58a556feeb71dcccc8bf1c87aafc6175.1610638176.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cited patch below blocked the TLS TX device offload unless HW_CSUM
is set. This broke devices that use IP_CSUM && IP6_CSUM.
Here we fix it.
Note that the single HW_TLS_TX feature flag indicates support for
both IPv4/6, hence it should still be disabled in case only one of
(IP_CSUM | IPV6_CSUM) is set.
Fixes: ae0b04b238 ("net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reported-by: Rohit Maheshwari <rohitm@chelsio.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Link: https://lore.kernel.org/r/20210114151215.7061-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Both virtio net and napi_get_frags() allocate skbs
with a very small skb->head
While using page fragments instead of a kmalloc backed skb->head might give
a small performance improvement in some cases, there is a huge risk of
under estimating memory usage.
For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
per page (order-3 page in x86), or even 64 on PowerPC
We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
but consuming far more memory for TCP buffers than instructed in tcp_mem[2]
Even if we force napi_alloc_skb() to only use order-0 pages, the issue
would still be there on arches with PAGE_SIZE >= 32768
This patch makes sure that small skb head are kmalloc backed, so that
other objects in the slab page can be reused instead of being held as long
as skbs are sitting in socket queues.
Note that we might in the future use the sk_buff napi cache,
instead of going through a more expensive __alloc_skb()
Another idea would be to use separate page sizes depending
on the allocated length (to never have more than 4 frags per page)
I would like to thank Greg Thelen for his precious help on this matter,
analysing crash dumps is always a time consuming task.
Fixes: fd11a83dd3 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20210113161819.1155526-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A function has a different name between their prototype
and its kernel-doc markup:
../net/tipc/link.c:2551: warning: expecting prototype for link_reset_stats(). Prototype was for tipc_link_reset_stats() instead
../net/tipc/node.c:1678: warning: expecting prototype for is the general link level function for message sending(). Prototype was for tipc_node_xmit() instead
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We need to unregister the netdevice if config failed.
.ndo_uninit takes care of most of the heavy lifting.
This was uncovered by recent commit c269a24ce0 ("net: make
free_netdev() more lenient with unregistering devices").
Previously the partially-initialized device would be left
in the system.
Reported-and-tested-by: syzbot+2393580080a2da190f04@syzkaller.appspotmail.com
Fixes: e2f1f072db ("sit: allow to configure 6rd tunnels via netlink")
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20210114012947.2515313-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The call state may be changed at any time by the data-ready routine in
response to received packets, so if the call state is to be read and acted
upon several times in a function, READ_ONCE() must be used unless the call
state lock is held.
As it happens, we used READ_ONCE() to read the state a few lines above the
unmarked read in rxrpc_input_data(), so use that value rather than
re-reading it.
Fixes: a158bdd324 ("rxrpc: Fix call timeouts")
Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/161046715522.2450566.488819910256264150.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Clang static analysis reports the following:
net/rxrpc/key.c:657:11: warning: Assigned value is garbage or undefined
toksize = toksizes[tok++];
^ ~~~~~~~~~~~~~~~
rxrpc_read() contains two consecutive loops. The first loop calculates the
token sizes and stores the results in toksizes[] and the second one uses
the array. When there is an error in identifying the token in the first
loop, the token is skipped, no change is made to the toksizes[] array.
When the same error happens in the second loop, the token is not skipped.
This will cause the toksizes[] array to be out of step and will overrun
past the calculated sizes.
Fix this by making both loops log a message and return an error in this
case. This should only happen if a new token type is incompletely
implemented, so it should normally be impossible to trigger this.
Fixes: 9a059cd5ca ("rxrpc: Downgrade the BUG() for unsupported token type in rxrpc_read()")
Reported-by: Tom Rix <trix@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Tom Rix <trix@redhat.com>
Link: https://lore.kernel.org/r/161046503122.2445787.16714129930607546635.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Pass conntrack -f to specify family in netfilter conntrack helper
selftests, from Chen Yi.
2) Honor hashsize modparam from nf_conntrack_buckets sysctl,
from Jesper D. Brouer.
3) Fix memleak in nf_nat_init() error path, from Dinghao Liu.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: nf_nat: Fix memleak in nf_nat_init
netfilter: conntrack: fix reading nf_conntrack_buckets
selftests: netfilter: Pass family parameter "-f" to conntrack tool
====================
Link: https://lore.kernel.org/r/20210112222033.9732-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Using snprintf() to convert not null-terminated strings to null
terminated strings may cause out of bounds read in the source string.
Therefore use memcpy() and terminate the target string with a null
afterwards.
Fixes: a3db10efcc ("net/smc: Add support for obtaining SMCR device list")
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
smc_clc_get_hostname() sets the host pointer to a buffer
which is not NULL-terminated (see smc_clc_init()).
Reported-by: syzbot+f4708c391121cfc58396@syzkaller.appspotmail.com
Fixes: 099b990bd1 ("net/smc: Add support for obtaining system information")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of re-implementing most of inet_shutdown, re-use
such helper, and implement the MPTCP-specific bits at the
'proto' level.
The msk-level disconnect() can now be invoked, lets provide a
suitable implementation.
As a side effect, this fixes bad state management for listener
sockets. The latter could lead to division by 0 oops since
commit ea4ca586b1 ("mptcp: refine MPTCP-level ack scheduling").
Fixes: 43b54c6ee3 ("mptcp: Use full MPTCP-level disconnect state machine")
Fixes: ea4ca586b1 ("mptcp: refine MPTCP-level ack scheduling")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Syzkaller found a way to trigger division by zero
in mptcp_subflow_cleanup_rbuf().
The current checks implemented into tcp_can_send_ack()
are too week, let's be more accurate.
Reported-by: Christoph Paasch <cpaasch@apple.com>
Fixes: ea4ca586b1 ("mptcp: refine MPTCP-level ack scheduling")
Fixes: fd8976790a ("mptcp: be careful on MPTCP-level ack.")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Florian reported a use-after-free bug in devlink_nl_port_fill found with
KASAN:
(devlink_nl_port_fill)
(devlink_port_notify)
(devlink_port_unregister)
(dsa_switch_teardown.part.3)
(dsa_tree_teardown_switches)
(dsa_unregister_switch)
(bcm_sf2_sw_remove)
(platform_remove)
(device_release_driver_internal)
(device_links_unbind_consumers)
(device_release_driver_internal)
(device_driver_detach)
(unbind_store)
Allocated by task 31:
alloc_netdev_mqs+0x5c/0x50c
dsa_slave_create+0x110/0x9c8
dsa_register_switch+0xdb0/0x13a4
b53_switch_register+0x47c/0x6dc
bcm_sf2_sw_probe+0xaa4/0xc98
platform_probe+0x90/0xf4
really_probe+0x184/0x728
driver_probe_device+0xa4/0x278
__device_attach_driver+0xe8/0x148
bus_for_each_drv+0x108/0x158
Freed by task 249:
free_netdev+0x170/0x194
dsa_slave_destroy+0xac/0xb0
dsa_port_teardown.part.2+0xa0/0xb4
dsa_tree_teardown_switches+0x50/0xc4
dsa_unregister_switch+0x124/0x250
bcm_sf2_sw_remove+0x98/0x13c
platform_remove+0x44/0x5c
device_release_driver_internal+0x150/0x254
device_links_unbind_consumers+0xf8/0x12c
device_release_driver_internal+0x84/0x254
device_driver_detach+0x30/0x34
unbind_store+0x90/0x134
What happens is that devlink_port_unregister emits a netlink
DEVLINK_CMD_PORT_DEL message which associates the devlink port that is
getting unregistered with the ifindex of its corresponding net_device.
Only trouble is, the net_device has already been unregistered.
It looks like we can stub out the search for a corresponding net_device
if we clear the devlink_port's type. This looks like a bit of a hack,
but also seems to be the reason why the devlink_port_type_clear function
exists in the first place.
Fixes: 3122433eb5 ("net: dsa: Register devlink ports before calling DSA driver setup()")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian fainelli <f.fainelli@gmail.com>
Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210112004831.3778323-1-olteanv@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently the following happens when a DSA master driver unbinds while
there are DSA switches attached to it:
$ echo 0000:00:00.5 > /sys/bus/pci/drivers/mscc_felix/unbind
------------[ cut here ]------------
WARNING: CPU: 0 PID: 392 at net/core/dev.c:9507
Call trace:
rollback_registered_many+0x5fc/0x688
unregister_netdevice_queue+0x98/0x120
dsa_slave_destroy+0x4c/0x88
dsa_port_teardown.part.16+0x78/0xb0
dsa_tree_teardown_switches+0x58/0xc0
dsa_unregister_switch+0x104/0x1b8
felix_pci_remove+0x24/0x48
pci_device_remove+0x48/0xf0
device_release_driver_internal+0x118/0x1e8
device_driver_detach+0x28/0x38
unbind_store+0xd0/0x100
Located at the above location is this WARN_ON:
/* Notifier chain MUST detach us all upper devices. */
WARN_ON(netdev_has_any_upper_dev(dev));
Other stacked interfaces, like VLAN, do indeed listen for
NETDEV_UNREGISTER on the real_dev and also unregister themselves at that
time, which is clearly the behavior that rollback_registered_many
expects. But DSA interfaces are not VLAN. They have backing hardware
(platform devices, PCI devices, MDIO, SPI etc) which have a life cycle
of their own and we can't just trigger an unregister from the DSA
framework when we receive a netdev notifier that the master unregisters.
Luckily, there is something we can do, and that is to inform the driver
core that we have a runtime dependency to the DSA master interface's
device, and create a device link where that is the supplier and we are
the consumer. Having this device link will make the DSA switch unbind
before the DSA master unbinds, which is enough to avoid the WARN_ON from
rollback_registered_many.
Note that even before the blamed commit, DSA did nothing intelligent
when the master interface got unregistered either. See the discussion
here:
https://lore.kernel.org/netdev/20200505210253.20311-1-f.fainelli@gmail.com/
But this time, at least the WARN_ON is loud enough that the
upper_dev_link commit can be blamed.
The advantage with this approach vs dev_hold(master) in the attached
link is that the latter is not meant for long term reference counting.
With dev_hold, the only thing that will happen is that when the user
attempts an unbind of the DSA master, netdev_wait_allrefs will keep
waiting and waiting, due to DSA keeping the refcount forever. DSA would
not access freed memory corresponding to the master interface, but the
unbind would still result in a freeze. Whereas with device links,
graceful teardown is ensured. It even works with cascaded DSA trees.
$ echo 0000:00:00.2 > /sys/bus/pci/drivers/fsl_enetc/unbind
[ 1818.797546] device swp0 left promiscuous mode
[ 1819.301112] sja1105 spi2.0: Link is Down
[ 1819.307981] DSA: tree 1 torn down
[ 1819.312408] device eno2 left promiscuous mode
[ 1819.656803] mscc_felix 0000:00:00.5: Link is Down
[ 1819.667194] DSA: tree 0 torn down
[ 1819.711557] fsl_enetc 0000:00:00.2 eno2: Link is Down
This approach allows us to keep the DSA framework absolutely unchanged,
and the driver core will just know to unbind us first when the master
goes away - as opposed to the large (and probably impossible) rework
required if attempting to listen for NETDEV_UNREGISTER.
As per the documentation at Documentation/driver-api/device_link.rst,
specifying the DL_FLAG_AUTOREMOVE_CONSUMER flag causes the device link
to be automatically purged when the consumer fails to probe or later
unbinds. So we don't need to keep the consumer_link variable in struct
dsa_switch.
Fixes: 2f1e8ea726 ("net: dsa: link interfaces with the DSA master to get rid of lockdep warnings")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210111230943.3701806-1-olteanv@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In commit 826f328e2b ("net: dcb: Validate netlink message in DCB
handler"), Linux started rejecting RTM_GETDCB netlink messages if they
contained a set-like DCB_CMD_ command.
The reason was that privileges were only verified for RTM_SETDCB messages,
but the value that determined the action to be taken is the command, not
the message type. And validation of message type against the DCB command
was the obvious missing piece.
Unfortunately it turns out that mlnx_qos, a somewhat widely deployed tool
for configuration of DCB, accesses the DCB set-like APIs through
RTM_GETDCB.
Therefore do not bounce the discrepancy between message type and command.
Instead, in addition to validating privileges based on the actual message
type, validate them also based on the expected message type. This closes
the loophole of allowing DCB configuration on non-admin accounts, while
maintaining backward compatibility.
Fixes: 2f90b8657e ("ixgbe: this patch adds support for DCB to the kernel and ixgbe driver")
Fixes: 826f328e2b ("net: dcb: Validate netlink message in DCB handler")
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/a3edcfda0825f2aa2591801c5232f2bbf2d8a554.1610384801.git.me@pmachata.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Highlights include:
Bugfixes:
- Fix parsing of link-local IPv6 addresses
- Fix confusing logging of mount errors that was introduced by the
fsopen() patchset.
- Fix a tracing use after free in _nfs4_do_setlk()
- Layout return-on-close fixes when called from nfs4_evict_inode()
- Layout segments were being leaked in pnfs_generic_clear_request_commit()
- Don't leak DS commits in pnfs_generic_retry_commit()
- Fix an Oopsable use-after-free when nfs_delegation_find_inode_server()
calls iput() on an inode after the super block has gone away.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl/806IACgkQZwvnipYK
APKIZA/+L+LvkMXflS9TQGGccpOPw+BBW5ixi2DabFYLqHz6WXNnIcUStU0NtF3q
uHM2YrJT0XtWtQ8W6fWcsfdeS/1ixciXDS/5RH/o2e+fFMNg1lPWAeOc4brQSDFd
DYEc7lSqw0D/pX8vY4dFIrpQorU2hnasjMK582JU7mDYXveRMLB/Bhcq9qBP2XgQ
LVUpnHU/3dayvFGmr/sPzzZk/rIEfPaHU/J0YLbPfrEGFOo/mZKqstfS4ZkINAWp
0yRD90s1hWTfRcxAiDaUoYPoxEw5AYjdbwC82owOaEa0zNWA2U7tD94UeVS51JCJ
DtCn81znWaF4jVzes4VGzPlWirYoumthJwrKpKh04tEwo0a4V4AtsOAg2IbxfE/O
CYsfwjwikzW4nOEerv22zOHICLNd2IP65kHAACaN0NVhS7dlLSuckwnMILdstD2Z
x0LHxFhyRQe5c7bf6W6Jal2E/ThyD2qaUmSIxWweTq93OldD0mTLGHO7e2/chXwP
3xkcuZLpU6bmg9QzmylWZWBB3ncDtC95VlRv/IV29mbN3a8XjJaugSOAwjx14JNT
OFlJtLav2pvCwFLUutvgAMSgbshhfkwdUoUUHrcabXNL/4QBeeZB/pp9Ytr3NoBT
xxC6nmB/Af7FtRnTrTpOSlH9s1NEB3JN4uMNx4kAKC+ZLySdMPQ=
=08H3
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.11-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
"Highlights include:
- Fix parsing of link-local IPv6 addresses
- Fix confusing logging of mount errors that was introduced by the
fsopen() patchset.
- Fix a tracing use after free in _nfs4_do_setlk()
- Layout return-on-close fixes when called from nfs4_evict_inode()
- Layout segments were being leaked in
pnfs_generic_clear_request_commit()
- Don't leak DS commits in pnfs_generic_retry_commit()
- Fix an Oopsable use-after-free when nfs_delegation_find_inode_server()
calls iput() on an inode after the super block has gone away"
* tag 'nfs-for-5.11-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: nfs_igrab_and_active must first reference the superblock
NFS: nfs_delegation_find_inode_server must first reference the superblock
NFS/pNFS: Fix a leak of the layout 'plh_outstanding' counter
NFS/pNFS: Don't leak DS commits in pnfs_generic_retry_commit()
NFS/pNFS: Don't call pnfs_free_bucket_lseg() before removing the request
pNFS: Stricter ordering of layoutget and layoutreturn
pNFS: Clean up pnfs_layoutreturn_free_lsegs()
pNFS: We want return-on-close to complete when evicting the inode
pNFS: Mark layout for return if return-on-close was not sent
net: sunrpc: interpret the return value of kstrtou32 correctly
NFS: Adjust fs_context error logging
NFS4: Fix use-after-free in trace_event_raw_event_nfs4_set_lock
esp(6)_output_head uses skb_page_frag_refill to allocate a buffer for
the esp trailer.
It accesses the page with kmap_atomic to handle highmem. But
skb_page_frag_refill can return compound pages, of which
kmap_atomic only maps the first underlying page.
skb_page_frag_refill does not return highmem, because flag
__GFP_HIGHMEM is not set. ESP uses it in the same manner as TCP.
That also does not call kmap_atomic, but directly uses page_address,
in skb_copy_to_page_nocache. Do the same for ESP.
This issue has become easier to trigger with recent kmap local
debugging feature CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP.
Fixes: cac2661c53 ("esp4: Avoid skb_cow_data whenever possible")
Fixes: 03e2a30f6a ("esp6: Avoid skb_cow_data whenever possible")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
skb_seq_read iterates over an skb, returning pointer and length of
the next data range with each call.
It relies on kmap_atomic to access highmem pages when needed.
An skb frag may be backed by a compound page, but kmap_atomic maps
only a single page. There are not enough kmap slots to always map all
pages concurrently.
Instead, if kmap_atomic is needed, iterate over each page.
As this increases the number of calls, avoid this unless needed.
The necessary condition is captured in skb_frag_must_loop.
I tried to make the change as obvious as possible. It should be easy
to verify that nothing changes if skb_frag_must_loop returns false.
Tested:
On an x86 platform with
CONFIG_HIGHMEM=y
CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP=y
CONFIG_NETFILTER_XT_MATCH_STRING=y
Run
ip link set dev lo mtu 1500
iptables -A OUTPUT -m string --string 'badstring' -algo bm -j ACCEPT
dd if=/dev/urandom of=in bs=1M count=20
nc -l -p 8000 > /dev/null &
nc -w 1 -q 0 localhost 8000 < in
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When register_pernet_subsys() fails, nf_nat_bysource
should be freed just like when nf_ct_extend_register()
fails.
Fixes: 1cd472bf03 ("netfilter: nf_nat: add nat hook register functions to nf_nat")
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
A return value of 0 means success. This is documented in lib/kstrtox.c.
This was found by trying to mount an NFS share from a link-local IPv6
address with the interface specified by its index:
mount("[fe80::1%1]:/srv/nfs", "/mnt", "nfs", 0, "nolock,addr=fe80::1%1")
Before this commit this failed with EINVAL and also caused the following
message in dmesg:
[...] NFS: bad IP address specified: addr=fe80::1%1
The syscall using the same address based on the interface name instead
of its index succeeds.
Credits for this patch go to my colleague Christian Speich, who traced
the origin of this bug to this line of code.
Signed-off-by: Johannes Nixdorf <j.nixdorf@avm.de>
Fixes: 00cfaa943e ("replace strict_strto calls")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The old way of changing the conntrack hashsize runtime was through changing
the module param via file /sys/module/nf_conntrack/parameters/hashsize. This
was extended to sysctl change in commit 3183ab8997 ("netfilter: conntrack:
allow increasing bucket size via sysctl too").
The commit introduced second "user" variable nf_conntrack_htable_size_user
which shadow actual variable nf_conntrack_htable_size. When hashsize is
changed via module param this "user" variable isn't updated. This results in
sysctl net/netfilter/nf_conntrack_buckets shows the wrong value when users
update via the old way.
This patch fix the issue by always updating "user" variable when reading the
proc file. This will take care of changes to the actual variable without
sysctl need to be aware.
Fixes: 3183ab8997 ("netfilter: conntrack: allow increasing bucket size via sysctl too")
Reported-by: Yoel Caspersen <yoel@kviknet.dk>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There are cases where GSO segment's length exceeds the egress MTU:
- Forwarding of a TCP GRO skb, when DF flag is not set.
- Forwarding of an skb that arrived on a virtualisation interface
(virtio-net/vhost/tap) with TSO/GSO size set by other network
stack.
- Local GSO skb transmitted on an NETIF_F_TSO tunnel stacked over an
interface with a smaller MTU.
- Arriving GRO skb (or GSO skb in a virtualised environment) that is
bridged to a NETIF_F_TSO tunnel stacked over an interface with an
insufficient MTU.
If so:
- Consume the SKB and its segments.
- Issue an ICMP packet with 'Packet Too Big' message containing the
MTU, allowing the source host to reduce its Path MTU appropriately.
Note: These cases are handled in the same manner in IPv4 output finish.
This patch aligns the behavior of IPv6 and the one of IPv4.
Fixes: 9e50849054 ("netfilter: ipv6: move POSTROUTING invocation before fragmentation")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/1610027418-30438-1-git-send-email-ayal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If register_netdevice() fails at the very last stage - the
notifier call - some subsystems may have already seen it and
grabbed a reference. struct net_device can't be freed right
away without calling netdev_wait_all_refs().
Now that we have a clean interface in form of dev->needs_free_netdev
and lenient free_netdev() we can undo what commit 93ee31f14f ("[NET]:
Fix free_netdev on register_netdev failure.") has done and complete
the unregistration path by bringing the net_set_todo() call back.
After registration fails user is still expected to explicitly
free the net_device, so make sure ->needs_free_netdev is cleared,
otherwise rolling back the registration will cause the old double
free for callers who release rtnl_lock before the free.
This also solves the problem of priv_destructor not being called
on notifier error.
net_set_todo() will be moved back into unregister_netdevice_queue()
in a follow up.
Reported-by: Hulk Robot <hulkci@huawei.com>
Reported-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are two flavors of handling netdev registration:
- ones called without holding rtnl_lock: register_netdev() and
unregister_netdev(); and
- those called with rtnl_lock held: register_netdevice() and
unregister_netdevice().
While the semantics of the former are pretty clear, the same can't
be said about the latter. The netdev_todo mechanism is utilized to
perform some of the device unregistering tasks and it hooks into
rtnl_unlock() so the locked variants can't actually finish the work.
In general free_netdev() does not mix well with locked calls. Most
drivers operating under rtnl_lock set dev->needs_free_netdev to true
and expect core to make the free_netdev() call some time later.
The part where this becomes most problematic is error paths. There is
no way to unwind the state cleanly after a call to register_netdevice(),
since unreg can't be performed fully without dropping locks.
Make free_netdev() more lenient, and defer the freeing if device
is being unregistered. This allows error paths to simply call
free_netdev() both after register_netdevice() failed, and after
a call to unregister_netdevice() but before dropping rtnl_lock.
Simplify the error paths which are currently doing gymnastics
around free_netdev() handling.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
reuse->socks[] is modified concurrently by reuseport_add_sock. To
prevent reading values that have not been fully initialized, only read
the array up until the last known safe index instead of incorrectly
re-reading the last index of the array.
Fixes: acdcecc612 ("udp: correct reuseport selection with connected sockets")
Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20210107051110.12247-1-baptiste.lepers@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
skbs in fraglist could be shared by a BPF filter loaded at TC. If TC
writes, it will call skb_ensure_writable -> pskb_expand_head to create
a private linear section for the head_skb. And then call
skb_clone_fraglist -> skb_get on each skb in the fraglist.
skb_segment_list overwrites part of the skb linear section of each
fragment itself. Even after skb_clone, the frag_skbs share their
linear section with their clone in PF_PACKET.
Both sk_receive_queue of PF_PACKET and PF_INET (or PF_INET6) can have
a link for the same frag_skbs chain. If a new skb (not frags) is
queued to one of the sk_receive_queue, multiple ptypes can see and
release this. It causes use-after-free.
[ 4443.426215] ------------[ cut here ]------------
[ 4443.426222] refcount_t: underflow; use-after-free.
[ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
[ 4443.426808] Call trace:
[ 4443.426813] refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426823] skb_release_data+0x144/0x264
[ 4443.426828] kfree_skb+0x58/0xc4
[ 4443.426832] skb_queue_purge+0x64/0x9c
[ 4443.426844] packet_set_ring+0x5f0/0x820
[ 4443.426849] packet_setsockopt+0x5a4/0xcd0
[ 4443.426853] __sys_setsockopt+0x188/0x278
[ 4443.426858] __arm64_sys_setsockopt+0x28/0x38
[ 4443.426869] el0_svc_common+0xf0/0x1d0
[ 4443.426873] el0_svc_handler+0x74/0x98
[ 4443.426880] el0_svc+0x8/0xc
Fixes: 3a1296a38d (net: Support GRO/GSO fraglist chaining.)
Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/1610072918-174177-1-git-send-email-dseok.yi@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The function nh_check_attr_group() is called to validate nexthop groups.
The intention of that code seems to have been to bounce all attributes
above NHA_GROUP_TYPE except for NHA_FDB. However instead it bounces all
these attributes except when NHA_FDB attribute is present--then it accepts
them.
NHA_FDB validation that takes place before, in rtm_to_nh_config(), already
bounces NHA_OIF, NHA_BLACKHOLE, NHA_ENCAP and NHA_ENCAP_TYPE. Yet further
back, NHA_GROUPS and NHA_MASTER are bounced unconditionally.
But that still leaves NHA_GATEWAY as an attribute that would be accepted in
FDB nexthop groups (with no meaning), so long as it keeps the address
family as unspecified:
# ip nexthop add id 1 fdb via 127.0.0.1
# ip nexthop add id 10 fdb via default group 1
The nexthop code is still relatively new and likely not used very broadly,
and the FDB bits are newer still. Even though there is a reproducer out
there, it relies on an improbable gateway arguments "via default", "via
all" or "via any". Given all this, I believe it is OK to reformulate the
condition to do the right thing and bounce NHA_GATEWAY.
Fixes: 38428d6871 ("nexthop: support for fdb ecmp nexthops")
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In case of error, remove the nexthop group entry from the list to which
it was previously added.
Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A reference was not taken for the current nexthop entry, so do not try
to put it in the error path.
Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Conntrack reassembly records the largest fragment size seen in IPCB.
However, when this gets forwarded/transmitted, fragmentation will only
be forced if one of the fragmented packets had the DF bit set.
In that case, a flag in IPCB will force fragmentation even if the
MTU is large enough.
This should work fine, but this breaks with ip tunnels.
Consider client that sends a UDP datagram of size X to another host.
The client fragments the datagram, so two packets, of size y and z, are
sent. DF bit is not set on any of these packets.
Middlebox netfilter reassembles those packets back to single size-X
packet, before routing decision.
packet-size-vs-mtu checks in ip_forward are irrelevant, because DF bit
isn't set. At output time, ip refragmentation is skipped as well
because x is still smaller than the mtu of the output device.
If ttransmit device is an ip tunnel, the packet size increases to
x+overhead.
Also, tunnel might be configured to force DF bit on outer header.
In this case, packet will be dropped (exceeds MTU) and an ICMP error is
generated back to sender.
But sender already respects the announced MTU, all the packets that
it sent did fit the announced mtu.
Force refragmentation as per original sizes unconditionally so ip tunnel
will encapsulate the fragments instead.
The only other solution I see is to place ip refragmentation in
the ip_tunnel code to handle this case.
Fixes: d6b915e29f ("ip_fragment: don't forward defragmented DF packet")
Reported-by: Christian Perle <christian.perle@secunet.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For some reason ip_tunnel insist on setting the DF bit anyway when the
inner header has the DF bit set, EVEN if the tunnel was configured with
'nopmtudisc'.
This means that the script added in the previous commit
cannot be made to work by adding the 'nopmtudisc' flag to the
ip tunnel configuration. Doing so breaks connectivity even for the
without-conntrack/netfilter scenario.
When nopmtudisc is set, the tunnel will skip the mtu check, so no
icmp error is sent to client. Then, because inner header has DF set,
the outer header gets added with DF bit set as well.
IP stack then sends an error to itself because the packet exceeds
the device MTU.
Fixes: 23a3647bc4 ("ip_tunnels: Use skb-len to PMTU check.")
Cc: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Route removal is handled by two code paths. The main removal path is via
fib6_del_route() which will handle purging any PMTU exceptions from the
cache, removing all per-cpu copies of the DST entry used by the route, and
releasing the fib6_info struct.
The second removal location is during fib6_add_rt2node() during a route
replacement operation. This path also calls fib6_purge_rt() to handle
cleaning up the per-cpu copies of the DST entries and releasing the
fib6_info associated with the older route, but it does not flush any PMTU
exceptions that the older route had. Since the older route is removed from
the tree during the replacement, we lose any way of accessing it again.
As these lingering DSTs and the fib6_info struct are holding references to
the underlying netdevice struct as well, unregistering that device from the
kernel can never complete.
Fixes: 2b760fcf5c ("ipv6: hook up exception table to store dst cache")
Signed-off-by: Sean Tranchetti <stranche@codeaurora.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/1609892546-11389-1-git-send-email-stranche@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A null-ptr-deref bug is reported by Hulk Robot like this:
--------------
KASAN: null-ptr-deref in range [0x0000000000000128-0x000000000000012f]
Call Trace:
qrtr_ns_remove+0x22/0x40 [ns]
qrtr_proto_fini+0xa/0x31 [qrtr]
__x64_sys_delete_module+0x337/0x4e0
do_syscall_64+0x34/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x468ded
--------------
When qrtr_ns_init fails in qrtr_proto_init, qrtr_ns_remove which would
be called later on would raise a null-ptr-deref because qrtr_ns.workqueue
has been destroyed.
Fix it by making qrtr_ns_init have a return value and adding a check in
qrtr_proto_init.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
VLAN checks for NETREG_UNINITIALIZED to distinguish between
registration failure and unregistration in progress.
Since commit cb626bf566 ("net-sysfs: Fix reference count leak")
registration failure may, however, result in NETREG_UNREGISTERED
as well as NETREG_UNINITIALIZED.
This fix is similer to cebb69754f ("rtnetlink: Fix
memory(net_device) leak when ->newlink fails")
Fixes: cb626bf566 ("net-sysfs: Fix reference count leak")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without crc32 support, this fails to link:
arm-linux-gnueabi-ld: net/wireless/scan.o: in function `cfg80211_scan_6ghz':
scan.c:(.text+0x928): undefined reference to `crc32_le'
Fixes: c8cb5b854b ("nl80211/cfg80211: support 6 GHz scanning")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
and bpf trees.
Current release - regressions:
- mt76: - usb: fix NULL pointer dereference in mt76u_status_worker
- sdio: fix NULL pointer dereference in mt76s_process_tx_queue
- net: ipa: fix interconnect enable bug
Current release - always broken:
- netfilter: ipset: fixes possible oops in mtype_resize
- ath11k: fix number of coding issues found by static analysis tools
and spurious error messages
Previous releases - regressions:
- e1000e: re-enable s0ix power saving flows for systems with
the Intel i219-LM Ethernet controllers to fix power
use regression
- virtio_net: fix recursive call to cpus_read_lock() to avoid
a deadlock
- ipv4: ignore ECN bits for fib lookups in fib_compute_spec_dst()
- net-sysfs: take the rtnl lock around XPS configuration
- xsk: - fix memory leak for failed bind
- rollback reservation at NETDEV_TX_BUSY
- r8169: work around power-saving bug on some chip versions
Previous releases - always broken:
- dcb: validate netlink message in DCB handler
- tun: fix return value when the number of iovs exceeds MAX_SKB_FRAGS
to prevent unnecessary retries
- vhost_net: fix ubuf refcount when sendmsg fails
- bpf: save correct stopping point in file seq iteration
- ncsi: use real net-device for response handler
- neighbor: fix div by zero caused by a data race (TOCTOU)
- bareudp: - fix use of incorrect min_headroom size
- fix false positive lockdep splat from the TX lock
- net: mvpp2: - clear force link UP during port init procedure
in case bootloader had set it
- add TCAM entry to drop flow control pause frames
- fix PPPoE with ipv6 packet parsing
- fix GoP Networking Complex Control config of port 3
- fix pkt coalescing IRQ-threshold configuration
- xsk: fix race in SKB mode transmit with shared cq
- ionic: account for vlan tag len in rx buffer len
- net: stmmac: ignore the second clock input, current clock framework
does not handle exclusive clock use well, other drivers
may reconfigure the second clock
Misc:
- ppp: change PPPIOCUNBRIDGECHAN ioctl request number to follow
existing scheme
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl/zsqQACgkQMUZtbf5S
IrvfqA/+MbjN9TRccZRgYVzPVzlP5jswi7VZIjikPrNxCdwgQd8bDMfeaD6I1PcX
WHf35vtD8zh729qz9DheWXFp7kDQ1fY0Z59KA25xf/ulFEkZPl3RBg70rSgv4rc+
T82dVo6x33DPe6NkspDC+Uhjz2IxcS/P7F9N7DtbavrfNuDyX8+0U/FFQIL0xOyG
DuhwecCh0vJFGcWXTWtK1vP1CPD98L28KS2Od+EZsUUZOKt1WMyGrAgNcT6uYXmO
NIYNy+FPyvvIwTLupoFE7oU4LA0sZozyvzcTDugXBF5EKoR8BwBFk0FfWzN9Oxge
LrmhNBSTeYyiw8XMOwSIfxwZnBm7mJFQqTHR1+Y83Qw1SR6PfSUZgkEkW2SYgprL
9CzE3O3P3Ci7TSx7fvZUn8B1q5J0DfZR6ZYyor9zl55e+ikraRYtXsk47bf9AGXl
owpHXEYWHFmgOP+LVdf1BUjuiE3vnCBJBsHlMbRkxiNPKravWtPSiM2yTu6fEbpT
pMXCgFQBL/IqwzX01zuw7teg40YLVaFnmFdQbYDwA5p9VODlQvHzn2K4GyuktswX
wxHYU5WRWtCkBfE+nbAROKzE7MuH9jtPtV1ZeuseTqYGBRuvEvudX8ypEvKS45pP
OWkzFsSXd9q7M6cxftipwjcyLiIO+UGdizNHvDUyEQOPAyYPKb4=
=N4/x
-----END PGP SIGNATURE-----
Merge tag 'net-5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes, including fixes from netfilter, wireless and bpf
trees.
Current release - regressions:
- mt76: fix NULL pointer dereference in mt76u_status_worker and
mt76s_process_tx_queue
- net: ipa: fix interconnect enable bug
Current release - always broken:
- netfilter: fixes possible oops in mtype_resize in ipset
- ath11k: fix number of coding issues found by static analysis tools
and spurious error messages
Previous releases - regressions:
- e1000e: re-enable s0ix power saving flows for systems with the
Intel i219-LM Ethernet controllers to fix power use regression
- virtio_net: fix recursive call to cpus_read_lock() to avoid a
deadlock
- ipv4: ignore ECN bits for fib lookups in fib_compute_spec_dst()
- sysfs: take the rtnl lock around XPS configuration
- xsk: fix memory leak for failed bind and rollback reservation at
NETDEV_TX_BUSY
- r8169: work around power-saving bug on some chip versions
Previous releases - always broken:
- dcb: validate netlink message in DCB handler
- tun: fix return value when the number of iovs exceeds MAX_SKB_FRAGS
to prevent unnecessary retries
- vhost_net: fix ubuf refcount when sendmsg fails
- bpf: save correct stopping point in file seq iteration
- ncsi: use real net-device for response handler
- neighbor: fix div by zero caused by a data race (TOCTOU)
- bareudp: fix use of incorrect min_headroom size and a false
positive lockdep splat from the TX lock
- mvpp2:
- clear force link UP during port init procedure in case
bootloader had set it
- add TCAM entry to drop flow control pause frames
- fix PPPoE with ipv6 packet parsing
- fix GoP Networking Complex Control config of port 3
- fix pkt coalescing IRQ-threshold configuration
- xsk: fix race in SKB mode transmit with shared cq
- ionic: account for vlan tag len in rx buffer len
- stmmac: ignore the second clock input, current clock framework does
not handle exclusive clock use well, other drivers may reconfigure
the second clock
Misc:
- ppp: change PPPIOCUNBRIDGECHAN ioctl request number to follow
existing scheme"
* tag 'net-5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
net: dsa: lantiq_gswip: Fix GSWIP_MII_CFG(p) register access
net: dsa: lantiq_gswip: Enable GSWIP_MII_CFG_EN also for internal PHYs
net: lapb: Decrease the refcount of "struct lapb_cb" in lapb_device_event
r8169: work around power-saving bug on some chip versions
net: usb: qmi_wwan: add Quectel EM160R-GL
selftests: mlxsw: Set headroom size of correct port
net: macb: Correct usage of MACB_CAPS_CLK_HW_CHG flag
ibmvnic: fix: NULL pointer dereference.
docs: networking: packet_mmap: fix old config reference
docs: networking: packet_mmap: fix formatting for C macros
vhost_net: fix ubuf refcount incorrectly when sendmsg fails
bareudp: Fix use of incorrect min_headroom size
bareudp: set NETIF_F_LLTX flag
net: hdlc_ppp: Fix issues when mod_timer is called while timer is running
atlantic: remove architecture depends
erspan: fix version 1 check in gre_parse_header()
net: hns: fix return value check in __lb_other_process()
net: sched: prevent invalid Scell_log shift count
net: neighbor: fix a crash caused by mod zero
ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst()
...
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Missing sanitization of rateest userspace string, bug has been
triggered by syzbot, patch from Florian Westphal.
2) Report EOPNOTSUPP on missing set features in nft_dynset, otherwise
error reporting to userspace via EINVAL is misleading since this is
reserved for malformed netlink requests.
3) New binaries with old kernels might silently accept several set
element expressions. New binaries set on the NFT_SET_EXPR and
NFT_DYNSET_F_EXPR flags to request for several expressions per
element, hence old kernels which do not support for this bail out
with EOPNOTSUPP.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: nftables: add set expression flags
netfilter: nft_dynset: report EOPNOTSUPP on missing set feature
netfilter: xt_RATEEST: reject non-null terminated string from userspace
====================
Link: https://lore.kernel.org/r/20210103192920.18639-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In lapb_device_event, lapb_devtostruct is called to get a reference to
an object of "struct lapb_cb". lapb_devtostruct increases the refcount
of the object and returns a pointer to it. However, we didn't decrease
the refcount after we finished using the pointer. This patch fixes this
problem.
Fixes: a4989fa911 ("net/lapb: support netdev events")
Cc: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Link: https://lore.kernel.org/r/20201231174331.64539-1-xie.he.0141@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf 2020-12-28
The following pull-request contains BPF updates for your *net* tree.
There is a small merge conflict between bpf tree commit 69ca310f34
("bpf: Save correct stopping point in file seq iteration") and net tree
commit 66ed594409 ("bpf/task_iter: In task_file_seq_get_next use
task_lookup_next_fd_rcu"). The get_files_struct() does not exist anymore
in net, so take the hunk in HEAD and add the `info->tid = curr_tid` to
the error path:
[...]
curr_task = task_seq_get_next(ns, &curr_tid, true);
if (!curr_task) {
info->task = NULL;
info->tid = curr_tid;
return NULL;
}
/* set info->task and info->tid */
[...]
We've added 10 non-merge commits during the last 9 day(s) which contain
a total of 11 files changed, 75 insertions(+), 20 deletions(-).
The main changes are:
1) Various AF_XDP fixes such as fill/completion ring leak on failed bind and
fixing a race in skb mode's backpressure mechanism, from Magnus Karlsson.
2) Fix latency spikes on lockdep enabled kernels by adding a rescheduling
point to BPF hashtab initialization, from Eric Dumazet.
3) Fix a splat in task iterator by saving the correct stopping point in the
seq file iteration, from Jonathan Lemon.
4) Fix BPF maps selftest by adding retries in case hashtab returns EBUSY
errors on update/deletes, from Andrii Nakryiko.
5) Fix BPF selftest error reporting to something more user friendly if the
vmlinux BTF cannot be found, from Kamal Mostafa.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Both version 0 and version 1 use ETH_P_ERSPAN, but version 0 does not
have an erspan header. So the check in gre_parse_header() is wrong,
we have to distinguish version 1 from version 0.
We can just check the gre header length like is_erspan_type1().
Fixes: cb73ee40b1 ("net: ip_gre: use erspan key field for tunnel lookup")
Reported-by: syzbot+f583ce3d4ddf9836b27a@syzkaller.appspotmail.com
Cc: William Tu <u9012063@gmail.com>
Cc: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>