Commit Graph

78656 Commits

Author SHA1 Message Date
Atte Heikkilä
dbab80e207 ksmbd: make utf-8 file name comparison work in __caseless_lookup()
Case-insensitive file name lookups with __caseless_lookup() use
strncasecmp() for file name comparison. strncasecmp() assumes an
ISO8859-1-compatible encoding, which is not the case here as UTF-8
is always used. As such, use of strncasecmp() here produces correct
results only if both strings use characters in the ASCII range only.
Fix this by using utf8_strncasecmp() if CONFIG_UNICODE is set. On
failure or if CONFIG_UNICODE is not set, fallback to strncasecmp().
Also, as we are adding an include for `linux/unicode.h', include it
in `fs/ksmbd/connection.h' as well since it should be explicit there.

Signed-off-by: Atte Heikkilä <atteh.mailbox@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-10-05 01:15:44 -05:00
Mickaël Salaün
7c88c1e0ab ksmbd: Fix user namespace mapping
A kernel daemon should not rely on the current thread, which is unknown
and might be malicious.  Before this security fix,
ksmbd_override_fsids() didn't correctly override FS UID/GID which means
that arbitrary user space threads could trick the kernel to impersonate
arbitrary users or groups for file system access checks, leading to
file system access bypass.

This was found while investigating truncate support for Landlock:
https://lore.kernel.org/r/CAKYAXd8fpMJ7guizOjHgxEyyjoUwPsx3jLOPZP=wPYcbhkVXqA@mail.gmail.com

Fixes: e2f34481b2 ("cifsd: add server-side procedures for SMB3")
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Steve French <smfrench@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Link: https://lore.kernel.org/r/20220929100447.108468-1-mic@digikod.net
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-10-05 01:15:44 -05:00
Namjae Jeon
5876e99611 ksmbd: hide socket error message when ipv6 config is disable
When ipv6 config is disable(CONFIG_IPV6 is not set), ksmbd fallback to
create ipv4 socket. User reported that this error message lead to
misunderstood some issue. Users have requested not to print this error
message that occurs even though there is no problem.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-10-05 01:15:44 -05:00
Tom Talpey
78af146e10 ksmbd: reduce server smbdirect max send/receive segment sizes
Reduce ksmbd smbdirect max segment send and receive size to 1364
to match protocol norms. Larger buffers are unnecessary and add
significant memory overhead.

Signed-off-by: Tom Talpey <tom@talpey.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-10-05 01:15:44 -05:00
Tom Talpey
2b4eeeaa90 ksmbd: decrease the number of SMB3 smbdirect server SGEs
The server-side SMBDirect layer requires no more than 6 send SGEs
The previous default of 8 causes ksmbd to fail on the SoftiWARP
(siw) provider, and possibly others. Additionally, large numbers
of SGEs reduces performance significantly on adapter implementations.

Signed-off-by: Tom Talpey <tom@talpey.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-10-05 01:15:44 -05:00
Zhang Xiaoxu
b1763d265a ksmbd: Fix wrong return value and message length check in smb2_ioctl()
Commit c7803b05f7 ("smb3: fix ksmbd bigendian bug in oplock
break, and move its struct to smbfs_common") use the defination
of 'struct validate_negotiate_info_req' in smbfs_common, the
array length of 'Dialects' changed from 1 to 4, but the protocol
does not require the client to send all 4. This lead the request
which satisfied with protocol and server to fail.

So just ensure the request payload has the 'DialectCount' in
smb2_ioctl(), then fsctl_validate_negotiate_info() will use it
to validate the payload length and each dialect.

Also when the {in, out}_buf_len is less than the required, should
goto out to initialize the status in the response header.

Fixes: f7db8fd03a ("ksmbd: add validation in smb2_ioctl")
Cc: stable@vger.kernel.org
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-10-05 01:15:44 -05:00
Namjae Jeon
5bedae90b3 ksmbd: set NTLMSSP_NEGOTIATE_SEAL flag to challenge blob
If NTLMSSP_NEGOTIATE_SEAL flags is set in negotiate blob from client,
Set NTLMSSP_NEGOTIATE_SEAL flag to challenge blob.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-10-05 01:15:44 -05:00
Namjae Jeon
af705ef2b0 ksmbd: fix encryption failure issue for session logoff response
If client send encrypted session logoff request on seal mount,
Encryption for that response fails.

ksmbd: Could not get encryption key
CIFS: VFS: cifs_put_smb_ses: Session Logoff failure rc=-512

Session lookup fails in ksmbd_get_encryption_key() because sess->state is
set to SMB2_SESSION_EXPIRED in session logoff. There is no need to do
session lookup again to encrypt the response. This patch change to use
ksmbd_session in ksmbd_work.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-10-05 01:15:44 -05:00
Namjae Jeon
360c8ee6fe ksmbd: fix endless loop when encryption for response fails
If ->encrypt_resp return error, goto statement cause endless loop.
It send an error response immediately after removing it.

Fixes: 0626e6641f ("cifsd: add server handler for central processing and tranport layers")
Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-10-05 01:15:44 -05:00
Namjae Jeon
d5919f2a14 ksmbd: fill sids in SMB_FIND_FILE_POSIX_INFO response
This patch fill missing sids in SMB_FIND_FILE_POSIX_INFO response.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-10-05 01:15:44 -05:00
Namjae Jeon
f6c2b201da ksmbd: set file permission mode to match Samba server posix extension behavior
Set file permission mode to match Samba server posix extension behavior.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-10-05 01:15:44 -05:00
Namjae Jeon
5609bdd9ff ksmbd: change security id to the one samba used for posix extension
Samba set SIDOWNER and SIDUNIX_GROUP in create posix context and
set SIDUNIX_USER/GROUP in other sids for posix extension.
This patch change security id to the one samba used.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-10-05 01:15:44 -05:00
Atte Heikkilä
16b5f54e30 ksmbd: casefold utf-8 share names and fix ascii lowercase conversion
strtolower() corrupts all UTF-8 share names that have a byte in the C0
(À ISO8859-1) to DE (Þ ISO8859-1) range, since the non-ASCII part of
ISO8859-1 is incompatible with UTF-8. Prevent this by checking that a
byte is in the ASCII range with isascii(), before the conversion to
lowercase with tolower(). Properly handle case-insensitivity of UTF-8
share names by casefolding them, but fallback to ASCII lowercase
conversion on failure or if CONFIG_UNICODE is not set. Refactor to move
the share name casefolding immediately after the share name extraction.
Also, make the associated constness corrections.

Signed-off-by: Atte Heikkilä <atteh.mailbox@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-10-05 01:15:37 -05:00
Christian Brauner
276a3f7cf1 ksmbd: port to vfs{g,u}id_t and associated helpers
A while ago we introduced a dedicated vfs{g,u}id_t type in commit
1e5267cd08 ("mnt_idmapping: add vfs{g,u}id_t"). We already switched
over a good part of the VFS. Ultimately we will remove all legacy
idmapped mount helpers that operate only on k{g,u}id_t in favor of the
new type safe helpers that operate on vfs{g,u}id_t.

Cc: Seth Forshee (Digital Ocean) <sforshee@kernel.org>
Cc: Steve French <sfrench@samba.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-10-05 01:15:37 -05:00
Namjae Jeon
88541cb414 ksmbd: fix incorrect handling of iterate_dir
if iterate_dir() returns non-negative value, caller has to treat it
as normal and check there is any error while populating dentry
information. ksmbd doesn't have to do anything because ksmbd already
checks too small OutputBufferLength to store one file information.

And because ctx->pos is set to file->f_pos when iterative_dir is called,
remove restart_ctx(). And if iterate_dir() return -EIO, which mean
directory entry is corrupted, return STATUS_FILE_CORRUPT_ERROR error
response.

This patch fixes some failure of SMB2_QUERY_DIRECTORY, which happens when
ntfs3 is local filesystem.

Fixes: e2f34481b2 ("cifsd: add server-side procedures for SMB3")
Cc: stable@vger.kernel.org
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-10-05 01:15:37 -05:00
Namjae Jeon
823d0d3e2b ksmbd: remove generic_fillattr use in smb2_open()
Removed the use of unneeded generic_fillattr() in smb2_open().

Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-10-05 01:15:37 -05:00
Al Viro
c22180a5e2 ksmbd: constify struct path
... in particular, there should never be a non-const pointers to
any file->f_path.

Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-10-05 01:15:37 -05:00
Al Viro
369c1634cc ksmbd: don't open-code %pD
a bunch of places used %pd with file->f_path.dentry; shorter (and saner)
way to spell that is %pD with file...

Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-10-05 01:15:37 -05:00
Al Viro
2f5930c1d7 ksmbd: don't open-code file_path()
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-10-05 01:15:37 -05:00
Linus Torvalds
0326074ff4 Networking changes for 6.1.
Core
 ----
 
  - Introduce and use a single page frag cache for allocating small skb
    heads, clawing back the 10-20% performance regression in UDP flood
    test from previous fixes.
 
  - Run packets which already went thru HW coalescing thru SW GRO.
    This significantly improves TCP segment coalescing and simplifies
    deployments as different workloads benefit from HW or SW GRO.
 
  - Shrink the size of the base zero-copy send structure.
 
  - Move TCP init under a new slow / sleepable version of DO_ONCE().
 
 BPF
 ---
 
  - Add BPF-specific, any-context-safe memory allocator.
 
  - Add helpers/kfuncs for PKCS#7 signature verification from BPF
    programs.
 
  - Define a new map type and related helpers for user space -> kernel
    communication over a ring buffer (BPF_MAP_TYPE_USER_RINGBUF).
 
  - Allow targeting BPF iterators to loop through resources of one
    task/thread.
 
  - Add ability to call selected destructive functions.
    Expose crash_kexec() to allow BPF to trigger a kernel dump.
    Use CAP_SYS_BOOT check on the loading process to judge permissions.
 
  - Enable BPF to collect custom hierarchical cgroup stats efficiently
    by integrating with the rstat framework.
 
  - Support struct arguments for trampoline based programs.
    Only structs with size <= 16B and x86 are supported.
 
  - Invoke cgroup/connect{4,6} programs for unprivileged ICMP ping
    sockets (instead of just TCP and UDP sockets).
 
  - Add a helper for accessing CLOCK_TAI for time sensitive network
    related programs.
 
  - Support accessing network tunnel metadata's flags.
 
  - Make TCP SYN ACK RTO tunable by BPF programs with TCP Fast Open.
 
  - Add support for writing to Netfilter's nf_conn:mark.
 
 Protocols
 ---------
 
  - WiFi: more Extremely High Throughput (EHT) and Multi-Link
    Operation (MLO) work (802.11be, WiFi 7).
 
  - vsock: improve support for SO_RCVLOWAT.
 
  - SMC: support SO_REUSEPORT.
 
  - Netlink: define and document how to use netlink in a "modern" way.
    Support reporting missing attributes via extended ACK.
 
  - IPSec: support collect metadata mode for xfrm interfaces.
 
  - TCPv6: send consistent autoflowlabel in SYN_RECV state
    and RST packets.
 
  - TCP: introduce optional per-netns connection hash table to allow
    better isolation between namespaces (opt-in, at the cost of memory
    and cache pressure).
 
  - MPTCP: support TCP_FASTOPEN_CONNECT.
 
  - Add NEXT-C-SID support in Segment Routing (SRv6) End behavior.
 
  - Adjust IP_UNICAST_IF sockopt behavior for connected UDP sockets.
 
  - Open vSwitch:
    - Allow specifying ifindex of new interfaces.
    - Allow conntrack and metering in non-initial user namespace.
 
  - TLS: support the Korean ARIA-GCM crypto algorithm.
 
  - Remove DECnet support.
 
 Driver API
 ----------
 
  - Allow selecting the conduit interface used by each port
    in DSA switches, at runtime.
 
  - Ethernet Power Sourcing Equipment and Power Device support.
 
  - Add tc-taprio support for queueMaxSDU parameter, i.e. setting
    per traffic class max frame size for time-based packet schedules.
 
  - Support PHY rate matching - adapting between differing host-side
    and link-side speeds.
 
  - Introduce QUSGMII PHY mode and 1000BASE-KX interface mode.
 
  - Validate OF (device tree) nodes for DSA shared ports; make
    phylink-related properties mandatory on DSA and CPU ports.
    Enforcing more uniformity should allow transitioning to phylink.
 
  - Require that flash component name used during update matches one
    of the components for which version is reported by info_get().
 
  - Remove "weight" argument from driver-facing NAPI API as much
    as possible. It's one of those magic knobs which seemed like
    a good idea at the time but is too indirect to use in practice.
 
  - Support offload of TLS connections with 256 bit keys.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Microchip KSZ9896 6-port Gigabit Ethernet Switch
    - Renesas Ethernet AVB (EtherAVB-IF) Gen4 SoCs
    - Analog Devices ADIN1110 and ADIN2111 industrial single pair
      Ethernet (10BASE-T1L) MAC+PHY.
    - Rockchip RV1126 Gigabit Ethernet (a version of stmmac IP).
 
  - Ethernet SFPs / modules:
    - RollBall / Hilink / Turris 10G copper SFPs
    - HALNy GPON module
 
  - WiFi:
    - CYW43439 SDIO chipset (brcmfmac)
    - CYW89459 PCIe chipset (brcmfmac)
    - BCM4378 on Apple platforms (brcmfmac)
 
 Drivers
 -------
 
  - CAN:
    - gs_usb: HW timestamp support
 
  - Ethernet PHYs:
    - lan8814: cable diagnostics
 
  - Ethernet NICs:
    - Intel (100G):
      - implement control of FCS/CRC stripping
      - port splitting via devlink
      - L2TPv3 filtering offload
    - nVidia/Mellanox:
      - tunnel offload for sub-functions
      - MACSec offload, w/ Extended packet number and replay
        window offload
      - significantly restructure, and optimize the AF_XDP support,
        align the behavior with other vendors
    - Huawei:
      - configuring DSCP map for traffic class selection
      - querying standard FEC statistics
      - querying SerDes lane number via ethtool
    - Marvell/Cavium:
      - egress priority flow control
      - MACSec offload
    - AMD/SolarFlare:
      - PTP over IPv6 and raw Ethernet
    - small / embedded:
      - ax88772: convert to phylink (to support SFP cages)
      - altera: tse: convert to phylink
      - ftgmac100: support fixed link
      - enetc: standard Ethtool counters
      - macb: ZynqMP SGMII dynamic configuration support
      - tsnep: support multi-queue and use page pool
      - lan743x: Rx IP & TCP checksum offload
      - igc: add xdp frags support to ndo_xdp_xmit
 
  - Ethernet high-speed switches:
    - Marvell (prestera):
      - support SPAN port features (traffic mirroring)
      - nexthop object offloading
    - Microchip (sparx5):
      - multicast forwarding offload
      - QoS queuing offload (tc-mqprio, tc-tbf, tc-ets)
 
  - Ethernet embedded switches:
    - Marvell (mv88e6xxx):
      - support RGMII cmode
    - NXP (felix):
      - standardized ethtool counters
    - Microchip (lan966x):
      - QoS queuing offload (tc-mqprio, tc-tbf, tc-cbs, tc-ets)
      - traffic policing and mirroring
      - link aggregation / bonding offload
      - QUSGMII PHY mode support
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - cold boot calibration support on WCN6750
    - support to connect to a non-transmit MBSSID AP profile
    - enable remain-on-channel support on WCN6750
    - Wake-on-WLAN support for WCN6750
    - support to provide transmit power from firmware via nl80211
    - support to get power save duration for each client
    - spectral scan support for 160 MHz
 
  - MediaTek WiFi (mt76):
    - WiFi-to-Ethernet bridging offload for MT7986 chips
 
  - RealTek WiFi (rtw89):
    - P2P support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmM7vtkACgkQMUZtbf5S
 Irvotg//dmh53rC+UMKO3OgOqPlSMnaqzbUdDEfN6mj4Mpox7Csb8zERVURHhBHY
 fvlXWsDgxmvgTebI5fvNC5+f1iW5xcqgJV2TWnNmDOKWwvQwb6qQfgixVmunvkpe
 IIukMXYt0dAf9bXeeEfbNXcCb85cPwB76stX0tMV6BX7osp3T0TL1fvFk0NJkL0j
 TeydLad/yAQtPb4TbeWYjNDoxPVDf0cVpUrevLGmWE88UMYmgTqPze+h1W5Wri52
 bzjdLklY/4cgcIZClHQ6F9CeRWqEBxvujA5Hj/cwOcn/ptVVJWUGi7sQo3sYkoSs
 HFu+F8XsTec14kGNC0Ab40eVdqs5l/w8+E+4jvgXeKGOtVns8DwoiUIzqXpyty89
 Ib04mffrwWNjFtHvo/kIsNwP05X2PGE9HUHfwsTUfisl/ASvMmQp7D7vUoqQC/4B
 AMVzT5qpjkmfBHYQQGuw8FxJhMeAOjC6aAo6censhXJyiUhIfleQsN0syHdaNb8q
 9RZlhAgQoVb6ZgvBV8r8unQh/WtNZ3AopwifwVJld2unsE/UNfQy2KyqOWBES/zf
 LP9sfuX0JnmHn8s1BQEUMPU1jF9ZVZCft7nufJDL6JhlAL+bwZeEN4yCiAHOPZqE
 ymSLHI9s8yWZoNpuMWKrI9kFexVnQFKmA3+quAJUcYHNMSsLkL8=
 =Gsio
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - Introduce and use a single page frag cache for allocating small skb
     heads, clawing back the 10-20% performance regression in UDP flood
     test from previous fixes.

   - Run packets which already went thru HW coalescing thru SW GRO. This
     significantly improves TCP segment coalescing and simplifies
     deployments as different workloads benefit from HW or SW GRO.

   - Shrink the size of the base zero-copy send structure.

   - Move TCP init under a new slow / sleepable version of DO_ONCE().

  BPF:

   - Add BPF-specific, any-context-safe memory allocator.

   - Add helpers/kfuncs for PKCS#7 signature verification from BPF
     programs.

   - Define a new map type and related helpers for user space -> kernel
     communication over a ring buffer (BPF_MAP_TYPE_USER_RINGBUF).

   - Allow targeting BPF iterators to loop through resources of one
     task/thread.

   - Add ability to call selected destructive functions. Expose
     crash_kexec() to allow BPF to trigger a kernel dump. Use
     CAP_SYS_BOOT check on the loading process to judge permissions.

   - Enable BPF to collect custom hierarchical cgroup stats efficiently
     by integrating with the rstat framework.

   - Support struct arguments for trampoline based programs. Only
     structs with size <= 16B and x86 are supported.

   - Invoke cgroup/connect{4,6} programs for unprivileged ICMP ping
     sockets (instead of just TCP and UDP sockets).

   - Add a helper for accessing CLOCK_TAI for time sensitive network
     related programs.

   - Support accessing network tunnel metadata's flags.

   - Make TCP SYN ACK RTO tunable by BPF programs with TCP Fast Open.

   - Add support for writing to Netfilter's nf_conn:mark.

  Protocols:

   - WiFi: more Extremely High Throughput (EHT) and Multi-Link Operation
     (MLO) work (802.11be, WiFi 7).

   - vsock: improve support for SO_RCVLOWAT.

   - SMC: support SO_REUSEPORT.

   - Netlink: define and document how to use netlink in a "modern" way.
     Support reporting missing attributes via extended ACK.

   - IPSec: support collect metadata mode for xfrm interfaces.

   - TCPv6: send consistent autoflowlabel in SYN_RECV state and RST
     packets.

   - TCP: introduce optional per-netns connection hash table to allow
     better isolation between namespaces (opt-in, at the cost of memory
     and cache pressure).

   - MPTCP: support TCP_FASTOPEN_CONNECT.

   - Add NEXT-C-SID support in Segment Routing (SRv6) End behavior.

   - Adjust IP_UNICAST_IF sockopt behavior for connected UDP sockets.

   - Open vSwitch:
      - Allow specifying ifindex of new interfaces.
      - Allow conntrack and metering in non-initial user namespace.

   - TLS: support the Korean ARIA-GCM crypto algorithm.

   - Remove DECnet support.

  Driver API:

   - Allow selecting the conduit interface used by each port in DSA
     switches, at runtime.

   - Ethernet Power Sourcing Equipment and Power Device support.

   - Add tc-taprio support for queueMaxSDU parameter, i.e. setting per
     traffic class max frame size for time-based packet schedules.

   - Support PHY rate matching - adapting between differing host-side
     and link-side speeds.

   - Introduce QUSGMII PHY mode and 1000BASE-KX interface mode.

   - Validate OF (device tree) nodes for DSA shared ports; make
     phylink-related properties mandatory on DSA and CPU ports.
     Enforcing more uniformity should allow transitioning to phylink.

   - Require that flash component name used during update matches one of
     the components for which version is reported by info_get().

   - Remove "weight" argument from driver-facing NAPI API as much as
     possible. It's one of those magic knobs which seemed like a good
     idea at the time but is too indirect to use in practice.

   - Support offload of TLS connections with 256 bit keys.

  New hardware / drivers:

   - Ethernet:
      - Microchip KSZ9896 6-port Gigabit Ethernet Switch
      - Renesas Ethernet AVB (EtherAVB-IF) Gen4 SoCs
      - Analog Devices ADIN1110 and ADIN2111 industrial single pair
        Ethernet (10BASE-T1L) MAC+PHY.
      - Rockchip RV1126 Gigabit Ethernet (a version of stmmac IP).

   - Ethernet SFPs / modules:
      - RollBall / Hilink / Turris 10G copper SFPs
      - HALNy GPON module

   - WiFi:
      - CYW43439 SDIO chipset (brcmfmac)
      - CYW89459 PCIe chipset (brcmfmac)
      - BCM4378 on Apple platforms (brcmfmac)

  Drivers:

   - CAN:
      - gs_usb: HW timestamp support

   - Ethernet PHYs:
      - lan8814: cable diagnostics

   - Ethernet NICs:
      - Intel (100G):
         - implement control of FCS/CRC stripping
         - port splitting via devlink
         - L2TPv3 filtering offload
      - nVidia/Mellanox:
         - tunnel offload for sub-functions
         - MACSec offload, w/ Extended packet number and replay window
           offload
         - significantly restructure, and optimize the AF_XDP support,
           align the behavior with other vendors
      - Huawei:
         - configuring DSCP map for traffic class selection
         - querying standard FEC statistics
         - querying SerDes lane number via ethtool
      - Marvell/Cavium:
         - egress priority flow control
         - MACSec offload
      - AMD/SolarFlare:
         - PTP over IPv6 and raw Ethernet
      - small / embedded:
         - ax88772: convert to phylink (to support SFP cages)
         - altera: tse: convert to phylink
         - ftgmac100: support fixed link
         - enetc: standard Ethtool counters
         - macb: ZynqMP SGMII dynamic configuration support
         - tsnep: support multi-queue and use page pool
         - lan743x: Rx IP & TCP checksum offload
         - igc: add xdp frags support to ndo_xdp_xmit

   - Ethernet high-speed switches:
      - Marvell (prestera):
         - support SPAN port features (traffic mirroring)
         - nexthop object offloading
      - Microchip (sparx5):
         - multicast forwarding offload
         - QoS queuing offload (tc-mqprio, tc-tbf, tc-ets)

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - support RGMII cmode
      - NXP (felix):
         - standardized ethtool counters
      - Microchip (lan966x):
         - QoS queuing offload (tc-mqprio, tc-tbf, tc-cbs, tc-ets)
         - traffic policing and mirroring
         - link aggregation / bonding offload
         - QUSGMII PHY mode support

   - Qualcomm 802.11ax WiFi (ath11k):
      - cold boot calibration support on WCN6750
      - support to connect to a non-transmit MBSSID AP profile
      - enable remain-on-channel support on WCN6750
      - Wake-on-WLAN support for WCN6750
      - support to provide transmit power from firmware via nl80211
      - support to get power save duration for each client
      - spectral scan support for 160 MHz

   - MediaTek WiFi (mt76):
      - WiFi-to-Ethernet bridging offload for MT7986 chips

   - RealTek WiFi (rtw89):
      - P2P support"

* tag 'net-next-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1864 commits)
  eth: pse: add missing static inlines
  once: rename _SLOW to _SLEEPABLE
  net: pse-pd: add regulator based PSE driver
  dt-bindings: net: pse-dt: add bindings for regulator based PoDL PSE controller
  ethtool: add interface to interact with Ethernet Power Equipment
  net: mdiobus: search for PSE nodes by parsing PHY nodes.
  net: mdiobus: fwnode_mdiobus_register_phy() rework error handling
  net: add framework to support Ethernet PSE and PDs devices
  dt-bindings: net: phy: add PoDL PSE property
  net: marvell: prestera: Propagate nh state from hw to kernel
  net: marvell: prestera: Add neighbour cache accounting
  net: marvell: prestera: add stub handler neighbour events
  net: marvell: prestera: Add heplers to interact with fib_notifier_info
  net: marvell: prestera: Add length macros for prestera_ip_addr
  net: marvell: prestera: add delayed wq and flush wq on deinit
  net: marvell: prestera: Add strict cleanup of fib arbiter
  net: marvell: prestera: Add cleanup of allocated fib_nodes
  net: marvell: prestera: Add router nexthops ABI
  eth: octeon: fix build after netif_napi_add() changes
  net/mlx5: E-Switch, Return EBUSY if can't get mode lock
  ...
2022-10-04 13:38:03 -07:00
Chao Yu
95fa90c9e5 f2fs: support recording errors into superblock
This patch supports to record detail reason of FSCORRUPTED error into
f2fs_super_block.s_errors[].

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:45 -07:00
Chao Yu
a9cfee0ef9 f2fs: support recording stop_checkpoint reason into super_block
This patch supports to record stop_checkpoint error into
f2fs_super_block.s_stop_reason[].

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:44 -07:00
Zhang Qilong
ca7efd71c3 f2fs: remove the unnecessary check in f2fs_xattr_fiemap
Whehter or not error occurs, checking "err == 1" is unnecessary
in f2fs_xattr_fiemap(), and just remove it here.

Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:44 -07:00
Chao Yu
718693c84d f2fs: introduce cp_status sysfs entry
This patch adds a new sysfs entry named cp_status, it can output
checkpoint flags in real time.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:44 -07:00
Chao Yu
fcc2d8cc96 f2fs: fix to detect corrupted meta ino
It is possible that ino of dirent or orphan inode is corrupted in a
fuzzed image, occasionally, if corrupted ino is equal to meta ino:
meta_ino, node_ino or compress_ino, caller of f2fs_iget() from below
call paths will get meta inode directly, it's not allowed, let's
add sanity check to detect such cases.

case #1
- recover_dentry
 - __f2fs_find_entry
 - f2fs_iget_retry

case #2
- recover_orphan_inode
 - f2fs_iget_retry

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:44 -07:00
Chao Yu
d80afefb17 f2fs: fix to account FS_CP_DATA_IO correctly
f2fs_inode_info.cp_task was introduced for FS_CP_DATA_IO accounting
since commit b0af6d491a ("f2fs: add app/fs io stat").

However, cp_task usage coverage has been increased due to below
commits:
commit 040d2bb318 ("f2fs: fix to avoid deadloop if data_flush is on")
commit 186857c5a1 ("f2fs: fix potential recursive call when enabling data_flush")

So that, if data_flush mountoption is on, when data flush was
triggered from background, the IO from data flush will be accounted
as checkpoint IO type incorrectly.

In order to fix this issue, this patch splits cp_task into two:
a) cp_task: used for IO accounting
b) wb_task: used to avoid deadlock

Fixes: 040d2bb318 ("f2fs: fix to avoid deadloop if data_flush is on")
Fixes: 186857c5a1 ("f2fs: fix potential recursive call when enabling data_flush")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:44 -07:00
Zhang Qilong
544b53dadc f2fs: code clean and fix a type error
ERROR: code indent should use tabs where possible
ERROR: spaces required around that ':'
ERROR: incorrect tab

Found serveral code type errors when review the code and fix it.
There is no function change.

Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:44 -07:00
Zhang Qilong
a834aa3ec9 f2fs: add "c_len" into trace_f2fs_update_extent_tree_range for compressed file
The trace_f2fs_update_extent_tree_range could not record compressed
block length in the cluster of compress file and we just add it.

Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:43 -07:00
Chao Yu
c6ad7fd166 f2fs: fix to do sanity check on summary info
As Wenqing Liu reported in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=216456

BUG: KASAN: use-after-free in recover_data+0x63ae/0x6ae0 [f2fs]
Read of size 4 at addr ffff8881464dcd80 by task mount/1013

CPU: 3 PID: 1013 Comm: mount Tainted: G        W          6.0.0-rc4 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
Call Trace:
 dump_stack_lvl+0x45/0x5e
 print_report.cold+0xf3/0x68d
 kasan_report+0xa8/0x130
 recover_data+0x63ae/0x6ae0 [f2fs]
 f2fs_recover_fsync_data+0x120d/0x1fc0 [f2fs]
 f2fs_fill_super+0x4665/0x61e0 [f2fs]
 mount_bdev+0x2cf/0x3b0
 legacy_get_tree+0xed/0x1d0
 vfs_get_tree+0x81/0x2b0
 path_mount+0x47e/0x19d0
 do_mount+0xce/0xf0
 __x64_sys_mount+0x12c/0x1a0
 do_syscall_64+0x38/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

The root cause is: in fuzzed image, SSA table is corrupted: ofs_in_node
is larger than ADDRS_PER_PAGE(), result in out-of-range access on 4k-size
page.

- recover_data
 - do_recover_data
  - check_index_in_prev_nodes
   - f2fs_data_blkaddr

This patch adds sanity check on summary info in recovery and GC flow
in where the flows rely on them.

After patch:
[   29.310883] F2FS-fs (loop0): Inconsistent ofs_in_node:65286 in summary, ino:0, nid:6, max:1018

Cc: stable@vger.kernel.org
Reported-by: Wenqing Liu <wenqingliu0120@gmail.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:43 -07:00
Christian Brauner
1e8a9191cc f2fs: port to vfs{g,u}id_t and associated helpers
A while ago we introduced a dedicated vfs{g,u}id_t type in commit
1e5267cd08 ("mnt_idmapping: add vfs{g,u}id_t"). We already switched
over a good part of the VFS. Ultimately we will remove all legacy
idmapped mount helpers that operate only on k{g,u}id_t in favor of the
new type safe helpers that operate on vfs{g,u}id_t.

Cc: Seth Forshee (Digital Ocean) <sforshee@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:43 -07:00
Chao Yu
0ef4ca04a3 f2fs: fix to do sanity check on destination blkaddr during recovery
As Wenqing Liu reported in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=216456

loop5: detected capacity change from 0 to 131072
F2FS-fs (loop5): recover_inode: ino = 6, name = hln, inline = 1
F2FS-fs (loop5): recover_data: ino = 6 (i_size: recover) err = 0
F2FS-fs (loop5): recover_inode: ino = 6, name = hln, inline = 1
F2FS-fs (loop5): recover_data: ino = 6 (i_size: recover) err = 0
F2FS-fs (loop5): recover_inode: ino = 6, name = hln, inline = 1
F2FS-fs (loop5): recover_data: ino = 6 (i_size: recover) err = 0
F2FS-fs (loop5): Bitmap was wrongly set, blk:5634
------------[ cut here ]------------
WARNING: CPU: 3 PID: 1013 at fs/f2fs/segment.c:2198
RIP: 0010:update_sit_entry+0xa55/0x10b0 [f2fs]
Call Trace:
 <TASK>
 f2fs_do_replace_block+0xa98/0x1890 [f2fs]
 f2fs_replace_block+0xeb/0x180 [f2fs]
 recover_data+0x1a69/0x6ae0 [f2fs]
 f2fs_recover_fsync_data+0x120d/0x1fc0 [f2fs]
 f2fs_fill_super+0x4665/0x61e0 [f2fs]
 mount_bdev+0x2cf/0x3b0
 legacy_get_tree+0xed/0x1d0
 vfs_get_tree+0x81/0x2b0
 path_mount+0x47e/0x19d0
 do_mount+0xce/0xf0
 __x64_sys_mount+0x12c/0x1a0
 do_syscall_64+0x38/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

If we enable CONFIG_F2FS_CHECK_FS config, it will trigger a kernel panic
instead of warning.

The root cause is: in fuzzed image, SIT table is inconsistent with inode
mapping table, result in triggering such warning during SIT table update.

This patch introduces a new flag DATA_GENERIC_ENHANCE_UPDATE, w/ this
flag, data block recovery flow can check destination blkaddr's validation
in SIT table, and skip f2fs_replace_block() to avoid inconsistent status.

Cc: stable@vger.kernel.org
Reported-by: Wenqing Liu <wenqingliu0120@gmail.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:43 -07:00
Weichao Guo
f3b23c785a f2fs: let FI_OPU_WRITE override FADVISE_COLD_BIT
Cold files may be fragmented due to SSR, defragment is needed as
sequential reads are dominant scenarios of these files. FI_OPU_WRITE
should override FADVISE_COLD_BIT to avoid defragment fails.

Signed-off-by: Weichao Guo <guoweichao@oppo.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:43 -07:00
Zhang Qilong
07725adc55 f2fs: fix race condition on setting FI_NO_EXTENT flag
The following scenarios exist.
process A:               process B:
->f2fs_drop_extent_tree  ->f2fs_update_extent_cache_range
                          ->f2fs_update_extent_tree_range
                           ->write_lock
 ->set_inode_flag
                           ->is_inode_flag_set
                           ->__free_extent_tree // Shouldn't
                                                // have been
                                                // cleaned up
                                                // here
  ->write_lock

In this case, the "FI_NO_EXTENT" flag is set between
f2fs_update_extent_tree_range and is_inode_flag_set
by other process. it leads to clearing the whole exten
tree which should not have happened. And we fix it by
move the setting it to the range of write_lock.

Fixes:5f281fab9b9a3 ("f2fs: disable extent_cache for fcollapse/finsert inodes")
Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:43 -07:00
Zhang Qilong
9df6d6f9be f2fs: remove redundant check in f2fs_sanity_check_cluster
It have checked "compressed" at the entry of
f2fs_sanity_check_cluster, just remove the
redundant check for better performance here.

Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:42 -07:00
Zhang Qilong
049ea86cb5 f2fs: add static init_idisk_time function to reduce the code
We can use a inner function to init the disk time
of f2fs_inode_info for cleaning redundant code.

Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:42 -07:00
Yonggil Song
d382e36970 f2fs: fix typo
Fix typo in f2fs.h
Detected by Jaeyoon Choi

Signed-off-by: Yonggil Song <yonggil.song@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:42 -07:00
Shuqi Zhang
9b7eadd9bd f2fs: fix wrong dirty page count when race between mmap and fallocate.
This is a BUG_ON issue as follows when running xfstest-generic-503:
WARNING: CPU: 21 PID: 1385 at fs/f2fs/inode.c:762 f2fs_evict_inode+0x847/0xaa0
Modules linked in:
CPU: 21 PID: 1385 Comm: umount Not tainted 5.19.0-rc5+ #73
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/01/2014

Call Trace:
evict+0x129/0x2d0
dispose_list+0x4f/0xb0
evict_inodes+0x204/0x230
generic_shutdown_super+0x5b/0x1e0
kill_block_super+0x29/0x80
kill_f2fs_super+0xe6/0x140
deactivate_locked_super+0x44/0xc0
deactivate_super+0x79/0x90
cleanup_mnt+0x114/0x1a0
__cleanup_mnt+0x16/0x20
task_work_run+0x98/0x100
exit_to_user_mode_prepare+0x3d0/0x3e0
syscall_exit_to_user_mode+0x12/0x30
do_syscall_64+0x42/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0

Function flow analysis when BUG occurs:
f2fs_fallocate                    mmap
                                  do_page_fault
                                    pte_spinlock  // ---lock_pte
                                    do_wp_page
                                      wp_page_shared
                                        pte_unmap_unlock   // unlock_pte
                                          do_page_mkwrite
                                          f2fs_vm_page_mkwrite
                                            down_read(invalidate_lock)
                                            lock_page
                                            if (PageMappedToDisk(page))
                                              goto out;
                                            // set_page_dirty  --NOT RUN
                                            out: up_read(invalidate_lock);
                                        finish_mkwrite_fault // unlock_pte
f2fs_collapse_range
  down_write(i_mmap_sem)
  truncate_pagecache
    unmap_mapping_pages
      i_mmap_lock_write // down_write(i_mmap_rwsem)
        ......
        zap_pte_range
          pte_offset_map_lock // ---lock_pte
           set_page_dirty
            f2fs_dirty_data_folio
              if (!folio_test_dirty(folio)) {
                                        fault_dirty_shared_page
                                          set_page_dirty
                                            f2fs_dirty_data_folio
                                              if (!folio_test_dirty(folio)) {
                                                filemap_dirty_folio
                                                f2fs_update_dirty_folio // ++
                                              }
                                            unlock_page
                filemap_dirty_folio
                f2fs_update_dirty_folio // page count++
              }
          pte_unmap_unlock  // --unlock_pte
      i_mmap_unlock_write  // up_write(i_mmap_rwsem)
  truncate_inode_pages
  up_write(i_mmap_sem)

When race happens between mmap-do_page_fault-wp_page_shared and
fallocate-truncate_pagecache-zap_pte_range, the zap_pte_range calls
function set_page_dirty without page lock. Besides, though
truncate_pagecache has immap and pte lock, wp_page_shared calls
fault_dirty_shared_page without any. In this case, two threads race
in f2fs_dirty_data_folio function. Page is set to dirty only ONCE,
but the count is added TWICE by calling filemap_dirty_folio.
Thus the count of dirty page cannot accord with the real dirty pages.

Following is the solution to in case of race happens without any lock.
Since folio_test_set_dirty in filemap_dirty_folio is atomic, judge return
value will not be at risk of race.

Signed-off-by: Shuqi Zhang <zhangshuqi3@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:42 -07:00
Zhang Qilong
173cdf2c32 f2fs: use COMPRESS_MAPPING to get compress cache mapping
Just use the defined COMPRESS_MAPPING to get compress cache
mapping instaed of direct accessing name.

Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:42 -07:00
Zhang Qilong
280dfeae56 f2fs: return the tmp_ptr directly in __bitmap_ptr
Just return tmp_ptr here, it's no need to dereference
checkpoint pointer again.

Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:42 -07:00
Xiubo Li
aa87052dd9 ceph: fix incorrectly showing the .snap size for stat
We should set the 'stat->size' to the real number of snapshots for
snapdirs.

Link: https://tracker.ceph.com/issues/57342
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-10-04 19:18:08 +02:00
Xiubo Li
bd04b9192e ceph: fail the open_by_handle_at() if the dentry is being unlinked
When unlinking a file the kclient will send a unlink request to MDS
by holding the dentry reference, and then the MDS will return 2 replies,
which are unsafe reply and a deferred safe reply.

After the unsafe reply received the kernel will return and succeed
the unlink request to user space apps.

Only when the safe reply received the dentry's reference will be
released. Or the dentry will only be unhashed from dcache. But when
the open_by_handle_at() begins to open the unlinked files it will
succeed.

The inode->i_count couldn't be used to check whether the inode is
opened or not.

Link: https://tracker.ceph.com/issues/56524
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luís Henriques <lhenriques@suse.de>
Tested-by: Luís Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-10-04 19:18:08 +02:00
Jeff Layton
b4b924c7a1 ceph: increment i_version when doing a setattr with caps
When the client has enough caps to satisfy a setattr locally without
having to talk to the server, we currently do the setattr without
incrementing the change attribute.

Ensure that if the ctime changes locally, then the change attribute
does too.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-10-04 19:18:08 +02:00
Kenneth Lee
aa1d627207 ceph: Use kcalloc for allocating multiple elements
Prefer using kcalloc(a, b) over kzalloc(a * b) as this improves
semantics since kcalloc is intended for allocating an array of memory.

Signed-off-by: Kenneth Lee <klee33@uw.edu>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-10-04 19:18:08 +02:00
Xiubo Li
7c3ea9870e ceph: no need to wait for transition RDCACHE|RD -> RD
For write when trying to get the Fwb caps we need to keep waiting
on transition from WRBUFFER|WR -> WR to avoid a new WR sync write
from going before a prior buffered writeback happens.

While for read there is no need to wait on transition from
RDCACHE|RD -> RD, and we can just exclude the revoking caps and
force to sync read.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-10-04 19:18:08 +02:00
Xiubo Li
6eb06c4621 ceph: fail the request if the peer MDS doesn't support getvxattr op
Just fail the request instead sending the request out, or the peer
MDS will crash.

Link: https://tracker.ceph.com/issues/56529
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-10-04 19:18:08 +02:00
Xiubo Li
f791357330 ceph: wake up the waiters if any new caps comes
When new caps comes we need to wake up the waiters and also when
revoking the caps, there also could be new caps comes.

Link: https://tracker.ceph.com/issues/54044
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-10-04 19:18:08 +02:00
Petr Mladek
c77ae0b863 Merge branch 'rework/kthreads' into for-linus 2022-10-04 15:55:55 +02:00
Darrick J. Wong
e033f40be2 xfs: on memory failure, only shut down fs after scanning all mappings
xfs_dax_failure_fn is used to scan the filesystem during a memory
failure event to look for memory mappings to revoke.  Unfortunately,
if it encounters an rmap record for filesystem metadata, it will
shut down the filesystem and the scan immediately.  This means that
we don't complete the mapping revocation scan and instead leave live
mappings to failed memory.  Fix the function to defer the shutdown
until after we've finished culling mappings.

While we're at it, add the usual "xfs_" prefix to struct
failure_info, and actually initialize mf_flags.

Fixes: 6f643c57d5 ("xfs: implement ->notify_failure() for XFS")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-10-04 16:40:01 +11:00
Shida Zhang
c098576f5f xfs: rearrange the logic and remove the broken comment for xfs_dir2_isxx
xfs_dir2_isleaf is used to see if the directory is a single-leaf
form directory instead, as commented right above the function.

Besides getting rid of the broken comment, we rearrange the logic by
converting everything over to standard formatting and conventions,
at the same time, to make it easier to understand and self documenting.

Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-10-04 16:39:58 +11:00
Shida Zhang
44159659df xfs: trim the mapp array accordingly in xfs_da_grow_inode_int
Take a look at the for-loop in xfs_da_grow_inode_int:
======
for(){
        nmap = min(XFS_BMAP_MAX_NMAP, count);
        ...
        error = xfs_bmapi_write(...,&mapp[mapi], &nmap);//(..., $1, $2)
        ...
        mapi += nmap;
}
=====
where $1 stands for the start address of the array,
while $2 is used to indicate the size of the array.

The array $1 will advance by $nmap in each iteration after
the allocation of extents.
But the size $2 still remains unchanged, which is determined by
min(XFS_BMAP_MAX_NMAP, count).

It seems that it has forgotten to trim the mapp array after each
iteration, so change it.

Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-10-04 16:39:42 +11:00
Linus Torvalds
725737e7c2 STATX_DIOALIGN for 6.1
Make statx() support reporting direct I/O (DIO) alignment information.
 This provides a generic interface for userspace programs to determine
 whether a file supports DIO, and if so with what alignment restrictions.
 Specifically, STATX_DIOALIGN works on block devices, and on regular
 files when their containing filesystem has implemented support.
 
 An interface like this has been requested for years, since the
 conditions for when DIO is supported in Linux have gotten increasingly
 complex over time.  Today, DIO support and alignment requirements can be
 affected by various filesystem features such as multi-device support,
 data journalling, inline data, encryption, verity, compression,
 checkpoint disabling, log-structured mode, etc.  Further complicating
 things, Linux v6.0 relaxed the traditional rule of DIO needing to be
 aligned to the block device's logical block size; now user buffers (but
 not file offsets) only need to be aligned to the DMA alignment.
 
 The approach of uplifting the XFS specific ioctl XFS_IOC_DIOINFO was
 discarded in favor of creating a clean new interface with statx().
 
 For more information, see the individual commits and the man page update
 https://lore.kernel.org/r/20220722074229.148925-1-ebiggers@kernel.org.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYzpV2xQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKwF1AQDetPX5hyuq0/mwikOywLTTJsoHgGY5
 euO+dISqjH/InwD9HAQqfPRkdM1j4ml82BjjkAfrhzZXOOWPKJm0zOhMIQg=
 =0Oav
 -----END PGP SIGNATURE-----

Merge tag 'statx-dioalign-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux

Pull STATX_DIOALIGN support from Eric Biggers:
 "Make statx() support reporting direct I/O (DIO) alignment information.

  This provides a generic interface for userspace programs to determine
  whether a file supports DIO, and if so with what alignment
  restrictions. Specifically, STATX_DIOALIGN works on block devices, and
  on regular files when their containing filesystem has implemented
  support.

  An interface like this has been requested for years, since the
  conditions for when DIO is supported in Linux have gotten increasingly
  complex over time. Today, DIO support and alignment requirements can
  be affected by various filesystem features such as multi-device
  support, data journalling, inline data, encryption, verity,
  compression, checkpoint disabling, log-structured mode, etc.

  Further complicating things, Linux v6.0 relaxed the traditional rule
  of DIO needing to be aligned to the block device's logical block size;
  now user buffers (but not file offsets) only need to be aligned to the
  DMA alignment.

  The approach of uplifting the XFS specific ioctl XFS_IOC_DIOINFO was
  discarded in favor of creating a clean new interface with statx().

  For more information, see the individual commits and the man page
  update[1]"

Link: https://lore.kernel.org/r/20220722074229.148925-1-ebiggers@kernel.org [1]

* tag 'statx-dioalign-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
  xfs: support STATX_DIOALIGN
  f2fs: support STATX_DIOALIGN
  f2fs: simplify f2fs_force_buffered_io()
  f2fs: move f2fs_force_buffered_io() into file.c
  ext4: support STATX_DIOALIGN
  fscrypt: change fscrypt_dio_supported() to prepare for STATX_DIOALIGN
  vfs: support STATX_DIOALIGN on block devices
  statx: add direct I/O alignment information
2022-10-03 20:33:41 -07:00
Linus Torvalds
5779aa2dac fsverity updates for 6.1
Minor changes to convert uses of kmap() to kmap_local_page().
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYzpKvRQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK9KSAP9PyrI3kDLBpiG9os3HIZtHyPHgt6OZ
 FA978i0UuAxgHAD7BPiIT55oBdOrn6CVy2g8PkwmkcKQx0kvhVQq8Pyz5ww=
 =49pm
 -----END PGP SIGNATURE-----

Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fsverity updates from Eric Biggers:
 "Minor changes to convert uses of kmap() to kmap_local_page()"

* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fs-verity: use kmap_local_page() instead of kmap()
  fs-verity: use memcpy_from_page()
2022-10-03 20:27:34 -07:00
Linus Torvalds
438b2cdd17 fscrypt updates for 6.1
This release contains some implementation changes, but no new features:
 
 - Rework the implementation of the fscrypt filesystem-level keyring to
   not be as tightly coupled to the keyrings subsystem.  This resolves
   several issues.
 
 - Eliminate most direct uses of struct request_queue from fs/crypto/,
   since struct request_queue is considered to be a block layer
   implementation detail.
 
 - Stop using the PG_error flag to track decryption failures.  This is a
   prerequisite for freeing up PG_error for other uses.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYzpMMRQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKxYbAP0VrWjlqonO75gYkIxwX0aTxajoKC3m
 awUDAC/feQ910gD6A4WbJivanLngJKgcxfbhN5paalZJEGNOBBrOUB1WLgs=
 =CxSh
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fscrypt updates from Eric Biggers:
 "This release contains some implementation changes, but no new
  features:

   - Rework the implementation of the fscrypt filesystem-level keyring
     to not be as tightly coupled to the keyrings subsystem. This
     resolves several issues.

   - Eliminate most direct uses of struct request_queue from fs/crypto/,
     since struct request_queue is considered to be a block layer
     implementation detail.

   - Stop using the PG_error flag to track decryption failures. This is
     a prerequisite for freeing up PG_error for other uses"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fscrypt: work on block_devices instead of request_queues
  fscrypt: stop holding extra request_queue references
  fscrypt: stop using keyrings subsystem for fscrypt_master_key
  fscrypt: stop using PG_error to track error status
  fscrypt: remove fscrypt_set_test_dummy_encryption()
2022-10-03 20:18:34 -07:00
Linus Torvalds
f4309528f3 dlm for 6.1
This set of commits includes:
 . Fix a couple races found with a new torture test.
 . Improve errors when api functions are used incorrectly.
 . Improve tracing for lock requests from user space.
 . Fix use after free in recently added tracing code.
 . Small internal code cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJjOyfeAAoJEDgbc8f8gGmqHF4QALKGo+95JGzfXN37dNL2ve8L
 DAKxESYIwaTEWuKxmD4AGogClEl55UoC8kxMB3dHwLZEd4U0v5ZDULR6NUYXMpos
 6miaoF+pJfBnpNRqpCieWRW5dYXD4TwSdquv5rUSmUBrdOSy34s/nORWB4kL443K
 hFPcbo5Mv1L0W70/+gdj1uBlBsenZxnXu6aEmrckONqwj9Q2SBjJTik9WuNwh+FF
 tEcmUt8kDanGkbwtMCxnbT3HDOdfQyW+qq4IJ6MOYHlW9Cqbp9QUvAIho4DEpr7f
 eGurQ/urSD3dltzuYQcZ81zGhaGxzaRt5d2AEHRrGugQ2ZvnsG74oSAmEINZTSw4
 RV2EXyJ4hXcXK/yJXo3fGzFm2/5JFvYhnvddo6wts3vQZHwefExIRCHVz2cJL9eS
 gFpfFu4uB8z7w7l9s9LJKv7cTriaDd1WHuIWZGonz3wlFSUOn7IxunDxM3Hc5YO3
 okawhr6sWe03fFcKsw1WeWymfDUwmk/7OV15OSDanItAwX5vkBYDBvAcA/cwm8cj
 P0Vb3c1/Sf1IjjHGGA13vHpD1JXJ7FHafg6jyWmjJNqaS+wtShvs2As9MqbtSWMb
 o2OcYTEEzME4mMIXZzVlKP7hhkLMaVR5PwGmbPovlyAkEUX0soH7nefyLMAqP3JG
 7VZYV46VCL7wm3yjrKYw
 =sL1G
 -----END PGP SIGNATURE-----

Merge tag 'dlm-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:

 - Fix a couple races found with a new torture test

 - Improve errors when api functions are used incorrectly

 - Improve tracing for lock requests from user space

 - Fix use after free in recently added tracing cod.

 - Small internal code cleanups

* tag 'dlm-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  fs: dlm: fix possible use after free if tracing
  fs: dlm: const void resource name parameter
  fs: dlm: LSFL_CB_DELAY only for kernel lockspaces
  fs: dlm: remove DLM_LSFL_FS from uapi
  fs: dlm: trace user space callbacks
  fs: dlm: change ls_clear_proc_locks to spinlock
  fs: dlm: remove dlm_del_ast prototype
  fs: dlm: handle rcom in else if branch
  fs: dlm: allow lockspaces have zero lvblen
  fs: dlm: fix invalid derefence of sb_lvbptr
  fs: dlm: handle -EINVAL as log_error()
  fs: dlm: use __func__ for function name
  fs: dlm: handle -EBUSY first in unlock validation
  fs: dlm: handle -EBUSY first in lock arg validation
  fs: dlm: fix race between test_bit() and queue_work()
  fs: dlm: fix race in lowcomms
2022-10-03 20:11:59 -07:00
Linus Torvalds
f90497a16e NFSD 6.1 Release Notes
This release is mostly bug fixes, clean-ups, and optimizations.
 
 One notable set of fixes addresses a subtle buffer overflow issue
 that occurs if a small RPC Call message arrives in an oversized
 RPC record. This is only possible on a framed RPC transport such
 as TCP.
 
 Because NFSD shares the receive and send buffers in one set of
 pages, an oversized RPC record steals pages from the send buffer
 that will be used to construct the RPC Reply message. NFSD must
 not assume that a full-sized buffer is always available to it;
 otherwise, it will walk off the end of the send buffer while
 constructing its reply.
 
 In this release, we also introduce the ability for the server to
 wait a moment for clients to return delegations before it responds
 with NFS4ERR_DELAY. This saves a retransmit and a network round-
 trip when a delegation recall is needed. This work will be built
 upon in future releases.
 
 The NFS server adds another shrinker to its collection. Because
 courtesy clients can linger for quite some time, they might be
 freeable when the server host comes under memory pressure. A new
 shrinker has been added that releases courtesy client resources
 during low memory scenarios.
 
 Lastly, of note: the maximum number of operations per NFSv4
 COMPOUND that NFSD can handle is increased from 16 to 50. There
 are NFSv4 client implementations that need more than 16 to
 successfully perform a mount operation that uses a pathname
 with many components.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmM66P4ACgkQM2qzM29m
 f5fhMg//afS2mp4fgPz4MjoFIqD/Icep8qFEPA8Gy6I1dDGRxd9wNgjoN4JALFdr
 NKX1oRVISBvDrOG/C84GbYnXEDlzY8q1HmyPoJA8VAR57hnJXfPZN6CBN//Bx4mU
 nISPJeNGY9SMNVhS8916V/yzd41uWDQuD+H+i5mluBTJHONgSzwzc80sQ+eq+yZQ
 PV6mlJN6hcm14LCaDOTXF7oY2Wm6dQc2rV87YChJWnc+vdXKnme/LWTMY1ABkePD
 g88mSL6w3YDKEuKciWda5/QU1ETp/Q7XTjFGDKEQSnnNsvCLmUKogJTKVa2QqyLY
 P1qlrj6XwukqAe414W4amlLL3q4NUFmJZPNWDxdf+qtTrQrBBEFrsKy/bSt27XoD
 cTvBWcorMG2riSlYPViVeh8RpyC6qwhttPbvGAmflVF2KEyXpfgc5Pnn0/Xm1Ac9
 XKzaCTJlUyRb/2wdqVtQbIpyh3sbhzp8zhv7sWKXgQOEXxKOO3ZAIrQXeL6oFN/b
 HlXDty7wKhFRj8IbkZfQ9SvN1saTONwB3clYHbCXTetkw/nnrUgLYcu8NIDBK9ou
 wkBcz1++XgVTqjRFwUwagb62cPJnRM6UiROYCVbQp7qcUe4/U+WP9t6dnZlnGZVZ
 dtipKlH/LTGKW+d7ysZOqb4hsRza5Kaduz5a7lML7UIGQXmxjM0=
 =hE6t
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
 "This release is mostly bug fixes, clean-ups, and optimizations.

  One notable set of fixes addresses a subtle buffer overflow issue that
  occurs if a small RPC Call message arrives in an oversized RPC record.
  This is only possible on a framed RPC transport such as TCP.

  Because NFSD shares the receive and send buffers in one set of pages,
  an oversized RPC record steals pages from the send buffer that will be
  used to construct the RPC Reply message. NFSD must not assume that a
  full-sized buffer is always available to it; otherwise, it will walk
  off the end of the send buffer while constructing its reply.

  In this release, we also introduce the ability for the server to wait
  a moment for clients to return delegations before it responds with
  NFS4ERR_DELAY. This saves a retransmit and a network round- trip when
  a delegation recall is needed. This work will be built upon in future
  releases.

  The NFS server adds another shrinker to its collection. Because
  courtesy clients can linger for quite some time, they might be
  freeable when the server host comes under memory pressure. A new
  shrinker has been added that releases courtesy client resources during
  low memory scenarios.

  Lastly, of note: the maximum number of operations per NFSv4 COMPOUND
  that NFSD can handle is increased from 16 to 50. There are NFSv4
  client implementations that need more than 16 to successfully perform
  a mount operation that uses a pathname with many components"

* tag 'nfsd-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (53 commits)
  nfsd: extra checks when freeing delegation stateids
  nfsd: make nfsd4_run_cb a bool return function
  nfsd: fix comments about spinlock handling with delegations
  nfsd: only fill out return pointer on success in nfsd4_lookup_stateid
  NFSD: fix use-after-free on source server when doing inter-server copy
  NFSD: Cap rsize_bop result based on send buffer size
  NFSD: Rename the fields in copy_stateid_t
  nfsd: use DEFINE_SHOW_ATTRIBUTE to define nfsd_file_cache_stats_fops
  nfsd: use DEFINE_SHOW_ATTRIBUTE to define nfsd_reply_cache_stats_fops
  nfsd: use DEFINE_SHOW_ATTRIBUTE to define client_info_fops
  nfsd: use DEFINE_SHOW_ATTRIBUTE to define export_features_fops and supported_enctypes_fops
  nfsd: use DEFINE_PROC_SHOW_ATTRIBUTE to define nfsd_proc_ops
  NFSD: Pack struct nfsd4_compoundres
  NFSD: Remove unused nfsd4_compoundargs::cachetype field
  NFSD: Remove "inline" directives on op_rsize_bop helpers
  NFSD: Clean up nfs4svc_encode_compoundres()
  SUNRPC: Fix typo in xdr_buf_subsegment's kdoc comment
  NFSD: Clean up WRITE arg decoders
  NFSD: Use xdr_inline_decode() to decode NFSv3 symlinks
  NFSD: Refactor common code out of dirlist helpers
  ...
2022-10-03 20:07:15 -07:00
Linus Torvalds
3497640a80 Changes since last update:
- Introduce fscache-based domain to share blobs between images;
 
  - Support recording fragments in a special packed inode;
 
  - Support partial-referenced pclusters for global compressed data
    deduplication;
 
  - Fix an order >= MAX_ORDER warning due to crafted negative i_size;
 
  - Several cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYzq3FxEceGlhbmdAa2Vy
 bmVsLm9yZwAKCRA5NzHcH7XmBJbRAQDab/0DJu7iDktzupazfCibkg8vWzakXIi+
 KE0y5O8VaQEAwn9bdPU4cp+raowoMt3z8eGsj4H9ZO9NM8NfPUX0uQQ=
 =TNVH
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs updates from Gao Xiang:
 "In this cycle, for container use cases, fscache-based shared domain is
  introduced [1] so that data blobs in the same domain will be storage
  deduplicated and it will also be used for page cache sharing later.

  Also, a special packed inode is now introduced to record inode
  fragments which keep the tail part of files by Yue Hu [2]. You can
  keep arbitary length or (at will) the whole file as a fragment and
  then fragments can be optionally compressed in the packed inode
  together and even deduplicated for smaller image sizes.

  In addition to that, global compressed data deduplication by sharing
  partial-referenced pclusters is also supported in this cycle.

  Summary:

   - Introduce fscache-based domain to share blobs between images

   - Support recording fragments in a special packed inode

   - Support partial-referenced pclusters for global compressed data
     deduplication

   - Fix an order >= MAX_ORDER warning due to crafted negative i_size

   - Several cleanups"

Link: https://lore.kernel.org/r/20220916085940.89392-1-zhujia.zj@bytedance.com [1]
Link: https://lore.kernel.org/r/cover.1663065968.git.huyue2@coolpad.com [2]

* tag 'erofs-for-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: clean up erofs_iget()
  erofs: clean up unnecessary code and comments
  erofs: fold in z_erofs_reload_indexes()
  erofs: introduce partial-referenced pclusters
  erofs: support on-disk compressed fragments data
  erofs: support interlaced uncompressed data for compressed files
  erofs: clean up .read_folio() and .readahead() in fscache mode
  erofs: introduce 'domain_id' mount option
  erofs: Support sharing cookies in the same domain
  erofs: introduce a pseudo mnt to manage shared cookies
  erofs: introduce fscache-based domain
  erofs: code clean up for fscache
  erofs: use kill_anon_super() to kill super in fscache mode
  erofs: fix order >= MAX_ORDER warning due to crafted negative i_size
2022-10-03 20:01:40 -07:00
Linus Torvalds
8bea8ff34a fs.vfsuid.fat.v6.1
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYzqjGgAKCRCRxhvAZXjc
 ovVFAQCbZXflZk/DGy1CVEWHwJDkMlmN62jCY+3gZP6UPsCeCQEA4uB3Hub7jihO
 b2q9yhR+p6G7DHkeNAo2qUu4bI0lDgQ=
 =zG6Z
 -----END PGP SIGNATURE-----

Merge tag 'fs.vfsuid.fat.v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull fatfs vfsuid conversion from Christian Brauner:
 "Last cycle we introduced the new vfs{g,u}id_t types that we had agreed
  on. The most important parts of the vfs have been converted but there
  are a few more places we need to switch before we can remove the old
  helpers completely.

  This cycle we converted all filesystems that called idmapped mount
  helpers directly. The affected filesystems are f2fs, fat, fuse, ksmbd,
  overlayfs, and xfs. We've sent patches for all of them. Looking at
  -next f2fs, ksmbd, overlayfs, and xfs have all picked up these patches
  and they should land in mainline during the v6.1 merge window.

  So all filesystems that have a separate tree should send the vfsuid
  conversion themselves. Onle the fat conversion is going through this
  generic fs trees because there is no fat tree.

  In order to change time settings on an inode fat checks that the
  caller either is the owner of the inode or the inode's group is in the
  caller's group list. If fat is on an idmapped mount we compare whether
  the inode mapped into the mount is equivalent to the caller's fsuid.
  If it isn't we compare whether the inode's group mapped into the mount
  is in the caller's group list.

  We now use the new vfsuid based helpers for that"

* tag 'fs.vfsuid.fat.v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
  fat: port to vfs{g,u}id_t and associated helpers
2022-10-03 19:54:29 -07:00
Linus Torvalds
223b845253 fs.acl.rework.prep.v6.1
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYzqi8gAKCRCRxhvAZXjc
 orKNAQCGKPJ3Kc3LVVnh8qdjm9npP+j9UQAB7jDZi9q7RijIIAD/VYjj+z5XLg4V
 k96ibCyir1+4EOF8ihY0WQi40MSWYws=
 =S/Wf
 -----END PGP SIGNATURE-----

Merge tag 'fs.acl.rework.prep.v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull vfs acl updates from Christian Brauner:
 "These are general fixes and preparatory changes related to the ongoing
  posix acl rework. The actual rework where we build a type safe posix
  acl api wasn't ready for this merge window but we're hopeful for the
  next merge window.

  General fixes:

   - Some filesystems like 9p and cifs have to implement custom posix
     acl handlers because they require access to the dentry in order to
     set and get posix acls while the set and get inode operations
     currently don't. But the ntfs3 filesystem has no such requirement
     and thus implemented custom posix acl xattr handlers when it really
     didn't have to. So this pr contains patch that just implements set
     and get inode operations for ntfs3 and switches it to rely on the
     generic posix acl xattr handlers. (We would've appreciated reviews
     from the ntfs3 maintainers but we didn't get any. But hey, if we
     really broke it we'll fix it. But fstests for ntfs3 said it's
     fine.)

   - The posix_acl_fix_xattr_common() helper has been adapted so it can
     be used by a few more callers and avoiding open-coding the same
     checks over and over.

  Other than the two general fixes this series introduces a new helper
  vfs_set_acl_prepare(). The reason for this helper is so that we can
  mitigate one of the source that change {g,u}id values directly in the
  uapi struct. With the vfs_set_acl_prepare() helper we can move the
  idmapped mount fixup into the generic posix acl set handler.

  The advantage of this is that it allows us to remove the
  posix_acl_setxattr_idmapped_mnt() helper which so far we had to call
  in vfs_setxattr() to account for idmapped mounts. While semantically
  correct the problem with this approach was that we had to keep the
  value parameter of the generic vfs_setxattr() call as non-const. This
  is rectified in this series.

  Ultimately, we will get rid of all the extreme kludges and type
  unsafety once we have merged the posix api - hopefully during the next
  merge window - built solely around get and set inode operations. Which
  incidentally will also improve handling of posix acls in security and
  especially in integrity modesl. While this will come with temporarily
  having two inode operation for posix acls that is nothing compared to
  the problems we have right now and so well worth it. We'll end up with
  something that we can actually reason about instead of needing to
  write novels to explain what's going on"

* tag 'fs.acl.rework.prep.v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
  xattr: always us is_posix_acl_xattr() helper
  acl: fix the comments of posix_acl_xattr_set
  xattr: constify value argument in vfs_setxattr()
  ovl: use vfs_set_acl_prepare()
  acl: move idmapping handling into posix_acl_xattr_set()
  acl: add vfs_set_acl_prepare()
  acl: return EOPNOTSUPP in posix_acl_fix_xattr_common()
  ntfs3: rework xattr handlers and switch to POSIX ACL VFS helpers
2022-10-03 19:48:54 -07:00
Linus Torvalds
da380aefdd fs/coredump fix
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYzuCiQAKCRBZ7Krx/gZQ
 65z6AQDbHgeZ3vXLyHdxs3VWsUkKWMUV1gb5MmzBs/eGq0K3hAD9FfGsOoT2ow9h
 2ics8AGrvMZMHrFOwFAmolXjLW7qQws=
 =y0rq
 -----END PGP SIGNATURE-----

Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull coredump fix from Al Viro:
 "Brown paper bag bug fix for the coredumping fix late in the 6.0
  release cycle"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  [brown paperbag] fix coredump breakage
2022-10-03 18:03:58 -07:00
Linus Torvalds
26b84401da lsm/stable-6.1 PR 20221003
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmM68YIUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXOTbA//TR8i+Wy8iswUCmtfmYg91h1uebpl
 /kjNsSmfgivAUTGamr3eN2WRlGhZfkFDPIHa25uybSA6Q+75p4lst83Rt3HDbjkv
 Ga7grCXnHwSDwJoHOSeFh0pojV2u7Zvfmiib2U5hPZEmd3kBw3NCgAJVcSGN80B2
 dct36fzZNXjvpWDbygmFtRRkmEseslSkft8bUVvNZBP+B0zvv3vcNY1QFuKuK+W2
 8wWpvO/cCSmke5i2c2ktHSk2f8/Y6n26Ik/OTHcTVfoKZLRaFbXEzLyxzLrNWd6m
 hujXgcxszTtHdmoXx+J6uBauju7TR8pi1x8mO2LSGrlpRc1cX0A5ED8WcH71+HVE
 8L1fIOmZShccPZn8xRok7oYycAUm/gIfpmSLzmZA76JsZYAe+mp9Ze9FA6fZtSwp
 7Q/rfw/Rlz25WcFBe4xypP078HkOmqutkCk2zy5liR+cWGrgy/WKX15vyC0TaPrX
 tbsRKuCLkipgfXrTk0dX3kmhz+3bJYjqeZEt7sfPSZYpaOGkNXVmAW0wnCOTuLMU
 +8pIVktvQxMmACEj2gBMz11iooR4DpWLxOcQQR/impgCpNdZ60nA0a6KPJoIXC+5
 NfTa422FZkc99QRVblUZyWSgJBW78Z3ZAQcQlo1AGLlFydbfrSFTRLbmNJZo/Nkl
 KwpGvWs5nB0rVw0=
 =VZl5
 -----END PGP SIGNATURE-----

Merge tag 'lsm-pr-20221003' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm

Pull LSM updates from Paul Moore:
 "Seven patches for the LSM layer and we've got a mix of trivial and
  significant patches. Highlights below, starting with the smaller bits
  first so they don't get lost in the discussion of the larger items:

   - Remove some redundant NULL pointer checks in the common LSM audit
     code.

   - Ratelimit the lockdown LSM's access denial messages.

     With this change there is a chance that the last visible lockdown
     message on the console is outdated/old, but it does help preserve
     the initial series of lockdown denials that started the denial
     message flood and my gut feeling is that these might be the more
     valuable messages.

   - Open userfaultfds as readonly instead of read/write.

     While this code obviously lives outside the LSM, it does have a
     noticeable impact on the LSMs with Ondrej explaining the situation
     in the commit description. It is worth noting that this patch
     languished on the VFS list for over a year without any comments
     (objections or otherwise) so I took the liberty of pulling it into
     the LSM tree after giving fair notice. It has been in linux-next
     since the end of August without any noticeable problems.

   - Add a LSM hook for user namespace creation, with implementations
     for both the BPF LSM and SELinux.

     Even though the changes are fairly small, this is the bulk of the
     diffstat as we are also including BPF LSM selftests for the new
     hook.

     It's also the most contentious of the changes in this pull request
     with Eric Biederman NACK'ing the LSM hook multiple times during its
     development and discussion upstream. While I've never taken NACK's
     lightly, I'm sending these patches to you because it is my belief
     that they are of good quality, satisfy a long-standing need of
     users and distros, and are in keeping with the existing nature of
     the LSM layer and the Linux Kernel as a whole.

     The patches in implement a LSM hook for user namespace creation
     that allows for a granular approach, configurable at runtime, which
     enables both monitoring and control of user namespaces. The general
     consensus has been that this is far preferable to the other
     solutions that have been adopted downstream including outright
     removal from the kernel, disabling via system wide sysctls, or
     various other out-of-tree mechanisms that users have been forced to
     adopt since we haven't been able to provide them an upstream
     solution for their requests. Eric has been steadfast in his
     objections to this LSM hook, explaining that any restrictions on
     the user namespace could have significant impact on userspace.
     While there is the possibility of impacting userspace, it is
     important to note that this solution only impacts userspace when it
     is requested based on the runtime configuration supplied by the
     distro/admin/user. Frederick (the pathset author), the LSM/security
     community, and myself have tried to work with Eric during
     development of this patchset to find a mutually acceptable
     solution, but Eric's approach and unwillingness to engage in a
     meaningful way have made this impossible. I have CC'd Eric directly
     on this pull request so he has a chance to provide his side of the
     story; there have been no objections outside of Eric's"

* tag 'lsm-pr-20221003' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
  lockdown: ratelimit denial messages
  userfaultfd: open userfaultfds with O_RDONLY
  selinux: Implement userns_create hook
  selftests/bpf: Add tests verifying bpf lsm userns_create hook
  bpf-lsm: Make bpf_lsm_userns_create() sleepable
  security, lsm: Introduce security_create_user_ns()
  lsm: clean up redundant NULL pointer check
2022-10-03 17:51:52 -07:00
Al Viro
4f526fef91 [brown paperbag] fix coredump breakage
Let me count the ways in which I'd screwed up:

* when emitting a page, handling of gaps in coredump should happen
before fetching the current file position.
* fix for a problem that occurs on rather uncommon setups (and hadn't
been observed in the wild) had been sent very late in the cycle.
* ... with badly insufficient testing, introducing an easily
reproducible breakage.  Without giving it time to soak in -next.

Fucked-up-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: "J. R. Okajima" <hooanon05g@gmail.com>
Tested-by: "J. R. Okajima" <hooanon05g@gmail.com>
Fixes: 06bbaa6dc5 "[coredump] don't use __kernel_write() on kmap_local_page()"
Cc: stable@kernel.org	# v6.0-only
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-10-03 20:28:38 -04:00
Linus Torvalds
12ed00ba01 execve updates for v6.1-rc1
- Remove a.out implementation globally (Eric W. Biederman)
 
 - Remove unused linux_binprm::taso member (Lukas Bulwahn)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmM4bQsWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJkftD/4gTcnAd3BCUgerhiQfq64kYNPt
 l47p0BM6GzXvl1Mf4Q0TDV35WJ/JD5Yd3ij3V7J2XJWSHANUAlHbxm9yfChVLACU
 99YRVhuWSdohJkF7p0b8dkAQO551aeodj/JUKGiNrJyNR4L336r+YG5aqEjOSPji
 jQH5I4SonDeaGdLy8nYO/aRhEryIF1FqvLH6egNp6Tt8Q69UqDYIojdCgZ/MS5lb
 dldrFsDb3ZjoXET0NdeIzZEZVS6zDM2iehb2W8dtRFoNsjMXz4jSiy7AEoprETBz
 wAKZ16t0dj2sARLLVGL8i3m2k6tzD6zzkIIoc9X3VOyeOADa6aghagDMyDpRo6ZB
 2ML7wMNCHCboCVVfG3n2rWTIFmrqeycAiny0hZxU4bjBBYSxTK4qD9lFQtlXk0cD
 BESZhnM6gg87vVFgLV/8aefCvwRd5eb8Pugtwb3qF4NsSvgosZtIXhWnIeU3cjg2
 425+4XOPbLsBv/u8NVkG8yIHaHZbtXH78JDIcNFMgSvw9iBwn2NH9184EpHI4qRx
 9aBjHPz+VOqzPTnNR6ISP5J4VXaOvHeocb/ckbrS+xhY8zQmbWX9QHaAZ+XnDYby
 PVY0HjmYTFSlijHSRhqLqNMHwtOAeiSZwJNk4wh+39H0ynpTzThGjQ9NlGYQ5cUE
 TVF5ukO5QRa5GbmhIQ==
 =Ns31
 -----END PGP SIGNATURE-----

Merge tag 'execve-v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull execve updates from Kees Cook:
 "This removes a.out support globally; it has been disabled for a while
  now.

   - Remove a.out implementation globally (Eric W. Biederman)

   - Remove unused linux_binprm::taso member (Lukas Bulwahn)"

* tag 'execve-v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  binfmt: remove taso from linux_binprm struct
  a.out: Remove the a.out implementation
2022-10-03 16:56:40 -07:00
Alexey Dobriyan
ef1d61781b proc: mark more files as permanent
Mark
	/proc/devices
	/proc/kpagecount
	/proc/kpageflags
	/proc/kpagecgroup
	/proc/loadavg
	/proc/meminfo
	/proc/softirqs
	/proc/uptime
	/proc/version

as permanent /proc entries, saving alloc/free and some list/spinlock ops
per use.

These files are never removed by the kernel so it is OK to mark them.

Link: https://lkml.kernel.org/r/Yyn527DzDMa+r0Yj@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:21:45 -07:00
ye xingchen
da6f79164e nilfs2: remove the unneeded result variable
Return the value nilfs_segctor_sync() directly instead of storing it in
another redundant variable.

Link: https://lkml.kernel.org/r/20220831033403.302184-1-ye.xingchen@zte.com.cn
Link: https://lkml.kernel.org/r/20220921034803.2476-3-konishi.ryusuke@gmail.com
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Minghao Chi <chi.minghao@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:21:44 -07:00
Minghao Chi
0badb2e46a nilfs2: delete unnecessary checks before brelse()
Patch series "nilfs2 minor amendments".


This patch (of 2):

The brelse() inline function tests whether its argument is NULL and then
returns immediately.  Thus remove the tests which are not needed around
the shown calls.

Link: https://lkml.kernel.org/r/20220921034803.2476-1-konishi.ryusuke@gmail.com
Link: https://lkml.kernel.org/r/20220819081700.96279-1-chi.minghao@zte.com.cn
Link: https://lkml.kernel.org/r/20220921034803.2476-2-konishi.ryusuke@gmail.com
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: ye xingchen <ye.xingchen@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:21:44 -07:00
wangjianli
e77999c1d4 fs/ocfs2: fix repeated words in comments
Delete the redundant word 'to'.

Link: https://lkml.kernel.org/r/20220908130036.31149-1-wangjianli@cdjrlc.com
Signed-off-by: wangjianli <wangjianli@cdjrlc.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:21:43 -07:00
Lukas Bulwahn
7ec354baa2 proc: make config PROC_CHILDREN depend on PROC_FS
Commit 2e13ba54a2 ("fs, proc: introduce CONFIG_PROC_CHILDREN")
introduces the config PROC_CHILDREN to configure kernels to provide the
/proc/<pid>/task/<tid>/children file.

When one deselects PROC_FS for kernel builds without /proc/, the config
PROC_CHILDREN has no effect anymore, but is still visible in menuconfig.

Add the dependency on PROC_FS to make the PROC_CHILDREN option disappear
for kernel builds without /proc/.

Link: https://lkml.kernel.org/r/20220909122529.1941-1-lukas.bulwahn@gmail.com
Fixes: 2e13ba54a2 ("fs, proc: introduce CONFIG_PROC_CHILDREN")
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>

Cc: Iago López Galeiras <iago@endocode.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:21:43 -07:00
Andrew Morton
5ca14835dc fs: uninline inode_maybe_inc_iversion()
It has many callsites and is large.

   text	   data	    bss	    dec	    hex	filename
  91796	  15984	    512	 108292	  1a704	mm/shmem.o-before
  91180	  15984	    512	 107676	  1a49c	mm/shmem.o-after

Acked-by: Jeff Layton <jlayton@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:21:43 -07:00
Jiangshan Yi
1c320cfa17 fs/ocfs2/suballoc.h: fix spelling typo in comment
Fix spelling typo in comment.

Link: https://lkml.kernel.org/r/20220905061656.1829179-1-13667453960@163.com
Signed-off-by: Jiangshan Yi <yijiangshan@kylinos.cn>
Reported-by: k2ci <kernel-bot@kylinos.cn>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:21:42 -07:00
Gustavo A. R. Silva
6e4a53ee79 ocfs2: replace zero-length arrays with DECLARE_FLEX_ARRAY() helper
Zero-length arrays are deprecated and we are moving towards adopting C99
flexible-array members, instead.  So, replace zero-length array
declarations in a couple of structures and unions with the new
DECLARE_FLEX_ARRAY() helper macro.

This helper allows for a flexible-array member in a union and as only
member in a structure.

Also, this addresses multiple warnings reported when building with
Clang-15 and -Wzero-length-array.

Lastly, this will also help memcpy (in a coming hardening update) execute
proper bounds-checking on variable length object i_symlink at
fs/ocfs2/namei.c:1973:

fs/ocfs2/namei.c:
1973                 memcpy((char *) fe->id2.i_symlink, symname, l);

Link: https://github.com/KSPP/linux/issues/21
Link: https://github.com/KSPP/linux/issues/193
Link: https://github.com/KSPP/linux/issues/197
Link: https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
Link: https://lkml.kernel.org/r/YxKY6O2hmdwNh8r8@work
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:21:42 -07:00
Alexander Potapenko
1468c6f455 mm: fs: initialize fsdata passed to write_begin/write_end interface
Functions implementing the a_ops->write_end() interface accept the `void
*fsdata` parameter that is supposed to be initialized by the corresponding
a_ops->write_begin() (which accepts `void **fsdata`).

However not all a_ops->write_begin() implementations initialize `fsdata`
unconditionally, so it may get passed uninitialized to a_ops->write_end(),
resulting in undefined behavior.

Fix this by initializing fsdata with NULL before the call to
write_begin(), rather than doing so in all possible a_ops implementations.

This patch covers only the following cases found by running x86 KMSAN
under syzkaller:

 - generic_perform_write()
 - cont_expand_zero() and generic_cont_expand_simple()
 - page_symlink()

Other cases of passing uninitialized fsdata may persist in the codebase.

Link: https://lkml.kernel.org/r/20220915150417.722975-43-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:25 -07:00
Mike Kravetz
fa27759af4 hugetlb: clean up code checking for fault/truncation races
With the new hugetlb vma lock in place, it can also be used to handle page
fault races with file truncation.  The lock is taken at the beginning of
the code fault path in read mode.  During truncation, it is taken in write
mode for each vma which has the file mapped.  The file's size (i_size) is
modified before taking the vma lock to unmap.

How are races handled?

The page fault code checks i_size early in processing after taking the vma
lock.  If the fault is beyond i_size, the fault is aborted.  If the fault
is not beyond i_size the fault will continue and a new page will be added
to the file.  It could be that truncation code modifies i_size after the
check in fault code.  That is OK, as truncation code will soon remove the
page.  The truncation code will wait until the fault is finished, as it
must obtain the vma lock in write mode.

This patch cleans up/removes late checks in the fault paths that try to
back out pages racing with truncation.  As noted above, we just let the
truncation code remove the pages.

[mike.kravetz@oracle.com: fix reserve_alloc set but not used compiler warning]
  Link: https://lkml.kernel.org/r/Yyj7HsJWfHDoU24U@monkey
Link: https://lkml.kernel.org/r/20220914221810.95771-10-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:17 -07:00
Mike Kravetz
40549ba8f8 hugetlb: use new vma_lock for pmd sharing synchronization
The new hugetlb vma lock is used to address this race:

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   <------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...

The vma_lock is used as follows:
- During fault processing. The lock is acquired in read mode before
  doing a page table lock and allocation (huge_pte_alloc).  The lock is
  held until code is finished with the page table entry (ptep).
- The lock must be held in write mode whenever huge_pmd_unshare is
  called.

Lock ordering issues come into play when unmapping a page from all
vmas mapping the page.  The i_mmap_rwsem must be held to search for the
vmas, and the vma lock must be held before calling unmap which will
call huge_pmd_unshare.  This is done today in:
- try_to_migrate_one and try_to_unmap_ for page migration and memory
  error handling.  In these routines we 'try' to obtain the vma lock and
  fail to unmap if unsuccessful.  Calling routines already deal with the
  failure of unmapping.
- hugetlb_vmdelete_list for truncation and hole punch.  This routine
  also tries to acquire the vma lock.  If it fails, it skips the
  unmapping.  However, we can not have file truncation or hole punch
  fail because of contention.  After hugetlb_vmdelete_list, truncation
  and hole punch call remove_inode_hugepages.  remove_inode_hugepages
  checks for mapped pages and call hugetlb_unmap_file_page to unmap them.
  hugetlb_unmap_file_page is designed to drop locks and reacquire in the
  correct order to guarantee unmap success.

Link: https://lkml.kernel.org/r/20220914221810.95771-9-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:17 -07:00
Mike Kravetz
378397ccb8 hugetlb: create hugetlb_unmap_file_folio to unmap single file folio
Create the new routine hugetlb_unmap_file_folio that will unmap a single
file folio.  This is refactored code from hugetlb_vmdelete_list.  It is
modified to do locking within the routine itself and check whether the
page is mapped within a specific vma before unmapping.

This refactoring will be put to use and expanded upon in a subsequent
patch adding vma specific locking.

Link: https://lkml.kernel.org/r/20220914221810.95771-8-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:17 -07:00
Mike Kravetz
c86272287b hugetlb: create remove_inode_single_folio to remove single file folio
Create the new routine remove_inode_single_folio that will remove a single
folio from a file.  This is refactored code from remove_inode_hugepages. 
It checks for the uncommon case in which the folio is still mapped and
unmaps.

No functional change.  This refactoring will be put to use and expanded
upon in a subsequent patches.

Link: https://lkml.kernel.org/r/20220914221810.95771-5-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:16 -07:00
Mike Kravetz
7e1813d48d hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache
remove_huge_page removes a hugetlb page from the page cache.  Change to
hugetlb_delete_from_page_cache as it is a more descriptive name. 
huge_add_to_page_cache is global in scope, but only deals with hugetlb
pages.  For consistency and clarity, rename to hugetlb_add_to_page_cache.

Link: https://lkml.kernel.org/r/20220914221810.95771-4-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:16 -07:00
Mike Kravetz
3a47c54f09 hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization
Commit c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing.  However, this has been shown to cause
performance/scaling issues.  Revert the code and go back to only taking
the semaphore in huge_pmd_share during the fault path.

Keep the code that takes i_mmap_rwsem in write mode before calling
try_to_unmap as this is required if huge_pmd_unshare is called.

NOTE: Reverting this code does expose the following race condition.

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   <------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)

It is unknown if the above race was ever experienced by a user.  It was
discovered via code inspection when initially addressed.

In subsequent patches, a new synchronization mechanism will be added to
coordinate pmd sharing and eliminate this race.

Link: https://lkml.kernel.org/r/20220914221810.95771-3-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:16 -07:00
Mike Kravetz
188a39725a hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race
Patch series "hugetlb: Use new vma lock for huge pmd sharing
synchronization", v2.

hugetlb fault scalability regressions have recently been reported [1]. 
This is not the first such report, as regressions were also noted when
commit c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") was added [2] in v5.7.  At that time, a proposal to
address the regression was suggested [3] but went nowhere.

The regression and benefit of this patch series is not evident when
using the vm_scalability benchmark reported in [2] on a recent kernel.
Results from running,
"./usemem -n 48 --prealloc --prefault -O -U 3448054972"

			48 sample Avg
next-20220913		next-20220913			next-20220913
unmodified	revert i_mmap_sema locking	vma sema locking, this series
-----------------------------------------------------------------------------
498150 KB/s		501934 KB/s			504793 KB/s

The recent regression report [1] notes page fault and fork latency of
shared hugetlb mappings.  To measure this, I created two simple programs:
1) map a shared hugetlb area, write fault all pages, unmap area
   Do this in a continuous loop to measure faults per second
2) map a shared hugetlb area, write fault a few pages, fork and exit
   Do this in a continuous loop to measure forks per second
These programs were run on a 48 CPU VM with 320GB memory.  The shared
mapping size was 250GB.  For comparison, a single instance of the program
was run.  Then, multiple instances were run in parallel to introduce
lock contention.  Changing the locking scheme results in a significant
performance benefit.

test		instances	unmodified	revert		vma
--------------------------------------------------------------------------
faults per sec	1		393043		395680		389932
faults per sec  24		 71405		 81191		 79048
forks per sec   1		  2802		  2747		  2725
forks per sec   24		   439		   536		   500
Combined faults 24		  1621		 68070		 53662
Combined forks  24		   358		    67		   142

Combined test is when running both faulting program and forking program
simultaneously.

Patches 1 and 2 of this series revert c0d0381ade and 87bf91d39b which
depends on c0d0381ade.  Acquisition of i_mmap_rwsem is still required in
the fault path to establish pmd sharing, so this is moved back to
huge_pmd_share.  With c0d0381ade reverted, this race is exposed:

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   <------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)

Reverting 87bf91d39b exposes races in page fault/file truncation.  When
the new vma lock is put to use in patch 8, this will handle the fault/file
truncation races.  This is explained in patch 9 where code associated with
these races is cleaned up.

Patches 3 - 5 restructure existing code in preparation for using the new
vma lock (rw semaphore) for pmd sharing synchronization.  The idea is that
this semaphore will be held in read mode for the duration of fault
processing, and held in write mode for unmap operations which may call
huge_pmd_unshare.  Acquiring i_mmap_rwsem is also still required to
synchronize huge pmd sharing.  However it is only required in the fault
path when setting up sharing, and will be acquired in huge_pmd_share().

Patch 6 adds the new vma lock and all supporting routines, but does not
actually change code to use the new lock.

Patch 7 refactors code in preparation for using the new lock.  And, patch
8 finally adds code to make use of this new vma lock.  Unfortunately, the
fault code and truncate/hole punch code would naturally take locks in the
opposite order which could lead to deadlock.  Since the performance of
page faults is more important, the truncation/hole punch code is modified
to back out and take locks in the correct order if necessary.

[1] https://lore.kernel.org/linux-mm/43faf292-245b-5db5-cce9-369d8fb6bd21@infradead.org/
[2] https://lore.kernel.org/lkml/20200622005551.GK5535@shao2-debian/
[3] https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@oracle.com/


This patch (of 9):

Commit c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing.  The use of i_mmap_rwsem to prevent
fault/truncate races depends on this.  However, this has been shown to
cause performance/scaling issues.  As a result, that code will be
reverted.  Since the use i_mmap_rwsem to address page fault/truncate races
depends on this, it must also be reverted.

In a subsequent patch, code will be added to detect the fault/truncate
race and back out operations as required.

Link: https://lkml.kernel.org/r/20220914221810.95771-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20220914221810.95771-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:16 -07:00
Jeff Layton
36f05cab0a tmpfs: add support for an i_version counter
NFSv4 mandates a change attribute to avoid problems with timestamp
granularity, which Linux implements using the i_version counter. This is
particularly important when the underlying filesystem is fast.

Give tmpfs an i_version counter. Since it doesn't have to be persistent,
we can just turn on SB_I_VERSION and sprinkle some inode_inc_iversion
calls in the right places.

Also, while there is no formal spec for xattrs, most implementations
update the ctime on setxattr. Fix shmem_xattr_handler_set to update the
ctime and bump the i_version appropriately.

Link: https://lkml.kernel.org/r/20220909130031.15477-1-jlayton@kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:06 -07:00
Mike Marshall
2ad4b6f5e1 Orangefs: change iterate to iterate_shared
Changed	.iterate to .iterate_shared in orangefs_dir_operations.	I
didn't change anything else, there were	no xfstests regressions
and no problem with any	of my other tests...

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2022-10-03 13:05:38 -04:00
Trond Myklebust
90377158bd NFSv4/pNFS: Always return layout stats on layout return for flexfiles
We want to ensure that the server never misses the layout stats when
we're closing the file, so that it knows whether or not to update its
internal state. Otherwise, if we were racing with a layout stat, we
might cause the server to invalidate its layout before the layout stat
got processed.

Fixes: 06946c6a3d ("pNFS/flexfiles: Only send layoutstats updates for mirrors that were updated")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-03 11:26:37 -04:00
Wolfram Sang
0dd7439f38 NFS: move from strlcpy with unused retval to strscpy
Follow the advice of the below link and prefer 'strscpy' in this
subsystem. Conversion is 1:1 because the return value is not used.
Generated by a coccinelle script.

Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-03 11:26:36 -04:00
Lukas Bulwahn
384edeb46f NFS: clean up a needless assignment in nfs_file_write()
Commit 064109db53 ("NFS: remove redundant code in nfs_file_write()")
identifies that filemap_fdatawait_range() will always return 0 and removes
a dead error-handling case in nfs_file_write(). With this change however,
assigning the return of filemap_fdatawait_range() to the result variable is
a dead store.

Remove this needless assignment.

No functional change. No change in object code.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-03 11:26:36 -04:00
yuzhe
7e7ce2ccba nfs: remove unnecessary (void*) conversions.
remove unnecessary void* type castings.

Signed-off-by: yuzhe <yuzhe@nfschina.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-03 11:26:36 -04:00
Bo Liu
724e2df95b NFSv4: Directly use ida_alloc()/free()
Use ida_alloc()/ida_free() instead of
ida_simple_get()/ida_simple_remove().
The latter is deprecated and more verbose.

Signed-off-by: Bo Liu <liubo03@inspur.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-03 11:26:36 -04:00
Darrick J. Wong
adc9c2e5a7 iomap: add a tracepoint for mappings returned by map_blocks
Add a new tracepoint so we can see what mapping the filesystem returns
to writeback a dirty page.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-02 11:42:19 -07:00
Darrick J. Wong
3d5f3ba1ac iomap: iomap: fix memory corruption when recording errors during writeback
Every now and then I see this crash on arm64:

Unable to handle kernel NULL pointer dereference at virtual address 00000000000000f8
Buffer I/O error on dev dm-0, logical block 8733687, async page read
Mem abort info:
  ESR = 0x0000000096000006
  EC = 0x25: DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
  FSC = 0x06: level 2 translation fault
Data abort info:
  ISV = 0, ISS = 0x00000006
  CM = 0, WnR = 0
user pgtable: 64k pages, 42-bit VAs, pgdp=0000000139750000
[00000000000000f8] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000, pmd=0000000000000000
Internal error: Oops: 96000006 [#1] PREEMPT SMP
Buffer I/O error on dev dm-0, logical block 8733688, async page read
Dumping ftrace buffer:
Buffer I/O error on dev dm-0, logical block 8733689, async page read
   (ftrace buffer empty)
XFS (dm-0): log I/O error -5
Modules linked in: dm_thin_pool dm_persistent_data
XFS (dm-0): Metadata I/O Error (0x1) detected at xfs_trans_read_buf_map+0x1ec/0x590 [xfs] (fs/xfs/xfs_trans_buf.c:296).
 dm_bio_prison
XFS (dm-0): Please unmount the filesystem and rectify the problem(s)
XFS (dm-0): xfs_imap_lookup: xfs_ialloc_read_agi() returned error -5, agno 0
 dm_bufio dm_log_writes xfs nft_chain_nat xt_REDIRECT nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT
potentially unexpected fatal signal 6.
 nf_reject_ipv6
potentially unexpected fatal signal 6.
 ipt_REJECT nf_reject_ipv4
CPU: 1 PID: 122166 Comm: fsstress Tainted: G        W          6.0.0-rc5-djwa #rc5 3004c9f1de887ebae86015f2677638ce51ee7
 rpcsec_gss_krb5 auth_rpcgss xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set nft_compat ip_set_hash_mac ip_set nf_tables
Hardware name: QEMU KVM Virtual Machine, BIOS 1.5.1 06/16/2021
pstate: 60001000 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
 ip_tables
pc : 000003fd6d7df200
 x_tables
lr : 000003fd6d7df1ec
 overlay nfsv4
CPU: 0 PID: 54031 Comm: u4:3 Tainted: G        W          6.0.0-rc5-djwa #rc5 3004c9f1de887ebae86015f2677638ce51ee7405
Hardware name: QEMU KVM Virtual Machine, BIOS 1.5.1 06/16/2021
Workqueue: writeback wb_workfn
sp : 000003ffd9522fd0
 (flush-253:0)
pstate: 60401005 (nZCv daif +PAN -UAO -TCO -DIT +SSBS BTYPE=--)
pc : errseq_set+0x1c/0x100
x29: 000003ffd9522fd0 x28: 0000000000000023 x27: 000002acefeb6780
x26: 0000000000000005 x25: 0000000000000001 x24: 0000000000000000
x23: 00000000ffffffff x22: 0000000000000005
lr : __filemap_set_wb_err+0x24/0xe0
 x21: 0000000000000006
sp : fffffe000f80f760
x29: fffffe000f80f760 x28: 0000000000000003 x27: fffffe000f80f9f8
x26: 0000000002523000 x25: 00000000fffffffb x24: fffffe000f80f868
x23: fffffe000f80fbb0 x22: fffffc0180c26a78 x21: 0000000002530000
x20: 0000000000000000 x19: 0000000000000000 x18: 0000000000000000

x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
x14: 0000000000000001 x13: 0000000000470af3 x12: fffffc0058f70000
x11: 0000000000000040 x10: 0000000000001b20 x9 : fffffe000836b288
x8 : fffffc00eb9fd480 x7 : 0000000000f83659 x6 : 0000000000000000
x5 : 0000000000000869 x4 : 0000000000000005 x3 : 00000000000000f8
x20: 000003fd6d740020 x19: 000000000001dd36 x18: 0000000000000001
x17: 000003fd6d78704c x16: 0000000000000001 x15: 000002acfac87668
x2 : 0000000000000ffa x1 : 00000000fffffffb x0 : 00000000000000f8
Call trace:
 errseq_set+0x1c/0x100
 __filemap_set_wb_err+0x24/0xe0
 iomap_do_writepage+0x5e4/0xd5c
 write_cache_pages+0x208/0x674
 iomap_writepages+0x34/0x60
 xfs_vm_writepages+0x8c/0xcc [xfs 7a861f39c43631f15d3a5884246ba5035d4ca78b]
x14: 0000000000000000 x13: 2064656e72757465 x12: 0000000000002180
x11: 000003fd6d8a82d0 x10: 0000000000000000 x9 : 000003fd6d8ae288
x8 : 0000000000000083 x7 : 00000000ffffffff x6 : 00000000ffffffee
x5 : 00000000fbad2887 x4 : 000003fd6d9abb58 x3 : 000003fd6d740020
x2 : 0000000000000006 x1 : 000000000001dd36 x0 : 0000000000000000
CPU: 1 PID: 122167 Comm: fsstress Tainted: G        W          6.0.0-rc5-djwa #rc5 3004c9f1de887ebae86015f2677638ce51ee7
 do_writepages+0x90/0x1c4
 __writeback_single_inode+0x4c/0x4ac
Hardware name: QEMU KVM Virtual Machine, BIOS 1.5.1 06/16/2021
 writeback_sb_inodes+0x214/0x4ac
 wb_writeback+0xf4/0x3b0
pstate: 60001000 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
 wb_workfn+0xfc/0x580
 process_one_work+0x1e8/0x480
pc : 000003fd6d7df200
 worker_thread+0x78/0x430

This crash is a result of iomap_writepage_map encountering some sort of
error during writeback and wanting to set that error code in the file
mapping so that fsync will report it.  Unfortunately, the code
dereferences folio->mapping after unlocking the folio, which means that
another thread could have removed the page from the page cache
(writeback doesn't hold the invalidation lock) and give it to somebody
else.

At best we crash the system like above; at worst, we corrupt memory or
set an error on some other unsuspecting file while failing to record the
problems with *this* file.  Regardless, fix the problem by reporting the
error to the inode mapping.

NOTE: Commit 598ecfbaa7 lifted the XFS writeback code to iomap, so
this fix should be backported to XFS in the 4.6-5.4 kernels in addition
to iomap in the 5.5-5.19 kernels.

Fixes: e735c00794 ("iomap: Convert iomap_add_to_ioend() to take a folio") # 5.17 onward
Fixes: 598ecfbaa7 ("iomap: lift the xfs writeback code to iomap") # 5.5-5.16, needs backporting
Fixes: 150d5be09c ("xfs: remove xfs_cancel_ioend") # 4.6-5.4, needs backporting
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-10-02 11:42:19 -07:00
Ye Bin
1b45cc5c7b ext4: fix potential out of bound read in ext4_fc_replay_scan()
For scan loop must ensure that at least EXT4_FC_TAG_BASE_LEN space. If remain
space less than EXT4_FC_TAG_BASE_LEN which will lead to out of bound read
when mounting corrupt file system image.
ADD_RANGE/HEAD/TAIL is needed to add extra check when do journal scan, as this
three tags will read data during scan, tag length couldn't less than data length
which will read.

Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20220924075233.2315259-4-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:54 -04:00
Ye Bin
dcc5827484 ext4: factor out ext4_fc_get_tl()
Factor out ext4_fc_get_tl() to fill 'tl' with host byte order.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20220924075233.2315259-3-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:54 -04:00
Ye Bin
fdc2a3c75d ext4: introduce EXT4_FC_TAG_BASE_LEN helper
Introduce EXT4_FC_TAG_BASE_LEN helper for calculate length of
struct ext4_fc_tl.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20220924075233.2315259-2-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:54 -04:00
Ye Bin
7ff5fddadd ext4: factor out ext4_free_ext_path()
Factor out ext4_free_ext_path() to free extent path. As after previous patch
'ext4_ext_drop_refs()' is only used in 'extents.c', so make it static.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220924021211.3831551-3-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:54 -04:00
Ye Bin
b6a750c019 ext4: remove unnecessary drop path references in mext_check_coverage()
According to Jan Kara's suggestion:
"The use in mext_check_coverage() can be actually removed
- get_ext_path() -> ext4_find_extent() takes care of dropping the references."
So remove unnecessary call ext4_ext_drop_refs() in mext_check_coverage().

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220924021211.3831551-2-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:54 -04:00
Ye Bin
27cd497803 ext4: update 'state->fc_regions_size' after successful memory allocation
To avoid to 'state->fc_regions_size' mismatch with 'state->fc_regions'
when fail to reallocate 'fc_reqions',only update 'state->fc_regions_size'
after 'state->fc_regions' is allocated successfully.

Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220921064040.3693255-4-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:54 -04:00
Ye Bin
7069d105c1 ext4: fix potential memory leak in ext4_fc_record_regions()
As krealloc may return NULL, in this case 'state->fc_regions' may not be
freed by krealloc, but 'state->fc_regions' already set NULL. Then will
lead to 'state->fc_regions' memory leak.

Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220921064040.3693255-3-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:54 -04:00
Ye Bin
9305721a30 ext4: fix potential memory leak in ext4_fc_record_modified_inode()
As krealloc may return NULL, in this case 'state->fc_modified_inodes'
may not be freed by krealloc, but 'state->fc_modified_inodes' already
set NULL. Then will lead to 'state->fc_modified_inodes' memory leak.

Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220921064040.3693255-2-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:54 -04:00
Guoqing Jiang
78ed9354c5 ext4: remove redundant checking in ext4_ioctl_checkpoint
It is already checked after comment "check for invalid bits set",
so let's remove this one.

Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20220918115219.12407-1-guoqing.jiang@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:54 -04:00
Ye Bin
dfff66f30f jbd2: add miss release buffer head in fc_do_one_pass()
In fc_do_one_pass() miss release buffer head after use which will lead
to reference count leak.

Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220917093805.1782845-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:54 -04:00
Jason Yan
3df11e27f0 ext4: move DIOREAD_NOLOCK setting to ext4_set_def_opts()
Now since all preparations is done, we can move the DIOREAD_NOLOCK
setting to ext4_set_def_opts().

Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220916141527.1012715-17-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:54 -04:00
Jason Yan
c8267c5142 ext4: remove useless local variable 'blocksize'
Since sb->s_blocksize is now initialized at the very beginning, the
local variable 'blocksize' in __ext4_fill_super() is not needed now.
Remove it and use sb->s_blocksize instead.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220916141527.1012715-16-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:54 -04:00
Jason Yan
a7a79c292a ext4: unify the ext4 super block loading operation
Now we load the super block from the disk in two steps. First we load
the super block with the default block size(EXT4_MIN_BLOCK_SIZE). Second
we load the super block with the real block size. The second step is a
little far from the first step. This patch move these two steps together
in a new function.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220916141527.1012715-15-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:53 -04:00
Jason Yan
a5991e539c ext4: factor out ext4_journal_data_mode_check()
Factor out ext4_journal_data_mode_check(). No functional change.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara<jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220916141527.1012715-14-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:53 -04:00
Jason Yan
9c1dd22d74 ext4: factor out ext4_load_and_init_journal()
This patch group the journal load and initialize code together and
factor out ext4_load_and_init_journal(). This patch also removes the
lable 'no_journal' which is not needed after refactor.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220916141527.1012715-13-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:53 -04:00
Jason Yan
a4e6a511d7 ext4: factor out ext4_group_desc_init() and ext4_group_desc_free()
Factor out ext4_group_desc_init() and ext4_group_desc_free(). No
functional change.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220916141527.1012715-12-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:53 -04:00
Jason Yan
bc62dbf914 ext4: factor out ext4_geometry_check()
Factor out ext4_geometry_check(). No functional change.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220916141527.1012715-11-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:53 -04:00
Jason Yan
d7f3542b32 ext4: factor out ext4_check_feature_compatibility()
Factor out ext4_check_feature_compatibility(). No functional change.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220916141527.1012715-10-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:53 -04:00
Jason Yan
b26458d151 ext4: factor out ext4_init_metadata_csum()
Factor out ext4_init_metadata_csum(). No functional change.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220916141527.1012715-9-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:53 -04:00
Jason Yan
39c135b08c ext4: factor out ext4_encoding_init()
Factor out ext4_encoding_init(). No functional change.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220916141527.1012715-8-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:53 -04:00
Jason Yan
0e495f7cc3 ext4: factor out ext4_inode_info_init()
Factor out ext4_inode_info_init(). No functional change.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220916141527.1012715-7-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:53 -04:00
Jason Yan
f7314a6732 ext4: factor out ext4_fast_commit_init()
Factor out ext4_fast_commit_init(). No functional change.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220916141527.1012715-6-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:53 -04:00
Jason Yan
4a8557b094 ext4: factor out ext4_handle_clustersize()
Factor out ext4_handle_clustersize(). No functional change.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220916141527.1012715-5-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:53 -04:00
Jason Yan
5f6d662d12 ext4: factor out ext4_set_def_opts()
Factor out ext4_set_def_opts(). No functional change.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220916141527.1012715-4-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Jason Yan
a5fc511935 ext4: remove cantfind_ext4 error handler
The 'cantfind_ext4' error handler is just a error msg print and then
goto failed_mount. This two level goto makes the code complex and not
easy to read. The only benefit is that is saves a little bit code.
However some branches can merge and some branches dot not even need it.
So do some refactor and remove it.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220916141527.1012715-3-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Jason Yan
43bd6f1b49 ext4: goto right label 'failed_mount3a'
Before these two branches neither loaded the journal nor created the
xattr cache. So the right label to goto is 'failed_mount3a'. Although
this did not cause any issues because the error handler validated if the
pointer is null. However this still made me confused when reading
the code. So it's still worth to modify to goto the right label.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220916141527.1012715-2-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Ye Bin
e64e6ca909 ext4: adjust fast commit disable judgement order in ext4_fc_track_inode
If fastcommit is already disabled, there isn't need to mark inode ineligible.
So move 'ext4_fc_disabled()' judgement bofore 'ext4_should_journal_data(inode)'
judgement which can avoid to do meaningless judgement.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220916083836.388347-3-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Ye Bin
b7b80a35fb ext4: factor out ext4_fc_disabled()
Factor out ext4_fc_disabled(). No functional change.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220916083836.388347-2-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Ye Bin
ccbf8eeb39 ext4: fix miss release buffer head in ext4_fc_write_inode
In 'ext4_fc_write_inode' function first call 'ext4_get_inode_loc' get 'iloc',
after use it miss release 'iloc.bh'.
So just release 'iloc.bh' before 'ext4_fc_write_inode' return.

Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220914100859.1415196-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Ye Bin
243d1a5d50 jbd2: fix potential use-after-free in jbd2_fc_wait_bufs
In 'jbd2_fc_wait_bufs' use 'bh' after put buffer head reference count
which may lead to use-after-free.
So judge buffer if uptodate before put buffer head reference count.

Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220914100812.1414768-3-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Ye Bin
e0d5fc7a6d jbd2: fix potential buffer head reference count leak
As in 'jbd2_fc_wait_bufs' if buffer isn't uptodate, will return -EIO without
update 'journal->j_fc_off'. But 'jbd2_fc_release_bufs' will release buffer head
from ‘j_fc_off - 1’ if 'bh' is NULL will terminal release which will lead to
buffer head buffer head reference count leak.
To solve above issue, update 'journal->j_fc_off' before return -EIO.

Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220914100812.1414768-2-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Zhihao Cheng
7177dd009c ext4: fix dir corruption when ext4_dx_add_entry() fails
Following process may lead to fs corruption:
1. ext4_create(dir/foo)
 ext4_add_nondir
  ext4_add_entry
   ext4_dx_add_entry
     a. add_dirent_to_buf
      ext4_mark_inode_dirty
      ext4_handle_dirty_metadata   // dir inode bh is recorded into journal
     b. ext4_append    // dx_get_count(entries) == dx_get_limit(entries)
       ext4_bread(EXT4_GET_BLOCKS_CREATE)
        ext4_getblk
         ext4_map_blocks
          ext4_ext_map_blocks
            ext4_mb_new_blocks
             dquot_alloc_block
              dquot_alloc_space_nodirty
               inode_add_bytes    // update dir's i_blocks
            ext4_ext_insert_extent
	     ext4_ext_dirty  // record extent bh into journal
              ext4_handle_dirty_metadata(bh)
	      // record new block into journal
       inode->i_size += inode->i_sb->s_blocksize   // new size(in mem)
     c. ext4_handle_dirty_dx_node(bh2)
	// record dir's new block(dx_node) into journal
     d. ext4_handle_dirty_dx_node((frame - 1)->bh)
     e. ext4_handle_dirty_dx_node(frame->bh)
     f. do_split    // ret err!
     g. add_dirent_to_buf
	 ext4_mark_inode_dirty(dir)  // update raw_inode on disk(skipped)
2. fsck -a /dev/sdb
 drop last block(dx_node) which beyonds dir's i_size.
  /dev/sdb: recovering journal
  /dev/sdb contains a file system with errors, check forced.
  /dev/sdb: Inode 12, end of extent exceeds allowed value
	(logical block 128, physical block 3938, len 1)
3. fsck -fn /dev/sdb
 dx_node->entry[i].blk > dir->i_size
  Pass 2: Checking directory structure
  Problem in HTREE directory inode 12 (/dir): bad block number 128.
  Clear HTree index? no
  Problem in HTREE directory inode 12: block #3 has invalid depth (2)
  Problem in HTREE directory inode 12: block #3 has bad max hash
  Problem in HTREE directory inode 12: block #3 not referenced

Fix it by marking inode dirty directly inside ext4_append().
Fetch a reproducer in [Link].

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216466
Cc: stable@vger.kernel.org
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220911045204.516460-1-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Gaosheng Cui
ebd5d23e88 ext4: remove ext4_inline_data_fiemap() declaration
ext4_inline_data_fiemap() has been removed since
commit d3b6f23f71 ("ext4: move ext4_fiemap to use iomap framework"),
so remove it.

Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220909065307.1155201-1-cuigaosheng1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Jeff Layton
a642c2c082 ext4: fix i_version handling in ext4
ext4 currently updates the i_version counter when the atime is updated
during a read. This is less than ideal as it can cause unnecessary cache
invalidations with NFSv4 and unnecessary remeasurements for IMA.

The increment in ext4_mark_iloc_dirty is also problematic since it can
corrupt the i_version counter for ea_inodes. We aren't bumping the file
times in ext4_mark_iloc_dirty, so changing the i_version there seems
wrong, and is the cause of both problems.

Remove that callsite and add increments to the setattr, setxattr and
ioctl codepaths, at the same times that we update the ctime. The
i_version bump that already happens during timestamp updates should take
care of the rest.

In ext4_move_extents, increment the i_version on both inodes, and also
add in missing ctime updates.

[ Some minor updates since we've already enabled the i_version counter
  unconditionally already via another patch series. -- TYT ]

Cc: stable@kernel.org
Cc: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20220908172448.208585-3-jlayton@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Jan Kara
5fc4cbd9fd mbcache: Avoid nesting of cache->c_list_lock under bit locks
Commit 307af6c879 ("mbcache: automatically delete entries from cache
on freeing") started nesting cache->c_list_lock under the bit locks
protecting hash buckets of the mbcache hash table in
mb_cache_entry_create(). This causes problems for real-time kernels
because there spinlocks are sleeping locks while bitlocks stay atomic.
Luckily the nesting is easy to avoid by holding entry reference until
the entry is added to the LRU list. This makes sure we cannot race with
entry deletion.

Cc: stable@kernel.org
Fixes: 307af6c879 ("mbcache: automatically delete entries from cache on freeing")
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220908091032.10513-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Andrew Perepechko
34fc8768ec jbd2: wake up journal waiters in FIFO order, not LIFO
LIFO wakeup order is unfair and sometimes leads to a journal
user not being able to get a journal handle for hundreds of
transactions in a row.

FIFO wakeup can make things more fair.

Cc: stable@kernel.org
Signed-off-by: Alexey Lyashkov <alexey.lyashkov@gmail.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220907165959.1137482-1-alexey.lyashkov@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Jinke Han
d1052d236e ext4: place buffer head allocation before handle start
In our product environment, we encounter some jbd hung waiting handles to
stop while several writters were doing memory reclaim for buffer head
allocation in delay alloc write path. Ext4 do buffer head allocation with
holding transaction handle which may be blocked too long if the reclaim
works not so smooth. According to our bcc trace, the reclaim time in
buffer head allocation can reach 258s and the jbd transaction commit also
take almost the same time meanwhile. Except for these extreme cases,
we often see several seconds delays for cgroup memory reclaim on our
servers. This is more likely to happen considering docker environment.

One thing to note, the allocation of buffer heads is as often as page
allocation or more often when blocksize less than page size. Just like
page cache allocation, we should also place the buffer head allocation
before startting the handle.

Cc: stable@kernel.org
Signed-off-by: Jinke Han <hanjinke.666@bytedance.com>
Link: https://lore.kernel.org/r/20220903012429.22555-1-hanjinke.666@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:52 -04:00
Zhang Yi
0b73284c56 ext4: ext4_read_bh_lock() should submit IO if the buffer isn't uptodate
Recently we notice that ext4 filesystem would occasionally fail to read
metadata from disk and report error message, but the disk and block
layer looks fine. After analyse, we lockon commit 88dbcbb3a4
("blkdev: avoid migration stalls for blkdev pages"). It provide a
migration method for the bdev, we could move page that has buffers
without extra users now, but it lock the buffers on the page, which
breaks the fragile metadata read operation on ext4 filesystem,
ext4_read_bh_lock() was copied from ll_rw_block(), it depends on the
assumption of that locked buffer means it is under IO. So it just
trylock the buffer and skip submit IO if it lock failed, after
wait_on_buffer() we conclude IO error because the buffer is not
uptodate.

This issue could be easily reproduced by add some delay just after
buffer_migrate_lock_buffers() in __buffer_migrate_folio() and do
fsstress on ext4 filesystem.

  EXT4-fs error (device pmem1): __ext4_find_entry:1658: inode #73193:
  comm fsstress: reading directory lblock 0
  EXT4-fs error (device pmem1): __ext4_find_entry:1658: inode #75334:
  comm fsstress: reading directory lblock 0

Fix it by removing the trylock logic in ext4_read_bh_lock(), just lock
the buffer and submit IO if it's not uptodate, and also leave over
readahead helper.

Cc: stable@kernel.org
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220831074629.3755110-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:46:51 -04:00
Jeff Layton
1ff2030739 ext4: unconditionally enable the i_version counter
The original i_version implementation was pretty expensive, requiring a
log flush on every change. Because of this, it was gated behind a mount
option (implemented via the MS_I_VERSION mountoption flag).

Commit ae5e165d85 (fs: new API for handling inode->i_version) made the
i_version flag much less expensive, so there is no longer a performance
penalty from enabling it. xfs and btrfs already enable it
unconditionally when the on-disk format can support it.

Have ext4 ignore the SB_I_VERSION flag, and just enable it
unconditionally.  While we're in here, mark the i_version mount
option Opt_removed.

[ Removed leftover bits of i_version from ext4_apply_options() since it
  now can't ever be set in ctx->mask_s_flags -- lczerner ]

Cc: stable@kernel.org
Cc: Dave Chinner <david@fromorbit.com>
Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220824160349.39664-3-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30 23:45:00 -04:00
Linus Torvalds
d649d2c49b pstore revert for v6.0-rc8
- Revert crypto acomp migration (Guilherme G. Piccoli)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmM3CssWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJl75D/42DaeB92odK/3XG9cI1Frp4YS0
 vcTIecUeheNrTf4okUjeL0kQjUJgDQLWYNxZU6O3ljNM3lItDegV14Ij4tiPJlNK
 pOg8e+ddnWmhTB6c1BrLwVIjPDUnJmhqd1L2G6D/1djlcQU2TzN/amjBE4PQYrC4
 Kyqn8nmrVt5CgCvkV4PPDuIiyeGx808bFu+rK13J6BZYDH4mTZWqky25+wFuresn
 HNBQ0Xh7otaeeIMdbdohaitANlr7xHpUOLDuHnQaZ9Od1wJdFqNFpQ0sYDwvsqhY
 aTyNAAkZig9xMp10Sr6lCDfT/ZHkHDtSNOqn30iwPrFX83QVbbnaQhRsN7EK8Caz
 v7dx2HDjP5MxkYetT+3qPW4waScc5/s7dJNXoK/3E3oWrKoFRwx2CrmZUEuHmjHC
 S1M/kG+MoizxGAWbtvB93XbxXJlULdgmv1VKQyBjZRZ9JVBP7AnQmstljKjK0yn2
 o/pgTFRduiTj7+N9iU7h7GxHNLXJm0YHwCZw1QhORc1dg14jbMkNEu6GN6ZNfuvg
 ADYs2tRN7YA7oGjACaCNAEuGxjY5/quvPAdYHdwankXBj8S8O2/BCxa02mbrPevC
 7p32nwzRtUz4G+bjJr4N/RGO1q/Qzn5IcZnzOxGUVn5TC729JxkfRUFrb/m0l0PP
 PgfmGgNjSQ4KDaYIMQ==
 =KoAn
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v6.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore revert from Kees Cook:
 "A misbehavior with some compression backends in pstore was just
  discovered due to the recent crypto acomp migration.

  Since we're so close to release, it seems better to just simply revert
  it, and we can figure out what's going on without leaving it broken
  for a release.

   - Revert crypto acomp migration (Guilherme G. Piccoli)"

* tag 'pstore-v6.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  Revert "pstore: migrate to crypto acomp interface"
2022-09-30 08:54:14 -07:00
Guilherme G. Piccoli
40158dbf7e Revert "pstore: migrate to crypto acomp interface"
This reverts commit e4f0a7ec58.

When using this new interface, both efi_pstore and ramoops
backends are unable to properly decompress dmesg if using
zstd, lz4 and lzo algorithms (and maybe more). It does succeed
with deflate though.

The message observed in the kernel log is:

[2.328828] pstore: crypto_acomp_decompress failed, ret = -22!

The pstore infrastructure is able to collect the dmesg with
both backends tested, but since decompression fails it's
unreadable. With this revert everything is back to normal.

Fixes: e4f0a7ec58 ("pstore: migrate to crypto acomp interface")
Cc: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220929215515.276486-1-gpiccoli@igalia.com
2022-09-30 08:16:06 -07:00
Lukas Czerner
cbfecb927f fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE
Currently the I_DIRTY_TIME will never get set if the inode already has
I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME.  That's
true, however ext4 will only update the on-disk inode in
->dirty_inode(), not on actual writeback. As a result if the inode
already has I_DIRTY_INODE state by the time we get to
__mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
into on-disk inode and will not get updated until the next I_DIRTY_INODE
update, which might never come if we crash or get a power failure.

The problem can be reproduced on ext4 by running xfstest generic/622
with -o iversion mount option.

Fix it by allowing I_DIRTY_TIME to be set even if the inode already has
I_DIRTY_INODE. Also make sure that the case is properly handled in
writeback_single_inode() as well. Additionally changes in
xfs_fs_dirty_inode() was made to accommodate for I_DIRTY_TIME in flag.

Thanks Jan Kara for suggestions on how to make this work properly.

Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220825100657.44217-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29 23:02:00 -04:00
Lukas Czerner
50f094a558 ext4: don't increase iversion counter for ea_inodes
ea_inodes are using i_version for storing part of the reference count so
we really need to leave it alone.

The problem can be reproduced by xfstest ext4/026 when iversion is
enabled. Fix it by not calling inode_inc_iversion() for EXT4_EA_INODE_FL
inodes in ext4_mark_iloc_dirty().

Cc: stable@kernel.org
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Link: https://lore.kernel.org/r/20220824160349.39664-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29 23:01:44 -04:00
Jan Kara
61a1d87a32 ext4: fix check for block being out of directory size
The check in __ext4_read_dirblock() for block being outside of directory
size was wrong because it compared block number against directory size
in bytes. Fix it.

Fixes: 65f8ea4cd5 ("ext4: check if directory block is within i_size")
CVE: CVE-2022-1184
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20220822114832.1482-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29 23:01:40 -04:00
Ritesh Harjani (IBM)
5bdf402a05 fs/buffer: make submit_bh & submit_bh_wbc return type as void
submit_bh/submit_bh_wbc are non-blocking functions which just submit
the bio and return. The caller of submit_bh/submit_bh_wbc needs to wait
on buffer till I/O completion and then check buffer head's b_state field
to know if there was any I/O error.

Hence there is no need for these functions to have any return type.
Even now they always returns 0. Hence drop the return value and make
their return type as void to avoid any confusion.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/cb66ef823374cdd94d2d03083ce13de844fffd41.1660788334.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29 23:01:40 -04:00
Ritesh Harjani (IBM)
ab620620bf fs/buffer: drop useless return value of submit_bh
submit_bh always returns 0. This patch drops the useless return value of
submit_bh from __sync_dirty_buffer(). Once all of submit_bh callers are
cleaned up, we can make it's return type as void.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/a98a6ddfac68f73d684c2724952e825bc1f4d238.1660788334.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29 23:01:40 -04:00
Ritesh Harjani (IBM)
c547524852 fs/ntfs: drop useless return value of submit_bh from ntfs_submit_bh_for_read
submit_bh always returns 0. This patch drops the useless return value of
submit_bh from ntfs_submit_bh_for_read(). Once all of submit_bh callers are
cleaned up, we can make it's return type as void.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/d82eb29e8dbc52fe13a7affef5c907ea4076aa31.1660788334.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29 23:01:40 -04:00
Ritesh Harjani (IBM)
f3ed5df398 jbd2: drop useless return value of submit_bh
submit_bh always returns 0. This patch cleans up 2 of it's caller
in jbd2 to drop submit_bh's useless return value.
Once all submit_bh callers are cleaned up, we can make it's return
type as void.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/e069c0539be0aec61abcdc6f6141982ec85d489d.1660788334.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29 23:01:40 -04:00
Lalith Rajendran
3b575495ab ext4: make ext4_lazyinit_thread freezable
ext4_lazyinit_thread is not set freezable. Hence when the thread calls
try_to_freeze it doesn't freeze during suspend and continues to send
requests to the storage during suspend, resulting in suspend failures.

Cc: stable@kernel.org
Signed-off-by: Lalith Rajendran <lalithkraj@google.com>
Link: https://lore.kernel.org/r/20220818214049.1519544-1-lalithkraj@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29 23:01:27 -04:00
Linus Torvalds
987a926c1d fix for breakage in dump_user_range()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYzXkrwAKCRBZ7Krx/gZQ
 6+BcAP4+OfgZO6qStFjroURhZ8o3qfCJpYde/fGN5S/F/5UMdAD6A+4yozxXLdMZ
 wb2oB0XhxHv33vFdFfEeodquTnn+MAQ=
 =1Mnt
 -----END PGP SIGNATURE-----

Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull coredump fix from Al Viro:
 "Fix for breakage in dump_user_range()"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  [coredump] don't use __kernel_write() on kmap_local_page()
2022-09-29 14:37:45 -07:00
Jakub Kicinski
accc3b4a57 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29 14:30:51 -07:00
Tetsuo Handa
cbddcc4fa3 btrfs: set generation before calling btrfs_clean_tree_block in btrfs_init_new_buffer
syzbot is reporting uninit-value in btrfs_clean_tree_block() [1], for
commit bc877d285c ("btrfs: Deduplicate extent_buffer init code")
missed that btrfs_set_header_generation() in btrfs_init_new_buffer() must
not be moved to after clean_tree_block() because clean_tree_block() is
calling btrfs_header_generation() since commit 55c69072d6 ("Btrfs:
Fix extent_buffer usage when nodesize != leafsize").

Since memzero_extent_buffer() will reset "struct btrfs_header" part, we
can't move btrfs_set_header_generation() to before memzero_extent_buffer().
Just re-add btrfs_set_header_generation() before btrfs_clean_tree_block().

Link: https://syzkaller.appspot.com/bug?extid=fba8e2116a12609b6c59 [1]
Reported-by: syzbot <syzbot+fba8e2116a12609b6c59@syzkaller.appspotmail.com>
Fixes: bc877d285c ("btrfs: Deduplicate extent_buffer init code")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:31 +02:00
Filipe Manana
db21370bff btrfs: drop extent map range more efficiently
Currently when dropping extent maps for a file range, through
btrfs_drop_extent_map_range(), we do the following non-optimal things:

1) We lookup for extent maps one by one, always starting the search from
   the root of the extent map tree. This is not efficient if we have
   multiple extent maps in the range;

2) We check on every iteration if we have the 'split' and 'split2' spare
   extent maps in case we need to split an extent map that intersects our
   range but also crosses its boundaries (to the left, to the right or
   both cases). If our target range is for example:

       [2M, 8M)

   And we have 3 extents maps in the range:

       [1M, 3M) [3M, 6M) [6M, 10M[

   The on the first iteration we allocate two extent maps for 'split' and
   'split2', and use the 'split' to split the first extent map, so after
   the split we set 'split' to 'split2' and then set 'split2' to NULL.

   On the second iteration, we don't need to split the second extent map,
   but because 'split2' is now NULL, we allocate a new extent map for
   'split2'.

   On the third iteration we need to split the third extent map, so we
   use the extent map pointed by 'split'.

   So we ended up allocating 3 extent maps for splitting, but all we
   needed was 2 extent maps. We never need to allocate more than 2,
   because extent maps that need to be split are always the first one
   and the last one in the target range.

Improve on this by:

1) Using rb_next() to move on to the next extent map. This results in
   iterating over less nodes of the tree and it does not require comparing
   the ranges of nodes to our start/end offset;

2) Allocate the 2 extent maps for splitting before entering the loop and
   never allocate more than 2. In practice it's very rare to have the
   combination of both extent map allocations fail, since we have a
   dedicated slab for extent maps, and also have the need to split two
   extent maps.

This patch is part of a patchset comprised of the following patches:

   btrfs: fix missed extent on fsync after dropping extent maps
   btrfs: move btrfs_drop_extent_cache() to extent_map.c
   btrfs: use extent_map_end() at btrfs_drop_extent_map_range()
   btrfs: use cond_resched_rwlock_write() during inode eviction
   btrfs: move open coded extent map tree deletion out of inode eviction
   btrfs: add helper to replace extent map range with a new extent map
   btrfs: remove the refcount warning/check at free_extent_map()
   btrfs: remove unnecessary extent map initializations
   btrfs: assert tree is locked when clearing extent map from logging
   btrfs: remove unnecessary NULL pointer checks when searching extent maps
   btrfs: remove unnecessary next extent map search
   btrfs: avoid pointless extent map tree search when flushing delalloc
   btrfs: drop extent map range more efficiently

And the following fio test was done before and after applying the whole
patchset, on a non-debug kernel (Debian's default kernel config) on a 12
cores Intel box with 64G of ram:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/nvme0n1
   MNT=/mnt/nvme0n1
   MOUNT_OPTIONS="-o ssd"
   MKFS_OPTIONS="-R free-space-tree -O no-holes"

   cat <<EOF > /tmp/fio-job.ini
   [writers]
   rw=randwrite
   fsync=8
   fallocate=none
   group_reporting=1
   direct=0
   bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
   ioengine=psync
   filesize=2G
   runtime=300
   time_based
   directory=$MNT
   numjobs=8
   thread
   EOF

   echo performance | \
       tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

   echo
   echo "Using config:"
   echo
   cat /tmp/fio-job.ini
   echo

   umount $MNT &> /dev/null
   mkfs.btrfs -f $MKFS_OPTIONS $DEV
   mount $MOUNT_OPTIONS $DEV $MNT

   fio /tmp/fio-job.ini

   umount $MNT

Result before applying the patchset:

   WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec

Result after applying the patchset:

   WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:31 +02:00
Filipe Manana
b54bb86556 btrfs: avoid pointless extent map tree search when flushing delalloc
When flushing delalloc, in COW mode at cow_file_range(), before entering
the loop that allocates extents and creates ordered extents, we do a call
to btrfs_drop_extent_map_range() for the whole range. This is pointless
because in the loop we call create_io_em(), which will also call
btrfs_drop_extent_map_range() before inserting the new extent map.

So remove that call at cow_file_range() not only because it is not needed,
but also because it will make the btrfs_drop_extent_map_range() calls made
from create_io_em() waste time searching the extent map tree, and that
tree can be large for files with many extents. It also makes us waste time
at btrfs_drop_extent_map_range() allocating and freeing the split extent
maps for nothing.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:31 +02:00
Filipe Manana
6c05813ebb btrfs: remove unnecessary next extent map search
At __tree_search(), and its single caller __lookup_extent_mapping(), there
is no point in finding the next extent map that starts after the search
offset if we were able to find the previous extent map that ends before
our search offset, because __lookup_extent_mapping() ignores the next
acceptable extent map if we were able to find the previous one.

So just return immediately if we were able to find the previous extent
map, therefore avoiding wasting time iterating the tree looking for the
next extent map which will not be used by __lookup_extent_mapping().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:31 +02:00
Filipe Manana
08f088dd63 btrfs: remove unnecessary NULL pointer checks when searching extent maps
The previous and next pointer arguments passed to __tree_search() are
never NULL as the only caller of this function, __lookup_extent_mapping(),
always passes the address of two on stack pointers. So remove the NULL
checks and add assertions to verify the pointers.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:31 +02:00
Filipe Manana
74333c7d87 btrfs: assert tree is locked when clearing extent map from logging
When calling clear_em_logging() we should have a write lock on the extent
map tree, as we will try to merge the extent map with the previous and
next ones in the tree. So assert that we have a write lock.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:31 +02:00
Filipe Manana
2e0cdaa028 btrfs: remove unnecessary extent map initializations
When allocating an extent map, we use kmem_cache_zalloc() which guarantees
the returned memory is initialized to zeroes, therefore it's pointless
to initialize the generation and flags of the extent map to zero again.

Remove those initializations, as they are pointless and slightly increase
the object text size.

Before removing them:

   $ size fs/btrfs/extent_map.o
      text	   data	    bss	    dec	    hex	filename
      9241	    274	     24	   9539	   2543	fs/btrfs/extent_map.o

After removing them:

   $ size fs/btrfs/extent_map.o
      text	   data	    bss	    dec	    hex	filename
      9209	    274	     24	   9507	   2523	fs/btrfs/extent_map.o

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:30 +02:00
Filipe Manana
ad5d6e9148 btrfs: remove the refcount warning/check at free_extent_map()
At free_extent_map(), it's pointless to have a WARN_ON() to check if the
refcount of the extent map is zero. Such check is already done by the
refcount_t module and refcount_dec_and_test(), which loudly complains if
we try to decrement a reference count that is currently 0.

The WARN_ON() dates back to the time when used a regular atomic_t type
for the reference counter, before we switched to the refcount_t type.
The main goal of the refcount_t type/module is precisely to catch such
types of bugs and loudly complain if they happen.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:30 +02:00
Filipe Manana
a1ba4c080b btrfs: add helper to replace extent map range with a new extent map
We have several places that need to drop all the extent maps in a given
file range and then add a new extent map for that range. Currently they
call btrfs_drop_extent_map_range() to delete all extent maps in the range
and then keep trying to add the new extent map in a loop that keeps
retrying while the insertion of the new extent map fails with -EEXIST.

So instead of repeating this logic, add a helper to extent_map.c that
does these steps and name it btrfs_replace_extent_map_range(). Also add
a comment about why the retry loop is necessary.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:30 +02:00
Filipe Manana
9c9d1b4f74 btrfs: move open coded extent map tree deletion out of inode eviction
Move the loop that removes all the extent maps from the inode's extent
map tree during inode eviction out of inode.c and into extent_map.c, to
btrfs_drop_extent_map_range(). Anything manipulating extent maps or the
extent map tree should be in extent_map.c.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:30 +02:00
Filipe Manana
99ba0c8150 btrfs: use cond_resched_rwlock_write() during inode eviction
At evict_inode_truncate_pages(), instead of manually checking if
rescheduling is needed, then unlock the extent map tree, reschedule and
then write lock again the tree, use the helper cond_resched_rwlock_write()
which does all that.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:30 +02:00
Filipe Manana
f3109e33bb btrfs: use extent_map_end() at btrfs_drop_extent_map_range()
Instead of open coding the end offset calculation of an extent map, use
the helper extent_map_end() and cache its result in a local variable,
since it's used several times.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:30 +02:00