* for-6.1/block: (162 commits)
sbitmap: fix lockup while swapping
block: add rationale for not using blk_mq_plug() when applicable
block: adapt blk_mq_plug() to not plug for writes that require a zone lock
s390/dasd: use blk_mq_alloc_disk
blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep
nvmet: don't look at the request_queue in nvmet_bdev_set_limits
nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all
blk-mq: use quiesced elevator switch when reinitializing queues
block: replace blk_queue_nowait with bdev_nowait
nvme: remove nvme_ctrl_init_connect_q
nvme-loop: use the tagset alloc/free helpers
nvme-loop: store the generic nvme_ctrl in set->driver_data
nvme-loop: initialize sqsize later
nvme-fc: use the tagset alloc/free helpers
nvme-fc: store the generic nvme_ctrl in set->driver_data
nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
nvme-rdma: use the tagset alloc/free helpers
nvme-rdma: store the generic nvme_ctrl in set->driver_data
nvme-tcp: use the tagset alloc/free helpers
nvme-tcp: store the generic nvme_ctrl in set->driver_data
...
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 4acb83417c ("sbitmap: fix batched wait_cnt accounting")
is a big improvement: without it, I had to revert to before commit
040b83fcec ("sbitmap: fix possible io hung due to lost wakeup")
to avoid the high system time and freezes which that had introduced.
Now okay on the NVME laptop, but 4acb83417c is a disaster for heavy
swapping (kernel builds in low memory) on another: soon locking up in
sbitmap_queue_wake_up() (into which __sbq_wake_up() is inlined), cycling
around with waitqueue_active() but wait_cnt 0 . Here is a backtrace,
showing the common pattern of outer sbitmap_queue_wake_up() interrupted
before setting wait_cnt 0 back to wake_batch (in some cases other CPUs
are idle, in other cases they're spinning for a lock in dd_bio_merge()):
sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
__blk_mq_free_request < blk_mq_free_request < __blk_mq_end_request <
scsi_end_request < scsi_io_completion < scsi_finish_command <
scsi_complete < blk_complete_reqs < blk_done_softirq < __do_softirq <
__irq_exit_rcu < irq_exit_rcu < common_interrupt < asm_common_interrupt <
_raw_spin_unlock_irqrestore < __wake_up_common_lock < __wake_up <
sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
__blk_mq_free_request < blk_mq_free_request < dd_bio_merge <
blk_mq_sched_bio_merge < blk_mq_attempt_bio_merge < blk_mq_submit_bio <
__submit_bio < submit_bio_noacct_nocheck < submit_bio_noacct <
submit_bio < __swap_writepage < swap_writepage < pageout <
shrink_folio_list < evict_folios < lru_gen_shrink_lruvec <
shrink_lruvec < shrink_node < do_try_to_free_pages < try_to_free_pages <
__alloc_pages_slowpath < __alloc_pages < folio_alloc < vma_alloc_folio <
do_anonymous_page < __handle_mm_fault < handle_mm_fault <
do_user_addr_fault < exc_page_fault < asm_exc_page_fault
See how the process-context sbitmap_queue_wake_up() has been interrupted,
after bringing wait_cnt down to 0 (and in this example, after doing its
wakeups), before advancing wake_index and refilling wake_cnt: an
interrupt-context sbitmap_queue_wake_up() of the same sbq gets stuck.
I have almost no grasp of all the possible sbitmap races, and their
consequences: but __sbq_wake_up() can do nothing useful while wait_cnt 0,
so it is better if sbq_wake_ptr() skips on to the next ws in that case:
which fixes the lockup and shows no adverse consequence for me.
The check for wait_cnt being 0 is obviously racy, and ultimately can lead
to lost wakeups: for example, when there is only a single waitqueue with
waiters. However, lost wakeups are unlikely to matter in these cases,
and a proper fix requires redesign (and benchmarking) of the batched
wakeup code: so let's plug the hole with this bandaid for now.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/9c2038a7-cdc5-5ee-854c-fbc6168bf16@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Alexey reported that the fraction of unknown filename instances in
kallsyms grew from ~0.3% to ~10% recently; Bill and Greg tracked it down
to assembler defined symbols, which regressed as a result of:
commit b8a9092330 ("Kbuild: do not emit debug info for assembly with LLVM_IAS=1")
In that commit, I allude to restoring debug info for assembler defined
symbols in a follow up patch, but it seems I forgot to do so in
commit a66049e2cf ("Kbuild: make DWARF version a choice")
Link: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=31bf18645d98b4d3d7357353be840e320649a67d
Fixes: b8a9092330 ("Kbuild: do not emit debug info for assembly with LLVM_IAS=1")
Reported-by: Alexey Alexandrov <aalexand@google.com>
Reported-by: Bill Wendling <morbo@google.com>
Reported-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Suggested-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Use atomic_long_try_cmpxchg instead of
atomic_long_cmpxchg (*ptr, old, new) == old in __sbitmap_queue_get_batch.
x86 CMPXCHG instruction returns success in ZF flag, so this change
saves a compare after cmpxchg (and related move instruction in front
of cmpxchg).
Also, atomic_long_cmpxchg implicitly assigns old *ptr value to "old"
when cmpxchg fails, enabling further code simplifications, e.g.
an extra memory read can be avoided in the loop.
No functional change intended.
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20220908151200.9993-1-ubizjak@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When __sbq_wake_up() decrements wait_cnt to 0 but races with someone
else waking the waiter on the waitqueue (so the waitqueue becomes
empty), it exits without reseting wait_cnt to wake_batch number. Once
wait_cnt is 0, nobody will ever reset the wait_cnt or wake the new
waiters resulting in possible deadlocks or busyloops. Fix the problem by
making sure we reset wait_cnt even if we didn't wake up anybody in the
end.
Fixes: 040b83fcec ("sbitmap: fix possible io hung due to lost wakeup")
Reported-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220908130937.2795-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit 16ede66973.
This is causing issues with CPU stalls on my test box, revert it for
now until we understand what is going on. It looks like infinite
looping off sbitmap_queue_wake_up(), but hard to tell with a lot of
CPUs hitting this issue and the console scrolling infinitely.
Link: https://lore.kernel.org/linux-block/e742813b-ce5c-0d58-205b-1626f639b1bd@kernel.dk/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
dependency on XOR_BLOCKS.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmMIp9QACgkQxycdCkmx
i6eYhRAAs4XrSzpSSWaTaf3JZuQIqqs9KoABLvxeYfemEwgnd28NLQrea2qnM5yW
NT5TSL/OYpTmTTD9Q/uHm9TBpRchuQMto7rDLwB129ie1WdeSqwOKjk9zMb7SzkL
5mgO2L05jkyvo4NI4KxUyeBtMNdZc8eu4/iAhOejdnZYIgxFQUAuNVmFcy1VG/tq
kFOOLaXLOtrs9y+IswyoLoP7LdLbYCtUO8B4wBzMKZmRHXRKYtpXAnz/ytUhfSEM
JL6Vrb4jwAUK29B8A6Nk49gl5CHSlXOhnTrv6RK2qhwpLXQxsWWrDRinsJ6xQJk0
3pWVRDm2MAhpqsuk6Q/Cj0pFpFSiLCPRPSiH0Du2w6W3Fm8KGcQoE87bYIgBSTX/
vpBdGU0ItVokMB0OqNIHOLUFyi4wMM20wOKt5znVLeVlY0adye6SdcBGm/qXurR4
midv7HNyBZkjh/H9wtC//hdkpQDq8l87ygL/F44IBFlXItUFt+Vrhoi61m8AGSVb
5nkGwSQMe/VXBKV3gpBEM7LzVTwUsw0poG73rxFBByrAzVVd9Y9F4FXJ5xUmcRvf
uYl95Zw6vk5IBu/ijdqaJPyHBXLqVWg7FyAlZjtJxCIO4gIBMl/KCNY8yHE6f8Nd
6kpvL2GmFvYmpnu04ay+tFAEB0KgbzDZSNrZissZN0ilMwvykzM=
=dIEI
-----END PGP SIGNATURE-----
Merge tag 'v6.0-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
"Fix a boot performance regression due to an unnecessary dependency on
XOR_BLOCKS"
* tag 'v6.0-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: lib - remove unneeded selection of XOR_BLOCKS
Hi Linus,
Please pull (hopefully) the last portion of fixes from Sander for his
UP rework series. The original series came from -mm tree, and it was
not the latest version, that's why we need follow-ups. It fixes only
a test introduced by that series. The test fails under certain configs.
From Sander:
This series fixes the reported issues, and implements the suggested
improvements, for the version of the cpumask tests [1] that was merged
with commit c41e8866c2 ("lib/test: introduce cpumask KUnit test
suite").
These changes include fixes for the tests, and better alignment with the
KUnit style guidelines.
Thanks,
Yury
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEEi8GdvG6xMhdgpu/4sUSA/TofvsgFAmMKYcoACgkQsUSA/Tof
vsiceAv/TY+HTn1gmrNQwi7xC6VUD7mFYlVNZMtyMpZ23UYildz5SjFfuQV3UbXI
H5yKgSao9VFsbwyDUXbhySgOaNR8auq17Ey3jSJuR2A76qO2u2d79Gdt4IjIkq5N
IGOPv/pNOur7J+KSbiVhXasFeZGJ6Xi+xAobp5CK1uPCUI3oU1pAcm1iKkI+eWZ3
tPsM3aWcYGCDec7tqtqcsiWO2x9imPnrpI+C91Pwwr+N40ObkMc4IPzuPrQRn2T2
ECY9pgIWKOwOJ41jzgCVwZIHmuOn9dEgmaEGvE9Ah57OwuDlS43M4Ok3xy2+xS3t
3naLG3p02sJy7sXabC+xH4VJVPNT9/qauMW27cntPeeI2i/+yZXuQSLlVOllrY7/
LYxI8lVb1j50A90I/WrwXoDV0E68cfjhkiqhkgV33t1EamhSJvTG8GwCnF46WG8o
LzLukvoohA9uIrPAH2YpkZtrvsuT6iQccCY0M+kXv6TuYTgygdE16muVHffDKvsG
EIVdBGu6
=oNmV
-----END PGP SIGNATURE-----
Merge tag 'bitmap-6.0-rc3' of github.com:/norov/linux
Pull bitmap fixes from Yury Norov:
"Fix the reported issues, and implements the suggested improvements,
for the version of the cpumask tests [1] that was merged with commit
c41e8866c2 ("lib/test: introduce cpumask KUnit test suite").
These changes include fixes for the tests, and better alignment with
the KUnit style guidelines"
* tag 'bitmap-6.0-rc3' of github.com:/norov/linux:
lib/cpumask_kunit: add tests file to MAINTAINERS
lib/cpumask_kunit: log mask contents
lib/test_cpumask: follow KUnit style guidelines
lib/test_cpumask: fix cpu_possible_mask last test
lib/test_cpumask: drop cpu_possible_mask full test
CRYPTO_LIB_CHACHA_GENERIC doesn't need to select XOR_BLOCKS. It perhaps
was thought that it's needed for __crypto_xor, but that's not the case.
Enabling XOR_BLOCKS is problematic because the XOR_BLOCKS code runs a
benchmark when it is initialized. That causes a boot time regression on
systems that didn't have it enabled before.
Therefore, remove this unnecessary and problematic selection.
Fixes: e56e189855 ("lib/crypto: add prompts back to crypto libraries")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Current release - new code bugs:
- dsa: don't dereference NULL extack in dsa_slave_changeupper()
- dpaa: fix <1G ethernet on LS1046ARDB
- neigh: don't call kfree_skb() under spin_lock_irqsave()
Previous releases - regressions:
- r8152: fix the RX FIFO settings when suspending
- dsa: microchip: keep compatibility with device tree blobs with
no phy-mode
- Revert "net: macsec: update SCI upon MAC address change."
- Revert "xfrm: update SA curlft.use_time", comply with RFC 2367
Previous releases - always broken:
- netfilter: conntrack: work around exceeded TCP receive window
- ipsec: fix a null pointer dereference of dst->dev on a metadata
dst in xfrm_lookup_with_ifid
- moxa: get rid of asymmetry in DMA mapping/unmapping
- dsa: microchip: make learning configurable and keep it off
while standalone
- ice: xsk: prohibit usage of non-balanced queue id
- rxrpc: fix locking in rxrpc's sendmsg
Misc:
- another chunk of sysctl data race silencing
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmMH1scACgkQMUZtbf5S
IrtzTA//as5jbKepxBLqWjmDtTXTzkR9AZwD3pz/y2eRYYZz97N5R6TYLXh03zc0
OoB7yNIsjOtYu0aB0KosF+mqeGSzIG8MZ5W6eecQVRhUL270OD/kJ0G89CeHyuKP
BYUQE2S8z+55qM6IQ0DKbR4F038J2OeR6HdV7VUDFYRGfxDZsTZU4q3aY5bklAuz
TvpDAEsw0818a2lTdgqFUeRwbcU8ZIAJhiE/LQmqxhjsGyPkK02907Ccn06IrcAy
UHRBc6Cbjn8IcNNSL0hChjAkUdHtk7iHAqU8Nr2QnxKbE0FHGVOW8BsmY5GYvLAC
hH7t/dJAu3WUxubImZG6rnp3YD3YNZoaJrDgg6jSCJeUL6MKO2rJf8Q5HGiTJOWH
8vyPfCrB9IQVnef6Im0u9EFTyu9+W4MGVN4hyhttv2OykZwSQfdpjceGZgELiwSC
+od2p8TSXkZix//cTdWeO5THSnpHeMudh+0DEm10Uzf4+ybqIVuPn2ZCSy6piYJX
nsAIac1j7onWEyKQQ/nqy0o6rlZwLe+h0BraHHp3sApWVjyFwS4p6Z6VADed4kga
n/BsINdIW56pBT2nSrBTG5/RirlVfUTOaqiry0t6oak2qooEs0Gmm8DEbgTkncbs
BRLZTVzn6X3XWq52SXf7/v36xEJ/LRooY7MqUEMPg4emgGoNuC4=
=azH5
-----END PGP SIGNATURE-----
Merge tag 'net-6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from ipsec and netfilter (with one broken Fixes tag).
Current release - new code bugs:
- dsa: don't dereference NULL extack in dsa_slave_changeupper()
- dpaa: fix <1G ethernet on LS1046ARDB
- neigh: don't call kfree_skb() under spin_lock_irqsave()
Previous releases - regressions:
- r8152: fix the RX FIFO settings when suspending
- dsa: microchip: keep compatibility with device tree blobs with no
phy-mode
- Revert "net: macsec: update SCI upon MAC address change."
- Revert "xfrm: update SA curlft.use_time", comply with RFC 2367
Previous releases - always broken:
- netfilter: conntrack: work around exceeded TCP receive window
- ipsec: fix a null pointer dereference of dst->dev on a metadata dst
in xfrm_lookup_with_ifid
- moxa: get rid of asymmetry in DMA mapping/unmapping
- dsa: microchip: make learning configurable and keep it off while
standalone
- ice: xsk: prohibit usage of non-balanced queue id
- rxrpc: fix locking in rxrpc's sendmsg
Misc:
- another chunk of sysctl data race silencing"
* tag 'net-6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (87 commits)
net: lantiq_xrx200: restore buffer if memory allocation failed
net: lantiq_xrx200: fix lock under memory pressure
net: lantiq_xrx200: confirm skb is allocated before using
net: stmmac: work around sporadic tx issue on link-up
ionic: VF initial random MAC address if no assigned mac
ionic: fix up issues with handling EAGAIN on FW cmds
ionic: clear broken state on generation change
rxrpc: Fix locking in rxrpc's sendmsg
net: ethernet: mtk_eth_soc: fix hw hash reporting for MTK_NETSYS_V2
MAINTAINERS: rectify file entry in BONDING DRIVER
i40e: Fix incorrect address type for IPv6 flow rules
ixgbe: stop resetting SYSTIME in ixgbe_ptp_start_cyclecounter
net: Fix a data-race around sysctl_somaxconn.
net: Fix a data-race around netdev_unregister_timeout_secs.
net: Fix a data-race around gro_normal_batch.
net: Fix data-races around sysctl_devconf_inherit_init_net.
net: Fix data-races around sysctl_fb_tunnels_only_for_init_net.
net: Fix a data-race around netdev_budget_usecs.
net: Fix data-races around sysctl_max_skb_frags.
net: Fix a data-race around netdev_budget.
...
The cpumask test suite doesn't follow the KUnit style guidelines, as
laid out in Documentation/dev-tools/kunit/style.rst. The file is
renamed to lib/cpumask_kunit.c to clearly distinguish it from other,
non-KUnit, tests.
Link: https://lore.kernel.org/lkml/346cb279-8e75-24b0-7d12-9803f2b41c73@riseup.net/
Suggested-by: Maíra Canal <mairacanal@riseup.net>
Signed-off-by: Sander Vanheule <sander@svanheule.net>
Reviewed-by: Maíra Canal <mairacanal@riseup.net>
Reviewed-by: David Gow <davidgow@google.com>
Acked-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Since cpumask_first() on the cpu_possible_mask must return at most
nr_cpu_ids - 1 for a valid result, cpumask_last() cannot return anything
larger than this value. As test_cpumask_weight() also verifies that the
total weight of cpu_possible_mask must equal nr_cpu_ids, the last bit
set in this mask must be at nr_cpu_ids - 1.
Fixes: c41e8866c2 ("lib/test: introduce cpumask KUnit test suite")
Link: https://lore.kernel.org/lkml/346cb279-8e75-24b0-7d12-9803f2b41c73@riseup.net/
Reported-by: Maíra Canal <mairacanal@riseup.net>
Signed-off-by: Sander Vanheule <sander@svanheule.net>
Tested-by: Maíra Canal <mairacanal@riseup.net>
Reviewed-by: David Gow <davidgow@google.com>
Acked-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
When the number of CPUs that can possibly be brought online is known at
boot time, e.g. when HOTPLUG is disabled, nr_cpu_ids may be smaller than
NR_CPUS. In that case, cpu_possible_mask would not be completely filled,
and cpumask_full(cpu_possible_mask) can return false for valid system
configurations.
Without this test, cpu_possible_mask contents are still constrained by
a check on cpumask_weight(), as well as tests in test_cpumask_first(),
test_cpumask_last(), test_cpumask_next(), and test_cpumask_iterators().
Fixes: c41e8866c2 ("lib/test: introduce cpumask KUnit test suite")
Link: https://lore.kernel.org/lkml/346cb279-8e75-24b0-7d12-9803f2b41c73@riseup.net/
Reported-by: Maíra Canal <mairacanal@riseup.net>
Signed-off-by: Sander Vanheule <sander@svanheule.net>
Tested-by: Maíra Canal <mairacanal@riseup.net>
Reviewed-by: David Gow <davidgow@google.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
While reading rs->interval and rs->burst, they can be changed
concurrently via sysctl (e.g. net_ratelimit_state). Thus, we
need to add READ_ONCE() to their readers.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two problems can lead to lost wakeup:
1) invalid wakeup on the wrong waitqueue:
For example, 2 * wake_batch tags are put, while only wake_batch threads
are woken:
__sbq_wake_up
atomic_cmpxchg -> reset wait_cnt
__sbq_wake_up -> decrease wait_cnt
...
__sbq_wake_up -> wait_cnt is decreased to 0 again
atomic_cmpxchg
sbq_index_atomic_inc -> increase wake_index
wake_up_nr -> wake up and waitqueue might be empty
sbq_index_atomic_inc -> increase again, one waitqueue is skipped
wake_up_nr -> invalid wake up because old wakequeue might be empty
To fix the problem, increasing 'wake_index' before resetting 'wait_cnt'.
2) 'wait_cnt' can be decreased while waitqueue is empty
As pointed out by Jan Kara, following race is possible:
CPU1 CPU2
__sbq_wake_up __sbq_wake_up
sbq_wake_ptr() sbq_wake_ptr() -> the same
wait_cnt = atomic_dec_return()
/* decreased to 0 */
sbq_index_atomic_inc()
/* move to next waitqueue */
atomic_set()
/* reset wait_cnt */
wake_up_nr()
/* wake up on the old waitqueue */
wait_cnt = atomic_dec_return()
/*
* decrease wait_cnt in the old
* waitqueue, while it can be
* empty.
*/
Fix the problem by waking up before updating 'wake_index' and
'wait_cnt'.
With this patch, noted that 'wait_cnt' is still decreased in the old
empty waitqueue, however, the wakeup is redirected to a active waitqueue,
and the extra decrement on the old empty waitqueue is not handled.
Fixes: 88459642cb ("blk-mq: abstract tag allocation out into sbitmap library")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220803121504.212071-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since lib/cpumask.o is only built for CONFIG_SMP=y, NR_CPUS will always
be greater than 1 at compile time. This makes checking for that
condition unnecesarry, so it can be dropped.
Signed-off-by: Sander Vanheule <sander@svanheule.net>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
In the uniprocessor case, cpumask_next_wrap() can be simplified, as the
number of valid argument combinations is limited:
- 'start' can only be 0
- 'n' can only be -1 or 0
The only valid CPU that can then be returned, if any, will be the first
one set in the provided 'mask'.
For NR_CPUS == 1, include/linux/cpumask.h now provides an inline
definition of cpumask_next_wrap(), which will conflict with the one
provided by lib/cpumask.c. Make building of lib/cpumask.o again depend
on CONFIG_SMP=y (i.e. NR_CPUS > 1) to avoid the re-definition.
Suggested-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Sander Vanheule <sander@svanheule.net>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Commit 36d4b36b69 ("lib/nodemask: inline next_node_in() and
node_random()") removed the lib/nodemask.c file, but the remove didn't
happen when the patch was applied.
Reported-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* more new_sync_{read,write}() speedups - ITER_UBUF introduction
* ITER_PIPE cleanups
* unification of iov_iter_get_pages/iov_iter_get_pages_alloc and
switching them to advancing semantics
* making ITER_PIPE take high-order pages without splitting them
* handling copy_page_from_iter() for high-order pages properly
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYvHI8QAKCRBZ7Krx/gZQ
62CQAPsGlbebqBeAT2pMulaGDxfLAsgz5Yf4BEaMLhPtRqFOQgD+KrZQId7Sd8O0
3IWucpTb2c4jvLlXhGMS+XWnusQH+AQ=
=pBux
-----END PGP SIGNATURE-----
Merge tag 'pull-work.iov_iter-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more iov_iter updates from Al Viro:
- more new_sync_{read,write}() speedups - ITER_UBUF introduction
- ITER_PIPE cleanups
- unification of iov_iter_get_pages/iov_iter_get_pages_alloc and
switching them to advancing semantics
- making ITER_PIPE take high-order pages without splitting them
- handling copy_page_from_iter() for high-order pages properly
* tag 'pull-work.iov_iter-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (32 commits)
fix copy_page_from_iter() for compound destinations
hugetlbfs: copy_page_to_iter() can deal with compound pages
copy_page_to_iter(): don't split high-order page in case of ITER_PIPE
expand those iov_iter_advance()...
pipe_get_pages(): switch to append_pipe()
get rid of non-advancing variants
ceph: switch the last caller of iov_iter_get_pages_alloc()
9p: convert to advancing variant of iov_iter_get_pages_alloc()
af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages()
iter_to_pipe(): switch to advancing variant of iov_iter_get_pages()
block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()
iov_iter: saner helper for page array allocation
fold __pipe_get_pages() into pipe_get_pages()
ITER_XARRAY: don't open-code DIV_ROUND_UP()
unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts
unify xarray_get_pages() and xarray_get_pages_alloc()
unify pipe_get_pages() and pipe_get_pages_alloc()
iov_iter_get_pages(): sanity-check arguments
iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper
...
now that we are advancing the iterator, there's no need to
treat the first page separately - just call append_pipe()
in a loop.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
mechanical change; will be further massaged in subsequent commits
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All call sites of get_pages_array() are essenitally identical now.
Replace with common helper...
Returns number of slots available in resulting array or 0 on OOM;
it's up to the caller to make sure it doesn't ask to zero-entry
array (i.e. neither maxpages nor size are allowed to be zero).
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and don't mangle maxsize there - turn the loop into counting
one instead. Easier to see that we won't run out of array that
way. Note that special treatment of the partial buffer in that
thing is an artifact of the non-advancing semantics of
iov_iter_get_pages() - if not for that, it would be append_pipe(),
same as the body of the loop that follows it. IOW, once we make
iov_iter_get_pages() advancing, the whole thing will turn into
calculate how many pages do we want
allocate an array (if needed)
call append_pipe() that many times.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
same as for pipes and xarrays; after that iov_iter_get_pages() becomes
a wrapper for __iov_iter_get_pages_alloc().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The differences between those two are
* pipe_get_pages() gets a non-NULL struct page ** value pointing to
preallocated array + array size.
* pipe_get_pages_alloc() gets an address of struct page ** variable that
contains NULL, allocates the array and (on success) stores its address in
that variable.
Not hard to combine - always pass struct page ***, have
the previous pipe_get_pages_alloc() caller pass ~0U as cap for
array size.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
zero maxpages is bogus, but best treated as "just return 0";
NULL pages, OTOH, should be treated as a hard bug.
get rid of now completely useless checks in xarray_get_pages{,_alloc}().
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Incidentally, ITER_XARRAY did *not* free the sucker in case when
iter_xarray_populate_pages() returned 0...
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All their callers are next to each other; all of them
want the total amount of pages and, possibly, the
offset in the partial final buffer.
Combine into a new helper (pipe_npages()), fix the
bogosity in pipe_space_for_user(), while we are at it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We often need to find whether the last buffer is anon or not, and
currently it's rather clumsy:
check if ->iov_offset is non-zero (i.e. that pipe is not empty)
if so, get the corresponding pipe_buffer and check its ->ops
if it's &default_pipe_buf_ops, we have an anon buffer.
Let's replace the use of ->iov_offset (which is nowhere near similar to
its role for other flavours) with signed field (->last_offset), with
the following rules:
empty, no buffers occupied: 0
anon, with bytes up to N-1 filled: N
zero-copy, with bytes up to N-1 filled: -N
That way abs(i->last_offset) is equal to what used to be in i->iov_offset
and empty vs. anon vs. zero-copy can be distinguished by the sign of
i->last_offset.
Checks for "should we extend the last buffer or should we start
a new one?" become easier to follow that way.
Note that most of the operations can only be done in a sane
state - i.e. when the pipe has nothing past the current position of
iterator. About the only thing that could be done outside of that
state is iov_iter_advance(), which transitions to the sane state by
truncating the pipe. There are only two cases where we leave the
sane state:
1) iov_iter_get_pages()/iov_iter_get_pages_alloc(). Will be
dealt with later, when we make get_pages advancing - the callers are
actually happier that way.
2) iov_iter copied, then something is put into the copy. Since
they share the underlying pipe, the original gets behind. When we
decide that we are done with the copy (original is not usable until then)
we advance the original. direct_io used to be done that way; nowadays
it operates on the original and we do iov_iter_revert() to discard
the excessive data. At the moment there's nothing in the kernel that
could do that to ITER_PIPE iterators, so this reason for insane state
is theoretical right now.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fold pipe_truncate() into it, clean up. We can release buffers
in the same loop where we walk backwards to the iterator beginning
looking for the place where the new position will be.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
instead of setting ->iov_offset for new position and calling
pipe_truncate() to adjust ->len of the last buffer and discard
everything after it, adjust ->len at the same time we set ->iov_offset
and use pipe_discard_from() to deal with buffers past that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
it's only used to get to the partial buffer we can add to,
and that's always the last one, i.e. pipe->head - 1.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Expand the only remaining call of push_pipe() (in
__pipe_get_pages()), combine it with the page-collecting loop there.
Note that the only reason it's not a loop doing append_pipe() is
that append_pipe() is advancing, while iov_iter_get_pages() is not.
As soon as it switches to saner semantics, this thing will switch
to using append_pipe().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
New helper: append_pipe(). Extends the last buffer if possible,
allocates a new one otherwise. Returns page and offset in it
on success, NULL on failure. iov_iter is advanced past the
data we've got.
Use that instead of push_pipe() in copy-to-pipe primitives;
they get simpler that way. Handling of short copy (in "mc" one)
is done simply by iov_iter_revert() - iov_iter is in consistent
state after that one, so we can use that.
[Fix for braino caught by Liu Xinpeng <liuxp11@chinatelecom.cn> folded in]
[another braino fix, this time in copy_pipe_to_iter() and pipe_zero();
caught by testcase from Hugh Dickins]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There are only two kinds of pipe_buffer in the area used by ITER_PIPE.
1) anonymous - copy_to_iter() et.al. end up creating those and copying
data there. They have zero ->offset, and their ->ops points to
default_pipe_page_ops.
2) zero-copy ones - those come from copy_page_to_iter(), and page
comes from caller. ->offset is also caller-supplied - it might be
non-zero. ->ops points to page_cache_pipe_buf_ops.
Move creation and insertion of those into helpers - push_anon(pipe, size)
and push_page(pipe, page, offset, size) resp., separating them from
the "could we avoid creating a new buffer by merging with the current
head?" logics.
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
pipe_buffer instances of a pipe are organized as a ring buffer,
with power-of-2 size. Indices are kept *not* reduced modulo ring
size, so the buffer refered to by index N is
pipe->bufs[N & (pipe->ring_size - 1)].
Ring size can change over the lifetime of a pipe, but not while
the pipe is locked. So for any iov_iter primitives it's a constant.
Original conversion of pipes to this layout went overboard trying
to microoptimize that - calculating pipe->ring_size - 1, storing
it in a local variable and using through the function. In some
cases it might be warranted, but most of the times it only
obfuscates what's going on in there.
Introduce a helper (pipe_buf(pipe, N)) that would encapsulate
that and use it in the obvious cases. More will follow...
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Equivalent of single-segment iovec. Initialized by iov_iter_ubuf(),
checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC
ones.
We are going to expose the things like ->write_iter() et.al. to those
in subsequent commits.
New predicate (user_backed_iter()) that is true for ITER_IOVEC and
ITER_UBUF; places like direct-IO handling should use that for
checking that pages we modify after getting them from iov_iter_get_pages()
would need to be dirtied.
DO NOT assume that replacing iter_is_iovec() with user_backed_iter()
will solve all problems - there's code that uses iter_is_iovec() to
decide how to poke around in iov_iter guts and for that the predicate
replacement obviously won't suffice.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This branch consists of:
Qu Wenruo:
lib: bitmap: fix the duplicated comments on bitmap_to_arr64()
https://lore.kernel.org/lkml/0d85e1dbad52ad7fb5787c4432bdb36cbd24f632.1656063005.git.wqu@suse.com/
Alexander Lobakin:
bitops: let optimize out non-atomic bitops on compile-time constants
https://lore.kernel.org/lkml/20220624121313.2382500-1-alexandr.lobakin@intel.com/T/
Yury Norov:
lib: cleanup bitmap-related headers
https://lore.kernel.org/linux-arm-kernel/YtCVeOGLiQ4gNPSf@yury-laptop/T/#m305522194c4d38edfdaffa71fcaaf2e2ca00a961
Alexander Lobakin:
x86/olpc: fix 'logical not is only applied to the left hand side'
https://www.spinics.net/lists/kernel/msg4440064.html
Yury Norov:
lib/nodemask: inline wrappers around bitmap
https://lore.kernel.org/all/20220723214537.2054208-1-yury.norov@gmail.com/
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEEi8GdvG6xMhdgpu/4sUSA/TofvsgFAmLpVvwACgkQsUSA/Tof
vsiAHgwAwS9pl8GJ+fKYnue2CYo9349d2oT6BBUs/Rv8uqYEa4QkpYsR7NS733TG
pos0hhoRvSOzrUP4qppXUjfJ+NkzLgpnKFOeWfFoNAKlHuaaMRvF3Y0Q/P8g0/Kg
HPWcCQLHyCH9Wjs3e2TTgRjxTrHuruD2VJ401/PX/lw0DicUhmev5mUFa10uwFkP
ZJRprjoFn9HJ0Hk16pFZDi36d3YumhACOcWRiJdoBDrEPV3S6lm9EeOy/yHBNp5k
9bKj+RboeT2t70KaZcKv+M5j1nu0cAhl7kRkjcxcmGyimI0l82Vgq9yFxhGqvWg8
RnCrJ5EaO08FGCAKG9GEwzdiNa24Gdq5XZSpQA7JZHmhmchpnnlNenJicyv0gOQi
abChZeWSEsyA+78l2+kk9nezfVKUOnKDEZQxBVTOyWsmZYxHZV94oam340VjQDaY
4/fETdOy/qqPIxnpxAeFGWxZjcVaYiYPLj7KLPMsB0aAAF7pZrem465vSfgbrE81
+gCdqrWd
=4dTW
-----END PGP SIGNATURE-----
Merge tag 'bitmap-6.0-rc1' of https://github.com/norov/linux
Pull bitmap updates from Yury Norov:
- fix the duplicated comments on bitmap_to_arr64() (Qu Wenruo)
- optimize out non-atomic bitops on compile-time constants (Alexander
Lobakin)
- cleanup bitmap-related headers (Yury Norov)
- x86/olpc: fix 'logical not is only applied to the left hand side'
(Alexander Lobakin)
- lib/nodemask: inline wrappers around bitmap (Yury Norov)
* tag 'bitmap-6.0-rc1' of https://github.com/norov/linux: (26 commits)
lib/nodemask: inline next_node_in() and node_random()
powerpc: drop dependency on <asm/machdep.h> in archrandom.h
x86/olpc: fix 'logical not is only applied to the left hand side'
lib/cpumask: move some one-line wrappers to header file
headers/deps: mm: align MANITAINERS and Docs with new gfp.h structure
headers/deps: mm: Split <linux/gfp_types.h> out of <linux/gfp.h>
headers/deps: mm: Optimize <linux/gfp.h> header dependencies
lib/cpumask: move trivial wrappers around find_bit to the header
lib/cpumask: change return types to unsigned where appropriate
cpumask: change return types to bool where appropriate
lib/bitmap: change type of bitmap_weight to unsigned long
lib/bitmap: change return types to bool where appropriate
arm: align find_bit declarations with generic kernel
iommu/vt-d: avoid invalid memory access via node_online(NUMA_NO_NODE)
lib/test_bitmap: test the tail after bitmap_to_arr64()
lib/bitmap: fix off-by-one in bitmap_to_arr64()
lib: test_bitmap: add compile-time optimization/evaluations assertions
bitmap: don't assume compiler evaluates small mem*() builtins calls
net/ice: fix initializing the bitmap in the switch code
bitops: let optimize out non-atomic bitops on compile-time constants
...
fatfs, autofs, squashfs, procfs, etc.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYu9BeQAKCRDdBJ7gKXxA
jp1DAP4mjCSvAwYzXklrIt+Knv3CEY5oVVdS+pWOAOGiJpldTAD9E5/0NV+VmlD9
kwS/13j38guulSlXRzDLmitbg81zAAI=
=Zfum
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2022-08-06-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc updates from Andrew Morton:
"Updates to various subsystems which I help look after. lib, ocfs2,
fatfs, autofs, squashfs, procfs, etc. A relatively small amount of
material this time"
* tag 'mm-nonmm-stable-2022-08-06-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (72 commits)
scripts/gdb: ensure the absolute path is generated on initial source
MAINTAINERS: kunit: add David Gow as a maintainer of KUnit
mailmap: add linux.dev alias for Brendan Higgins
mailmap: update Kirill's email
profile: setup_profiling_timer() is moslty not implemented
ocfs2: fix a typo in a comment
ocfs2: use the bitmap API to simplify code
ocfs2: remove some useless functions
lib/mpi: fix typo 'the the' in comment
proc: add some (hopefully) insightful comments
bdi: remove enum wb_congested_state
kernel/hung_task: fix address space of proc_dohung_task_timeout_secs
lib/lzo/lzo1x_compress.c: replace ternary operator with min() and min_t()
squashfs: support reading fragments in readahead call
squashfs: implement readahead
squashfs: always build "file direct" version of page actor
Revert "squashfs: provide backing_dev_info in order to disable read-ahead"
fs/ocfs2: Fix spelling typo in comment
ia64: old_rr4 added under CONFIG_HUGETLB_PAGE
proc: fix test for "vsyscall=xonly" boot option
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmLrqJgACgkQUqAMR0iA
lPIiexAAr+EUDm28lbQARZqZTuTDZsI40HBun4fj6MU31hPOVXW94gejTMLQNj1w
DNWSRkf5C8mOKVuLkSv2h8PE+7dtwhRiN0qJgymHxrl/EdrngqrqpzN+UjRfCc0G
AdGgItdznGtsVX6CVO4NbUza/ixapAOKe9FRXzK9LNukr9DoXiV4UoxOEERFZFj8
mIJ+9p5kx1LqoP4+3yuUtGowyfJEYYhvPDukx1kjfxzszwIwlAZQhKwKC5jgzNWf
zbOxj3GwP44c8DFmUyRtWeY8wnpLrztBOM26cFWQunLtXtanFbj5SHPGWiB+35bR
+/FySJjOR75b5tyEBwdb8oqbLsg7n/XyybgYGDs5s5ZM/yeGyUlghQGBXsWJ3NTk
r1QGB58Dodu2dW2aT24xfow/Nfe1xt/rq4mLypvSY2Q7+IHvX1TpYiaUGfRY44k/
uLKjwRUCOPbaanaH73JiZ92tq9cAUFj6WhLf5mgtzqoe6hR/pc9hrzmOYE5/8ttX
0knJc0VkQoCJhPtCZk7Voh4xHWMng5MSUzwEvp2wEd7y+KrURr3b2buEhbTPxmvf
VEaGZMwvHqGN5c/gjaCxCmb+rsDQ/WeTK9cugAzDKXntLEi0EPM3Jk2NjIw4a3Qv
MzexGDCH++94p5EEJxStoC8dbVTdl5EVzC9t5zQC1oiaoSLJOQw=
=dvik
-----END PGP SIGNATURE-----
Merge tag 'livepatching-for-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
Pull livepatching update from Petr Mladek:
- Make a selftest more reliable
* tag 'livepatching-for-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
selftests/livepatch: better synchronize test_klp_callbacks_busy
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve latency
and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA
jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/
SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE=
=w/UH
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Most of the MM queue. A few things are still pending.
Liam's maple tree rework didn't make it. This has resulted in a few
other minor patch series being held over for next time.
Multi-gen LRU still isn't merged as we were waiting for mapletree to
stabilize. The current plan is to merge MGLRU into -mm soon and to
later reintroduce mapletree, with a view to hopefully getting both
into 6.1-rc1.
Summary:
- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve
latency and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place"
[ XFS merge from hell as per Darrick Wong in
https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]
* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
tools/testing/selftests/vm/hmm-tests.c: fix build
mm: Kconfig: fix typo
mm: memory-failure: convert to pr_fmt()
mm: use is_zone_movable_page() helper
hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
hugetlbfs: cleanup some comments in inode.c
hugetlbfs: remove unneeded header file
hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
hugetlbfs: use helper macro SZ_1{K,M}
mm: cleanup is_highmem()
mm/hmm: add a test for cross device private faults
selftests: add soft-dirty into run_vmtests.sh
selftests: soft-dirty: add test for mprotect
mm/mprotect: fix soft-dirty check in can_change_pte_writable()
mm: memcontrol: fix potential oom_lock recursion deadlock
mm/gup.c: fix formatting in check_and_migrate_movable_page()
xfs: fail dax mount if reflink is enabled on a partition
mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
userfaultfd: don't fail on unrecognized features
hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
...