linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-14 16:12:02 +00:00

History

Muchun Song 14c2404884 locking/rwsem: Optimize down_read_trylock() under highly contended case We found that a process with 10 thousnads threads has been encountered a regression problem from Linux-v4.14 to Linux-v5.4. It is a kind of workload which will concurrently allocate lots of memory in different threads sometimes. In this case, we will see the down_read_trylock() with a high hotspot. Therefore, we suppose that rwsem has a regression at least since Linux-v5.4. In order to easily debug this problem, we write a simply benchmark to create the similar situation lile the following. ```c++ #include <sys/mman.h> #include <sys/time.h> #include <sys/resource.h> #include <sched.h> #include <cstdio> #include <cassert> #include <thread> #include <vector> #include <chrono> volatile int mutex; void trigger(int cpu, char* ptr, std::size_t sz) { cpu_set_t set; CPU_ZERO(&set); CPU_SET(cpu, &set); assert(pthread_setaffinity_np(pthread_self(), sizeof(set), &set) == 0); while (mutex); for (std::size_t i = 0; i < sz; i += 4096) { ptr = '\0'; ptr += 4096; } } int main(int argc, char argv[]) { std::size_t sz = 100; if (argc > 1) sz = atoi(argv[1]); auto nproc = std:🧵:hardware_concurrency(); std::vector<std::thread> thr; sz <<= 30; auto* ptr = mmap(nullptr, sz, PROT_READ \| PROT_WRITE, MAP_ANON \| MAP_PRIVATE, -1, 0); assert(ptr != MAP_FAILED); char* cptr = static_cast<char*>(ptr); auto run = sz / nproc; run = (run >> 12) << 12; mutex = 1; for (auto i = 0U; i < nproc; ++i) { thr.emplace_back(std::thread([i, cptr, run]() { trigger(i, cptr, run); })); cptr += run; } rusage usage_start; getrusage(RUSAGE_SELF, &usage_start); auto start = std::chrono::system_clock::now(); mutex = 0; for (auto& t : thr) t.join(); rusage usage_end; getrusage(RUSAGE_SELF, &usage_end); auto end = std::chrono::system_clock::now(); timeval utime; timeval stime; timersub(&usage_end.ru_utime, &usage_start.ru_utime, &utime); timersub(&usage_end.ru_stime, &usage_start.ru_stime, &stime); printf("usr: %ld.%06ld\n", utime.tv_sec, utime.tv_usec); printf("sys: %ld.%06ld\n", stime.tv_sec, stime.tv_usec); printf("real: %lu\n", std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()); return 0; } ``` The functionality of above program is simply which creates `nproc` threads and each of them are trying to touch memory (trigger page fault) on different CPU. Then we will see the similar profile by `perf top`. 25.55% [kernel] [k] down_read_trylock 14.78% [kernel] [k] handle_mm_fault 13.45% [kernel] [k] up_read 8.61% [kernel] [k] clear_page_erms 3.89% [kernel] [k] __do_page_fault The highest hot instruction, which accounts for about 92%, in down_read_trylock() is cmpxchg like the following. 91.89 │ lock cmpxchg %rdx,(%rdi) Sice the problem is found by migrating from Linux-v4.14 to Linux-v5.4, so we easily found that the commit `ddb20d1d3a` ("locking/rwsem: Optimize down_read_trylock()") caused the regression. The reason is that the commit assumes the rwsem is not contended at all. But it is not always true for mmap lock which could be contended with thousands threads. So most threads almost need to run at least 2 times of "cmpxchg" to acquire the lock. The overhead of atomic operation is higher than non-atomic instructions, which caused the regression. By using the above benchmark, the real executing time on a x86-64 system before and after the patch were: Before Patch After Patch # of Threads real real reduced by ------------ ------ ------ ---------- 1 65,373 65,206 ~0.0% 4 15,467 15,378 ~0.5% 40 6,214 5,528 ~11.0% For the uncontended case, the new down_read_trylock() is the same as before. For the contended cases, the new down_read_trylock() is faster than before. The more contended, the more fast. Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20211118094455.9068-1-songmuchun@bytedance.com		2021-11-23 09:45:36 +01:00
..
irqflag-debug.c	lockdep: Noinstr annotate warn_bogus_irq_restore()	2021-02-10 14:44:39 +01:00
lock_events_list.h	locking/rwsem: Remove reader optimistic spinning	2020-12-09 17:08:48 +01:00
lock_events.c	locking/lock_events: Don't show pvqspinlock events on bare metal	2019-04-10 10:56:05 +02:00
lock_events.h	locking/lock_events: Use raw_cpu_{add,inc}() for stats	2019-06-03 12:32:56 +02:00
lockdep_internals.h	lockdep: Allow tuning tracing capacity constants.	2021-04-05 20:33:57 +09:00
lockdep_proc.c	locking/lockdep: Fix meaningless /proc/lockdep output of lock classes on !CONFIG_PROVE_LOCKING	2021-07-05 10:44:52 +02:00
lockdep_states.h	locking/lockdep: Rework FS_RECLAIM annotation	2017-08-10 12:29:03 +02:00
lockdep.c	Merge branch 'akpm' (patches from Andrew)	2021-11-09 10:11:53 -08:00
locktorture.c	locktorture: Warn on individual lock_torture_init() error conditions	2021-09-13 16:36:16 -07:00
Makefile	locking/ww_mutex: Implement rtmutex based ww_mutex API functions	2021-08-17 19:05:26 +02:00
mcs_spinlock.h	locking: Fix typos in comments	2021-03-22 02:45:52 +01:00
mutex-debug.c	locking/ww_mutex: Gather mutex_waiter initialization	2021-08-17 19:04:41 +02:00
mutex.c	locking: Remove rcu_read_{,un}lock() for preempt_{dis,en}able()	2021-10-19 17:27:06 +02:00
mutex.h	locking/mutex: Move the 'struct mutex_waiter' definition from <linux/mutex.h> to the internal header	2021-08-17 18:24:31 +02:00
osq_lock.c	locking: Fix typos in comments	2021-03-22 02:45:52 +01:00
percpu-rwsem.c	locking/percpu-rwsem: Use this_cpu_{inc,dec}() for read_count	2020-09-16 16:26:56 +02:00
qrwlock.c	locking/qrwlock: Cleanup queued_write_lock_slowpath()	2021-05-06 15:33:49 +02:00
qspinlock_paravirt.h	Revert "locking/pvqspinlock: Don't wait if vCPU is preempted"	2019-09-25 10:22:37 +02:00
qspinlock_stat.h	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 157	2019-05-30 11:26:37 -07:00
qspinlock.c	x86/kvm: Add "nopvspin" parameter to disable PV spinlocks	2020-07-08 16:21:57 -04:00
rtmutex_api.c	locking/rtmutex: Prevent lockdep false positive with PI futexes	2021-08-17 19:06:02 +02:00
rtmutex_common.h	locking/rtmutex: Dont dereference waiter lockless	2021-08-25 15:42:32 +02:00
rtmutex.c	rtmutex: Wake up the waiters lockless while dropping the read lock.	2021-10-01 13:57:52 +02:00
rwbase_rt.c	locking/rwbase: Optimize rwbase_read_trylock	2021-10-07 13:51:07 +02:00
rwsem.c	locking/rwsem: Optimize down_read_trylock() under highly contended case	2021-11-23 09:45:36 +01:00
semaphore.c	locking/semaphore: Add might_sleep() to down_*() family	2021-08-20 12:33:17 +02:00
spinlock_debug.c	locking/rwlock: Provide RT variant	2021-08-17 17:50:51 +02:00
spinlock_rt.c	locking/rt: Take RCU nesting into account for __might_resched()	2021-10-01 13:57:51 +02:00
spinlock.c	locking: Remove spin_lock_flags() etc	2021-10-30 16:37:28 +02:00
test-ww_mutex.c	locking/ww-mutex: Fix uninitialized use of ret in test_aa()	2021-10-01 13:57:49 +02:00
ww_mutex.h	locking/ww_mutex: Add rt_mutex based lock type and accessors	2021-08-17 19:05:11 +02:00
ww_rt_mutex.c	kernel/locking: Add context to ww_mutex_trylock()	2021-09-17 15:08:41 +02:00