linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-20 02:51:44 +00:00

History

Mathieu Desnoyers d7822b1e24 rseq: Introduce restartable sequences system call Expose a new system call allowing each thread to register one userspace memory area to be used as an ABI between kernel and user-space for two purposes: user-space restartable sequences and quick access to read the current CPU number value from user-space. * Restartable sequences (per-cpu atomics) Restartables sequences allow user-space to perform update operations on per-cpu data without requiring heavy-weight atomic operations. The restartable critical sections (percpu atomics) work has been started by Paul Turner and Andrew Hunter. It lets the kernel handle restart of critical sections. [1] [2] The re-implementation proposed here brings a few simplifications to the ABI which facilitates porting to other architectures and speeds up the user-space fast path. Here are benchmarks of various rseq use-cases. Test hardware: arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading The following benchmarks were all performed on a single thread. * Per-CPU statistic counter increment getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 344.0 31.4 11.0 x86-64: 15.3 2.0 7.7 * LTTng-UST: write event 32-bit header, 32-bit payload into tracer per-cpu buffer getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 2502.0 2250.0 1.1 x86-64: 117.4 98.0 1.2 * liburcu percpu: lock-unlock pair, dereference, read/compare word getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 751.0 128.5 5.8 x86-64: 53.4 28.6 1.9 * jemalloc memory allocator adapted to use rseq Using rseq with per-cpu memory pools in jemalloc at Facebook (based on rseq 2016 implementation): The production workload response-time has 1-2% gain avg. latency, and the P99 overall latency drops by 2-3%. * Reading the current CPU number Speeding up reading the current CPU number on which the caller thread is running is done by keeping the current CPU number up do date within the cpu_id field of the memory area registered by the thread. This is done by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the current thread. Upon return to user-space, a notify-resume handler updates the current CPU value within the registered user-space memory area. User-space can then read the current CPU number directly from memory. Keeping the current cpu id in a memory area shared between kernel and user-space is an improvement over current mechanisms available to read the current CPU number, which has the following benefits over alternative approaches: - 35x speedup on ARM vs system call through glibc - 20x speedup on x86 compared to calling glibc, which calls vdso executing a "lsl" instruction, - 14x speedup on x86 compared to inlined "lsl" instruction, - Unlike vdso approaches, this cpu_id value can be read from an inline assembly, which makes it a useful building block for restartable sequences. - The approach of reading the cpu id through memory mapping shared between kernel and user-space is portable (e.g. ARM), which is not the case for the lsl-based x86 vdso. On x86, yet another possible approach would be to use the gs segment selector to point to user-space per-cpu data. This approach performs similarly to the cpu id cache, but it has two disadvantages: it is not portable, and it is incompatible with existing applications already using the gs segment selector for other purposes. Benchmarking various approaches for reading the current CPU number: ARMv7 Processor rev 4 (v7l) Machine model: Cubietruck - Baseline (empty loop): 8.4 ns - Read CPU from rseq cpu_id: 16.7 ns - Read CPU from rseq cpu_id (lazy register): 19.8 ns - glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns - getcpu system call: 234.9 ns x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz: - Baseline (empty loop): 0.8 ns - Read CPU from rseq cpu_id: 0.8 ns - Read CPU from rseq cpu_id (lazy register): 0.8 ns - Read using gs segment selector: 0.8 ns - "lsl" inline assembly: 13.0 ns - glibc 2.19-0ubuntu6 getcpu: 16.6 ns - getcpu system call: 53.9 ns - Speed (benchmark taken on v8 of patchset) Running 10 runs of hackbench -l 100000 seems to indicate, contrary to expectations, that enabling CONFIG_RSEQ slightly accelerates the scheduler: Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1 kernel parameter), with a Linux v4.6 defconfig+localyesconfig, restartable sequences series applied. * CONFIG_RSEQ=n avg.: 41.37 s std.dev.: 0.36 s * CONFIG_RSEQ=y avg.: 40.46 s std.dev.: 0.33 s - Size On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is 567 bytes, and the data size increase of vmlinux is 5696 bytes. [1] https://lwn.net/Articles/650333/ [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Watson <davejwatson@fb.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Chris Lameter <cl@linux.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Andrew Hunter <ahh@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Paul Turner <pjt@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Maurer <bmaurer@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-api@vger.kernel.org Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com		2018-06-06 11:58:31 +02:00
..
9p	9p: unify paths in v9fs_vfs_lookup()	2018-05-22 14:28:02 -04:00
adfs	adfs_lookup: do not fail with ENOENT on negatives, use d_splice_alias()	2018-05-22 14:27:56 -04:00
affs	affs: fix potential memory leak when parsing option 'prefix'	2018-05-28 12:36:41 +02:00
afs	Merge branch 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2018-06-04 10:00:01 -07:00
autofs4	autofs: mount point create should honour passed in mode	2018-04-20 17:18:35 -07:00
befs	befs_lookup(): use d_splice_alias()	2018-05-21 14:30:07 -04:00
bfs	bfs_add_entry: pass name/len as qstr pointer	2018-05-22 14:27:50 -04:00
btrfs	for-4.18-tag	2018-06-04 14:29:13 -07:00
cachefiles	Merge branch 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2018-06-04 10:00:01 -07:00
ceph	ceph: fix iov_iter issues in ceph_direct_read_write()	2018-05-10 10:15:12 +02:00
cifs	some smb3 fixes for stable, as well as addition of ftrace hooks for cifs.ko, and improvements in compounding and smbdirect (RDMA)	2018-06-04 14:42:46 -07:00
coda	vfs: do bulk POLL* -> EPOLL* replacement	2018-02-11 14:34:03 -08:00
configfs
cramfs	cramfs_lookup(): use d_splice_alias()	2018-05-22 14:27:51 -04:00
crypto	fscrypt: fix build with pre-4.6 gcc versions	2018-02-01 10:51:18 -05:00
debugfs	debugfs_lookup(): switch to lookup_one_len_unlocked()	2018-03-29 15:07:47 -04:00
devpts	devpts: comment devpts_mntget()	2018-03-14 13:31:23 +01:00
dlm	dlm: remove O_NONBLOCK flag in sctp_connect_to_sock	2018-05-29 10:48:35 -05:00
ecryptfs	Merge branch 'fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into aio-base	2018-05-26 09:16:25 +02:00
efivarfs	efivarfs: Limit the rate for non-root to read files	2018-02-22 10:21:02 -08:00
efs
exofs	scsi/osd: remove the gfp argument to osd_start_request	2018-05-14 08:55:09 -06:00
exportfs	ovl: do not try to reconnect a disconnected origin dentry	2018-04-12 12:04:49 +02:00
ext2	Merge branch 'fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into aio-base	2018-05-26 09:16:25 +02:00
ext4	Merge branch 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2018-06-04 10:00:01 -07:00
f2fs	Merge branch 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2018-06-04 10:00:01 -07:00
fat	vfat: simplify checks in vfat_lookup()	2018-05-13 12:09:14 -04:00
freevxfs	freevxfs_lookup(): use d_splice_alias()	2018-05-22 14:27:51 -04:00
fscache	proc: introduce proc_create_single{,_data}	2018-05-16 07:23:35 +02:00
fuse	fuse: define the filesystem as untrusted	2018-03-23 06:31:37 -04:00
gfs2	gfs2: Iomap cleanups and improvements	2018-06-04 07:56:51 -05:00
hfs	hfs: don't allow mounting over .../rsrc	2018-05-22 14:28:00 -04:00
hfsplus	Merge branch 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2018-06-04 13:46:22 -07:00
hostfs	hostfs: rename do_rmdir() to hostfs_do_rmdir()	2018-04-02 20:15:53 +02:00
hpfs
hugetlbfs	hugetlbfs: fix bug in pgoff overflow checking	2018-04-05 21:36:21 -07:00
isofs	isofs: fix potential memory leak in mount option parsing	2018-04-16 09:47:41 +02:00
jbd2	ext4: set h_journal if there is a failure starting a reserved handle	2018-04-18 11:49:31 -04:00
jffs2	do d_instantiate/unlock_new_inode combinations safely	2018-05-11 15:36:37 -04:00
jfs	Merge branch 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2018-06-04 10:00:01 -07:00
kernfs	kernfs: deal with kernfs_fill_super() failures	2018-05-21 14:30:08 -04:00
lockd	net: Drop pernet_operations::async	2018-03-27 13:18:09 -04:00
minix	minix_lookup: use d_splice_alias()	2018-05-22 14:27:52 -04:00
nfs	proc: introduce proc_create_net{,_data}	2018-05-16 07:24:30 +02:00
nfs_common	net: Drop pernet_operations::async	2018-03-27 13:18:09 -04:00
nfsd	for-4.18/block-20180603	2018-06-04 07:58:06 -07:00
nilfs2	do d_instantiate/unlock_new_inode combinations safely	2018-05-11 15:36:37 -04:00
nls
notify	fsnotify: fix ignore mask logic in send_to_group()	2018-04-13 15:52:49 +02:00
ntfs	ntfs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC \| I_DIRTY_DATASYNC) call	2018-03-28 01:39:02 -04:00
ocfs2	ocfs2: revert "ocfs2/o2hb: check len for bio_add_page() to avoid getting incorrect bio"	2018-05-25 18:12:10 -07:00
omfs	omfs_lookup(): report IO errors, use d_splice_alias()	2018-05-22 14:27:58 -04:00
openpromfs	openpromfs: switch to d_splice_alias()	2018-05-22 14:27:57 -04:00
orangefs	orangefs_lookup: simplify	2018-05-22 14:27:58 -04:00
overlayfs	ovl: add support for "xino" mount and config options	2018-04-12 12:04:50 +02:00
proc	Merge branch 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2018-06-04 13:46:22 -07:00
pstore	pstore: fix crypto dependencies without compression	2018-04-06 15:45:33 -07:00
qnx4	qnx4_lookup: use d_splice_alias()	2018-05-22 14:27:52 -04:00
qnx6	qnx6_lookup: switch to d_splice_alias()	2018-05-22 14:27:54 -04:00
quota	fs: quota: Replace GFP_ATOMIC with GFP_KERNEL in dquot_init	2018-04-09 17:48:54 +02:00
ramfs
reiserfs	Merge branch 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2018-06-04 10:00:01 -07:00
romfs	romfs_lookup: switch to d_splice_alias()	2018-05-22 14:27:55 -04:00
squashfs
sysfs	unfuck sysfs_mount()	2018-05-21 14:30:09 -04:00
sysv	sysv_lookup: use d_splice_alias()	2018-05-22 14:27:53 -04:00
tracefs
ubifs	ubifs_lookup: use d_splice_alias()	2018-05-22 14:27:54 -04:00
udf	Merge branch 'fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into aio-base	2018-05-26 09:16:25 +02:00
ufs	do d_instantiate/unlock_new_inode combinations safely	2018-05-11 15:36:37 -04:00
xfs	Merge branch 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2018-06-04 13:46:22 -07:00
aio.c	aio: sanitize the limit checking in io_submit(2)	2018-05-29 23:20:17 -04:00
anon_inodes.c
attr.c	fs: Allow superblock owner to replace invalid owners of inodes	2018-05-24 11:57:18 -05:00
bad_inode.c
binfmt_aout.c	exec: introduce finalize_exec() before start_thread()	2018-04-11 10:28:37 -07:00
binfmt_elf_fdpic.c	exec: introduce finalize_exec() before start_thread()	2018-04-11 10:28:37 -07:00
binfmt_elf.c	fs, elf: don't complain MAP_FIXED_NOREPLACE unless -EEXIST error	2018-04-20 17:18:36 -07:00
binfmt_em86.c
binfmt_flat.c	exec: introduce finalize_exec() before start_thread()	2018-04-11 10:28:37 -07:00
binfmt_misc.c	fs: add ksys_close() wrapper; remove in-kernel calls to sys_close()	2018-04-02 20:16:00 +02:00
binfmt_script.c
block_dev.c	fs: convert block_dev.c to bioset_init()	2018-05-30 15:33:32 -06:00
buffer.c	Merge branch 'work.thaw' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2018-04-12 12:28:32 -07:00
char_dev.c	block, char_dev: Use correct format specifier for unsigned ints	2018-03-15 17:59:24 +01:00
compat_binfmt_elf.c
compat_ioctl.c
compat.c
coredump.c
d_path.c	split d_path() and friends into a separate file	2018-03-29 15:07:46 -04:00
dax.c	Merge branch 'mm-rst' into docs-next	2018-04-16 14:25:08 -06:00
dcache.c	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2018-06-04 10:14:28 -07:00
dcookies.c	fs: add do_lookup_dcookie() helper; remove in-kernel call to syscall	2018-04-02 20:15:39 +02:00
direct-io.c	block: consistently use GFP_NOIO instead of __GFP_NORECLAIM	2018-05-14 08:55:18 -06:00
drop_caches.c
eventfd.c	eventfd: switch to ->poll_mask	2018-05-26 09:16:44 +02:00
eventpoll.c	fs: add new vfs_poll and file_can_poll helpers	2018-05-26 09:16:44 +02:00
exec.c	rseq: Introduce restartable sequences system call	2018-06-06 11:58:31 +02:00
fcntl.c	fasync: Fix deadlock between task-context and interrupt-context kill_fasync()	2018-05-01 07:39:50 -04:00
fhandle.c
file_table.c
file.c	fs: add ksys_close() wrapper; remove in-kernel calls to sys_close()	2018-04-02 20:16:00 +02:00
filesystems.c	proc: introduce proc_create_single{,_data}	2018-05-16 07:23:35 +02:00
fs_pin.c
fs_struct.c
fs-writeback.c	bdi: Fix oops in wb_workfn()	2018-05-03 16:11:37 -06:00
inode.c	fs: clear writeback errors in inode_init_always	2018-05-30 19:43:53 -07:00
internal.h	Revert "fs: fold open_check_o_direct into do_dentry_open"	2018-06-03 10:58:23 -07:00
ioctl.c	fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems	2018-05-24 12:04:28 -05:00
iomap.c	iomap: warn on zero-length mappings	2018-01-29 07:27:24 -08:00
Kconfig	docs/admin-guide/mm: start moving here files from Documentation/vm	2018-04-27 17:02:48 -06:00
Kconfig.binfmt	treewide: simplify Kconfig dependencies for removed archs	2018-03-26 15:55:57 +02:00
libfs.c	fs, dax: prepare for dax-specific address_space_operations	2018-03-30 11:34:55 -07:00
locks.c	proc: introduce proc_create_seq_private	2018-05-16 07:23:35 +02:00
Makefile	split d_path() and friends into a separate file	2018-03-29 15:07:46 -04:00
mbcache.c
mount.h
mpage.c
namei.c	Merge branch 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2018-06-04 15:21:19 -07:00
namespace.c	fs: Allow superblock owner to access do_remount_sb()	2018-05-24 12:02:25 -05:00
no-block.c
nsfs.c	net: Export open_related_ns()	2018-02-15 15:34:42 -05:00
open.c	Revert "fs: fold open_check_o_direct into do_dentry_open"	2018-06-03 10:58:23 -07:00
pipe.c	pipe: convert to ->poll_mask	2018-05-26 09:16:44 +02:00
pnode.c
pnode.h
posix_acl.c
proc_namespace.c	vfs: do bulk POLL* -> EPOLL* replacement	2018-02-11 14:34:03 -08:00
read_write.c	fs: avoid fdput() after failed fdget() in vfs_dedupe_file_range()	2018-04-15 23:36:26 -04:00
readdir.c	fs: add ksys_getdents64() helper; remove in-kernel calls to sys_getdents64()	2018-04-02 20:16:02 +02:00
select.c	fs: introduce new ->get_poll_head and ->poll_mask methods	2018-05-26 09:16:44 +02:00
seq_file.c	proc: fix smaps and meminfo alignment	2018-05-25 18:12:11 -07:00
signalfd.c	signal: Extend siginfo_layout with SIL_FAULT_{MCEERR\|BNDERR\|PKUERR}	2018-04-26 19:51:14 -05:00
splice.c	fs: add do_vmsplice() helper; remove in-kernel call to syscall	2018-04-02 20:15:40 +02:00
stack.c
stat.c	fs: add do_readlinkat() helper; remove internal call to sys_readlinkat()	2018-04-02 20:15:34 +02:00
statfs.c
super.c	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2018-06-04 10:14:28 -07:00
sync.c	Changes for this release:	2018-04-04 12:44:02 -07:00
timerfd.c	timerfd: convert to ->poll_mask	2018-05-26 09:16:44 +02:00
userfaultfd.c	vfs: do bulk POLL* -> EPOLL* replacement	2018-02-11 14:34:03 -08:00
utimes.c	fs: add do_compat_futimesat() helper; remove in-kernel call to compat syscall	2018-04-02 20:15:44 +02:00
xattr.c	vfs: delete unnecessary assignment in vfs_listxattr	2018-05-29 13:22:41 -04:00