linux/fs
Mathieu Desnoyers d7822b1e24 rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.

* Restartable sequences (per-cpu atomics)

Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.

The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.

Here are benchmarks of various rseq use-cases.

Test hardware:

arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading

The following benchmarks were all performed on a single thread.

* Per-CPU statistic counter increment

                getcpu+atomic (ns/op)    rseq (ns/op)    speedup
arm32:                344.0                 31.4          11.0
x86-64:                15.3                  2.0           7.7

* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
             per-cpu buffer

                getcpu+atomic (ns/op)    rseq (ns/op)    speedup
arm32:               2502.0                 2250.0         1.1
x86-64:               117.4                   98.0         1.2

* liburcu percpu: lock-unlock pair, dereference, read/compare word

                getcpu+atomic (ns/op)    rseq (ns/op)    speedup
arm32:                751.0                 128.5          5.8
x86-64:                53.4                  28.6          1.9

* jemalloc memory allocator adapted to use rseq

Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):

The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.

* Reading the current CPU number

Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.

Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:

- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
  executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
  assembly, which makes it a useful building block for restartable
  sequences.
- The approach of reading the cpu id through memory mapping shared
  between kernel and user-space is portable (e.g. ARM), which is not the
  case for the lsl-based x86 vdso.

On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.

Benchmarking various approaches for reading the current CPU number:

ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop):                                    8.4 ns
- Read CPU from rseq cpu_id:                               16.7 ns
- Read CPU from rseq cpu_id (lazy register):               19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
- getcpu system call:                                     234.9 ns

x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop):                                    0.8 ns
- Read CPU from rseq cpu_id:                                0.8 ns
- Read CPU from rseq cpu_id (lazy register):                0.8 ns
- Read using gs segment selector:                           0.8 ns
- "lsl" inline assembly:                                   13.0 ns
- glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
- getcpu system call:                                      53.9 ns

- Speed (benchmark taken on v8 of patchset)

Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:

Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.

* CONFIG_RSEQ=n

avg.:      41.37 s
std.dev.:   0.36 s

* CONFIG_RSEQ=y

avg.:      40.46 s
std.dev.:   0.33 s

- Size

On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.

[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-06 11:58:31 +02:00
..
9p 9p: unify paths in v9fs_vfs_lookup() 2018-05-22 14:28:02 -04:00
adfs adfs_lookup: do not fail with ENOENT on negatives, use d_splice_alias() 2018-05-22 14:27:56 -04:00
affs affs: fix potential memory leak when parsing option 'prefix' 2018-05-28 12:36:41 +02:00
afs Merge branch 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-06-04 10:00:01 -07:00
autofs4 autofs: mount point create should honour passed in mode 2018-04-20 17:18:35 -07:00
befs befs_lookup(): use d_splice_alias() 2018-05-21 14:30:07 -04:00
bfs bfs_add_entry: pass name/len as qstr pointer 2018-05-22 14:27:50 -04:00
btrfs for-4.18-tag 2018-06-04 14:29:13 -07:00
cachefiles Merge branch 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-06-04 10:00:01 -07:00
ceph ceph: fix iov_iter issues in ceph_direct_read_write() 2018-05-10 10:15:12 +02:00
cifs some smb3 fixes for stable, as well as addition of ftrace hooks for cifs.ko, and improvements in compounding and smbdirect (RDMA) 2018-06-04 14:42:46 -07:00
coda vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
configfs
cramfs cramfs_lookup(): use d_splice_alias() 2018-05-22 14:27:51 -04:00
crypto fscrypt: fix build with pre-4.6 gcc versions 2018-02-01 10:51:18 -05:00
debugfs debugfs_lookup(): switch to lookup_one_len_unlocked() 2018-03-29 15:07:47 -04:00
devpts devpts: comment devpts_mntget() 2018-03-14 13:31:23 +01:00
dlm dlm: remove O_NONBLOCK flag in sctp_connect_to_sock 2018-05-29 10:48:35 -05:00
ecryptfs Merge branch 'fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into aio-base 2018-05-26 09:16:25 +02:00
efivarfs efivarfs: Limit the rate for non-root to read files 2018-02-22 10:21:02 -08:00
efs
exofs scsi/osd: remove the gfp argument to osd_start_request 2018-05-14 08:55:09 -06:00
exportfs ovl: do not try to reconnect a disconnected origin dentry 2018-04-12 12:04:49 +02:00
ext2 Merge branch 'fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into aio-base 2018-05-26 09:16:25 +02:00
ext4 Merge branch 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-06-04 10:00:01 -07:00
f2fs Merge branch 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-06-04 10:00:01 -07:00
fat vfat: simplify checks in vfat_lookup() 2018-05-13 12:09:14 -04:00
freevxfs freevxfs_lookup(): use d_splice_alias() 2018-05-22 14:27:51 -04:00
fscache proc: introduce proc_create_single{,_data} 2018-05-16 07:23:35 +02:00
fuse fuse: define the filesystem as untrusted 2018-03-23 06:31:37 -04:00
gfs2 gfs2: Iomap cleanups and improvements 2018-06-04 07:56:51 -05:00
hfs hfs: don't allow mounting over .../rsrc 2018-05-22 14:28:00 -04:00
hfsplus Merge branch 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-06-04 13:46:22 -07:00
hostfs hostfs: rename do_rmdir() to hostfs_do_rmdir() 2018-04-02 20:15:53 +02:00
hpfs
hugetlbfs hugetlbfs: fix bug in pgoff overflow checking 2018-04-05 21:36:21 -07:00
isofs isofs: fix potential memory leak in mount option parsing 2018-04-16 09:47:41 +02:00
jbd2 ext4: set h_journal if there is a failure starting a reserved handle 2018-04-18 11:49:31 -04:00
jffs2 do d_instantiate/unlock_new_inode combinations safely 2018-05-11 15:36:37 -04:00
jfs Merge branch 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-06-04 10:00:01 -07:00
kernfs kernfs: deal with kernfs_fill_super() failures 2018-05-21 14:30:08 -04:00
lockd net: Drop pernet_operations::async 2018-03-27 13:18:09 -04:00
minix minix_lookup: use d_splice_alias() 2018-05-22 14:27:52 -04:00
nfs proc: introduce proc_create_net{,_data} 2018-05-16 07:24:30 +02:00
nfs_common net: Drop pernet_operations::async 2018-03-27 13:18:09 -04:00
nfsd for-4.18/block-20180603 2018-06-04 07:58:06 -07:00
nilfs2 do d_instantiate/unlock_new_inode combinations safely 2018-05-11 15:36:37 -04:00
nls
notify fsnotify: fix ignore mask logic in send_to_group() 2018-04-13 15:52:49 +02:00
ntfs ntfs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call 2018-03-28 01:39:02 -04:00
ocfs2 ocfs2: revert "ocfs2/o2hb: check len for bio_add_page() to avoid getting incorrect bio" 2018-05-25 18:12:10 -07:00
omfs omfs_lookup(): report IO errors, use d_splice_alias() 2018-05-22 14:27:58 -04:00
openpromfs openpromfs: switch to d_splice_alias() 2018-05-22 14:27:57 -04:00
orangefs orangefs_lookup: simplify 2018-05-22 14:27:58 -04:00
overlayfs ovl: add support for "xino" mount and config options 2018-04-12 12:04:50 +02:00
proc Merge branch 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-06-04 13:46:22 -07:00
pstore pstore: fix crypto dependencies without compression 2018-04-06 15:45:33 -07:00
qnx4 qnx4_lookup: use d_splice_alias() 2018-05-22 14:27:52 -04:00
qnx6 qnx6_lookup: switch to d_splice_alias() 2018-05-22 14:27:54 -04:00
quota fs: quota: Replace GFP_ATOMIC with GFP_KERNEL in dquot_init 2018-04-09 17:48:54 +02:00
ramfs
reiserfs Merge branch 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-06-04 10:00:01 -07:00
romfs romfs_lookup: switch to d_splice_alias() 2018-05-22 14:27:55 -04:00
squashfs
sysfs unfuck sysfs_mount() 2018-05-21 14:30:09 -04:00
sysv sysv_lookup: use d_splice_alias() 2018-05-22 14:27:53 -04:00
tracefs
ubifs ubifs_lookup: use d_splice_alias() 2018-05-22 14:27:54 -04:00
udf Merge branch 'fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into aio-base 2018-05-26 09:16:25 +02:00
ufs do d_instantiate/unlock_new_inode combinations safely 2018-05-11 15:36:37 -04:00
xfs Merge branch 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-06-04 13:46:22 -07:00
aio.c aio: sanitize the limit checking in io_submit(2) 2018-05-29 23:20:17 -04:00
anon_inodes.c
attr.c fs: Allow superblock owner to replace invalid owners of inodes 2018-05-24 11:57:18 -05:00
bad_inode.c
binfmt_aout.c exec: introduce finalize_exec() before start_thread() 2018-04-11 10:28:37 -07:00
binfmt_elf_fdpic.c exec: introduce finalize_exec() before start_thread() 2018-04-11 10:28:37 -07:00
binfmt_elf.c fs, elf: don't complain MAP_FIXED_NOREPLACE unless -EEXIST error 2018-04-20 17:18:36 -07:00
binfmt_em86.c
binfmt_flat.c exec: introduce finalize_exec() before start_thread() 2018-04-11 10:28:37 -07:00
binfmt_misc.c fs: add ksys_close() wrapper; remove in-kernel calls to sys_close() 2018-04-02 20:16:00 +02:00
binfmt_script.c
block_dev.c fs: convert block_dev.c to bioset_init() 2018-05-30 15:33:32 -06:00
buffer.c Merge branch 'work.thaw' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-04-12 12:28:32 -07:00
char_dev.c block, char_dev: Use correct format specifier for unsigned ints 2018-03-15 17:59:24 +01:00
compat_binfmt_elf.c
compat_ioctl.c
compat.c
coredump.c
d_path.c split d_path() and friends into a separate file 2018-03-29 15:07:46 -04:00
dax.c Merge branch 'mm-rst' into docs-next 2018-04-16 14:25:08 -06:00
dcache.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-06-04 10:14:28 -07:00
dcookies.c fs: add do_lookup_dcookie() helper; remove in-kernel call to syscall 2018-04-02 20:15:39 +02:00
direct-io.c block: consistently use GFP_NOIO instead of __GFP_NORECLAIM 2018-05-14 08:55:18 -06:00
drop_caches.c
eventfd.c eventfd: switch to ->poll_mask 2018-05-26 09:16:44 +02:00
eventpoll.c fs: add new vfs_poll and file_can_poll helpers 2018-05-26 09:16:44 +02:00
exec.c rseq: Introduce restartable sequences system call 2018-06-06 11:58:31 +02:00
fcntl.c fasync: Fix deadlock between task-context and interrupt-context kill_fasync() 2018-05-01 07:39:50 -04:00
fhandle.c
file_table.c
file.c fs: add ksys_close() wrapper; remove in-kernel calls to sys_close() 2018-04-02 20:16:00 +02:00
filesystems.c proc: introduce proc_create_single{,_data} 2018-05-16 07:23:35 +02:00
fs_pin.c
fs_struct.c
fs-writeback.c bdi: Fix oops in wb_workfn() 2018-05-03 16:11:37 -06:00
inode.c fs: clear writeback errors in inode_init_always 2018-05-30 19:43:53 -07:00
internal.h Revert "fs: fold open_check_o_direct into do_dentry_open" 2018-06-03 10:58:23 -07:00
ioctl.c fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems 2018-05-24 12:04:28 -05:00
iomap.c iomap: warn on zero-length mappings 2018-01-29 07:27:24 -08:00
Kconfig docs/admin-guide/mm: start moving here files from Documentation/vm 2018-04-27 17:02:48 -06:00
Kconfig.binfmt treewide: simplify Kconfig dependencies for removed archs 2018-03-26 15:55:57 +02:00
libfs.c fs, dax: prepare for dax-specific address_space_operations 2018-03-30 11:34:55 -07:00
locks.c proc: introduce proc_create_seq_private 2018-05-16 07:23:35 +02:00
Makefile split d_path() and friends into a separate file 2018-03-29 15:07:46 -04:00
mbcache.c
mount.h
mpage.c
namei.c Merge branch 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2018-06-04 15:21:19 -07:00
namespace.c fs: Allow superblock owner to access do_remount_sb() 2018-05-24 12:02:25 -05:00
no-block.c
nsfs.c net: Export open_related_ns() 2018-02-15 15:34:42 -05:00
open.c Revert "fs: fold open_check_o_direct into do_dentry_open" 2018-06-03 10:58:23 -07:00
pipe.c pipe: convert to ->poll_mask 2018-05-26 09:16:44 +02:00
pnode.c
pnode.h
posix_acl.c
proc_namespace.c vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
read_write.c fs: avoid fdput() after failed fdget() in vfs_dedupe_file_range() 2018-04-15 23:36:26 -04:00
readdir.c fs: add ksys_getdents64() helper; remove in-kernel calls to sys_getdents64() 2018-04-02 20:16:02 +02:00
select.c fs: introduce new ->get_poll_head and ->poll_mask methods 2018-05-26 09:16:44 +02:00
seq_file.c proc: fix smaps and meminfo alignment 2018-05-25 18:12:11 -07:00
signalfd.c signal: Extend siginfo_layout with SIL_FAULT_{MCEERR|BNDERR|PKUERR} 2018-04-26 19:51:14 -05:00
splice.c fs: add do_vmsplice() helper; remove in-kernel call to syscall 2018-04-02 20:15:40 +02:00
stack.c
stat.c fs: add do_readlinkat() helper; remove internal call to sys_readlinkat() 2018-04-02 20:15:34 +02:00
statfs.c
super.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-06-04 10:14:28 -07:00
sync.c Changes for this release: 2018-04-04 12:44:02 -07:00
timerfd.c timerfd: convert to ->poll_mask 2018-05-26 09:16:44 +02:00
userfaultfd.c vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
utimes.c fs: add do_compat_futimesat() helper; remove in-kernel call to compat syscall 2018-04-02 20:15:44 +02:00
xattr.c vfs: delete unnecessary assignment in vfs_listxattr 2018-05-29 13:22:41 -04:00