linux/include/trace/events
Ross Zwisler 91d25ba8a6 dax: use common 4k zero page for dax mmap reads
When servicing mmap() reads from file holes the current DAX code
allocates a page cache page of all zeroes and places the struct page
pointer in the mapping->page_tree radix tree.

This has three major drawbacks:

1) It consumes memory unnecessarily. For every 4k page that is read via
   a DAX mmap() over a hole, we allocate a new page cache page. This
   means that if you read 1GiB worth of pages, you end up using 1GiB of
   zeroed memory. This is easily visible by looking at the overall
   memory consumption of the system or by looking at /proc/[pid]/smaps:

	7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12   /root/dax/data
	Size:            1048576 kB
	Rss:             1048576 kB
	Pss:             1048576 kB
	Shared_Clean:          0 kB
	Shared_Dirty:          0 kB
	Private_Clean:   1048576 kB
	Private_Dirty:         0 kB
	Referenced:      1048576 kB
	Anonymous:             0 kB
	LazyFree:              0 kB
	AnonHugePages:         0 kB
	ShmemPmdMapped:        0 kB
	Shared_Hugetlb:        0 kB
	Private_Hugetlb:       0 kB
	Swap:                  0 kB
	SwapPss:               0 kB
	KernelPageSize:        4 kB
	MMUPageSize:           4 kB
	Locked:                0 kB

2) It is slower than using a common zero page because each page fault
   has more work to do. Instead of just inserting a common zero page we
   have to allocate a page cache page, zero it, and then insert it. Here
   are the average latencies of dax_load_hole() as measured by ftrace on
   a random test box:

    Old method, using zeroed page cache pages:	3.4 us
    New method, using the common 4k zero page:	0.8 us

   This was the average latency over 1 GiB of sequential reads done by
   this simple fio script:

     [global]
     size=1G
     filename=/root/dax/data
     fallocate=none
     [io]
     rw=read
     ioengine=mmap

3) The fact that we had to check for both DAX exceptional entries and
   for page cache pages in the radix tree made the DAX code more
   complex.

Solve these issues by following the lead of the DAX PMD code and using a
common 4k zero page instead.  As with the PMD code we will now insert a
DAX exceptional entry into the radix tree instead of a struct page
pointer which allows us to remove all the special casing in the DAX
code.

Note that we do still pretty aggressively check for regular pages in the
DAX radix tree, especially where we take action based on the bits set in
the page.  If we ever find a regular page in our radix tree now that
most likely means that someone besides DAX is inserting pages (which has
happened lots of times in the past), and we want to find that out early
and fail loudly.

This solution also removes the extra memory consumption.  Here is that
same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
code:

	7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12   /root/dax/data
	Size:            1048576 kB
	Rss:                   0 kB
	Pss:                   0 kB
	Shared_Clean:          0 kB
	Shared_Dirty:          0 kB
	Private_Clean:         0 kB
	Private_Dirty:         0 kB
	Referenced:            0 kB
	Anonymous:             0 kB
	LazyFree:              0 kB
	AnonHugePages:         0 kB
	ShmemPmdMapped:        0 kB
	Shared_Hugetlb:        0 kB
	Private_Hugetlb:       0 kB
	Swap:                  0 kB
	SwapPss:               0 kB
	KernelPageSize:        4 kB
	MMUPageSize:           4 kB
	Locked:                0 kB

Overall system memory consumption is similarly improved.

Another major change is that we remove dax_pfn_mkwrite() from our fault
flow, and instead rely on the page fault itself to make the PTE dirty
and writeable.  The following description from the patch adding the
vm_insert_mixed_mkwrite() call explains this a little more:

   "To be able to use the common 4k zero page in DAX we need to have our
    PTE fault path look more like our PMD fault path where a PTE entry
    can be marked as dirty and writeable as it is first inserted rather
    than waiting for a follow-up dax_pfn_mkwrite() =>
    finish_mkwrite_fault() call.

    Right now we can rely on having a dax_pfn_mkwrite() call because we
    can distinguish between these two cases in do_wp_page():

            case 1: 4k zero page => writable DAX storage
            case 2: read-only DAX storage => writeable DAX storage

    This distinction is made by via vm_normal_page(). vm_normal_page()
    returns false for the common 4k zero page, though, just as it does
    for DAX ptes. Instead of special casing the DAX + 4k zero page case
    we will simplify our DAX PTE page fault sequence so that it matches
    our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
    We will instead use dax_iomap_fault() to handle write-protection
    faults.

    This means that insert_pfn() needs to follow the lead of
    insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
    'mkwrite' is set insert_pfn() will do the work that was previously
    done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"

Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:24 -07:00
..
9p.h net/9p/tracing: Export enums in tracepoints to userspace 2015-04-08 09:39:59 -04:00
afs.h afs: Refcount the afs_call struct 2017-01-09 11:10:02 +00:00
alarmtimer.h ktime: Get rid of the union 2016-12-25 17:21:22 +01:00
asoc.h ASoC: trace: fix printing jack name 2016-02-26 10:52:48 +09:00
bcache.h block: better op and flags encoding 2016-10-28 08:48:16 -06:00
block.h block: remove the errors field from struct request 2017-04-20 12:16:10 -06:00
bpf.h bpf: map_get_next_key to return first key on NULL 2017-04-25 11:57:45 -04:00
btrfs.h btrfs: cleanup unused qgroup trace event 2017-06-19 18:25:57 +02:00
cgroup.h kernfs: handle null pointers while printing node name and path 2017-02-10 16:02:26 +01:00
clk.h clk: Add tracepoints for hardware operations 2015-03-12 12:18:51 -07:00
cma.h mm: cma: add trace events for CMA allocations and freeings 2015-04-15 16:35:19 -07:00
compaction.h mm, trace: extract COMPACTION_STATUS and ZONE_TYPE to a common header 2017-02-22 16:41:27 -08:00
context_tracking.h
cpuhp.h cpu/hotplug: Add multi instance support 2016-09-02 20:05:05 +02:00
devlink.h devlink: fix trace format string 2016-07-14 22:16:05 -07:00
dma_fence.h dma-buf: Rename struct fence to dma_fence 2016-10-25 14:40:39 +02:00
ext4.h ext4: remove unused metadata accounting variables 2017-07-30 22:30:11 -04:00
f2fs.h f2fs: split bio cache 2017-05-23 21:05:39 -07:00
fib6.h ipv6, trace: fix tos reporting on fib6_table_lookup 2016-03-20 13:44:34 -04:00
fib.h net: Make table id type u32 2015-09-01 14:32:44 -07:00
filelock.h locks: sprinkle some tracepoints around the file locking code 2016-01-08 11:38:13 -05:00
filemap.h fs: new infrastructure for writeback error handling and reporting 2017-07-06 07:02:25 -04:00
fs_dax.h dax: use common 4k zero page for dax mmap reads 2017-09-06 17:27:24 -07:00
fsi_master_gpio.h drivers/fsi/gpio: Add tracepoints for GPIO master 2017-06-09 11:52:09 +02:00
fsi.h drivers/fsi: Add tracepoints for low-level operations 2017-06-09 11:52:08 +02:00
gpio.h tracing: gpio: Add Kconfig option for enabling/disabling trace events 2015-10-20 21:56:10 -04:00
host1x.h gpu: host1x: Use struct host1x_bo pointers in traces 2014-11-13 16:11:32 +01:00
hswadsp.h
huge_memory.h mm, thp: convert from optimistic swapin collapsing to conservative 2016-07-26 16:19:19 -07:00
i2c.h i2c: break out smbus support into separate file 2017-05-31 21:01:03 +02:00
intel_ish.h HID: intel-ish-hid: ipc layer 2016-08-17 11:13:07 +02:00
intel-sst.h tracing: Add TRACE_SYSTEM_VAR to intel-sst 2015-04-07 12:31:12 -04:00
iommu.h iommu: Remove pci.h include from trace/events/iommu.h 2017-04-29 00:20:49 +02:00
ipi.h
irq.h irq: Fix typo in tracepoint.xml 2016-09-29 10:03:38 +02:00
jbd2.h
kmem.h Nothing major this round. Mostly small clean ups and fixes. 2016-03-24 10:52:25 -07:00
kvm.h KVM: x86: add KVM_CAP_X2APIC_API 2016-07-14 09:03:57 +02:00
libata.h ata: Handle ATA NCQ NO-DATA commands correctly 2016-07-15 08:08:13 -04:00
lock.h
mce.h x86/mce/AMD: Save MCA_IPID in MCE struct on SMCA systems 2016-09-13 15:23:12 +02:00
mdio.h net/phy: add trace events for mdio accesses 2016-11-24 11:55:43 -05:00
migrate.h mm: tracing: Export enums in tracepoints to user space 2015-04-08 09:40:01 -04:00
mmc.h mmc: core: Provide tracepoints for request processing 2016-05-02 10:33:11 +02:00
mmflags.h mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic 2017-07-12 16:26:03 -07:00
module.h tracing: %pF is only for function pointers 2015-03-25 08:57:22 -04:00
napi.h net: fixup for tracepoint napi:napi_poll 2016-07-15 15:55:01 -07:00
net.h net: rename vlan_tx_* helpers since "tx" is misleading there 2015-01-13 17:51:08 -05:00
nilfs2.h nilfs2: add tracepoints for analyzing reading and writing metadata files 2015-11-06 17:50:42 -08:00
nmi.h
oom.h mm/oom_kill.c: add tracepoints for oom reaper-related events 2017-07-10 16:32:32 -07:00
page_isolation.h mm/page_isolation: fix tracepoint to mirror check function behavior 2016-04-01 17:03:37 -05:00
page_ref.h mm/page_ref: add tracepoint to track down page reference manipulation 2016-03-17 15:09:34 -07:00
pagemap.h
percpu.h percpu: add tracepoint support for percpu memory 2017-06-20 15:31:43 -04:00
power_cpu_migrate.h
power.h cpufreq: intel_pstate: Add io_boost trace 2016-09-16 23:55:30 +02:00
printk.h printk, tracing: Avoiding unneeded blank lines 2016-07-15 15:52:41 -04:00
random.h tracing: %pF is only for function pointers 2015-03-25 08:57:22 -04:00
rcu.h rcutorture: Place event-traced strings into trace buffer 2017-07-24 16:04:12 -07:00
regulator.h
rpm.h
rxrpc.h rxrpc: Add service upgrade support for client connections 2017-06-05 14:30:49 +01:00
sched.h sched,tracing: Update trace_sched_pi_setprio() 2017-04-04 11:44:06 +02:00
scsi.h scsi-trace: define ZBC_IN and ZBC_OUT 2016-04-11 16:57:09 -04:00
signal.h
skb.h
smbus.h i2c: break out smbus support into separate file 2017-05-31 21:01:03 +02:00
sock.h
spi.h spi: Generalize SPI "master" to "controller" 2017-06-13 18:51:11 +01:00
spmi.h spmi: add command tracepoints for SPMI 2015-08-05 12:27:09 -07:00
sunrpc.h SUNRPC: Add tracepoints for dropped and deferred requests 2016-07-13 15:53:43 -04:00
sunvnet.h sunvnet: Add support for perf LDC event tracing 2016-02-07 14:13:05 -05:00
swiotlb.h swiotlb: Add swiotlb=noforce debug option 2016-12-19 09:05:20 -05:00
syscalls.h tracing: Add #undef to fix compile error 2017-03-03 09:45:01 -05:00
target.h target: Minimize SCSI header #include directives 2015-06-02 08:03:25 -07:00
task.h tracing: Don't make assumptions about length of string on task rename 2015-08-31 10:47:14 -04:00
thermal_power_allocator.h thermal: consistently use int for temperatures 2015-08-03 23:15:50 +08:00
thermal.h trace: thermal: add another parameter 'power' to the tracing function 2017-05-05 15:54:45 +08:00
thp.h powerpc/thp: Add tracepoints to track hugepage invalidate 2014-08-13 18:20:42 +10:00
timer.h This release has no new tracing features, just clean ups, minor fixes 2017-02-27 13:26:17 -08:00
tlb.h tracing: Remove duplicate checks for online CPUs 2016-03-08 11:19:28 -05:00
udp.h
ufs.h scsi: ufs: add trace event for ufs commands 2017-01-05 18:10:04 -05:00
v4l2.h [media] v4l: Add metadata buffer type and format 2017-04-14 22:37:02 -03:00
vb2.h [media] media: videobuf2: Move timestamp to vb2_buffer 2015-12-18 13:53:31 -02:00
vmscan.h mm, vmscan: add mm_vmscan_inactive_list_is_low tracepoint 2017-02-22 16:41:29 -08:00
vsock_virtio_transport_common.h VSOCK: Introduce virtio_vsock_common.ko 2016-08-02 02:57:29 +03:00
wbt.h blk-wbt: add general throttling mechanism 2016-11-10 13:53:32 -07:00
workqueue.h
writeback.h mm: vmscan: kick flushers when we encounter dirty pages on the LRU 2017-02-24 17:46:54 -08:00
xdp.h bpf: add initial bpf tracepoints 2017-01-25 13:17:47 -05:00
xen.h tracing: Add TRACE_DEFINE_SIZEOF() macros 2017-06-13 17:11:08 -04:00