linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-10 22:21:40 +00:00

History

Dave Chinner 64081362e8 mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock We've recently seen a workload on XFS filesystems with a repeatable deadlock between background writeback and a multi-process application doing concurrent writes and fsyncs to a small range of a file. range_cyclic writeback Process 1 Process 2 xfs_vm_writepages write_cache_pages writeback_index = 2 cycled = 0 .... find page 2 dirty lock Page 2 ->writepage page 2 writeback page 2 clean page 2 added to bio no more pages write() locks page 1 dirties page 1 locks page 2 dirties page 1 fsync() .... xfs_vm_writepages write_cache_pages start index 0 find page 1 towrite lock Page 1 ->writepage page 1 writeback page 1 clean page 1 added to bio find page 2 towrite lock Page 2 page 2 is writeback <blocks> write() locks page 1 dirties page 1 fsync() .... xfs_vm_writepages write_cache_pages start index 0 !done && !cycled sets index to 0, restarts lookup find page 1 dirty find page 1 towrite lock Page 1 page 1 is writeback <blocks> lock Page 1 <blocks> DEADLOCK because: - process 1 needs page 2 writeback to complete to make enough progress to issue IO pending for page 1 - writeback needs page 1 writeback to complete so process 2 can progress and unlock the page it is blocked on, then it can issue the IO pending for page 2 - process 2 can't make progress until process 1 issues IO for page 1 The underlying cause of the problem here is that range_cyclic writeback is processing pages in descending index order as we hold higher index pages in a structure controlled from above write_cache_pages(). The write_cache_pages() caller needs to be able to submit these pages for IO before write_cache_pages restarts writeback at mapping index 0 to avoid wcp inverting the page lock/writeback wait order. generic_writepages() is not susceptible to this bug as it has no private context held across write_cache_pages() - filesystems using this infrastructure always submit pages in ->writepage immediately and so there is no problem with range_cyclic going back to mapping index 0. However: mpage_writepages() has a private bio context, exofs_writepages() has page_collect fuse_writepages() has fuse_fill_wb_data nfs_writepages() has nfs_pageio_descriptor xfs_vm_writepages() has xfs_writepage_ctx All of these ->writepages implementations can hold pages under writeback in their private structures until write_cache_pages() returns, and hence they are all susceptible to this deadlock. Also worth noting is that ext4 has it's own bastardised version of write_cache_pages() and so it /may/ have an equivalent deadlock. I looked at the code long enough to understand that it has a similar retry loop for range_cyclic writeback reaching the end of the file and then promptly ran away before my eyes bled too much. I'll leave it for the ext4 developers to determine if their code is actually has this deadlock and how to fix it if it has. There's a few ways I can see avoid this deadlock. There's probably more, but these are the first I've though of: 1. get rid of range_cyclic altogether 2. range_cyclic always stops at EOF, and we start again from writeback index 0 on the next call into write_cache_pages() 2a. wcp also returns EAGAIN to ->writepages implementations to indicate range cyclic has hit EOF. writepages implementations can then flush the current context and call wpc again to continue. i.e. lift the retry into the ->writepages implementation 3. range_cyclic uses trylock_page() rather than lock_page(), and it skips pages it can't lock without blocking. It will already do this for pages under writeback, so this seems like a no-brainer 3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid blocking as per pages under writeback. I don't think #1 is an option - range_cyclic prevents frequently dirtied lower file offset from starving background writeback of rarely touched higher file offsets. #2 is simple, and I don't think it will have any impact on performance as going back to the start of the file implies an immediate seek. We'll have exactly the same number of seeks if we switch writeback to another inode, and then come back to this one later and restart from index 0. #2a is pretty much "status quo without the deadlock". Moving the retry loop up into the wcp caller means we can issue IO on the pending pages before calling wcp again, and so avoid locking or waiting on pages in the wrong order. I'm not convinced we need to do this given that we get the same thing from #2 on the next writeback call from the writeback infrastructure. #3 is really just a band-aid - it doesn't fix the access/wait inversion problem, just prevents it from becoming a deadlock situation. I'd prefer we fix the inversion, not sweep it under the carpet like this. #3a is really an optimisation that just so happens to include the band-aid fix of #3. So it seems that the simplest way to fix this issue is to implement solution #2 Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.com Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.de> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2018-10-26 16:38:14 -07:00
..
kasan	kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN	2018-08-17 16:20:30 -07:00
backing-dev.c	blkcg: delay blkg destruction until after writeback has finished	2018-08-31 14:48:56 -06:00
balloon_compaction.c	virtio_balloon: fix deadlock on OOM	2017-11-14 23:57:38 +02:00
bootmem.c	docs/mm: bootmem: add overview documentation	2018-08-02 12:17:27 -06:00
cleancache.c	mm: use octal not symbolic permissions	2018-06-15 07:55:25 +09:00
cma_debug.c	mm/cma: remove unsupported gfp_mask parameter from cma_alloc()	2018-08-17 16:20:32 -07:00
cma.c	mm/cma: remove unsupported gfp_mask parameter from cma_alloc()	2018-08-17 16:20:32 -07:00
cma.h	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
compaction.c	psi: pressure stall information for CPU, memory, and IO	2018-10-26 16:26:32 -07:00
debug_page_ref.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
debug.c	mm: provide kernel parameter to allow disabling page init poisoning	2018-10-26 16:26:34 -07:00
dmapool.c	mm: use octal not symbolic permissions	2018-06-15 07:55:25 +09:00
early_ioremap.c	mm/early_ioremap: Fix boot hang with earlyprintk=efi,keep	2017-12-11 14:54:44 +01:00
fadvise.c	vfs: implement readahead(2) using POSIX_FADV_WILLNEED	2018-08-30 20:01:32 +02:00
failslab.c	mm: use octal not symbolic permissions	2018-06-15 07:55:25 +09:00
filemap.c	mm/filemap.c: use vmf_error()	2018-10-26 16:26:35 -07:00
frame_vector.c	mm/frame_vector.c: release a semaphore in 'get_vaddr_frames()'	2017-12-14 16:00:48 -08:00
frontswap.c	mm: use octal not symbolic permissions	2018-06-15 07:55:25 +09:00
gup_benchmark.c	mm/gup_benchmark: fix unsigned comparison to zero in __gup_benchmark_ioctl	2018-10-05 16:32:04 -07:00
gup.c	mm: remove unnecessary local variable addr in __get_user_pages_fast()	2018-10-26 16:26:34 -07:00
highmem.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
hmm.c	mm: defer ZONE_DEVICE page initialization to the point where we init pgmap	2018-10-26 16:26:34 -07:00
huge_memory.c	mm: workingset: tell cache transitions from workingset thrashing	2018-10-26 16:26:32 -07:00
hugetlb_cgroup.c	mm: rename page_counter's count/limit into usage/max	2018-06-07 17:34:35 -07:00
hugetlb.c	hugetlb: take PMD sharing into account when flushing tlb/caches	2018-10-05 16:32:04 -07:00
hwpoison-inject.c	mm/memory_failure: Remove unused trapno from memory_failure	2018-01-23 12:17:42 -06:00
init-mm.c	mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids	2018-07-17 09:35:30 +02:00
internal.h	mm: Change return type int to vm_fault_t for fault handlers	2018-08-23 18:48:44 -07:00
interval_tree.c	mm/interval_tree.c: use vma_pages() helper	2018-01-31 17:18:37 -08:00
Kconfig	mm: disable deferred struct page for 32-bit arches	2018-09-20 22:01:11 +02:00
Kconfig.debug	mm: clarify CONFIG_PAGE_POISONING and usage	2018-08-22 10:52:44 -07:00
khugepaged.c	mm: Change return type int to vm_fault_t for fault handlers	2018-08-23 18:48:44 -07:00
kmemleak-test.c
kmemleak.c	kmemleak: add module param to print warnings to dmesg	2018-10-26 16:25:19 -07:00
ksm.c	include/linux/compiler.h: make compiler-.h mutually exclusive	2018-08-22 17:31:34 -07:00
list_lru.c	mm/list_lru: introduce list_lru_shrink_walk_irq()	2018-08-17 16:20:32 -07:00
maccess.c	x86/fault: BUG() when uaccess helpers fault on kernel addresses	2018-09-03 15:12:09 +02:00
madvise.c	mm: madvise(MADV_DODUMP): allow hugetlbfs pages	2018-10-05 16:32:05 -07:00
Makefile	arm64 updates for 4.20:	2018-10-22 17:30:06 +01:00
memblock.c	mm: provide kernel parameter to allow disabling page init poisoning	2018-10-26 16:26:34 -07:00
memcontrol.c	mm/memcontrol.c: convert mem_cgroup_id::ref to refcount_t type	2018-10-26 16:26:35 -07:00
memfd.c	alloc_file(): switch to passing O_... flags instead of FMODE_... mode	2018-07-12 10:02:57 -04:00
memory_hotplug.c	mm/memory_hotplug.c: clean up node_states_check_changes_offline()	2018-10-26 16:26:33 -07:00
memory-failure.c	libnvdimm-for-4.19_dax-memory-failure	2018-08-25 18:43:59 -07:00
memory.c	mm/memory.c: recheck page table entry with page table lock held	2018-10-26 16:26:35 -07:00
mempolicy.c	mm/mempolicy.c: use match_string() helper to simplify the code	2018-10-26 16:26:33 -07:00
mempool.c	mm/mempool.c: add missing parameter description	2018-08-22 10:52:44 -07:00
memtest.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
migrate.c	mm: workingset: tell cache transitions from workingset thrashing	2018-10-26 16:26:32 -07:00
mincore.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
mlock.c	dax: remove VM_MIXEDMAP for fsdax and device dax	2018-08-17 16:20:27 -07:00
mm_init.c	mm: access zone->node via zone_to_nid() and zone_set_nid()	2018-08-22 10:52:45 -07:00
mmap.c	mm: brk: downgrade mmap_sem to read when shrinking	2018-10-26 16:26:35 -07:00
mmu_context.c
mmu_gather.c	mm/memory: Move mmu_gather and TLB invalidation code into its own file	2018-09-07 15:19:25 +01:00
mmu_notifier.c	Revert "mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks"	2018-10-26 16:25:19 -07:00
mmzone.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
mprotect.c	x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings	2018-06-20 19:10:01 +02:00
mremap.c	mm: mremap: downgrade mmap_sem to read when shrinking	2018-10-26 16:26:35 -07:00
msync.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
nobootmem.c	mm/memblock: add a name for memblock flags enumeration	2018-08-02 12:17:27 -06:00
nommu.c	mm: provide a fallback for PAGE_KERNEL_EXEC for architectures	2018-08-17 16:20:29 -07:00
oom_kill.c	Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2018-10-24 11:22:39 +01:00
page_alloc.c	mm: move mirrored memory specific code outside of memmap_init_zone	2018-10-26 16:38:10 -07:00
page_counter.c	memcg: introduce memory.min	2018-06-07 17:34:36 -07:00
page_ext.c	mm/page_ext.c: constify lookup_page_ext() argument	2018-08-17 16:20:28 -07:00
page_idle.c	mm: use octal not symbolic permissions	2018-06-15 07:55:25 +09:00
page_io.c	blkcg: associate a blkg for pages being evicted by swap	2018-09-21 20:29:09 -06:00
page_isolation.c	mm, migrate: remove reason argument from new_page_t	2018-04-11 10:28:32 -07:00
page_owner.c	mm: use octal not symbolic permissions	2018-06-15 07:55:25 +09:00
page_poison.c	mm/page_poison.c: make early_page_poison_param() __init	2018-04-05 21:36:26 -07:00
page_vma_mapped.c	mm, page_vma_mapped: Introduce pfn_in_hpage()	2018-01-22 12:15:57 -08:00
page-writeback.c	mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock	2018-10-26 16:38:14 -07:00
pagewalk.c	mm: kernel-doc: add missing parameter descriptions	2018-04-05 21:36:27 -07:00
percpu-internal.h	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
percpu-km.c	percpu: allow select gfp to be passed to underlying allocators	2018-02-18 05:33:01 -08:00
percpu-stats.c	treewide: Use array_size() in vmalloc()	2018-06-12 16:19:22 -07:00
percpu-vm.c	percpu: allow select gfp to be passed to underlying allocators	2018-02-18 05:33:01 -08:00
percpu.c	percpu: stop leaking bitmap metadata blocks	2018-10-07 14:50:12 -07:00
pgtable-generic.c	x86/mm: Page size aware flush_tlb_mm_range()	2018-10-09 16:51:11 +02:00
process_vm_access.c	mm: docs: add blank lines to silence sphinx "Unexpected indentation" errors	2018-02-06 18:32:48 -08:00
quicklist.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
readahead.c	vfs: implement readahead(2) using POSIX_FADV_WILLNEED	2018-08-30 20:01:32 +02:00
rmap.c	mm: migration: fix migration of huge PMD shared pages	2018-10-05 16:32:04 -07:00
rodata_test.c
shmem.c	mm: shmem.c: Correctly annotate new inodes for lockdep	2018-09-20 22:01:11 +02:00
slab_common.c	mm, slab: shorten kmalloc cache names for large sizes	2018-10-26 16:26:32 -07:00
slab.c	mm, slab: combine kmalloc_caches and kmalloc_dma_caches	2018-10-26 16:26:31 -07:00
slab.h	mm: introduce CONFIG_MEMCG_KMEM as combination of CONFIG_MEMCG && !CONFIG_SLOB	2018-08-17 16:20:30 -07:00
slob.c	slab: __GFP_ZERO is incompatible with a constructor	2018-06-07 17:34:34 -07:00
slub.c	mm, slab: combine kmalloc_caches and kmalloc_dma_caches	2018-10-26 16:26:31 -07:00
sparse-vmemmap.c	mm/sparse: delete old sparse_init and enable new one	2018-08-17 16:20:32 -07:00
sparse.c	mm: provide kernel parameter to allow disabling page init poisoning	2018-10-26 16:26:34 -07:00
swap_cgroup.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
swap_slots.c	mm, swap, get_swap_pages: use entry_size instead of cluster in parameter	2018-08-22 10:52:44 -07:00
swap_state.c	mm: workingset: tell cache transitions from workingset thrashing	2018-10-26 16:26:32 -07:00
swap.c	mm/swap.c: remove duplicated include	2018-10-26 16:26:33 -07:00
swapfile.c	mm/swapfile.c: clear si->swap_map[] in swap_free_cluster()	2018-10-26 16:25:19 -07:00
truncate.c	page cache: use xa_lock	2018-04-11 10:28:39 -07:00
usercopy.c	usercopy: Allow boot cmdline disabling of hardening	2018-07-04 08:04:52 -07:00
userfaultfd.c	userfaultfd: prevent non-cooperative events vs mcopy_atomic races	2018-06-07 17:34:38 -07:00
util.c	kvfree(): fix misleading comment	2018-10-26 16:26:33 -07:00
vmacache.c	mm: get rid of vmacache_flush_all() entirely	2018-09-13 15:18:04 -10:00
vmalloc.c	vfree: add debug might_sleep()	2018-10-26 16:26:33 -07:00
vmpressure.c	mm/vmpressure.c: convert to use match_string() helper	2018-06-07 17:34:36 -07:00
vmscan.c	mm: zero-seek shrinkers	2018-10-26 16:26:33 -07:00
vmstat.c	mm/vmstat.c: assert that vmstat_text is in sync with stat_items_size	2018-10-26 16:26:35 -07:00
workingset.c	mm: zero-seek shrinkers	2018-10-26 16:26:33 -07:00
z3fold.c	z3fold: fix reclaim lock-ups	2018-05-11 17:28:45 -07:00
zbud.c	mm: docs: fix parameter names mismatch	2018-02-06 18:32:48 -08:00
zpool.c	mm/zpool.c: zpool_evictable: fix mismatch in parameter name and kernel-doc	2018-02-21 15:35:43 -08:00
zsmalloc.c	mm/zsmalloc.c: fix fall-through annotation	2018-10-26 16:26:35 -07:00
zswap.c	zswap: re-check zswap_is_full() after do zswap_shrink()	2018-07-26 19:38:03 -07:00