linux/lib
Toshiyuki Okajima ac15ee691f radix_tree: radix_tree_gang_lookup_tag_slot() may never return
Executed command: fsstress -d /mnt -n 600 -p 850

  crash> bt
  PID: 7947   TASK: ffff880160546a70  CPU: 0   COMMAND: "fsstress"
   #0 [ffff8800dfc07d00] machine_kexec at ffffffff81030db9
   #1 [ffff8800dfc07d70] crash_kexec at ffffffff810a7952
   #2 [ffff8800dfc07e40] oops_end at ffffffff814aa7c8
   #3 [ffff8800dfc07e70] die_nmi at ffffffff814aa969
   #4 [ffff8800dfc07ea0] do_nmi_callback at ffffffff8102b07b
   #5 [ffff8800dfc07f10] do_nmi at ffffffff814aa514
   #6 [ffff8800dfc07f50] nmi at ffffffff814a9d60
      [exception RIP: __lookup_tag+100]
      RIP: ffffffff812274b4  RSP: ffff88016056b998  RFLAGS: 00000287
      RAX: 0000000000000000  RBX: 0000000000000002  RCX: 0000000000000006
      RDX: 000000000000001d  RSI: ffff88016056bb18  RDI: ffff8800c85366e0
      RBP: ffff88016056b9c8   R8: ffff88016056b9e8   R9: 0000000000000000
      R10: 000000000000000e  R11: ffff8800c8536908  R12: 0000000000000010
      R13: 0000000000000040  R14: ffffffffffffffc0  R15: ffff8800c85366e0
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  <NMI exception stack>
   #7 [ffff88016056b998] __lookup_tag at ffffffff812274b4
   #8 [ffff88016056b9d0] radix_tree_gang_lookup_tag_slot at ffffffff81227605
   #9 [ffff88016056ba20] find_get_pages_tag at ffffffff810fc110
  #10 [ffff88016056ba80] pagevec_lookup_tag at ffffffff81105e85
  #11 [ffff88016056baa0] write_cache_pages at ffffffff81104c47
  #12 [ffff88016056bbd0] generic_writepages at ffffffff81105014
  #13 [ffff88016056bbe0] do_writepages at ffffffff81105055
  #14 [ffff88016056bbf0] __filemap_fdatawrite_range at ffffffff810fb2cb
  #15 [ffff88016056bc40] filemap_write_and_wait_range at ffffffff810fb32a
  #16 [ffff88016056bc70] generic_file_direct_write at ffffffff810fb3dc
  #17 [ffff88016056bce0] __generic_file_aio_write at ffffffff810fcee5
  #18 [ffff88016056bda0] generic_file_aio_write at ffffffff810fd085
  #19 [ffff88016056bdf0] do_sync_write at ffffffff8114f9ea
  #20 [ffff88016056bf00] vfs_write at ffffffff8114fcf8
  #21 [ffff88016056bf30] sys_write at ffffffff81150691
  #22 [ffff88016056bf80] system_call_fastpath at ffffffff8100c0b2

I think this root cause is the following:

 radix_tree_range_tag_if_tagged() always tags the root tag with settag
 if the root tag is set with iftag even if there are no iftag tags
 in the specified range (Of course, there are some iftag tags
 outside the specified range).

===============================================================================
[[[Detailed description]]]

(1) Why cannot radix_tree_gang_lookup_tag_slot() return forever?

__lookup_tag():
 - Return with 0.
 - Return with the index which is not bigger than the old one as the
   input parameter.

Therefore the following "while" repeats forever because the above
conditions cause "ret" not to be updated and the cur_index cannot be
changed into the bigger one.

(So, radix_tree_gang_lookup_tag_slot() cannot return forever.)

radix_tree_gang_lookup_tag_slot():
1178         while (ret < max_items) {
1179                 unsigned int slots_found;
1180                 unsigned long next_index;       /* Index of next search */
1181
1182                 if (cur_index > max_index)
1183                         break;
1184                 slots_found = __lookup_tag(node, results + ret,
1185                                 cur_index, max_items - ret, &next_index,
tag);
1186                 ret += slots_found;
			// cannot update ret because slots_found == 0.
			// so, this while loops forever.
1187                 if (next_index == 0)
1188                         break;
1189                 cur_index = next_index;
1190         }

(2) Why does __lookup_tag() return with 0 and doesn't update the index?

Assuming the following:
  - the one of the slot in radix_tree_node is NULL.
  - the one of the tag which corresponds to the slot sets with
    PAGECACHE_TAG_TOWRITE or other.
  - In a certain height(!=0), the corresponding index is 0.

a) __lookup_tag() notices that the tag is set.

1005 static unsigned int
1006 __lookup_tag(struct radix_tree_node *slot, void ***results, unsigned long index,
1007         unsigned int max_items, unsigned long *next_index, unsigned int tag)
1008 {
1009         unsigned int nr_found = 0;
1010         unsigned int shift, height;
1011
1012         height = slot->height;
1013         if (height == 0)
1014                 goto out;
1015         shift = (height-1) * RADIX_TREE_MAP_SHIFT;
1016
1017         while (height > 0) {
1018                 unsigned long i = (index >> shift) & RADIX_TREE_MAP_MASK ;
1019
1020                 for (;;) {
1021                         if (tag_get(slot, tag, i))
1022                                 break;
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* the index is not updated yet.

b) __lookup_tag() notices that the slot is NULL.

1023                         index &= ~((1UL << shift) - 1);
1024                         index += 1UL << shift;
1025                         if (index == 0)
1026                                 goto out;       /* 32-bit wraparound */
1027                         i++;
1028                         if (i == RADIX_TREE_MAP_SIZE)
1029                                 goto out;
1030                 }
1031                 height--;
1032                 if (height == 0) {      /* Bottom level: grab some items */
...
1055                 }
1056                 shift -= RADIX_TREE_MAP_SHIFT;
1057                 slot = rcu_dereference_raw(slot->slots[i]);
1058                 if (slot == NULL)
1059                         break;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

c) __lookup_tag() doesn't update the index and return with 0.

1060         }
1061 out:
1062         *next_index = index;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1063         return nr_found;
1064 }

(3) Why is the slot NULL even if the tag is set?

Because radix_tree_range_tag_if_tagged() always sets the root tag with
PAGECACHE_TAG_TOWRITE if the root tag is set with PAGECACHE_TAG_DIRTY,
even if there is no tag which can be set with PAGECACHE_TAG_TOWRITE
in the specified range (from *first_indexp to last_index). Of course,
some PAGECACHE_TAG_DIRTY nodes must exist outside the specified range.
(radix_tree_range_tag_if_tagged() is called only from tag_pages_for_writeback())

 640 unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root
*root,
 641                 unsigned long *first_indexp, unsigned long last_index,
 642                 unsigned long nr_to_tag,
 643                 unsigned int iftag, unsigned int settag)
 644 {
 645         unsigned int height = root->height;
 646         struct radix_tree_path path[height];
 647         struct radix_tree_path *pathp = path;
 648         struct radix_tree_node *slot;
 649         unsigned int shift;
 650         unsigned long tagged = 0;
 651         unsigned long index = *first_indexp;
 652
 653         last_index = min(last_index, radix_tree_maxindex(height));
 654         if (index > last_index)
 655                 return 0;
 656         if (!nr_to_tag)
 657                 return 0;
 658         if (!root_tag_get(root, iftag)) {
 659                 *first_indexp = last_index + 1;
 660                 return 0;
 661         }
 662         if (height == 0) {
 663                 *first_indexp = last_index + 1;
 664                 root_tag_set(root, settag);
 665                 return 1;
 666         }
...
 733         root_tag_set(root, settag);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 734         *first_indexp = index;
 735
 736         return tagged;
 737 }

As the result, there is no radix_tree_node which is set with
PAGECACHE_TAG_TOWRITE but the root tag(radix_tree_root) is set with
PAGECACHE_TAG_TOWRITE.

[figure: inside radix_tree]
(Please see the figure with typewriter font)
===========================================
          [roottag = DIRTY]
                 |             tag=0:NOTHING
         tag[0 0 0 1]              1:DIRTY
            [x x x +]              2:WRITEBACK
                   |               3:DIRTY,WRITEBACK
                   p               4:TOWRITE
             <--->                 5:DIRTY,TOWRITE ...
     specified range (index: 0 to 2)

* There is no DIRTY tag within the specified range.
 (But there is a DIRTY tag outside that range.)

            | | | | | | | | |
    after calling tag_pages_for_writeback()
            | | | | | | | | |
            v v v v v v v v v

          [roottag = DIRTY,TOWRITE]
                 |                 p is "page".
         tag[0 0 0 1]              x is NULL.
            [x x x +]              +- is a pointer to "page".
                   |
                   p

* But TOWRITE tag is set on the root tag.
============================================

After that, radix_tree_extend() via radix_tree_insert() is called
when the page is added.
This function sets the new radix_tree_node with PAGECACHE_TAG_TOWRITE
to succeed the status of the root tag.

 246 static int radix_tree_extend(struct radix_tree_root *root, unsigned long
index)
 247 {
 248         struct radix_tree_node *node;
 249         unsigned int height;
 250         int tag;
 251
 252         /* Figure out what the height should be.  */
 253         height = root->height + 1;
 254         while (index > radix_tree_maxindex(height))
 255                 height++;
 256
 257         if (root->rnode == NULL) {
 258                 root->height = height;
 259                 goto out;
 260         }
 261
 262         do {
 263                 unsigned int newheight;
 264                 if (!(node = radix_tree_node_alloc(root)))
 265                         return -ENOMEM;
 266
 267                 /* Increase the height.  */
 268                 node->slots[0] = radix_tree_indirect_to_ptr(root->rnode);
 269
 270                 /* Propagate the aggregated tag info into the new root */
 271                 for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
 272                         if (root_tag_get(root, tag))
 273                                 tag_set(node, tag, 0);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 274                 }

===========================================
          [roottag = DIRTY,TOWRITE]
                 |     :
         tag[0 0 0 1] [0 0 0 0]
            [x x x +] [+ x x x]
                   |   |
                   p   p (new page)

            | | | | | | | | |
    after calling radix_tree_insert
            | | | | | | | | |
            v v v v v v v v v

          [roottag = DIRTY,TOWRITE]
                 |
         tag [5 0 0 0]    *  DIRTY and TOWRITE tags are
             [+ + x x]       succeeded to the new node.
              | |
  tag [0 0 0 1] [0 0 0 0]
      [x x x +] [+ x x x]
             |   |
             p   p
============================================

After that, the index 3 page is released by remove_from_page_cache().
Then we can make the situation that the tag is set with PAGECACHE_TAG_TOWRITE
and that the slot which corresponds to the tag is NULL.
===========================================
          [roottag = DIRTY,TOWRITE]
                 |
         tag [5 0 0 0]
             [+ + x x]
              | |
  tag [0 0 0 1] [0 0 0 0]
      [x x x +] [+ x x x]
             |   |
             p   p
         (remove)

            | | | | | | | | |
    after calling remove_page_cache
            | | | | | | | | |
            v v v v v v v v v

          [roottag = DIRTY,TOWRITE]
                 |
         tag [4 0 0 0]      * Only DIRTY tag is cleared
             [x + x x]        because no TOWRITE tag is existed
                |             in the bottom node.
                [0 0 0 0]
                [+ x x x]
                 |
                 p
============================================

To solve this problem

Change to that radix_tree_tag_if_tagged() doesn't tag the root tag
if it doesn't set any tags within the specified range.

Like this.
============================================
 640 unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root
*root,
 641                 unsigned long *first_indexp, unsigned long last_index,
 642                 unsigned long nr_to_tag,
 643                 unsigned int iftag, unsigned int settag)
 644 {
 650         unsigned long tagged = 0;
...
 733 	     if (tagged)
^^^^^^^^^^^^^^^^^^^^^^^^
 734            root_tag_set(root, settag);
 735         *first_indexp = index;
 736
 737         return tagged;
 738 }

============================================

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-26 10:50:04 +10:00
..
lzo lib: add support for LZO-compressed kernels 2010-01-11 09:34:04 -08:00
raid6 Move .gitignore from drivers/md to lib/raid6 2010-08-30 17:35:52 +10:00
reed_solomon
xz kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT 2011-01-20 17:02:05 -08:00
zlib_deflate trivial: fix typo "to to" in multiple files 2009-09-21 15:14:55 +02:00
zlib_inflate inflate_fast: sout is already a short so ptr arith was off by one. 2010-03-12 15:52:44 -08:00
.gitignore
argv_split.c tree-wide: convert open calls to remove spaces to skip_spaces() lib function 2009-12-15 08:53:32 -08:00
atomic64_test.c ARM: 6213/1: atomic64_test: add ARM as supported architecture 2010-07-27 10:43:46 +01:00
atomic64.c lib: Fix atomic64_add_unless return value convention 2010-03-01 11:38:46 -08:00
audit.c
average.c lib: Improve EWMA efficiency by using bitshifts 2010-12-06 15:58:43 -05:00
bcd.c
bitmap.c lib/bitmap.c: use hex_to_bin() 2010-10-26 16:52:18 -07:00
bitrev.c
btree.c lib/btree: fix possible NULL pointer dereference 2010-05-15 12:48:10 -07:00
bug.c modules: Fix module_bug_list list corruption race 2010-10-05 11:29:27 -07:00
bust_spinlocks.c oops handling: ensure that any oops is flushed to the mtdoops console 2009-01-06 15:59:11 -08:00
check_signature.c
checksum.c lib/checksum: fix one more thinko 2009-11-03 16:06:53 +01:00
cmdline.c
cpu-notifier-error-inject.c fault-injection: add CPU notifier error injection module 2010-05-27 09:12:48 -07:00
cpumask.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
crc7.c
crc16.c
crc32.c revert "crc32: use __BYTE_ORDER macro for endian detection" 2010-05-26 08:19:23 -07:00
crc32defs.h
crc-ccitt.c
crc-itu-t.c
crc-t10dif.c
ctype.c ctype: constify read-only _ctype string 2009-12-15 08:53:32 -08:00
debug_locks.c Revert "debug_locks: set oops_in_progress if we will log messages." 2010-11-29 15:18:28 -08:00
debugobjects.c Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-05-18 08:17:58 -07:00
dec_and_lock.c atomic: only take lock when the counter drops to zero on UP as well 2009-06-16 19:47:47 -07:00
decompress_bunzip2.c Decompressors: include <linux/slab.h> in <linux/decompress/mm.h> 2011-01-13 08:03:23 -08:00
decompress_inflate.c decompressors: check input size in decompress_inflate.c 2011-01-13 08:03:25 -08:00
decompress_unlzma.c Decompressors: validate match distance in decompress_unlzma.c 2011-01-13 08:03:24 -08:00
decompress_unlzo.c Decompressors: fix callback-to-callback mode in decompress_unlzo.c 2011-01-13 08:03:24 -08:00
decompress_unxz.c decompressors: add boot-time XZ support 2011-01-13 08:03:25 -08:00
decompress.c decompressors: add boot-time XZ support 2011-01-13 08:03:25 -08:00
devres.c lib/devres.c: fix comment typo 2010-07-11 22:16:32 +02:00
div64.c div64_u64(): improve precision on 32bit platforms 2010-10-26 16:52:19 -07:00
dma-debug.c llseek: automatically add .llseek fop 2010-10-15 15:53:27 +02:00
dump_stack.c
dynamic_debug.c dynamic debug: Fix build issue with older gcc 2011-01-07 23:36:59 -05:00
extable.c module: trim exception table on init free. 2009-06-12 21:47:04 +09:30
fault-inject.c headers: remove sched.h from interrupt.h 2009-10-11 11:20:58 -07:00
find_last_bit.c bitmap: find_last_bit() 2009-01-01 10:12:19 +10:30
find_next_bit.c
flex_array.c flex_array: export symbols to modules 2011-01-13 08:03:11 -08:00
gcd.c lib: add lib/gcd.c 2009-06-18 13:04:05 -07:00
gen_crc32table.c crc32: major optimization 2010-05-25 08:07:06 -07:00
genalloc.c genalloc: fix allocation from end of pool 2010-06-29 15:29:30 -07:00
halfmd4.c
hexdump.c include/linux/printk.h lib/hexdump.c: neatening and add CONFIG_PRINTK guard 2011-01-13 08:03:10 -08:00
hweight.c x86: Add optimized popcnt variants 2010-04-06 15:52:11 -07:00
idr.c docbook: add idr/ida to kernel-api docbook 2010-10-26 17:40:56 -07:00
inflate.c MN10300: Don't try and #include <linux/slab.h> in lib/inflate.c from bootloader 2010-08-12 09:51:35 -07:00
int_sqrt.c
iomap_copy.c
iomap.c
iommu-helper.c iommu: inline iommu_num_pages 2010-08-09 20:45:05 -07:00
ioremap.c ACPI, APEI, Generic Hardware Error Source POLL/IRQ/NMI notification type support 2011-01-12 03:06:19 -05:00
irq_regs.c
is_single_threaded.c kernel: is_current_single_threaded: don't use ->mmap_sem 2009-07-17 09:11:31 +10:00
kasprintf.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
Kconfig decompressors: add boot-time XZ support 2011-01-13 08:03:25 -08:00
Kconfig.debug kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT 2011-01-20 17:02:05 -08:00
Kconfig.kgdb mips,kgdb: kdb low level trap catch and stack trace 2010-05-20 21:04:26 -05:00
Kconfig.kmemcheck kmemcheck: depend on HAVE_ARCH_KMEMCHECK 2009-07-01 22:28:44 +02:00
kernel_lock.c bkl: Fixup core_lock fallout 2009-12-14 23:55:33 +01:00
klist.c driver core: Remove completion from struct klist_node 2009-01-06 10:44:30 -08:00
kobject_uevent.c kobject_uevent: fix typo in comments 2010-08-23 18:12:46 -07:00
kobject.c kobject: Introduce kset_find_obj_hinted. 2010-10-22 10:16:44 -07:00
kref.c kref: Add a kref_sub function 2010-11-22 13:25:13 +10:00
lcm.c block: Fix overrun in lcm() and move it to lib 2010-03-15 12:47:59 +01:00
libcrc32c.c libcrc32c: Fix "crc32c undefined" compilation error 2008-12-25 11:01:42 +11:00
list_debug.c list debugging: warn when deleting a deleted entry 2010-08-09 20:45:08 -07:00
list_sort.c lib/list_sort: test: check element addresses 2010-10-26 16:52:19 -07:00
locking-selftest-hardirq.h
locking-selftest-mutex.h
locking-selftest-rlock-hardirq.h
locking-selftest-rlock-softirq.h
locking-selftest-rlock.h
locking-selftest-rsem.h
locking-selftest-softirq.h
locking-selftest-spin-hardirq.h
locking-selftest-spin-softirq.h
locking-selftest-spin.h
locking-selftest-wlock-hardirq.h
locking-selftest-wlock-softirq.h
locking-selftest-wlock.h
locking-selftest-wsem.h
locking-selftest.c locking: rename trace_softirq_[enter|exit] => lockdep_softirq_[enter|exit] 2009-03-13 01:32:36 +01:00
lru_cache.c The DRBD driver 2009-10-01 21:17:49 +02:00
Makefile decompressors: add boot-time XZ support 2011-01-13 08:03:25 -08:00
nlattr.c Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2011-01-13 10:05:56 -08:00
parser.c lib/parser: cleanup match_number() 2010-10-26 16:52:19 -07:00
percpu_counter.c percpucounter: Optimize __percpu_counter_add a bit through the use of this_cpu() options. 2010-12-17 15:07:18 +01:00
plist.c plist: Make plist debugging raw_spinlock aware 2009-12-14 23:55:33 +01:00
prio_heap.c lib: fix sparse shadowed variable warning 2009-01-06 15:59:11 -08:00
prio_tree.c
proportions.c Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-01-06 17:10:04 -08:00
radix-tree.c radix_tree: radix_tree_gang_lookup_tag_slot() may never return 2011-01-26 10:50:04 +10:00
random32.c Merge branch 'master' into for-next 2010-06-16 18:08:13 +02:00
ratelimit.c ratelimit: fix the return value when __ratelimit() fails to acquire the lock 2010-04-07 08:38:04 -07:00
rational.c lib/rational.c needs module.h 2010-01-11 09:34:05 -08:00
rbtree.c rbtree: Undo augmented trees performance damage and regression 2010-07-05 14:43:50 +02:00
reciprocal_div.c
rwsem-spinlock.c rwsem generic spinlock: use IRQ save/restore spinlocks 2010-04-07 16:15:05 -07:00
rwsem.c rwsem: smaller wrappers around rwsem_down_failed_common 2010-08-09 20:45:11 -07:00
scatterlist.c scatterlist: prevent invalid free when alloc fails 2010-08-30 19:55:09 +02:00
sha1.c
show_mem.c mm: use the same log level for show_mem() 2010-03-06 11:26:27 -08:00
smp_processor_id.c cpumask: convert lib/smp_processor_id to new cpumask ops 2009-01-30 15:47:34 +01:00
sort.c generic swap(): lib/sort.c: rename swap to swap_func 2009-01-08 08:31:14 -08:00
spinlock_debug.c locking: Further name space cleanups 2009-12-14 23:55:33 +01:00
string_helpers.c
string.c lib/string.c: simplify strnstr() 2010-03-06 11:26:35 -08:00
swiotlb.c tree-wide: fix comment/printk typos 2010-11-01 15:38:34 -04:00
syscall.c
textsearch.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
timerqueue.c timerqueue: Make timerqueue_getnext() static inline 2010-12-11 12:34:34 +01:00
ts_bm.c
ts_fsm.c
ts_kmp.c
uuid.c Unified UUID/GUID definition 2010-05-19 22:40:47 -04:00
vsprintf.c lib/vsprintf.c: fix vscnprintf() if @size is == 0 2011-01-13 08:03:10 -08:00