linux/Documentation
Nick Piggin 31e6b01f41 fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.

This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.

The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
  of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
  not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
  access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
  refcounts are not required for persistence. Also we are free to perform mount
  lookups, and to assume dentry mount points and mount roots are stable up and
  down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
  so we can load this tuple atomically, and also check whether any of its
  members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
  sequence after the child is found in case anything changed in the parent
  during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
  limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.

When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.

Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).

The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links

In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.

Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:27 +11:00
..
ABI Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mjg59/platform-drivers-x86 2010-12-07 08:14:04 -08:00
accounting taskstats: pad taskstats netlink response for aligment issues on ia64 2010-12-22 19:43:34 -08:00
acpi ACPI: introduce module parameter acpi.aml_debug_output 2010-08-14 23:02:14 -04:00
aoe Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
arm OMAP: DSS: Fix documentation regarding 'vram' kernel parameter 2010-11-10 20:51:14 +09:00
auxdisplay
blackfin Blackfin: document SPI CS limitations with CPHA=0 2010-08-06 12:55:52 -04:00
block Documentation: remove anticipatory scheduler info 2010-11-11 12:09:59 +01:00
blockdev Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
cdrom Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
cgroups cgroup: add clone_children control file 2010-10-27 18:03:09 -07:00
connector Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
console doc: fix console doc typo 2010-02-24 13:51:32 +01:00
cpu-freq [CPUFREQ] Processor Clocking Control interface driver 2010-01-13 10:55:16 -05:00
cpuidle
cris
crypto
development-process Documentation/development-process: more staging info 2010-11-18 15:00:47 -08:00
device-mapper Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
DocBook Merge branch 'sh-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6 2010-11-25 06:58:19 +09:00
driver-model driver core: prune docs about device_interface 2010-11-10 16:57:11 -08:00
dvb [media] lmedm04: driver for DM04/QQBOX updated to version 1.60 2010-10-22 22:22:47 -02:00
early-userspace
fault-injection lkdtm: add debugfs access and loosen KPROBE ties 2010-03-06 11:26:32 -08:00
fb fbdev: Update documentation index file. 2010-11-18 15:01:54 +09:00
filesystems fs: rcu-walk for path lookup 2011-01-07 17:50:27 +11:00
firmware_class firmware: Update hotplug script 2010-08-05 13:53:34 -07:00
frv
hwmon Documentation: change email address for Hans Koch 2010-11-18 15:00:46 -08:00
i2c i2c-i801: Add PCI idents for Patsburg 'IDF' SMBus controllers 2010-10-31 21:07:00 +01:00
i2o
ia64 Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
ide
infiniband Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
input HID: ntrig: add documention 2010-08-30 15:25:18 +02:00
ioctl Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6 2010-10-28 12:13:00 -07:00
isdn Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2010-08-04 15:31:02 -07:00
ja_JP Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
kbuild Merge branch 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6 2010-10-28 16:18:59 -07:00
kdump
ko_KR Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
kvm KVM: Document that KVM_GET_SUPPORTED_CPUID may return emulated values 2010-10-24 10:52:48 +02:00
laptops thinkpad-acpi: untangle ACPI/vendor backlight selection 2010-08-16 11:54:50 -04:00
leds Documentation: led drivers lp5521 and lp5523 2010-11-12 07:55:32 -08:00
lguest Merge branch 'v2.6.36-rc8' into for-2.6.37/barrier 2010-10-19 09:13:04 +02:00
m68k
make
mips
misc-devices Documentation: short descriptions for bh1770glc and apds990x drivers 2010-10-26 16:52:14 -07:00
mmc mmc: add erase, secure erase, trim and secure trim operations 2010-08-12 08:43:30 -07:00
mn10300
mtd Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
namespaces
netlabel Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
networking tcp: restrict net.ipv4.tcp_adv_win_scale (#20312) 2010-11-28 10:39:45 -08:00
parisc
PCI Documentation: pci.txt: fix typo 2010-07-11 22:17:45 +02:00
pcmcia pcmcia: use autoconfiguration feature for ioports and iomem 2010-09-29 17:20:24 +02:00
power PM / Runtime: Fix pm_runtime_suspended() 2010-12-16 17:12:25 +01:00
powerpc Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6 2010-10-22 20:30:48 -07:00
pps
prctl
RCU rcu: Add tracing data to support queueing models 2010-09-23 09:16:53 -07:00
s390 Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
scheduler sched: Remove USER_SCHED from documentation 2010-04-02 20:12:01 +02:00
scsi [SCSI] fix up documentation for change in ->queuecommand to lockless calling 2010-12-21 08:23:54 -06:00
serial Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
sh sh: clkfwk: Kill off unused clk_set_rate_ex(). 2010-11-15 18:25:12 +09:00
sound Merge branch 'topic/hda' into for-linus 2010-10-25 10:40:05 +02:00
sparc
spi spi/ep93xx: implemented driver for Cirrus EP93xx SPI controller 2010-05-25 00:23:16 -06:00
sysctl Restrict unprivileged access to kernel syslog 2010-11-12 07:55:32 -08:00
telephony Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
thermal
timers Documentation/timers/hpet_example.c: add supporting info for hpet_example 2010-10-26 16:52:11 -07:00
trace mm: vmscan: tracepoint: account for scanned pages similarly for both ftrace and vmstat 2010-12-22 19:43:33 -08:00
uml Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
usb USB: teach "devices" file about Wireless and SuperSpeed USB 2010-10-22 10:21:40 -07:00
video4linux [media] Support for Elgato Video Capture 2010-10-22 20:55:43 -02:00
vm mm: highmem documentation 2010-10-26 16:52:08 -07:00
w1 Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
watchdog watchdog: docs: add an entry for imx2_wdt 2010-07-01 16:02:55 +00:00
wimax
x86 Merge branch 'x86-irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-10-22 08:54:21 -07:00
zh_CN Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
.gitignore add random binaries to .gitignore 2010-04-08 11:34:34 +02:00
00-INDEX mmc: add erase, secure erase, trim and secure trim operations 2010-08-12 08:43:30 -07:00
apparmor.txt AppArmor: update Maintainer and Documentation 2010-08-02 15:35:15 +10:00
applying-patches.txt
atomic_ops.txt Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
bad_memory.txt
basic_profiling.txt
binfmt_misc.txt Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
braille-console.txt
bt8xxgpio.txt
btmrvl.txt
BUG-HUNTING
bus-virt-phys-mapping.txt documentation: fix almost duplicate filenames (IO/io-mapping.txt) 2010-07-20 17:49:30 +00:00
cachetlb.txt Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
Changes Documentation update broken web addresses 2010-07-11 21:55:42 +02:00
circular-buffers.txt Document Linux's circular buffering capabilities 2010-03-24 16:31:22 -07:00
coccinelle.txt Coccinelle: Fix documentation 2010-10-28 00:32:23 +02:00
CodingStyle
cpu-hotplug.txt documentation: fix erroneous email address. 2010-08-11 23:04:10 +09:30
cpu-load.txt
cputopology.txt topology/sysfs: Provide book id and siblings attributes 2010-09-09 20:41:25 +02:00
credentials.txt CRED: Fix __task_cred()'s lockdep check and banner comment 2010-07-29 15:16:18 -07:00
dcdbas.txt
debugging-modules.txt
debugging-via-ohci1394.txt
dell_rbu.txt
devices.txt Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6 2010-10-28 09:35:11 -07:00
DMA-API-HOWTO.txt Documentation: DMA-API-HOWTO.txt: rename ARCH_KMALLOC_MINALIGN to ARCH_DMA_MINALIGN 2010-08-14 11:56:46 -07:00
DMA-API.txt dma-mapping: remove dma_is_consistent API 2010-08-11 08:59:21 -07:00
DMA-attributes.txt
DMA-ISA-LPC.txt
dmaengine.txt
dontdiff Merge branch 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-kconfig 2010-02-25 14:43:57 -08:00
dynamic-debug-howto.txt Dynamic Debug: Introduce ddebug_query= boot parameter 2010-10-22 10:16:42 -07:00
edac.txt EDAC: Fix typos in Documentation/edac.txt 2010-11-25 17:32:47 +01:00
eisa.txt doc: fix Defaultd -> Defaults typo in EISA doc 2010-02-05 12:22:39 +01:00
email-clients.txt Documentation/email-clients.txt: update gmail information 2010-03-12 15:52:35 -08:00
feature-removal-schedule.txt i2c: Mark i2c_adapter.id as deprecated 2010-11-15 22:40:38 +01:00
flexible-arrays.txt
futex-requeue-pi.txt
gcov.txt
gpio.txt Documentation/gpio.txt: explain poll/select usage 2010-11-18 15:00:46 -08:00
highuid.txt
HOWTO Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
hw_random.txt
init.txt init/main.c: improve usability in case of init binary failure 2010-03-06 11:26:29 -08:00
initrd.txt
intel_txt.txt Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
Intel-IOMMU.txt
io_ordering.txt
io-mapping.txt
iostats.txt
IPMI.txt ipmi: add parameter to limit CPU usage in kipmid 2010-03-12 15:52:39 -08:00
IRQ-affinity.txt
IRQ.txt
irqflags-tracing.txt
isapnp.txt
java.txt
kernel-doc-nano-HOWTO.txt docbook: warn on unused doc entries 2010-09-11 16:49:21 -07:00
kernel-docs.txt Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
kernel-parameters.txt watchdog: Improve initialisation error message and documentation 2011-01-03 05:25:52 +01:00
keys-request-key.txt
keys.txt
kmemcheck.txt
kmemleak.txt
kobject.txt kobject: documentation: Update to refer to kset-example.c. 2010-03-19 07:12:20 -07:00
kprobes.txt kprobes: Update document about irq disabled state in kprobe handler 2010-10-14 08:55:27 +02:00
kref.txt
ldm.txt Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
leds-class.txt led-class: always implement blinking 2010-11-12 07:55:32 -08:00
leds-lp3944.txt
local_ops.txt
lockdep-design.txt
lockstat.txt
logo.gif
logo.txt
magic-number.txt
Makefile Documentation/fs/: split txt and source files 2010-03-12 15:52:35 -08:00
ManagementStyle
mca.txt
md.txt Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
memory-barriers.txt Document Linux's circular buffering capabilities 2010-03-24 16:31:22 -07:00
memory-hotplug.txt
memory.txt
mono.txt
mutex-design.txt mutex: Fix annotations to include it in kernel-locking docbook 2010-09-03 08:19:51 +02:00
nmi_watchdog.txt
nommu-mmap.txt
numastat.txt
oops-tracing.txt panic: Add taint flag TAINT_FIRMWARE_WORKAROUND ('I') 2010-05-19 08:37:43 +01:00
padata.txt Documentation/padata.txt: fix typos etc. 2010-08-11 08:59:18 -07:00
parport-lowlevel.txt
parport.txt
pi-futex.txt
pnp.txt doc: capitalization and other minor fixes in pnp doc 2010-02-05 12:22:44 +01:00
preempt-locking.txt
printk-formats.txt
prio_tree.txt
rbtree.txt Documentation: remove anticipatory scheduler info 2010-11-11 12:09:59 +01:00
rfkill.txt Document the rfkill sysfs ABI 2010-03-10 17:09:33 -05:00
robust-futex-ABI.txt
robust-futexes.txt
rt-mutex-design.txt variable name fix to Documentation/rt-mutex-design.txt 2010-06-05 17:39:09 +02:00
rt-mutex.txt
rtc.txt
SAK.txt
SecurityBugs
SELinux.txt
serial-console.txt
sgi-ioc4.txt
sgi-visws.txt
SM501.txt
Smack.txt Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
sparse.txt update email address 2010-07-19 10:56:54 +02:00
spinlocks.txt
stable_api_nonsense.txt
stable_kernel_rules.txt Documentation: -stable rules: upstream commit ID requirement reworded 2010-04-22 15:24:56 -07:00
SubmitChecklist Documentation: update SubmitChecklist for O=objdir and kconfig testing 2010-05-24 07:31:20 -07:00
SubmittingDrivers Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
SubmittingPatches SubmittingPatches: add more about patch descriptions 2010-08-09 20:45:05 -07:00
svga.txt
sysfs-rules.txt Fix typos in comments 2010-03-16 11:47:56 +01:00
sysrq.txt documentation: update sysrq.txt magic sysrq keys 2010-10-26 17:32:41 -07:00
tomoyo.txt TOMOYO: Update version to 2.3.0 2010-08-02 15:35:10 +10:00
unaligned-memory-access.txt
unicode.txt
unshare.txt
VGA-softcursor.txt
vgaarbiter.txt vgaarbiter: fix a typo in the vgaarbiter Documentation 2009-12-16 11:28:58 -08:00
video-output.txt
volatile-considered-harmful.txt Documentation/volatile-considered-harmful.txt: correct cpu_relax() documentation 2010-03-24 16:31:20 -07:00
workqueue.txt workqueue: add and use WQ_MEM_RECLAIM flag 2010-10-11 15:20:26 +02:00
zorro.txt