Instead of sending ->older_than_this to queue_io() and
move_expired_inodes(), send the entire wb_writeback_work
structure. There are other fields of a work item that are
useful in these routines and in tracepoints.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Useful for analyzing the dynamics of the throttling algorithms and
debugging user reported problems.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.
RATIONALS
=========
- disk seeks on concurrent writeback of multiple inodes (Dave Chinner)
If every thread doing writes and being throttled start foreground
writeback, it leads to N IO submitters from at least N different
inodes at the same time, end up with N different sets of IO being
issued with potentially zero locality to each other, resulting in
much lower elevator sort/merge efficiency and hence we seek the disk
all over the place to service the different sets of IO.
OTOH, if there is only one submission thread, it doesn't jump between
inodes in the same way when congestion clears - it keeps writing to
the same inode, resulting in large related chunks of sequential IOs
being issued to the disk. This is more efficient than the above
foreground writeback because the elevator works better and the disk
seeks less.
- lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)
With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".
* "CPU usage has dropped by ~55%", "it certainly appears that most of
the CPU time saving comes from the removal of contention on the
inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
cacheline bouncing, because the new code is able to call much less
frequently into balance_dirty_pages() and hence access the global
page states)
* the user space "App overhead" is reduced by 20%, by avoiding the
cacheline pollution by the complex writeback code path
* "for a ~5% throughput reduction", "the number of write IOs have
dropped by ~25%", and the elapsed time reduced from 41:42.17 to
40:53.23.
* On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
and improves IO throughput from 38MB/s to 42MB/s.
- IO size too small for fast arrays and too large for slow USB sticks
The write_chunk used by current balance_dirty_pages() cannot be
directly set to some large value (eg. 128MB) for better IO efficiency.
Because it could lead to more than 1 second user perceivable stalls.
Even the current 4MB write size may be too large for slow USB sticks.
The fact that balance_dirty_pages() starts IO on itself couples the
IO size to wait time, which makes it hard to do suitable IO size while
keeping the wait time under control.
Now it's possible to increase writeback chunk size proportional to the
disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
the larger writeback size dramatically reduces the seek count to 1/10
(far beyond my expectation) and improves the write throughput by 24%.
- long block time in balance_dirty_pages() hurts desktop responsiveness
Many of us may have the experience: it often takes a couple of seconds
or even long time to stop a heavy writing dd/cp/tar command with
Ctrl-C or "kill -9".
- IO pipeline broken by bumpy write() progress
There are a broad class of "loop {read(buf); write(buf);}" applications
whose read() pipeline will be under-utilized or even come to a stop if
the write()s have long latencies _or_ don't progress in a constant rate.
The current threshold based throttling inherently transfers the large
low level IO completion fluctuations to bumpy application write()s,
and further deteriorates with increasing number of dirtiers and/or bdi's.
For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
the rsync progresses very bumpy in legacy kernel, and throughput is
improved by 67% by this patchset. (plus the larger write chunk size,
it will be 93% speedup).
The new rate based throttling can support 1000+ dd's with excellent
smoothness, low latency and low overheads.
For the above reasons, it's much better to do IO-less and low latency
pauses in balance_dirty_pages().
Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:
- in large NUMA systems, the per-cpu counters may have big accounting
errors, leading to big throttle wait time and jitters.
- NFS may kill large amount of unstable pages with one single COMMIT.
Because NFS server serves COMMIT with expensive fsync() IOs, it is
desirable to delay and reduce the number of COMMITs. So it's not
likely to optimize away such kind of bursty IO completions, and the
resulted large (and tiny) stall times in IO completion based throttling.
So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:
- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than 4ms, which burns CPU power)
- avoid too large pause time (more than 200ms, which hurts responsiveness)
- avoid big fluctuations of pause times
It can control pause times at will. The default policy (in a followup
patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
in 1000-dd case.
BEHAVIOR CHANGE
===============
(1) dirty threshold
Users will notice that the applications will get throttled once crossing
the global (background + dirty)/2=15% threshold, and then balanced around
17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
memory in 1-dd case.
Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than 15% dirtyable memory.
(2) smoothness/responsiveness
Users will notice a more responsive system during heavy writeback.
"killall dd" will take effect instantly.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Save inode->dirtied_when in the raw trace output for reliable scripting,
and to also show in formatted output the relative age in seconds for
easy human reading.
CC: Jan Kara <jack@suse.cz>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* 'for-linus' of git://git.kernel.dk/linux-block: (23 commits)
Revert "cfq: Remove special treatment for metadata rqs."
block: fix flush machinery for stacking drivers with differring flush flags
block: improve rq_affinity placement
blktrace: add FLUSH/FUA support
Move some REQ flags to the common bio/request area
allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH
xen/blkback: Make description more obvious.
cfq-iosched: Add documentation about idling
block: Make rq_affinity = 1 work as expected
block: swim3: fix unterminated of_device_id table
block/genhd.c: remove useless cast in diskstats_show()
drivers/cdrom/cdrom.c: relax check on dvd manufacturer value
drivers/block/drbd/drbd_nl.c: use bitmap_parse instead of __bitmap_parse
bsg-lib: add module.h include
cfq-iosched: Reduce linked group count upon group destruction
blk-throttle: correctly determine sync bio
loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other
loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices
loop: add management interface for on-demand device allocation
loop: replace linked list of allocated devices with an idr index
...
Add FLUSH/FUA support to blktrace. As FLUSH precedes WRITE and/or
FUA follows WRITE, use the same 'F' flag for both cases and
distinguish them by their (relative) position. The end results
look like (other flags might be shown also):
- WRITE: W
- WRITE_FLUSH: FW
- WRITE_FUA: WF
- WRITE_FLUSH_FUA: FWF
Note that we reuse TC_BARRIER due to lack of bit space of act_mask
so that the older versions of blktrace tools will report flush
requests as barriers from now on.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (60 commits)
ext4: prevent memory leaks from ext4_mb_init_backend() on error path
ext4: use EXT4_BAD_INO for buddy cache to avoid colliding with valid inode #
ext4: use ext4_msg() instead of printk in mballoc
ext4: use ext4_kvzalloc()/ext4_kvmalloc() for s_group_desc and s_group_info
ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and ext4_kvfree()
ext4: use the correct error exit path in ext4_init_inode_table()
ext4: add missing kfree() on error return path in add_new_gdb()
ext4: change umode_t in tracepoint headers to be an explicit __u16
ext4: fix races in ext4_sync_parent()
ext4: Fix overflow caused by missing cast in ext4_fallocate()
ext4: add action of moving index in ext4_ext_rm_idx for Punch Hole
ext4: simplify parameters of reserve_backup_gdb()
ext4: simplify parameters of add_new_gdb()
ext4: remove lock_buffer in bclean() and setup_new_group_blocks()
ext4: simplify journal handling in setup_new_group_blocks()
ext4: let setup_new_group_blocks() set multiple bits at a time
ext4: fix a typo in ext4_group_extend()
ext4: let ext4_group_add_blocks() handle 0 blocks quickly
ext4: let ext4_group_add_blocks() return an error code
ext4: rename ext4_add_groupblocks() to ext4_group_add_blocks()
...
Fix up conflict in fs/ext4/inode.c: commit aacfc19c62 ("fs: simplify
the blockdev_direct_IO prototype") had changed the ext4_ind_direct_IO()
function for the new simplified calling convention, while commit
dae1e52cb1 ("ext4: move ext4_ind_* functions from inode.c to
indirect.c") moved the function to another file.
As requested by Al Viro, since umode_t may be changing to a u32 for
some architectures.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
jbd: change the field "b_cow_tid" of struct journal_head from type unsigned to tid_t
ext3.txt: update the links in the section "useful links" to the latest ones
ext3: Fix data corruption in inodes with journalled data
ext2: check xattr name_len before acquiring xattr_sem in ext2_xattr_get
ext3: Fix compilation with -DDX_DEBUG
quota: Remove unused declaration
jbd: Use WRITE_SYNC in journal checkpoint.
jbd: Fix oops in journal_remove_journal_head()
ext3: Return -EINVAL when start is beyond the end of fs in ext3_trim_fs()
ext3/ioctl.c: silence sparse warnings about different address spaces
ext3/ext4 Documentation: remove bh/nobh since it has been deprecated
ext3: Improve truncate error handling
ext3: use proper little-endian bitops
ext2: include fs.h into ext2_fs.h
ext3: Fix oops in ext3_try_to_allocate_with_rsv()
jbd: fix a bug of leaking jh->b_jcount
jbd: remove dependency on __GFP_NOFAIL
ext3: Convert ext3 to new truncate calling convention
jbd: Add fixed tracepoints
ext3: Add fixed tracepoints
Resolve conflicts in fs/ext3/fsync.c due to fsync locking push-down and
new fixed tracepoints.
When CONFIG_FUNCTION_TRACER is disabled, compilation fails as follows:
CC arch/x86/xen/setup.o
In file included from arch/x86/include/asm/xen/hypercall.h:42,
from arch/x86/xen/setup.c:19:
include/trace/events/xen.h:31: warning: 'struct multicall_entry' declared inside parameter list
include/trace/events/xen.h:31: warning: its scope is only this definition or declaration, which is probably not what you want
include/trace/events/xen.h:31: warning: 'struct multicall_entry' declared inside parameter list
include/trace/events/xen.h:31: warning: 'struct multicall_entry' declared inside parameter list
include/trace/events/xen.h:31: warning: 'struct multicall_entry' declared inside parameter list
[...]
arch/x86/xen/trace.c:5: error: '__HYPERVISOR_set_trap_table' undeclared here (not in a function)
arch/x86/xen/trace.c:5: error: array index in initializer not of integer type
arch/x86/xen/trace.c:5: error: (near initialization for 'xen_hypercall_names')
arch/x86/xen/trace.c:6: error: '__HYPERVISOR_mmu_update' undeclared here (not in a function)
arch/x86/xen/trace.c:6: error: array index in initializer not of integer type
arch/x86/xen/trace.c:6: error: (near initialization for 'xen_hypercall_names')
Fix this by making sure struct multicall_entry has a declaration in
scope at all times, and don't bother compiling xen/trace.c when tracing
is disabled.
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6: (297 commits)
ALSA: asihpi - Replace with snd_ctl_boolean_mono_info()
ALSA: asihpi - HPI version 4.08
ALSA: asihpi - Add volume mute controls
ALSA: asihpi - Control name updates
ALSA: asihpi - Use size_t for sizeof result
ALSA: asihpi - Explicitly include mutex.h
ALSA: asihpi - Add new node and message defines
ALSA: asihpi - Make local function static
ALSA: asihpi - Fix minor typos and spelling
ALSA: asihpi - Remove unused structures, macros and functions
ALSA: asihpi - Remove spurious adapter index check
ALSA: asihpi - Revise snd_pcm_debug_name, get rid of DEBUG_NAME macro
ALSA: asihpi - DSP code loader API now independent of OS
ALSA: asihpi - Remove controlex structs and associated special data transfer code
ALSA: asihpi - Increase request and response buffer sizes
ALSA: asihpi - Give more meaningful name to hpi request message type
ALSA: usb-audio - Add quirk for Roland / BOSS BR-800
ALSA: hda - Remove a superfluous argument of via_auto_init_output()
ALSA: hda - Fix indep-HP path (de-)activation for VT1708* codecs
ALSA: hda - Add documentation for codec-specific mixer controls
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits)
vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp
isofs: Remove global fs lock
jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory
fix IN_DELETE_SELF on overwriting rename() on ramfs et.al.
mm/truncate.c: fix build for CONFIG_BLOCK not enabled
fs:update the NOTE of the file_operations structure
Remove dead code in dget_parent()
AFS: Fix silly characters in a comment
switch d_add_ci() to d_splice_alias() in "found negative" case as well
simplify gfs2_lookup()
jfs_lookup(): don't bother with . or ..
get rid of useless dget_parent() in btrfs rename() and link()
get rid of useless dget_parent() in fs/btrfs/ioctl.c
fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers
drivers: fix up various ->llseek() implementations
fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek
Ext4: handle SEEK_HOLE/SEEK_DATA generically
Btrfs: implement our own ->llseek
fs: add SEEK_HOLE and SEEK_DATA flags
reiserfs: make reiserfs default to barrier=flush
...
Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new
shrinker callout for the inode cache, that clashed with the xfs code to
start the periodic workers later.
It is impossible to understand what the shrinkers are actually doing
without instrumenting the code, so add a some tracepoints to allow
insight to be gained.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Using function calls in TP_printk causes perf heartburn, so print the
MAJOR/MINOR device numbers instead.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Add trace event balance_dirty_state for showing the global dirty page
counts and thresholds at each global_dirty_limits() invocation. This
will cover the callers throttle_vm_writeout(), over_bground_thresh()
and each balance_dirty_pages() loop.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.
struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.
It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.
It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.
Acked-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
This commit adds fixed tracepoint for jbd. It has been based on fixed
tracepoints for jbd2, however there are missing those for collecting
statistics, since I think that it will require more intrusive patch so I
should have its own commit, if someone decide that it is needed. Also
there are new tracepoints in __journal_drop_transaction() and
journal_update_superblock().
The list of jbd tracepoints:
jbd_checkpoint
jbd_start_commit
jbd_commit_locking
jbd_commit_flushing
jbd_commit_logging
jbd_drop_transaction
jbd_end_commit
jbd_do_submit_data
jbd_cleanup_journal_tail
jbd_update_superblock_end
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
This commit adds fixed tracepoints to the ext3 code. It is based on ext4
tracepoints, however due to the differences of both file systems, there
are some tracepoints missing (those for delaloc and for multi-block
allocator) and there are some ext3 specific as well (for reservation
windows).
Here is a list:
ext3_free_inode
ext3_request_inode
ext3_allocate_inode
ext3_evict_inode
ext3_drop_inode
ext3_mark_inode_dirty
ext3_write_begin
ext3_ordered_write_end
ext3_writeback_write_end
ext3_journalled_write_end
ext3_ordered_writepage
ext3_writeback_writepage
ext3_journalled_writepage
ext3_readpage
ext3_releasepage
ext3_invalidatepage
ext3_discard_blocks
ext3_request_blocks
ext3_allocate_blocks
ext3_free_blocks
ext3_sync_file_enter
ext3_sync_file_exit
ext3_sync_fs
ext3_rsv_window_add
ext3_discard_reservation
ext3_alloc_new_reservation
ext3_reserved
ext3_forget
ext3_read_block_bitmap
ext3_direct_IO_enter
ext3_direct_IO_exit
ext3_unlink_enter
ext3_unlink_exit
ext3_truncate_enter
ext3_truncate_exit
ext3_get_blocks_enter
ext3_get_blocks_exit
ext3_load_inode
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
This patch adds 2 tracepoints to get a status of a socket receive queue
and related parameter.
One tracepoint is added to sock_queue_rcv_skb. It records rcvbuf size
and its usage. The other tracepoint is added to __sk_mem_schedule and
it records limitations of memory for sockets and current usage.
By using these tracepoints we're able to know detailed reason why kernel
drop the packet.
Signed-off-by: Satoru Moriya <satoru.moriya@hds.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a tracepoint to __udp_queue_rcv_skb to get the
return value of ip_queue_rcv_skb. It indicates why kernel drops
a packet at this point.
ip_queue_rcv_skb returns following values in the packet drop case:
rcvbuf is full : -ENOMEM
sk_filter returns error : -EINVAL, -EACCESS, -ENOMEM, etc.
__sk_mem_schedule returns error: -ENOBUF
Signed-off-by: Satoru Moriya <satoru.moriya@hds.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
jbd2: Fix oops in jbd2_journal_remove_journal_head()
jbd2: Remove obsolete parameters in the comments for some jbd2 functions
ext4: fixed tracepoints cleanup
ext4: use FIEMAP_EXTENT_LAST flag for last extent in fiemap
ext4: Fix max file size and logical block counting of extent format file
ext4: correct comments for ext4_free_blocks()
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
rcu: Move RCU_BOOST #ifdefs to header file
rcu: use softirq instead of kthreads except when RCU_BOOST=y
rcu: Use softirq to address performance regression
rcu: Simplify curing of load woes
While testing for memcg aware swap token, I observed a swap token was
often grabbed an intermittent running process (eg init, auditd) and they
never release a token.
Why?
Some processes (eg init, auditd, audispd) wake up when a process exiting.
And swap token can be get first page-in process when a process exiting
makes no swap token owner. Thus such above intermittent running process
often get a token.
And currently, swap token priority is only decreased at page fault path.
Then, if the process sleep immediately after to grab swap token, the swap
token priority never be decreased. That's obviously undesirable.
This patch implement very poor (and lightweight) priority aging. It only
be affect to the above corner case and doesn't change swap tendency
workload performance (eg multi process qsbench load)
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit a26ac2455ffcf3(rcu: move TREE_RCU from softirq to kthread)
introduced performance regression. In an AIM7 test, this commit degraded
performance by about 40%.
The commit runs rcu callbacks in a kthread instead of softirq. We observed
high rate of context switch which is caused by this. Out test system has
64 CPUs and HZ is 1000, so we saw more than 64k context switch per second
which is caused by RCU's per-CPU kthread. A trace showed that most of
the time the RCU per-CPU kthread doesn't actually handle any callbacks,
but instead just does a very small amount of work handling grace periods.
This means that RCU's per-CPU kthreads are making the scheduler do quite
a bit of work in order to allow a very small amount of RCU-related
processing to be done.
Alex Shi's analysis determined that this slowdown is due to lock
contention within the scheduler. Unfortunately, as Peter Zijlstra points
out, the scheduler's real-time semantics require global action, which
means that this contention is inherent in real-time scheduling. (Yes,
perhaps someone will come up with a workaround -- otherwise, -rt is not
going to do well on large SMP systems -- but this patch will work around
this issue in the meantime. And "the meantime" might well be forever.)
This patch therefore re-introduces softirq processing to RCU, but only
for core RCU work. RCU callbacks are still executed in kthread context,
so that only a small amount of RCU work runs in softirq context in the
common case. This should minimize ksoftirqd execution, allowing us to
skip boosting of ksoftirqd for CONFIG_RCU_BOOST=y kernels.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Tested-by: "Alex,Shi" <alex.shi@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Note that it adds a little overheads to account the moved/enqueued
inodes from b_dirty to b_io. The "moved" accounting may be later used to
limit the number of inodes that can be moved in one shot, in order to
keep spinlock hold time under control.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
It is valuable to know how the dirty inodes are iterated and their IO size.
"writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 to_write=1024 wrote=0"
- "state" reflects inode->i_state at the end of writeback_single_inode()
- "index" reflects mapping->writeback_index after the ->writepages() call
- "to_write" is the wbc->nr_to_write at entrance of writeback_single_inode()
- "wrote" is the number of pages actually written
v2: add trace event writeback_single_inode_requeue as proposed by Dave.
CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Remove two unused struct writeback_control fields:
.encountered_congestion (completely unused)
.nonblocking (never set, checked/showed in XFS,NFS/btrfs)
The .for_background check in nfs_write_inode() is also removed btw,
as .for_background implies WB_SYNC_NONE.
Reviewed-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
When wbc.more_io was first introduced, it indicates whether there are
at least one superblock whose s_more_io contains more IO work. Now with
the per-bdi writeback, it can be replaced with a simple b_more_io test.
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
While creating fixed tracepoints for ext3, basically by porting them
from ext4, I found a lot of useless retyping, wrong type usage, useless
variable passing and other inconsistencies in the ext4 fixed tracepoint
code.
This patch cleans the fixed tracepoint code for ext4 and also simplify
some of them.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (40 commits)
tg3: Fix tg3_skb_error_unmap()
net: tracepoint of net_dev_xmit sees freed skb and causes panic
drivers/net/can/flexcan.c: add missing clk_put
net: dm9000: Get the chip in a known good state before enabling interrupts
drivers/net/davinci_emac.c: add missing clk_put
af-packet: Add flag to distinguish VID 0 from no-vlan.
caif: Fix race when conditionally taking rtnl lock
usbnet/cdc_ncm: add missing .reset_resume hook
vlan: fix typo in vlan_dev_hard_start_xmit()
net/ipv4: Check for mistakenly passed in non-IPv4 address
iwl4965: correctly validate temperature value
bluetooth l2cap: fix locking in l2cap_global_chan_by_psm
ath9k: fix two more bugs in tx power
cfg80211: don't drop p2p probe responses
Revert "net: fix section mismatches"
drivers/net/usb/catc.c: Fix potential deadlock in catc_ctrl_run()
sctp: stop pending timers and purge queues when peer restart asoc
drivers/net: ks8842 Fix crash on received packet when in PIO mode.
ip_options_compile: properly handle unaligned pointer
iwlagn: fix incorrect PCI subsystem id for 6150 devices
...
Because there is a possibility that skb is kfree_skb()ed and zero cleared
after ndo_start_xmit, we should not see the contents of skb like skb->len and
skb->dev->name after ndo_start_xmit. But trace_net_dev_xmit does that
and causes panic by NULL pointer dereference.
This patch fixes trace_net_dev_xmit not to see the contents of skb directly.
If you want to reproduce this panic,
1. Get tracepoint of net_dev_xmit on
2. Create 2 guests on KVM
2. Make 2 guests use virtio_net
4. Execute netperf from one to another for a long time as a network burden
5. host will panic(It takes about 30 minutes)
Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>