linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-17 17:41:44 +00:00

Author	SHA1	Message	Date
Aditya Kali	5356f2615c	ext4: attempt to fix race in bigalloc code path Currently, there exists a race between delayed allocated writes and the writeback when bigalloc feature is in use. The race was because we wanted to determine what blocks in a cluster are under delayed allocation and we were using buffer_delayed(bh) check for it. But, the writeback codepath clears this bit without any synchronization which resulted in a race and an ext4 warning similar to: EXT4-fs (ram1): ext4_da_update_reserve_space: ino 13, used 1 with only 0 reserved data blocks The race existed in two places. (1) between ext4_find_delalloc_range() and ext4_map_blocks() when called from writeback code path. (2) between ext4_find_delalloc_range() and ext4_da_get_block_prep() (where buffer_delayed(bh) is set. To fix (1), this patch introduces a new buffer_head state bit - BH_Da_Mapped. This bit is set under the protection of EXT4_I(inode)->i_data_sem when we have actually mapped the delayed allocated blocks during the writeout time. We can now reliably check for this bit inside ext4_find_delalloc_range() to determine whether the reservation for the blocks have already been claimed or not. To fix (2), it was necessary to set buffer_delay(bh) under the protection of i_data_sem. So, I extracted the very beginning of ext4_map_blocks into a new function - ext4_da_map_blocks() - and performed the required setting of bh_delay bit and the quota reservation under the protection of i_data_sem. These two fixes makes the checking of buffer_delay(bh) and buffer_da_mapped(bh) consistent, thus removing the race. Tested: I was able to reproduce the problem by running 'dd' and 'fsync' in parallel. Also, xfstests sometimes used to reproduce this race. After the fix both my test and xfstests were successful and no race (warning message) was observed. Google-Bug-Id: 4997027 Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 19:20:51 -04:00
Aditya Kali	d8990240d8	ext4: add some tracepoints in ext4/extents.c This patch adds some tracepoints in ext4/extents.c and updates a tracepoint in ext4/inode.c. Tested: Built and ran the kernel and verified that these tracepoints work. Also ran xfstests. Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 19:18:51 -04:00
Theodore Ts'o	df55c99dc8	ext4: rename ext4_has_free_blocks() to ext4_has_free_clusters() Rename the function so it is more clear what is going on. Also rename the various variables so it's clearer what's happening. Also fix a missing blocks to cluster conversion when reading the number of reserved blocks for root. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 19:16:51 -04:00
Theodore Ts'o	e7d5f3156e	ext4: rename ext4_claim_free_blocks() to ext4_claim_free_clusters() This function really claims a number of free clusters, not blocks, so rename it so it's clearer what's going on. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 19:14:51 -04:00
Theodore Ts'o	cff1dfd767	ext4: rename ext4_free_blocks_after_init() to ext4_free_clusters_after_init() This function really returns the number of clusters after initializing an uninitalized block bitmap has been initialized. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 19:12:51 -04:00
Theodore Ts'o	5dee54372c	ext4: rename ext4_count_free_blocks() to ext4_count_free_clusters() This function really counts the free clusters reported in the block group descriptors, so rename it to reduce confusion. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 19:10:51 -04:00
Theodore Ts'o	021b65bb1e	ext4: Rename ext4_free_blks_{count,set}() to refer to clusters The field bg_free_blocks_count_{lo,high} in the block group descriptor has been repurposed to hold the number of free clusters for bigalloc functions. So rename the functions so it makes it easier to read and audit the block allocation and block freeing code. Note: at this point in bigalloc development we doesn't support online resize, so this also makes it really obvious all of the places we need to fix up to add support for online resize. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 19:08:51 -04:00
Theodore Ts'o	6f16b60690	ext4: enable mounting bigalloc as read/write Now that we have implemented all of the changes needed for bigalloc, we can finally enable it! Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 19:06:51 -04:00
Aditya Kali	7b415bf60f	ext4: Fix bigalloc quota accounting and i_blocks value With bigalloc changes, the i_blocks value was not correctly set (it was still set to number of blocks being used, but in case of bigalloc, we want i_blocks to represent the number of clusters being used). Since the quota subsystem sets the i_blocks value, this patch fixes the quota accounting and makes sure that the i_blocks value is set correctly. Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 19:04:51 -04:00
Theodore Ts'o	27baebb849	ext4: tune mballoc's default group prealloc size for bigalloc file systems The default group preallocation size had been previously set to 512 blocks/clusters, regardless of the block/cluster size. This is probably too big for large cluster sizes. So adjust the default so that it is 2 megabytes or 32 clusters, whichever is larger. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 19:02:51 -04:00
Theodore Ts'o	f975d6bcc7	ext4: teach ext4_statfs() to deal with clusters if bigalloc is enabled Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 19:00:51 -04:00
Theodore Ts'o	24aaa8ef4e	ext4: convert the free_blocks field in s_flex_groups to be free_clusters Convert the free_blocks to be free_clusters to make the final revised bigalloc changes easier to read/understand. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 18:58:51 -04:00
Theodore Ts'o	5704265188	ext4: convert s_{dirty,free}blocks_counter to s_{dirty,free}clusters_counter Convert the percpu counters s_dirtyblocks_counter and s_freeblocks_counter in struct ext4_super_info to be s_dirtyclusters_counter and s_freeclusters_counter. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 18:56:51 -04:00
Theodore Ts'o	0aa060000e	ext4: teach ext4_ext_truncate() about the bigalloc feature When we are truncating (as opposed unlinking) a file, we need to worry about partial truncates of a file, especially in the light of sparse files. The changes here make sure that arbitrary truncates of sparse files works correctly. Yeah, it's messy. Note that these functions will need to be revisted when the punch ioctl is integrated --- in fact this commit will probably have merge conflicts with the punch changes which Allison Henders and the IBM LTC have been working on. I will need to fix this up when either patch hits mainline. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 18:54:51 -04:00
Theodore Ts'o	4d33b1ef10	ext4: teach ext4_ext_map_blocks() about the bigalloc feature If we need to allocate a new block in ext4_ext_map_blocks(), the function needs to see if the cluster has already been allocated. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 18:52:51 -04:00
Theodore Ts'o	84130193e0	ext4: teach ext4_free_blocks() about bigalloc and clusters The ext4_free_blocks() function now has two new flags that indicate whether a partial cluster at the beginning or the end of the block extents should be freed or not. That will be up the caller (i.e., truncate), who can figure out whether partial clusters at the beginning or the end of a block range can be freed. We also have to update the ext4_mb_free_metadata() and release_blocks_on_commit() machinery to be cluster-based, since it is used by ext4_free_blocks(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 18:50:51 -04:00
Theodore Ts'o	53accfa9f8	ext4: teach mballoc preallocation code about bigalloc clusters In most of mballoc.c, we do everything in units of clusters, since the block allocation bitmaps and buddy bitmaps are all denominated in clusters. The one place where we do deal with absolute block numbers is in the code that handles the preallocation regions, since in the case of inode-based preallocation regions, the start of the preallocation region can't be relative to the beginning of the group. So this adds a bit of complexity, where pa_pstart and pa_lstart are block numbers, while pa_free, pa_len, and fe_len are denominated in units of clusters. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 18:48:51 -04:00
Theodore Ts'o	3212a80a58	ext4: convert block group-relative offsets to use clusters Certain parts of the ext4 code base, primarily in mballoc.c, use a block group number and offset from the beginning of the block group. This offset is invariably used to index into the allocation bitmap, so change the offset to be denominated in units of clusters. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 18:46:51 -04:00
Theodore Ts'o	d5b8f31007	ext4: bigalloc changes to block bitmap initialization functions Add bigalloc support to ext4_init_block_bitmap() and ext4_free_blocks_after_init(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 18:44:51 -04:00
Theodore Ts'o	fd034a84e1	ext4: split out ext4_free_blocks_after_init() The function ext4_free_blocks_after_init() used to be a #define of ext4_init_block_bitmap(). This actually made it difficult to understand how the function worked, and made it hard make changes to support clusters. So as an initial cleanup, I've separated out the functionality of initializing block bitmap from calculating the number of free blocks in the new block group. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 18:42:51 -04:00
Theodore Ts'o	49f7f9af4b	ext4: factor out block group accounting into functions This makes it easier to understand how ext4_init_block_bitmap() works, and it will assist when we split out ext4_free_blocks_after_init() in the next commit. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 18:40:51 -04:00
Theodore Ts'o	7137d7a48e	ext4: convert instances of EXT4_BLOCKS_PER_GROUP to EXT4_CLUSTERS_PER_GROUP Change the places in fs/ext4/mballoc.c where EXT4_BLOCKS_PER_GROUP are used to indicate the number of bits in a block bitmap (which is really a cluster allocation bitmap in bigalloc file systems). There are still some places in the ext4 codebase where usage of EXT4_BLOCKS_PER_GROUP needs to be audited/fixed, in code paths that aren't used given the initial restricted assumptions for bigalloc. These will need to be fixed before we can relax those restrictions. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 18:38:51 -04:00
Theodore Ts'o	bab08ab964	ext4: enforce bigalloc restrictions (e.g., no online resizing, etc.) At least initially if the bigalloc feature is enabled, we will not support non-extent mapped inodes, online resizing, online defrag, or the FITRIM ioctl. This simplifies the initial implementation. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 18:36:51 -04:00
Theodore Ts'o	281b599597	ext4: read-only support for bigalloc file systems This adds supports for bigalloc file systems. It teaches the mount code just enough about bigalloc superblock fields that it will mount the file system without freaking out that the number of blocks per group is too big. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 18:34:51 -04:00
Theodore Ts'o	7c2e70879f	ext4: add ext4-specific kludge to avoid an oops after the disk disappears The del_gendisk() function uninitializes the disk-specific data structures, including the bdi structure, without telling anyone else. Once this happens, any attempt to call mark_buffer_dirty() (for example, by ext4_commit_super), will cause a kernel OOPS. Fix this for now until we can fix things in an architecturally correct way. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-09 18:28:51 -04:00
Allison Henderson	02fac1297e	ext4: fix partial page writes While running extended fsx tests to verify the preceeding patches, a similar bug was also found in the write operation When ever a write operation begins or ends in a hole, or extends EOF, the partial page contained in the hole or beyond EOF needs to be zeroed out. To correct this the new ext4_discard_partial_page_buffers_no_lock routine is used to zero out the partial page, but only for buffer heads that are already unmapped. Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-06 21:53:01 -04:00
Allison Henderson	189e868fa8	ext4: fix fsx truncate failure While running extended fsx tests to verify the first two patches, a similar bug was also found in the truncate operation. This bug happens because the truncate routine only zeros the unblock aligned portion of the last page. This means that the block aligned portions of the page appearing after i_size are left unzeroed, and the buffer heads still mapped. This bug is corrected by using ext4_discard_partial_page_buffers in the truncate routine to zero the partial page and unmap the buffer headers. Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-06 21:49:44 -04:00
Theodore Ts'o	decbd919f4	ext4: only call ext4_jbd2_file_inode when an inode has been extended In delayed allocation mode, it's important to only call ext4_jbd2_file_inode when the file has been extended. This is necessary to avoid a race which first got introduced in commit `678aaf481`, but which was made much more common with the introduction of the "punch hole" functionality. (Especially when dioread_nolock was enabled; when I could reliably reproduce this problem with xfstests #74.) The race is this: If while trying to writeback a delayed allocation inode, there is a need to map delalloc blocks, and we run out of space in the journal, and at the same time the inode is already on the committing transaction's t_inode_list (because for example while doing the punch hole operation, ext4_jbd2_file_inode() is called), then the commit operation will wait for the inode to finish all of its pending writebacks by calling filemap_fdatawait(), but since that inode has one or more pages with the PageWriteback flag set, the commit operation will wait forever, and the so the writeback of the inode can never take place, and the kjournald thread and the writeback thread end up waiting for each other --- forever. It's important at this point to recall why an inode is placed on the t_inode_list; it is to provide the data=ordered guarantees that we don't end up exposing stale data. In the case where we are truncating or punching a hole in the inode, there is no possibility that stale data could be exposed in the first place, so we don't need to put the inode on the t_inode_list! The right long-term fix is to get rid of data=ordered mode altogether, and only update the extent tree or indirect blocks after the data has been written. Until then, this change will also avoid some unnecessary waiting in the commit operation. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Allison Henderson <achender@linux.vnet.ibm.com> Cc: Jan Kara <jack@suse.cz>	2011-09-06 02:37:06 -04:00
Theodore Ts'o	9ea7a0df63	jbd2: add debugging information to jbd2_journal_dirty_metadata() Add debugging information in case jbd2_journal_dirty_metadata() is called with a buffer_head which didn't have jbd2_journal_get_write_access() called on it, or if the journal_head has the wrong transaction in it. In addition, return an error code. This won't change anything for ocfs2, which will BUG_ON() the non-zero exit code. For ext4, the caller of this function is ext4_handle_dirty_metadata(), and on seeing a non-zero return code, will call __ext4_journal_stop(), which will print the function and line number of the (buggy) calling function and abort the journal. This will allow us to recover instead of bug halting, which is better from a robustness and reliability point of view. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-04 10:18:14 -04:00
Theodore Ts'o	56889787cf	ext4: improve handling of conflicting mount options If the user explicitly specifies conflicting mount options for delalloc or dioread_nolock and data=journal, fail the mount, instead of printing a warning and continuing (since many user's won't look at dmesg and notice the warning). Also, print a single warning that data=journal implies that delayed allocation is not on by default (since it's not supported), and furthermore that O_DIRECT is not supported. Improve the text in Documentation/filesystems/ext4.txt so this is clear there as well. Similarly, if the dioread_nolock mount option is specified when the file system block size != PAGE_SIZE, fail the mount instead of printing a warning message and ignoring the mount option. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-03 18:22:38 -04:00
Allison Henderson	2be4751b21	ext4: fix 2nd xfstests 127 punch hole failure This patch fixes a second punch hole bug found by xfstests 127. This bug happens because punch hole needs to flush the pages of the hole to avoid race conditions. But if the end of the hole is in the same page as i_size, the buffer heads beyond i_size need to be unmapped and the page needs to be zeroed after it is flushed. To correct this, the new ext4_discard_partial_page_buffers routine is used to zero and unmap the partial page beyond i_size if the end of the hole appears in the same page as i_size. The code has also been optimized to set the end of the hole to the page after i_size if the specified hole exceeds i_size, and the code that flushes the pages has been simplified. Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>	2011-09-03 11:56:52 -04:00
Allison Henderson	ba06208a13	ext4: fix xfstests 75, 112, 127 punch hole failure This patch addresses a bug found by xfstests 75, 112, 127 when blocksize = 1k This bug happens because the punch hole code only zeros out non block aligned regions of the page. This means that if the blocks are smaller than a page, then the block aligned regions of the page inside the hole are left un-zeroed, and their buffer heads are still mapped. This bug is corrected by using ext4_discard_partial_page_buffers to properly zero the partial page at the head and tail of the hole, and unmap the corresponding buffer heads This patch also addresses a bug reported by Lukas while working on a new patch to add discard support for loop devices using punch hole. The bug happened because of the first and last block number needed to be cast to a larger data type before calculating the byte offset, but since now we only need the byte offsets of the pages, we no longer even need to be calculating the byte offsets of the blocks. The code to do the block offset calculations is removed in this patch. Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>	2011-09-03 11:55:59 -04:00
Allison Henderson	4e96b2dbbf	ext4: Add new ext4_discard_partial_page_buffers routines This patch adds two new routines: ext4_discard_partial_page_buffers and ext4_discard_partial_page_buffers_no_lock. The ext4_discard_partial_page_buffers routine is a wrapper function to ext4_discard_partial_page_buffers_no_lock. The wrapper function locks the page and passes it to ext4_discard_partial_page_buffers_no_lock. Calling functions that already have the page locked can call ext4_discard_partial_page_buffers_no_lock directly. The ext4_discard_partial_page_buffers_no_lock function zeros a specified range in a page, and unmaps the corresponding buffer heads. Only block aligned regions of the page will have their buffer heads unmapped. Unblock aligned regions will be mapped if needed so that they can be updated with the partial zero out. This function is meant to be used to update a page and its buffer heads to be zeroed and unmapped when the corresponding blocks have been released or will be released. This routine is used in the following scenarios: * A hole is punched and the non page aligned regions of the head and tail of the hole need to be discarded * The file is truncated and the partial page beyond EOF needs to be discarded * The end of a hole is in the same page as EOF. After the page is flushed, the partial page beyond EOF needs to be discarded. * A write operation begins or ends inside a hole and the partial page appearing before or after the write needs to be discarded * A write operation extends EOF and the partial page beyond EOF needs to be discarded This function takes a flag EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED which is used when a write operation begins or ends in a hole. When the EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED flag is used, only buffer heads that are already unmapped will have the corresponding regions of the page zeroed. Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-09-03 11:51:09 -04:00
Theodore Ts'o	5930ea6438	ext4: call ext4_handle_dirty_metadata with correct inode in ext4_dx_add_entry ext4_dx_add_entry manipulates bh2 and frames[0].bh, which are two buffer_heads that point to directory blocks assigned to the directory inode. However, the function calls ext4_handle_dirty_metadata with the inode of the file that's being added to the directory, not the directory inode itself. Therefore, correct the code to dirty the directory buffers with the directory inode, not the file inode. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-08-31 12:02:51 -04:00
Darrick J. Wong	f9287c1f2d	ext4: ext4_mkdir should dirty dir_block with newly created directory inode ext4_mkdir calls ext4_handle_dirty_metadata with dir_block and the inode "dir". Unfortunately, dir_block belongs to the newly created directory (which is "inode"), not the parent directory (which is "dir"). Fix the incorrect association. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-08-31 12:00:51 -04:00
Darrick J. Wong	bcaa992975	ext4: ext4_rename should dirty dir_bh with the correct directory When ext4_rename performs a directory rename (move), dir_bh is a buffer that is modified to update the '..' link in the directory being moved (old_inode). However, ext4_handle_dirty_metadata is called with the old parent directory inode (old_dir) and dir_bh, which is incorrect because dir_bh does not belong to the parent inode. Fix this error. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-08-31 11:58:51 -04:00
Theodore Ts'o	84ebd79561	ext4: fake direct I/O mode for data=journal Currently attempts to open a file with O_DIRECT in data=journal mode causes the open to fail with -EINVAL. This makes it very hard to test data=journal mode. So we will let the open succeed, but then always fall back to O_DSYNC buffered writes. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-08-31 11:56:51 -04:00
Theodore Ts'o	1cd9f0976a	ext2,ext3,ext4: don't inherit APPEND_FL or IMMUTABLE_FL for new inodes This doesn't make much sense, and it exposes a bug in the kernel where attempts to create a new file in an append-only directory using O_CREAT will fail (but still leave a zero-length file). This was discovered when xfstests #79 was generalized so it could run on all file systems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc:stable@kernel.org	2011-08-31 11:54:51 -04:00
Jiaying Zhang	8c0bec2151	ext4: remove i_mutex lock in ext4_evict_inode to fix lockdep complaining The i_mutex lock and flush_completed_IO() added by commit `2581fdc810` in ext4_evict_inode() causes lockdep complaining about potential deadlock in several places. In most/all of these LOCKDEP complaints it looks like it's a false positive, since many of the potential circular locking cases can't take place by the time the ext4_evict_inode() is called; but since at the very least it may mask real problems, we need to address this. This change removes the flush_completed_IO() and i_mutex lock in ext4_evict_inode(). Instead, we take a different approach to resolve the software lockup that commit `2581fdc810` intends to fix. Rather than having ext4-dio-unwritten thread wait for grabing the i_mutex lock of an inode, we use mutex_trylock() instead, and simply requeue the work item if we fail to grab the inode's i_mutex lock. This should speed up work queue processing in general and also prevents the following deadlock scenario: During page fault, shrink_icache_memory is called that in turn evicts another inode B. Inode B has some pending io_end work so it calls ext4_ioend_wait() that waits for inode B's i_ioend_count to become zero. However, inode B's ioend work was queued behind some of inode A's ioend work on the same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten thread on that cpu is processing inode A's ioend work, it tries to grab inode A's i_mutex lock. Since the i_mutex lock of inode A is still hold before the page fault happened, we enter a deadlock. Signed-off-by: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-08-31 11:50:51 -04:00
Linus Torvalds	c063d8a60f	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: flush any pending end_io requests before DIO reads w/dioread_nolock ext4: fix nomblk_io_submit option so it correctly converts uninit blocks ext4: Resolve the hang of direct i/o read in handling EXT4_IO_END_UNWRITTEN. ext4: call ext4_ioend_wait and ext4_flush_completed_IO in ext4_evict_inode ext4: Fix ext4_should_writeback_data() for no-journal mode	2011-08-21 06:59:41 -07:00
Jiaying Zhang	dccaf33fa3	ext4: flush any pending end_io requests before DIO reads w/dioread_nolock There is a race between ext4 buffer write and direct_IO read with dioread_nolock mount option enabled. The problem is that we clear PageWriteback flag during end_io time but will do uninitialized-to-initialized extent conversion later with dioread_nolock. If an O_direct read request comes in during this period, ext4 will return zero instead of the recently written data. This patch checks whether there are any pending uninitialized-to-initialized extent conversion requests before doing O_direct read to close the race. Note that this is just a bandaid fix. The fundamental issue is that we clear PageWriteback flag before we really complete an IO, which is problem-prone. To fix the fundamental issue, we may need to implement an extent tree cache that we can use to look up pending to-be-converted extents. Signed-off-by: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-08-19 19:13:32 -04:00
Theodore Ts'o	9dd75f1f1a	ext4: fix nomblk_io_submit option so it correctly converts uninit blocks Bug discovered by Jan Kara: Finally, commit `1449032be1` returned back the old IO submission code but apparently it forgot to return the old handling of uninitialized buffers so we unconditionnaly call block_write_full_page() without specifying end_io function. So AFAICS we never convert unwritten extents to written in some cases. For example when I mount the fs as: mount -t ext4 -o nomblk_io_submit,dioread_nolock /dev/ubdb /mnt and do int fd = open(argv[1], O_RDWR \| O_CREAT \| O_TRUNC, 0600); char buf[1024]; memset(buf, 'a', sizeof(buf)); fallocate(fd, 0, 0, 16384); write(fd, buf, sizeof(buf)); I get a file full of zeros (after remounting the filesystem so that pagecache is dropped) instead of seeing the first KB contain 'a's. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-08-13 12:58:21 -04:00
Tao Ma	32c80b32c0	ext4: Resolve the hang of direct i/o read in handling EXT4_IO_END_UNWRITTEN. EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should be done simultaneously since ext4_end_io_nolock always clear the flag and decrease the counter in the same time. We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so it will go nagative and causes some process to wait forever. Part of the patch came from Eric in his e-mail, but it doesn't fix the problem met by Michael actually. http://marc.info/?l=linux-ext4&m=131316851417460&w=2 Reported-and-Tested-by: Michael Tokarev<mjt@tls.msk.ru> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-08-13 12:30:59 -04:00
Jiaying Zhang	2581fdc810	ext4: call ext4_ioend_wait and ext4_flush_completed_IO in ext4_evict_inode Flush inode's i_completed_io_list before calling ext4_io_wait to prevent the following deadlock scenario: A page fault happens while some process is writing inode A. During page fault, shrink_icache_memory is called that in turn evicts another inode B. Inode B has some pending io_end work so it calls ext4_ioend_wait() that waits for inode B's i_ioend_count to become zero. However, inode B's ioend work was queued behind some of inode A's ioend work on the same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten thread on that cpu is processing inode A's ioend work, it tries to grab inode A's i_mutex lock. Since the i_mutex lock of inode A is still hold before the page fault happened, we enter a deadlock. Also moves ext4_flush_completed_IO and ext4_ioend_wait from ext4_destroy_inode() to ext4_evict_inode(). During inode deleteion, ext4_evict_inode() is called before ext4_destroy_inode() and in ext4_evict_inode(), we may call ext4_truncate() without holding i_mutex lock. As a result, there is a race between flush_completed_IO that is called from ext4_ext_truncate() and ext4_end_io_work, which may cause corruption on an io_end structure. This change moves ext4_flush_completed_IO and ext4_ioend_wait from ext4_destroy_inode() to ext4_evict_inode() to resolve the race between ext4_truncate() and ext4_end_io_work during inode deletion. Signed-off-by: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-08-13 12:17:13 -04:00
Curt Wohlgemuth	441c850857	ext4: Fix ext4_should_writeback_data() for no-journal mode ext4_should_writeback_data() had an incorrect sequence of tests to determine if it should return 0 or 1: in particular, even in no-journal mode, 0 was being returned for a non-regular-file inode. This meant that, in non-journal mode, we would use ext4_journalled_aops for directories, symlinks, and other non-regular files. However, calling journalled aop callbacks when there is no valid handle, can cause problems. This would cause a kernel crash with Jan Kara's commit `2d859db3e4` ("ext4: fix data corruption in inodes with journalled data"), because we now dereference 'handle' in ext4_journalled_write_end(). I also added BUG_ONs to check for a valid handle in the obviously journal-only aops callbacks. I tested this running xfstests with a scratch device in these modes: - no-journal - data=ordered - data=writeback - data=journal All work fine; the data=journal run has many failures and a crash in xfstests 074, but this is no different from a vanilla kernel. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-08-13 11:25:18 -04:00
Eric Sandeen	8c20871998	ext4: Properly count journal credits for long symlinks Commit `df5e622340` ("ext4: fix deadlock in ext4_symlink() in ENOSPC conditions") recalculated the number of credits needed for a long symlink, in the process of splitting it into two transactions. However, the first credit calculation under-counted because if selinux is enabled, credits are needed to create the selinux xattr as well. Overrunning the reservation will result in an OOPS in jbd2_journal_dirty_metadata() due to this assert: J_ASSERT_JH(jh, handle->h_buffer_credits > 0); Fix this by increasing the reservation size. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-11 17:23:40 -07:00
Mathias Krause	db9481c047	ext4: use kzalloc in ext4_kzalloc() Commit 9933fc0i (ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and ext4_kvfree()) intruduced wrappers around k*alloc/vmalloc but introduced a typo for ext4_kzalloc() by not using kzalloc() but kmalloc(). Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-08-03 14:57:11 -04:00
Linus Torvalds	60ad446682	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (60 commits) ext4: prevent memory leaks from ext4_mb_init_backend() on error path ext4: use EXT4_BAD_INO for buddy cache to avoid colliding with valid inode # ext4: use ext4_msg() instead of printk in mballoc ext4: use ext4_kvzalloc()/ext4_kvmalloc() for s_group_desc and s_group_info ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and ext4_kvfree() ext4: use the correct error exit path in ext4_init_inode_table() ext4: add missing kfree() on error return path in add_new_gdb() ext4: change umode_t in tracepoint headers to be an explicit __u16 ext4: fix races in ext4_sync_parent() ext4: Fix overflow caused by missing cast in ext4_fallocate() ext4: add action of moving index in ext4_ext_rm_idx for Punch Hole ext4: simplify parameters of reserve_backup_gdb() ext4: simplify parameters of add_new_gdb() ext4: remove lock_buffer in bclean() and setup_new_group_blocks() ext4: simplify journal handling in setup_new_group_blocks() ext4: let setup_new_group_blocks() set multiple bits at a time ext4: fix a typo in ext4_group_extend() ext4: let ext4_group_add_blocks() handle 0 blocks quickly ext4: let ext4_group_add_blocks() return an error code ext4: rename ext4_add_groupblocks() to ext4_group_add_blocks() ... Fix up conflict in fs/ext4/inode.c: commit `aacfc19c62` ("fs: simplify the blockdev_direct_IO prototype") had changed the ext4_ind_direct_IO() function for the new simplified calling convention, while commit `dae1e52cb1` ("ext4: move ext4_ind_* functions from inode.c to indirect.c") moved the function to another file.	2011-08-01 13:56:03 -10:00
Yu Jian	79a77c5ac3	ext4: prevent memory leaks from ext4_mb_init_backend() on error path In ext4_mb_init(), if the s_locality_group allocation fails it will currently cause the allocations made in ext4_mb_init_backend() to be leaked. Moving the ext4_mb_init_backend() allocation after the s_locality_group allocation avoids that problem. Signed-off-by: Yu Jian <yujian@whamcloud.com> Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-08-01 17:41:46 -04:00
Yu Jian	48e6061bf4	ext4: use EXT4_BAD_INO for buddy cache to avoid colliding with valid inode # Signed-off-by: Yu Jian <yujian@whamcloud.com> Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-08-01 17:41:39 -04:00

1 2 3 4 5 ...

1349 Commits