linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-12 23:23:03 +00:00

Author	SHA1	Message	Date
Kent Overstreet	790666c8ac	bcachefs: Improve trans_blocked_journal_reclaim tracepoint include information about the state of the btree key cache Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 23:00:34 -04:00
Kent Overstreet	7254555c44	bcachefs: Add hysteresis to waiting on btree key cache flush This helps ensure key cache reclaim isn't contending with threads waiting for the key cache to be helped, and fixes a severe performance bug. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 23:00:34 -04:00
Kent Overstreet	e4e49375a8	bcachefs; kill bch2_btree_key_cache_flush() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-01-01 11:47:41 -05:00
Brian Foster	e53d03fe39	bcachefs: don't bump key cache journal seq on nojournal commits fstest generic/388 occasionally reproduces corruptions where an inode has extents beyond i_size. This is a deliberate crash and recovery test, and the post crash+recovery characteristics are usually the same: the inode exists on disk in an early (i.e. just allocated) state based on the journal sequence number associated with the inode. Subsequent inode updates exist in the journal at higher sequence numbers, but the inode hadn't been written back before the associated crash and the post-crash recovery processes a set of journal sequence numbers that doesn't include updates to the inode. In fact, the sequence with the most recent inode key update always happens to be the sequence just before the front of the journal processed by recovery. This last bit is a significant hint that the problem relates to an on-disk journal update of the front of the journal. The root cause of this problem is basically that the inode is updated (multiple times) in-core and in the key cache, each time bumping the key cache sequence number used to control the cache flush. The cache flush skips one or more times, bumping the associated key cache journal pin to the key cache seq value. This has a side effect of holding the inode in memory a bit longer than normal, which helps exacerbate this problem, but is also unsafe in certain cases where the key cache seq may have been updated by a transaction commit that didn't journal the associated key. For example, consider an inode that has been allocated, updated several times in the key cache, journaled, but not yet written back. At this stage, everything should be consistent if the fs happens to crash because the latest update has been journal. Now consider a key update via bch2_extent_update_i_size_sectors() that uses the BTREE_UPDATE_NOJOURNAL flag. While this update may not change inode state, it can have the side effect of bumping ck->seq in bch2_btree_insert_key_cached(). In turn, if a subsequent key cache flush skips due to seq not matching the former, the ck->journal pin is updated to ck->seq even though the most recent key update was not journaled. If this pin happens to reside at the front (tail) of the journal, this means a subsequent journal write can update last_seq to a value beyond that which includes the most recent update to the inode. If this occurs and the fs happens to crash before the inode happens to flush, recovery will see the latest last_seq, fail to recover the inode and leave the inode in the inconsistent state described above. To avoid this problem, skip the key cache seq update on NOJOURNAL commits, except on initial pin add. Pass the insert entry directly to bch2_btree_insert_key_cached() to make the associated flag available and be consistent with btree_insert_key_leaf(). Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:09:56 -04:00
Kent Overstreet	30ca6ece88	bcachefs: Kill trans->flags Recursive transaction commits are occasionally necessary - in particular, for the upcoming btree write buffer's flush path. This avoids bugs due to trans->flags being accidentally mutated mid-commit, which can cause c->writes refcount leaks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:09:50 -04:00
Kent Overstreet	a101957649	bcachefs: More style fixes Fixes for various checkpatch errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:09:45 -04:00
Kent Overstreet	12ce5b7df1	bcachefs: Btree key cache coherency - Updates to non key cache iterators will now be transparently redirected to the key cache for cached btrees. - Except when creating new keys: then the update goes to underlying btree For for iterating over a cached btree to work, we need to ensure that if a key exists in the key cache, it also exists in the btree - otherwise the iterator code will skip past it and not check the key cache. Otherwise, for consistency, all updates should go to the same place - the key cache. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:09:23 -04:00
Kent Overstreet	d8601afca8	bcachefs: Simplify journal replay With BTREE_ITER_WITH_JOURNAL, there's no longer any restrictions on the order we have to replay keys from the journal in, and we can also start up journal reclaim right away - and delete a bunch of code. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>	2023-10-22 17:09:21 -04:00
Kent Overstreet	67e0dd8f0d	bcachefs: btree_path This splits btree_iter into two components: btree_iter is now the externally visible componont, and it points to a btree_path which is now reference counted. This means we no longer have to clone iterators up front if they might be mutated - btree_path can be shared by multiple iterators, and cloned if an iterator would mutate a shared btree_path. This will help us use iterators more efficiently, as well as slimming down the main long lived state in btree_trans, and significantly cleans up the logic for iterator lifetimes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:09:11 -04:00
Kent Overstreet	78cf784eaa	bcachefs: Further reduce iter->trans usage This is prep work for splitting btree_path out from btree_iter - btree_path will not have a pointer to btree_trans. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>	2023-10-22 17:09:11 -04:00
Kent Overstreet	241e26369e	bcachefs: Don't flush btree writes more aggressively because of btree key cache We need to flush the btree key cache when it's too dirty, because otherwise the shrinker won't be able to reclaim memory - this is done by journal reclaim. But journal reclaim also kicks btree node writes: this meant that btree node writes were getting kicked much too often just because we needed to flush btree key cache keys. This patch splits journal pins into two different lists, and teaches journal reclaim to not flush btree node writes when it only needs to flush key cache keys. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:59 -04:00
Kent Overstreet	c5f51cdd5f	bcachefs: Have journal reclaim thread flush more aggressively This adds a new watermark for the journal reclaim when flushing btree key cache entries - it should try and stay ahead of where foreground threads doing transaction commits will enter direct journal reclaim. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:58 -04:00
Kent Overstreet	331194a230	bcachefs: btree key cache locking improvements The btree key cache mutex was becoming a significant bottleneck - it was mainly used to protect the lists of dirty, clean and freed cached keys. This patch eliminates the dirty and clean lists - instead, when we need to scan for keys to drop from the cache we iterate over the rhashtable, and thus we're able to remove most uses of that lock. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:57 -04:00
Kent Overstreet	e323edd6d3	bcachefs: Fix for spinning in journal reclaim on startup We normally avoid having too many dirty keys in the btree key cache, to ensure that we can always shrink our caches to reclaim memory if needed. But this check was causing us to go into an infinite loop on startup, in the btree insert path before journal reclaim was started. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:50 -04:00
Kent Overstreet	f51e84fe24	bcachefs: Fix btree key cache dirty checks Had a type that meant we were triggering journal reclaim _much_ more aggressively than needed. Also, fix a potential integer overflow. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:49 -04:00
Kent Overstreet	d5425a3b22	bcachefs: Throttle updates when btree key cache is too dirty This is needed to ensure we don't deadlock because journal reclaim and thus memory reclaim isn't making forward progress. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:48 -04:00
Kent Overstreet	8a92e54559	bcachefs: Ensure journal reclaim runs when btree key cache is too dirty Ensuring the key cache isn't too dirty is critical for ensuring that the shrinker can reclaim memory. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:48 -04:00
Kent Overstreet	14ba3706b3	bcachefs: Add a kmem_cache for btree_key_cache objects We allocate a lot of these, and we're seeing sporading OOMs - this will help with tracking those down. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:47 -04:00
Kent Overstreet	8cad3e2f73	bcachefs: Use cached iterators for inode updates This switches inode updates to use cached btree iterators - which should be a nice performance boost, since lock contention on the inodes btree can be a bottleneck on multithreaded workloads. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:45 -04:00
Kent Overstreet	d211b408ab	bcachefs: Fix lock ordering with new btree cache code The code that checks lock ordering was recently changed to go off of the pos of the btree node, rather than the iterator, but the btree cache code didn't update to handle iterators that point to cached bkeys. Oops Also, update various debug code. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:41 -04:00
Kent Overstreet	2ca88e5ad9	bcachefs: Btree key cache This introduces a new kind of btree iterator, cached iterators, which point to keys cached in a hash table. The cache also acts as a write cache - in the update path, we journal the update but defer updating the btree until the cached entry is flushed by journal reclaim. Cache coherency is for now up to the users to handle, which isn't ideal but should be good enough for now. These new iterators will be used for updating inodes and alloc info (the alloc and stripes btrees). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:41 -04:00

21 Commits