mirror of
https://github.com/torvalds/linux.git
synced 2024-11-10 22:21:40 +00:00
eef983ffea
Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to guarantee the ordering requirements the journal has w.r.t. metadata writeback. THe two ordering constraints are: 1. we cannot overwrite metadata in the journal until we guarantee that the dirty metadata has been written back in place and is stable. 2. we cannot write back dirty metadata until it has been written to the journal and guaranteed to be stable (and hence recoverable) in the journal. The ordering guarantees of #1 are provided by REQ_PREFLUSH. This causes the journal IO to issue a cache flush and wait for it to complete before issuing the write IO to the journal. Hence all completed metadata IO is guaranteed to be stable before the journal overwrites the old metadata. The ordering guarantees of #2 are provided by the REQ_FUA, which ensures the journal writes do not complete until they are on stable storage. Hence by the time the last journal IO in a checkpoint completes, we know that the entire checkpoint is on stable storage and we can unpin the dirty metadata and allow it to be written back. This is the mechanism by which ordering was first implemented in XFS way back in 2002 by commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96 ("Add support for drive write cache flushing") in the xfs-archive tree. A lot has changed since then, most notably we now use delayed logging to checkpoint the filesystem to the journal rather than write each individual transaction to the journal. Cache flushes on journal IO are necessary when individual transactions are wholly contained within a single iclog. However, CIL checkpoints are single transactions that typically span hundreds to thousands of individual journal writes, and so the requirements for device cache flushing have changed. That is, the ordering rules I state above apply to ordering of atomic transactions recorded in the journal, not to the journal IO itself. Hence we need to ensure metadata is stable before we start writing a new transaction to the journal (guarantee #1), and we need to ensure the entire transaction is stable in the journal before we start metadata writeback (guarantee #2). Hence we only need a REQ_PREFLUSH on the journal IO that starts a new journal transaction to provide #1, and it is not on any other journal IO done within the context of that journal transaction. The CIL checkpoint already issues a cache flush before it starts writing to the log, so we no longer need the iclog IO to issue a REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed to xlog_write(), we no longer need to mark the first iclog in the log write with REQ_PREFLUSH for this case. As an added bonus, this ordering mechanism works for both internal and external logs, meaning we can remove the explicit data device cache flushes from the iclog write code when using external logs. Given the new ordering semantics of commit records for the CIL, we need iclogs containing commit records to issue a REQ_PREFLUSH. We also require unmount records to do this. Hence for both XLOG_COMMIT_TRANS and XLOG_UNMOUNT_TRANS xlog_write() calls we need to mark the first iclog being written with REQ_PREFLUSH. For both commit records and unmount records, we also want them immediately on stable storage, so we want to also mark the iclogs that contain these records to be marked REQ_FUA. That means if a record is split across multiple iclogs, they are all marked REQ_FUA and not just the last one so that when the transaction is completed all the parts of the record are on stable storage. And for external logs, unmount records need a pre-write data device cache flush similar to the CIL checkpoint cache pre-flush as the internal iclog write code does not do this implicitly anymore. As an optimisation, when the commit record lands in the same iclog as the journal transaction starts, we don't need to wait for anything and can simply use REQ_FUA to provide guarantee #2. This means that for fsync() heavy workloads, the cache flush behaviour is completely unchanged and there is no degradation in performance as a result of optimise the multi-IO transaction case. The most notable sign that there is less IO latency on my test machine (nvme SSDs) is that the "noiclogs" rate has dropped substantially. This metric indicates that the CIL push is blocking in xlog_get_iclog_space() waiting for iclog IO completion to occur. With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space() is blocking waiting for log IO. With the changes in this patch, this drops to 1 noiclog event for every 100 iclog writes. Hence it is clear that log IO is completing much faster than it was previously, but it is also clear that for large iclog sizes, this isn't the performance limiting factor on this hardware. With smaller iclogs (32kB), however, there is a substantial difference. With the cache flush modifications, the journal is now running at over 4000 write IOPS, and the journal throughput is largely identical to the 256kB iclogs and the noiclog event rate stays low at about 1:50 iclog writes. The existing code tops out at about 2500 IOPS as the number of cache flushes dominate performance and latency. The noiclog event rate is about 1:4, and the performance variance is quite large as the journal throughput can fall to less than half the peak sustained rate when the cache flush rate prevents metadata writeback from keeping up and the log runs out of space and throttles reservations. As a result: logbsize fsmark create rate rm -rf before 32kb 152851+/-5.3e+04 5m28s patched 32kb 221533+/-1.1e+04 5m24s before 256kb 220239+/-6.2e+03 4m58s patched 256kb 228286+/-9.2e+03 5m06s The rm -rf times are included because I ran them, but the differences are largely noise. This workload is largely metadata read IO latency bound and the changes to the journal cache flushing doesn't really make any noticable difference to behaviour apart from a reduction in noiclog events from background CIL pushing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
148 lines
4.2 KiB
C
148 lines
4.2 KiB
C
// SPDX-License-Identifier: GPL-2.0
|
|
/*
|
|
* Copyright (c) 2000-2003,2005 Silicon Graphics, Inc.
|
|
* All Rights Reserved.
|
|
*/
|
|
#ifndef __XFS_LOG_H__
|
|
#define __XFS_LOG_H__
|
|
|
|
struct xfs_cil_ctx;
|
|
|
|
struct xfs_log_vec {
|
|
struct xfs_log_vec *lv_next; /* next lv in build list */
|
|
int lv_niovecs; /* number of iovecs in lv */
|
|
struct xfs_log_iovec *lv_iovecp; /* iovec array */
|
|
struct xfs_log_item *lv_item; /* owner */
|
|
char *lv_buf; /* formatted buffer */
|
|
int lv_bytes; /* accounted space in buffer */
|
|
int lv_buf_len; /* aligned size of buffer */
|
|
int lv_size; /* size of allocated lv */
|
|
};
|
|
|
|
#define XFS_LOG_VEC_ORDERED (-1)
|
|
|
|
static inline void *
|
|
xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
|
|
uint type)
|
|
{
|
|
struct xfs_log_iovec *vec = *vecp;
|
|
|
|
if (vec) {
|
|
ASSERT(vec - lv->lv_iovecp < lv->lv_niovecs);
|
|
vec++;
|
|
} else {
|
|
vec = &lv->lv_iovecp[0];
|
|
}
|
|
|
|
vec->i_type = type;
|
|
vec->i_addr = lv->lv_buf + lv->lv_buf_len;
|
|
|
|
ASSERT(IS_ALIGNED((unsigned long)vec->i_addr, sizeof(uint64_t)));
|
|
|
|
*vecp = vec;
|
|
return vec->i_addr;
|
|
}
|
|
|
|
/*
|
|
* We need to make sure the next buffer is naturally aligned for the biggest
|
|
* basic data type we put into it. We already accounted for this padding when
|
|
* sizing the buffer.
|
|
*
|
|
* However, this padding does not get written into the log, and hence we have to
|
|
* track the space used by the log vectors separately to prevent log space hangs
|
|
* due to inaccurate accounting (i.e. a leak) of the used log space through the
|
|
* CIL context ticket.
|
|
*/
|
|
static inline void
|
|
xlog_finish_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec *vec, int len)
|
|
{
|
|
lv->lv_buf_len += round_up(len, sizeof(uint64_t));
|
|
lv->lv_bytes += len;
|
|
vec->i_len = len;
|
|
}
|
|
|
|
static inline void *
|
|
xlog_copy_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
|
|
uint type, void *data, int len)
|
|
{
|
|
void *buf;
|
|
|
|
buf = xlog_prepare_iovec(lv, vecp, type);
|
|
memcpy(buf, data, len);
|
|
xlog_finish_iovec(lv, *vecp, len);
|
|
return buf;
|
|
}
|
|
|
|
/*
|
|
* By comparing each component, we don't have to worry about extra
|
|
* endian issues in treating two 32 bit numbers as one 64 bit number
|
|
*/
|
|
static inline xfs_lsn_t _lsn_cmp(xfs_lsn_t lsn1, xfs_lsn_t lsn2)
|
|
{
|
|
if (CYCLE_LSN(lsn1) != CYCLE_LSN(lsn2))
|
|
return (CYCLE_LSN(lsn1)<CYCLE_LSN(lsn2))? -999 : 999;
|
|
|
|
if (BLOCK_LSN(lsn1) != BLOCK_LSN(lsn2))
|
|
return (BLOCK_LSN(lsn1)<BLOCK_LSN(lsn2))? -999 : 999;
|
|
|
|
return 0;
|
|
}
|
|
|
|
#define XFS_LSN_CMP(x,y) _lsn_cmp(x,y)
|
|
|
|
/*
|
|
* Flags to xfs_log_force()
|
|
*
|
|
* XFS_LOG_SYNC: Synchronous force in-core log to disk
|
|
*/
|
|
#define XFS_LOG_SYNC 0x1
|
|
|
|
/* Log manager interfaces */
|
|
struct xfs_mount;
|
|
struct xlog_in_core;
|
|
struct xlog_ticket;
|
|
struct xfs_log_item;
|
|
struct xfs_item_ops;
|
|
struct xfs_trans;
|
|
|
|
int xfs_log_force(struct xfs_mount *mp, uint flags);
|
|
int xfs_log_force_lsn(struct xfs_mount *mp, xfs_lsn_t lsn, uint flags,
|
|
int *log_forced);
|
|
int xfs_log_mount(struct xfs_mount *mp,
|
|
struct xfs_buftarg *log_target,
|
|
xfs_daddr_t start_block,
|
|
int num_bblocks);
|
|
int xfs_log_mount_finish(struct xfs_mount *mp);
|
|
void xfs_log_mount_cancel(struct xfs_mount *);
|
|
xfs_lsn_t xlog_assign_tail_lsn(struct xfs_mount *mp);
|
|
xfs_lsn_t xlog_assign_tail_lsn_locked(struct xfs_mount *mp);
|
|
void xfs_log_space_wake(struct xfs_mount *mp);
|
|
int xfs_log_reserve(struct xfs_mount *mp,
|
|
int length,
|
|
int count,
|
|
struct xlog_ticket **ticket,
|
|
uint8_t clientid,
|
|
bool permanent);
|
|
int xfs_log_regrant(struct xfs_mount *mp, struct xlog_ticket *tic);
|
|
void xfs_log_unmount(struct xfs_mount *mp);
|
|
int xfs_log_force_umount(struct xfs_mount *mp, int logerror);
|
|
bool xfs_log_writable(struct xfs_mount *mp);
|
|
|
|
struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket);
|
|
void xfs_log_ticket_put(struct xlog_ticket *ticket);
|
|
|
|
void xfs_log_commit_cil(struct xfs_mount *mp, struct xfs_trans *tp,
|
|
xfs_lsn_t *commit_lsn, bool regrant);
|
|
void xlog_cil_process_committed(struct list_head *list);
|
|
bool xfs_log_item_in_current_chkpt(struct xfs_log_item *lip);
|
|
|
|
void xfs_log_work_queue(struct xfs_mount *mp);
|
|
int xfs_log_quiesce(struct xfs_mount *mp);
|
|
void xfs_log_clean(struct xfs_mount *mp);
|
|
bool xfs_log_check_lsn(struct xfs_mount *, xfs_lsn_t);
|
|
bool xfs_log_in_recovery(struct xfs_mount *);
|
|
|
|
xfs_lsn_t xlog_grant_push_threshold(struct xlog *log, int need_bytes);
|
|
|
|
#endif /* __XFS_LOG_H__ */
|