linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-13 23:51:39 +00:00

Author	SHA1	Message	Date
David Teigland	4f5957a980	dlm: change list and timer names The old terminology of "toss" and "keep" is no longer an accurate description of the rsb states and lists, so change the names to "inactive" and "active". The old names had also been copied into the scanning code, which is changed back to use the "scan" name. - "active" rsb structs have lkb's attached, and are ref counted. - "inactive" rsb structs have no lkb's attached, are not ref counted. - "scan" list is for rsb's that can be freed after a timeout period. - "slow" lists are for infrequent iterations through active or inactive rsb structs. - inactive rsb structs that are directory records will not be put on the scan list, since they are not freed based on timeouts. - inactive rsb structs that are not directory records will be put on the scan list to be freed, since they are not longer needed. Signed-off-by: David Teigland <teigland@redhat.com>	2024-06-10 15:11:46 -05:00
Alexander Aring	e91313591b	dlm: use rwlock for rsb hash table The conversion to rhashtable introduced a hash table lock per lockspace, in place of per bucket locks. To make this more scalable, switch to using a rwlock for hash table access. The common case fast path uses it as a read lock. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2024-04-16 14:45:31 -05:00
Alexander Aring	b1f2381c1a	dlm: drop dlm_scand kthread and use timers Currently the scand kthread acts like a garbage collection for expired rsbs on toss list, to clean them up after a certain timeout. It triggers every couple of seconds and iterates over the toss list while holding ls_rsbtbl_lock for the whole hash bucket iteration. To reduce the amount of time holding ls_rsbtbl_lock, we now handle the disposal of expired rsbs using a per-lockspace timer that expires for the earliest tossed rsb on the lockspace toss queue. This toss queue is ordered according to the rsb res_toss_time with the earliest tossed rsb as the first entry. The toss timer will only trylock() necessary locks, since it is low priority garbage collection, and will rearm the timer if trylock() fails. If the timer function does not find any expired rsb's, it rearms the timer with the next earliest expired rsb. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2024-04-16 14:40:27 -05:00
Alexander Aring	93a693d19d	dlm: add rsb lists for iteration To prepare for using rhashtable, add two rsb lists for iterating through rsb's in two uncommon cases where this is necesssary: - when dumping rsb state from debugfs, now using seq_list. - when looking at all rsb's during recovery. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2024-04-16 14:33:25 -05:00
Alexander Aring	2d90354027	dlm: merge toss and keep hash table lists into one list There are several places where lock processing can perform two hash table lookups, first in the "keep" list, and if not found, in the "toss" list. This patch introduces a new rsb state flag "RSB_TOSS" to represent the difference between the state of being on keep vs toss list, so that the two lists can be combined. This avoids cases of two lookups. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2024-04-16 13:49:13 -05:00
Alexander Aring	dcdaad05ca	dlm: change to single hashtable lock Prepare to replace our own hash table with rhashtable by replacing the per-bucket locks in our own hash table with a single lock. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2024-04-16 13:46:41 -05:00
Alexander Aring	578acf9a87	dlm: use spin_lock_bh for message processing Use spin_lock_bh for all spinlocks involved in message processing, in preparation for softirq message processing. DLM lock requests from user space involve dlm processing in user context, in addition to the standard kernel context, necessitating bh variants. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2024-04-09 11:45:23 -05:00
Alexander Aring	d52c9b8fef	dlm: convert ls_recv_active from rw_semaphore to rwlock Convert ls_recv_active rw_semaphore to an rwlock to avoid sleeping, in preparation for softirq message processing. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2024-04-09 11:44:49 -05:00
Alexander Aring	3ae6776056	dlm: add new struct to save position in dlm_copy_master_names Add a new struct to save the current position in the rsb masters_list while sending the rsb names to other nodes. The rsb names are sent in multiple chunks, and for each new chunk, the new "dlm_dir_dump" struct saves the last position in the masters_list. The new struct is also used to save more information to sanity check the recovery process. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2024-04-09 11:44:49 -05:00
Alexander Aring	3a747f4a2e	dlm: move rsb root_list to ls_recover() stack Move the rsb root_list from the lockspace to a stack variable since it is now only used by the ls_recover() function. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2024-04-09 11:44:49 -05:00
Alexander Aring	aff46e0f24	dlm: use a new list for recovery of master rsb names Add a new "masters_list" for master rsb structs, with a new rwlock. The new list is created and used during the recovery process to send the master rsb names to new nodes. With this change, the current "root_list" can be used without locking. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2024-04-09 11:44:49 -05:00
Alexander Aring	29e345f3c6	dlm: move root_list functionality to recover.c Move dlm_create_root_list() and dlm_release_root_list() to recover.c and declare them static because they are only used there. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2024-04-09 11:44:49 -05:00
Alexander Aring	c4f4e135c2	fs: dlm: get recovery sequence number as parameter This patch removes a read of the ls->ls_recover_seq uint64_t number in _create_rcom(). If the ls->ls_recover_seq is readed the ls_recover_lock need to held. However this number was always readed before when any rcom message is received and it's not necessary to read it again from a per lockspace variable to use it for the replying message. This patch will pass the sequence number as parameter so another read of ls->ls_recover_seq and holding the ls->ls_recover_lock is not required. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2023-08-10 10:33:03 -05:00
Alexander Aring	01c7a59789	fs: dlm: remove deprecated code parts This patch removes code parts which was declared deprecated by commit `6b0afc0cc3` ("fs: dlm: don't use deprecated timeout features by default"). This contains the following dlm functionality: - start a cancel of a dlm request did not complete after certain timeout: The current way how dlm cancellation works and interfering with other dlm requests triggered by the user can end in an overlapping and returning in -EBUSY. The most user don't handle this case and are unaware that DLM can return such errno in such situation. Due the timeout the user are mostly unaware when this happens. - start a netlink warning messages for user space if dlm requests did not complete after certain timeout: This feature was never being built in the only known dlm user space side. As we are to remove the timeout cancellation feature we can directly remove this feature as well. There might be the possibility to bring the timeout cancellation feature back. However the current way of handling the -EBUSY case which is only a software limitation and not a hardware limitation should be changed. We minimize the current code base in DLM cancellation feature to not have to deal with those existing features while solving the DLM cancellation feature in general. UAPI define DLM_LSFL_TIMEWARN is commented as deprecated and reserved value. We should avoid at first to give it a new meaning but let possible users still compile by keeping this define. In far future we can give this flag a new meaning. The same for the DLM_LKF_TIMEOUT lock request flag. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2023-03-06 15:49:07 -06:00
Alexander Aring	3182599f5f	fs: dlm: handle recovery result outside of ls_recover This patch cleans up the handling of recovery results by moving it from ls_recover() to the caller do_ls_recovery(). This makes the error handling clearer. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2022-06-24 11:57:48 -05:00
Alexander Aring	682bb91b6b	fs: dlm: make new_lockspace() wait until recovery completes Make dlm_new_lockspace() wait until a full recovery completes sucessfully or fails. Previously, dlm_new_lockspace() returned to the caller after dlm_recover_members() finished, which is only partially through recovery. The result of the previous behavior is that the new lockspace would not be usable for some time (especially with overlapping recoveries), and some errors in the later part of recovery could not be returned to the caller. Kernel callers gfs2 and cluster-md have their own wait handling to wait for recovery to complete after calling dlm_new_lockspace(). This continues to work, but will be unnecessary. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2022-06-24 11:57:47 -05:00
Alexander Aring	ca8031d917	fs: dlm: update comments about recovery and membership handling Make clear that a particular recovery iteration must not be aborted before membership changes are applied to the members list (ls_nodes) and midcomms layer. Interrupting recovery before this can result in missing node-specific changes in midcomms or through lsops. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2022-06-24 11:57:40 -05:00
Alexander Aring	e10249b190	fs: dlm: use dlm_recovery_stopped in condition This patch will change to evaluate the dlm_recovery_stopped() in the condition of the if branch instead fetch it before evaluating the condition. As this is an atomic test-set operation it should be evaluated in the condition itself. Reported-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2021-11-02 14:39:20 -05:00
Alexander Aring	aee742c992	fs: dlm: fix return -EINTR on recovery stopped This patch will return -EINTR instead of 1 if recovery is stopped. In case of ping_members() the return value will be checked if the error is -EINTR for signaling another recovery was triggered and the whole recovery process will come to a clean end to process the next one. Returning 1 will abort the recovery process and can leave the recovery in a broken state. It was reported with the following kernel log message attached and a gfs2 mount stopped working: "dlm: bobvirt1: dlm_recover_members error 1" whereas 1 was returned because of a conversion of "dlm_recovery_stopped()" to an errno was missing which this patch will introduce. While on it all other possible missing errno conversions at other places were added as they are done as in other places. It might be worth to check the error case at this recovery level, because some of the functionality also returns -ENOBUFS and check why recovery ends in a broken state. However this will fix the issue if another recovery was triggered at some points of recovery handling. Reported-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2021-08-19 11:33:03 -05:00
Thomas Gleixner	2522fe45a1	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 193 Based on 1 normalized pattern(s): this copyrighted material is made available to anyone wishing to use modify copy or redistribute it subject to the terms and conditions of the gnu general public license v 2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 45 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Richard Fontana <rfontana@redhat.com> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Steve Winslow <swinslow@gmail.com> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190528170027.342746075@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-05-30 11:29:21 -07:00
Guoqing Jiang	9e1b0211c5	dlm: recheck kthread_should_stop() before schedule() Call schedule() here could make the thread miss wake up from kthread_stop(), so it is better to recheck kthread_should_stop() before call schedule(), a symptom happened when I run indefinite test (which mostly created clustered raid1, assemble it in other nodes, then stop them) of clustered raid. $ ps aux\|grep md\|grep D root 4211 0.0 0.0 19760 2220 ? Ds 02:58 0:00 mdadm -Ssq $ cat /proc/4211/stack kthread_stop+0x4d/0x150 dlm_recoverd_stop+0x15/0x20 [dlm] dlm_release_lockspace+0x2ab/0x460 [dlm] leave+0xbf/0x150 [md_cluster] md_cluster_stop+0x18/0x30 [md_mod] bitmap_free+0x12e/0x140 [md_mod] bitmap_destroy+0x7f/0x90 [md_mod] __md_stop+0x21/0xa0 [md_mod] do_md_stop+0x15f/0x5c0 [md_mod] md_ioctl+0xa65/0x18a0 [md_mod] blkdev_ioctl+0x49e/0x8d0 block_ioctl+0x41/0x50 do_vfs_ioctl+0x96/0x5b0 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x1e/0xad This maybe not resolve the issue completely since the KTHREAD_SHOULD_STOP flag could be set between "break" and "schedule", but at least the chance for the symptom happen could be reduce a lot (The indefinite test runs more than 20 hours without problem and it happens easily without the change). Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: David Teigland <teigland@redhat.com>	2017-09-25 12:48:10 -05:00
tsutomu.owa@toshiba.co.jp	e412f9201d	DLM: fix race condition between dlm_recoverd_stop and dlm_recoverd When dlm_recoverd_stop() is called between kthread_should_stop() and set_task_state(TASK_INTERRUPTIBLE), dlm_recoverd will not wake up. Signed-off-by: Tadashi Miyauchi <miyauchi@toshiba-tops.co.jp> Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp> Signed-off-by: David Teigland <teigland@redhat.com>	2017-09-25 12:45:21 -05:00
David Teigland	075f01775f	dlm: use INFO for recovery messages The log messages relating to the progress of recovery are minimal and very often useful. Change these to the KERN_INFO level so they are always available. Signed-off-by: David Teigland <teigland@redhat.com>	2014-02-14 11:54:44 -06:00
David Teigland	475f230c60	dlm: fix unlock balance warnings The in_recovery rw_semaphore has always been acquired and released by different threads by design. To work around the "BUG: bad unlock balance detected!" messages, adjust things so the dlm_recoverd thread always does both down_write and up_write. Signed-off-by: David Teigland <teigland@redhat.com>	2012-08-08 11:33:49 -05:00
David Teigland	c04fecb4d9	dlm: use rsbtbl as resource directory Remove the dir hash table (dirtbl), and use the rsb hash table (rsbtbl) as the resource directory. It has always been an unnecessary duplication of information. This improves efficiency by using a single rsbtbl lookup in many cases where both rsbtbl and dirtbl lookups were needed previously. This eliminates the need to handle cases of rsbtbl and dirtbl being out of sync. In many cases there will be memory savings because the dir hash table no longer exists. Signed-off-by: David Teigland <teigland@redhat.com>	2012-07-16 14:16:19 -05:00
David Teigland	4875647a08	dlm: fixes for nodir mode The "nodir" mode (statically assign master nodes instead of using the resource directory) has always been highly experimental, and never seriously used. This commit fixes a number of problems, making nodir much more usable. - Major change to recovery: recover all locks and restart all in-progress operations after recovery. In some cases it's not possible to know which in-progess locks to recover, so recover all. (Most require recovery in nodir mode anyway since rehashing changes most master nodes.) - Change the way nodir mode is enabled, from a command line mount arg passed through gfs2, into a sysfs file managed by dlm_controld, consistent with the other config settings. - Allow recovering MSTCPY locks on an rsb that has not yet been turned into a master copy. - Ignore RCOM_LOCK and RCOM_LOCK_REPLY recovery messages from a previous, aborted recovery cycle. Base this on the local recovery status not being in the state where any nodes should be sending LOCK messages for the current recovery cycle. - Hold rsb lock around dlm_purge_mstcpy_locks() because it may run concurrently with dlm_recover_master_copy(). - Maintain highbast on process-copy lkb's (in addition to the master as is usual), because the lkb can switch back and forth between being a master and being a process copy as the master node changes in recovery. - When recovering MSTCPY locks, flag rsb's that have non-empty convert or waiting queues for granting at the end of recovery. (Rename flag from LOCKS_PURGED to RECOVER_GRANT and similar for the recovery function, because it's not only resources with purged locks that need grant a grant attempt.) - Replace a couple of unnecessary assertion panics with error messages. Signed-off-by: David Teigland <teigland@redhat.com>	2012-05-02 14:15:27 -05:00
David Teigland	6d40c4a708	dlm: improve error and debug messages Change some existing error/debug messages to collect more useful information, and add some new error/debug messages to address recently found problems. Signed-off-by: David Teigland <teigland@redhat.com>	2012-04-26 15:41:46 -05:00
David Teigland	60f98d1839	dlm: add recovery callbacks These new callbacks notify the dlm user about lock recovery. GFS2, and possibly others, need to be aware of when the dlm will be doing lock recovery for a failed lockspace member. In the past, this coordination has been done between dlm and file system daemons in userspace, which then direct their kernel counterparts. These callbacks allow the same coordination directly, and more simply. Signed-off-by: David Teigland <teigland@redhat.com>	2012-01-04 08:56:31 -06:00
David Teigland	f95a34c665	dlm: move recovery barrier calls Put all the calls to recovery barriers in the same function to clarify where they each happen. Should not change any behavior. Also modify some recovery debug lines to make them consistent. Signed-off-by: David Teigland <teigland@redhat.com>	2012-01-04 08:53:27 -06:00
David Teigland	23e8e1aaac	dlm: use workqueue for callbacks Instead of creating our own kthread (dlm_astd) to deliver callbacks for all lockspaces, use a per-lockspace workqueue to deliver the callbacks. This eliminates complications and slowdowns from many lockspaces sharing the same thread. Signed-off-by: David Teigland <teigland@redhat.com>	2011-07-15 12:30:43 -05:00
David Teigland	d44e0fc704	dlm: recover nodes that are removed and re-added If a node is removed from a lockspace, and then added back before the dlm is notified of the removal, the dlm will not detect the removal and won't clear the old state from the node. This is fixed by using a list of added nodes so the membership recovery can detect when a newly added node is already in the member list. Signed-off-by: David Teigland <teigland@redhat.com>	2008-04-21 11:18:01 -05:00
David Teigland	85f0379aa0	dlm: keep cached master rsbs during recovery To prevent the master of an rsb from changing rapidly, an unused rsb is kept on the "toss list" for a period of time to be reused. The toss list was being cleared completely for each recovery, which is unnecessary. Much of the benefit of the toss list can be maintained if nodes keep rsb's in their toss list that they are the master of. These rsb's need to be included when the resource directory is rebuilt during recovery. Signed-off-by: David Teigland <teigland@redhat.com>	2008-01-30 11:04:43 -06:00
David Teigland	c36258b592	[DLM] block dlm_recv in recovery transition Introduce a per-lockspace rwsem that's held in read mode by dlm_recv threads while working in the dlm. This allows dlm_recv activity to be suspended when the lockspace transitions to, from and between recovery cycles. The specific bug prompting this change is one where an in-progress recovery cycle is aborted by a new recovery cycle. While dlm_recv was processing a recovery message, the recovery cycle was aborted and dlm_recoverd began cleaning up. dlm_recv decremented recover_locks_count on an rsb after dlm_recoverd had reset it to zero. This is fixed by suspending dlm_recv (taking write lock on the rwsem) before aborting the current recovery. The transitions to/from normal and recovery modes are simplified by using this new ability to block dlm_recv. The switch from normal to recovery mode means dlm_recv goes from processing locking messages, to saving them for later, and vice versa. Races are avoided by blocking dlm_recv when setting the flag that switches between modes. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-10-10 08:56:38 +01:00
David Teigland	3ae1acf93a	[DLM] add lock timeouts and warnings [2/6] New features: lock timeouts and time warnings. If the DLM_LKF_TIMEOUT flag is set, then the request/conversion will be canceled after waiting the specified number of centiseconds (specified per lock). This feature is only available for locks requested through libdlm (can be enabled for kernel dlm users if there's a use for it.) If the new DLM_LSFL_TIMEWARN flag is set when creating the lockspace, then a warning message will be sent to userspace (using genetlink) after a request/conversion has been waiting for a given number of centiseconds (configurable per node). The time warnings will be used in the future to do deadlock detection in userspace. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-07-09 08:22:33 +01:00
David Teigland	8ec6886748	[DLM] change some log_error to log_debug Some common, non-error messages should use log_debug instead of log_error so they can be turned off. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-02-05 13:36:34 -05:00
Ryusuke Konishi	57adf7eede	[DLM] fix format warnings in rcom.c and recoverd.c This fixes the following gcc warnings generated on the architectures where uint64_t != unsigned long long (e.g. ppc64). fs/dlm/rcom.c:154: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'uint64_t' fs/dlm/rcom.c:154: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'uint64_t' fs/dlm/recoverd.c:48: warning: format '%llx' expects type 'long long unsigned int', but argument 3 has type 'uint64_t' fs/dlm/recoverd.c:202: warning: format '%llx' expects type 'long long unsigned int', but argument 3 has type 'uint64_t' fs/dlm/recoverd.c:210: warning: format '%llx' expects type 'long long unsigned int', but argument 3 has type 'uint64_t' Signed-off-by: Ryusuke Konishi <ryusuke@osrg.net> Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2006-11-30 10:37:22 -05:00
David Teigland	2896ee37cc	[DLM] fix add_requestqueue checking nodes list Requests that arrive after recovery has started are saved in the requestqueue and processed after recovery is done. Some of these requests are purged during recovery if they are from nodes that have been removed. We move the purging of the requests (dlm_purge_requestqueue) to later in the recovery sequence which allows the routine saving requests (dlm_add_requestqueue) to avoid filtering out requests by nodeid since the same will be done by the purge. The current code has add_requestqueue filtering by nodeid but doesn't hold any locks when accessing the list of current nodes. This also means that we need to call the purge routine when the lockspace is being shut down since the add routine will not be rejecting requests itself any more. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2006-11-30 10:37:00 -05:00
David Teigland	4b77f2c93d	[DLM] do full recover_locks barrier Red Hat BZ 211914 The previous patch "[DLM] fix aborted recovery during node removal" was incomplete as discovered with further testing. It set the bit for the RS_LOCKS barrier but did not then wait for the barrier. This is often ok, but sometimes it will cause yet another recovery hang. If it's a new node that also has the lowest nodeid that skips the barrier wait, then it misses the important step of collecting and reporting the barrier status from the other nodes (which is the job of the low nodeid in the barrier wait routine). Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2006-11-30 10:35:24 -05:00
David Teigland	2cdc98aaf0	[DLM] fix stopping unstarted recovery Red Hat BZ 211914 When many nodes are joining a lockspace simultaneously, the dlm gets a quick sequence of stop/start events, a pair for adding each node. dlm_controld in user space sends dlm_recoverd in the kernel each stop and start event. dlm_controld will sometimes send the stop before dlm_recoverd has had a chance to take up the previously queued start. The stop aborts the processing of the previous start by setting the RECOVERY_STOP flag. dlm_recoverd is erroneously clearing this flag and ignoring the stop/abort if it happens to take up the start after the stop meant to abort it. The fix is to check the sequence number that's incremented for each stop/start before clearing the flag. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2006-11-30 10:35:16 -05:00
David Teigland	91c0dc93a1	[DLM] fix aborted recovery during node removal Red Hat BZ 211914 With the new cluster infrastructure, dlm recovery for a node removal can be aborted and restarted for a node addition. When this happens, the restarted recovery isn't aware that it's doing recovery for the earlier removal as well as the addition. So, it then skips the recovery steps only required when nodes are removed. This can result in locks not being purged for failed/removed nodes. The fix is to check for removed nodes for which recovery has not been completed at the start of a new recovery sequence. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2006-11-30 10:35:13 -05:00
David Teigland	5f88f1ea16	[DLM] add new lockspace to list ealier When a new lockspace was being created, the recoverd thread was being started for it before the lockspace was added to the global list of lockspaces. The new thread was looking up the lockspace in the global list and sometimes not finding it due to the race with the original thread adding it to the list. We need to add the lockspace to the global list before starting the thread instead of after, and if the new thread can't find the lockspace for some reason, it should return an error. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2006-08-25 10:02:53 -04:00
David Teigland	f6db1b8e72	[DLM] abort recovery more quickly When we abort one recovery to do another, break out of the ping_members() routine more quickly, and wake up the dlm_recoverd thread more quickly instead of waiting for it to time out. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2006-08-09 09:45:31 -04:00
David Teigland	901359256b	[DLM] Update DLM to the latest patch level Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steve Whitehouse <swhiteho@redhat.com>	2006-01-20 08:47:07 +00:00
David Teigland	e7fd41792f	[DLM] The core of the DLM for GFS2/CLVM This is the core of the distributed lock manager which is required to use GFS2 as a cluster filesystem. It is also used by CLVM and can be used as a standalone lock manager independantly of either of these two projects. It implements VAX-style locking modes. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steve Whitehouse <swhiteho@redhat.com>	2006-01-18 09:30:29 +00:00

44 Commits