如何从集群的pc状态清除失败的操作
在本文中,我将共享命令以从High Availability Pcaemaker群集的" pcs status"输出中清除失败的操作。
当群集中的资源启动失败时,在" pcs状态"中记录了一些失败的动作的情况很多次。即使资源成功启动后,这些失败的动作仍会继续出现在" pcs status"输出中。
因此,在这种情况下,我们可以从"个人电脑状态"中"清除失败的操作"。
问题:从PC状态清除失败的操作消息
下面是我的KVM高可用性群集上的pcs状态的示例输出,这里有两种类型的失败操作
资源操作失败
Fencing操作失败
要检查集群状态:
[root@centos8-2 ~]# pcs status Cluster name: ha-cluster Stack: corosync Current DC: centos8-3 (version 2.0.2-3.el8_1.2-744a30d655) - partition with quorum Last updated: Sat Jan 2 14:38:27 2017 Last change: Sat Jan 2 14:38:23 2017 by root via cibadmin on centos8-2 3 nodes configured 4 resources configured Online: [ centos8-2 centos8-3 centos8-4 ] Full list of resources: fence-centos8-3 (stonith:fence_xvm): Started centos8-3 fence-centos8-2 (stonith:fence_xvm): Started centos8-2 ClusterIP (ocf::heartbeat:IPaddr2): Started centos8-4 fence-centos8-4 (stonith:fence_xvm): Started centos8-3 Failed Resource Actions: * fence-centos8-2_start_0 on centos8-4 'OCF_TIMEOUT' (198): call=122, status=Timed Out, exitreason='', last-rc-change='Sat Jan 2 14:36:16 2017', queued=1ms, exec=20012ms * fence-centos8-4_start_0 on centos8-4 'OCF_TIMEOUT' (198): call=124, status=Timed Out, exitreason='', last-rc-change='Sat Jan 2 14:36:36 2017', queued=0ms, exec=20011ms Failed Fencing Actions: * reboot of centos8-2 failed: delegate=, client=pacemaker-controld.1548, origin=centos8-3, last-failed='Sat Jan 2 14:37:17 2017' * reboot of centos8-4 failed: delegate=, client=pacemaker-controld.1548, origin=centos8-3, last-failed='Fri Jan 1 20:57:41 2017' Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
现在,我的资源和防护资源已成功启动,因此不需要保留这些失败的操作消息。
用于清除"资源"和"隔离"的失败操作的命令是不同的。
解决方案:资源清理失败的操作
要清除"失败的资源操作"下资源的失败操作消息,请使用pcs resource cleanup <resource>
。我们可以从"失败的资源操作"消息输出中获取资源名称。
以下是我的"个人电脑状态"的输出
* fence-centos8-2_start_0 on centos8-4 'OCF_TIMEOUT' (198): call=122, status=Timed Out, exitreason='', last-rc-change='Sat Jan 2 14:36:16 2017', queued=1ms, exec=20012ms * fence-centos8-4_start_0 on centos8-4 'OCF_TIMEOUT' (198): call=124, status=Timed Out, exitreason='', last-rc-change='Sat Jan 2 14:36:36 2017', queued=0ms, exec=20011ms
这里的资源名称是fence-centos8-2
和fence-centos8-4
,我们也可以使用pcs资源状态进行检查
因此,为使用fence-centos8-2
清理失败的操作消息:
[root@centos8-2 ~]# pcs resource cleanup fence-centos8-2 Cleaned up fence-centos8-2 on centos8-4 Cleaned up fence-centos8-2 on centos8-3 Cleaned up fence-centos8-2 on centos8-2 Waiting for 1 reply from the controller. OK
与清理fence-centos8-2
资源的失败操作消息类似
[root@centos8-2 ~]# pcs resource cleanup fence-centos8-4 Cleaned up fence-centos8-4 on centos8-4 Cleaned up fence-centos8-4 on centos8-3 Cleaned up fence-centos8-4 on centos8-2 Waiting for 1 reply from the controller. OK
执行清理后,检查集群状态
[root@centos8-2 ~]# pcs status Cluster name: ha-cluster Stack: corosync Current DC: centos8-3 (version 2.0.2-3.el8_1.2-744a30d655) - partition with quorum Last updated: Sat Jan 2 14:39:19 2017 Last change: Sat Jan 2 14:39:17 2017 by hacluster via crmd on centos8-4 3 nodes configured 4 resources configured Online: [ centos8-2 centos8-3 centos8-4 ] Full list of resources: fence-centos8-3 (stonith:fence_xvm): Started centos8-3 fence-centos8-2 (stonith:fence_xvm): Started centos8-2 ClusterIP (ocf::heartbeat:IPaddr2): Started centos8-4 fence-centos8-4 (stonith:fence_xvm): Started centos8-3 Failed Fencing Actions: * reboot of centos8-2 failed: delegate=, client=pacemaker-controld.1548, origin=centos8-3, last-failed='Sat Jan 2 14:37:17 2017' * reboot of centos8-4 failed: delegate=, client=pacemaker-controld.1548, origin=centos8-3, last-failed='Fri Jan 1 20:57:41 2017' Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
所以现在我们没有任何" Failed Resource Actions",接下来我们将清除Fencing的失败操作消息
解决方案:清理失败的防护措施
现在," pcs status"仍然显示针对Fencing的失败操作消息,因此要清除针对fencing的失败操作消息,我们将使用pcs stonith历史记录清理<resource>
但是在执行清理之前,我们可以使用pcs stonith历史记录show <resource>检查Fencing Fencing Actions
的完整历史记录。
[root@centos8-2 ~]# pcs stonith history show centos8-2 We failed reboot node centos8-2 on behalf of pacemaker-controld.1548 from centos8-3 at Sat Jan 2 14:36:57 2017 We failed reboot node centos8-2 on behalf of pacemaker-controld.1548 from centos8-3 at Sat Jan 2 14:36:37 2017 We failed reboot node centos8-2 on behalf of pacemaker-controld.1548 from centos8-3 at Sat Jan 2 14:36:17 2017 We failed reboot node centos8-2 on behalf of pacemaker-controld.1548 from centos8-3 at Sat Jan 2 14:37:16 2017 We failed reboot node centos8-2 on behalf of pacemaker-controld.1548 from centos8-3 at Sat Jan 2 14:37:17 2017 0 events found
我们可以从" pcs status"的消息输出中获取资源名称。
* reboot of centos8-2 failed: delegate=, client=pacemaker-controld.1548, origin=centos8-3, last-failed='Sat Jan 2 14:37:17 2017' * reboot of centos8-4 failed: delegate=, client=pacemaker-controld.1548, origin=centos8-3, last-failed='Fri Jan 1 20:57:41 2017'
执行清除使用围栏的失败操作消息
[root@centos8-2 ~]# pcs stonith history cleanup centos8-2 cleaning up fencing-history for node centos8-2 0 events found [root@centos8-2 ~]# pcs stonith history cleanup centos8-4 cleaning up fencing-history for node centos8-4 0 events found
现在使用" pcs status"检查pcaemaker集群状态
[root@centos8-2 ~]# pcs status Cluster name: ha-cluster Stack: corosync Current DC: centos8-3 (version 2.0.2-3.el8_1.2-744a30d655) - partition with quorum Last updated: Sat Jan 2 14:41:05 2017 Last change: Sat Jan 2 14:39:17 2017 by hacluster via crmd on centos8-4 3 nodes configured 4 resources configured Online: [ centos8-2 centos8-3 centos8-4 ] Full list of resources: fence-centos8-3 (stonith:fence_xvm): Started centos8-3 fence-centos8-2 (stonith:fence_xvm): Started centos8-2 ClusterIP (ocf::heartbeat:IPaddr2): Started centos8-4 fence-centos8-4 (stonith:fence_xvm): Started centos8-3 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
因此,我们没有其他失败的操作消息。
说明:
这只会清除以前遇到的错误。如果pc继续显示更多,则表明故障继续发生,那么我们必须首先调试实际的根本原因。