如何从集群的pc状态清除失败的操作

时间:2020-01-09 10:37:40  来源:igfitidea点击:

在本文中,我将共享命令以从High Availability Pcaemaker群集的" pcs status"输出中清除失败的操作。

当群集中的资源启动失败时,在" pcs状态"中记录了一些失败的动作的情况很多次。即使资源成功启动后,这些失败的动作仍会继续出现在" pcs status"输出中。

因此,在这种情况下,我们可以从"个人电脑状态"中"清除失败的操作"。

问题:从PC状态清除失败的操作消息

下面是我的KVM高可用性群集上的pcs状态的示例输出,这里有两种类型的失败操作

  • 资源操作失败

  • Fencing操作失败

要检查集群状态:

[root@centos8-2 ~]# pcs status
Cluster name: ha-cluster
Stack: corosync
Current DC: centos8-3 (version 2.0.2-3.el8_1.2-744a30d655) - partition with quorum
Last updated: Sat Jan  2 14:38:27 2017
Last change: Sat Jan  2 14:38:23 2017 by root via cibadmin on centos8-2
3 nodes configured
4 resources configured
Online: [ centos8-2 centos8-3 centos8-4 ]
Full list of resources:
 fence-centos8-3        (stonith:fence_xvm):    Started centos8-3
 fence-centos8-2        (stonith:fence_xvm):    Started centos8-2
 ClusterIP      (ocf::heartbeat:IPaddr2):       Started centos8-4
 fence-centos8-4        (stonith:fence_xvm):    Started centos8-3
Failed Resource Actions:
* fence-centos8-2_start_0 on centos8-4 'OCF_TIMEOUT' (198): call=122, status=Timed Out, exitreason='',
    last-rc-change='Sat Jan  2 14:36:16 2017', queued=1ms, exec=20012ms
* fence-centos8-4_start_0 on centos8-4 'OCF_TIMEOUT' (198): call=124, status=Timed Out, exitreason='',
    last-rc-change='Sat Jan  2 14:36:36 2017', queued=0ms, exec=20011ms
Failed Fencing Actions:
* reboot of centos8-2 failed: delegate=, client=pacemaker-controld.1548, origin=centos8-3,
    last-failed='Sat Jan  2 14:37:17 2017'
* reboot of centos8-4 failed: delegate=, client=pacemaker-controld.1548, origin=centos8-3,
    last-failed='Fri Jan  1 20:57:41 2017'
Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

现在,我的资源和防护资源已成功启动,因此不需要保留这些失败的操作消息。

用于清除"资源"和"隔离"的失败操作的命令是不同的。

解决方案:资源清理失败的操作

要清除"失败的资源操作"下资源的失败操作消息,请使用pcs resource cleanup <resource>。我们可以从"失败的资源操作"消息输出中获取资源名称。

以下是我的"个人电脑状态"的输出

* fence-centos8-2_start_0 on centos8-4 'OCF_TIMEOUT' (198): call=122, status=Timed Out, exitreason='',
    last-rc-change='Sat Jan  2 14:36:16 2017', queued=1ms, exec=20012ms
* fence-centos8-4_start_0 on centos8-4 'OCF_TIMEOUT' (198): call=124, status=Timed Out, exitreason='',
    last-rc-change='Sat Jan  2 14:36:36 2017', queued=0ms, exec=20011ms

这里的资源名称是fence-centos8-2fence-centos8-4,我们也可以使用pcs资源状态进行检查

因此,为使用fence-centos8-2清理失败的操作消息:

[root@centos8-2 ~]# pcs resource cleanup fence-centos8-2
Cleaned up fence-centos8-2 on centos8-4
Cleaned up fence-centos8-2 on centos8-3
Cleaned up fence-centos8-2 on centos8-2
Waiting for 1 reply from the controller. OK

与清理fence-centos8-2资源的失败操作消息类似

[root@centos8-2 ~]# pcs resource cleanup fence-centos8-4
Cleaned up fence-centos8-4 on centos8-4
Cleaned up fence-centos8-4 on centos8-3
Cleaned up fence-centos8-4 on centos8-2
Waiting for 1 reply from the controller. OK

执行清理后,检查集群状态

[root@centos8-2 ~]# pcs status
Cluster name: ha-cluster
Stack: corosync
Current DC: centos8-3 (version 2.0.2-3.el8_1.2-744a30d655) - partition with quorum
Last updated: Sat Jan  2 14:39:19 2017
Last change: Sat Jan  2 14:39:17 2017 by hacluster via crmd on centos8-4
3 nodes configured
4 resources configured
Online: [ centos8-2 centos8-3 centos8-4 ]
Full list of resources:
 fence-centos8-3        (stonith:fence_xvm):    Started centos8-3
 fence-centos8-2        (stonith:fence_xvm):    Started centos8-2
 ClusterIP      (ocf::heartbeat:IPaddr2):       Started centos8-4
 fence-centos8-4        (stonith:fence_xvm):    Started centos8-3
Failed Fencing Actions:
* reboot of centos8-2 failed: delegate=, client=pacemaker-controld.1548, origin=centos8-3,
    last-failed='Sat Jan  2 14:37:17 2017'
* reboot of centos8-4 failed: delegate=, client=pacemaker-controld.1548, origin=centos8-3,
    last-failed='Fri Jan  1 20:57:41 2017'
Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

所以现在我们没有任何" Failed Resource Actions",接下来我们将清除Fencing的失败操作消息

解决方案:清理失败的防护措施

现在," pcs status"仍然显示针对Fencing的失败操作消息,因此要清除针对fencing的失败操作消息,我们将使用pcs stonith历史记录清理<resource>

但是在执行清理之前,我们可以使用pcs stonith历史记录show <resource>检查Fencing Fencing Actions的完整历史记录。

[root@centos8-2 ~]# pcs stonith history show centos8-2
We failed reboot node centos8-2 on behalf of pacemaker-controld.1548 from centos8-3 at Sat Jan  2 14:36:57 2017
We failed reboot node centos8-2 on behalf of pacemaker-controld.1548 from centos8-3 at Sat Jan  2 14:36:37 2017
We failed reboot node centos8-2 on behalf of pacemaker-controld.1548 from centos8-3 at Sat Jan  2 14:36:17 2017
We failed reboot node centos8-2 on behalf of pacemaker-controld.1548 from centos8-3 at Sat Jan  2 14:37:16 2017
We failed reboot node centos8-2 on behalf of pacemaker-controld.1548 from centos8-3 at Sat Jan  2 14:37:17 2017
0 events found

我们可以从" pcs status"的消息输出中获取资源名称。

* reboot of centos8-2 failed: delegate=, client=pacemaker-controld.1548, origin=centos8-3,
    last-failed='Sat Jan  2 14:37:17 2017'
* reboot of centos8-4 failed: delegate=, client=pacemaker-controld.1548, origin=centos8-3,
    last-failed='Fri Jan  1 20:57:41 2017'

执行清除使用围栏的失败操作消息

[root@centos8-2 ~]# pcs stonith history cleanup centos8-2
cleaning up fencing-history for node centos8-2
0 events found
[root@centos8-2 ~]# pcs stonith history cleanup centos8-4
cleaning up fencing-history for node centos8-4
0 events found

现在使用" pcs status"检查pcaemaker集群状态

[root@centos8-2 ~]# pcs status
Cluster name: ha-cluster
Stack: corosync
Current DC: centos8-3 (version 2.0.2-3.el8_1.2-744a30d655) - partition with quorum
Last updated: Sat Jan  2 14:41:05 2017
Last change: Sat Jan  2 14:39:17 2017 by hacluster via crmd on centos8-4
3 nodes configured
4 resources configured
Online: [ centos8-2 centos8-3 centos8-4 ]
Full list of resources:
 fence-centos8-3        (stonith:fence_xvm):    Started centos8-3
 fence-centos8-2        (stonith:fence_xvm):    Started centos8-2
 ClusterIP      (ocf::heartbeat:IPaddr2):       Started centos8-4
 fence-centos8-4        (stonith:fence_xvm):    Started centos8-3
Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

因此,我们没有其他失败的操作消息。

说明:

这只会清除以前遇到的错误。如果pc继续显示更多,则表明故障继续发生,那么我们必须首先调试实际的根本原因。