Active/Passive Cluster With Pacemaker, Corosync and DRBD on CentOS 7: Part 4 – Configure Fencing (STONITH)

Posted on 31/12/2015 by Lisenet

The following is part 4 of a 4 part series that will go over an installation and configuration of Pacemaker, Corosync, Apache, DRBD and a VMware STONITH agent.

Pacemaker Fencing Agents

Fencing is a very important concept in computer clusters for High Availability. Unfortunately, given that fencing does not offer a visible service to users, it is often neglected.

There are two kinds of fencing: resource level and node level.

The node level fencing makes sure that a node does not run any resources at all. This is the fencing method that we are going to use in this article. The node will be simply rebooted using a vCentre.

Be advised that the node level fencing configuration depends heavily on environment. Commonly used STONITH devices include remote management services like Dell DRAC and HP iLO (the lights-out devices), uninterrupted power supplies (UPS), blade control devices.

STONITH (Shoot The Other Node In The Head) is Pacemaker’s fencing implementation.

Fencing Agents Available

To see what packages are available for fencing, run the following command:

[pcmk01]# yum search fence-

You should get a couple of dozen agents listed.

Fencing Agent for APC

For those having an APC network power switch, the fence_apc fence agent is likely the best bet. It logs into device via telnet/ssh and reboots a specified outlet.

Although such configuration is not covered in this article, APC UPS equipment is commonly used and therefore worth mentioning.

Fencing Agent for VMware

As we run virtual machines on a VMware platform, we are going to use the vmware_fence_soap fencing device:

[pcmk01]# yum search fence-|grep -i vmware
fence-agents-vmware-soap.x86_64 : Fence agent for VMWare with SOAP API v4.1+

Be sure to install the package on all cluster nodes:

[ALL]# yum install -y fence-agents-vmware-soap

Fencing Agent Script

We usually have to find the correct STONITH agent script, however, in our care, there should be just a single fence agent available:

[pcmk01]# pcs stonith list
fence_vmware_soap - Fence agent for VMWare over SOAP API

Find the parameters associated with the device:

[pcmk01]# pcs stonith describe fence_vmware_soap

It’s always handy to have a list of mandatory parameters:

[pcmk01]# pcs stonith describe fence_vmware_soap|grep required
  port (required): Physical plug number, name of virtual machine or UUID
  ipaddr (required): IP Address or Hostname
  action (required): Fencing Action
  login (required): Login Name

We can check the device’s metadata as well:

[pcmk01]# stonith_admin -M -a fence_vmware_soap

For the configuration of this VMware fencing device we need credentials to vCentre with minimal permissions. Once we have credentials, we can get a list of servers which are available on VMware:

[pcmk01]# fence_vmware_soap --ip vcentre.local --ssl --ssl-insecure --action list \
  --username="vcentre-account" --password="passwd" | grep pcmk
vm-pcmk01,4224b9eb-579c-c0eb-0e85-794a3eee7d26
vm-pcmk02,42240e2f-31a2-3fc1-c4e7-8f22073587ae

Where:

ip – is the IP Address or Hostname of a vCentre,
username – is the vCentre username,
password – is the vCentre password,
action – is the fencing action to use,
ssl – uses SSL connection,
ssl-insecure – uses SSL connection without verifying fence device’s certificate.

The vCentre account which we use above has “Power On/Power Off” privileges on VMware and is allowed to access all machines that we use in the series.

Configure Fencing (STONITH)

Get a local copy of the CIB:

[pcmk01]# pcs cluster cib stonith_cfg

Create a new STONITH resource called my_vcentre-fence:

[pcmk01]# pcs -f stonith_cfg stonith create my_vcentre-fence fence_vmware_soap \
 ipaddr=vcentre.local ipport=443 ssl_insecure=1 inet4_only=1 \
 login="vcentre-account" passwd="passwd" \
 action=reboot \
 pcmk_host_map="pcmk01-cr:vm-pcmk01;pcmk02-cr:vm-pcmk02" \
 pcmk_host_check=static-list \
 pcmk_host_list="vm-pcmk01,vm-pcmk02" \
 power_wait=3 op monitor interval=60s

We use pcmk_host_map for mapping host names to ports numbers for devices that do not support host names.

The host names should be the ones used for Corosync interface! Make sure pcmk_host_map contains the names of the corosync interfaces. Otherwise if Corosync interface if down, you may get the following error:

pcmk01 stonith-ng[23308]: notice: my_vcentre-fence can not fence (reboot) pcmk02-cr: static-list
pcmk01 stonith-ng[23308]: notice: Couldn't find anyone to fence (reboot) pcmk02-cr with any device
pcmk01 stonith-ng[23308]: error: Operation reboot of pcmk02-cr by  for [email protected]: No such device
pcmk01 crmd[23312]: notice: Initiating action 47: start my_vcentre-fence_start_0 on pcmk01-cr (local)
pcmk01 crmd[23312]: notice: Stonith operation 6/50:46:0:bf22c892-cf13-44b2-8fc6-67d13c05f4d4: No such device (-19)
pcmk01 crmd[23312]: notice: Stonith operation 6 for pcmk02-cr failed (No such device): aborting transition.
pcmk01 crmd[23312]: notice: Transition aborted: Stonith failed (source=tengine_stonith_callback:695, 0)
pcmk01 crmd[23312]: notice: Peer pcmk02-cr was not terminated (reboot) by  for pcmk01-cr: No such device

Enable STONITH and commit the changes:

[pcmk01]# pcs -f stonith_cfg property set stonith-enabled=true
[pcmk01]# pcs cluster cib-push stonith_cfg

Check the STONITH status and review the cluster resources:

[pcmk01]# pcs stonith show
 my_vcentre-fence       (stonith:fence_vmware_soap):    Started pcmk01-cr

[pcmk01]# pcs status
Cluster name: test_webcluster
Last updated: Sun Dec 13 16:24:14 2015          Last change: Sun Dec 13 16:23:42 2015 by root via cibadmin on pcmk01-cr
Stack: corosync
Current DC: pcmk02-cr (version 1.1.13-10.el7-44eb2dd) - partition with quorum
2 nodes and 6 resources configured

Online: [ pcmk01-cr pcmk02-cr ]

Full list of resources:

 Resource Group: my_webresource
     my_VIP     (ocf::heartbeat:IPaddr2):       Started pcmk02-cr
     my_website (ocf::heartbeat:apache):        Started pcmk02-cr
 Master/Slave Set: MyWebClone [my_webdata]
     Masters: [ pcmk02-cr ]
     Slaves: [ pcmk01-cr ]
 my_webfs       (ocf::heartbeat:Filesystem):    Started pcmk02-cr
 my_vcentre-fence       (stonith:fence_vmware_soap):    Started pcmk01-cr

PCSD Status:
  pcmk01-cr: Online
  pcmk02-cr: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Test Fencing

Reboot the second cluster node, make sure that you use the Corosync interface for this:

[pcmk01]# stonith_admin --reboot pcmk02-cr

We can also test it by killing the Corosync interface of the pcmk01 and observing the node being fenced:

[pcmk02]# tail -f /var/log/messages
[pcmk01]# ifdown $(ip ro|grep "172\.16\.21"|awk '{print $3}')

References

http://ingenierolinux.blogspot.co.uk/2014/10/cluster-rhel-7-con-pacemaker.html
http://clusterlabs.org/doc/crm_fencing.html

34 thoughts on “Active/Passive Cluster With Pacemaker, Corosync and DRBD on CentOS 7: Part 4 – Configure Fencing (STONITH)”

Lidor Aviman says:

03/04/2017 at 11:18 am

I Configured Fencing (STONITH) using Fencing Agent for VMware exactly like described above.
my cluster configuration contain two VMs on the same physical server.
After theconfiguration was done I run ‘pcs status’ and got:
Master/Slave Set: RADviewClone01 [RADview_data01]
Masters: [ rvpcmk01-cr ]
Slaves: [ rvpcmk02-cr ]
RADview_fs01 (ocf::heartbeat:Filesystem): Started rvpcmk01-cr
RADview_VIP01 (ocf::heartbeat:IPaddr2): Started rvpcmk01-cr
oracle_service (ocf::heartbeat:oracle): Started rvpcmk01-cr
oracle_listener (ocf::heartbeat:oralsnr): Started rvpcmk01-cr
my_vcentre-fence (stonith:fence_vmware_soap): Started rvpcmk02-cr

I enabled stonith using the command: pcs property set stonith-enabled=true

Now I saw behavior that after few minutes all my (physical) server goes down without I did anything…
I start my (physical) server again and start also my VMs and again after few minutes the server powered off.

Why this kind of behavior happened?
* in addition the server power off inappropriate and cannot startup without my manual intervention to power off and then power on the whole server.

Reply
- Lisenet says:
  
  03/04/2017 at 10:04 pm
  
  STONITH is used to fence a virtual cluster node, not a physical VMware ESXi host. If a cluster node got powered off, then it’s likely that a split-brain situation occurred. Logs should indicate the reason.
- Lidor Aviman says:
  
  04/04/2017 at 12:53 pm
  
  After I fixed the resource I created (configured mistakenly with the ESXI IP instead of the Vcenter IP) the stonith/fencing resourse started properly (without shutdown my all server :) ) but after few minutes it’s stopped.
  Please advise.
  
  pcs status show:
  Master/Slave Set: RADviewClone01 [RADview_data01]
  Masters: [ rvpcmk01-cr ]
  Slaves: [ rvpcmk02-cr ]
  RADview_fs01 (ocf::heartbeat:Filesystem): Started rvpcmk01-cr
  RADview_VIP01 (ocf::heartbeat:IPaddr2): Started rvpcmk01-cr
  oracle_service (ocf::heartbeat:oracle): Started rvpcmk01-cr
  oracle_listener (ocf::heartbeat:oralsnr): Started rvpcmk01-cr
  RADviewServicesResource (ocf::heartbeat:RADviewServices): Started rvpcmk01-cr
  my_vcentre-fence (stonith:fence_vmware_soap): Stopped
  
  Failed Actions:
  * my_vcentre-fence_start_0 on rvpcmk01-cr ‘unknown error’ (1): call=61, status=Error, exitreason=’none’,
  last-rc-change=’Tue Apr 4 14:42:56 2017′, queued=0ms, exec=2646ms
  * my_vcentre-fence_start_0 on rvpcmk02-cr ‘unknown error’ (1): call=46, status=Error, exitreason=’none’,
  last-rc-change=’Tue Apr 4 14:42:25 2017′, queued=1ms, exec=1562ms
  
  Cluster corosync.log shows these errors:
  Apr 04 14:42:56 [1407] rvpcmk01 crmd: warning: status_from_rc: Action 50 (my_vcentre-fence_start_0) on rvpcmk02-cr failed (target: 0 vs. rc: 1): Error
  Apr 04 14:42:56 [1407] rvpcmk01 crmd: notice: abort_transition_graph: Transition aborted by operation my_vcentre-fence_start_0 ‘modify’ on rvpcmk02-cr: Event failed | magic=4:1;50:14:0:c6128452-3a7f-4fa0-9a02-f01b334e502e cib=0.221.7 source=match_graph_event:310 complete=false
  Apr 04 14:42:56 [1407] rvpcmk01 crmd: info: match_graph_event: Action my_vcentre-fence_start_0 (50) confirmed on rvpcmk02-cr (rc=1)
  Apr 04 14:42:56 [1407] rvpcmk01 crmd: info: update_failcount: Updating failcount for my_vcentre-fence on rvpcmk02-cr after failed start: rc=1 (update=INFINITY, time=1491306176)
  Apr 04 14:42:56 [1407] rvpcmk01 crmd: info: process_graph_event: Detected action (14.50) my_vcentre-fence_start_0.46=unknown error: failed
  Apr 04 14:42:56 [1407] rvpcmk01 crmd: warning: status_from_rc: Action 50 (my_vcentre-fence_start_0) on rvpcmk02-cr failed (target: 0 vs. rc: 1): Error
  Apr 04 14:42:56 [1407] rvpcmk01 crmd: info: abort_transition_graph: Transition aborted by operation my_vcentre-fence_start_0 ‘modify’ on rvpcmk02-cr: Event failed | magic=4:1;50:14:0:c6128452-3a7f-4fa0-9a02-f01b334e502e cib=0.221.7 source=match_graph_event:310 complete=false
  Apr 04 14:42:56 [1407] rvpcmk01 crmd: info: match_graph_event: Action my_vcentre-fence_start_0 (50) confirmed on rvpcmk02-cr (rc=1)
  Apr 04 14:42:56 [1407] rvpcmk01 crmd: info: update_failcount: Updating failcount for my_vcentre-fence on rvpcmk02-cr after failed start: rc=1 (update=INFINITY, time=1491306176)
  Apr 04 14:42:56 [1407] rvpcmk01 crmd: info: process_graph_event: Detected action (14.50) my_vcentre-fence_start_0.46=unknown error: failed
  Apr 04 14:42:56 [1407] rvpcmk01 crmd: notice: run_graph: Transition 14 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=1, Source=/var/lib/pacema
Lidor Aviman says:

04/04/2017 at 2:11 pm

I tried to run this command and got failure.
[root@rvpcmk02 ~]# stonith_admin –reboot rvpcmk01-cr
Command failed: No route to host

corosync.log on primary server show:
Apr 04 16:09:07 [1403] rvpcmk01 stonith-ng: warning: log_action: fence_vmware_soap[29029] stderr: [ Failed: Unable to obtain correct plug status or plug is not available ]
Apr 04 16:09:07 [1403] rvpcmk01 stonith-ng: warning: log_action: fence_vmware_soap[29029] stderr: [ ]
Apr 04 16:09:07 [1403] rvpcmk01 stonith-ng: warning: log_action: fence_vmware_soap[29029] stderr: [ ]
Apr 04 16:09:07 [1403] rvpcmk01 stonith-ng: info: internal_stonith_action_execute: Attempt 2 to execute fence_vmware_soap (reboot). remaining timeout is 109
Apr 04 16:09:19 [1403] rvpcmk01 stonith-ng: warning: log_action: fence_vmware_soap[29110] stderr: [ /usr/lib/python2.7/site-packages/urllib3/connectionpool.py:769: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html ]
Apr 04 16:09:19 [1403] rvpcmk01 stonith-ng: warning: log_action: fence_vmware_soap[29110] stderr: [ InsecureRequestWarning) ]
Apr 04 16:09:19 [1403] rvpcmk01 stonith-ng: warning: log_action: fence_vmware_soap[29110] stderr: [ Failed: Unable to obtain correct plug status or plug is not available ]
Apr 04 16:09:19 [1403] rvpcmk01 stonith-ng: warning: log_action: fence_vmware_soap[29110] stderr: [ ]
Apr 04 16:09:19 [1403] rvpcmk01 stonith-ng: warning: log_action: fence_vmware_soap[29110] stderr: [ ]
Apr 04 16:09:19 [1403] rvpcmk01 stonith-ng: info: update_remaining_timeout: Attempted to execute agent fence_vmware_soap (reboot) the maximum number of times (2) allowed
Apr 04 16:09:19 [1403] rvpcmk01 stonith-ng: error: log_operation: Operation ‘reboot’ [29110] (call 2 from stonith_admin.29713) for host ‘rvpcmk01-cr’ with device ‘my_vcentre-fence’ returned: -201 (Generic Pacemaker error)
Apr 04 16:09:19 [1403] rvpcmk01 stonith-ng: notice: remote_op_done: Operation reboot of rvpcmk01-cr by for [email protected]: No route to host
Apr 04 16:09:19 [1407] rvpcmk01 crmd: notice: tengine_stonith_notify: Peer rvpcmk01-cr was not terminated (reboot) by for rvpcmk02-cr: No route to host (ref=6096a1f3-ac9b-4f8c-9500-fb39d1070ef5) by client stonith_admin.29713

PLEASE ADVISE WHAT IS WRONG…

Reply
- Lisenet says:
  
  04/04/2017 at 2:57 pm
  
  It look like there is no route to host. It might be related to a self-signed SSL certificate. Please see this Red Hat page for more info: https://access.redhat.com/discussions/1236653
- Lidor Aviman says:
  
  04/04/2017 at 3:23 pm
  
  Nothing in this redhat page to solve this issue :(
- Lisenet says:
  
  04/04/2017 at 3:43 pm
  
  Have you checked the responses? There is a link to the Red Hat solutions page. The solution seems to be verified.
- Lidor Aviman says:
  
  05/04/2017 at 9:03 am
  
  I checked the responses and the link, the link not truly related to the problem that I have, the link explain about the option to use ssl_insecure but I already use this option in my command.
Jeremy says:

23/11/2017 at 2:59 am

I have already installed fence_vmware_soap on all node.But when i excute “fence_vmware_soap –ip vcentre.local –ssl –ssl-insecure –action list “, return a error that “unable to connect/login to fencing device”. The username and password is what i use to login vsphere client. And they also can login exsi shell by ssh.
Does that error relate to the article said “For the configuration of this VMware fencing device we need credentials to vCentre with minimal permissions. “.
please advise me what is wrong…Thank you!

Reply
- Lisenet says:
  
  23/11/2017 at 10:49 am
  
  It’s possible that the agent cannot connect to the vCentre. I used VMware some years ago but migrated to a different virtualisation platform since. Unfortunately I’m unable to test/replicate it.
- Jeremy says:
  
  24/11/2017 at 2:12 am
  
  All right ! Thank you anyway. Happy Thanksgiving !
- Lisenet says:
  
  24/11/2017 at 4:48 pm
  
  No worries, you’re welcome.
  
  P.S. Thanksgiving is not a national holiday in the UK :)
- Jeremy says:
  
  24/11/2017 at 2:32 am
  
  By the way , is vCenter necessary ?
- Lisenet says:
  
  24/11/2017 at 4:54 pm
  
  If you don’t use vSphere then you can I pass the IP of the ESXi server directly. Simply configure fence_vmware_soap to point to the IP address of the ESXi host instead of the vCenter.
- Jeremy says:
  
  26/11/2017 at 2:51 am
  
  Oh, i see. Thank you !
  P.S.Thanksgiving is not a national holiday in the UK . Get :)
ricky says:

12/03/2018 at 5:09 pm

Hi Tomas , in the case that I use proxmox and kvm? , how could I do fence

Reply
- Lisenet says:
  
  12/03/2018 at 6:58 pm
  
  Hi, if you use libvirt you can do fencing by installing and configuring fence-agents-virsh. Proxmox does not use libvirt so you need to utilise fence_pve (comes with the fence-agents package on Debian, also available on GitHub).
ricky says:

22/03/2018 at 6:37 pm

Hi Tomas , thank for the explanation

Reply
- Lisenet says:
  
  22/03/2018 at 7:24 pm
  
  You’re welcome.
Norman says:

19/06/2018 at 12:31 pm

Hi Tomas,

thanks a lot for this, it helped me set up my own Active/Passive cluster.

Did you ever set up an Active/Active cluster? After my A/P cluster was running fine, I went on and configured an A/A cluster following this tutorial:
https://www.clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch09.html
But even though I none of those steps showed an error message, I can’t use the cloned cluster IP from the outside and I have no idea why.

Cheers,
-Norman

Reply
- Lisenet says:
  
  19/06/2018 at 1:06 pm
  
  Hi Norman,
  
  Thanks for your feedback! Yes, I did set up Active/Active clusters with iSCSI and GFS, feel free to take a look here if interested. I need to admit that I never used Active/Active clusters with DRBD though.
vivath says:

22/01/2019 at 9:00 am

hi Tomas,

one question. is there any impact on cluster, in case of vcenter go down or account is locked or Certificate expired?

Thank you.
Vivath

Reply
- Lisenet says:
  
  22/01/2019 at 1:04 pm
  
  Fencing will stop working.
Mariusz Wojtkiewicz says:

22/05/2019 at 1:52 pm

Hi Tomas,

I make this tutorial.
All look fine but when i test it:
[root@node1 ~]# stonith_admin –reboot node2
I get errors:
May 22 14:21:52 node2 stonith-ng[19034]: notice: VcentreFence can fence (reboot) node2 (aka. ‘Centos7_Pacemaker_2’): static-list
May 22 14:22:13 node2 stonith-ng[19034]: notice: VcentreFence can fence (reboot) node2 (aka. ‘Centos7_Pacemaker_2’): static-list
May 22 14:22:23 node2 fence_vmware_soap: Failed: Timed out waiting to power OFF
May 22 14:22:23 node2 stonith-ng[19034]: warning: fence_vmware_soap[26244] stderr: [ 2019-05-22 14:22:23,400 ERROR: Failed: Timed out waiting to power OFF ]
May 22 14:22:23 node2 stonith-ng[19034]: warning: fence_vmware_soap[26244] stderr: [ ]
May 22 14:22:23 node2 stonith-ng[19034]: warning: fence_vmware_soap[26244] stderr: [ ]
May 22 14:22:34 node2 fence_vmware_soap: Failed: Timed out waiting to power OFF
May 22 14:22:34 node2 stonith-ng[19034]: warning: fence_vmware_soap[26336] stderr: [ 2019-05-22 14:22:34,305 ERROR: Failed: Timed out waiting to power OFF ]
May 22 14:22:34 node2 stonith-ng[19034]: warning: fence_vmware_soap[26336] stderr: [ ]
May 22 14:22:34 node2 stonith-ng[19034]: warning: fence_vmware_soap[26336] stderr: [ ]
May 22 14:22:34 node2 stonith-ng[19034]: error: Operation ‘reboot’ [26336] (call 2 from stonith_admin.6825) for host ‘node2’ with device ‘VcentreFence’ returned: -110 (Connection timed out)
May 22 14:22:34 node2 stonith-ng[19034]: notice: Operation reboot of node2 by for [email protected]: No route to host
May 22 14:22:34 node2 crmd[19038]: notice: Peer node2 was not terminated (reboot) by on behalf of stonith_admin.6825: No route to host

In VMware vSphere Client I see connection in this time:
Find entity by UUID Completed stonith_user 2019-05-22 14:24:16 2019-05-22 14:24:16 2019-05-22 14:24:16

Test connection fence is o.k. :
[root@node1 ~]# fence_vmware_soap –ip 10.2.4.8 –ssl –ssl-insecure –action list –username=”stonith_user” –password=”testowy_pass” | grep -i pacemaker
Centos7_Pacemaker_2,564d8e70-ba64-72d9-7ab8-ccb43919b0d4
Centos7_Pacemaker_1,564dec0d-6d14-7464-6586-0fb0ee308859

My command created stonith was:
pcs -f stonith_cfg stonith create VcentreFence fence_vmware_soap ipaddr=10.2.4.8 ipport=443 ssl_insecure=1 inet4_only=1 login=”stonith_user” passwd=”testowy_pass” pcmk_host_map=”node1:Centos7_Pacemaker_1;node2:Centos7_Pacemaker_2″ pcmk_host_check=static-list pcmk_host_list=”Centos7_Pacemaker_1,Centos7_Pacemaker_2″ power_wait=3 op monitor interval=60s

My versions:
VMware ESXi 6.0 (NFR license)
Centos 7.6.1810 (Core)
pacemaker 1.1.19-8.el7_6.4
corosync 2.4.3-4.el7
pcs 0.9.165-6.el7.centos.1
fence-agents-vmware-soap 4.2.1-11.el7_6.7

Have any idea ?
Best regards,
Mariusz

Reply
- Mariusz says:
  
  23/05/2019 at 11:24 am
  
  Resolved
  
  Inadequate license ESXi (I have NFR license), it follows that saop api is “read only”.
  Log from ESXi:
  2019-05-23T09:06:25.065Z info hostd[2E380B70] [Originator@6876 sub=Solo.Vmomi opID=fc27506f user=stonith_user] Activation [N5Vmomi10ActivationE:0x2db73278] : Invoke done [powerOff] on [vim.VirtualMachine:52]
  2019-05-23T09:06:25.065Z info hostd[2E380B70] [Originator@6876 sub=Solo.Vmomi opID=fc27506f user=stonith] Throw vim.fault.RestrictedVersion
  2019-05-23T09:06:25.065Z info hostd[2E380B70] [Originator@6876 sub=Solo.Vmomi opID=fc27506f user=stonith] Result:
  –> (vim.fault.RestrictedVersion) {
  –> faultCause = (vmodl.MethodFault) null,
  –> msg = “”
  –> }
- Lisenet says:
  
  23/05/2019 at 6:13 pm
  
  Great, thanks.
Theo Viset says:

18/12/2019 at 3:02 pm

Hiall,
I understand that this thread is rather old, but for me it is relevant now.

We have 2 nodes in a cluster, using bonding, drbd, Corosync and fencing.

Rebooting the slave server:
Fencing works fine when I use shutdown from the command line or when I press the power switch on the slave server. I see that the cluster responds, and that fencing on the slave server is set to Stopped. The active server remains active, so the cluster may be in trouble, the services running in it are still up and running.

When the slave/passive server is rebooted and I start the cluster on it (pcs cluster start [nodename]), the server becomes slave again and the cluster is ok. The fencing is set to Started again.

Rebooting the active server:
When I reboot the active server, the slave server flags the active server as being Stopped and OFFLINE, and takes over all resources, so although the cluster is in trouble, the resources are up and running. The node2_fence is flagged as Stopped.
When the server is rebooted and I start the cluster node at the newly started server, the cluster is running fine and all is flagged ok and Started. The DRBD blocks are Connected, Secondary/Primary and UpToDate.

Power outage or crashing server:
When I simulate a crashing server, e.g. the power is cut off immediately (so not with a graceful shutdown or reboot) of the active server, I would expect the same result: The slave server acknowledges the active server being dead and take over the cluster. However, that doesn’t happen. The slave server tris to reboot the master server through fencing, but obviously can’t since the power in that server is off. Instead of understanding that the server is no longer there, it tries to reboot that server a couple of times and than states a time-out error rebooting the server. And that’s it, it doesn’t try to become master or do anything at all.

The pcs status output is:
Cluster name: my-cluster
Stack: corosync
Current DC: someservername01 (version 1.1.20-5.el7_7.2-3c4c782f70) – partition with quorum
Last updated: Wed Dec 18 16:22:36 2019
Last change: Wed Dec 18 12:10:29 2019 by root via cibadmin on someservername01

2 nodes configured
13 resources configured

Node someservername02: UNCLEAN (offline)
Online: [ someservername01 ]

Full list of resources:

cluster_ip (ocf::heartbeat:IPaddr2): Started someservername02 (UNCLEAN)
Master/Slave Set: drbd_ms [drbd_data]
drbd_data (ocf::linbit:drbd): Master someservername02 (UNCLEAN)
Slaves: [ someservername01 ]
conf_fs (ocf::heartbeat:Filesystem): Started someservername02 (UNCLEAN)
mysql_fs (ocf::heartbeat:Filesystem): Started someservername02 (UNCLEAN)
mysql_service (ocf::heartbeat:mysql): Started someservername02 (UNCLEAN)
syslog_service (ocf::heartbeat:syslog-ng): Started someservername02 (UNCLEAN)
lumids_fs (ocf::heartbeat:Filesystem): Started someservername02 (UNCLEAN)
lumids_service (ocf::heartbeat:lumids): Started someservername02 (UNCLEAN)
data_fs (ocf::heartbeat:Filesystem): Started someservername02 (UNCLEAN)
nginx_service (ocf::heartbeat:nginx): Started someservername02 (UNCLEAN)
node1_fence (stonith:fence_ilo5): Started someservername02 (UNCLEAN)
node2_fence (stonith:fence_ilo5): FAILED someservername01

Failed Resource Actions:
* node2_fence_monitor_60000 on someservername01 ‘unknown error’ (1): call=53, status=Timed Out, exitreason=”,
last-rc-change=’Wed Dec 18 16:21:25 2019′, queued=0ms, exec=31875ms

Failed Fencing Actions:
* reboot of someservername02 failed: delegate=someservername01, client=crmd.4440, origin=someservername01,
last-failed=’Wed Dec 18 16:22:17 2019′

Pending Fencing Actions:
* reboot of someservername02 pending: client=crmd.4440, origin=someservername01

Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled

Here’s my question:
How can I configure fencing in a way that when this happens and the remaining server recognizes the time-out for rebooting, it changes itself from being the slave to becoming the master? Is that even possible, because maybe it doesn’t do that to prevent a split-brain situation when both servers could be up so it decides not to start anything automatically? In that case, the result was to be expected but I just misinterpreted the process.

Any help would greatly be appreciated :)

Greetz, Theo

Reply
Srikanth Reddy Samala says:

28/01/2020 at 7:23 pm

Jan 28 11:19:51 [1981] sxdcmqdmz01 stonith-ng: warning: log_action: fence_vmware_soap[661906] stderr: [ InsecureRequestWarning) ]
Jan 28 11:19:51 [1981] sxdcmqdmz01 stonith-ng: warning: log_action: fence_vmware_soap[661906] stderr: [ Traceback (most recent call last): ]
Jan 28 11:19:51 [1981] sxdcmqdmz01 stonith-ng: warning: log_action: fence_vmware_soap[661906] stderr: [ File “/usr/sbin/fence_vmware_soap”, line 261, in ]
Jan 28 11:19:51 [1981] sxdcmqdmz01 stonith-ng: warning: log_action: fence_vmware_soap[661906] stderr: [ main() ]
Jan 28 11:19:51 [1981] sxdcmqdmz01 stonith-ng: warning: log_action: fence_vmware_soap[661906] stderr: [ File “/usr/sbin/fence_vmware_soap”, line 255, in main ]
Jan 28 11:19:51 [1981] sxdcmqdmz01 stonith-ng: warning: log_action: fence_vmware_soap[661906] stderr: [ result = fence_action(conn_global, options_global, set_power_status, get_power_status, get_power_status) ]
Jan 28 11:19:51 [1981] sxdcmqdmz01 stonith-ng: warning: log_action: fence_vmware_soap[661906] stderr: [ File “/usr/share/fence/fencing.py”, line 861, in fence_action ]
Jan 28 11:19:51 [1981] sxdcmqdmz01 stonith-ng: warning: log_action: fence_vmware_soap[661906] stderr: [ outlets = get_outlet_list(connection, options) ]
Jan 28 11:19:51 [1981] sxdcmqdmz01 stonith-ng: warning: log_action: fence_vmware_soap[661906] stderr: [ File “/usr/sbin/fence_vmware_soap”, line 132, in get_power_status ]
Jan 28 11:19:51 [1981] sxdcmqdmz01 stonith-ng: warning: log_action: fence_vmware_soap[661906] stderr: [ (machines, uuid, mappingToUUID) = process_results(raw_machines, {}, {}, {}) ]
Jan 28 11:19:51 [1981] sxdcmqdmz01 stonith-ng: warning: log_action: fence_vmware_soap[661906] stderr: [ File “/usr/sbin/fence_vmware_soap”, line 79, in process_results ]
Jan 28 11:19:51 [1981] sxdcmqdmz01 stonith-ng: warning: log_action: fence_vmware_soap[661906] stderr: [ for m in results.objects: ]
Jan 28 11:19:51 [1981] sxdcmqdmz01 stonith-ng: warning: log_action: fence_vmware_soap[661906] stderr: [ AttributeError: ‘NoneType’ object has no attribute ‘objects’ ]
Jan 28 11:19:51 [1981] sxdcmqdmz01 stonith-ng: warning: log_action: fence_vmware_soap[661906] stderr: [ /usr/lib/python2.7/site-packages/urllib3/connectionpool.py:769: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html ]
Jan 28 11:19:51 [1981] sxdcmqdmz01 stonith-ng: warning: log_action: fence_vmware_soap[661906] stderr: [ InsecureRequestWarning) ]
Jan 28 11:19:51 [1981] sxdcmqdmz01 stonith-ng: info: update_remaining_timeout: Attempted to execute agent fence_vmware_soap (monitor) the maximum number of times (2) allowed
Jan 28 11:19:51 [1981] sxdcmqdmz01 stonith-ng: notice: log_operation: Operation ‘monitor’ [661906] for device ‘VMWareFence’ returned: -201 (Generic Pacemaker error)

The stonith service is not starting after reseting the failcount.

Reply
- Lisenet says:
  
  28/01/2020 at 7:46 pm
  
  Are you using a self-signed certificate?
Carlos Montes says:

20/02/2020 at 7:17 pm

Excelent, I can start my fence resource! Thanks a lot!

Reply
- Lisenet says:
  
  20/02/2020 at 10:32 pm
  
  You’re welcome!
Praveen Varshney says:

10/08/2020 at 12:57 pm

Eevrything worked fine according to the steps mentioned . but during testing it got failed it couldnt reboot the primary node .it showed Timer expired. What we can do to check and what will be the output after that command is executed. Also i didnt configure anyting or any role on vcentre

Reply
JM says:

26/08/2021 at 4:39 am

I just got a clarification, where should my_vcentre-fence run by default? should it be on pcmk01 or pcmk02. I am confused on this part.

Reply
- Lisenet says:
  
  26/08/2021 at 9:08 am
  
  It can run on any node. If the node becomes unavailable, the service will fail over to the node that is available.