Active/Passive Cluster With Pacemaker, Corosync and DRBD on CentOS 7: Part 4 – Configure Fencing (STONITH)

The following is part 4 of a 4 part series that will go over an installation and configuration of Pacemaker, Corosync, Apache, DRBD and a VMware STONITH agent.

Pacemaker Fencing Agents

Fencing is a very important concept in computer clusters for High Availability. Unfortunately, given that fencing does not offer a visible service to users, it is often neglected.

There are two kinds of fencing: resource level and node level.

The node level fencing makes sure that a node does not run any resources at all. This is the fencing method that we are going to use in this article. The node will be simply rebooted using a vCentre.

Be advised that the node level fencing configuration depends heavily on environment. Commonly used STONITH devices include remote management services like Dell DRAC and HP iLO (the lights-out devices), uninterrupted power supplies (UPS), blade control devices.

STONITH (Shoot The Other Node In The Head) is Pacemaker’s fencing implementation.

Fencing Agents Available

To see what packages are available for fencing, run the following command:

[pcmk01]# yum search fence-

You should get a couple of dozen agents listed.

Fencing Agent for APC

For those having an APC network power switch, the fence_apc fence agent is likely the best bet. It logs into device via telnet/ssh and reboots a specified outlet.

Although such configuration is not covered in this article, APC UPS equipment is commonly used and therefore worth mentioning.

Fencing Agent for VMware

As we run virtual machines on a VMware platform, we are going to use the vmware_fence_soap fencing device:

[pcmk01]# yum search fence-|grep -i vmware
fence-agents-vmware-soap.x86_64 : Fence agent for VMWare with SOAP API v4.1+

Be sure to install the package on all cluster nodes:

[ALL]# yum install -y fence-agents-vmware-soap

Fencing Agent Script

We usually have to find the correct STONITH agent script, however, in our care, there should be just a single fence agent available:

[pcmk01]# pcs stonith list
fence_vmware_soap - Fence agent for VMWare over SOAP API

Find the parameters associated with the device:

[pcmk01]# pcs stonith describe fence_vmware_soap

It’s always handy to have a list of mandatory parameters:

[pcmk01]# pcs stonith describe fence_vmware_soap|grep required
  port (required): Physical plug number, name of virtual machine or UUID
  ipaddr (required): IP Address or Hostname
  action (required): Fencing Action
  login (required): Login Name

We can check the device’s metadata as well:

[pcmk01]# stonith_admin -M -a fence_vmware_soap

For the configuration of this VMware fencing device we need credentials to vCentre with minimal permissions. Once we have credentials, we can get a list of servers which are available on VMware:

[pcmk01]# fence_vmware_soap --ip vcentre.local --ssl --ssl-insecure --action list \
  --username="vcentre-account" --password="passwd" | grep pcmk
vm-pcmk01,4224b9eb-579c-c0eb-0e85-794a3eee7d26
vm-pcmk02,42240e2f-31a2-3fc1-c4e7-8f22073587ae

Where:

  1. ip – is the IP Address or Hostname of a vCentre,
  2. username – is the vCentre username,
  3. password – is the vCentre password,
  4. action – is the fencing action to use,
  5. ssl – uses SSL connection,
  6. ssl-insecure – uses SSL connection without verifying fence device’s certificate.

The vCentre account which we use above has “Power On/Power Off” privileges on VMware and is allowed to access all machines that we use in the series.

Configure Fencing (STONITH)

Get a local copy of the CIB:

[pcmk01]# pcs cluster cib stonith_cfg

Create a new STONITH resource called my_vcentre-fence:

[pcmk01]# pcs -f stonith_cfg stonith create my_vcentre-fence fence_vmware_soap \
 ipaddr=vcentre.local ipport=443 ssl_insecure=1 inet4_only=1 \
 login="vcentre-account" passwd="passwd" \
 action=reboot \
 pcmk_host_map="pcmk01-cr:vm-pcmk01;pcmk02-cr:vm-pcmk02" \
 pcmk_host_check=static-list \
 pcmk_host_list="vm-pcmk01,vm-pcmk02" \
 power_wait=3 op monitor interval=60s

We use pcmk_host_map for mapping host names to ports numbers for devices that do not support host names.

The host names should be the ones used for Corosync interface! Make sure pcmk_host_map contains the names of the corosync interfaces. Otherwise if Corosync interface if down, you may get the following error:

pcmk01 stonith-ng[23308]: notice: my_vcentre-fence can not fence (reboot) pcmk02-cr: static-list
pcmk01 stonith-ng[23308]: notice: Couldn't find anyone to fence (reboot) pcmk02-cr with any device
pcmk01 stonith-ng[23308]: error: Operation reboot of pcmk02-cr by  for [email protected]: No such device
pcmk01 crmd[23312]: notice: Initiating action 47: start my_vcentre-fence_start_0 on pcmk01-cr (local)
pcmk01 crmd[23312]: notice: Stonith operation 6/50:46:0:bf22c892-cf13-44b2-8fc6-67d13c05f4d4: No such device (-19)
pcmk01 crmd[23312]: notice: Stonith operation 6 for pcmk02-cr failed (No such device): aborting transition.
pcmk01 crmd[23312]: notice: Transition aborted: Stonith failed (source=tengine_stonith_callback:695, 0)
pcmk01 crmd[23312]: notice: Peer pcmk02-cr was not terminated (reboot) by  for pcmk01-cr: No such device

Enable STONITH and commit the changes:

[pcmk01]# pcs -f stonith_cfg property set stonith-enabled=true
[pcmk01]# pcs cluster cib-push stonith_cfg

Check the STONITH status and review the cluster resources:

[pcmk01]# pcs stonith show
 my_vcentre-fence       (stonith:fence_vmware_soap):    Started pcmk01-cr
[pcmk01]# pcs status
Cluster name: test_webcluster
Last updated: Sun Dec 13 16:24:14 2015          Last change: Sun Dec 13 16:23:42 2015 by root via cibadmin on pcmk01-cr
Stack: corosync
Current DC: pcmk02-cr (version 1.1.13-10.el7-44eb2dd) - partition with quorum
2 nodes and 6 resources configured

Online: [ pcmk01-cr pcmk02-cr ]

Full list of resources:

 Resource Group: my_webresource
     my_VIP     (ocf::heartbeat:IPaddr2):       Started pcmk02-cr
     my_website (ocf::heartbeat:apache):        Started pcmk02-cr
 Master/Slave Set: MyWebClone [my_webdata]
     Masters: [ pcmk02-cr ]
     Slaves: [ pcmk01-cr ]
 my_webfs       (ocf::heartbeat:Filesystem):    Started pcmk02-cr
 my_vcentre-fence       (stonith:fence_vmware_soap):    Started pcmk01-cr

PCSD Status:
  pcmk01-cr: Online
  pcmk02-cr: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Test Fencing

Reboot the second cluster node, make sure that you use the Corosync interface for this:

[pcmk01]# stonith_admin --reboot pcmk02-cr

We can also test it by killing the Corosync interface of the pcmk01 and observing the node being fenced:

[pcmk02]# tail -f /var/log/messages
[pcmk01]# ifdown $(ip ro|grep "172\.16\.21"|awk '{print $3}')

References

http://ingenierolinux.blogspot.co.uk/2014/10/cluster-rhel-7-con-pacemaker.html
http://clusterlabs.org/doc/crm_fencing.html

19 thoughts on “Active/Passive Cluster With Pacemaker, Corosync and DRBD on CentOS 7: Part 4 – Configure Fencing (STONITH)

  1. I Configured Fencing (STONITH) using Fencing Agent for VMware exactly like described above.
    my cluster configuration contain two VMs on the same physical server.
    After theconfiguration was done I run ‘pcs status’ and got:
    Master/Slave Set: RADviewClone01 [RADview_data01]
    Masters: [ rvpcmk01-cr ]
    Slaves: [ rvpcmk02-cr ]
    RADview_fs01 (ocf::heartbeat:Filesystem): Started rvpcmk01-cr
    RADview_VIP01 (ocf::heartbeat:IPaddr2): Started rvpcmk01-cr
    oracle_service (ocf::heartbeat:oracle): Started rvpcmk01-cr
    oracle_listener (ocf::heartbeat:oralsnr): Started rvpcmk01-cr
    my_vcentre-fence (stonith:fence_vmware_soap): Started rvpcmk02-cr

    I enabled stonith using the command: pcs property set stonith-enabled=true

    Now I saw behavior that after few minutes all my (physical) server goes down without I did anything…
    I start my (physical) server again and start also my VMs and again after few minutes the server powered off.

    Why this kind of behavior happened?
    * in addition the server power off inappropriate and cannot startup without my manual intervention to power off and then power on the whole server.

    • STONITH is used to fence a virtual cluster node, not a physical VMware ESXi host. If a cluster node got powered off, then it’s likely that a split-brain situation occurred. Logs should indicate the reason.

    • After I fixed the resource I created (configured mistakenly with the ESXI IP instead of the Vcenter IP) the stonith/fencing resourse started properly (without shutdown my all server :) ) but after few minutes it’s stopped.
      Please advise.

      pcs status show:
      Master/Slave Set: RADviewClone01 [RADview_data01]
      Masters: [ rvpcmk01-cr ]
      Slaves: [ rvpcmk02-cr ]
      RADview_fs01 (ocf::heartbeat:Filesystem): Started rvpcmk01-cr
      RADview_VIP01 (ocf::heartbeat:IPaddr2): Started rvpcmk01-cr
      oracle_service (ocf::heartbeat:oracle): Started rvpcmk01-cr
      oracle_listener (ocf::heartbeat:oralsnr): Started rvpcmk01-cr
      RADviewServicesResource (ocf::heartbeat:RADviewServices): Started rvpcmk01-cr
      my_vcentre-fence (stonith:fence_vmware_soap): Stopped

      Failed Actions:
      * my_vcentre-fence_start_0 on rvpcmk01-cr ‘unknown error’ (1): call=61, status=Error, exitreason=’none’,
      last-rc-change=’Tue Apr 4 14:42:56 2017′, queued=0ms, exec=2646ms
      * my_vcentre-fence_start_0 on rvpcmk02-cr ‘unknown error’ (1): call=46, status=Error, exitreason=’none’,
      last-rc-change=’Tue Apr 4 14:42:25 2017′, queued=1ms, exec=1562ms

      Cluster corosync.log shows these errors:
      Apr 04 14:42:56 [1407] rvpcmk01 crmd: warning: status_from_rc: Action 50 (my_vcentre-fence_start_0) on rvpcmk02-cr failed (target: 0 vs. rc: 1): Error
      Apr 04 14:42:56 [1407] rvpcmk01 crmd: notice: abort_transition_graph: Transition aborted by operation my_vcentre-fence_start_0 ‘modify’ on rvpcmk02-cr: Event failed | magic=4:1;50:14:0:c6128452-3a7f-4fa0-9a02-f01b334e502e cib=0.221.7 source=match_graph_event:310 complete=false
      Apr 04 14:42:56 [1407] rvpcmk01 crmd: info: match_graph_event: Action my_vcentre-fence_start_0 (50) confirmed on rvpcmk02-cr (rc=1)
      Apr 04 14:42:56 [1407] rvpcmk01 crmd: info: update_failcount: Updating failcount for my_vcentre-fence on rvpcmk02-cr after failed start: rc=1 (update=INFINITY, time=1491306176)
      Apr 04 14:42:56 [1407] rvpcmk01 crmd: info: process_graph_event: Detected action (14.50) my_vcentre-fence_start_0.46=unknown error: failed
      Apr 04 14:42:56 [1407] rvpcmk01 crmd: warning: status_from_rc: Action 50 (my_vcentre-fence_start_0) on rvpcmk02-cr failed (target: 0 vs. rc: 1): Error
      Apr 04 14:42:56 [1407] rvpcmk01 crmd: info: abort_transition_graph: Transition aborted by operation my_vcentre-fence_start_0 ‘modify’ on rvpcmk02-cr: Event failed | magic=4:1;50:14:0:c6128452-3a7f-4fa0-9a02-f01b334e502e cib=0.221.7 source=match_graph_event:310 complete=false
      Apr 04 14:42:56 [1407] rvpcmk01 crmd: info: match_graph_event: Action my_vcentre-fence_start_0 (50) confirmed on rvpcmk02-cr (rc=1)
      Apr 04 14:42:56 [1407] rvpcmk01 crmd: info: update_failcount: Updating failcount for my_vcentre-fence on rvpcmk02-cr after failed start: rc=1 (update=INFINITY, time=1491306176)
      Apr 04 14:42:56 [1407] rvpcmk01 crmd: info: process_graph_event: Detected action (14.50) my_vcentre-fence_start_0.46=unknown error: failed
      Apr 04 14:42:56 [1407] rvpcmk01 crmd: notice: run_graph: Transition 14 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=1, Source=/var/lib/pacema

  2. I tried to run this command and got failure.
    [[email protected] ~]# stonith_admin –reboot rvpcmk01-cr
    Command failed: No route to host

    corosync.log on primary server show:
    Apr 04 16:09:07 [1403] rvpcmk01 stonith-ng: warning: log_action: fence_vmware_soap[29029] stderr: [ Failed: Unable to obtain correct plug status or plug is not available ]
    Apr 04 16:09:07 [1403] rvpcmk01 stonith-ng: warning: log_action: fence_vmware_soap[29029] stderr: [ ]
    Apr 04 16:09:07 [1403] rvpcmk01 stonith-ng: warning: log_action: fence_vmware_soap[29029] stderr: [ ]
    Apr 04 16:09:07 [1403] rvpcmk01 stonith-ng: info: internal_stonith_action_execute: Attempt 2 to execute fence_vmware_soap (reboot). remaining timeout is 109
    Apr 04 16:09:19 [1403] rvpcmk01 stonith-ng: warning: log_action: fence_vmware_soap[29110] stderr: [ /usr/lib/python2.7/site-packages/urllib3/connectionpool.py:769: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html ]
    Apr 04 16:09:19 [1403] rvpcmk01 stonith-ng: warning: log_action: fence_vmware_soap[29110] stderr: [ InsecureRequestWarning) ]
    Apr 04 16:09:19 [1403] rvpcmk01 stonith-ng: warning: log_action: fence_vmware_soap[29110] stderr: [ Failed: Unable to obtain correct plug status or plug is not available ]
    Apr 04 16:09:19 [1403] rvpcmk01 stonith-ng: warning: log_action: fence_vmware_soap[29110] stderr: [ ]
    Apr 04 16:09:19 [1403] rvpcmk01 stonith-ng: warning: log_action: fence_vmware_soap[29110] stderr: [ ]
    Apr 04 16:09:19 [1403] rvpcmk01 stonith-ng: info: update_remaining_timeout: Attempted to execute agent fence_vmware_soap (reboot) the maximum number of times (2) allowed
    Apr 04 16:09:19 [1403] rvpcmk01 stonith-ng: error: log_operation: Operation ‘reboot’ [29110] (call 2 from stonith_admin.29713) for host ‘rvpcmk01-cr’ with device ‘my_vcentre-fence’ returned: -201 (Generic Pacemaker error)
    Apr 04 16:09:19 [1403] rvpcmk01 stonith-ng: notice: remote_op_done: Operation reboot of rvpcmk01-cr by for [email protected]: No route to host
    Apr 04 16:09:19 [1407] rvpcmk01 crmd: notice: tengine_stonith_notify: Peer rvpcmk01-cr was not terminated (reboot) by for rvpcmk02-cr: No route to host (ref=6096a1f3-ac9b-4f8c-9500-fb39d1070ef5) by client stonith_admin.29713

    PLEASE ADVISE WHAT IS WRONG…

  3. I have already installed fence_vmware_soap on all node.But when i excute “fence_vmware_soap –ip vcentre.local –ssl –ssl-insecure –action list “, return a error that “unable to connect/login to fencing device”. The username and password is what i use to login vsphere client. And they also can login exsi shell by ssh.
    Does that error relate to the article said “For the configuration of this VMware fencing device we need credentials to vCentre with minimal permissions. “.
    please advise me what is wrong…Thank you!

    • It’s possible that the agent cannot connect to the vCentre. I used VMware some years ago but migrated to a different virtualisation platform since. Unfortunately I’m unable to test/replicate it.

    • If you don’t use vSphere then you can I pass the IP of the ESXi server directly. Simply configure fence_vmware_soap to point to the IP address of the ESXi host instead of the vCenter.

    • Oh, i see. Thank you !
      P.S.Thanksgiving is not a national holiday in the UK . Get :)

    • Hi, if you use libvirt you can do fencing by installing and configuring fence-agents-virsh. Proxmox does not use libvirt so you need to utilise fence_pve (comes with the fence-agents package on Debian, also available on GitHub).

Leave a Reply

Your email address will not be published. Required fields are marked *