Setting up an O2CB Cluster with Dual Primary DRBD and OCFS2 on Oracle Linux 7

We are going to build a two-node active/active HA cluster using O2CB and DRBD.

Before We Begin

The o2cb is the default cluster stack of the OCFS2 file system. It is an in-kernel cluster stack that includes a node manager (o2nm) to keep track of the nodes in the cluster, a disk heartbeat agent (o2hb) to detect node live-ness, a network agent (o2net) for intra-cluster node communication and a distributed lock manager (o2dlm) to keep track of lock resources. It also includes a synthetic file system, dlmfs, to allow applications to access the in-kernel dlm.

This cluster stack has two configuration files, /etc/ocfs2/cluster.conf and /etc/sysconfig/o2cb. Whereas the former keeps track of the cluster layout, the latter keeps track of the cluster timeouts. Both files are only read when the cluster is brought online.

The convention followed in the article is that [ALL]# denotes a command that needs to be run on all cluster nodes.

Software

Software used in this article:

  1. Oracle Linux Server release 7.2 (Maipo)
  2. kernel-uek-3.8.13
  3. ocfs2-tools 1.8.6
  4. drbd-8.4

Networking and Firewall Configuration

We have two Oracle Linux 7 virtual machines on VirtualBox, named ora1 and ora2.

Networking

The following networks will be used:

  1. 10.8.8.0/24 – LAN with access to the Internet,
  2. 172.16.21.0/24 – non-routable cluster heartbeat vlan for o2cb,
  3. 172.16.22.0/24 – non-routable cluster heartbeat vlan for DRBD.

Hostnames and IPs as defined in /etc/hosts file:

10.8.8.55 ora1
10.8.8.56 ora2
172.16.21.55 ora1-clust
172.16.21.56 ora2-clust
172.16.22.55 ora1-drbd
172.16.22.56 ora2-drbd

We have set the following hostnames:

[ora1]# hostnamectl set-hostname ora1
[ora2]# hostnamectl set-hostname ora2

SELinux

SELinux is set to enforcing mode.

Firewall

The o2cb cluster stack requires iptables to be either disabled or modified to allow network traffic on the private network interface.

These are iptables rules that we have in use:

# iptables -S
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -s 10.0.0.0/8 -p tcp -m tcp --dport 22 -m state --state NEW -j ACCEPT
-A INPUT -s 172.16.21.0/24 -d 172.16.21.0/24 -p tcp -m tcp --dport 7777 -j ACCEPT
-A INPUT -s 172.16.21.0/24 -d 172.16.21.0/24 -p udp -m udp --dport 7777 -j ACCEPT
-A INPUT -s 172.16.22.0/24 -d 172.16.22.0/24 -m comment --comment DRBD -j ACCEPT
-A INPUT -p udp -m multiport --dports 67,68 -m state --state NEW -j ACCEPT
-A INPUT -p udp -m multiport --dports 137,138,139,445 -j DROP
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT 
-A INPUT -j LOG --log-prefix "iptables_input " 
-A INPUT -j REJECT --reject-with icmp-port-unreachable

We have also disabled IPv6, open /etc/sysctl.conf for editing and place the following:

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
[ALL]# sysctl -p

DRBD

DRBD refers to block devices designed as a building block to form high availability clusters. This is done by mirroring a whole block device via an assigned network. DRBD can be understood as network based RAID-1.

DRBD Installation

Import the ELRepo package signing key, enable the repository and install the DRBD kernel module with utilities:

[ALL]# rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
[ALL]# rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm
[ALL]# yum install -y kmod-drbd84 drbd84-utils

To avoid issues with SELinux until resolved, we are going to exempt DRBD processes from SELinux control:

[ALL]# yum install -y policycoreutils-python
[ALL]# semanage permissive -a drbd_t

LVM Volume for DRBD

We create a new 1GB logical volume for DRBD:

[ALL]# lvcreate --name lv_drbd --size 1024M vg_oracle7

DRBD Configuration

As we are going to use an OCFS2 which is a shared cluster file system (expecting concurrent read/write storage access from all cluster nodes), a DRBD resource to be used for storing a OCFS2 filesystem must be configured in dual-primary mode.

It may not be recommended to set the allow-two-primaries option to yes upon initial configuration. We should do so after the initial resource synchronisation has completed.

[ALL]# cat /etc/drbd.d/ocfsdata.res
resource ocfsdata {
 protocol C;
 meta-disk internal;
 device /dev/drbd0;
 disk /dev/vg_oracle7/lv_drbd;
 handlers {
  split-brain "/usr/lib/drbd/notify-split-brain.sh root";
 }
 startup {
  wfc-timeout 20;
  become-primary-on both;
 }
 net {
  allow-two-primaries yes;
  after-sb-0pri discard-zero-changes;
  after-sb-1pri discard-secondary;
  after-sb-2pri disconnect;
  rr-conflict disconnect;
  csums-alg sha1;
 }
 disk {
  on-io-error detach;
  resync-rate 10M; # 100Mbps dedicated link
  # All cluster file systems require fencing
  fencing resource-and-stonith;
 }
 syncer {
  verify-alg sha1;
 }
 on ora1 {
  address  172.16.22.55:7789;
 }
 on ora2 {
  address  172.16.22.56:7789;
 }
}

Create the local metadata for the DRBD resource:

[ALL]# drbdadm create-md ocfsdata

Ensuring that a DRBD kernel module is loaded, bring up the DRBD resource:

[ALL]# modprobe drbd
[ALL]# drbdadm up ocfsdata

We should have inconsistent data:

[ora1]# drbdadm dstate ocfsdata
Inconsistent/Inconsistent
[ora1]# cat /proc/drbd
version: 8.4.2 (api:1/proto:86-101)
srcversion: D2C09D2CF4CCB8C91B02D14
 0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:1048508

We see that state is Connected, meaning the two DRBD nodes are communicating properly, and that both nodes are in Secondary role with Inconsistent data.

Force ora1 to become the primary node:

[ora1]# drbdadm primary --force ocfsdata
[ALL]# drbdadm dstate ocfsdata
UpToDate/UpToDate

Force ora2 to become the primary node too:

[ora2]# drbdadm primary --force ocfsdata

Both nodes are primary and up to date:

[ora1]# cat /proc/drbd
version: 8.4.2 (api:1/proto:86-101)
srcversion: D2C09D2CF4CCB8C91B02D14
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:1049420 al:0 bm:64 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
[ora1]# drbd-overview
 0:ocfsdata/0  Connected Primary/Primary UpToDate/UpToDate

Enable DRBD on boot as we don’t use Pacemaker:

[ALL]# systemctl enable drbd

Cluster Configuration

OCFS2 Cluster Software

Each node should be running the same version of the OCFS2 software and a compatible version of the Oracle Linux Unbreakable Enterprise Kernel (UEK).

[ALL]# yum install -y ocfs2-tools

Create the Configuration File for the Cluster Stack

Cluster should be offline as we haven’t created one yet:

[ora1]# o2cb cluster-status
offline

Create a cluster definition:

[ora1]# o2cb add-cluster tclust

The command above creates the configuration file /etc/ocfs2/cluster.conf if it does not already exist.

Verify:

[ora1]# o2cb list-cluster tclust
cluster:
        node_count = 0
        heartbeat_mode = local
        name = tclust

We need to define each cluster node. The IP address is the one that the node will use for private communication in the cluster.

[ora1]# o2cb add-node tclust ora1 --ip 172.16.21.55
[ora1]# o2cb add-node tclust ora2 --ip 172.16.21.56

Verify:

[ora1]# o2cb list-cluster testclust
node:
        number = 0
        name = ora1
        ip_address = 172.16.21.55
        ip_port = 7777
        cluster = tclust

node:
        number = 1
        name = ora2
        ip_address = 172.16.21.56
        ip_port = 7777
        cluster = tclust

cluster:
        node_count = 2
        heartbeat_mode = local
        name = tclust

Now, copy the cluster configuration file /etc/ocfs2/cluster.conf to each node in the cluster.

Configure the Cluster Stack

If you follow an “Oracle Linux Administrator’s Guide for Release 7”, you’ll see the following command defined to configure the cluster stack:

# /etc/init.d/o2cb configure

However, on Oracle Linux 7.2, the o2cb init.d script is no longer found. Therefore open the file /etc/sysconfig/o2cb and configure the cluster stack manually.

# O2CB_ENABLED: 'true' means to load the driver on boot.
O2CB_ENABLED=true

# O2CB_STACK: The name of the cluster stack backing O2CB.
O2CB_STACK=o2cb

# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
O2CB_BOOTCLUSTER=tclust

# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
O2CB_HEARTBEAT_THRESHOLD=21

# O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is considered dead.
O2CB_IDLE_TIMEOUT_MS=15000

# O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is sent
O2CB_KEEPALIVE_DELAY_MS=2000

# O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts
O2CB_RECONNECT_DELAY_MS=2000

Copy the cluster configuration file /etc/sysconfig/o2cb to each node in the cluster.

Configure the o2cb and ocfs2 services so that they start at boot time after networking is enabled:

[ALL]# systemctl enable o2cb
[ALL]# systemctl enable ocfs2

Register the cluster with configfs:

[ALL]# o2cb register-cluster

Start the cluster:

[ALL]# systemctl start o2cb
[ora1]# systemctl status o2cb
 o2cb.service - Load o2cb Modules
   Loaded: loaded (/usr/lib/systemd/system/o2cb.service; enabled; vendor preset: disabled)
   Active: active (exited) since Sat 2016-03-12 15:37:06 GMT; 8s ago
  Process: 1223 ExecStop=/sbin/o2cb.init disable (code=exited, status=0/SUCCESS)
  Process: 1288 ExecStart=/sbin/o2cb.init enable (code=exited, status=0/SUCCESS)
 Main PID: 1288 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/o2cb.service
           └─1335 o2hbmonitor

Mar 12 15:37:06 ora1 o2cb.init[1288]: checking debugfs...
Mar 12 15:37:06 ora1 o2cb.init[1288]: Loading filesystem "configfs": OK
Mar 12 15:37:06 ora1 o2cb.init[1288]: Loading stack plugin "o2cb": OK
Mar 12 15:37:06 ora1 o2cb.init[1288]: Loading filesystem "ocfs2_dlmfs": OK
Mar 12 15:37:06 ora1 o2cb.init[1288]: Mounting ocfs2_dlmfs filesystem at /dlm: OK
Mar 12 15:37:06 ora1 o2cb.init[1288]: Setting cluster stack "o2cb": OK
Mar 1 15:37:06 ora1 o2cb.init[1288]: Registering O2CB cluster "tclust": OK
Mar 12 15:37:06 ora1 o2cb.init[1288]: Setting O2CB cluster timeouts : OK
Mar 12 15:37:06 ora1 systemd[1]: Started Load o2cb Modules.
Mar 12 15:37:06 ora1 o2hbmonitor[1335]: Starting

Configure the Kernel for the Cluster

Two sysctl values need to be set for o2cb to function properly. The first, panic_on_oops, must be enabled to turn a kernel oops into a panic. If a kernel thread required for o2cb to function crashes, the system must be reset to prevent a cluster hang. If it is not set, another node may not be able to distinguish whether a node is unable to respond or slow to respond.

The other related sysctl parameter is panic, which specifies the number of seconds after a panic that the system will be auto-reset.

On each node, open the file /etc/sysctl.conf and set the recommended values for panic and panic_on_oops:

kernel.panic = 30
kernel.panic_on_oops = 1

Reload:

[ALL]# sysctl -p

Create and Populate OCFS2 Filesystem

Note that we cannot change the block and cluster size of an OCFS2 volume after it has been created.

[ora1]# mkfs.ocfs2 --cluster-size 8K -J size=32M -T mail \
  --node-slots 2 --label ocfs2_fs --mount cluster \
  --fs-feature-level=max-features \
  --cluster-stack=o2cb --cluster-name=tclust \
  /dev/drbd0
mkfs.ocfs2 1.8.6
Cluster stack: o2cb
Cluster name: tclust
Stack Flags: 0x0
NOTE: Feature extended slot map may be enabled
Overwriting existing ocfs2 partition.
Proceed (y/N): y
Filesystem Type of mail
Label: ocfs2_fs
Features: sparse extended-slotmap backup-super unwritten inline-data strict-journal-super metaecc xattr indexed-dirs usrquota grpquota refcount discontig-bg
Block size: 2048 (11 bits)
Cluster size: 8192 (13 bits)
Volume size: 1073668096 (131063 clusters) (524252 blocks)
Cluster groups: 9 (tail covers 4087 clusters, rest cover 15872 clusters)
Extent allocator size: 4194304 (1 groups)
Journal size: 33554432
Node slots: 2
Creating bitmaps: done
Initializing superblock: done
Writing system files: done
Writing superblock: done
Writing backup superblock: 0 block(s)
Formatting Journals: done
Growing extent allocator: done
Formatting slot map: done
Formatting quota files: done
Writing lost+found: done
mkfs.ocfs2 successful

Filesystem can be checked with fsck.ocfs2 and tuned with tunefs.ocfs2.

Detect all OCFS2 volumes on a system:
[ora1]# mounted.ocfs2 -f
Device                          Stack  Cluster  F  Nodes
/dev/mapper/vg_oracle7-lv_drbd  o2cb   tclust      Not mounted
/dev/drbd0                      o2cb   tclust      Not mounted

Show OCFS2 file system information:

[ora1]# o2info --fs-features /dev/drbd0
backup-super strict-journal-super sparse extended-slotmap inline-data metaecc
xattr indexed-dirs refcount discontig-bg clusterinfo unwritten usrquota
grpquota
[ora1]# o2info --volinfo /dev/drbd0
       Label: ocfs2_fs
        UUID: 30F0A6E6705249FF9F9E39231772C1FE
  Block Size: 2048
Cluster Size: 8192
  Node Slots: 2
    Features: backup-super strict-journal-super sparse extended-slotmap
    Features: inline-data metaecc xattr indexed-dirs refcount discontig-bg
    Features: clusterinfo unwritten usrquota grpquota

We can copy metadata blocks from /dev/drbd0 device to drbd0.out file (handy for debugfs.ocfs2):

[ora1]# o2image /dev/drbd0 drbd0.out

Create a mountpoint:

[ALL]# mkdir -p /cluster/storage

Add to fstab on all nodes:

/dev/drbd0  /cluster/storage ocfs2 rw,noatime,nodiratime,_netdev 0 0

Mount:

[ALL]# mount -a

The file system will not mount unless we have enabled the o2cb and ocfs2 services to start after networking is started.

Verify the volume is mounted.

[ora1]# mount -l|grep ocfs
ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw,relatime)
/dev/drbd0 on /cluster/storage type ocfs2 (rw,relatime,seclabel,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,cluster_stack=o2cb,coherency=buffered,user_xattr,acl) [ocfs2_fs]
[ora1]# drbd-overview
 0:ocfsdata/0  Connected Primary/Primary UpToDate/UpToDate /cluster/storage ocfs2 1.0G 82M 943M 8%
[ora1]# mounted.ocfs2 -f
Device                          Stack  Cluster    F  Nodes
/dev/mapper/vg_oracle7-lv_drbd  o2cb   testclust     ora1, ora2
/dev/drbd0                      o2cb   testclust     ora1, ora2

Dealing with SELinux

The default security context applied should be unlabeled:

[ora1]# ls -dZ /cluster/storage
drwxr-xr-x. root root system_u:object_r:unlabeled_t:s0 /cluster/storage/

The xattr part for OCFS2 should be loaded:

[ora1]# dmesg|grep xattr
[11426.378173] SELinux: initialized (dev dm-3, type ext4), uses xattr
[11432.780965] SELinux: initialized (dev sda1, type ext2), uses xattr
[11804.273746] SELinux: initialized (dev drbd0, type ocfs2), uses xattr

We are going to change security context to public content, only one security context per filesystem. Configure fstab accordingly on all cluster nodes and remount the OCFS2 volume:

/dev/drbd0  /cluster/storage ocfs2 rw,noatime,nodiratime,_netdev,context=system_u:object_r:public_content_t:s0 0 0

Security context should be changed:

[ora1]# ls -dZ /cluster/storage
drwxr-xr-x. root root system_u:object_r:public_content_t:s0 /cluster/storage/

Reboot the system and check the kernel message buffer, the xattr part should no longer be loaded:

[ora1]# systemctl reboot
[ora1]# dmesg|grep xattr
[12133.780006] SELinux: initialized (dev dm-3, type ext4), uses xattr
[12139.991193] SELinux: initialized (dev sda1, type ext2), uses xattr

Some Testing

Manual DRBD Split Brain Recovery

As per DRBD documentation, after split brain has been detected, one node will always have the resource in a StandAlone connection state. The other might either also be in the StandAlone state (if both nodes detected the split brain simultaneously), or in WFConnection (if the peer tore down the connection before the other node had a chance to detect split brain).

To test a split brain DRBD scenario, we can simply ifdown the network interface used for DRBD communications (172.16.22.0/24) on one of the cluster nodes.

Let us see what state are resources afterwards.

[ora1]# drbd-overview
 0:ocfsdata/0  StandAlone Primary/Unknown UpToDate/DUnknown /cluster/storage ocfs2 1.0G 186M 839M 19%
[ora2]# drbd-overview
 0:ocfsdata/0  WFConnection Primary/Unknown UpToDate/DUnknown /cluster/storage ocfs2 1.0G 186M 839M 19%

We are going to pick the ora1 node as the split brain victim:

[ora1]# umount /cluster/storage/
[ora1]# drbdadm disconnect ocfsdata
[ora1]# drbdadm secondary ocfsdata
[ora1]# drbd-overview
 0:ocfsdata/0  StandAlone Secondary/Unknown UpToDate/DUnknown
[ora1]# drbdadm connect --discard-my-data ocfsdata

On the other node (the split brain survivor), if its connection state is also StandAlone, we would enter:

[ora2]# drbdadm connect ocfsdata

However, we may omit this step as the node is already in the WFConnection state; it will then reconnect automatically.

[ora1]# drbdadm primary ocfsdata
[ora1]# mount -a

O2CB Fencing

As per Oracle documentation, a node self-fences if it fails to update its timestamp for ((O2CB_HEARTBEAT_THRESHOLD – 1) * 2) secs. The kernel thread, after every timestamp write, sets a timer to panic the system after that duration. If the next timestamp is written within that duration, as it should, it first cancels that timer before setting up a new one. This way it ensures the system will self fence if for some reason the kernel thread is unable to update the timestamp and thus be deemed dead by other nodes in the cluster.

To test o2cb fencing, ifdown the network interface used for o2cb communications (172.16.21.0/24) on one of the cluster nodes.

References

https://docs.oracle.com/cd/E52668_01/E54669/html/ol7-instcfg-ocfs2.html
https://oss.oracle.com/projects/ocfs2/dist/documentation/v1.8/ocfs2-1_8_2-manpages.pdf
http://drbd.linbit.com/en/doc/users-guide-84/s-resolve-split-brain

4 thoughts on “Setting up an O2CB Cluster with Dual Primary DRBD and OCFS2 on Oracle Linux 7

  1. Hello Tomas,

    Thank you very much for your hard work. You have great articles!

    I was wondering though… Can we use Pacemaker with Dual primary DRBD and OCFS2? I have Googled for tutorials and explanations such as yours, but I haven’t found anything recent about Pacemaker + Dual Primary DRBD + OCFS2. Is there a reason not to use these 3 together?

    Regards,
    JG

    • You could use Pacemaker with DRBD and OCFS2 on CentOS 6 (I’ve seen several articles on the Internet).

      However, the ocfs2 provider is no longer available on CentOS 7:

      # pcs resource providers
      heartbeat
      linbit
      openstack
      pacemaker
    • I see. I was planning to use the 3 with Oracle Linux 7 or CentOS 7. Thank you for pointing out that the provider is not available. I suppose I have no choice but to use CentOS 6 then. Thank you very much!

    • No worries. There is Pacemaker’s interface (ocf:ocfs2:o2cb) to OCFS2 cluster manager available on CentOS 6, but I’m not sure on how you manage it on CentOS 7. Let me know in case you find out.

      According to the Cluster Labs, Pacemaker supports both OCFS2 and GFS2 filesystems. They say that you can use them on top of real disks or network block devices like DRBD. It might be worth dropping them a line and asking how such support works on CentOS 7.

Leave a Reply

Your email address will not be published. Required fields are marked *