Friday, September 4, 2009

A two node HA cluster - mini howto

One of our goals this quarter has been to make our LDAP service more reliable. We tried using the Cisco ACE load balancer in front of two LDAP slaves, but that doesn't allow for custom application checks. Simple port checks aren't good enough for this and we needed a more thorough check to verify that our OpenLDAP instances were up and working correctly. So we decided to implement this in software using the linux HA stack. The linux HA stack allows you to combine a few servers into a cluster to provide highly available services(s). In HA terminology the services provided by the cluster are called resources.

The HA stack is made of multiple components that work together to make resources available. The first of these is the heartbeat daemon. It runs on every single node (server) in the cluster and is responsible for ensuring that the nodes are alive and talking to each other. It also provides a framework for the other layers in the stack. Although there are bunch of other options you could use, a basic configuration tells heartbeat about the members in the cluster, establishes a communication mechanism between the members, and sets up an (secret) auth key to make sure that only nodes that know that key can join the cluster. Here is a sample config file for heartbeat.

[root@server1 ha.d]# cat /etc/ha.d/ha.cf
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
deadtime 30
keepalive 1
warntime 10
initdead 120
udpport 694
bcast bond0
mcast bond0 239.0.0.1 694 1 0
auto_failback on
node server1
node server2
debug 0
crm on
[root@server1 ha.d]#
[root@server1 ha.d]# cat /etc/ha.d/authkeys
auth 2
2 sha1 4BWtvO7NOO6PPnFX
[root@server1 ha.d]#

With the above configuration, we are establishing two modes of communication between the cluster members (server1 and server2), broadcast or multicast over the bond0 interface. Other communication methods are possible as well (serial cable, etc). In this case, since both modes are going over the same interface, its probably redundant and not all that fool-proof. The authkeys file establishes the secret key that nodes need to know to join this cluster.

Heartbeat by itself can also be used to manage and make the cluster resources available. However, it is limited to only two nodes in this configuration. A newer implementation was developed to remove this limitation and was spun off to become the pacemaker project. The last line "crm on" tells heartbeat that we will use an external Cluster Resource Manager (pacemaker in this case) to handle resources. Please note that there is a new software layer called OpenAIS that provides services similar to heartbeat. It is being developed jointly by RedHat and Suse and attempts to be a OSI certified implementation of the Application Interface Specification (AIS). I found it pretty confusing and decided to stick with heartbeat for our needs.

Pacemaker can be used to provide a variety of services and is frequently used to provide resources that access shared data. A common example is an nfs server that exports data from a shared block level layer (like a iscsi disk). Scenarios like this require that only one host in the cluster accesses this shared disk at any time. Bad things happen when multiple hosts try to write to a single shared physical disk simultaneously. In certain situations, member nodes in a cluster fail to relinquish these shared resources and must be cut off from the resources. Heartbeat relies on a service called stonith (Shoot The Other Node In The Head), which basically turns misbehaving hosts off in such cases. This service is usually hooked up to some sort of remote power management facility for the nodes in the cluster. Our situation doesn't need that stuff, so my configuration does not cover stonith. Disable stonith with "crm_attribute --type crm_config -n stonith-enabled -v false".

The pacemaker project provides binaries for almost all linux distributions (using the openSUSE Build Service - thanks guys!). Configuring pacemaker can seem daunting at first but googling should give you plenty of pointers. Pacemaker itself is split into a bunch of daemons that work together to manage your resources. These are the crm, lrm, etc... I strongly suggest reading through at least the first 10 pages or so of this document before continuing.

Now that you have read the doc, all that remains is to configure the resources your cluster provides. As indicated in the configuration above, we have two servers (server1 - a physical box and server2 - a backup VM). Either of these servers are capable of handling all our traffic. Server1 however is a pretty robust machine, so I want all our traffic going to just that machine (as long as it's working correctly). However, if the LDAP (slapd) instance on it gets corrupt for some reason or if I need to reboot the box for maintenance etc, I would like server2 to kick in, take over the floating vip and field requests. Both servers have LDAP slaves on them, that are running all the time.

Pacemaker comes with a host of configuration, management and monitoring tools. To begin with, configure heartbeat as shown above and start it on both the servers. On our second server we don't have a bonded interface, so bond0 in the config file above changes to eth0. Once heartbeat is up and running, run the crm_mon tool and wait for it to tell you that the cluster is in quorum and that one of the nodes has been elected as the DC. At that point you can quit it (with a CTRL-C).

Pacemaker depends on Resource Agents to start/stop and monitor your resources. These RAs are usually just scripts that are very similar to the standard linux init scripts with a few modifications. These come in two flavours, the older style heartbeat scripts and the newer OCF style scripts that support more features. This page talks about these scripts and the differences between the two styles. If you use the older heartbeat style scripts, keep in mind that pacemaker will not be able to monitor your resources. It will just take care of starting, stopping and migrating them as directed (by an admin). We had one minor oddity in our situation that we didn't really need to start/stop our LDAP slaves on these servers, as these slaves were always running. I had to hack a RA script to make it work for us. I will detail that in another post as this one is already getting to be pretty long!

To configure the resources managed by your cluster, use the crm command (in its configure mode). You can run this tool interactively or feed it a preset configuration script. I used the interactive mode quite a bit as it allows you to validate your configuration, make changes on the fly and deploy them pretty easily. For the sake of brevity, I am just going to list our configuration. Feed these into crm with "crm configure ..."

primitive ldap_service ocf:heartbeat:ldap \
meta migration-threshold="2" failure-timeout="90s" \
op monitor interval="5s" timeout="15s" start_delay="15s" disabled="false" on_fail="standby"
primitive ldap_vip ocf:heartbeat:IPaddr2 \
params ip="10.7.36.142"
group ldap ldap_service ldap_vip \
meta target_role="started" collocated="true"
location prefer_server1 ldap 10: server1

The first line defines a ldap RA and tells pacemaker that it is a OCF style resource, and that script is called ldap. The op line defines a monitor and tells it to monitor the resource every 5s and enables the monitor. It also states that the node should be put in a standby mode upon resource failure. The meta parameters say that the resource should be failed over after two failures, and that after 90s, the service is allowed to fail back to the primary server if desired.

The second line defines the next resource (the vip).

The third line defines a group that combines the above two resources and that these two resources should live together.

The last line tells that I'd prefer this group to live on server1 as much as possible.

You can verify your configuration with "crm configure verify" and activate it with "crm configure commit". At this point, pacemaker should activate your vip and the service. crm_mon should show these two resources to be up and running. If you want to fail your service manually to your backup server use "crm_resource -M -r ldap -N server2".

Note that I probably have some redundant configuration options in our setup here. If you spot any of those, or if you find any glaring errors, I'd appreciate the feedback. The cluster configuration guide I linked to earlier is your bible for this stuff. It details every single option you can use with crm and is written very well. You can also refer to docs here for sample configurations and other helpful pointers. HTH someone out in the ether!