Friday, September 4, 2009

A two node HA cluster - mini howto

One of our goals this quarter has been to make our LDAP service more reliable. We tried using the Cisco ACE load balancer in front of two LDAP slaves, but that doesn't allow for custom application checks. Simple port checks aren't good enough for this and we needed a more thorough check to verify that our OpenLDAP instances were up and working correctly. So we decided to implement this in software using the linux HA stack. The linux HA stack allows you to combine a few servers into a cluster to provide highly available services(s). In HA terminology the services provided by the cluster are called resources.

The HA stack is made of multiple components that work together to make resources available. The first of these is the heartbeat daemon. It runs on every single node (server) in the cluster and is responsible for ensuring that the nodes are alive and talking to each other. It also provides a framework for the other layers in the stack. Although there are bunch of other options you could use, a basic configuration tells heartbeat about the members in the cluster, establishes a communication mechanism between the members, and sets up an (secret) auth key to make sure that only nodes that know that key can join the cluster. Here is a sample config file for heartbeat.

[root@server1 ha.d]# cat /etc/ha.d/ha.cf
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
deadtime 30
keepalive 1
warntime 10
initdead 120
udpport 694
bcast bond0
mcast bond0 239.0.0.1 694 1 0
auto_failback on
node server1
node server2
debug 0
crm on
[root@server1 ha.d]#
[root@server1 ha.d]# cat /etc/ha.d/authkeys
auth 2
2 sha1 4BWtvO7NOO6PPnFX
[root@server1 ha.d]#

With the above configuration, we are establishing two modes of communication between the cluster members (server1 and server2), broadcast or multicast over the bond0 interface. Other communication methods are possible as well (serial cable, etc). In this case, since both modes are going over the same interface, its probably redundant and not all that fool-proof. The authkeys file establishes the secret key that nodes need to know to join this cluster.

Heartbeat by itself can also be used to manage and make the cluster resources available. However, it is limited to only two nodes in this configuration. A newer implementation was developed to remove this limitation and was spun off to become the pacemaker project. The last line "crm on" tells heartbeat that we will use an external Cluster Resource Manager (pacemaker in this case) to handle resources. Please note that there is a new software layer called OpenAIS that provides services similar to heartbeat. It is being developed jointly by RedHat and Suse and attempts to be a OSI certified implementation of the Application Interface Specification (AIS). I found it pretty confusing and decided to stick with heartbeat for our needs.

Pacemaker can be used to provide a variety of services and is frequently used to provide resources that access shared data. A common example is an nfs server that exports data from a shared block level layer (like a iscsi disk). Scenarios like this require that only one host in the cluster accesses this shared disk at any time. Bad things happen when multiple hosts try to write to a single shared physical disk simultaneously. In certain situations, member nodes in a cluster fail to relinquish these shared resources and must be cut off from the resources. Heartbeat relies on a service called stonith (Shoot The Other Node In The Head), which basically turns misbehaving hosts off in such cases. This service is usually hooked up to some sort of remote power management facility for the nodes in the cluster. Our situation doesn't need that stuff, so my configuration does not cover stonith. Disable stonith with "crm_attribute --type crm_config -n stonith-enabled -v false".

The pacemaker project provides binaries for almost all linux distributions (using the openSUSE Build Service - thanks guys!). Configuring pacemaker can seem daunting at first but googling should give you plenty of pointers. Pacemaker itself is split into a bunch of daemons that work together to manage your resources. These are the crm, lrm, etc... I strongly suggest reading through at least the first 10 pages or so of this document before continuing.

Now that you have read the doc, all that remains is to configure the resources your cluster provides. As indicated in the configuration above, we have two servers (server1 - a physical box and server2 - a backup VM). Either of these servers are capable of handling all our traffic. Server1 however is a pretty robust machine, so I want all our traffic going to just that machine (as long as it's working correctly). However, if the LDAP (slapd) instance on it gets corrupt for some reason or if I need to reboot the box for maintenance etc, I would like server2 to kick in, take over the floating vip and field requests. Both servers have LDAP slaves on them, that are running all the time.

Pacemaker comes with a host of configuration, management and monitoring tools. To begin with, configure heartbeat as shown above and start it on both the servers. On our second server we don't have a bonded interface, so bond0 in the config file above changes to eth0. Once heartbeat is up and running, run the crm_mon tool and wait for it to tell you that the cluster is in quorum and that one of the nodes has been elected as the DC. At that point you can quit it (with a CTRL-C).

Pacemaker depends on Resource Agents to start/stop and monitor your resources. These RAs are usually just scripts that are very similar to the standard linux init scripts with a few modifications. These come in two flavours, the older style heartbeat scripts and the newer OCF style scripts that support more features. This page talks about these scripts and the differences between the two styles. If you use the older heartbeat style scripts, keep in mind that pacemaker will not be able to monitor your resources. It will just take care of starting, stopping and migrating them as directed (by an admin). We had one minor oddity in our situation that we didn't really need to start/stop our LDAP slaves on these servers, as these slaves were always running. I had to hack a RA script to make it work for us. I will detail that in another post as this one is already getting to be pretty long!

To configure the resources managed by your cluster, use the crm command (in its configure mode). You can run this tool interactively or feed it a preset configuration script. I used the interactive mode quite a bit as it allows you to validate your configuration, make changes on the fly and deploy them pretty easily. For the sake of brevity, I am just going to list our configuration. Feed these into crm with "crm configure ..."

primitive ldap_service ocf:heartbeat:ldap \
meta migration-threshold="2" failure-timeout="90s" \
op monitor interval="5s" timeout="15s" start_delay="15s" disabled="false" on_fail="standby"
primitive ldap_vip ocf:heartbeat:IPaddr2 \
params ip="10.7.36.142"
group ldap ldap_service ldap_vip \
meta target_role="started" collocated="true"
location prefer_server1 ldap 10: server1

The first line defines a ldap RA and tells pacemaker that it is a OCF style resource, and that script is called ldap. The op line defines a monitor and tells it to monitor the resource every 5s and enables the monitor. It also states that the node should be put in a standby mode upon resource failure. The meta parameters say that the resource should be failed over after two failures, and that after 90s, the service is allowed to fail back to the primary server if desired.

The second line defines the next resource (the vip).

The third line defines a group that combines the above two resources and that these two resources should live together.

The last line tells that I'd prefer this group to live on server1 as much as possible.

You can verify your configuration with "crm configure verify" and activate it with "crm configure commit". At this point, pacemaker should activate your vip and the service. crm_mon should show these two resources to be up and running. If you want to fail your service manually to your backup server use "crm_resource -M -r ldap -N server2".

Note that I probably have some redundant configuration options in our setup here. If you spot any of those, or if you find any glaring errors, I'd appreciate the feedback. The cluster configuration guide I linked to earlier is your bible for this stuff. It details every single option you can use with crm and is written very well. You can also refer to docs here for sample configurations and other helpful pointers. HTH someone out in the ether!

Saturday, May 30, 2009

LDAP password policy - how I learned to stop worrying..

I have had to live through implementing a controversial password rotation policy at Mozilla. We had some bitter battles along the lines of "you dimwits, it doesn't do anything for security", and "you are a Nazi for trying to make me change my password so often". While I am still firmly in the "it's a good thing" camp, we have found OpenLDAP's support for password policy somewhat lacking. In particular, it does not distinguish between a few brain dead applications failing multiple times with a single incorrect password and a crack attempt with different incorrect passwords. I tried bringing it up in their mailing lists, but that thread didn't get too far. We have an its request on file as well. At this point, I wimped out and outsourced the problem. Enter Zytrax and the awesome Ron Aitchison, I can't recommend his "OpenLDAP - here is what all this gobbledegook means" guide enough. Jeff Clowser, Ron and I fleshed out the details in the spec and I am now happy to announce that we have patches that work against 2.4.11 and 2.4.16. It's been running on our servers for a few days now and seems to be holding up okay.

Here is how it works. The patch introduces a new attribute - pwdMaxTotalAttempts. Quoting from the README, 'The attribute may take one of three values. If pwdMaxTotalAttempts is zero (0) or not defined then no repeat password checking is perfomed. If pwdMaxTotalAttempts is -1 repeat password checking is performed and an unlimited number of attempts with any number (up to the limit defined by pwdMaxFailure) of repeat passwords are allowed'.

To disable this new behavior you don't have to do anything (i.e. pwdMaxTotalAttempts is not even defined). Also, explicitly setting pwdMaxTotalAttempts to 0 disables it. If you set it to -1, the new policy is enabled and repeat password attempts are tracked. Setting it to a positive number enables the policy as well, but also gives you some limited DoS protection. There are some risks to enabling the new module - it keeps track of your failed passwords (as SSHA hashes). So, proceed with caution when you enable the module.

HTH someone out in the ether.

Thursday, April 23, 2009

ldap acls - locking accounts out

One of the common questions/situations folks face in an ldap implementation is implementing some sort of locking mechanism for old accounts. We use the employeeType attribute in the inetOrgPerson schema. You could probably use any similar attribute (or even a custom one). One way to implement this locking is to add checks in your application code to remove such accounts. This strategy is bound to leak stuff (either due to application problems, or coding errors), sometimes this data may simply have multiple access points - through a public address book, or through folks directly accessing the directory data. A better way (imo) is to filter these accounts within the LDAP server itself. OpenLDAP allows you to set acls on specific filters. This works beautifully for cases like this. As an example, we have this acl in our slapd.conf file.

access to dn.children="ou=People,dc=mozilla" filter=(!(|(employeeType=Contractor)(employeeType=Employee)))
by group="cn=admins,ou=ldapgroups,dc=mozilla" write
by * none

This blocks out accounts with any other employeeType (other than employee or contractor) from everyone, except the administrators. Of course, this depends on you setting the employeeType attribute to some appropriate value (like retired) on inactive accounts.

Tuesday, April 21, 2009

iSCSI - fast, cheap and reliable?

iSCSI is often touted as the new way forward for enterprise storage needs. This is mostly a rant/information about what I have found useful so far.

To begin with iSCSI support in linux is pretty good and you will find a lot of resources on how to set it up etc. Here are the quick steps.
  • yum install iscsi-initiator-utils(or whatever tool you use to install stuff)
  • chkconfig iscsid on; chkconfig iscsi on (or update-rc)
  • echo "iscsi initiator name" > /etc/iscsi/initiatorname.iscsi (but put the real name in, make one up thats unique to this host). Make sure that the iscsi volume you want to mount now allows this initiator name to connect to it.
  • service iscsid start
  • iscsiadm -m discovery -t sendtargets -p IPofIScsiHost
  • service iscsi start (now you should see a device sdx appear in dmesg)
  • when you put it in /etc/fstab, you need "_netdev" in the options so that it doesn't try to mount it before the network is available.
(you can get fancy with CHAP initiator names etc, but for most cases they aren't needed).

That should present an iSCSI block device to your system. Check your dmesg for the device name. Unfortunately, I was unable to find a simple iSCSI command that lists the mapped device name and the corresponding iSCSI lun information. In the old days (with the cisco? iSCSI tools), there used to be a nice little iscsi-ls command that would list all your iSCSI devices and where they came from etc. You now have to manually map /proc/scsi/scsi entries with the /sys/class/scsi_disk/lun-number/ and dig for the entries that way.

I have found that iSCSI in general is not all that fast. You can get decent performance out of it, but nothing close to dedicated local disk and a good hardware raid solution. I have seen asymmetric read-write performance on iscsi disks - with writes being a lot faster than reads. Also, don't waste your money on hardware iSCSI initiators. The don't help a whole lot with speed and are a pita to maintain (with out-of-kernel drivers in some cases). Most modern servers have plenty of horsepower to drive iscsi using the CPUs. I have heard that running iscsi on a 10Gbps network helps (we run it on a 1 Gbps network) with the performance.

iSCSI is in general cheaper than comparable full SAN solutions, but, performance does suffer. This is okay for most cases, except maybe for i/o heavy db applications etc. Choosing the right vendor for your NAS device matters a lot. Don't believe the specs you get from the vendors, try a few of them, perform some tests (bonnie, plain old dd, iozone) and decide how much you want to sink into it.

HTH someone out in the ether!