Friday, February 28, 2014

OpenTSDB proxy

We use OpenTSDB to store the majority of our time series server and application statistics at Tumblr. We recently began a project to migrate OpenTSDB from an existing HBase cluster running an older version of HBase to a new cluster with newer hardware and running the latest stable version of Hbase.

We wanted a way to have some historical data in the new cluster before we switched to it. Within Tumblr we have a variety of applications generating these metrics and it was not very practical for us to change all of them to double write this data. Instead, we chose to replace the standard OpenTSDB listeners with a proxy that would do this double writing for us. While we could have used HBase copy table or written our own tool to backfill historical data from the old cluster, double writing for an initial period allowed us to avoid adding additional load on our existing cluster. This strategy also allowed us to move queries for recent data to new cluster earlier than the full cutover.

The tsd_proxy is written in Clojure and relies heavily on the Lamina and Aleph which in turn build on top of Netty. We have been using this in our production infrastructure for over two months now while sustaining writes at or above 175k/s (across the cluster) and it has been working well for us. We are open sourcing this proxy in the hope that others might find a use for this as well.

The tsd_proxy listens on a configurable port and can forward the incoming data stream to multiple end points. It also has the ability to filter the incoming stream and reject data points that don't match a (configurable) set of regular expressions. It also has the ability to queue the incoming stream and re-attempt delivery if one of the end points is down. It is also possible to limit the queue size so you don't blow through your heap. The README has some more information on how to set this up.

Friday, March 2, 2012

HBase ops automation

or.. how to not go crazy managing a large cluster.

This post is an expansion of my talk at a local HBase meetup. I am going to go into a little more detail on our HBase setup and cluster automation and will hopefully give you ideas on how to build/manage your HBase infrastructure.

Server Specifications

  • SuperMicro boxes
  • Ubuntu Lucid running backported kernels
  • 48 GB RAM (No swap)
  • Six SATA 2T - Hitachi Deskstar 7K, 64MB cache
  • Two quad core - Intel Xeon CPU L5630 - 2.13GHz
  • Each of the machines uses a single gigabit uplink

Directory layout

All of our Hadoop/HBase processes run as the Hadoop user. The configs for Hadoop and HBase are maintained in git and are distributed to the servers via Puppet. These are synced to the ~hadoop/hadoop_conf and ~hadoop/hbase_conf directories on the servers. One of our goals is to stay as close to the upstream release as possible, so we use the bits from the packaged binary builds directly. When we get new builds, the packaged binaries are directly expanded into the corresponding <product>-<rel>-<ver> directories. At any given time, the active build is sym-linked to the corresponding product directory. We then symlink the <prod>/conf directories to the corresponding ~hadoop/<prod>_conf directories synced via puppet. This is how the directory listing looks.
~$ ls -l | sed -e 's/ \(.*[0-9]\) / /'
total 52
lrwxrwxrwx hadoop -> hadoop-0.20.2-cdh3u3
drwxr-xr-x hadoop-0.20.2-cdh3u3
drwxr-xr-x hadoop_conf
lrwxrwxrwx hbase -> /home/hadoop/hbase-0.90.5-a9d4c8d
drwxr-xr-x hbase-0.90.5-a9d4c8d
drwxr-xr-x hbase-0.90.5-fb2b8ca
drwxr-xr-x hbase_conf
drwxr-xr-x run
~$ ls -l hbase-0.90.5-a9d4c8d | sed -e 's/ \(.*[0-9]\) / /'
total 3784
drwxr-xr-x bin
-rw-r--r-- CHANGES.txt
lrwxrwxrwx conf -> /home/hadoop/hbase_conf/
-rwxr-xr-x hbase-0.90.5.jar
-rwxr-xr-x hbase-0.90.5-tests.jar
lrwxrwxrwx hbase.jar -> hbase-0.90.5.jar
drwxr-xr-x hbase-webapps
drwxr-xr-x lib
-rw-r--r-- LICENSE.txt
-rw-r--r-- NOTICE.txt
-rw-r--r-- pom.xml
-rw-r--r-- README.txt
drwxr-xr-x src

Deployment automation

One of our goals is to deploy new HBase releases with zero downtime. We use Fabric to automate almost all of this process, and it is currently mostly hands-off. There are parts of this, that are still prone to manual intervention, but it usually works pretty well. When we get a new HBase build to be deployed, the deployment step looks like this.
fab prep-release:/home/stack/hbase-0.90.3-9fbaa99.tar.gz disable_balancer deploy-hbase:/home/stack/hbase-0.90.3-9fbaa99.tar.gz enable_balancer
This lays out (extracts code, makes symlinks) the code, pushes it to the regionserver machine and does a graceful restart of the node. After this, a restart of the HBase master is required to make it use the new code. This last step is currently manual.

A rolling restart of the cluster for new configs to take effect looks like this.

fab -P -z 3 rolling-restart
With parallel Fabric (since release 1.3), these rolling restarts bounce multiple (the spread is controlled by the -z flag) nodes at once. The 0.92 HBase release introduces the notion of draining nodes - these are nodes that will not get any more new regions. A node is marked as a draining node by creating an entry in ZooKeeper under the hbase_root/draining znode with the format "name,port,startcode" just like the regionserver entries under hbase_root/rs znode. This makes it easier to gracefully drain multiple regionservers at the same time. This command puts regionservers "foo" and "bar" into the draining state.
fab add_rs_to_draining:hosts="foo,bar"
Here is the list of tasks we can handle with our current Fabric setup.
~$ fab -l
Available commands:

    add_rs_to_draining        Put the regionserver into a draining state.
    assert_configs            Check that all the region servers have the sam...
    assert_regions            Check that all the regions have been vacated f...
    assert_release            Check the release running on the server.
    clear_all_draining_nodes  Remove all servers under the Zookeeper /draini...
    clear_rs_from_draining    Remove the regionserver from the draining stat...
    deploy_hbase              Deploy the new hbase release to the regionserv...
    disable_balancer          Disable the balancer.
    dist_hadoop               Rsyncs the hadoop release to the region server...
    dist_hbase                Rsyncs the hbase release to the region servers...
    dist_release              Rsyncs the release to the region servers.
    enable_balancer           Balance regions and enable the balancer.
    hadoop_start              Start hadoop.
    hadoop_stop               Start hadoop.
    hbase_gstop               HBase graceful stop.
    hbase_start               Start hbase.
    hbase_stop                Stop hbase (WARNING: does not unload regions).
    jmx_kill                  Kill JMX collectors.
    list_draining_nodes       List all servers under the Zookeeper /draining...
    prep_release              Copies the tar file from face and extract it.
    reboot_server             Reboot the box.
    region_count              Returns a count of the number of regions in th...
    rolling_reboot            Rolling reboot of the whole cluster.
    rolling_restart           Rolling restart of the whole cluster.
    sync_puppet               Sync puppet on the box.
    thrift_restart            Re-start thrift.
    thrift_start              Start thrift.
    thrift_stop               Stop thrift.
    unload_regions            Un-load HBase regions on the server so it can ...

Additional Notes

The fabfile and other scripts to run all of this are on github.
  • The latest version of the fabfile is meant for the 0.92 release. For older HBase releases look at this commit.
  • The older fabfile is meant to be run on one node at a time in serial, so the -P flag (parallel mode) will not work correctly.
  • I stole from here and added command line arguments to make it do some simple tasks I needed for manipulating ZooKeeper nodes. You will need the python-zookeeper libraries to make it work. I could not get the ZooKeeper cli_mt client to work correctly, which would have made un-necessary.
  • Ideally, I would have liked to use the python-zookeeper libraries directly from Fabric. However, the python-zookeeper libraries need threading support and that doesn't play nice with Fabric's parallel mode. It works fine in the serial mode.
  • It is important to drain HBase regions slowly when restarting regionservers. Otherwise, multiple regions go offline simultaneously as they are re-assigned to other nodes. Depending on your usage patterns, this might not be desirable.
  • The region_mover.rb script is an extension of the standard region_mover.rb that ships with stock HBase. I hacked it a little to add slow balancing support and automatic region balancing while unloading regions from a server. This version is also aware of draining servers and avoids them during region assignment and balancing. Again, look for the older commit if you want to use this with 0.90.x HBase releases. The latest version is for the 0.92 release.
  • We use linux cgroups to contain the TaskTracker processes, so if you plan on using this to manage your Hadoop cluster - be aware of that (remove that stuff, if you don't need cgroups).
  • We grant the hadoop user sudo permissions to run puppet on our cluster nodes, you will need to do something similar if you want to manage configuration through Puppet/Fabric. Your life will be a lot easier if you setup no-password ssh logins from the master (or wherever you run fab from) to your regionserver nodes.

Hope this helps other folks with their HBase deployments.

Tuesday, January 31, 2012

building mesos..

This took me a while to figure out, maybe others have faster ways.. but I am posting it here so it might help others. This wfm as of Tue Jan 31 12:55:17 PST 2012 and commit 079614aea80cfc7282c2a80de6e84c896df776c0. These instructions are for 64-bit Ubuntu Lucid.
  • sudo apt-get install g++ git automake libtool libltdl-dev python-dev swig python-setuptools (not sure libltdl-dev is needed, but libtool suggested it, so I figured, why not!)
  • git clone git://
  • cd mesos
  • ./bootstrap
  • ./configure --with-webui \
    --with-java-home=/usr/lib/jvm/java-6-sun \
    --with-python-headers=/usr/include/python2.6 \
    --with-included-zookeeper \
  • make -j 2 ... ... at some point this will fail while building zookeeper.
  • vi third_party/zookeeper-3.3.1/src/c/
  • comment out lines 25 to 44
  • make -j 2 ... ... This will again fail with some libtool version compatibility problems. I don't know the auto* tools well enough to understand why. Posts on the mesos mailing list suggest autoreconf. That worked for me.
  • autoreconf -fi
  • make -j 2
and finally.. sucess.. I am sure there are more elegant and less black-magic ways of doing this, but I lack the knowledge and patience to figure them out. I hope this helps someone else trying to get a basic mesos build working.

Tuesday, September 13, 2011

personal and social conflicts..

This is probably more a post on plus rather than a blog post, but it got too long. So here it goes..

I have been reading a parenting book by John Medina - Brain rules for baby(highly recommended), trying to morph into a good parent and everything... anyhoo.. I found this little nugget in the book. The author is describing some observations by two sociologists Edward Jones and Richard Nisbett - "People view their own behaviors as originating from amendable, situational constraints, but they view others behaviours as originating from inherent, immutable personality traits". Thinking back to a lot of my everyday work, home and family experiences - this explains things so well.. everytime I feel like I am right, or everytime I think things ought to be done differently, maybe it's just my assymmetric brain failing to see the other side of the problem.

After reading that.. I couldn't help but admire these sociologists - reducing everything down to predictable, simple facts and observations.

anyways.. thought I'd share. Next time, a little more putting yourself in the other persons shoes and a little less of the "jump to conclusions" mat!

Wednesday, February 16, 2011

Moving on from Mozilla..

After about five and a half years at MoCo, I have decided to move on to a new position at a startup in San Francisco. My last day at MoCo will be Feb 28'th. It was a hard decision for me, especially looking at everything I had a chance to work on and help build over the years, but I feel it's the right one. Working at MoCo has been one of the most challenging and satisfying jobs I have ever had and a huge thanks to all of the folks at Mozilla for making it that. I will still be living and working in the bay area, so I do hope to keep in touch with folks here.

In my new role, I will be working as an operations engineer with a focus on developing tools to better understand and monitor Hadoop environments (at least initially). The goal would be to open source these tools and contribute them to the community.

Hope to see you on the other side..

Wednesday, April 7, 2010

shelldap to the rescue!

I just discovered shelldap through my trusty dselect.. (I know, I am old and lazy and not in touch with the times, I really should be using apt-cache, but wth!). Anyhoo... shelldap is a pseudo shell on top of a LDAP DIT. You can cd to the different branches, grep within them for entries and edit individual entries in an LDIF format with your favorite editor!

Other tools like phpldapadmin, ldapsearch have their uses, but this is the most usable ldap browsing, editing tool I found so far. Figured someone else out there might find a use for it.. Thanks Mahlon E. Smith for shelldap!

Friday, September 4, 2009

A two node HA cluster - mini howto

One of our goals this quarter has been to make our LDAP service more reliable. We tried using the Cisco ACE load balancer in front of two LDAP slaves, but that doesn't allow for custom application checks. Simple port checks aren't good enough for this and we needed a more thorough check to verify that our OpenLDAP instances were up and working correctly. So we decided to implement this in software using the linux HA stack. The linux HA stack allows you to combine a few servers into a cluster to provide highly available services(s). In HA terminology the services provided by the cluster are called resources.

The HA stack is made of multiple components that work together to make resources available. The first of these is the heartbeat daemon. It runs on every single node (server) in the cluster and is responsible for ensuring that the nodes are alive and talking to each other. It also provides a framework for the other layers in the stack. Although there are bunch of other options you could use, a basic configuration tells heartbeat about the members in the cluster, establishes a communication mechanism between the members, and sets up an (secret) auth key to make sure that only nodes that know that key can join the cluster. Here is a sample config file for heartbeat.

[root@server1 ha.d]# cat /etc/ha.d/
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
deadtime 30
keepalive 1
warntime 10
initdead 120
udpport 694
bcast bond0
mcast bond0 694 1 0
auto_failback on
node server1
node server2
debug 0
crm on
[root@server1 ha.d]#
[root@server1 ha.d]# cat /etc/ha.d/authkeys
auth 2
2 sha1 4BWtvO7NOO6PPnFX
[root@server1 ha.d]#

With the above configuration, we are establishing two modes of communication between the cluster members (server1 and server2), broadcast or multicast over the bond0 interface. Other communication methods are possible as well (serial cable, etc). In this case, since both modes are going over the same interface, its probably redundant and not all that fool-proof. The authkeys file establishes the secret key that nodes need to know to join this cluster.

Heartbeat by itself can also be used to manage and make the cluster resources available. However, it is limited to only two nodes in this configuration. A newer implementation was developed to remove this limitation and was spun off to become the pacemaker project. The last line "crm on" tells heartbeat that we will use an external Cluster Resource Manager (pacemaker in this case) to handle resources. Please note that there is a new software layer called OpenAIS that provides services similar to heartbeat. It is being developed jointly by RedHat and Suse and attempts to be a OSI certified implementation of the Application Interface Specification (AIS). I found it pretty confusing and decided to stick with heartbeat for our needs.

Pacemaker can be used to provide a variety of services and is frequently used to provide resources that access shared data. A common example is an nfs server that exports data from a shared block level layer (like a iscsi disk). Scenarios like this require that only one host in the cluster accesses this shared disk at any time. Bad things happen when multiple hosts try to write to a single shared physical disk simultaneously. In certain situations, member nodes in a cluster fail to relinquish these shared resources and must be cut off from the resources. Heartbeat relies on a service called stonith (Shoot The Other Node In The Head), which basically turns misbehaving hosts off in such cases. This service is usually hooked up to some sort of remote power management facility for the nodes in the cluster. Our situation doesn't need that stuff, so my configuration does not cover stonith. Disable stonith with "crm_attribute --type crm_config -n stonith-enabled -v false".

The pacemaker project provides binaries for almost all linux distributions (using the openSUSE Build Service - thanks guys!). Configuring pacemaker can seem daunting at first but googling should give you plenty of pointers. Pacemaker itself is split into a bunch of daemons that work together to manage your resources. These are the crm, lrm, etc... I strongly suggest reading through at least the first 10 pages or so of this document before continuing.

Now that you have read the doc, all that remains is to configure the resources your cluster provides. As indicated in the configuration above, we have two servers (server1 - a physical box and server2 - a backup VM). Either of these servers are capable of handling all our traffic. Server1 however is a pretty robust machine, so I want all our traffic going to just that machine (as long as it's working correctly). However, if the LDAP (slapd) instance on it gets corrupt for some reason or if I need to reboot the box for maintenance etc, I would like server2 to kick in, take over the floating vip and field requests. Both servers have LDAP slaves on them, that are running all the time.

Pacemaker comes with a host of configuration, management and monitoring tools. To begin with, configure heartbeat as shown above and start it on both the servers. On our second server we don't have a bonded interface, so bond0 in the config file above changes to eth0. Once heartbeat is up and running, run the crm_mon tool and wait for it to tell you that the cluster is in quorum and that one of the nodes has been elected as the DC. At that point you can quit it (with a CTRL-C).

Pacemaker depends on Resource Agents to start/stop and monitor your resources. These RAs are usually just scripts that are very similar to the standard linux init scripts with a few modifications. These come in two flavours, the older style heartbeat scripts and the newer OCF style scripts that support more features. This page talks about these scripts and the differences between the two styles. If you use the older heartbeat style scripts, keep in mind that pacemaker will not be able to monitor your resources. It will just take care of starting, stopping and migrating them as directed (by an admin). We had one minor oddity in our situation that we didn't really need to start/stop our LDAP slaves on these servers, as these slaves were always running. I had to hack a RA script to make it work for us. I will detail that in another post as this one is already getting to be pretty long!

To configure the resources managed by your cluster, use the crm command (in its configure mode). You can run this tool interactively or feed it a preset configuration script. I used the interactive mode quite a bit as it allows you to validate your configuration, make changes on the fly and deploy them pretty easily. For the sake of brevity, I am just going to list our configuration. Feed these into crm with "crm configure ..."

primitive ldap_service ocf:heartbeat:ldap \
meta migration-threshold="2" failure-timeout="90s" \
op monitor interval="5s" timeout="15s" start_delay="15s" disabled="false" on_fail="standby"
primitive ldap_vip ocf:heartbeat:IPaddr2 \
params ip=""
group ldap ldap_service ldap_vip \
meta target_role="started" collocated="true"
location prefer_server1 ldap 10: server1

The first line defines a ldap RA and tells pacemaker that it is a OCF style resource, and that script is called ldap. The op line defines a monitor and tells it to monitor the resource every 5s and enables the monitor. It also states that the node should be put in a standby mode upon resource failure. The meta parameters say that the resource should be failed over after two failures, and that after 90s, the service is allowed to fail back to the primary server if desired.

The second line defines the next resource (the vip).

The third line defines a group that combines the above two resources and that these two resources should live together.

The last line tells that I'd prefer this group to live on server1 as much as possible.

You can verify your configuration with "crm configure verify" and activate it with "crm configure commit". At this point, pacemaker should activate your vip and the service. crm_mon should show these two resources to be up and running. If you want to fail your service manually to your backup server use "crm_resource -M -r ldap -N server2".

Note that I probably have some redundant configuration options in our setup here. If you spot any of those, or if you find any glaring errors, I'd appreciate the feedback. The cluster configuration guide I linked to earlier is your bible for this stuff. It details every single option you can use with crm and is written very well. You can also refer to docs here for sample configurations and other helpful pointers. HTH someone out in the ether!