My own personal soapbox!: 2012

or.. how to not go crazy managing a large cluster.

This post is an expansion of my talk at a local HBase meetup. I am going to go into a little more detail on our HBase setup and cluster automation and will hopefully give you ideas on how to build/manage your HBase infrastructure.

Server Specifications

SuperMicro boxes
Ubuntu Lucid running backported kernels
48 GB RAM (No swap)
Six SATA 2T - Hitachi Deskstar 7K, 64MB cache
Two quad core - Intel Xeon CPU L5630 - 2.13GHz
Each of the machines uses a single gigabit uplink

Directory layout

All of our Hadoop/HBase processes run as the Hadoop user. The configs for Hadoop and HBase are maintained in git and are distributed to the servers via Puppet. These are synced to the ~hadoop/hadoop_conf and ~hadoop/hbase_conf directories on the servers. One of our goals is to stay as close to the upstream release as possible, so we use the bits from the packaged binary builds directly. When we get new builds, the packaged binaries are directly expanded into the corresponding <product>-<rel>-<ver> directories. At any given time, the active build is sym-linked to the corresponding product directory. We then symlink the <prod>/conf directories to the corresponding ~hadoop/<prod>_conf directories synced via puppet. This is how the directory listing looks.

~$ ls -l | sed -e 's/ \(.*[0-9]\) / /'
total 52
lrwxrwxrwx hadoop -> hadoop-0.20.2-cdh3u3
drwxr-xr-x hadoop-0.20.2-cdh3u3
drwxr-xr-x hadoop_conf
lrwxrwxrwx hbase -> /home/hadoop/hbase-0.90.5-a9d4c8d
drwxr-xr-x hbase-0.90.5-a9d4c8d
drwxr-xr-x hbase-0.90.5-fb2b8ca
drwxr-xr-x hbase_conf
drwxr-xr-x run
~$
~$ ls -l hbase-0.90.5-a9d4c8d | sed -e 's/ \(.*[0-9]\) / /'
total 3784
drwxr-xr-x bin
-rw-r--r-- CHANGES.txt
lrwxrwxrwx conf -> /home/hadoop/hbase_conf/
-rwxr-xr-x hbase-0.90.5.jar
-rwxr-xr-x hbase-0.90.5-tests.jar
lrwxrwxrwx hbase.jar -> hbase-0.90.5.jar
drwxr-xr-x hbase-webapps
drwxr-xr-x lib
-rw-r--r-- LICENSE.txt
-rw-r--r-- NOTICE.txt
-rw-r--r-- pom.xml
-rw-r--r-- README.txt
drwxr-xr-x src
~$

Deployment automation

One of our goals is to deploy new HBase releases with zero downtime. We use Fabric to automate almost all of this process, and it is currently mostly hands-off. There are parts of this, that are still prone to manual intervention, but it usually works pretty well. When we get a new HBase build to be deployed, the deployment step looks like this.

fab prep-release:/home/stack/hbase-0.90.3-9fbaa99.tar.gz disable_balancer deploy-hbase:/home/stack/hbase-0.90.3-9fbaa99.tar.gz enable_balancer

This lays out (extracts code, makes symlinks) the code, pushes it to the regionserver machine and does a graceful restart of the node. After this, a restart of the HBase master is required to make it use the new code. This last step is currently manual.

A rolling restart of the cluster for new configs to take effect looks like this.

fab -P -z 3 rolling-restart

With parallel Fabric (since release 1.3), these rolling restarts bounce multiple (the spread is controlled by the -z flag) nodes at once. The 0.92 HBase release introduces the notion of draining nodes - these are nodes that will not get any more new regions. A node is marked as a draining node by creating an entry in ZooKeeper under the hbase_root/draining znode with the format "name,port,startcode" just like the regionserver entries under hbase_root/rs znode. This makes it easier to gracefully drain multiple regionservers at the same time. This command puts regionservers "foo" and "bar" into the draining state.

fab add_rs_to_draining:hosts="foo,bar"

Here is the list of tasks we can handle with our current Fabric setup.

~$ fab -l
Available commands:

    add_rs_to_draining        Put the regionserver into a draining state.
    assert_configs            Check that all the region servers have the sam...
    assert_regions            Check that all the regions have been vacated f...
    assert_release            Check the release running on the server.
    clear_all_draining_nodes  Remove all servers under the Zookeeper /draini...
    clear_rs_from_draining    Remove the regionserver from the draining stat...
    deploy_hbase              Deploy the new hbase release to the regionserv...
    disable_balancer          Disable the balancer.
    dist_hadoop               Rsyncs the hadoop release to the region server...
    dist_hbase                Rsyncs the hbase release to the region servers...
    dist_release              Rsyncs the release to the region servers.
    enable_balancer           Balance regions and enable the balancer.
    hadoop_start              Start hadoop.
    hadoop_stop               Start hadoop.
    hbase_gstop               HBase graceful stop.
    hbase_start               Start hbase.
    hbase_stop                Stop hbase (WARNING: does not unload regions).
    jmx_kill                  Kill JMX collectors.
    list_draining_nodes       List all servers under the Zookeeper /draining...
    prep_release              Copies the tar file from face and extract it.
    reboot_server             Reboot the box.
    region_count              Returns a count of the number of regions in th...
    rolling_reboot            Rolling reboot of the whole cluster.
    rolling_restart           Rolling restart of the whole cluster.
    sync_puppet               Sync puppet on the box.
    thrift_restart            Re-start thrift.
    thrift_start              Start thrift.
    thrift_stop               Stop thrift.
    unload_regions            Un-load HBase regions on the server so it can ...
~$

Additional Notes

The fabfile and other scripts to run all of this are on github.

The latest version of the fabfile is meant for the 0.92 release. For older HBase releases look at this commit.
The older fabfile is meant to be run on one node at a time in serial, so the -P flag (parallel mode) will not work correctly.
I stole zkclient.py from here and added command line arguments to make it do some simple tasks I needed for manipulating ZooKeeper nodes. You will need the python-zookeeper libraries to make it work. I could not get the ZooKeeper cli_mt client to work correctly, which would have made zkclient.py un-necessary.
Ideally, I would have liked to use the python-zookeeper libraries directly from Fabric. However, the python-zookeeper libraries need threading support and that doesn't play nice with Fabric's parallel mode. It works fine in the serial mode.
It is important to drain HBase regions slowly when restarting regionservers. Otherwise, multiple regions go offline simultaneously as they are re-assigned to other nodes. Depending on your usage patterns, this might not be desirable.
The region_mover.rb script is an extension of the standard region_mover.rb that ships with stock HBase. I hacked it a little to add slow balancing support and automatic region balancing while unloading regions from a server. This version is also aware of draining servers and avoids them during region assignment and balancing. Again, look for the older commit if you want to use this with 0.90.x HBase releases. The latest version is for the 0.92 release.
We use linux cgroups to contain the TaskTracker processes, so if you plan on using this to manage your Hadoop cluster - be aware of that (remove that stuff, if you don't need cgroups).
We grant the hadoop user sudo permissions to run puppet on our cluster nodes, you will need to do something similar if you want to manage configuration through Puppet/Fabric. Your life will be a lot easier if you setup no-password ssh logins from the master (or wherever you run fab from) to your regionserver nodes.

Hope this helps other folks with their HBase deployments.

My own personal soapbox!

Friday, March 2, 2012

HBase ops automation

Server Specifications

Directory layout

Deployment automation

Additional Notes

Tuesday, January 31, 2012

building mesos..

About Me