Friday, February 28, 2014

OpenTSDB proxy

We use OpenTSDB to store the majority of our time series server and application statistics at Tumblr. We recently began a project to migrate OpenTSDB from an existing HBase cluster running an older version of HBase to a new cluster with newer hardware and running the latest stable version of Hbase.

We wanted a way to have some historical data in the new cluster before we switched to it. Within Tumblr we have a variety of applications generating these metrics and it was not very practical for us to change all of them to double write this data. Instead, we chose to replace the standard OpenTSDB listeners with a proxy that would do this double writing for us. While we could have used HBase copy table or written our own tool to backfill historical data from the old cluster, double writing for an initial period allowed us to avoid adding additional load on our existing cluster. This strategy also allowed us to move queries for recent data to new cluster earlier than the full cutover.

The tsd_proxy is written in Clojure and relies heavily on the Lamina and Aleph which in turn build on top of Netty. We have been using this in our production infrastructure for over two months now while sustaining writes at or above 175k/s (across the cluster) and it has been working well for us. We are open sourcing this proxy in the hope that others might find a use for this as well.

The tsd_proxy listens on a configurable port and can forward the incoming data stream to multiple end points. It also has the ability to filter the incoming stream and reject data points that don't match a (configurable) set of regular expressions. It also has the ability to queue the incoming stream and re-attempt delivery if one of the end points is down. It is also possible to limit the queue size so you don't blow through your heap. The README has some more information on how to set this up.

Friday, March 2, 2012

HBase ops automation

or.. how to not go crazy managing a large cluster.

This post is an expansion of my talk at a local HBase meetup. I am going to go into a little more detail on our HBase setup and cluster automation and will hopefully give you ideas on how to build/manage your HBase infrastructure.

Server Specifications

  • SuperMicro boxes
  • Ubuntu Lucid running backported kernels
  • 48 GB RAM (No swap)
  • Six SATA 2T - Hitachi Deskstar 7K, 64MB cache
  • Two quad core - Intel Xeon CPU L5630 - 2.13GHz
  • Each of the machines uses a single gigabit uplink

Directory layout

All of our Hadoop/HBase processes run as the Hadoop user. The configs for Hadoop and HBase are maintained in git and are distributed to the servers via Puppet. These are synced to the ~hadoop/hadoop_conf and ~hadoop/hbase_conf directories on the servers. One of our goals is to stay as close to the upstream release as possible, so we use the bits from the packaged binary builds directly. When we get new builds, the packaged binaries are directly expanded into the corresponding <product>-<rel>-<ver> directories. At any given time, the active build is sym-linked to the corresponding product directory. We then symlink the <prod>/conf directories to the corresponding ~hadoop/<prod>_conf directories synced via puppet. This is how the directory listing looks.
~$ ls -l | sed -e 's/ \(.*[0-9]\) / /'
total 52
lrwxrwxrwx hadoop -> hadoop-0.20.2-cdh3u3
drwxr-xr-x hadoop-0.20.2-cdh3u3
drwxr-xr-x hadoop_conf
lrwxrwxrwx hbase -> /home/hadoop/hbase-0.90.5-a9d4c8d
drwxr-xr-x hbase-0.90.5-a9d4c8d
drwxr-xr-x hbase-0.90.5-fb2b8ca
drwxr-xr-x hbase_conf
drwxr-xr-x run
~$ ls -l hbase-0.90.5-a9d4c8d | sed -e 's/ \(.*[0-9]\) / /'
total 3784
drwxr-xr-x bin
-rw-r--r-- CHANGES.txt
lrwxrwxrwx conf -> /home/hadoop/hbase_conf/
-rwxr-xr-x hbase-0.90.5.jar
-rwxr-xr-x hbase-0.90.5-tests.jar
lrwxrwxrwx hbase.jar -> hbase-0.90.5.jar
drwxr-xr-x hbase-webapps
drwxr-xr-x lib
-rw-r--r-- LICENSE.txt
-rw-r--r-- NOTICE.txt
-rw-r--r-- pom.xml
-rw-r--r-- README.txt
drwxr-xr-x src

Deployment automation

One of our goals is to deploy new HBase releases with zero downtime. We use Fabric to automate almost all of this process, and it is currently mostly hands-off. There are parts of this, that are still prone to manual intervention, but it usually works pretty well. When we get a new HBase build to be deployed, the deployment step looks like this.
fab prep-release:/home/stack/hbase-0.90.3-9fbaa99.tar.gz disable_balancer deploy-hbase:/home/stack/hbase-0.90.3-9fbaa99.tar.gz enable_balancer
This lays out (extracts code, makes symlinks) the code, pushes it to the regionserver machine and does a graceful restart of the node. After this, a restart of the HBase master is required to make it use the new code. This last step is currently manual.

A rolling restart of the cluster for new configs to take effect looks like this.

fab -P -z 3 rolling-restart
With parallel Fabric (since release 1.3), these rolling restarts bounce multiple (the spread is controlled by the -z flag) nodes at once. The 0.92 HBase release introduces the notion of draining nodes - these are nodes that will not get any more new regions. A node is marked as a draining node by creating an entry in ZooKeeper under the hbase_root/draining znode with the format "name,port,startcode" just like the regionserver entries under hbase_root/rs znode. This makes it easier to gracefully drain multiple regionservers at the same time. This command puts regionservers "foo" and "bar" into the draining state.
fab add_rs_to_draining:hosts="foo,bar"
Here is the list of tasks we can handle with our current Fabric setup.
~$ fab -l
Available commands:

    add_rs_to_draining        Put the regionserver into a draining state.
    assert_configs            Check that all the region servers have the sam...
    assert_regions            Check that all the regions have been vacated f...
    assert_release            Check the release running on the server.
    clear_all_draining_nodes  Remove all servers under the Zookeeper /draini...
    clear_rs_from_draining    Remove the regionserver from the draining stat...
    deploy_hbase              Deploy the new hbase release to the regionserv...
    disable_balancer          Disable the balancer.
    dist_hadoop               Rsyncs the hadoop release to the region server...
    dist_hbase                Rsyncs the hbase release to the region servers...
    dist_release              Rsyncs the release to the region servers.
    enable_balancer           Balance regions and enable the balancer.
    hadoop_start              Start hadoop.
    hadoop_stop               Start hadoop.
    hbase_gstop               HBase graceful stop.
    hbase_start               Start hbase.
    hbase_stop                Stop hbase (WARNING: does not unload regions).
    jmx_kill                  Kill JMX collectors.
    list_draining_nodes       List all servers under the Zookeeper /draining...
    prep_release              Copies the tar file from face and extract it.
    reboot_server             Reboot the box.
    region_count              Returns a count of the number of regions in th...
    rolling_reboot            Rolling reboot of the whole cluster.
    rolling_restart           Rolling restart of the whole cluster.
    sync_puppet               Sync puppet on the box.
    thrift_restart            Re-start thrift.
    thrift_start              Start thrift.
    thrift_stop               Stop thrift.
    unload_regions            Un-load HBase regions on the server so it can ...

Additional Notes

The fabfile and other scripts to run all of this are on github.
  • The latest version of the fabfile is meant for the 0.92 release. For older HBase releases look at this commit.
  • The older fabfile is meant to be run on one node at a time in serial, so the -P flag (parallel mode) will not work correctly.
  • I stole from here and added command line arguments to make it do some simple tasks I needed for manipulating ZooKeeper nodes. You will need the python-zookeeper libraries to make it work. I could not get the ZooKeeper cli_mt client to work correctly, which would have made un-necessary.
  • Ideally, I would have liked to use the python-zookeeper libraries directly from Fabric. However, the python-zookeeper libraries need threading support and that doesn't play nice with Fabric's parallel mode. It works fine in the serial mode.
  • It is important to drain HBase regions slowly when restarting regionservers. Otherwise, multiple regions go offline simultaneously as they are re-assigned to other nodes. Depending on your usage patterns, this might not be desirable.
  • The region_mover.rb script is an extension of the standard region_mover.rb that ships with stock HBase. I hacked it a little to add slow balancing support and automatic region balancing while unloading regions from a server. This version is also aware of draining servers and avoids them during region assignment and balancing. Again, look for the older commit if you want to use this with 0.90.x HBase releases. The latest version is for the 0.92 release.
  • We use linux cgroups to contain the TaskTracker processes, so if you plan on using this to manage your Hadoop cluster - be aware of that (remove that stuff, if you don't need cgroups).
  • We grant the hadoop user sudo permissions to run puppet on our cluster nodes, you will need to do something similar if you want to manage configuration through Puppet/Fabric. Your life will be a lot easier if you setup no-password ssh logins from the master (or wherever you run fab from) to your regionserver nodes.

Hope this helps other folks with their HBase deployments.

Tuesday, January 31, 2012

building mesos..

This took me a while to figure out, maybe others have faster ways.. but I am posting it here so it might help others. This wfm as of Tue Jan 31 12:55:17 PST 2012 and commit 079614aea80cfc7282c2a80de6e84c896df776c0. These instructions are for 64-bit Ubuntu Lucid.
  • sudo apt-get install g++ git automake libtool libltdl-dev python-dev swig python-setuptools (not sure libltdl-dev is needed, but libtool suggested it, so I figured, why not!)
  • git clone git://
  • cd mesos
  • ./bootstrap
  • ./configure --with-webui \
    --with-java-home=/usr/lib/jvm/java-6-sun \
    --with-python-headers=/usr/include/python2.6 \
    --with-included-zookeeper \
  • make -j 2 ... ... at some point this will fail while building zookeeper.
  • vi third_party/zookeeper-3.3.1/src/c/
  • comment out lines 25 to 44
  • make -j 2 ... ... This will again fail with some libtool version compatibility problems. I don't know the auto* tools well enough to understand why. Posts on the mesos mailing list suggest autoreconf. That worked for me.
  • autoreconf -fi
  • make -j 2
and finally.. sucess.. I am sure there are more elegant and less black-magic ways of doing this, but I lack the knowledge and patience to figure them out. I hope this helps someone else trying to get a basic mesos build working.

Tuesday, September 13, 2011

personal and social conflicts..

This is probably more a post on plus rather than a blog post, but it got too long. So here it goes..

I have been reading a parenting book by John Medina - Brain rules for baby(highly recommended), trying to morph into a good parent and everything... anyhoo.. I found this little nugget in the book. The author is describing some observations by two sociologists Edward Jones and Richard Nisbett - "People view their own behaviors as originating from amendable, situational constraints, but they view others behaviours as originating from inherent, immutable personality traits". Thinking back to a lot of my everyday work, home and family experiences - this explains things so well.. everytime I feel like I am right, or everytime I think things ought to be done differently, maybe it's just my assymmetric brain failing to see the other side of the problem.

After reading that.. I couldn't help but admire these sociologists - reducing everything down to predictable, simple facts and observations.

anyways.. thought I'd share. Next time, a little more putting yourself in the other persons shoes and a little less of the "jump to conclusions" mat!

Wednesday, February 16, 2011

Moving on from Mozilla..

After about five and a half years at MoCo, I have decided to move on to a new position at a startup in San Francisco. My last day at MoCo will be Feb 28'th. It was a hard decision for me, especially looking at everything I had a chance to work on and help build over the years, but I feel it's the right one. Working at MoCo has been one of the most challenging and satisfying jobs I have ever had and a huge thanks to all of the folks at Mozilla for making it that. I will still be living and working in the bay area, so I do hope to keep in touch with folks here.

In my new role, I will be working as an operations engineer with a focus on developing tools to better understand and monitor Hadoop environments (at least initially). The goal would be to open source these tools and contribute them to the community.

Hope to see you on the other side..

Wednesday, April 7, 2010

shelldap to the rescue!

I just discovered shelldap through my trusty dselect.. (I know, I am old and lazy and not in touch with the times, I really should be using apt-cache, but wth!). Anyhoo... shelldap is a pseudo shell on top of a LDAP DIT. You can cd to the different branches, grep within them for entries and edit individual entries in an LDIF format with your favorite editor!

Other tools like phpldapadmin, ldapsearch have their uses, but this is the most usable ldap browsing, editing tool I found so far. Figured someone else out there might find a use for it.. Thanks Mahlon E. Smith for shelldap!