Unit Testing HBase Tablemapper with MRUnit

MRUnit is a great tool to unit test your mapreduce jobs. It speeds up development quite a lot, saving time running test jobs against your cluster or even local pseudo-distributed setup. Testing TableMappers is not very wel documented, so here goes:

Assuming your TableMapper is setup like this:

Testing it with MRUnit will work like this:

The design here is a mapper that parses a record and only emits to the reducer when an error occurred. When it does, it outputs a record in the format [recordkey,errormessage]. The test creates the input, expected output and runs the test. MRUnit’s internals compare the input and expected output, so no need for extra asserts.

If your HBase version doesn’t come with KeyValueTestUtil, you can implement it yourself easily. See the KeyValueTestUtil class.

Comments ( 3 )

HBase key design: Best of both worlds

In every key-value store, your key design is key. It is derived from your data access pattern which in turn is probably derived from your use case, since in large scale there is no such thing as a generic solution.

What if you wanted to write serious amounts of data, but still want to be able to retrieve all records since timestamp X without resorting to bucketing*?

Usually this is pretty much a no go. You either use random generated keys and get optimal write performance, but lose the ability to get all records since yesterday 1:00am. Or retain the ability to read records sequentially with in/decrementing keys but enter the domain of hotspotting a region and losing the power of your cluster. Lars George’s excellent book contains a nice graph showing the trade off between write and read performance when using a fully random key, a sequential key or anything in between.


I needed a way to have both, without any of the drawbacks of bucketing*. It turns out this is possible if you add a constraint: Read before a major compaction, the quicker the better.

TL;DR: HBase pure Timerange scans before a major compaction are quite efficient.

Assuming you are writing away @30-50k puts per regionserver using UUIDs. Another process periodically reads the new data and processes it. Since HBase stores a timestamp with each record, used for versioning, you can leverage this timestamp to scan for records since timestamp X using HBase API setTimeRange(long, long). The problem is, it is very inefficient to scan purely on timerange. Or is it?

First a little on storage: Hbase uses a Buffer-Flush-Merge pattern to store data. On a put, Hbase writes to the WoL(write ahead log), then to the memstore(RAM). The memstore flushes periodically to storefiles which in turn are merged on a scheduled minor compaction and again on a major compaction. A minor compaction just merges the flush files and runs regularly, about every 3 created flushfiles and is fast because lexocographix key ordering is taken care of in memory. Tombstone markers(deletes) and merging row level updates is only done in a major compaction which only runs every 24 hours by default. A major compaction needs to merge memstore, flushfiles, tombstones, updates all into a completely new HFile which is a lot more expensive. Most large sites therefore delay a major compaction to once every 7-10 days and triggered manually as to not disturb key data processes.

In each storefile, aside from the data itself HBase adds some metadata**. This metadata also contains the overall Timerange of all the records in that HFile. This is how HBase can determine whether a HFile has any data within the requested timerange. So even if you do a regular partial keyscan, aways add a timerange if you can provide one to increase scan performance.

In summary the readpath:

  • Extremely efficient: Reading from the memstore.
  • Very Efficient        : Reading from selected flushfiles
  • Efficient                : Reading from selected minor compacted storefiles

The more recent you make the timerange scan(last 30min, last 15min, etc) the more efficient your scan will become. This is great news for polling for latest records for further processing, whilst still having optimal write performance using random keys without the logic overhead of bucketing.

Recommended Reading:

* Bucketing is a way to use a key like “<bucket>+SeqKey”. Typically a bucket would be a hash of a number equal to the number of reqionservers. Allowing an efficient write path and a slightly less, bit still OK read path. Major drawback is the need to add logic to add buckets on puts and read all buckets and merge them on reads. If you have multiple readers and writers in your architecture and want to be able to smoothly scale to more regionservers in the future, that would mean updating all read and write logic throughout your architecture.

** To see HFile metadata use the HFile tool:

Comments ( 1 )

Bash video recoding function

I’ve been recoding video’s from different camera’s such as Panasonic Lumix and iPhone 4s, 3gs lately. Not only recoding to a more compatible and efficient codec was important, but rotating video was an unexpected requirement. It seems that the iPhone always records video in landscape, but uses a transformation matrix to determine the orientation of the iPhone at the moment the record button was hit. The problem is that only Apple’s Quicktime uses this information when viewing the video. Open the video in any other player, like VLC and the video will show up rotated 180 degrees.

Ffmpeg is a trusty video encoder/recoder and with the proper modules compiled in this truely is the swiss-army knife of video encoding.  HTML5 is supported by most modern browsers, and is compatible with iPad, iPhone, Android etc. HTML5 supports h264 and aac audio, which are great formats for storing high quality video efficiently. On ubuntu libx264 is not supported out of the box so you need to manually compile following this guide.

Once you have the right ffmpeg built it’s just a matter of finding the right settings and defining the proper filter sequence. I’ve created a simple bash function which does all the proper recoding with a switch selecting the rotation(clockwise, counter clockwise and flip). Using the bash function ‘vrecode [cw|ccw|flip|(blank)]<filename>” you recode the video to mp4 with a horizontal resolution of 800pixels and crf(constant rate factor) of 28. CRF determines quality, lower means higher bitrates are used.  Just to give an impression, these settings compress a 180Mb iPhone 4S movie of 1920×1080 to a HTML5 compatible movie of 800×450 of just 6.2Mb. I use qt-faststart to allow browsers to start playing before downloading the whole file.

I’ve added it to my ‘bash_functions’ in my config on github.

Comments ( 0 )

Pig snippets

I’ve been creating a lot of Pig-jobs lately and writing basically the same lines of code for every job goes against all things Zen and DRY. So I’ve created a pig.snippet for the Snipmate Vim plugin. I wanted to keep the commands as clean as possible, and like in vim, make sense ‘pronouncing’ them. Like ‘su‘ for STORE ... USING. Typing su followed by Ctrl-r <Tab> shows all the results Snipmate can provide.

Choosing ‘sua’, as in ‘STORE , USING, AvroStorage()’, results in:

of course you don’t need to hit Ctrl-r <tab> every time, just ‘sua<tab>‘ will give the same results. Check ‘pig.snippet‘ for all 51 snippets.

The extra’s:

Solved a couple of repetitive tasks with these extra’s:

Registering all your .jar files can be a pain, this will make it easier:

  • regloc : ‘register local’. Searches for *.jar in ./ and registers them
  • reglib : ‘register lib’. Searches for *.jar in /user/lib/pig and registers them

Joins in PigLatin know several types, depending on inner or outer joins:

  • ji : ‘join inner’. Shows options 1-4 for normal, skewed, replicated or merge inner join
  • jo : join outer. Shows options 1-3 for normal, skewed, replicated outer join

Compression options require you to set multiple mapred options:

  • setsc : set snappy compression. Outputs all options for job output, map output, tmpfile compression


You can find the snippets on my GitHub: https://github.com/rverk/snipmate-pig

Comments ( 1 )

Strange Love; or How I Learned to Stop Worrying and Love the vim

Vim is like a great game; Easy to learn, difficult to master. Once you get over the initiial bump and don’t mind changing old habbits, there are rewards for the one who perseveres.

The five stages of  vim usage:

1. Denial: Every time you enter a shell which only has vi as an editor, you just know how to go to insert mode and :wq as quickly as possible. All the while asking yourself how high anyone would have been designing an editor with different modes.

2. Anger. You have become a developer on a linux platform and now you notice you’re hanging out in shells more often than a clam. When you hear someone preaching vim you declare them idiots or purists who waste precious time being ‘elite’.

3. Bargening: Using gedit, nano, joe as an IDE is all fun and good, but you start wondering if there are ways to increase your productivity. You google for ‘efficient text editor’, and I’ll be darned it’s that vim again! You decide to give it a shot and look what those elitists are  on about. You bump into Derek Wyatt’s vim video’s and decide to stick with the old editor, but Derek’s enthusiasm made you give it a shot.

4. Depression: You only now get what vim is all about, increasing your productivity. A very well thought through editor for which every fiber of its code existance was designed to make the user edit as efficiently as possible. Even down to how far and often your fingers have to physically move on the keyboard. You only now realize you’ve wasted years of your coding life editing in inferior editors.

5. Acceptance: Finally you decide to add the alias vi=vim to your .bashrc and go use vim seriously and bathe in the bliss knowing that you can edit much faster then before.

Of course like with all processes you will go through these phases again and again. Thinking you know all there is to know, only to discover you know so little.

I’ve only been using vim as my main editor for two years now and I realize I’ve only just scratched the surface of the possibilities, but I know I’m a lot more productive then I used to be.  Once you get the hang of the vim way, you’ll start missing it once your in a different editor or IDE., luckily the VIM community is great and tools like vwrapper even add the vim editing schema to Eclipse.

Like all vim blogs will recommend, start small, build your config over time. Write down annoyances and look for solutions. Having said that, as a starter I’d highly recommend moving your ESC key to an alternative key, like your Caps-Lock. This makes escaping a small pinky move, instead of having to lift your hand.


Best resources:
  • vim tips wiki: Great resource to squeeze more power from vim
  • vim.org: Having a problem with VIM? Changes are someone solved it for you in a nice plugin.
  • Derek Wyatt’s Blog: Novice, Intermediate and Advanced video’s with a vim enthusiast
  • VimCasts:  Short video tutorials
The config I use can be found on my github.
Comments ( 1 )

Windows Media Center recording to NAS with iSCSI

I’ve had a long time beef with WMC, the HTPC solution with the highest WAF in my house. It does not allow the use of a UNC (ie. \\server\share) or a mapped network drive as its location to store recordings. To keep my HTPC case cool and quiet I installed a small 60Gig SSD, just for the OS and some extra’s(Virtual Machines). Not enough for tons of recordings and storage of a large live stream buffer. All my recordings, music, movies are stored on a NAS, the Synology DS207+, the goal is to get WMC’s recordings on there too, without using MS’s suggested solution of using WHS and the Recorded TV Manager to just move stuff over once in a while.

The solution is simple, just setup iSCSI, how:

1. Setup an iSCSI target on your network. Howto’s for:  Synology, Netgear, QNAP, or Ubuntu.

2. Connect to the target on Windows, Mac or Linux.

Windows will see this iSCSI target as a normal drive and will allow TV recording and live buffering to this target.

What is iSCSI?

iSCSI is a protocol that ‘wraps’ SCSI commands, normally sent directly to a harddisk, in TCP/IP. Effectively allowing the physical separation of the OS and the  disk. Mostly used in enterprise storage solutions, but made available through most consumer NAS solutions nowadays.

Comments ( 5 )

How to add Google Translate to WordPress on individual posts

For my family blog I wanted an option to have the couple of non-native readers to have easy access to inline translation of my posts. This allows me to type the posts in one language and them to an easy translator interface to get the gist of the message, which support the cute pics most of the time anyway.

The are lots of WP plugin’s available, most of them broken because of Google discontinuing their Translator API.

Luckily with a little help from their Translator Toolkit you can add per-post translation quickly and more importantly, independant of wordpress versions and functions.


Dit is een voorbeeld blogpost die vertaald wordt in het Engels.


How to implement:

There are multiple ways of doing this, you could adjust the the_content() function from the WordPress post-template.php, but in order to retain forward compatibility I prefer to just modify the theme. If anything breaks just revert to the original theme and your set.

1. Edit your Theme’s index.php file.
You can use your favourite editor if you have direct access to the webserver, or just use the admin dash and go Appearance->Editor->index.php.

The structure of the index.php depends on your theme, but all of them just loop though all your posts and lists them out. So generally you will see something like this:

Notice the “while have_post() … endwhile;” and “the_content()”. The wordpress function the_content(), links to the main WordPress function in “wp-content/post-template”, which spills the contents of the post. This is wat needs translating, so that is what needs to be captured in Google’s code.

2. Insert the Translator Toolkit Javascript into your index.php
Using the Translator Toolkit you effectively get the javascript you need to embed at the very end of index.php. This section will look something like this:

Append this code at the very end of your index.php 3. Wrap the “the_content()” function in the proper DIV class. What is less obvious from the Translator Toolkit page is that you need to embed the_content() into specific tags for translation. Modify the section of your index.php which lists your posts and wrap it in the defined sections, like below.


  • The lang=”nl” in the div class goog-trans-section is the source language of ths section. Modify according to your needs.
  • The “hl=en” at the end of the code snippet is the target language you want to translate to. Modify accordingly.
  • The div class “goog-trans-control” can be placed above the text by placing it before the_content()
  • You can do the exact same thing for page.php to add the translate control to pages.


Comments ( 1 )