Tuesday 1 January 2013

How to run Hadoop on Windows

One can spend only so much time surfing in 30C+ weather ;). So while my body was recovering from too much sunshine I decided to play with Hadoop to learn first hand what it actually is.

The easiest way to start is to download a preconfigured VMware image from Cloudera. This is what I did and it worked but it did not work well. The highest resolution I could set was 1024x768. I installed the VMware client tools but they did not seem  to work with the Linux distribution picked by Cloudera. I managed to figure out how to use vim to edit text files but a tiny window with flaky UI (you can see what is happening inside Hadoop using a web browser) was more that I could handle. Then I thought about getting it working on Mac OS X which is a very close cousin of Linux. The installation process is simple but the configuration process is not.

So I googled a bit more and came across Microsoft HDInsight which is Microsoft distribution of Hadoop that runs on Windows and Windows Azure. HDInsight worked great for me on Windows 8 and I was able to play with 3 most often used query APIs: native Hadoop Java based map/reduce framework, Hive and Pig. I used Word count as a  problem to see what each of them is capable of.  Below are links to sample implementations:
  • Java map/reduce framework – run c:\hadoop\hadoop-1.1.0-SNAPSHOT\bin\hadoop.cmd to get into command line interface for Hadoop
  • Pig – run C:\Hadoop\pig-0.9.3-SNAPSHOT\bin\pig.cmd to get into Grunt which lets you use Pig
  • Hive – run C:\Hadoop\hive-0.9.0\bin\hive.cmd to get into Hive command line interface

As far as I know Microsoft is going to to contribute their changes back to the Hadoop project so at some stage we might get Hadoop running natively on Windows in the same way nodejs is.