MS BI Guy Does Hadoop (Part 1 – Getting Started)

With Big Data and Hadoop getting a lot of attention nowadays, I’ve decided it’s time to take a look so I am starting a log of my Hadoop adventures. I hope it’ll benefit Windows users especially BI pros. If not, at least I’ll keep a track of my experience, so I can recreate it if needed. Before I start, for an excellent introduction to Hadoop from a Microsoft perspective, watch the talk on Big Data – What is the Big Deal? by David Dewitt. Previously, I’ve experimented and I got my feed wet with Apache Hadoop-based Services for Windows Azure, which is the Microsoft implementation for Hadoop in the cloud, but I was thirsty for more and wanted to dig deeper. Microsoft is currently working on CTP of Hadoop-based Services For Windows, which will provide a supported environment for installing and running Hadoop on Windows. While waiting, O’Reilly was king enough to send me a review copy of Hadoop – The Definitive Guide, 3rd Edition, by Tom White. Since Hadoop is an open-source project, I had to rediscover and relearn something I thought I would never had to since my university days – Unix, or to be more precise its CentOS Linux variant which is installed on the Cloudera VM. So, part 1 is about setting up your environment.

From the book, I discovered that Cloudera has a virtual machine for Virtual Box. I have VirtualBox on my Windows 7 laptop so I could run SharePoint 2010 (available in x64 only). VirtualBox is a great piece of software that was originally developed by Sun Microsystems and currently owned by Oracle. So, I’ve decided to take the VM shortcut since I don’t have much time to mess around with Cygwin, Java, etc. After downloading and double-extracting the Cloudera file, I created a new VirtualBox machine and I’ve made the following changes.

060412_0245_MSBIGuyDoes1

On the next step, I increased the memory to 2GB (recommended by Cloudera). In the Virtual Hard Disk step, I chose the “Use existing hard disk” option and pointed to the vmdi file I extracted from the Cloudera downloadable. Then, in the Settings page for the new VM, I’ve changed the storage to use the IDE controller instead of SATA which Cloudera said that the VM might have an issue with.

060412_0245_MSBIGuyDoes2

Once this was done, I was able to start the VM, which automatically logged me into CentOS as user cloudera. The first challenge I had to overcome was installing the VirtualBox Guest Editions for Linux in order to be able to resize the window and move the mouse cursor in and out without having to hold the right Ctrl key. This turned out to be more difficult than expected. The final solution took the following steps:

  1. Once you’ve started the guest OS, in the VM menu toolbar click Install Guest Additions to mount the disk.
  2. Open the File Manager and navigate to the /etc/yum.repos.d folder. Right-click the folder and click Open Terminal Here.

    In the command window, type the following command to elevate your privileges:

    $ su

    Enter the password (claudera) when prompted

  3. Open the Vi editor to edit the Cloudera-cdh3.repo as mentioned in the Cloudera VM demo note by typing this command.

    su -c vi Cloudera-cdh3.repo

  4. Change the baseurl line (changes in bold):

    [Cloudera-cdh3]

    name=Cloudera’s Distribution for Hadoop, Version 3

    enabled=1

    gpgcheck=0

    baseurl=http://archive.cloudera.com/redhat/cdh/3u4/

  5. Press ESC to go to command mode and type :wq to save and exit vi.

    Tip: To edit files in a more civilized way, click the File Manager icon in the menu bar at the bottom of the shell. However, you won’t have access to save files. As a workaround, launch the File Manager with elevated permissions as follows:

    $ su –c Thunar

  6. Enter the following command to install a few utilities and development kernel:

    $ yum install dkms binutils gcc make patch libgomp glibc-headers glibc-devel kernel-headers kernel-devel

  7. Then navigate to the media folder and run the Guest Additions file.
    $ cd /media
    $ cd VBOXADDITIONS_4.1.16_78094
    $ ./VBoxLinuxAdditions.run

    This should install the guest additions successfully. If you see any error messages, execute additional packages with yum as requested.

Next, you can verify the Hadoop installation by executing the steps in the Starting Hadoop and Verifying it is Working Properly section in the Hadoop Quick Start Guide.