MS Guy Does Hadoop (Part 3 – Hive)

Writing MapReduce Java jobs might be OK for simple analytical needs or distributing processing jobs but it might be challenging for more involved scenarios, such as joining two datasets. This is where Hive comes in. Hive was originally developed by the Facebook data warehousing team after they concluded that “… developers ended up spending hours (if not days) to write programs for even simple analyses”. Instead, Hive offers a SQL–like language that is capable of auto-generating the MapReduce code.

The Hive Shell

Hive introduces the notion of a “table” on top of data. It has its own shell which can be invoked by typing “hive” in the command window. The following command shows the Hive tables. I have defined two tables: records and records_ex.

[cloudera@localhost book]$ hive

hive> show tables;

OK

records

records_ex

Time taken: 4.602 seconds

hive>

 

Creating a Managed Table

Suppose you have a file with the following tab-delimited format:

1950    0    1

1950    22    1

1950    -11    1

1949    111    1

1949    78    1

 

The following Hive statement creates a records table with three columns.

hive> CREATE TABLE records (year STRING, temperature INT, quality INT)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘\t’; 

Next, we use the LOAD DATA statement to populate the records table with data from a file located on the local file system:

LOAD DATA LOCAL INPATH ‘input/ncdc/micro-tab/sample.txt’

OVERWRITE INTO TABLE records; 

This causes Hive to move the file to its repository on local file system (/hive/warehouse). Therefore, by default, Hive will manage the table. If you drop the table, Hive will delete the source data.

Creating an External Table

What if the data is already in HDFS and you don’t want to move the files? In this case, you can tell Hive that the table will be external to Hive and you’ll manage the data. Suppose that you’ve already copied the sample.txt file to HDFS:

>hive[cloudera@localhost ~]$ hadoop dfs -ls /user/cloudera/input/ncdc

Found 1 items

-rw-r–r– 1 cloudera supergroup 529 2012-06-07 16:24 /user/cloudera/input/ncdc/sample.txt 

Next, we tell Hive to create an external table:

CREATE EXTERNAL TABLE records_ex (year STRING, temperature INT, quality INT)

LOCATION ‘/user/cloudera/records_ex’;

LOAD DATA INPATH ‘/input/ncdc/sample.txt’

OVERWRITE INTO TABLE records_ex 

The EXTERNAL clause causes Hive to leave the data where it is without even checking if the file exists. The INPATH clause points to the source file. The OVEWRITE clause causes the existing data to be removed.

Querying Data

The Hive SQL variant language is called HiveQL. HiveQL does not support the full SQL-92 specification as this wasn’t a design goal. The following two examples show how to select all data from our table.

hive> select * from records_ex;

OK

1950 0 1

1950 22 1

1950 -11 1

1949 111 1

1949 78 1

Time taken: 0.527 seconds 

hive> SELECT year, MAX(temperature)

> FROM records

> WHERE temperature != 9999

> AND quality in (1,2)

> GROUP BY year;

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapred.reduce.tasks=<number>

Starting Job = job_201206241704_0001, Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201206241704_0001

Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=0.0.0.0:8021 -kill job_201206241704_0001

2012-06-24 18:21:15,022 Stage-1 map = 0%, reduce = 0%

2012-06-24 18:21:19,106 Stage-1 map = 100%, reduce = 0%

2012-06-24 18:21:30,212 Stage-1 map = 100%, reduce = 100%

Ended Job = job_201206241704_0001

OK

1949 111

1950 22

Time taken: 26.779 seconds 

As you can see from the second example, Hive generates a MapReduce job. Please don’t make any conclusions from the fact that this simple query takes 26 seconds on my VM while it would take a millisecond to execute on any modern relational database. It takes quite a bit of time to instantiate MapReduce jobs and end users probably won’t query Hadoop directly anyway. Besides, the performance results will probably look completely different with hundreds of terabytes of data.

In a future blog on Hadoop, I plan to summarize my research on Hadoop and recommend usage scenarios.