MS Guy Does Hadoop (Part 3 – Hive)
Writing MapReduce Java jobs might be OK for simple analytical needs or distributing processing jobs but it might be challenging for more involved scenarios, such as joining two datasets. This is where Hive comes in. Hive was originally developed by the Facebook data warehousing team after they concluded that “… developers ended up spending hours (if not days) to write programs for even simple analyses”. Instead, Hive offers a SQL–like language that is capable of auto-generating the MapReduce code.
The Hive Shell
Hive introduces the notion of a “table” on top of data. It has its own shell which can be invoked by typing “hive” in the command window. The following command shows the Hive tables. I have defined two tables: records and records_ex.
[cloudera@localhost book]$ hive
hive> show tables;
OK
records
records_ex
Time taken: 4.602 seconds
hive>
Creating a Managed Table
Suppose you have a file with the following tab-delimited format:
1950 0 1
1950 22 1
1950 -11 1
1949 111 1
1949 78 1
The following Hive statement creates a records table with three columns.
hive> CREATE TABLE records (year STRING, temperature INT, quality INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’;
Next, we use the LOAD DATA statement to populate the records table with data from a file located on the local file system:
LOAD DATA LOCAL INPATH ‘input/ncdc/micro-tab/sample.txt’
OVERWRITE INTO TABLE records;
This causes Hive to move the file to its repository on local file system (/hive/warehouse). Therefore, by default, Hive will manage the table. If you drop the table, Hive will delete the source data.
Creating an External Table
What if the data is already in HDFS and you don’t want to move the files? In this case, you can tell Hive that the table will be external to Hive and you’ll manage the data. Suppose that you’ve already copied the sample.txt file to HDFS:
>hive[cloudera@localhost ~]$ hadoop dfs -ls /user/cloudera/input/ncdc
Found 1 items
-rw-r–r– 1 cloudera supergroup 529 2012-06-07 16:24 /user/cloudera/input/ncdc/sample.txt
Next, we tell Hive to create an external table:
CREATE EXTERNAL TABLE records_ex (year STRING, temperature INT, quality INT)
LOCATION ‘/user/cloudera/records_ex’;
LOAD DATA INPATH ‘/input/ncdc/sample.txt’
OVERWRITE INTO TABLE records_ex
The EXTERNAL clause causes Hive to leave the data where it is without even checking if the file exists. The INPATH clause points to the source file. The OVEWRITE clause causes the existing data to be removed.
Querying Data
The Hive SQL variant language is called HiveQL. HiveQL does not support the full SQL-92 specification as this wasn’t a design goal. The following two examples show how to select all data from our table.
hive> select * from records_ex;
OK
1950 0 1
1950 22 1
1950 -11 1
1949 111 1
1949 78 1
Time taken: 0.527 seconds
hive> SELECT year, MAX(temperature)
> FROM records
> WHERE temperature != 9999
> AND quality in (1,2)
> GROUP BY year;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201206241704_0001, Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201206241704_0001
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=0.0.0.0:8021 -kill job_201206241704_0001
2012-06-24 18:21:15,022 Stage-1 map = 0%, reduce = 0%
2012-06-24 18:21:19,106 Stage-1 map = 100%, reduce = 0%
2012-06-24 18:21:30,212 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201206241704_0001
OK
1949 111
1950 22
Time taken: 26.779 seconds
As you can see from the second example, Hive generates a MapReduce job. Please don’t make any conclusions from the fact that this simple query takes 26 seconds on my VM while it would take a millisecond to execute on any modern relational database. It takes quite a bit of time to instantiate MapReduce jobs and end users probably won’t query Hadoop directly anyway. Besides, the performance results will probably look completely different with hundreds of terabytes of data.
In a future blog on Hadoop, I plan to summarize my research on Hadoop and recommend usage scenarios.