Hive
and Hadoop for Data Analytics on Apache Log
In
this Apache Hive tutorial, we explain how to use Hive in a Hadoop distributed
processing environment to enable Web analytics on large datasets. The Web
analytics process will involve analyzing weblog details such as URLs accessed,
cookies, access dates with times, and IP addresses. This information will be
used to analyze visitors' website usage as well as their browsing patterns and
behavior. Armed with this information, site owners can predict what a
particular user likes on the site and personalize it accordingly. For their
part, developers can add extra tracking values in the weblog for additional
analytics.
Apache Hadoop and
Hive for Data Processing
Apache
Hadoop, the open source distributed computing framework for handling large
datasets, uses the HDFS file system for storing files and Map/Reduce model for
processing large datasets. Apache Hive, a sub-project of Hadoop, is a data
warehouse infrastructure used to query and analyze large datasets stored in
Hadoop files. Although Hadoop Hive is a non-SQL database, it will support some
SQL as well. With its Hive-SQL option, Hive users can query the Hive tables.
Hive works on top of Hadoop and ZooKeeper, a centralized Hadoop service for
maintaining configuration information, naming, providing distributed
synchronization, and providing group services.
Weblogs
can be processed by the Hadoop Map Reduce program and stored in HDFS.
Meanwhile, Hive supports fast reading of the data in the HDFS location, basic
SQL, joins, and batch data load to the Hive database.
Weblog Formats and
Processing
Both
Apache Web Server and Microsoft IIS record website requests into log files, but
the formats of those logs differ. Apache's preferred weblog format is the combined
log format, which logs all the details of Web usages. Here is an example of
the combined weblog format.
LogFormat "%h %v %u
%t \"%r\" %>s %b \"%{Referer}i\"
\"%{User-Agent}i\" \"%{Cookie}i\""
Here
is the IIS preferred format, Microsoft IIS W3C Extended Log File Format:
c-ip cs-username date
time sc-bytes sc-status cs-uri-stem cs[Referer] cs[User-Agent] cs[Cookie]
Large
retail applications are accessed by many users around the world, so their
weblog file sizes might be between 10 and 15 gigabytes. For example,
Amazon.com's weblog is more than 15 gigabytes. This weblog information is used
to predict customer interest and personalize the site.
Page
tagging is used for tagging the Web page and tracking usage of the page. Web
analytics providers such as WebTrends and Google Analytics use this option for
tracking page views and storing this view information in log files.
Hadoop MapReduce for
Parsing Weblogs
Below
is a sample Apache log where the log fields are terminated by a space. The OpenCSV
framework is used for parsing these logs. It is using field
terminated characters to parse the log and split the fields.
127.0.0.1 - -
[10/Apr/2007:10:39:11 +0300] "GET / HTTP/1.1" 500 606 "-"
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.3) Gecko/20061201
Firefox/2.0.0.3 (Ubuntu-feisty)"
Here
are the steps for parsing a log file using Hadoop MapReduce:
1. Load log files into
the HDFS location using this Hadoop command:
2. hadoop fs -put <local
file path of weblogs> <hadoop HDFS
location>
3. The Opencsv2.3.jar
framework is used for parsing log records.
4.
Below
is the Mapper program for parsing the log file from the HDFS location.
public static class ParseMapper extends Mapper<Object, Text,
NullWritable,Text >{
private Text word = new
Text();
public void map(Object
key, Text value, Context context) throws IOException, InterruptedException
{
CSVParser parse = new CSVParser('
','\"');
String sp[]=parse.parseLine(value.toString());
int spSize=sp.length;
StringBuffer rec= new StringBuffer();
for(int i=0;i<spSize;i++){
rec.append(sp[i]);
if(i!=(spSize-1))
rec.append(",");
}
word.set(rec.toString());
context.write(NullWritable.get(), word);
}
}
5.
The
command below is the Hadoop-based log parse execution. The MapReduce program is
attached in this article. You can add extra parsing methods in the class. Be
sure to create a new JAR with any change and move it to the Hadoop distributed
job tracker system.
6. hadoop jar <path of
logparse jar> <hadoop HDFS logfile path> <output path of parsed log file>
7.
The
output file is stored in the HDFS location, and the output file name starts
with "part-".
Hadoop Hive Configuration
The
sections to follow explain how to configure Hive for weblog analytics.
Requirements
· Java 1.6
· Hadoop 0.20.x.
Installing Hadoop Hive from a Stable
Release
First,
download the latest stable release of Hive from one of the Apache download
mirrors.
Next,
unpack the tarball, which will create a subdirectory named hive-x.y.z:
$ tar -xzvf hive-x.y.z.tar.gz
Point
the environment variable HIVE_HOME to the installation
directory:
$ cd hive-x.y.z
$ export
HIVE_HOME={{pwd}}
Finally,
add $HIVE_HOME/bin to your PATH:
$ export
PATH=$HIVE_HOME/bin:$PATH
Running Hadoop Hive
Because
Hive uses Hadoop either:
· you must have Hadoop
in your path, OR
· export HADOOP_HOME=<hadoop-install-dir>
In
addition, you must create /tmp
and /user/hive/warehouse (aka hive.metastore.warehouse.dir)
and set them chmod
g+w
in HDFS before a table can be created in Hive.
Commands
to perform this setup are as follows:
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
I
also find it useful, but not necessary, to set HIVE_HOME as follows:
$ export
HIVE_HOME=<hive-install-dir>
To
use the Hive command line interface (CLI) from the shell:
$ $HIVE_HOME/bin/hive
Hive Runtime Configuration
Hive queries are executed using MapReduce queries.
Therefore, the behavior of such queries can be controlled by the Hadoop
configuration variables.
- The CLI command SET can be used to
set any Hadoop (or Hive) configuration variable. For example:
- hive> SET
mapred.job.tracker=myhost.mycompany.com:50030;
·
hive>
SET -v;
Hive, MapReduce and Local-Mode
The
Hive compiler generates MapReduce jobs for most queries. These jobs are then
submitted to the Map-Reduce cluster indicated by this variable:
mapred.job.tracker
This
usually points to a MapReduce cluster with multiple nodes, but Hadoop also
provides an option to run MapReduce jobs locally on the user's PC. This can be very useful for running queries over
small data sets because in such cases local mode execution is usually much
faster than submitting jobs to a large cluster. Data is accessed transparently
from HDFS. Conversely, local mode runs with only one reducer and can be very
slow when processing larger data sets.
Starting
with version 0.7, Hive fully supports local mode execution, which you can
activate by enabling the following option:
Hive> SET
mapred.job.tracker=local;
In
addition, mapred.local.dir should point to a path that's valid on the
local machine (for example /tmp/<:username>/mapred/local). (Otherwise, the
user will get an exception allocating local disk space).
Starting
with version 0.7, Hive also supports a mode to run MapReduce jobs in local-mode
automatically. The relevant options are:
Hive> SET hive.exec.mode.local.auto=false;
Hadoop Hive Data Load
Hive
provides tools to enable easy data ETL, a mechanism to put structures on the
data, and defines a simple SQL-like query language, called QL, that enables
users familiar with SQL to query the data. At the same time, Hive QL also
allows programmers familiar with MapReduce to plug in their custom mappers and
reducers to perform more sophisticated analysis that may not be supported by
the built-in capabilities of the language.
Below
is the create table command for Hive.
create table weblogs
(
client_ip string,
full_request_date string,
day string,
month string,
month_num int,
year string,
hour string,
minute string,
second string,
timezone string,
http_verb string,
uri string,
http_status_code string,
bytes_returned string,
referrer string,
user_agent string
)
row format delimited
fields terminated by '\t' stored as textfile
The
below command is used for loading data from an HDFS location to a Hive table.
LOAD DATA INPATH
'<HDFS file path of parsed file>' INTO TABLE <table name>
After
loading the data into the table, the normal user can query Hive using Hive QL.
Below is an example query for getting user counts of each location.
SELECT client_ip , COUNT(client_ip) FROM weblogs GROUP BY
client_ip
Hadoop Hive JDBC Support
Hive
also supports JDBC connections. To connect Hive with JDBC, you need to start
the Hive Thrift Server as follows.
Export HIVE_PORT=9999
Hive –service hiveserver
Here
are the steps to establish a Hive JDBC connection:
1.
Add
hive-jdbc0.7.jar in the classpath; this is a type-4 driver.
2.
Use
the org.apache.hadoop.hive.jdbc.HiveDriver driver for the connection.
3.
Connection
String : jdbc:hive://<hive
IP Address>:<hive port>/<database name>
4.
Use
Hive QL to query the table in Hive, and it will return the result set.
5.
Using
the result set, you can project the output in graphs or charts easily.
Here
is a sample Hive JDBC program:
Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver");
Connection con =
DriverManager.getConnection("jdbc:hive://10.0.0.1:9999/default","",
"");
Statement stmt =
con.createStatement();
Resultset res =
stmt.executeQuery("SELECT client_ip
, COUNT(client_ip) FROM weblogs GROUP BY
client_ip");
while (res.next())
{
System.out.println(res.getInt(1) +
"\t" + res.getString(2));
}