Thursday, May 30, 2013

Examples : Processing Apache Log

Hive and Hadoop for Data Analytics on Apache Log
In this Apache Hive tutorial, we explain how to use Hive in a Hadoop distributed processing environment to enable Web analytics on large datasets. The Web analytics process will involve analyzing weblog details such as URLs accessed, cookies, access dates with times, and IP addresses. This information will be used to analyze visitors' website usage as well as their browsing patterns and behavior. Armed with this information, site owners can predict what a particular user likes on the site and personalize it accordingly. For their part, developers can add extra tracking values in the weblog for additional analytics.
Apache Hadoop and Hive for Data Processing
Apache Hadoop, the open source distributed computing framework for handling large datasets, uses the HDFS file system for storing files and Map/Reduce model for processing large datasets. Apache Hive, a sub-project of Hadoop, is a data warehouse infrastructure used to query and analyze large datasets stored in Hadoop files. Although Hadoop Hive is a non-SQL database, it will support some SQL as well. With its Hive-SQL option, Hive users can query the Hive tables. Hive works on top of Hadoop and ZooKeeper, a centralized Hadoop service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
Weblogs can be processed by the Hadoop Map Reduce program and stored in HDFS. Meanwhile, Hive supports fast reading of the data in the HDFS location, basic SQL, joins, and batch data load to the Hive database.
Weblog Formats and Processing
Both Apache Web Server and Microsoft IIS record website requests into log files, but the formats of those logs differ. Apache's preferred weblog format is the combined log format, which logs all the details of Web usages. Here is an example of the combined weblog format.
 
LogFormat "%h %v %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\""
 
Here is the IIS preferred format, Microsoft IIS W3C Extended Log File Format:
 
c-ip cs-username date time sc-bytes sc-status cs-uri-stem cs[Referer] cs[User-Agent] cs[Cookie]
Large retail applications are accessed by many users around the world, so their weblog file sizes might be between 10 and 15 gigabytes. For example, Amazon.com's weblog is more than 15 gigabytes. This weblog information is used to predict customer interest and personalize the site.
Page tagging is used for tagging the Web page and tracking usage of the page. Web analytics providers such as WebTrends and Google Analytics use this option for tracking page views and storing this view information in log files.
Hadoop MapReduce for Parsing Weblogs
Below is a sample Apache log where the log fields are terminated by a space. The OpenCSV framework is used for parsing these logs. It is using field terminated characters to parse the log and split the fields.
 
127.0.0.1 - - [10/Apr/2007:10:39:11 +0300] "GET / HTTP/1.1" 500 606 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.3) Gecko/20061201 Firefox/2.0.0.3 (Ubuntu-feisty)"
 
Here are the steps for parsing a log file using Hadoop MapReduce:
1.    Load log files into the HDFS location using this Hadoop command:
2.    hadoop fs -put <local file path of weblogs>  <hadoop HDFS location> 
3.    The Opencsv2.3.jar framework is used for parsing log records.
4.  Below is the Mapper program for parsing the log file from the HDFS location.
    public static class ParseMapper extends Mapper<Object, Text, NullWritable,Text >{
 
private Text word = new Text();
 
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
     CSVParser parse = new CSVParser(' ','\"');
     String sp[]=parse.parseLine(value.toString());
     int spSize=sp.length;
     StringBuffer rec= new StringBuffer();
     for(int i=0;i<spSize;i++){
          rec.append(sp[i]);
          if(i!=(spSize-1))
          rec.append(",");
     }
     word.set(rec.toString());
     context.write(NullWritable.get(), word);
            }
            }
 
5.    The command below is the Hadoop-based log parse execution. The MapReduce program is attached in this article. You can add extra parsing methods in the class. Be sure to create a new JAR with any change and move it to the Hadoop distributed job tracker system.
6.   hadoop jar <path of logparse jar> <hadoop HDFS logfile path>  <output path of parsed log file>
7.    The output file is stored in the HDFS location, and the output file name starts with "part-".
Hadoop Hive Configuration
The sections to follow explain how to configure Hive for weblog analytics.
Requirements
· Java 1.6
· Hadoop 0.20.x.
Installing Hadoop Hive from a Stable Release
First, download the latest stable release of Hive from one of the Apache download mirrors.
Next, unpack the tarball, which will create a subdirectory named hive-x.y.z:
 
 $ tar -xzvf hive-x.y.z.tar.gz
 
Point the environment variable HIVE_HOME to the installation directory:
 
$ cd hive-x.y.z
$ export HIVE_HOME={{pwd}}
 
Finally, add $HIVE_HOME/bin to your PATH:
 
$ export PATH=$HIVE_HOME/bin:$PATH
Running Hadoop Hive
Because Hive uses Hadoop either:
· you must have Hadoop in your path, OR
· export HADOOP_HOME=<hadoop-install-dir>
In addition, you must create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before a table can be created in Hive.
Commands to perform this setup are as follows:
 
  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse
I also find it useful, but not necessary, to set HIVE_HOME as follows:
 
$ export HIVE_HOME=<hive-install-dir>
To use the Hive command line interface (CLI) from the shell:
 
$ $HIVE_HOME/bin/hive
Hive Runtime Configuration
Hive queries are executed using MapReduce queries. Therefore, the behavior of such queries can be controlled by the Hadoop configuration variables.
  • The CLI command SET can be used to set any Hadoop (or Hive) configuration variable. For example:
  • hive> SET mapred.job.tracker=myhost.mycompany.com:50030;
·         hive> SET -v;
Hive, MapReduce and Local-Mode
The Hive compiler generates MapReduce jobs for most queries. These jobs are then submitted to the Map-Reduce cluster indicated by this variable:
mapred.job.tracker
This usually points to a MapReduce cluster with multiple nodes, but Hadoop also provides an option to run MapReduce jobs locally on the user's PC. This can be very useful for running queries over small data sets because in such cases local mode execution is usually much faster than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely, local mode runs with only one reducer and can be very slow when processing larger data sets.
 
Starting with version 0.7, Hive fully supports local mode execution, which you can activate by enabling the following option:
 
Hive> SET mapred.job.tracker=local;
 
In addition, mapred.local.dir should point to a path that's valid on the local machine (for example /tmp/<:username>/mapred/local). (Otherwise, the user will get an exception allocating local disk space).
 
Starting with version 0.7, Hive also supports a mode to run MapReduce jobs in local-mode automatically. The relevant options are:
 
 Hive> SET hive.exec.mode.local.auto=false;
Hadoop Hive Data Load
Hive provides tools to enable easy data ETL, a mechanism to put structures on the data, and defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, Hive QL also allows programmers familiar with MapReduce to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.
Below is the create table command for Hive.
 
create table weblogs
(
client_ip    string,
full_request_date string,
day    string,
month    string,
month_num int,
year    string,
hour    string,
minute    string,
second    string,
timezone string,
http_verb    string,
uri    string,
http_status_code    string,
bytes_returned  string,
referrer string,
user_agent string
)
 
row format delimited fields terminated by '\t' stored as textfile
The below command is used for loading data from an HDFS location to a Hive table.
 
LOAD DATA INPATH '<HDFS file path of parsed file>' INTO TABLE <table name> 
After loading the data into the table, the normal user can query Hive using Hive QL. Below is an example query for getting user counts of each location.
 
SELECT client_ip  , COUNT(client_ip) FROM weblogs GROUP BY client_ip
Hadoop Hive JDBC Support
Hive also supports JDBC connections. To connect Hive with JDBC, you need to start the Hive Thrift Server as follows.
 
Export HIVE_PORT=9999
Hive –service hiveserver
Here are the steps to establish a Hive JDBC connection:
1.    Add hive-jdbc0.7.jar in the classpath; this is a type-4 driver.
2.    Use the org.apache.hadoop.hive.jdbc.HiveDriver driver for the connection.
3.    Connection String : jdbc:hive://<hive IP Address>:<hive port>/<database name>
4.    Use Hive QL to query the table in Hive, and it will return the result set.
5.    Using the result set, you can project the output in graphs or charts easily.
Here is a sample Hive JDBC program:
 
Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver");
 
Connection con = DriverManager.getConnection("jdbc:hive://10.0.0.1:9999/default","", "");
 
Statement stmt = con.createStatement();
 
Resultset res = stmt.executeQuery("SELECT client_ip  , COUNT(client_ip) FROM weblogs GROUP BY  client_ip");
   
while (res.next())
{
      System.out.println(res.getInt(1) + "\t" + res.getString(2));
}
 

Wednesday, May 29, 2013

Why Hadoop?

I’m sure you've heard about Big Data since its very hot and emerging technology now days. I came to know hadoop via my company
The most well known technology used for Big Data is Hadoop. It is used by Yahoo, eBay, LinkedIn and Facebook. It has been inspired from Google publications on MapReduce, GoogleFS and BigTable. As Hadoop can be hosted on commodity hardware (usually Intel PC on Linux with one or 2 CPU and a few TB on HDD, without any RAID replication technology), it allows them to store huge quantity of data (petabytes or even more) at very low cost (compared to SAN bay systems).
Hadoop is an open source suite, under an apache foundation:http://hadoop.apache.org/.
The Hadoop “brand” contains many different tools. Two of them are core parts of Hadoop:
  • Hadoop Distributed File System (HDFS) is a virtual file system that looks like any other file system except than when you move a file on HDFS, this file is split into many small files, each of those files is replicated and stored on (usually, may be customized) 3 servers for fault tolerance constraints.
  • Hadoop MapReduce is a way to split every request into smaller requests which are sent to many small servers, allowing a truly scalable use of CPU power (describing MapReduce would worth a dedicated post).
Some other components are often installed on Hadoop solutions:
  • HBase is inspired from Google’s BigTable. HBase is a non-relational, scalable, and fault-tolerant database that is layered on top of HDFS. HBase is written in Java. Each row is identified by a key and consists of an arbitrary number of columns that can be grouped into column families.
  • ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Zookeeper is used by HBase, and can be used by MapReduce programs.
  • Solr / Lucene as search engine. This query engine library has been developed by Apache for more than 10 years.
  • Languages. Two languages are identified as original Hadoop languages: PIG andHive. For instance, you can use them to develop MapReduce processes at a higher level than MapReduce procedures. Other languages may be used, like C, Java or JAQL. Through JDBC or ODBC connectors (or directly in the languages) SQL can be used too.


Hadoop Architecture

Even if the most known Hadoop suite is provided by a very specialized actor named Cloudera (also by MapRHortonWorks, and of course Apache), big vendors are positioning themselves on this technology:
  • IBM has got BigInsights (Cloudera distribution plus their own custom version of Hadoop called GPFS) and has recently acquired many niche actors in the analytical and big data market (like Platform Computing which has got a product enhancing the capabilities and performance of MapReduce)
  • Oracle has launched BigData machine. Also based on Cloudera, this server is dedicated to storage and usage of non-structured content (as structured content stays on Exadata)
  • Informatica has a tool called HParser to complete PowerCenter This tool is built to launch Informatica process in a MapReduce mode, distributed on the Hadoop servers.
  • Microsoft has got a dedicated Hadoop version supported by Apache for Microsoft Windows and for Azure, their cloud solution, and a big native integration with SQL Server 2012.
  • Some very large database solutions like EMC Greenplum (partnering with MapR), HP Vertica (partnering with Cloudera), Teradata Aster Data (partnering with HortonWorks) or SAP Sybase IQ are able to connect directly to HDFS.