Saturday, December 20, 2014

Apache Impala Vs Hive

There are some key features in impala that makes its fast.

It does not use map/reduce which are very expensive to fork in separate jvms. It runs separate Impala Daemon which splits the query and runs them in parallel and merge result set at the end.

It does most of its operation in-memory.

It uses hdfs for its storage which is fast for large files. It caches as much as possible from queries to results to data.

It supports new file format like parquet, which is columnar file format. So if you use this format it will be faster for queries where you are accessing only few columns most of the time.

Impala doesn't even use Map-Reduce at all. It simply has daemons running on all your nodes which cache some of the data that is in HDFS, so that these daemons can return data quickly without having to go through a whole Map/Reduce job.

The reason for this is that there is a certain overhead involved in running a Map/Reduce job, so by short-circuiting Map/Reduce altogether you can get some pretty big gain in runtime.

That being said, Impala does not replace Hive, it is good for very different use cases. Impala doesn't provide fault-tolerance compared to Hive, so if there is a problem during your query then it's gone. Definitely for ETL type of jobs where failure of one job would be costly I would recommend Hive, but Impala can be awesome for small ad-hoc queries, for example for data scientists or business analysts who just want to take a look and analyze some data without building robust jobs. Also from my personal experience, Impala is still not very mature, and I've seen some crashes sometimes when the amount of data is larger than available memory.

Impala provides faster response as it uses MPP(massively parallel processing) unlike Hive which uses MapReduce under the hood, which involves some initial overheads (as Charles sir has specified). Massively parallel processing is a type of computing that uses many separate CPUs running in parallel to execute a single program where each CPU has it's own dedicated memory. The very fact that Impala, being MPP based, doesn't involve the overheads of a MapReduce jobs viz. job setup and creation,slot assignment, split creation, map generation etc, which makes it blazingly fast.

But that doesn't mean that Impala is the solution to all your problems. Being highly memory intensive(MPP), it is not a good fit for tasks that require heavy data operations like joins etc, as you just can't fit everything into the meory. This is where Hive is a better fit.

So, if you need real time, ad-hoc queries over a subset of your data go for Impala. And if you have batch processing kinda needs over your BigData go for Hive.

Monday, October 13, 2014

Hive Interview Questions

What is Hive?

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.

Hive was originally developed at Facebook. It’s now a Hadoop subproject with many contributors. Users need to concentrate only on the top level hive language rather than java map reduce programs. One of the main advantages of Hive is its SQLish nature. Thus it leverages the usability to a higher extend.

A hive program will be automatically compiled into map-reduce jobs executed on Hadoop. In addition, HiveQL supports custom map-reduce scripts to be plugged into queries.

Hive example:

selecting the employee names whose salary more than 100 dollars from a hive table called tbl_employee.

SELECT employee_name FROM tbl_employee WHERE salary > 100;

Users are excited to use Hive since it is very similar to SQL.

What are the types of tables in Hive?

There are two types of tables.

1. Managed tables.

2. External tables.

Only the drop table command differentiates managed and external tables. Otherwise, both type of tables are very similar.

Does Hive support record level Insert, delete or update?

Hive does not provide record-level update, insert, or delete. Henceforth, Hive does not provide transactions too. However, users can go with CASE statements and built in functions of Hive to satisfy the above DML operations. Thus, a complex update query in a RDBMS may need many lines of code in Hive.

What kind of data warehouse application is suitable for Hive?

Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do.

Hive is most suited for data warehouse applications, where

1) Relatively static data is analyzed,

2) Fast response times are not required, and

3) When the data is not changing rapidly.

Hive doesn’t provide crucial features required for OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing.So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.

How can the columns of a table in hive be written to a file?

By using awk command in shell, the output from HiveQL (Describe) can be written to a file.

hive -S -e “describe table_name;” | awk -F” ” ’{print 1}’ > ~/output.

CONCAT function in Hive with Example?

CONCAT function will concat the input strings. You can specify any number of strings separated by comma.

Example:

CONCAT (‘Hive’,’-’,’performs’,’-’,’good’,’-’,’in’,’-’,’Hadoop’);

Output:

Hive-performs-good-in-Hadoop

So, every time you delimit the strings by ‘-’. If it is common for all the strings, then Hive provides another command CONCAT_WS. Here you have to specify the delimit operator first.

CONCAT_WS (‘-’,’Hive’,’performs’,’good’,’in’,’Hadoop’);

Output: Hive-performs-good-in-Hadoop

REPEAT function in Hive with example?

REPEAT function will repeat the input string n times specified in the command.

Example:

REPEAT(‘Hadoop’,3);

Output:

HadoopHadoopHadoop.

Note: You can add a space with the input string also.

TRIM function in Hive with example?

TRIM function will remove the spaces associated with a string.

Example:

TRIM(‘ Hadoop ‘);

Output:

Hadoop.

Note: If you want to remove only leading or trialing spaces then you can specify the below commands respectively.

LTRIM(‘ Hadoop’);

RTRIM(‘Hadoop ‘);

REVERSE function in Hive with example?

REVERSE function will reverse the characters in a string.

Example:

REVERSE(‘Hadoop’);

Output:

poodaH

LOWER or LCASE function in Hive with example?

LOWER or LCASE function will convert the input string to lower case characters.

Example:

LOWER(‘Hadoop’);

LCASE(‘Hadoop’);

Output:

hadoop

Note:

If the characters are already in lower case then they will be preserved.

UPPER or UCASE function in Hive with example?

UPPER or UCASE function will convert the input string to upper case characters.

Example:

UPPER(‘Hadoop’);

UCASE(‘Hadoop’);

Output:

HADOOP

Note:

If the characters are already in upper case then they will be preserved.

Double type in Hive – Important points?

It is important to know about the double type in Hive. Double type in Hive will present the data differently unlike RDBMS.

See the double type data below:

24624.0

32556.0

3.99893E5

4366.0

E5 represents 10^5 here. So, the value 3.99893E5 represents 399893. All the calculations will be accurately performed using double type. The maximum value for a IEEE 754 double is about 2.22E308.

It is crucial while exporting the double type data to any RDBMS since the type may be wrongly interpreted. So, it is advised to cast the double type into appropriate type before exporting.

Rename a table in Hive – How to do it?

Using ALTER command, we can rename a table in Hive.

ALTER TABLE hive_table_name RENAME TO new_name;

There is another way to rename a table in Hive. Sometimes, ALTER may take more time if the underlying table has more partitions/functions. In that case, Import and export options can be utilized. Here you are saving the hive data into HDFS and importing back to new table like below.

EXPORT TABLE tbl_name TO ‘HDFS_location’;

IMPORT TABLE new_tbl_name FROM ‘HDFS_location’;

If you prefer to just preserve the data, you can create a new table from old table like below.

CREATE TABLE new_tbl_name AS SELECT * FROM old_tbl_name;

DROP TABLE old_tbl_name;

How to change a column data type in Hive?

ALTER TABLE table_name CHANGE column_name column_name new_datatype;

Example: If you want to change the data type of ID column from integer to bigint in a table called employee.

ALTER TABLE employee CHANGE id id BIGINT;

Difference between order by and sort by in hive?

SORT BY will sort the data within each reducer. You can use any number of reducers for SORT BY operation.

ORDER BY will sort all of the data together, which has to pass through one reducer. Thus, ORDER BY in hive uses single reducer.

ORDER BY guarantees total order in the output while SORT BY only guarantees ordering of the rows within a reducer. If there is more than one reducer, SORT BY may give partially ordered final results

RLIKE in Hive?

RLIKE (Right-Like) is a special function in Hive where if any substring of A matches with B then it evaluates to true. It also obeys Java regular expression pattern. Users don’t need to put % symbol for a simple match in RLIKE.

Examples:

‘Express’ RLIKE ‘Exp’ –> True

‘Express’ RLIKE ‘^E.*’ –> True (Regular expression)

Moreover, RLIKE will come handy when the string has some spaces. Without using TRIM function, RLIKE satisfies the required scenario. Suppose if A has value ‘Express ‘ (2 spaces additionally) and B has value ‘Express’ RLIKE will work better without using TRIM.

‘Express’ RLIKE ‘Express’ –> True

Note:

RLIKE evaluates to NULL if A or B is NULL.

PIG INTERVIEW QUESTIONS:

Can you give us some examples how Hadoop is used in real time environment?

Let us assume that we have an exam consisting of 10 Multiple-choice questions and 20 students appear for that exam. Every student will attempt each question. For each question and each answer option, a key will be generated. So we have a set of key-value pairs for all the questions and all the answer options for every student. Based on the options that the students have selected, you have to analyze and find out how many students have answered correctly.

This isn’t an easy task. Here Hadoop comes into picture! Hadoop helps you in solving these problems quickly and without much effort. You may also take the case of how many students have wrongly attempted a particular question.

What is BloomMapFile used for?

The BloomMapFile is a class that extends MapFile. So its functionality is similar to MapFile.

BloomMapFile uses dynamic Bloom filters to provide quick membership test for the keys. It is used in Hbase table format.

What is PIG?

PIG is a platform for analyzing large data sets that consist of high level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. PIG’s infrastructure layer consists of a compiler that produces sequence of MapReduce Programs.

What is the difference between logical and physical plans?

Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs. After performing the basic parsing and semantic checking, it produces a logical plan. The logical plan describes the logical operators that have to be executed by Pig during execution. After this, Pig produces a physical plan. The physical plan describes the physical operators that are needed to execute the script.

Does ‘ILLUSTRATE’ run MR job?

No, illustrate will not pull any MR, it will pull the internal data. On the console, illustrate will not do any job. It just shows output of each stage and not the final output.

Is the keyword ‘DEFINE’ like a function name?

Yes, the keyword ‘DEFINE’ is like a function name. Once you have registered, you have to define it. Whatever logic you have written in Java program, you have an exported jar and also a jar registered by you. Now the compiler will check the function in exported jar. When the function is not present in the library, it looks into your jar.

Is the keyword ‘FUNCTIONAL’ a User Defined Function (UDF)?

No, the keyword ‘FUNCTIONAL’ is not a User Defined Function (UDF). While using UDF, we have to override some functions. Certainly you have to do your job with the help of these functions only. But the keyword ‘FUNCTIONAL’ is a built-in function i.e a pre-defined function, therefore it does not work as a UDF.

Why do we need MapReduce during Pig programming?

Pig is a high-level platform that makes many Hadoop data analysis issues easier to execute. The language we use for this platform is: Pig Latin. A program written in Pig Latin is like a query written in SQL, where we need an execution engine to execute the query. So, when a program is written in Pig Latin, Pig compiler will convert the program into MapReduce jobs. Here, MapReduce acts as the execution engine.

Are there any problems which can only be solved by MapReduce and cannot be solved by PIG? In which kind of scenarios MR jobs will be more useful than PIG?

Let us take a scenario where we want to count the population in two cities. I have a data set and sensor list of different cities. I want to count the population by using one mapreduce for two cities. Let us assume that one is Bangalore and the other is Noida. So I need to consider key of Bangalore city similar to Noida through which I can bring the population data of these two cities to one reducer. The idea behind this is somehow I have to instruct map reducer program – whenever you find city with the name ‘Bangalore‘ and city with the name ‘Noida’, you create the alias name which will be the common name for these two cities so that you create a common key for both the cities and it get passed to the same reducer. For this, we have to write custom partitioner.

In MapReduce when you create a ‘key’ for city, you have to consider ’city’ as the key. So, whenever the framework comes across a different city, it considers it as a different key. Hence, we need to use customized partitioner. There is a provision in mapreduce only, where you can write your custom partitioner and mention if city = bangalore or noida then pass similar hashcode. However, we cannot create custom partitioner in Pig. As Pig is not a framework, we cannot direct execution engine to customize the partitioner. In such scenarios, MapReduce works better than Pig.

Does Pig give any warning when there is a type mismatch or missing field?

No, Pig will not show any warning if there is no matching field or a mismatch. If you assume that Pig gives such a warning, then it is difficult to find in log file. If any mismatch is found, it assumes a null value in Pig.

What co-group does in Pig?

Co-group joins the data set by grouping one particular data set only. It groups the elements by their common field and then returns a set of records containing two separate bags. The first bag consists of the record of the first data set with the common data set and the second bag consists of the records of the second data set with the common data set.

Can we say cogroup is a group of more than 1 data set?

Cogroup is a group of one data set. But in the case of more than one data sets, cogroup will group all the data sets and join them based on the common field. Hence, we can say that cogroup is a group of more than one data set and join of that data set as well.

What does FOREACH do?

FOREACH is used to apply transformations to the data and to generate new data items. The name itself is indicating that for each element of a data bag, the respective action will be performed.

Syntax : FOREACH bagname GENERATE expression1, expression2, …..

The meaning of this statement is that the expressions mentioned after GENERATE will be applied to the current record of the data bag.

What is bag?

A bag is one of the data models present in Pig. It is an unordered collection of tuples with possible duplicates. Bags are used to store collections while grouping. The size of bag is the size of the local disk, this means that the size of the bag is limited. When the bag is full, then Pig will spill this bag into local disk and keep only some parts of the bag in memory. There is no necessity that the complete bag should fit into memory. We represent bags with “{}”.

Why Collection Data Types in hive ?

Most relational databases don’t support such collection types, because using them tends to break normal form. For example, in traditional data models, structs might be captured in separate tables, with foreign key relations between the tables, as appropriate.

A practical problem with breaking normal form is the greater risk of data duplication, leading to unnecessary disk space consumption and potential data inconsistencies, as duplicate copies can grow out of sync as changes are made.

However, in Big Data systems, a benefit of sacrificing normal form is higher processing throughput. Scanning data off hard disks with minimal “head seeks” is essential when processing terabytes to petabytes of data. Embedding collections in records makes retrieval faster with minimal seeks .Navigating each foreign key relationship requires seeking across the disk, with significant performance overhead.

Friday, April 25, 2014

Complex Hadoop Interview Question

Is Hadoop designed for real-time systems?

No, Hadoop was initially designed for batch processing. That means, take a large dataset in input all at once, process it, and write a large output. The very concept of MapReduce is geared towards batch and not real-time. But to be honest, this was only the case at Hadoop's beginning, and now you have plenty of opportunities to use Hadoop in a more real-time way.

First I think it's important to define what you mean by real-time. It could be that you're interested in stream processing, or could also be that you want to run queries on your data that return results in real-time.

For stream processing on Hadoop, natively Hadoop won't provide you with this kind of capabilities, but you can integrate some other projects with Hadoop easily:

Storm-YARN allows you to use Storm on your Hadoop cluster via YARN.
Spark integrates with HDFS to allow you to process streaming data in real-time.

For real-time queries there are also several projects which use Hadoop:

Impala from Cloudera uses HDFS but bypasses MapReduce altogether because there's too much overhead otherwise.
Apache Drill is another project that integrates with Hadoop to provide real-time query capabilities.
The Stinger project aims to make Hive itself more real-time.

There are probably other projects that would fit into the list of "Making Hadoop real-time", but these are the most well-known ones.

So as you can see, Hadoop is going more and more towards the direction of real-time and, even if it wasn't designed for that, you have plenty of opportunities to extend it for real-time purposes.

Type of table in Hive :

How can we optimize Hive tables....

How can we optimize MapReduce job....

What kind of data you will have ...

What is the size of cluster ?

What is the size of data ?

What is Distributed Cache in mapreduce framework?

Distributed cache is an important feature provide by map reduce framework. Distributed cache can cache text, archive, jars which could be used by application to improve performance. Application provide details of file to jobconf object to cache. Mapreduce framework would copy the specified file to data node before processing the job. Framework copy file only once for each job, and has the ability of archival. Application needs to specify the file path via http:// or hdfs:// to cache.

Hbase vs RDBMS
HBase is a database but has totally different implementation in comparison to RDBMS. HBase is a distributed, column-oriented, versioned data storage system.It become a hadoop eco system project and helps hadoop to over come with challenges in random read and write. HDFS is underneath layer for HBase and provides fault tolerance, linear scalability. saves data in key value pair. Has built in support for dynamically adding column in table schema of preexisting column family.HBase is not relational and does not support SQL

RDBMS. follows codd’s 12 rule. RDBMS are designed to follow strictly fixed schema. These are row oriented databases and does not natively designed for distributed scalability. RDBMS welcomes secondary index and improvise in data retrieval through SQL language. RDBMS has very good and easy support of complex joins and aggregate functions

What is map side join and reduce side join?`
Two different large data can be joined in map reduce programming also. Joins in Map phase refers as Map side join, while join at reduce side called as reduce side join. Lets go in detail, Why we would require to join the data in map reduce. If one Dataset A has master data and B has sort of transactional data(A & B are just for reference). we need to join them on a coexisting common key for a result. It is important to realize that we can share data with side data sharing techniques(passing key value pair in job configuration /distribution caching) if master data set is small. we will use map-reduce join only when we have both dataset is too big to use data sharing techniques.
Joins at Map Reduce is not recommended way. Same problem can be addressed through high level frameworks like Hive or cascading. even if you are in situation then we can use below mentioned method to join.

Map side Join
Joining at map side performs the join before data reached to map. function It expects a strong prerequisite before joining data at map side.

1.Data should be partitioned and sorted in particular way.
2.Each input data should be divided in same number of partition.
3.Must be sorted with same key.
4.All the records for a particular key must reside in the same partition.

What is shuffleing in mapreduce?
Once map tasks started to complete, A communication from reducers is started. where map output sents to reducer, which is looking for the output data to process. at same time data nodes are still process multiple other tasks. The data transfer of mappers output to reducer known as shuffling.

What is partitioning?
Partitioning is a process to identify the reducer instance which would be used to supply the mappers output. Before mapper emits the data (Key Value) pair to reducer, mapper identify the reducer as an recipient of mapper output. All the key, no matter which mapper has generated this, must lie with same reducer.

Difference between Hive managed tables vs External tables
Hive managed tables are completely managed by hive, Hive creates a copy of table(data source) in their own data warehouse and at time of removing hive it self is responsible of removing this file from warehouse.In counter of managed table,external table directly are created by hive using External keyword at the time of table creation and does not copy any data in warehouse. During drop table data would remain intact.

External Tables: An external table refers to the data that is outside of the warehouse directory.
CREATE EXTERNAL TABLE ( col string)
LOCATION ‘/user/husr/’;
LOAD DATA INPATH ‘/user/husr/data.txt’ INTO ;

In case of external tables, Hive does not move the data into its warehouse directory. If the external table is dropped, then the table metadata is deleted but not the data.
Note: Hive does not check whether the external table location exists or not at the time the external table is created.

Normal Tables: Hive manages the normal tables created and moves the data into its warehouse directory.
As an example, consider the table creation and loading of data into the table.
CREATE TABLE (col string);

LOAD DATA INPATH ‘/user/husr/data.txt’ INTO TABLE ;