What
is Hive?
Hive is a data warehouse system for Hadoop that
facilitates easy data summarization, ad-hoc queries, and the analysis of large
datasets stored in Hadoop compatible file systems.
Hive was originally developed at Facebook. It’s now a Hadoop subproject with many contributors. Users need to
concentrate only on the top level hive language rather than java map reduce
programs. One of the main advantages of Hive is its SQLish nature. Thus it
leverages the usability to a higher extend.
A hive program will be
automatically compiled into map-reduce jobs executed on Hadoop. In addition,
HiveQL supports custom map-reduce scripts to be plugged into queries.
Hive example:
selecting the employee names
whose salary more than 100 dollars from a hive table called tbl_employee.
SELECT employee_name FROM
tbl_employee WHERE salary > 100;
Users are excited to use Hive since it is very similar
to SQL.
What are the types of tables in
Hive?
There are two types of tables.
1. Managed tables.
2.
External tables.
Only the drop table command
differentiates managed and external tables. Otherwise, both type of tables are
very similar.
Does Hive support record level
Insert, delete or update?
Hive does not provide record-level update, insert, or
delete. Henceforth, Hive does not provide transactions too. However, users can
go with CASE statements and built in functions of Hive to satisfy the above DML
operations. Thus, a complex update query in a RDBMS may need many lines of code
in Hive.
What kind of data warehouse
application is suitable for Hive?
Hive is not a full database. The design constraints
and limitations of Hadoop and HDFS impose limits on what Hive can do.
Hive is most suited for data warehouse applications,
where
1)
Relatively static data is
analyzed,
2)
Fast response times are not
required, and
3)
When the data is not changing
rapidly.
Hive doesn’t provide crucial features required for
OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online
Analytic Processing.So, Hive is best suited for data
warehouse applications, where a large data set is maintained and mined for
insights, reports, etc.
How can the columns of a table
in hive be written to a file?
By using awk command in shell,
the output from HiveQL (Describe) can be written to a file.
hive -S -e “describe table_name;”
| awk -F” ” ’{print 1}’ > ~/output.
CONCAT function in Hive with
Example?
CONCAT function will concat the
input strings. You can specify any number of strings separated by comma.
Example:
CONCAT
(‘Hive’,’-’,’performs’,’-’,’good’,’-’,’in’,’-’,’Hadoop’);
Output:
Hive-performs-good-in-Hadoop
So, every time you delimit the
strings by ‘-’. If it is common for all the strings, then Hive provides another
command CONCAT_WS. Here you have to specify the delimit operator first.
CONCAT_WS
(‘-’,’Hive’,’performs’,’good’,’in’,’Hadoop’);
Output:
Hive-performs-good-in-Hadoop
REPEAT function in Hive with
example?
REPEAT function will repeat the
input string n times specified in the command.
Example:
REPEAT(‘Hadoop’,3);
Output:
HadoopHadoopHadoop.
Note: You can add a space with the
input string also.
TRIM function in Hive with
example?
TRIM function will remove the
spaces associated with a string.
Example:
TRIM(‘ Hadoop ‘);
Output:
Hadoop.
Note: If you want to remove only
leading or trialing spaces then you can specify the below commands
respectively.
LTRIM(‘ Hadoop’);
RTRIM(‘Hadoop ‘);
REVERSE function in Hive with
example?
REVERSE function will reverse the
characters in a string.
Example:
REVERSE(‘Hadoop’);
Output:
poodaH
LOWER or LCASE function in Hive
with example?
LOWER or LCASE function will
convert the input string to lower case characters.
Example:
LOWER(‘Hadoop’);
LCASE(‘Hadoop’);
Output:
hadoop
Note:
If the characters are already in
lower case then they will be preserved.
UPPER or UCASE function in Hive
with example?
UPPER or UCASE function will
convert the input string to upper case characters.
Example:
UPPER(‘Hadoop’);
UCASE(‘Hadoop’);
Output:
HADOOP
Note:
If the characters are already in
upper case then they will be preserved.
Double type in Hive – Important
points?
It is important to know about the
double type in Hive. Double type in Hive will present the data differently
unlike RDBMS.
See the double type data below:
24624.0
32556.0
3.99893E5
4366.0
E5 represents 10^5 here. So, the
value 3.99893E5 represents 399893. All the calculations will be accurately
performed using double type. The maximum value for a IEEE 754 double is about
2.22E308.
It is crucial while exporting the
double type data to any RDBMS since the type may be wrongly interpreted. So, it
is advised to cast the double type into appropriate type before exporting.
Rename a table in Hive – How to
do it?
Using ALTER command, we can
rename a table in Hive.
ALTER TABLE hive_table_name
RENAME TO new_name;
There is another way to rename a
table in Hive. Sometimes, ALTER may take more time if the underlying table has
more partitions/functions. In that case, Import and export options can be
utilized. Here you are saving the hive data into HDFS and importing back to new
table like below.
EXPORT TABLE tbl_name TO
‘HDFS_location’;
IMPORT TABLE new_tbl_name FROM
‘HDFS_location’;
If you prefer to just preserve
the data, you can create a new table from old table like below.
CREATE TABLE new_tbl_name AS
SELECT * FROM old_tbl_name;
DROP TABLE old_tbl_name;
How to change a column data
type in Hive?
ALTER TABLE table_name CHANGE
column_name column_name new_datatype;
Example: If you want to change
the data type of ID column from integer to bigint in a table called employee.
ALTER TABLE employee CHANGE id id
BIGINT;
Difference between order by and
sort by in hive?
SORT BY will sort the data within
each reducer. You can use any number of reducers for SORT BY operation.
ORDER BY will sort all of the
data together, which has to pass through one reducer. Thus, ORDER BY in hive
uses single reducer.
ORDER BY guarantees total order
in the output while SORT BY only guarantees ordering of the rows within a
reducer. If there is more than one reducer, SORT BY may give partially ordered
final results
RLIKE in Hive?
RLIKE (Right-Like) is a special
function in Hive where if any substring of A matches with B then it evaluates
to true. It also obeys Java regular expression pattern. Users don’t need to put
% symbol for a simple match in RLIKE.
Examples:
‘Express’ RLIKE ‘Exp’ –> True
‘Express’ RLIKE ‘^E.*’ –> True
(Regular expression)
Moreover, RLIKE will come handy
when the string has some spaces. Without using TRIM function, RLIKE satisfies
the required scenario. Suppose if A has value ‘Express ‘ (2 spaces
additionally) and B has value ‘Express’ RLIKE will work better without using
TRIM.
‘Express’ RLIKE ‘Express’ –>
True
Note:
RLIKE evaluates to NULL if A or B
is NULL.
PIG INTERVIEW QUESTIONS:
Can you give us some examples
how Hadoop is used in real time environment?
Let us assume that we have an
exam consisting of 10 Multiple-choice questions and 20 students appear for that
exam. Every student will attempt each question. For each question and each
answer option, a key will be generated. So we have a set of key-value pairs for
all the questions and all the answer options for every student. Based on the
options that the students have selected, you have to analyze and find out how
many students have answered correctly.
This isn’t an easy task. Here
Hadoop comes into picture! Hadoop helps you in solving these problems quickly
and without much effort. You may also take the case of how many students have
wrongly attempted a particular question.
What is BloomMapFile used for?
The BloomMapFile is a class that
extends MapFile. So its functionality is similar to MapFile.
BloomMapFile uses dynamic Bloom
filters to provide quick membership test for the keys. It is used in Hbase
table format.
What is PIG?
PIG is a platform for analyzing large data sets that
consist of high level language for expressing data analysis programs, coupled
with infrastructure for evaluating these programs. PIG’s infrastructure layer
consists of a compiler that produces sequence of MapReduce Programs.
What is the difference between
logical and physical plans?
Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs. After
performing the basic parsing and semantic checking, it produces a logical plan.
The logical plan describes the logical operators that have to be executed by
Pig during execution. After this, Pig produces a physical plan. The physical
plan describes the physical operators that are needed to execute the script.
Does ‘ILLUSTRATE’ run MR job?
No, illustrate will not pull any
MR, it will pull the internal data. On the console, illustrate will not do any
job. It just shows output of each stage and not the final output.
Is the keyword ‘DEFINE’ like a
function name?
Yes, the keyword ‘DEFINE’ is like
a function name. Once you have registered, you have to define it. Whatever
logic you have written in Java program, you have an exported jar and also a jar
registered by you. Now the compiler will check the function in exported jar.
When the function is not present in the library, it looks into your jar.
Is the keyword ‘FUNCTIONAL’ a
User Defined Function (UDF)?
No, the keyword ‘FUNCTIONAL’ is
not a User Defined Function (UDF). While using UDF, we have to override some
functions. Certainly you have to do your job with the help of these functions
only. But the keyword ‘FUNCTIONAL’ is a built-in function i.e a pre-defined
function, therefore it does not work as a UDF.
Why do we need MapReduce during
Pig programming?
Pig is a high-level platform that
makes many Hadoop data analysis issues easier to execute. The language we use
for this platform is: Pig Latin. A program written in Pig Latin is like a query
written in SQL, where we need an execution engine to execute the query. So,
when a program is written in Pig Latin, Pig compiler will convert the program
into MapReduce jobs. Here, MapReduce acts as the execution engine.
Are there any problems which
can only be solved by MapReduce and cannot be solved by PIG? In which kind of
scenarios MR jobs will be more useful than PIG?
Let us take a scenario where we
want to count the population in two cities. I have a data set and sensor list
of different cities. I want to count the population by using one mapreduce for
two cities. Let us assume that one is Bangalore and the other is Noida. So I
need to consider key of Bangalore city similar to Noida through which I can
bring the population data of these two cities to one reducer. The idea behind
this is somehow I have to instruct map reducer program – whenever you find city
with the name ‘Bangalore‘ and city with the name ‘Noida’, you create the alias
name which will be the common name for these two cities so that you create a
common key for both the cities and it get passed to the same reducer. For this,
we have to write custom partitioner.
In MapReduce when you create a
‘key’ for city, you have to consider ’city’ as the key. So, whenever the
framework comes across a different city, it considers it as a different key.
Hence, we need to use customized partitioner. There is a provision in mapreduce
only, where you can write your custom partitioner and mention if city =
bangalore or noida then pass similar hashcode. However, we cannot create custom
partitioner in Pig. As Pig is not a framework, we cannot direct execution engine
to customize the partitioner. In such scenarios, MapReduce works better than
Pig.
Does Pig give any warning when
there is a type mismatch or missing field?
No, Pig will not show any warning
if there is no matching field or a mismatch. If you assume that Pig gives such
a warning, then it is difficult to find in log file. If any mismatch is found,
it assumes a null value in Pig.
What co-group does in Pig?
Co-group joins the data set by
grouping one particular data set only. It groups the elements by their common
field and then returns a set of records containing two separate bags. The first
bag consists of the record of the first data set with the common data set and
the second bag consists of the records of the second data set with the common
data set.
Can we say cogroup is a group
of more than 1 data set?
Cogroup is a group of one data
set. But in the case of more than one data sets, cogroup will group all the
data sets and join them based on the common field. Hence, we can say that
cogroup is a group of more than one data set and join of that data set as well.
What does FOREACH do?
FOREACH is used to apply
transformations to the data and to generate new data items. The name itself is
indicating that for each element of a data bag, the respective action will be
performed.
Syntax : FOREACH bagname GENERATE
expression1, expression2, …..
The meaning of this statement is
that the expressions mentioned after GENERATE will be applied to the current
record of the data bag.
What is bag?
A bag is one of the data models
present in Pig. It is an unordered collection of tuples with possible
duplicates. Bags are used to store collections while grouping. The size of bag
is the size of the local disk, this means that the size of the bag is limited.
When the bag is full, then Pig will spill this bag into local disk and keep
only some parts of the bag in memory. There is no necessity that the complete
bag should fit into memory. We represent bags with “{}”.
Why Collection Data Types in hive ?
Most relational databases don’t support such collection types, because using them tends to break normal form. For example, in traditional data models,
structs
might be captured in separate tables, with foreign key relations between the tables, as appropriate.
A practical problem with breaking normal form is the greater risk of data duplication, leading to unnecessary disk space consumption and potential data inconsistencies, as duplicate copies can grow out of sync as changes are made.
However, in Big Data systems, a benefit of sacrificing normal form is higher processing throughput. Scanning data off hard disks with minimal “head seeks” is essential when processing terabytes to petabytes of data. Embedding collections in records makes retrieval faster with minimal seeks .Navigating each foreign key relationship requires seeking across the disk, with significant performance overhead.