Friday, April 25, 2014

Complex Hadoop Interview Question

Is Hadoop designed for real-time systems?

No, Hadoop was initially designed for batch processing. That means, take a large dataset in input all at once, process it, and write a large output. The very concept of MapReduce is geared towards batch and not real-time. But to be honest, this was only the case at Hadoop's beginning, and now you have plenty of opportunities to use Hadoop in a more real-time way.
First I think it's important to define what you mean by real-time. It could be that you're interested in stream processing, or could also be that you want to run queries on your data that return results in real-time.
For stream processing on Hadoop, natively Hadoop won't provide you with this kind of capabilities, but you can integrate some other projects with Hadoop easily:
  • Storm-YARN allows you to use Storm on your Hadoop cluster via YARN.
  • Spark integrates with HDFS to allow you to process streaming data in real-time.
For real-time queries there are also several projects which use Hadoop:
  • Impala from Cloudera uses HDFS but bypasses MapReduce altogether because there's too much overhead otherwise.
  • Apache Drill is another project that integrates with Hadoop to provide real-time query capabilities.
  • The Stinger project aims to make Hive itself more real-time.
There are probably other projects that would fit into the list of "Making Hadoop real-time", but these are the most well-known ones.
So as you can see, Hadoop is going more and more towards the direction of real-time and, even if it wasn't designed for that, you have plenty of opportunities to extend it for real-time purposes.


Type of table in Hive : 

How can we optimize Hive tables....

How can we optimize MapReduce job....

What kind of data you will have ...

What is the size of cluster ? 

What is the size of data ? 

What is Distributed Cache in mapreduce framework?

Distributed cache is an important feature provide by map reduce framework. Distributed cache can cache text, archive, jars which could be used by application to improve performance. Application provide details of file to jobconf object to cache. Mapreduce framework would copy the specified file to data node before processing the job. Framework copy file only once for each job, and has the ability of archival. Application needs to specify the file path via http:// or hdfs:// to cache.

Hbase vs RDBMS
HBase is a database but has totally different implementation in comparison to RDBMS. HBase is a distributed, column-oriented, versioned data storage system.It become a hadoop eco system project and helps hadoop to over come with challenges in random read and write. HDFS is underneath layer for HBase and provides fault tolerance, linear scalability. saves data in key value pair. Has built in support for dynamically adding column in table schema of preexisting column family.HBase is not relational and does not support SQL

RDBMS. follows codd’s 12 rule. RDBMS are designed to follow strictly fixed schema. These are row oriented databases and does not natively designed for distributed scalability. RDBMS welcomes secondary index and improvise in data retrieval through SQL language. RDBMS has very good and easy support of complex joins and aggregate functions

What is map side join and reduce side join?`
Two different large data can be joined in map reduce programming also. Joins in Map phase refers as Map side join, while join at reduce side called as reduce side join.  Lets go in detail, Why we would require to join the data in map reduce. If one Dataset A has master data and B has sort of transactional data(A & B are just for reference). we need to join them on a coexisting common key for a result. It is important to realize that we can share data with side data sharing techniques(passing key value pair in job configuration /distribution caching) if master data set is small. we will use map-reduce join  only when we have both dataset is too big to use data sharing techniques.
Joins at Map Reduce is not recommended way. Same problem can be addressed through high level frameworks like Hive or cascading. even if you are in situation then we can use below mentioned method to join.

Map side Join
Joining at map side performs the join before data reached to map. function It expects a strong prerequisite before joining data at map side.

1.Data should be partitioned and sorted in particular way.
2.Each input data should be divided in same number of partition.
3.Must be sorted with same key.
4.All the records for a particular key must reside in the same partition.


What is shuffleing in mapreduce?
Once map tasks started to complete, A communication from reducers is started. where map output sents to reducer, which is looking for the output data to process. at same time data nodes are still process multiple other tasks. The data transfer of mappers output to reducer known as shuffling.


What is partitioning?
Partitioning is a process to identify the reducer instance which would be used to supply the mappers output. Before mapper emits the data (Key Value) pair to reducer, mapper identify the reducer as an recipient of mapper output. All the key, no matter which mapper has generated this, must lie with same reducer.

Difference between Hive managed tables vs External tables
Hive managed tables are completely managed by hive, Hive creates a copy of table(data source) in their own data warehouse and at time of removing hive it self is responsible of removing this file from warehouse.In counter of managed table,external table directly are created by hive using External keyword at the time of table creation and does not copy any data in warehouse. During drop table data would remain intact.

External Tables: An external table refers to the data that is outside of the warehouse directory.
CREATE EXTERNAL TABLE ( col string)
LOCATION ‘/user/husr/’;
LOAD DATA INPATH ‘/user/husr/data.txt’ INTO ;

In case of external tables, Hive does not move the data into its warehouse directory. If the external table is dropped, then the table metadata is deleted but not the data.
Note: Hive does not check whether the external table location exists or not at the time the external table is created. 

Normal Tables: Hive manages the normal tables created and moves the data into its warehouse directory.
As an example, consider the table creation and loading of data into the table.
CREATE TABLE (col string);

LOAD DATA INPATH ‘/user/husr/data.txt’ INTO TABLE ;