Friday, April 25, 2014

Complex Hadoop Interview Question

Is Hadoop designed for real-time systems?

No, Hadoop was initially designed for batch processing. That means, take a large dataset in input all at once, process it, and write a large output. The very concept of MapReduce is geared towards batch and not real-time. But to be honest, this was only the case at Hadoop's beginning, and now you have plenty of opportunities to use Hadoop in a more real-time way.
First I think it's important to define what you mean by real-time. It could be that you're interested in stream processing, or could also be that you want to run queries on your data that return results in real-time.
For stream processing on Hadoop, natively Hadoop won't provide you with this kind of capabilities, but you can integrate some other projects with Hadoop easily:
  • Storm-YARN allows you to use Storm on your Hadoop cluster via YARN.
  • Spark integrates with HDFS to allow you to process streaming data in real-time.
For real-time queries there are also several projects which use Hadoop:
  • Impala from Cloudera uses HDFS but bypasses MapReduce altogether because there's too much overhead otherwise.
  • Apache Drill is another project that integrates with Hadoop to provide real-time query capabilities.
  • The Stinger project aims to make Hive itself more real-time.
There are probably other projects that would fit into the list of "Making Hadoop real-time", but these are the most well-known ones.
So as you can see, Hadoop is going more and more towards the direction of real-time and, even if it wasn't designed for that, you have plenty of opportunities to extend it for real-time purposes.


Type of table in Hive : 

How can we optimize Hive tables....

How can we optimize MapReduce job....

What kind of data you will have ...

What is the size of cluster ? 

What is the size of data ? 

What is Distributed Cache in mapreduce framework?

Distributed cache is an important feature provide by map reduce framework. Distributed cache can cache text, archive, jars which could be used by application to improve performance. Application provide details of file to jobconf object to cache. Mapreduce framework would copy the specified file to data node before processing the job. Framework copy file only once for each job, and has the ability of archival. Application needs to specify the file path via http:// or hdfs:// to cache.

Hbase vs RDBMS
HBase is a database but has totally different implementation in comparison to RDBMS. HBase is a distributed, column-oriented, versioned data storage system.It become a hadoop eco system project and helps hadoop to over come with challenges in random read and write. HDFS is underneath layer for HBase and provides fault tolerance, linear scalability. saves data in key value pair. Has built in support for dynamically adding column in table schema of preexisting column family.HBase is not relational and does not support SQL

RDBMS. follows codd’s 12 rule. RDBMS are designed to follow strictly fixed schema. These are row oriented databases and does not natively designed for distributed scalability. RDBMS welcomes secondary index and improvise in data retrieval through SQL language. RDBMS has very good and easy support of complex joins and aggregate functions

What is map side join and reduce side join?`
Two different large data can be joined in map reduce programming also. Joins in Map phase refers as Map side join, while join at reduce side called as reduce side join.  Lets go in detail, Why we would require to join the data in map reduce. If one Dataset A has master data and B has sort of transactional data(A & B are just for reference). we need to join them on a coexisting common key for a result. It is important to realize that we can share data with side data sharing techniques(passing key value pair in job configuration /distribution caching) if master data set is small. we will use map-reduce join  only when we have both dataset is too big to use data sharing techniques.
Joins at Map Reduce is not recommended way. Same problem can be addressed through high level frameworks like Hive or cascading. even if you are in situation then we can use below mentioned method to join.

Map side Join
Joining at map side performs the join before data reached to map. function It expects a strong prerequisite before joining data at map side.

1.Data should be partitioned and sorted in particular way.
2.Each input data should be divided in same number of partition.
3.Must be sorted with same key.
4.All the records for a particular key must reside in the same partition.


What is shuffleing in mapreduce?
Once map tasks started to complete, A communication from reducers is started. where map output sents to reducer, which is looking for the output data to process. at same time data nodes are still process multiple other tasks. The data transfer of mappers output to reducer known as shuffling.


What is partitioning?
Partitioning is a process to identify the reducer instance which would be used to supply the mappers output. Before mapper emits the data (Key Value) pair to reducer, mapper identify the reducer as an recipient of mapper output. All the key, no matter which mapper has generated this, must lie with same reducer.

Difference between Hive managed tables vs External tables
Hive managed tables are completely managed by hive, Hive creates a copy of table(data source) in their own data warehouse and at time of removing hive it self is responsible of removing this file from warehouse.In counter of managed table,external table directly are created by hive using External keyword at the time of table creation and does not copy any data in warehouse. During drop table data would remain intact.

External Tables: An external table refers to the data that is outside of the warehouse directory.
CREATE EXTERNAL TABLE ( col string)
LOCATION ‘/user/husr/’;
LOAD DATA INPATH ‘/user/husr/data.txt’ INTO ;

In case of external tables, Hive does not move the data into its warehouse directory. If the external table is dropped, then the table metadata is deleted but not the data.
Note: Hive does not check whether the external table location exists or not at the time the external table is created. 

Normal Tables: Hive manages the normal tables created and moves the data into its warehouse directory.
As an example, consider the table creation and loading of data into the table.
CREATE TABLE (col string);

LOAD DATA INPATH ‘/user/husr/data.txt’ INTO TABLE ;





49 comments:

  1. Thanks. Will keep you posted new articles

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete
  3. Excellent Post, I welcome your interest about to post blogs. It will help many of them to update their skills in their interesting field.
    Regards,

    SAS Training in Chennai|SAS Institutes in Chennai

    ReplyDelete
  4. The Author did a great job! Nice work. It will helpful for who are looking for Hadoop Interview Questions. But it’s in advanced level. Suppose if you’re looking for beginner as well as advanced level then just have a look: https://goo.gl/rVWW8g

    ReplyDelete
  5. Nice collection of questions thank you for sharing. Know more about Big Data Hadoop Training

    ReplyDelete
  6. Nice collection of questions thank you for sharing. Know more about Big Data Hadoop Training

    ReplyDelete
  7. Thank you.Well it was nice post and very helpful information onB Big Data Hadoop Online Training Hyderabad

    ReplyDelete
  8. Nice way of expressing your ideas with us.
    thanks for sharing with us and please add more informations
    AWS Course in Bangalore
    AWS Course in Anna Nagar
    AWS Certification Training in T nagar

    ReplyDelete
  9. This comment has been removed by the author.

    ReplyDelete
  10. This is an awesome post.Really very informative and creative contents. These concept is a good way to enhance the knowledge.I like it and help me to development very well.Thank you for this brief explanation and very nice information.Well, got a good knowledge.
    Ethical Hacking Course in Chennai 
    Hacking Course in Chennai 
    Ethical Hacking Training in Chennai 
    Certified Ethical Hacking Course in Chennai 
    Ethical Hacking Course 

    ReplyDelete
  11. Nice post ! Thanks for sharing valuable information with us. Keep sharing..Big Data Hadoop Online Training Bangalore

    ReplyDelete
  12. Thank you for sharing such great information with us. I really appreciate everything that you’ve done here and am glad to know that you really care about the world that we live in.
    Selenium Training in Chennai
    Selenium course
    Software testing selenium training
    Selenium testing training
    Selenium Courses in Chennai
    Selenium training Chennai

    ReplyDelete
  13. Thank you.Well it was nice post and very helpful information on Big Data Hadoop Online Training

    ReplyDelete

  14. best info article published here thank u so much oracle training in chennai

    ReplyDelete
  15. Very nice article,keep sharing more info with us.
    thank you...
    Big data training

    Big data hadoop certification

    ReplyDelete
  16. You should be a piece of a challenge for probably the best website on the web. I will suggest this site!

    tech news

    ReplyDelete
  17. Awesome article! You are providing us very valid information. This is worth reading. Keep sharing more such articles.
    why become a data scientist
    why data science

    ReplyDelete