Big Data: HBase Interview Question

What is HBase?

Hbase is Column-Oriented , Open-Source, Multidimensional, Distributed database. It run on the top of HDFS

Why we use Habse?

Hbase provide random read and write, Need to do thousand of operation per second on large data set.

List the main component of HBase?

Zookeeper

Catalog Tables

Master

RegionServer

Region

How many Operational command in Hbase?

There are five main command in HBase.

1. Get

2. Put

3. Delete

4. Scan

5. Increment

How to open a connection in Hbase?

If you are going to open connection with the help of Java API.

The following code provide the connection

Configuration myConf = HBaseConfiguration.create();

HTableInterface usersTable = new HTable(myConf, "users");

What is MemStore in Hbase?

What is Block Cache in Hbase?

What is data versioning in HBase?

How can you define cell in Hbase?

What is major and minor compaction in HBase?

NoSQL?

HBase is a type of NoSQL database. NoSQL is a general term meaning that the database isnt an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a Data Store than Data Base because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.

However, HBase has many features which supports both linear and modular scaling. HBase clusters expand by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20 RegionServers, for example, it doubles both in terms of storage and as well as processing capacity. RDBMS can scale well, but only up to a point - specifically, the size of a single database server - and for the best performance requires specialized hardware and storage devices. HBase features of note are

Strongly consistent reads/writes: HBase is not an eventually consistent DataStore. This makes it very suitable for tasks such as high-speed counter aggregation.Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows.

When Should I Use HBase?

HBase isnt suitable for every problem.

First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.

Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be ported to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port.

Third, make sure you have enough hardware. Even HDFS doesnt do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.

What Is The Difference Between HBase and Hadoop/HDFS?

HDFS is a distributed file system that is well suited for the storage of large files. Its documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed StoreFiles that exist on HDFS for high-speed lookups.

Does HBase support SQL?

Not really. SQL-ish support for HBase via Hive is in development, however Hive is based on MapReduce which is not generally suitable for low-latency requests.

Are there any Schema Design examples?

Theres a very big difference between storage of relational/row-oriented databases and column-oriented databases. For example, if I have a table of users and I need to store friendships between these users... In a relational database my design is something like:

Table: users(pkey = userid) Table: friendships(userid,friendid,...) which contains one (or maybe two depending on how its impelemented) row for each friendship.

In order to lookup a given users friend, SELECT * FROM friendships WHERE userid = myid;

The cost of this relational query continues to increase as a user adds more friends. You also begin to have practical limits. If I have millions of users, each with many thousands of potential friends, the size of these indexes grow exponentially and things get nasty quickly. Rather than friendships, imagine Im storing activity logs of actions taken by users.

In a column-oriented database these things scale continuously with minimal difference between 10 users and 10,000,000 users, 10 friendships and 10,000 friendships.

Rather than a friendships table, you could just have a friendships column family in the users table. Each column in that family would contain the ID of a friend. The value could store anything else you would have stored in the friendships table in the relational model. As column families are stored together/sequentially on a per-row basis, reading a user with 1 friend versus a user with 10,000 friends is virtually the same. The biggest difference is just in the shipping of this information across the network which is unavoidable. In this system a user could have 10,000,000 friends. In a relational database the size of the friendship table would grow massively and the indexes would be out of control.

Can you please provide an example of good de-normalization in HBase and how its held consistent (in your friends example in a relational db, there would be a cascadingDelete)? As I think of the users table: if I delete an user with the userid=123, do I have to walk through all of the other users column-family friends to guaranty consistency?! Is de-normalization in HBase only used to avoid joins? Our webapp doesnt use joins at the moment anyway.

You lose any concept of foreign keys. You have a primary key, thats it. No secondary keys/indexes, no foreign keys.

Its the responsibility of your application to handle something like deleting a friend and cascading to the friendships. Again, typical small web apps are far simpler to write using SQL, you become responsible for some of the things that were once handled for you.

Another example of good denormalization would be something like storing a users favorite pages. If we want to query this data in two ways: for a given user, all of his favorites. Or, for a given favorite, all of the users who have it as a favorite. Relational database would probably have tables for users, favorites, and userfavorites. Each link would be stored in one row in the userfavorites table. We would have indexes on both userid and favoriteid and could thus query it in both ways described above. In HBase wed probably put a column in both the users table and the favorites table, there would be no link table.

That would be a very efficient query in both architectures, with relational performing better much better with small datasets but less so with a large dataset.

Now asking for the favorites of these 10 users. That starts to get tricky in HBase and will undoubtedly suffer worse from random reading. The flexibility of SQL allows us to just ask the database for the answer to that question. In a small dataset it will come up with a decent solution, and return the results to you in a matter of milliseconds. Now lets make that userfavorites table a few billion rows, and the number of users youre asking for a couple thousand. The query planner will come up with something but things will fall down and it will end up taking forever. The worst problem will be in the index bloat. Insertions to this link table will start to take a very long time. HBase will perform virtually the same as it did on the small table, if not better because of superior region distribution.

How would you design an Hbase table for many-to-many association between two entities, for example Student and Course?

I would define two tables:

Student: student id student data (name, address, ...) courses (use course ids as column qualifiers here) Course: course id course data (name, syllabus, ...) students (use student ids as column qualifiers here)

Does it make sense?

A[Jonathan Gray] : Your design does make sense.

As you said, youd probably have two column-families in each of the Student and Course tables. One for the data, another with a column per student or course. For example, a student row might look like: Student : id/row/key = 1001 data:name = Student Name data:address = 123 ABC St courses:2001 = (If you need more information about this association, for example, if they are on the waiting list) courses:2002 = ...

This schema gives you fast access to the queries, show all classes for a student (student table, courses family), or all students for a class (courses table, students family).

What is the maximum recommended cell size?

A rough rule of thumb, with little empirical validation, is to keep the data in HDFS and store pointers to the data in HBase if you expect the cell size to be consistently above 10 MB. If you do expect large cell values and you still plan to use HBase for the storage of cell contents, youll want to increase the block size and the maximum region size for the table to keep the index size reasonable and the split frequency acceptable.

Why cant I iterate through the rows of a table in reverse order?

Because of the way HFile works: for efficiency, column values are put on disk with the length of the value written first and then the bytes of the actual value written second. To navigate through these values in reverse order, these length values would need to be stored twice (at the end as well) or in a side file. A robust secondary index implementation is the likely solution here to ensure the primary use case remains fast.

NoSQL?
HBase is a type of NoSQL database. NoSQL is a general term meaning that the database isnt an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a Data Store than Data Base because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
However, HBase has many features which supports both linear and modular scaling. HBase clusters expand by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20 RegionServers, for example, it doubles both in terms of storage and as well as processing capacity. RDBMS can scale well, but only up to a point - specifically, the size of a single database server - and for the best performance requires specialized hardware and storage devices. HBase features of note are
Strongly consistent reads/writes: HBase is not an eventually consistent DataStore. This makes it very suitable for tasks such as high-speed counter aggregation.Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows.

When Should I Use HBase?
HBase isnt suitable for every problem.
First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.
Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be ported to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port.
Third, make sure you have enough hardware. Even HDFS doesnt do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.

What Is The Difference Between HBase and Hadoop/HDFS?
HDFS is a distributed file system that is well suited for the storage of large files. Its documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed StoreFiles that exist on HDFS for high-speed lookups.

Does HBase support SQL?
Not really. SQL-ish support for HBase via Hive is in development, however Hive is based on MapReduce which is not generally suitable for low-latency requests. But now a days lot of framework are coming in market like Splice Machine, CDH5 is also support SQL on Hbase.

Are there any Schema Design examples?
Theres a very big difference between storage of relational/row-oriented databases and column-oriented databases. For example, if I have a table of users and I need to store friendships between these users... In a relational database my design is something like:
Table: users(pkey = userid) Table: friendships(userid,friendid,...) which contains one (or maybe two depending on how its impelemented) row for each friendship.
In order to lookup a given users friend, SELECT * FROM friendships WHERE userid = myid;
The cost of this relational query continues to increase as a user adds more friends. You also begin to have practical limits. If I have millions of users, each with many thousands of potential friends, the size of these indexes grow exponentially and things get nasty quickly. Rather than friendships, imagine Im storing activity logs of actions taken by users.
In a column-oriented database these things scale continuously with minimal difference between 10 users and 10,000,000 users, 10 friendships and 10,000 friendships.
Rather than a friendships table, you could just have a friendships column family in the users table. Each column in that family would contain the ID of a friend. The value could store anything else you would have stored in the friendships table in the relational model. As column families are stored together/sequentially on a per-row basis, reading a user with 1 friend versus a user with 10,000 friends is virtually the same. The biggest difference is just in the shipping of this information across the network which is unavoidable. In this system a user could have 10,000,000 friends. In a relational database the size of the friendship table would grow massively and the indexes would be out of control.

Can you please provide an example of good de-normalization in HBase and how its held consistent (in your friends example in a relational db, there would be a cascadingDelete)? As I think of the users table: if I delete an user with the userid=123, do I have to walk through all of the other users column-family friends to guaranty consistency?! Is de-normalization in HBase only used to avoid joins? Our webapp doesnt use joins at the moment anyway.

You lose any concept of foreign keys. You have a primary key, thats it. No secondary keys/indexes, no foreign keys.
Its the responsibility of your application to handle something like deleting a friend and cascading to the friendships. Again, typical small web apps are far simpler to write using SQL, you become responsible for some of the things that were once handled for you.
Another example of good denormalization would be something like storing a users favorite pages. If we want to query this data in two ways: for a given user, all of his favorites. Or, for a given favorite, all of the users who have it as a favorite. Relational database would probably have tables for users, favorites, and userfavorites. Each link would be stored in one row in the userfavorites table. We would have indexes on both userid and favoriteid and could thus query it in both ways described above. In HBase wed probably put a column in both the users table and the favorites table, there would be no link table.
That would be a very efficient query in both architectures, with relational performing better much better with small datasets but less so with a large dataset.
Now asking for the favorites of these 10 users. That starts to get tricky in HBase and will undoubtedly suffer worse from random reading. The flexibility of SQL allows us to just ask the database for the answer to that question. In a small dataset it will come up with a decent solution, and return the results to you in a matter of milliseconds. Now lets make that userfavorites table a few billion rows, and the number of users youre asking for a couple thousand. The query planner will come up with something but things will fall down and it will end up taking forever. The worst problem will be in the index bloat. Insertions to this link table will start to take a very long time. HBase will perform virtually the same as it did on the small table, if not better because of superior region distribution.

4 comments:

kalyan hadoopMarch 26, 2015 at 4:23 AM
You want big data interview questions and answers follow this link.
http://kalyanhadooptraining.blogspot.in/search/label/Big%20Data%20Interview%20Questions%20and%20Answers
UnknownSeptember 30, 2016 at 3:02 AM
the information is posted in your blog is very informative and unique, thanks for sharing such an nice info..
Hbase interview questions
kalyan hadoopOctober 8, 2016 at 5:14 AM
Learn Big Data from Basics ... Hadoop Training in Hyderabad
UnknownAugust 16, 2017 at 2:30 AM
It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
SEO Company in India

Big Data

Tuesday, May 21, 2013

HBase Interview Question

4 comments:

Search Alpesh's Blog