Big Data: June 2013

Friday, June 28, 2013

Why HBase and Why Hive

There is not replacement with Hadoop with Hbase and Hive with Hbase . Its totally Though, you can efficiently put or fetch data to/from HBase by writing MapReduce jobs. Or you can write sequential programs using other HBase API, such as Java to put or fetch the data. But we use Hadoop, HBase etc to deal with gigantic amounts of data, so that doesn't make much sense. Using normal sequential programs would be highly inefficient when your data is too huge.

From Question point of view , Hadoop is 2 main components

HDFS- Distributed file system
MapReduce-Processing Framework.

Like all other FS, HDFS also provides us storage, but in a fault tolerant manner with high throughput and lower risk of data loss(because of the replication). But, being a FS, HDFS lacks random read and write accees. This is where HBase comes into picture. It's a distributed, scalable, big data store, modelled after Google's BigTable. It stores data as key/value pairs.

Now Hive. It provides us data warehousing facilities on top of an existing Hadoop cluster. Along with that it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them.

While Pig is basically a dataflow language that allows us to process enormous amounts of data very easily and quickly.Pig basically has 2 parts, the Pig Interpreter and the language, 'PigLatin'. You write Pig script in PigLatin and using Pig interpreter process them. Pig makes our life a lot easier, otherwise writing MapReduce is always not easy. Infact in some cases it can really become a pain.

HBase's internals allow fast read/write which is crucial for real time data handling. Whereas Hadoop with Map Reduce can be used to process large amount about data.

Thursday, June 6, 2013

Hive UDF UDAF UDTF

https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-12/ch12-section-08

https://cwiki.apache.org/Hive/genericudafcasestudy.html

http://dev.bizo.com/2010/07/extending-hive-with-custom-udtfs.html

Tuesday, June 4, 2013

Good Blogs on Hadoop

http://blog.guident.com/2013/02/import-noaa-data-into-apache-hbase/

http://souravgulati.webs.com/apps/forums/show/14108248-bigdata-learnings-hadoop-hbase-hive-and-other-bigdata-technologies-

http://www.coreservlets.com/hadoop-tutorial/

http://kickstarthadoop.blogspot.com/2011/09/joins-with-plain-map-reduce.html

http://kickstarthadoop.blogspot.in/2011/05/hadoop-for-dependent-data-splits-using.html

http://www.slideshare.net/cloudera/top-ten-tips-tricks-for-hadoop-success-r9

http://pkghosh.wordpress.com/2012/05/06/hive-plays-well-with-json/

http://snehalatastechnotes.wordpress.com/2013/02/07/complex-data-types-in-hive/

Also here are few links:

http://hadooptutorial.wikispaces.com/Custom+partitioner ( very good link)

http://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning

http://stackoverflow.com/questions/18356037/understanding-custom-partitioner-in-hadoop ( good discussion try reading it)

http://anilmeesala.wordpress.com/2013/05/19/in-this-post-i-will-show-how-to/ ( see how it is written with a different logic : partitioner)

http://wiki.apache.org/hadoop/HadoopMapReduce

THIS IS A LINK FOR PIG: ( we will be doing something similar you can read this after I will tell you about PIG)

http://www.ccs.neu.edu/home/mirek/classes/2011-F-CS6240/Slides/Lecture5.pdf

http://www.rohitmenon.com/index.php/cloudera-certified-hadoop-developer-ccd-410/