Feature
|
Hive
|
Pig
|
Language
|
SQL-like
|
PigLatin
|
Schemas/Types
|
Yes
(explicit)
|
Yes
(implicit)
|
Partitions
|
Yes
|
No
|
Server
|
Optional
(Thrift)
|
No
|
User Defined
Functions (UDF)
|
Yes
(Java)
|
Yes
(Java)
|
Custom
Serializer/Deserializer
|
Yes
|
Yes
|
DFS Direct
Access
|
Yes
(implicit)
|
Yes
(explicit)
|
Join/Order/Sort
|
Yes
|
Yes
|
Shell
|
Yes
|
Yes
|
Streaming
|
Yes
|
Yes
|
Web
Interface
|
Yes
|
No
|
JDBC/ODBC
|
Yes
(limited)
|
No
|
Apache Pig and Hive are two projects that layer on top of Hadoop, and provide
a higher-level language for using Hadoop's MapReduce library. Apache Pig
provides a scripting language for describing operations like reading, filtering,
transforming, joining, and writing data -- exactly the operations that MapReduce
was originally designed for. Rather than expressing these operations in
thousands of lines of Java code that uses MapReduce directly, Pig lets users
express them in a language not unlike a bash or perl script. Pig is excellent
for prototyping and rapidly developing MapReduce-based jobs, as opposed to
coding MapReduce jobs in Java itself.
If Pig is "Scripting for Hadoop", then Hive is "SQL queries for Hadoop".
Apache Hive offers an even more specific and higher-level language, for querying
data by running Hadoop jobs, rather than directly scripting step-by-step the
operation of several MapReduce jobs on Hadoop. The language is, by design,
extremely SQL-like. Hive is still intended as a tool for long-running
batch-oriented queries over massive data; it's not "real-time" in any sense.
Hive is an excellent tool for analysts and business development types who are
accustomed to SQL-like queries and Business Intelligence systems; it will let
them easily leverage your shiny new Hadoop cluster to perform ad-hoc queries or
generate report data across data stored in storage systems mentioned above.
Hello
ReplyDeleteExplained very nicely regarding the difference between Hive and Pig.Thanks for sharing.