Tuesday, March 10, 2015

Hadoop Vs Spark


With Hadoop 2.0 and YARN Hadoop is supposedly no longer tied only map-reduce solutions. 
The main use cases for Spark are iterative Machine Learning algorithms and Interactive analytics.
The primary reason to use Spark is for speed, and this comes from the fact that its execution can keep data in memory between stages rather than always persist back to HDFS after a Map or Reduce. This advantage is very pronounced for iterative computations, which have tens of stages each of which is touching the same data. This is where things might be "100x" faster. For simple, one-pass ETL-like jobs for which MapReduce was designed, it's not in general faster.
Some points to consider:
  • Hadoop does everything on disk via HDFS, whereas Spark does all it can in RAM and only spills to disk when needed. This means Spark can be much faster.
  • Hadoop's MapReduce API is pretty burdensome and easy to get wrong, whereas Spark has a natural functional API that's comparatively easy to understand. It's also written in Scala and supports Scala natively, which is a far better language than Java for implementing the kinds of transformations it supports.
  • Spark supports any Hadoop Input/OutputFormat, so you can leverage existing Hadoop connectors to get to your data.
  • Hadoop is much more mature, and there are more tools written on top of it. The Spark ecosystem is evolving rapidly (check out DataBricks), but there's definitely more built on Hadoop than Spark.
  • All the major Hadoop distributions are now on the Spark train.
  • Spark is infinitely easier to configure and run than vanilla Hadoop, although the various distributions make it simpler.
In short, if you're interested in fast, in-memory computation using the latest technology that's likely to ultimately replace Hadoop, then choose Spark. If stability and mature ecosystem are more important, go with Hadoop. Or you can use both, as they can co-exist happily.