Apache Spark has been steadily gaining ground as a fast and general engine for large-scale data processing. The engine is written in Scala and is well suited for applications that reuse a working set of data across multiple parallel operations. It’s designed to work as a standalone cluster or as part of Hadoop YARN cluster. It can access data from sources such as HDFS, Cassandra, S3 etc. Spark also offers many higher level operators in order to ease the development of data parallel applications. As a generic data processing platform it has enabled development of many higher level tools such as interactive SQL (Spark SQL), real time streaming (Spark Streaming), machine learning library (MLib), R-on-Spark etc.
For iterative processing such as machine learning and interactive analysis, Hadoop map-reduce does not work very well because of its batch-oriented nature. Spark is a fast and general engine for large-scale data processing. It aims to extend map-reduce for iterative algorithms and interactive low latency data mining. It also ships with a machine learning library.