Apache Spark is a platform for processing big data through streaming. Streaming can be much faster than disk-based processing offered by traditional Hadoop installations. Here’s what Cloudera has to say about Spark.
Use cases: Apache Spark supports batch, streaming, and interactive analytics on all your data, enabling historical reporting, interactive analysis, data mining, real-time insights.
Support: Cloudera offers commercial support for Spark with Cloudera Enterprise.
Performance: Spark is 10-100x faster than MapReduce analysts for iterative algorithms that are often used by analysts and data scientists. Performance benefits materialize both in memory and on disk.
Language support: Spark supports Java, Scala, and Python. It is not necessary to write “map” and “reduce” operators.
Integration: Spark is integrated with CDH and can read any data in HDFS and deployed through Cloudera Manager.
Features: API for working with streams, exactly-once semantics, fault tolerance, common code for batch and streaming, joining streaming data to historical data.
Differences vs. Storm: Spark Streaming can recover lost work and deliver exactly-once semantics out of the box.
For more details, see Cloudera’s discussion of Spark.