Spark on Cloudera

Apache Spark is a platform for processing big data through streaming.  Streaming can be much faster than disk-based processing offered by traditional Hadoop installations.  Here’s what Cloudera has to say about Spark.

Use cases: Apache Spark supports batch, streaming, and interactive analytics on all your data, enabling historical reporting, interactive analysis, data mining, real-time insights.

Support: Cloudera offers commercial support for Spark with Cloudera Enterprise.

Performance: Spark is 10-100x faster than MapReduce analysts for iterative algorithms that are often used by analysts and data scientists.  Performance benefits materialize both in memory and on disk.

Language support: Spark supports Java, Scala, and Python.  It is not necessary to write “map” and “reduce” operators.

Integration: Spark is integrated with CDH and can read any data in HDFS and deployed through Cloudera Manager.

Features: API for working with streams, exactly-once semantics, fault tolerance, common code for batch and streaming, joining streaming data to historical data.

Differences vs. Storm: Spark Streaming can recover lost work and deliver exactly-once semantics out of the box.

For more details, see Cloudera’s discussion of Spark.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s