Category Archives: Hadoop

Spark on Cloudera

Apache Spark is a platform for processing big data through streaming.  Streaming can be much faster than disk-based processing offered by traditional Hadoop installations.  Here’s what Cloudera has to say about Spark.

Use cases: Apache Spark supports batch, streaming, and interactive analytics on all your data, enabling historical reporting, interactive analysis, data mining, real-time insights.

Support: Cloudera offers commercial support for Spark with Cloudera Enterprise.

Performance: Spark is 10-100x faster than MapReduce analysts for iterative algorithms that are often used by analysts and data scientists.  Performance benefits materialize both in memory and on disk.

Language support: Spark supports Java, Scala, and Python.  It is not necessary to write “map” and “reduce” operators.

Integration: Spark is integrated with CDH and can read any data in HDFS and deployed through Cloudera Manager.

Features: API for working with streams, exactly-once semantics, fault tolerance, common code for batch and streaming, joining streaming data to historical data.

Differences vs. Storm: Spark Streaming can recover lost work and deliver exactly-once semantics out of the box.

For more details, see Cloudera’s discussion of Spark.


Cloudera Summarizes 2014

Screen Shot 2014-12-23 at 8.04.48 AM

In his letter from Cloudera, CEO Tom Reilly made a few interesting points.

$900 million round of funding

Cloudera secured a $900 million round of funding earlier this year, one of the largest ever in enterprise software. The majority of the investment came from Intel. Tom Reilly calls out security encryption at the chip level as an outcome of the Intel relationship.  Cloudera now has over 800 team members.

Acquisition of Gazzang and DataPad

Gazzang reportedly enables the industry’s first and only fully secure and regulation compliant Hadoop platform. DataPad has created a Python-based framework that simplifies data processing and analysis with Cloudera Enterprise.


The Cloudera partnerships listed are with Microsoft Azure, MongoDB, EMC, and Teradata.

Hadoop and Enterprise Data Warehousing (EDW)

Cloudera has made two webinars with Ralph Kimball on EDW available: Hadoop 101 for EDW Professionals
and EDW 101 for Hadoop Professionals.Screen Shot 2014-12-23 at 8.06.51 AM

Apache Spark

Apache Spark made huge strides, says Tom Reilly, and is well on its way to becoming the successor to MapReduce.


Cloudera believes Impala has continued to be the Hadoop SQL engine of choice.