Category Archives: Spark

Flink vs. Spark

Recently, the Flink project has been in the news as a relatively new big data stream processing framework in the Apache stack.  Since our team at Intuit IDEA is implementing Spark Streaming for big data streaming I was curious how Flink compares to Spark Streaming.

Similarities: Both Spark and Flink support big data processing in Scala, Java and Python.  For instance, here is the word count in Flink Scala:

For comparison, here is the word count problem in Spark Scala:

FlinkSimilarly, both Flink and Spark Streaming support processing in either of the two principal modes of big data processing, batch and streaming mode.  Like Spark, Flink shines with in-memory processing and query optimization.  Also, both frameworks also provide for graceful degradation from in-memory to out-of-core algorithms for very large distributed datasets.  Spark and Flink support machine learning, via Spark ML and FlinkML, respectively.  Lastly, both Spark and Flink support graph processing, via Spark GraphX and Flink Gelly, respectively.

spark-logoDifferences: While Spark Streaming processes data streams as “micro-batches“, which are windows as small as 500 milliseconds, Flink has a true streaming API that can process individual data records. However, Flink does not yet support SQL access, as Spark SQL does, which is a feature Flink plans to add soon. Also, at the time of this writing, in July 2015, Spark appears to have considerably more industry commitment, via companies such as Databricks and IBM, which should be important for community involvement and support of production implementations.

Coming Soon: one of Flink’s upcoming announcements is the graduation of Flink’s streaming from “beta” status. Other enhancements will expand the FlinkML Machine Learning library, as well as the Gelly graph processing library and introduce new algorithms to both libraries.

Volker Markl, one of Flink’s creators, shares more details in his interview with Roberto V. Zicari of ODBMS Industry Watch.


Advanced Data Partitioning in Spark

Screen Shot 2015-05-30 at 4.26.30 PMIn this post we take a look at advance data partitioning Spark.  Partitioning is the process that distributes your data across the Spark worker nodes.  If you process large amounts of data it is important that you chose your partitioning strategy carefully because communication is expensive in a distributed program.

Partitioning only applies to Spark key-value pair RDDs, and although you cannot force a record with a specific key to go to a specific node you can make sure that records with the same key are processed together on the same node.  This is important because you may need to join one data set against another dataset, and if the same keys reside on the same node for both datasets Spark does not need to communicate across nodes.

An example is a program that analyzes click events for a set of users.  In your code, you might define a `UserData` and `events` RDDs, and join them as follows:

5397024801_3d64b81c58_zHowever, this process is inefficient because if you perform the join in a loop, say once for each of a set of log files, then Spark will hash all the keys of both datasets and send elements with the same keys of both datasets across the network to perform the join.

To prevent this performance issue you should partition the user dataset before you persist it and then use it as the first dataset in the join operation.  If you do so you can leave the join statement from above unchanged and ensure that only the event data is sent across the network to pair up with the user data that is already persisted across worker nodes:

By using this technique of partitioning a larger reference dataset before persisting it, and then using it as the first dataset in a join, you can reduce network traffic and improve performance of your Spark code.  For more detailed information see “Learning Spark”, Chapter 4.

Scala vs. Java

cdmOn the Scala vs. Java discussion, “Is LinkedIn getting rid of Scala?” is a good read.  It is a little sad that Scala seems to be losing momentum at LinkedIn.

My thoughts on Scala vs. Java:

OK, I won’t say that Scala adoption builds our organization’s polyglot hack foo because developers are calling it quits on polyglot programming.

spark-logoWhat if you are about to build a new Big Data Analytics platform that leverages Spark for real-time processing?  Should you use Scala or Java?

Writing an entire platform in Scala seems like a possibility but not a very practical one.  As a compromise, you might build transforms, Spark code, and operations on RDDs in Scala, and pick the language is most appropriate for other situations, probably mostly Java.  Also, where you use java, consider making Java 8 a standard from the start.

Spark on Cloudera

Apache Spark is a platform for processing big data through streaming.  Streaming can be much faster than disk-based processing offered by traditional Hadoop installations.  Here’s what Cloudera has to say about Spark.

Use cases: Apache Spark supports batch, streaming, and interactive analytics on all your data, enabling historical reporting, interactive analysis, data mining, real-time insights.

Support: Cloudera offers commercial support for Spark with Cloudera Enterprise.

Performance: Spark is 10-100x faster than MapReduce analysts for iterative algorithms that are often used by analysts and data scientists.  Performance benefits materialize both in memory and on disk.

Language support: Spark supports Java, Scala, and Python.  It is not necessary to write “map” and “reduce” operators.

Integration: Spark is integrated with CDH and can read any data in HDFS and deployed through Cloudera Manager.

Features: API for working with streams, exactly-once semantics, fault tolerance, common code for batch and streaming, joining streaming data to historical data.

Differences vs. Storm: Spark Streaming can recover lost work and deliver exactly-once semantics out of the box.

For more details, see Cloudera’s discussion of Spark.

Cloudera Summarizes 2014

Screen Shot 2014-12-23 at 8.04.48 AM

In his letter from Cloudera, CEO Tom Reilly made a few interesting points.

$900 million round of funding

Cloudera secured a $900 million round of funding earlier this year, one of the largest ever in enterprise software. The majority of the investment came from Intel. Tom Reilly calls out security encryption at the chip level as an outcome of the Intel relationship.  Cloudera now has over 800 team members.

Acquisition of Gazzang and DataPad

Gazzang reportedly enables the industry’s first and only fully secure and regulation compliant Hadoop platform. DataPad has created a Python-based framework that simplifies data processing and analysis with Cloudera Enterprise.


The Cloudera partnerships listed are with Microsoft Azure, MongoDB, EMC, and Teradata.

Hadoop and Enterprise Data Warehousing (EDW)

Cloudera has made two webinars with Ralph Kimball on EDW available: Hadoop 101 for EDW Professionals
and EDW 101 for Hadoop Professionals.Screen Shot 2014-12-23 at 8.06.51 AM

Apache Spark

Apache Spark made huge strides, says Tom Reilly, and is well on its way to becoming the successor to MapReduce.


Cloudera believes Impala has continued to be the Hadoop SQL engine of choice.