Recently, the Flink project has been in the news as a relatively new big data stream processing framework in the Apache stack. Since our team at Intuit IDEA is implementing Spark Streaming for big data streaming I was curious how Flink compares to Spark Streaming.
Similarities: Both Spark and Flink support big data processing in Scala, Java and Python. For instance, here is the word count in Flink Scala:
For comparison, here is the word count problem in Spark Scala:
Similarly, both Flink and Spark Streaming support processing in either of the two principal modes of big data processing, batch and streaming mode. Like Spark, Flink shines with in-memory processing and query optimization. Also, both frameworks also provide for graceful degradation from in-memory to out-of-core algorithms for very large distributed datasets. Spark and Flink support machine learning, via Spark ML and FlinkML, respectively. Lastly, both Spark and Flink support graph processing, via Spark GraphX and Flink Gelly, respectively.
Differences: While Spark Streaming processes data streams as “micro-batches“, which are windows as small as 500 milliseconds, Flink has a true streaming API that can process individual data records. However, Flink does not yet support SQL access, as Spark SQL does, which is a feature Flink plans to add soon. Also, at the time of this writing, in July 2015, Spark appears to have considerably more industry commitment, via companies such as Databricks and IBM, which should be important for community involvement and support of production implementations.
Coming Soon: one of Flink’s upcoming announcements is the graduation of Flink’s streaming from “beta” status. Other enhancements will expand the FlinkML Machine Learning library, as well as the Gelly graph processing library and introduce new algorithms to both libraries.