Category Archives: Scala

Advanced Data Partitioning in Spark

Screen Shot 2015-05-30 at 4.26.30 PMIn this post we take a look at advance data partitioning Spark.  Partitioning is the process that distributes your data across the Spark worker nodes.  If you process large amounts of data it is important that you chose your partitioning strategy carefully because communication is expensive in a distributed program.

Partitioning only applies to Spark key-value pair RDDs, and although you cannot force a record with a specific key to go to a specific node you can make sure that records with the same key are processed together on the same node.  This is important because you may need to join one data set against another dataset, and if the same keys reside on the same node for both datasets Spark does not need to communicate across nodes.

An example is a program that analyzes click events for a set of users.  In your code, you might define a `UserData` and `events` RDDs, and join them as follows:

5397024801_3d64b81c58_zHowever, this process is inefficient because if you perform the join in a loop, say once for each of a set of log files, then Spark will hash all the keys of both datasets and send elements with the same keys of both datasets across the network to perform the join.

To prevent this performance issue you should partition the user dataset before you persist it and then use it as the first dataset in the join operation.  If you do so you can leave the join statement from above unchanged and ensure that only the event data is sent across the network to pair up with the user data that is already persisted across worker nodes:

By using this technique of partitioning a larger reference dataset before persisting it, and then using it as the first dataset in a join, you can reduce network traffic and improve performance of your Spark code.  For more detailed information see “Learning Spark”, Chapter 4.

Scala vs. Java

cdmOn the Scala vs. Java discussion, “Is LinkedIn getting rid of Scala?” is a good read.  It is a little sad that Scala seems to be losing momentum at LinkedIn.

My thoughts on Scala vs. Java:

OK, I won’t say that Scala adoption builds our organization’s polyglot hack foo because developers are calling it quits on polyglot programming.

spark-logoWhat if you are about to build a new Big Data Analytics platform that leverages Spark for real-time processing?  Should you use Scala or Java?

Writing an entire platform in Scala seems like a possibility but not a very practical one.  As a compromise, you might build transforms, Spark code, and operations on RDDs in Scala, and pick the language is most appropriate for other situations, probably mostly Java.  Also, where you use java, consider making Java 8 a standard from the start.

Scala Errors with JDK 1.8 in Intellij

I recently ran into problems with developing Scala in Intellij.

Screen Shot 2014-12-29 at 11.32.15 AMWhenever I tried to start a Scala console, evaluate a worksheet, or compile code, Intellij failed reporting an error relating to java.util.Comparator not taking type parameters:

Warning scalac ExtractAPI.scala:489 error java.util.Comparator does not take type parameters private[this] val sortClasses = new Comparator[Symbol] { ^…

Screen Shot 2014-12-29 at 11.28.47 AMTo resolve this issue, you need to configure Intellij to run Scala with JDK 1.7 instead of JDK 1.8.  Now, to accomplish this, It is not sufficient to set your Project SDK to Java 1.7, you will also have to remove Java 1.8 from your Platform Settings > SDKs. See the screen shots on the right.

You will then need to quit Intellij.  When you start Intellij back up it will tell you that I could not find Java 1.8.  Confirm the dialog to rebuild the Scala console and you will be able to use the Scala console in Intellij, evaluate worksheets, and compile Scala code.

Learning Scala

I was asked today, “let’s say I want to learn Scala.  What web references and/or books would you suggest?”  Here is what we found.

Screen Shot 2014-12-23 at 1.20.47 AM

Codecademy, Code School and Udacity: no Scala

I like Codecademy and Code School for learning languages, however, neither offers Scala.  Udacity also no Scala courses.

Coursera: not currently offered

Coursera has what looks to be a potentially awesome course but there are no courses currently available.  This course was apparently taught by Martin Odersky of École Polytechnique Fédérale de Lausanne who designed the Scala Programming language.  As of this writing, 12/23/2014, the last session ended a month ago, and no new sessions are listed.

Screen Shot 2014-12-23 at 1.15.23 AM

Scala’s Official Site: a starting point

While the docs on Scala’s web site have a tutorial, it may not be be ideal for learning Scala.  The first section after the introduction is “abstract types”, which creates something of a barrier to entry if you are familiar with other languages and want an easy in.  There is also a “hello world” example.  So it looks like Scala is not yet a mainstream language based on this evidence. Learning Scala is not as easy as learning Java, C-family languages, Python, JavaScript, or Ruby.

Screen Shot 2014-12-23 at 1.16.53 AM

Do it yourself

One solid approach to learning a new language is always to do some Project Euler problems.  You can solve some Euler problems as on-liners in Scala as Scala for Project Euler demonstrates.

Screen Shot 2014-12-23 at 1.05.19 AM

Another great site for exploring Scala, with problems and Scala solutions is S-99: Ninety-Nine Scala Problems.

Screen Shot 2014-12-23 at 1.01.05 AM

Twitter For The Win

I think Twitter’s Scala School may be the winner.  This site provides a step-by-step tutorial with commentary, problems, explanations, tips and solutions in Scala.

Screen Shot 2014-12-23 at 1.18.31 AM

Try this first

If you want to code “Hello World” in Scala to get started, you should check out the Blog post “From Zero to Hello World in Scala“.  In this article,  covers 

  • Simple Build Tools (SBT) – this is the tool that generates projects, dependencies, etc. for Scala
  • NetBeans with Scala Plugins – how to integrate Scala with NetBeans
  • Hello World – we’ll create our first Scala source file
  • Scalatest – the recommended unit testing framework for Scala

Let me know how these resources for learning Scala work for you.