Category Archives: HowTo

Advanced Data Partitioning in Spark

Screen Shot 2015-05-30 at 4.26.30 PMIn this post we take a look at advance data partitioning Spark.  Partitioning is the process that distributes your data across the Spark worker nodes.  If you process large amounts of data it is important that you chose your partitioning strategy carefully because communication is expensive in a distributed program.

Partitioning only applies to Spark key-value pair RDDs, and although you cannot force a record with a specific key to go to a specific node you can make sure that records with the same key are processed together on the same node.  This is important because you may need to join one data set against another dataset, and if the same keys reside on the same node for both datasets Spark does not need to communicate across nodes.

An example is a program that analyzes click events for a set of users.  In your code, you might define a `UserData` and `events` RDDs, and join them as follows:

5397024801_3d64b81c58_zHowever, this process is inefficient because if you perform the join in a loop, say once for each of a set of log files, then Spark will hash all the keys of both datasets and send elements with the same keys of both datasets across the network to perform the join.

To prevent this performance issue you should partition the user dataset before you persist it and then use it as the first dataset in the join operation.  If you do so you can leave the join statement from above unchanged and ensure that only the event data is sent across the network to pair up with the user data that is already persisted across worker nodes:

By using this technique of partitioning a larger reference dataset before persisting it, and then using it as the first dataset in a join, you can reduce network traffic and improve performance of your Spark code.  For more detailed information see “Learning Spark”, Chapter 4.

Advertisements

How to fix PyCharm @staticmethod Error

When you run PyCharm with Python 2.7.2, you may see errors for @staticmethod annotations.  PyCharm marks your @staticmethod annotation as an error:

Screen Shot 2015-01-07 at 12.43.49 AM

When you mouse over, you see a failing the unresolved reference inspection:

Screen Shot 2015-01-07 at 12.44.10 AM

You can fix this issue by configuring Python 2.7.6 as the Project Interpreter:

Screen Shot 2015-01-07 at 12.40.57 AM

Screen Shot 2015-01-07 at 12.44.51 AM

To configure your Python Interpreter, select: PyCharm > Preferences… > Project: your_project > Project Interpreter.  You may have to install 2.7.6 or a newer version, for example, by clicking “+” in the OS X version of PyCharm.  Once you configure 2.7.6 PyCharm will update your indexes and, once that process completes, you will see a clean @staticmethod annotation:

Screen Shot 2015-01-07 at 12.45.14 AM

Scala Errors with JDK 1.8 in Intellij

I recently ran into problems with developing Scala in Intellij.

Screen Shot 2014-12-29 at 11.32.15 AMWhenever I tried to start a Scala console, evaluate a worksheet, or compile code, Intellij failed reporting an error relating to java.util.Comparator not taking type parameters:

Warning scalac ExtractAPI.scala:489 error java.util.Comparator does not take type parameters private[this] val sortClasses = new Comparator[Symbol] { ^…

Screen Shot 2014-12-29 at 11.28.47 AMTo resolve this issue, you need to configure Intellij to run Scala with JDK 1.7 instead of JDK 1.8.  Now, to accomplish this, It is not sufficient to set your Project SDK to Java 1.7, you will also have to remove Java 1.8 from your Platform Settings > SDKs. See the screen shots on the right.

You will then need to quit Intellij.  When you start Intellij back up it will tell you that I could not find Java 1.8.  Confirm the dialog to rebuild the Scala console and you will be able to use the Scala console in Intellij, evaluate worksheets, and compile Scala code.

Lexicographic Enumeration

Screen Shot 2014-12-23 at 1.12.04 AM

I recently implemented Dijkstra’s algorithm, published in 1976, for lexicographically enumerating permutations in Python for Project Euler’s problem 32:

“Pandigital products
Problem 32
We shall say that an n-digit number is pandigital if it makes use of all the digits 1 to n exactly once; for example, the 5-digit number, 15234, is 1 through 5 pandigital.

The product 7254 is unusual, as the identity, 39 × 186 = 7254, containing multiplicand, multiplier, and product is 1 through 9 pandigital.

Find the sum of all products whose multiplicand / multiplier / product identity can be written as a 1 through 9 pandigital.”

Here is the Python code that implements Dijkstra’s algorithm for enumerating the pandigital numbers:

Given this code we can then check which numbers are unusual, as defined above.

By the way, friend me on Project Euler if you are also a fan of this site.

My friend code is: 704085_afc937d56367db90098e60535d34393a

Learning Scala

I was asked today, “let’s say I want to learn Scala.  What web references and/or books would you suggest?”  Here is what we found.

Screen Shot 2014-12-23 at 1.20.47 AM

Codecademy, Code School and Udacity: no Scala

I like Codecademy and Code School for learning languages, however, neither offers Scala.  Udacity also no Scala courses.

Coursera: not currently offered

Coursera has what looks to be a potentially awesome course but there are no courses currently available.  This course was apparently taught by Martin Odersky of École Polytechnique Fédérale de Lausanne who designed the Scala Programming language.  As of this writing, 12/23/2014, the last session ended a month ago, and no new sessions are listed.

Screen Shot 2014-12-23 at 1.15.23 AM

Scala’s Official Site: a starting point

While the docs on Scala’s web site have a tutorial, it may not be be ideal for learning Scala.  The first section after the introduction is “abstract types”, which creates something of a barrier to entry if you are familiar with other languages and want an easy in.  There is also a “hello world” example.  So it looks like Scala is not yet a mainstream language based on this evidence. Learning Scala is not as easy as learning Java, C-family languages, Python, JavaScript, or Ruby.

Screen Shot 2014-12-23 at 1.16.53 AM

Do it yourself

One solid approach to learning a new language is always to do some Project Euler problems.  You can solve some Euler problems as on-liners in Scala as Scala for Project Euler demonstrates.

Screen Shot 2014-12-23 at 1.05.19 AM

Another great site for exploring Scala, with problems and Scala solutions is S-99: Ninety-Nine Scala Problems.

Screen Shot 2014-12-23 at 1.01.05 AM

Twitter For The Win

I think Twitter’s Scala School may be the winner.  This site provides a step-by-step tutorial with commentary, problems, explanations, tips and solutions in Scala.

Screen Shot 2014-12-23 at 1.18.31 AM

Try this first

If you want to code “Hello World” in Scala to get started, you should check out the Blog post “From Zero to Hello World in Scala“.  In this article,  covers 

  • Simple Build Tools (SBT) – this is the tool that generates projects, dependencies, etc. for Scala
  • NetBeans with Scala Plugins – how to integrate Scala with NetBeans
  • Hello World – we’ll create our first Scala source file
  • Scalatest – the recommended unit testing framework for Scala

Let me know how these resources for learning Scala work for you.

A/B Testing Tools

When you run A/B tests, you need to answer three key questions:

  1. How long do I need to run the test?
  2. Is the observed difference statistically significant?
  3. Which treatment wins?

I have used spreadsheet templates from Visual Website Optimizer for this in the past.  Check out their tools and spreadsheets.

Screen Shot 2014-12-19 at 5.16.14 PMYou can plug in numbers and determine whether there is a winner, which treatment wins, and also estimate the required duration to call a winner, based on traffic and traffic routing to A and B.  You can probably use the spreadsheet to extract the math if you want to implement these tests yourself.

If you prefer the formal treatment, here is the statistical aspect of how we calculate winners.