Security Features in Apache Kafka 0.9

Apache Kafka is widely used as a central platform for streaming data.  Yet, previous to 0.9, Kafka had no built-in security features. The 0.9 release of Apache Kafka adds new security features to Kafka:

  1. Administrators can require client authentication using either Kerberos or Transport Layer Security (TLS) client certificates, so that Kafka brokers know who is making each request

  2. A Unix-like permissions system can be used to control which users can access which data.

  3. Network communication can be encrypted, allowing messages to be securely sent across untrusted networks.

  4. Administrators can require authentication for communication between Kafka brokers and ZooKeeper.

Apache_Kafka_Security_101

For more details, see Apache Kafka Security 101.

Flink vs. Spark

Recently, the Flink project has been in the news as a relatively new big data stream processing framework in the Apache stack.  Since our team at Intuit IDEA is implementing Spark Streaming for big data streaming I was curious how Flink compares to Spark Streaming.

Similarities: Both Spark and Flink support big data processing in Scala, Java and Python.  For instance, here is the word count in Flink Scala:

For comparison, here is the word count problem in Spark Scala:

FlinkSimilarly, both Flink and Spark Streaming support processing in either of the two principal modes of big data processing, batch and streaming mode.  Like Spark, Flink shines with in-memory processing and query optimization.  Also, both frameworks also provide for graceful degradation from in-memory to out-of-core algorithms for very large distributed datasets.  Spark and Flink support machine learning, via Spark ML and FlinkML, respectively.  Lastly, both Spark and Flink support graph processing, via Spark GraphX and Flink Gelly, respectively.

spark-logoDifferences: While Spark Streaming processes data streams as “micro-batches“, which are windows as small as 500 milliseconds, Flink has a true streaming API that can process individual data records. However, Flink does not yet support SQL access, as Spark SQL does, which is a feature Flink plans to add soon. Also, at the time of this writing, in July 2015, Spark appears to have considerably more industry commitment, via companies such as Databricks and IBM, which should be important for community involvement and support of production implementations.

Coming Soon: one of Flink’s upcoming announcements is the graduation of Flink’s streaming from “beta” status. Other enhancements will expand the FlinkML Machine Learning library, as well as the Gelly graph processing library and introduce new algorithms to both libraries.

Volker Markl, one of Flink’s creators, shares more details in his interview with Roberto V. Zicari of ODBMS Industry Watch.

Advanced Data Partitioning in Spark

Screen Shot 2015-05-30 at 4.26.30 PMIn this post we take a look at advance data partitioning Spark.  Partitioning is the process that distributes your data across the Spark worker nodes.  If you process large amounts of data it is important that you chose your partitioning strategy carefully because communication is expensive in a distributed program.

Partitioning only applies to Spark key-value pair RDDs, and although you cannot force a record with a specific key to go to a specific node you can make sure that records with the same key are processed together on the same node.  This is important because you may need to join one data set against another dataset, and if the same keys reside on the same node for both datasets Spark does not need to communicate across nodes.

An example is a program that analyzes click events for a set of users.  In your code, you might define a `UserData` and `events` RDDs, and join them as follows:

5397024801_3d64b81c58_zHowever, this process is inefficient because if you perform the join in a loop, say once for each of a set of log files, then Spark will hash all the keys of both datasets and send elements with the same keys of both datasets across the network to perform the join.

To prevent this performance issue you should partition the user dataset before you persist it and then use it as the first dataset in the join operation.  If you do so you can leave the join statement from above unchanged and ensure that only the event data is sent across the network to pair up with the user data that is already persisted across worker nodes:

By using this technique of partitioning a larger reference dataset before persisting it, and then using it as the first dataset in a join, you can reduce network traffic and improve performance of your Spark code.  For more detailed information see “Learning Spark”, Chapter 4.

Scala vs. Java

cdmOn the Scala vs. Java discussion, “Is LinkedIn getting rid of Scala?” is a good read.  It is a little sad that Scala seems to be losing momentum at LinkedIn.

My thoughts on Scala vs. Java:

OK, I won’t say that Scala adoption builds our organization’s polyglot hack foo because developers are calling it quits on polyglot programming.

spark-logoWhat if you are about to build a new Big Data Analytics platform that leverages Spark for real-time processing?  Should you use Scala or Java?

Writing an entire platform in Scala seems like a possibility but not a very practical one.  As a compromise, you might build transforms, Spark code, and operations on RDDs in Scala, and pick the language is most appropriate for other situations, probably mostly Java.  Also, where you use java, consider making Java 8 a standard from the start.

Folder-Level Access to S3 with AWS AIM

Program-GroupI recently had a need to grant users access toS3 but to limit access to specific folders within an S3 bucket.  While this requirement is not supported out-of-the-box, as with a traditional filesystem, you can set up IAM policies to achieve the same effect.

Background: The background for my folder-level access requirement is that we are building a system the processes data, and users should be able to write new data into S3 that the system will then process.  However, I don’t want users to access system data, and they should instead only be able to drop input data into a specific input folder.

S3 Concepts: As mentioned above, S3 is not a file system, and there are no paths, such as “/home/bob/my-file.text”.  In S3, you can create such a file but the “path” will be the file name and the slashes “/” in the file name won’t have special meaning.  Therefore, you must set up IAM policies where you define the slash “/” to be a delimiter.

s525151301783206906_p10_i1_w425S3 Buckets: You may have heard that S3 organizes data into “buckets” and you could just give different users access to different buckets.  While this is true, and easier to set up, this approach won’t scale because S3 only allows up to 100 buckets in each AWS account.

IAM Policies: In order to implement the folder-level access permissions, you will need to create policies for listing buckets, getting bucket locations, listing a specific bucket, and allowing all S3 actions in a specific folder.  With these policies you will be able to allow required Amazon S3 console permissions, allow required Amazon S3 console permissions, allow listing objects in the user’s folder, and allow all Amazon S3 actions in David’s folder.

Policy Variables: By setting up fixed IAM policies, you can get specific users set up easily.  However, if you have many users you won’t want to create the required set of policies for each user individually.  Instead you will want to use “policy variables“.  That is, instead of referring to a specific user such as “David”, you will be referring to the “username” variable: ${aws:username}.

If you have more questions about folder-level access to S3 with AWS AIM be sure to visit Jim Scharf‘s post on “Writing IAM policies: Grant access to user-specific folders in an Amazon S3 bucket“.

Complex Event Processing with Esper

mpUs9Ca7BTaEsKnojmH7A_gEsper is a component for complex event processing (CEP) and event series analysis.

Complex Event Processing (CEP): Event processing is a method of tracking and analyzing streams of information about events and deriving conclusions. Complex event processing, or CEP, combines data from multiple sources to infer events or patterns. The goal of complex event processing is to identify meaningful events such as opportunities or threats and to respond quickly [Wikipedia].

Overview: Esper can process historical data, real-time in eventshigh-velocity data, and high-variety data.  Esper has been described as highly scalable, memory-efficient, in-memory computing, SQL-standards-based, minimal latency, real-time streaming-capable, and designed for Big Data.  SQL streaming analytics is a commonly used term for the technology.

Domain Specific Language: Esper offers a Domain Specific Language (DSL) for processing events. The Event Processing Language (EPL) is a declarative language for dealing with high frequency time-based event data. The designers of EPL created the language to emulate and extend SQL.

Use Cases: Use cases include business process management and automation, process monitoring, business activity monitoring (BAM), reporting exceptions, operational intelligence, algorithmic trading, fraud detection, risk management, network and application monitoring, intrusion detection, SLA monitoring, sensor network applications, RFID reading, scheduling and control of fabrication lines, and air traffic control.

Bryophyllum daigremontianum or mother of thousands plantData Windows, Indexes, and Atomic Operations: Data windows support managing fine-grained event expiry, event retention periods, and conditions for events discarding.  Esper supports explicit indexes as hash and btree, update-insert-delete, also known as merge or upsert, and select-and-delete in atomic operations.

Tables, Patterns, Operations, Contexts, and Enumerations: Tables provide aggregation state.  Patterns support specifying complex time-based and correlation-based relationships. Available operations include grouping, aggregation, rollup, cubing, sorting, filtering, transforming, merging, splitting or duplicating of event series or streams. Context declarations allow controlling detection lifetime and concurrency. Enumeration methods execute lambda-expressions to analyze collections of values or events.

Scripting Support: Scripting integration is available for JavaScript, MVEL and other JSR 223 scripts.  This integration allows you to specify code as part of EPL queries.

Approximation Algorithms: Approximation Algorithms support summarizing data in streams.  For instance, the Count-min sketch (or CM sketch) is a probabilistic, sub-linear space, streaming algorithm that can approximate data stream frequency and top-k, without retaining distinct values in memory.

Event Representation and Inheritance: Events can be represented as Java objects, Map interface implementations, Object-arrays, or XML documents, and do not require transformation among these representations.  Esper supports event-type inheritance and polymorphism for all event types including for Map and object-array representations.  Event properties can be simple, indexed, mapped or nested.

For more information, see the Codehaus page on Esper.

How Red Hat describes OpenShift.

Recently, several of my friends recommended I look into OpenShift.  Here is how Red Hat describes OpenShift.

What is OpenShift: Red Hat, know by developers for its linuxstorage and cloud offerings and support for JAVAPHPPYTHON, and RUBY positions OpenShift as a Platform as a Service (PAAS) offering.  That is, OpenShift is a platform in the cloud where application developers can build, test, deploy, and run their applications.  To do so, OpenShift provides infrastructure, middleware, and management tools.

Usage: using OpenShift involves these steps. 1. Create an “Application” in OpenShift.  This can be done with the command-line or via an IDE.  2. Code the application with your favorite text editor or IDE.  Finally, 3. push the application code to OpenShift, with the command-line or from your IDE.

Supported languages: OpenShift supports Node.js, Ruby, Python, PHP, Perl, and Java. You can also any language with a “cartridge functionality” feature, and integrations have been developed for languages such as Clojure and Cobol.  Supported frameworks include Spring, Rails, and Play.

Elastic scaling: OpenShift provides automatic and manual scaling and clustering.

Selling points: Red Hat stresses leadership, stability, responsiveness, performance, security, and survivability.  Specifically, Red Hat emphasizes multi-tenancy, fine-grained security, and control over compute and storage resources. If desired, SELinux allows OpenShift to “firewall” one user’s application from another.  Red Hat believes that their “multi-tenant in the OS” approach vs. a “multi-tenant hypervisor” approach can scale resources more quickly

Sample code:

For more details, see Red Hat’s OpenShift page.