How to fix PyCharm @staticmethod Error

When you run PyCharm with Python 2.7.2, you may see errors for @staticmethod annotations.  PyCharm marks your @staticmethod annotation as an error:

Screen Shot 2015-01-07 at 12.43.49 AM

When you mouse over, you see a failing the unresolved reference inspection:

Screen Shot 2015-01-07 at 12.44.10 AM

You can fix this issue by configuring Python 2.7.6 as the Project Interpreter:

Screen Shot 2015-01-07 at 12.40.57 AM

Screen Shot 2015-01-07 at 12.44.51 AM

To configure your Python Interpreter, select: PyCharm > Preferences… > Project: your_project > Project Interpreter.  You may have to install 2.7.6 or a newer version, for example, by clicking “+” in the OS X version of PyCharm.  Once you configure 2.7.6 PyCharm will update your indexes and, once that process completes, you will see a clean @staticmethod annotation:

Screen Shot 2015-01-07 at 12.45.14 AM

The 25 Hour Clock

The 24-hour clock is the convention of time keeping in which the hours passed since midnight run from 0 to 23. It is popularly referred to as military time in the United States, Canada, and a handful of other countries where the 12-hour clock is still dominant.

I found this example of a railway timetable showing both 00:00 and 24:00 on Wikipedia.  This timetable has a pleasant retro style and the curious concept of what you might call a “25-hour clock”.  That is, the schedule uses the hours 1 through 24, plus zero.

RailtimeSource: Wikipedia

Spark on Cloudera

Apache Spark is a platform for processing big data through streaming.  Streaming can be much faster than disk-based processing offered by traditional Hadoop installations.  Here’s what Cloudera has to say about Spark.

Use cases: Apache Spark supports batch, streaming, and interactive analytics on all your data, enabling historical reporting, interactive analysis, data mining, real-time insights.

Support: Cloudera offers commercial support for Spark with Cloudera Enterprise.

Performance: Spark is 10-100x faster than MapReduce analysts for iterative algorithms that are often used by analysts and data scientists.  Performance benefits materialize both in memory and on disk.

Language support: Spark supports Java, Scala, and Python.  It is not necessary to write “map” and “reduce” operators.

Integration: Spark is integrated with CDH and can read any data in HDFS and deployed through Cloudera Manager.

Features: API for working with streams, exactly-once semantics, fault tolerance, common code for batch and streaming, joining streaming data to historical data.

Differences vs. Storm: Spark Streaming can recover lost work and deliver exactly-once semantics out of the box.

For more details, see Cloudera’s discussion of Spark.

The Chronos API

I recently blogged about Airbnb’s Chronos job scheduler.  Here I take a look at the Chronos API.

Launch command: Your system must run Mesos and Zookeeper.  Then, you launch chronos via java:

java -cp chronos.jar --master zk://127.0.0.1:2181/mesos --zk_hosts 127.0.0.1:2181

API Access: Chronos provides a RESTful JSON API over HTTP and listens on port 8080 for requests. For example, your Chronos leader may run at a URL such as chronos-node.airbnb.com:8080.

Leader node: Chronos can run on a cluster of multiple nodes, and these nodes automatically elect one node as the leader node. Only the leader responds to API requests, and requests to other nodes  are automatically redirected to the leader.

Listing jobs: you can obtain a JSON-formatted list of jobs through curl and the response will include invocationCount (number of times job completed), executor (auto-determined by Chronos, but will usually be empty for non-async jobs), and parents (for dependent jobs, a list of jobs that must run before this job).  If there is a parents field there will be no schedule field and vice-versa:

curl -L -X GET chronos-node:8080/scheduler/jobs

Deleting jobs: to delete job my_job use this request:

curl -L -X DELETE chronos-node:8080/scheduler/job/my_job

Deleting tasks: Deleting tasks for a job is useful if a job gets stuck. The job name corresponds to the information returned from the job listing request:

curl -L -X DELETE chronos-node:8080/scheduler/task/kill/my_job

Manual job start: You can manually start a job by issuing an HTTP request:

curl -L -X PUT chronos-node:8080/scheduler/job/my_job

Adding jobs: send a JSON hash with the fields Name, Command, and Schedule (in ISO8601 format).  We will explain the details for the json hash next:

curl -L -H 'Content-Type: application/json' -X POST -d '{<json hash>}' chronos-node:8080/scheduler/iso8601

JSON hash: an example of a JSON hash is shown below.  We discuss each component below:

{
  "schedule": "R10/2012-10-01T05:52:00Z/PT2S",
  "name": "SAMPLE_JOB1",
  "epsilon": "PT15M",
  "command": "echo 'FOO' >> /tmp/JOB1_OUT",
  "owner": "bob@airbnb.com",
  "async": false
}

Job schedule: The schedule consists of 3 parts separated by ‘/’:

  1. number of times to repeat the job or ‘R’ to repeat forever
  2. start time of the job, an empty start time means start immediately, such as “1997-07-16T19:20:30.45+01:00”
  3. run interval, such as P1Y2M3DT4H5M6S, see examples below.

The run interval: the following examples illustrate how to specify run intervals:

  • P10M: 10 months
  • PT10M: 10 minutes
  • P1Y12M12D: 1 years plus 12 months plus 12 days
  • P12DT12M: 12 days plus 12 minutes
  • P1Y2M3DT4H5M6S: Period: 1 Year, 2 Months, 3 Days, Time: 4 Hours, 5 Minutes, 6 Seconds

P is required. T is for distinguishing minute and month, when Hour, Minute, Second exists.

Available time zones: The time zone name to use when scheduling the job:

Example time zone: for example, to specify Pacific Standard Time use:

json { "schedule": "R/2014-10-10T18:32:00Z/PT60M", "scheduleTimeZone": "PST" } 

Retry epsilon: If Chronos misses a scheduled run time for any reason, it will run the job later as long as the current time is within the specified epsilon interval. Epsilon must be formatted like an ISO 8601 Duration.

Job owner: the email address of the person responsible for the job.

Async: the async flag specifies whether the job will run in the background or in blocking mode in the foreground.

Add job example: with the hash constructed as described above, send the job schedule request to Chronos:

curl -L -H 'Content-Type: application/json' -X POST -d '{ "schedule": "R10/2012-10-01T05:52:00Z/PT2S",  "name": "SAMPLE_JOB1",  "epsilon": "PT15M",  "command": "echo 'FOO' >> /tmp/JOB1_OUT",  "owner": "bob@airbnb.com",  "async": false}' chronos-node:8080/scheduler/iso8601

Adding dependent jobs: dependent job takes the same JSON format as a scheduled job. However, instead of the schedule field, it will accept a parents field. The parents field lists other jobs which must run at least once before this job will run.

curl -L -X POST -H 'Content-Type: application/json' -d '{dependent hash}' chronos-node:8080/scheduler/dependency

Example dependency job hash: Here is a more elaborate example for a dependency job hash:

{
    "async": true,
    "command": "bash -x /srv/data-infra/jobs/hive_query.bash run_hive hostings-earnings-summary",
    "epsilon": "PT30M",
    "errorCount": 0,
    "lastError": "",
    "lastSuccess": "2013-03-15T13:02:14.243Z",
    "name": "hostings_earnings_summary",
    "owner": "bob@airbnb.com",
    "parents": [
        "db_export-airbed_hostings",
        "db_export-airbed_reservation2s"
    ],
    "retries": 2,
    "successCount": 100
}

Adding docker jobs: docker jobs take the same format as a scheduled job or a dependency job, with an additional container argument.  The container argument requires a type, an image, and optionally takes a network mode and volumes:

curl -L -H 'Content-Type: application/json' -X POST -d '{<json hash>}' chronos-node:8080/scheduler/iso8601

The <json hash> has the following format:

{
 "schedule": "R\/2014-09-25T17:22:00Z\/PT2M",
 "name": "my_docker_job",
 "container": {
  "type": "DOCKER",
  "image": "libmesos/ubuntu",
  "network": "BRIDGE"
 },
 "cpus": "0.5",
 "mem": "512",
 "uris": [],
 "command": "while sleep 10; do date =u %T; done"
}

Dependency graph: Chronos has an endpoint for requesting the dependency graph in form of a dotfile:

curl -L -X GET chronos-node:8080/scheduler/graph/dot

Asynchronous jobs: long-running, synchronous jobs can tie up resources excessively long.  To schedule jobs as asynchronous, set async: true and ensure your job reports its completion status to Chronos.  If your job does not report completion status Chronos report your job as running irrespective of whether it completed or not.

Reporting completion: Reporting job completion to Chronos is accomplished via this API call:

curl -L -X PUT -H "Content-Type: application/json" -d '{"statusCode":0}' chronos-node:8080/scheduler/task/my_job_run_555_882083xkj302

The task id is auto-generated by Chronos. It will be available in your job’s environment as $mesos_task_id.  You need to url-encode the mesos task id to ensure it is not corrupted in the process of sending and processing your request.

Remote executables: There are two forms of specifying commands, as the bash script url-runner.bash and as a URL.  To use the bash script you need to deploy it to all slaves.  To use the URL you need to compile mesos  with the cURL libraries.

Job configuration: The following tables provides an overview of job configurations:

Field Description Default
name Name of job.
command Command to execute.
arguments Arguments to pass to the command. Ignored ifshell is true
shell If true, Mesos will execute command by running/bin/sh -c <command> and ignore arguments. If false, command will be treated as the filename of an executable and arguments will be the arguments passed. If this is a Docker job andshell is true, the entrypoint of the container will be overridden with /bin/sh -c true
epsilon If, for any reason, a job can’t be started at the scheduled time, this is the window in which Chronos will attempt to run the job again PT60S or --task_epsilon.
executor Mesos executor. By default Chronos uses the Mesos command executor.
executorFlags Flags to pass to Mesos executor.
retries Number of retries to attempt if a command returns a non-zero status 2
owner Email addresses to send job failure notifications. Use comma-separated list for multiple addresses.
async Execute using Async executor. false
successCount Number of successes since the job was last modified.
errorCount Number of errors since the job was last modified.
lastSuccess Date of last successful attempt.
lastError Date of last failed attempt.
cpus Amount of Mesos CPUs for this job. 0.1 or --mesos_task_cpu
mem Amount of Mesos Memory in MB for this job. 128 or --mesos_task_mem
disk Amount of Mesos disk in MB for this job. 256 or --mesos_task_disk
disabled If set to true, this job will not be run. false
uris An array of URIs which Mesos will download when the task is started.
schedule ISO8601 repeating schedule for this job. If specified, parents must not be specified.
scheduleTimeZone The time zone for the given schedule.
parents An array of parent jobs for a dependent job. If specified, schedule must not be specified.
runAsUser Mesos will run the job as this user, if specified. --user
container This contains the subfields for the container, type (req), image (req), network (optional) and volumes (optional).
environmentVariables An array of environment variables passed to the Mesos executor. For Docker containers, these are also passed to Docker using the -e flag.

Sample job: here is a complete sample job configuration:

{
   "name":"camus_kafka2hdfs",
   "command":"/srv/data-infra/kafka/camus/kafka_hdfs_job.bash",
   "arguments": [
      "-verbose",
      "-debug"
   ]
   "shell":"false",
   "epsilon":"PT30M",
   "executor":"",
   "executorFlags":"",
   "retries":2,
   "owner":"bofh@your-company.com",
   "async":false,
   "successCount":190,
   "errorCount":3,
   "lastSuccess":"2014-03-08T16:57:17.507Z",
   "lastError":"2014-03-01T00:10:15.957Z",
   "cpus":1.0,
   "disk":10240,
   "mem":1024,
   "disabled":false,
   "uris":[
   ],
   "schedule":"R/2014-03-08T20:00:00.000Z/PT2H",
   "environmentVariables": [
     {"name": "FOO", "value": "BAR"}
   ]
}

Job Management: for large installations it is impractical to manage jobs via the web UI. Instead, you can manage your job configurations in a git repository, make edits, and use it to configure Chronos.  You can use a script called chronos-sync.rb. You can also use a Chronos job to periodically check out your configuration and run chronos-sync.rb. 

Synchronizing jobs: there are 2 steps to loading your configuration.  First, initialize configuration data:

$ bin/chronos-sync.rb -u http://chronos/ -p /path/to/jobs/config -c

Then, synchronize jobs:

$ bin/chronos-sync.rb -u http://chronos/ -p /path/to/jobs/config

You can also force updating the configuration from disk by passing the -f or --force parameter.the Here, configuration data is placed in /path/to/jobs/config. Running chronos-sync.rb will not delete jobs.

For more details, see the Airbnb Chronos Github page.

The airbnb/chronos scheduler

Chronos at Airbnb: At Airbnb, chronos functions in an environment that includes AWS EMR, MySQL, Amazon Redshift, S3, Cascading, Cascalog, Hive, Pig.  Challenges in this environment include variance in network latency, unpredictable I/O performance, and spurious web services timeouts.  These challenges prompted Airbnb to look for a lightweight scheduling solution that allowed retries and provides high availability and via easy-to-use GUI interface.  In addition, Airbnb wanted the ability to schedule non-Hadoop jobs, such as bash scripts, and distribute work across multiple systems.  Thus Airbnb decided to build Chronos and to leverage Mesos, which provides the required primitives for storing state, distributing work, and adding new workers on the fly.

Chronos UI: The Chronos UI supports adding, deleting, listing, modifying and running jobs. It can show graphs of job dependencies.

Sample-chronos-ui

What people asking: Alerting and notification are not well documented but you can specify email addresses to send job failure notifications. Use comma-separated list for multiple addresses.  It is not obvious whether you can integrate chronos with other business applications, for example, Zenoss, JIRA, Logstash, etc.  Is it adaptable to multi-timezone calendars?  Does it have the ability to create incident reports?  What is the roundtrip time between submitting request and receiving a response?  Does it have reporting features? Does it have the ability to failover and load balance? Are all features available via both command line and GUI?

smalin’s Graphical Score

Screen Shot 2015-01-02 at 2.26.18 AM

Smalin’s YouTube channel offers music with a graphical score, such as the sample seen on the right.

Making of the graphical score: Made the arrangements to use licensed audio, found a MIDI file, imported into notation for program Sibelius, compared to a printed copy, fixed errors, compared to score, modified score to match orchestra, exported to MIDI, ran custom frame-rendering software, made “reduction” of score, colored it to taste, decided not to use the notation, put rendered frames, audio, and titles in Adobe Premiere, exported movie to QuickTime, converted with On2 Flix, generated Flash to prevent YouTube’s conversion changes, uploaded.