Download - Starr Bloom T.C.P. using Hadoop on Yahoo's M45 Cluster (20100112)

Berkeley Astronomy’sTransients Classification Pipeline

• Project Overview

• The TCP is a time series classification project which identifies flux varying stellar sources for “real-time” data streams.

• Upon identification of scientifically interesting sources, the pipeline emits source information to robotic telescopes for immediate and automated follow-up.

• TCP sub-project for Yahoo! M45 cluster

• We want to implement our most computationally expensive classifier generation technique using Hadoop.

1. Given an array of times which an interesting astronomical source has been observed.

2. Re-sample the time-series of well sampled and classified known sources, to match the given time-array.

3. Add noise to re-sampled data which is characteristic of the observing telescope, and seasonal / local conditions.

4. Generate a time-series science classifier using this re-sampled data.

5. Classify the original interesting source using this classifier.

PI: Josh Bloom, Sw Eng: Dan Starr

Berkeley’s Transients Classification Pipeline

• Hadoop technologies used:

• Hadoop Streaming

• Used to wrap existing TCP Python algorithms

• Cascading (ver 1.1-86)

• Allows construction of a Hadoop dataflow using pipes, sinks / source objects.

• Other packages used:

• Python modules: numpy, scipy, xml...ElementTree, pyephem

• WEKA (JAVA based machine learning software)


[1.01, 1.15, 2.03, 3.72, 8.11, 8.25, 20.93, 21.03, 25.48]

(ID, <time-series in compressed XML>)

([1.01, ... 25.48], <time-series in compressed XML>)

Generate several “noisified”, resampled time-series for

each tuple

Weka .arff file

Generate time-series characterizing attributes for each time-series tuple. Only output resampled sources

where a period could be found.

Generating a “noisified” classifier

(ID, <python dictionary of time-series attributes>)

Reduce noisified source tuples into a single WEKA .arff formated string.

This .arff file is then usedto generate a WEKA classifier.

This classifier can then be applied to the original interesting sourceto obtain a science classification.

Interesting source time array

(ID, <time-series in compressed XML>)(ID, <time-series in compressed XML>)

(ID, <python dictionary of time-series attributes>)(ID, <python dictionary of time-series attributes>)

<time-series in compressed XML>

Reference, well sampled source

<time-series in compressed XML>

[1.01, 1.15, 2.03, 3.72, 8.11, 8.25, 20.93, 21.03, 25.48]

Join time-arrays with well-sampled source XMLs

([1.01, ... 25.48], <time-series in compressed XML>)([1.01, ... 25.48], <time-series in compressed XML>)


• Metrics of noisifcation pipeline:

• M45 Hadoop Pipeline

• 15 minutes

• Original TCP Python

• Using IPython parallelization across 8 cores

• 150 minutes

• Other work done as part of the Yahoo! Cloud initiative

• We’ve developed code which applies a WEKA classifier to TCP’s VOSource XML.

• We’ve tested our software on other Hadoop distributions (Cloudera).

• Future tasks to improve comparison metrics

• More thought about more fairly distributed workload across map(), reduce()

• Have Python code self-contained in single distributable Python .egg


• Issues we’ve had with the M45 cluster:

• M45’s system installation of Python is v2.4.3 (Old).

• This required modifying some syntax used by our code.

• M45’s Python does not include scipy or numpy Python modules

• This required the non-ideal hack of packaging numpy & scipy source code with certain map/reduce Hadoop Streaming jobs.

• This is not needed on Hadoop clusters which have numpy, scipy installed.

• Future work using Hadoop:

• Generate classifiers for real astronomical sources using the noisification pipeline.

• We’ve currently used only a test case astronomical source.

• Apply our Hadoop based pipelines to the TCP’s real-time datastream.

• Break pipeline into a finer granularity of map(), reduce() algorithms.

• Make use of other Hadoop based machine learning projects (e.g.: Mahout)

• Port other TCP tasks to Hadoop.

, Justin Higgins, Adam Morgan

Josh Bloom (PI)

Download - Starr Bloom T.C.P. using Hadoop on Yahoo's M45 Cluster (20100112)

Top Related