spark application for time series analysis

®© 2016 MapR Technologies 1 ®© 2016 MapR Technologies 1 MapR Confidential

© 2016 MapR Technologies

Spark Application for Time Series Predictive Analysis

Dong Meng, Data Scientist


•  Originally developed in 2009 in UC Berkeley’s AMP Lab

•  Fully open-sourced in 2010 – now a Top Level Project at the Apache Software Foundation

•  The most Active Open Source Project in Big Data

spark.apache.org github.com/apache/spark [email protected]

What is Apache Spark?


Resilient Distributed Datasets (RDD)

RDD = read-only, partitioned

collection of records

Create new RDD’s via

transformations

Operate & analyze RDD’s

via actions

Users have option to persist data in-memory & control

partitioning

{map, filter, union, flatMap, distinct, groupByKey, cogroup, reduceByKey, … }

{reduce, collect, count, first, take, foreach, countByKey, takeSample, saveAsTextfile, … }

Fault-tolerance is derived from the ability to recreate the lineage of the RDD in case of a

partition loss



•  How to generate RDD •  How to compute RDD •  What is relationship between RDDs

–  Narrow Dependency (each partition of the parent RDD is used by at most one partition of the child RDD. [map, filter, union])

–  Wide/Shuffle Dependency (multiple partitions of child RDD may depend on it [groupByKey])

•  In this scene, mapreduce ~ map+reduceByKey


•  A distributed collection of rows organized into named columns •  Select, filter, aggregate, join and plotting the data with structure

(python pandas, R dataframe datatable) •  Based on schemaRDD (spark<1.3) •  Much less code to write, plus the schema RDD: data = sc.textFile(…).split(“,”) data.map(lambda x : (x[0], [x[1], 1] )) .reduceByKey(lambda x, y: [x[0]+y[0], x[1]+y[1]]) .map(lambda x: x[0], x[1][0]/x[1][1]) .collect()

Spark DataFrame

DataFrame:df.groupBy(“name”) .agg(“name”, avg(“age”))

.collect()


•  Add Schema to RDD •  Write SQL query against RDD •  Interact with Hive •  Flexibility in UDFs •  Support Parquet, Json, CSV, etc

SparkSQL


•  Active community •  Pipeline style model build and tunning (scikit-learn) •  Core spark •  Distributed computing

Spark Mllib


Spark Execution

•  Job -> Stage -> Task •  Flow: Directed Acyclic Graph(DAG) •  Pipeline(lazy load) •  Standalone(not necessarily means single machine)/Yarn(yarn-

client, yarn-cluster)


Spark Execution

®© 2016 MapR Technologies 11 ®© 2016 MapR Technologies 11 MapR Confidential © 2016 MapR Technologies © 2016 MapR Technologies

Now let’s build the spark application


Find my presentation and other related resources here:

http://events.mapr.com/DataScienceMD

Today’s Presentation

Whiteboard & demo videos

Free On-Demand Training

Free Cheat Sheet

Free Articles And more…

spark application for time series analysis

Technology