spark application for time series analysis

12
® © 2016 MapR Technologies 1 ® © 2016 MapR Technologies 1 MapR Confidential © 2016 MapR Technologies Spark Application for Time Series Predictive Analysis Dong Meng, Data Scientist

Upload: mapr-technologies

Post on 07-Apr-2017

303 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Spark Application for Time Series Analysis

®© 2016 MapR Technologies 1 ®© 2016 MapR Technologies 1 MapR Confidential

© 2016 MapR Technologies

Spark Application for Time Series Predictive Analysis

Dong Meng, Data Scientist

Page 2: Spark Application for Time Series Analysis

®© 2016 MapR Technologies 2 ®© 2016 MapR Technologies 2 MapR Confidential

•  Originally developed in 2009 in UC Berkeley’s AMP Lab

•  Fully open-sourced in 2010 – now a Top Level Project at the Apache Software Foundation

•  The most Active Open Source Project in Big Data

spark.apache.org github.com/apache/spark [email protected]

What is Apache Spark?

Page 3: Spark Application for Time Series Analysis

®© 2016 MapR Technologies 3 ®© 2016 MapR Technologies 3 MapR Confidential

Resilient Distributed Datasets (RDD)

RDD = read-only, partitioned

collection of records

Create new RDD’s via

transformations

Operate & analyze RDD’s

via actions

Users have option to persist data in-memory & control

partitioning

{map, filter, union, flatMap, distinct, groupByKey, cogroup, reduceByKey, … }

{reduce, collect, count, first, take, foreach, countByKey, takeSample, saveAsTextfile, … }

Fault-tolerance is derived from the ability to recreate the lineage of the RDD in case of a

partition loss

Page 4: Spark Application for Time Series Analysis

®© 2016 MapR Technologies 4 ®© 2016 MapR Technologies 4 MapR Confidential

Resilient Distributed Datasets (RDD)

•  How to generate RDD •  How to compute RDD •  What is relationship between RDDs

–  Narrow Dependency (each partition of the parent RDD is used by at most one partition of the child RDD. [map, filter, union])

–  Wide/Shuffle Dependency (multiple partitions of child RDD may depend on it [groupByKey])

•  In this scene, mapreduce ~ map+reduceByKey

Page 5: Spark Application for Time Series Analysis

®© 2016 MapR Technologies 5 ®© 2016 MapR Technologies 5 MapR Confidential

Resilient Distributed Datasets (RDD)

Page 6: Spark Application for Time Series Analysis

®© 2016 MapR Technologies 6 ®© 2016 MapR Technologies 6 MapR Confidential

•  A distributed collection of rows organized into named columns •  Select, filter, aggregate, join and plotting the data with structure

(python pandas, R dataframe datatable) •  Based on schemaRDD (spark<1.3) •  Much less code to write, plus the schema RDD: data = sc.textFile(…).split(“,”) data.map(lambda x : (x[0], [x[1], 1] )) .reduceByKey(lambda x, y: [x[0]+y[0], x[1]+y[1]]) .map(lambda x: x[0], x[1][0]/x[1][1]) .collect()

Spark DataFrame

DataFrame:df.groupBy(“name”) .agg(“name”, avg(“age”))

.collect()

Page 7: Spark Application for Time Series Analysis

®© 2016 MapR Technologies 7 ®© 2016 MapR Technologies 7 MapR Confidential

•  Add Schema to RDD •  Write SQL query against RDD •  Interact with Hive •  Flexibility in UDFs •  Support Parquet, Json, CSV, etc

SparkSQL

Page 8: Spark Application for Time Series Analysis

®© 2016 MapR Technologies 8 ®© 2016 MapR Technologies 8 MapR Confidential

•  Active community •  Pipeline style model build and tunning (scikit-learn) •  Core spark •  Distributed computing

Spark Mllib

Page 9: Spark Application for Time Series Analysis

®© 2016 MapR Technologies 9 ®© 2016 MapR Technologies 9 MapR Confidential

Spark Execution

•  Job -> Stage -> Task •  Flow: Directed Acyclic Graph(DAG) •  Pipeline(lazy load) •  Standalone(not necessarily means single machine)/Yarn(yarn-

client, yarn-cluster)

Page 10: Spark Application for Time Series Analysis

®© 2016 MapR Technologies 10 ®© 2016 MapR Technologies 10 MapR Confidential

Spark Execution

Page 11: Spark Application for Time Series Analysis

®© 2016 MapR Technologies 11 ®© 2016 MapR Technologies 11 MapR Confidential © 2016 MapR Technologies © 2016 MapR Technologies

Now let’s build the spark application

Page 12: Spark Application for Time Series Analysis

®© 2016 MapR Technologies 12 ®© 2016 MapR Technologies 12 MapR Confidential

Find my presentation and other related resources here:

http://events.mapr.com/DataScienceMD

Today’s Presentation

Whiteboard & demo videos

Free On-Demand Training

Free Cheat Sheet

Free Articles And more…