spark application for time series analysis
TRANSCRIPT
![Page 1: Spark Application for Time Series Analysis](https://reader031.vdocuments.us/reader031/viewer/2022030307/58e7569e1a28ab4a278b4663/html5/thumbnails/1.jpg)
®© 2016 MapR Technologies 1 ®© 2016 MapR Technologies 1 MapR Confidential
© 2016 MapR Technologies
Spark Application for Time Series Predictive Analysis
Dong Meng, Data Scientist
![Page 2: Spark Application for Time Series Analysis](https://reader031.vdocuments.us/reader031/viewer/2022030307/58e7569e1a28ab4a278b4663/html5/thumbnails/2.jpg)
®© 2016 MapR Technologies 2 ®© 2016 MapR Technologies 2 MapR Confidential
• Originally developed in 2009 in UC Berkeley’s AMP Lab
• Fully open-sourced in 2010 – now a Top Level Project at the Apache Software Foundation
• The most Active Open Source Project in Big Data
spark.apache.org github.com/apache/spark [email protected]
What is Apache Spark?
![Page 3: Spark Application for Time Series Analysis](https://reader031.vdocuments.us/reader031/viewer/2022030307/58e7569e1a28ab4a278b4663/html5/thumbnails/3.jpg)
®© 2016 MapR Technologies 3 ®© 2016 MapR Technologies 3 MapR Confidential
Resilient Distributed Datasets (RDD)
RDD = read-only, partitioned
collection of records
Create new RDD’s via
transformations
Operate & analyze RDD’s
via actions
Users have option to persist data in-memory & control
partitioning
{map, filter, union, flatMap, distinct, groupByKey, cogroup, reduceByKey, … }
{reduce, collect, count, first, take, foreach, countByKey, takeSample, saveAsTextfile, … }
Fault-tolerance is derived from the ability to recreate the lineage of the RDD in case of a
partition loss
![Page 4: Spark Application for Time Series Analysis](https://reader031.vdocuments.us/reader031/viewer/2022030307/58e7569e1a28ab4a278b4663/html5/thumbnails/4.jpg)
®© 2016 MapR Technologies 4 ®© 2016 MapR Technologies 4 MapR Confidential
Resilient Distributed Datasets (RDD)
• How to generate RDD • How to compute RDD • What is relationship between RDDs
– Narrow Dependency (each partition of the parent RDD is used by at most one partition of the child RDD. [map, filter, union])
– Wide/Shuffle Dependency (multiple partitions of child RDD may depend on it [groupByKey])
• In this scene, mapreduce ~ map+reduceByKey
![Page 5: Spark Application for Time Series Analysis](https://reader031.vdocuments.us/reader031/viewer/2022030307/58e7569e1a28ab4a278b4663/html5/thumbnails/5.jpg)
®© 2016 MapR Technologies 5 ®© 2016 MapR Technologies 5 MapR Confidential
Resilient Distributed Datasets (RDD)
![Page 6: Spark Application for Time Series Analysis](https://reader031.vdocuments.us/reader031/viewer/2022030307/58e7569e1a28ab4a278b4663/html5/thumbnails/6.jpg)
®© 2016 MapR Technologies 6 ®© 2016 MapR Technologies 6 MapR Confidential
• A distributed collection of rows organized into named columns • Select, filter, aggregate, join and plotting the data with structure
(python pandas, R dataframe datatable) • Based on schemaRDD (spark<1.3) • Much less code to write, plus the schema RDD: data = sc.textFile(…).split(“,”) data.map(lambda x : (x[0], [x[1], 1] )) .reduceByKey(lambda x, y: [x[0]+y[0], x[1]+y[1]]) .map(lambda x: x[0], x[1][0]/x[1][1]) .collect()
Spark DataFrame
DataFrame:df.groupBy(“name”) .agg(“name”, avg(“age”))
.collect()
![Page 7: Spark Application for Time Series Analysis](https://reader031.vdocuments.us/reader031/viewer/2022030307/58e7569e1a28ab4a278b4663/html5/thumbnails/7.jpg)
®© 2016 MapR Technologies 7 ®© 2016 MapR Technologies 7 MapR Confidential
• Add Schema to RDD • Write SQL query against RDD • Interact with Hive • Flexibility in UDFs • Support Parquet, Json, CSV, etc
SparkSQL
![Page 8: Spark Application for Time Series Analysis](https://reader031.vdocuments.us/reader031/viewer/2022030307/58e7569e1a28ab4a278b4663/html5/thumbnails/8.jpg)
®© 2016 MapR Technologies 8 ®© 2016 MapR Technologies 8 MapR Confidential
• Active community • Pipeline style model build and tunning (scikit-learn) • Core spark • Distributed computing
Spark Mllib
![Page 9: Spark Application for Time Series Analysis](https://reader031.vdocuments.us/reader031/viewer/2022030307/58e7569e1a28ab4a278b4663/html5/thumbnails/9.jpg)
®© 2016 MapR Technologies 9 ®© 2016 MapR Technologies 9 MapR Confidential
Spark Execution
• Job -> Stage -> Task • Flow: Directed Acyclic Graph(DAG) • Pipeline(lazy load) • Standalone(not necessarily means single machine)/Yarn(yarn-
client, yarn-cluster)
![Page 10: Spark Application for Time Series Analysis](https://reader031.vdocuments.us/reader031/viewer/2022030307/58e7569e1a28ab4a278b4663/html5/thumbnails/10.jpg)
®© 2016 MapR Technologies 10 ®© 2016 MapR Technologies 10 MapR Confidential
Spark Execution
![Page 11: Spark Application for Time Series Analysis](https://reader031.vdocuments.us/reader031/viewer/2022030307/58e7569e1a28ab4a278b4663/html5/thumbnails/11.jpg)
®© 2016 MapR Technologies 11 ®© 2016 MapR Technologies 11 MapR Confidential © 2016 MapR Technologies © 2016 MapR Technologies
Now let’s build the spark application
![Page 12: Spark Application for Time Series Analysis](https://reader031.vdocuments.us/reader031/viewer/2022030307/58e7569e1a28ab4a278b4663/html5/thumbnails/12.jpg)
®© 2016 MapR Technologies 12 ®© 2016 MapR Technologies 12 MapR Confidential
Find my presentation and other related resources here:
http://events.mapr.com/DataScienceMD
Today’s Presentation
Whiteboard & demo videos
Free On-Demand Training
Free Cheat Sheet
Free Articles And more…