spark!
DESCRIPTION
Overview of Spark - a new paradigm for processing Big Data - with astounding performance and conciseness. Slides from DataKRK meetup.TRANSCRIPT
By @przemur from
HTTP://ABOUT.ME/PRZEMEK.MACIOLEK/• Data Scientist, Hadoop user since 2009
• Did research for Academia, data mined for oil&gas exploration industry, cofounded Data Science startup, built Big Data team in Base CRM, …
• A lot of different tools used meanwhile (Mahout, HBase, Cassandra, Redis, Pig, Storm, …)
• Dreaming about something powerful and concise for Big Data…
• AD 2014: Head of Analytics & Data @ Toptal - researching new ways of doing Big Data Analytics, rediscovered Storm.
P.S. Ever considered doing Analytics & Data Science for a very cool startup? Drop me a note at: [email protected]
HADOOP IS COOL…
HADOOP IS COOL (BUT SOMETIMES IT’S NOT)
• High latency (interactive, anyone?)
• Challenging expressibility of business logic
• Iterative algorithms? (think: PageRank)
SOLUTION?
MapReduce
Pregel
Impala
PigS4 Storm
Drill
Giraph
…
General batch processing
Specialized systems
Hive
Data
Data
Data
Data
Data
Data
Data
Map Reduce
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
MAYBE MAP REDUCE IS NOT ALWAYS THE BEST SOLUTION?
GENERALIZE FTW!
MapReduce …
Batch processing
Specialized systems
Task DAG and Data Sharing
Spark
RESILIENT DISTRIBUTED DATASET (RDD)
• A collection of elements that can be operated in parallel
• Parallel Collection, e.g. sc.paralellize(Array(1,2,3))
• Hadoop Dataset
• Lazily evaluated, able to rebuild lost data any time
• Can be stored in memory without replication
TRANSFORMATIONS
• Creates a new dataset from an existing one
• Lazily evaluated
• Recomputed each time an action runs on it, but might be persisted (in memory or disk)
• Broadcast Variables and Accumulators for cluster-level sharing
ACTIONS
• Return the value to the driver after computation finishes
• Runs all required transformations
Scala, Java, Python!
HOW TO USE IT?scala> val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 !scala> textFile.count() // Number of items in this RDD res0: Long = 74 !scala> textFile.first() // First item in this RDD res1: String = # Apache Spark !scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b)) // How many words are in the longest line res2: Int = 16 !scala> textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b).collect res3: Array[(java.lang.String, Int)] = Array((need,2), ("",43), (Extra,3), (using,1), (passed,1), (etc.,1), (its,1), (`/usr/local/lib/libmesos.so`,1), (`SCALA_HOME`,1), (option,1), (these,1), (#,1), (`PATH`,,2), (200,1), (To,3),...
WHAT HAPPENS UNDERNEATH?
RDD Objects DAG Scheduler
Task SchedulerWorker
rdd.filter().map(…).groupBy(…).filter(…)
Split graph into stages of tasks. Submit each
one when ready.
Lunch tasks via cluster manager. Retry.
Execute tasks. Store and serve blocks.
TaskSet
Task
NARROW DEPENDENCIES
WIDE (SHUFFLE) DEPENDENCIES
map, filter
unionjoin (inputs not co-partitioned)
groupByKey
* http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
How much code is needed to implement Big Data Page Rank?
* http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
* http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf
* http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf
* http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf
BERKELEY DATA ANALYTICS STACK
* https://amplab.cs.berkeley.edu/software/
SPARK LIVE
REFERENCES• http://spark.incubator.apache.org/
• https://amplab.cs.berkeley.edu/software/
• http://ampcamp.berkeley.edu/3/exercises/index.html
• http://www.mlbase.org/
• https://amplab.cs.berkeley.edu/benchmark/
• http://files.meetup.com/3138542/dev-meetup-dec-2012.pptx
• http://spark-summit.org/wp-content/uploads/2013/10/Tully-SparkSummit4.pdf
• http://spark-summit.org/wp-content/uploads/2013/10/Kay_Sparrow_Spark_Summit.pdf
• http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf
• http://spark-summit.org/wp-content/uploads/2013/10/Wendell-Spark-Performance.pdf
• http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
• http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf