why functional programming is important in big data era
DESCRIPTION
The only thing that works for parallel programming is functional programming. --Carnegie Mello Professor Bob HarperTRANSCRIPT
![Page 2: Why Functional Programming Is Important in Big Data Era](https://reader033.vdocuments.us/reader033/viewer/2022052618/554a0f0eb4c90507558b4af2/html5/thumbnails/2.jpg)
What Is Big Data?
![Page 3: Why Functional Programming Is Important in Big Data Era](https://reader033.vdocuments.us/reader033/viewer/2022052618/554a0f0eb4c90507558b4af2/html5/thumbnails/3.jpg)
What Are The Steps?
Act On
Analyze
Collect
![Page 4: Why Functional Programming Is Important in Big Data Era](https://reader033.vdocuments.us/reader033/viewer/2022052618/554a0f0eb4c90507558b4af2/html5/thumbnails/4.jpg)
What We Need?
D
Distributed Computing
Cluster
ProcessData
![Page 5: Why Functional Programming Is Important in Big Data Era](https://reader033.vdocuments.us/reader033/viewer/2022052618/554a0f0eb4c90507558b4af2/html5/thumbnails/5.jpg)
What We Need?
• Spark as data processsing in cluster, originally written in Scala, which allows concise function syntax and interactive use
• Mesos as cluster manager• ZooKeeper as highly reliable distributed
coordinator• HDFS as distributed storage
![Page 6: Why Functional Programming Is Important in Big Data Era](https://reader033.vdocuments.us/reader033/viewer/2022052618/554a0f0eb4c90507558b4af2/html5/thumbnails/6.jpg)
What We Need?
• Pure functions• Atomic operations• Parallel patterns or skeletons• Lightweight algorithms
The only thing that works for parallel programming is functional programming.
--Carnegie Mello Professor Bob Harper
![Page 7: Why Functional Programming Is Important in Big Data Era](https://reader033.vdocuments.us/reader033/viewer/2022052618/554a0f0eb4c90507558b4af2/html5/thumbnails/7.jpg)
What Is Functional Programming?
![Page 8: Why Functional Programming Is Important in Big Data Era](https://reader033.vdocuments.us/reader033/viewer/2022052618/554a0f0eb4c90507558b4af2/html5/thumbnails/8.jpg)
FP Quick Tour In Scala
• Basic transformations:var array = new Array[Int](10)var list = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
• Indexing:array(0) = 1println(list(0))
• Anonymous functions:val multiplay = (x: Int, y: Int) => x * y
val procedure = { x: Int => {println(“Hello, ”+x)
println(x * 10) } }
![Page 9: Why Functional Programming Is Important in Big Data Era](https://reader033.vdocuments.us/reader033/viewer/2022052618/554a0f0eb4c90507558b4af2/html5/thumbnails/9.jpg)
FP Quick Tour In Scala
• Scala closure syntax:(x: Int) => x * 10 // full versionx => x * 10 // type interference_ * 10 // underscore syntaxx => { // body is block of code
val y = 10x * y
}
![Page 10: Why Functional Programming Is Important in Big Data Era](https://reader033.vdocuments.us/reader033/viewer/2022052618/554a0f0eb4c90507558b4af2/html5/thumbnails/10.jpg)
FP Quick Tour In Scala
• Processing collections:var list = List(1, 2, 3, 4, 5, 6, 7, 8, 9)
list.foreach(x => println(x)) list.map(_ * 10)list.filter(x => x % 2 == 0)list.reduce((x, y) => x + y)list.reduce(_ + _)
def f(x: Int) = List(x-1, x x+1)list.map(x => f(x))list.map(f(_))list.flatMap(x => f(x))list.map(x => f(x)).reduce(_ ++ _)
![Page 11: Why Functional Programming Is Important in Big Data Era](https://reader033.vdocuments.us/reader033/viewer/2022052618/554a0f0eb4c90507558b4af2/html5/thumbnails/11.jpg)
Spark Quick Tour
• Spark context:• Entry point to Spark functionality• In spark-shell, crated as sc• In standalone-spark-program, we must create it
• Resilient distributed datasets (RDDs) : • A distributed memory abstraction • A logically centralized entity but physically partitioned across multiple
machines inside a cluster based on some notion of key• Immutable• Automatically rebuilt on failure• Based on LRU (Least Recent Use) eviction algorithm
![Page 12: Why Functional Programming Is Important in Big Data Era](https://reader033.vdocuments.us/reader033/viewer/2022052618/554a0f0eb4c90507558b4af2/html5/thumbnails/12.jpg)
Spark Quick Tour
Working with RDDs
![Page 13: Why Functional Programming Is Important in Big Data Era](https://reader033.vdocuments.us/reader033/viewer/2022052618/554a0f0eb4c90507558b4af2/html5/thumbnails/13.jpg)
Spark Quick Tour
Cached RDDs
![Page 14: Why Functional Programming Is Important in Big Data Era](https://reader033.vdocuments.us/reader033/viewer/2022052618/554a0f0eb4c90507558b4af2/html5/thumbnails/14.jpg)
Spark Quick Tour
• Transformations:• Lazy operations to build RDDs from other RDDs
• Narrow transformation (involves no data shuffling) :• map• flatMap• filter
• Wide transformation (involves data shuffling):• sortByKey• reduceByKey• groupByKey
• Actions:• Return a result or write it to storage
• collect• count• take(n)
![Page 15: Why Functional Programming Is Important in Big Data Era](https://reader033.vdocuments.us/reader033/viewer/2022052618/554a0f0eb4c90507558b4af2/html5/thumbnails/15.jpg)
Spark Quick Tour
Transformations
![Page 16: Why Functional Programming Is Important in Big Data Era](https://reader033.vdocuments.us/reader033/viewer/2022052618/554a0f0eb4c90507558b4af2/html5/thumbnails/16.jpg)
Spark Quick Tour
• Creating RDDs:val numbers = sc.parallelize(List(1, 2, 3, 4, 5))
val textFile = sc.textFile("hdfs://localhost/test/tobe.txt")val textFile = sc.textFile("hdfs://localhost/test/*.txt")
• Basic transformations:val squares = numbers.map(x => x * x) val evens = squares.filter(_ < 9)val mapto = numbers.flatMap(x => 1 to x)
val words = textFile.flatMap(_.split(" ")).cache()
Base RDD
Transformed RDD
Turn a collection to RDD
![Page 17: Why Functional Programming Is Important in Big Data Era](https://reader033.vdocuments.us/reader033/viewer/2022052618/554a0f0eb4c90507558b4af2/html5/thumbnails/17.jpg)
Spark Quick Tour
• Basic actions:words.collect()words take(5)words countwords.reduce(_ + _)
words.filter(_ == “be").count()words.filter(_ == “or").count()
words.saveAsTextFile("hdfs://localhost/test/result")
The influence of cache
![Page 18: Why Functional Programming Is Important in Big Data Era](https://reader033.vdocuments.us/reader033/viewer/2022052618/554a0f0eb4c90507558b4af2/html5/thumbnails/18.jpg)
Spark Quick Tour
• Pair syntax:val pair = (a, b)
• Accessing pair elements:pair._1 pair._2
• Key-value operations:val pets = sc.parallelize(List(("cat", 1), ("dog", 2), ("cat", 3)))pets.reduceByKey(_ + _)pets.groupByKey()pets.sortByKey()
![Page 19: Why Functional Programming Is Important in Big Data Era](https://reader033.vdocuments.us/reader033/viewer/2022052618/554a0f0eb4c90507558b4af2/html5/thumbnails/19.jpg)
Hello World
val logFile = "hdfs://localhost/test/tobe.txt" val logData = sc.textFile(logFile).cache()val wordCount = logData.flatMap(_.split(“ “)) .map((_, 1)) .reduceByKey(_ + _)
wordCount.saveAsTextFile("hdfs://localhost/wordcount/result") sc.stop()
![Page 20: Why Functional Programming Is Important in Big Data Era](https://reader033.vdocuments.us/reader033/viewer/2022052618/554a0f0eb4c90507558b4af2/html5/thumbnails/20.jpg)
Execution
![Page 21: Why Functional Programming Is Important in Big Data Era](https://reader033.vdocuments.us/reader033/viewer/2022052618/554a0f0eb4c90507558b4af2/html5/thumbnails/21.jpg)
Software Components
Application
Spark Context
ZooKeeper
Mesos Master
Mesos Slave
Spark Executor
Mesos Slave
Spark Executor
HDFS/Other Storage
![Page 22: Why Functional Programming Is Important in Big Data Era](https://reader033.vdocuments.us/reader033/viewer/2022052618/554a0f0eb4c90507558b4af2/html5/thumbnails/22.jpg)
Literature
Parallel Programming With SparkSpark: Low latency, massively parallel processing framework