why functional programming is important in big data era

Why Functional Programming Is Important In Big Data Era?

[email protected]

What Is Big Data?

What Are The Steps?

Act On

Analyze

Collect

What We Need?

D

Distributed Computing

Cluster

ProcessData

What We Need?

• Spark as data processsing in cluster, originally written in Scala, which allows concise function syntax and interactive use

• Mesos as cluster manager• ZooKeeper as highly reliable distributed

coordinator• HDFS as distributed storage

What We Need?

• Pure functions• Atomic operations• Parallel patterns or skeletons• Lightweight algorithms

The only thing that works for parallel programming is functional programming.

--Carnegie Mello Professor Bob Harper

What Is Functional Programming?

FP Quick Tour In Scala

• Basic transformations:var array = new Array[Int](10)var list = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

• Indexing:array(0) = 1println(list(0))

• Anonymous functions:val multiplay = (x: Int, y: Int) => x * y

val procedure = { x: Int => {println(“Hello, ”+x)

println(x * 10) } }


• Scala closure syntax:(x: Int) => x * 10 // full versionx => x * 10 // type interference_ * 10 // underscore syntaxx => { // body is block of code

val y = 10x * y

}


• Processing collections:var list = List(1, 2, 3, 4, 5, 6, 7, 8, 9)

list.foreach(x => println(x)) list.map(_ * 10)list.filter(x => x % 2 == 0)list.reduce((x, y) => x + y)list.reduce(_ + _)

def f(x: Int) = List(x-1, x x+1)list.map(x => f(x))list.map(f(_))list.flatMap(x => f(x))list.map(x => f(x)).reduce(_ ++ _)

Spark Quick Tour

• Spark context:• Entry point to Spark functionality• In spark-shell, crated as sc• In standalone-spark-program, we must create it

• Resilient distributed datasets (RDDs) : • A distributed memory abstraction • A logically centralized entity but physically partitioned across multiple

machines inside a cluster based on some notion of key• Immutable• Automatically rebuilt on failure• Based on LRU (Least Recent Use) eviction algorithm

Spark Quick Tour

Working with RDDs

Spark Quick Tour

Cached RDDs

Spark Quick Tour

• Transformations:• Lazy operations to build RDDs from other RDDs

• Narrow transformation (involves no data shuffling) :• map• flatMap• filter

• Wide transformation (involves data shuffling):• sortByKey• reduceByKey• groupByKey

• Actions:• Return a result or write it to storage

• collect• count• take(n)

Spark Quick Tour

Transformations

Spark Quick Tour

• Creating RDDs:val numbers = sc.parallelize(List(1, 2, 3, 4, 5))

val textFile = sc.textFile("hdfs://localhost/test/tobe.txt")val textFile = sc.textFile("hdfs://localhost/test/*.txt")

• Basic transformations:val squares = numbers.map(x => x * x) val evens = squares.filter(_ < 9)val mapto = numbers.flatMap(x => 1 to x)

val words = textFile.flatMap(_.split(" ")).cache()

Base RDD

Transformed RDD

Turn a collection to RDD

Spark Quick Tour

• Basic actions:words.collect()words take(5)words countwords.reduce(_ + _)

words.filter(_ == “be").count()words.filter(_ == “or").count()

words.saveAsTextFile("hdfs://localhost/test/result")

The influence of cache

Spark Quick Tour

• Pair syntax:val pair = (a, b)

• Accessing pair elements:pair._1 pair._2

• Key-value operations:val pets = sc.parallelize(List(("cat", 1), ("dog", 2), ("cat", 3)))pets.reduceByKey(_ + _)pets.groupByKey()pets.sortByKey()

Hello World

val logFile = "hdfs://localhost/test/tobe.txt" val logData = sc.textFile(logFile).cache()val wordCount = logData.flatMap(_.split(“ “)) .map((_, 1)) .reduceByKey(_ + _)

wordCount.saveAsTextFile("hdfs://localhost/wordcount/result") sc.stop()

Execution

Software Components

Application

Spark Context

ZooKeeper

Mesos Master

Mesos Slave

Spark Executor

Mesos Slave

Spark Executor

HDFS/Other Storage

Literature

Parallel Programming With SparkSpark: Low latency, massively parallel processing framework

http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-part-1-amp-camp-2012-spark-intro.pdf

http://horicky.blogspot.com/2013/12/spark-low-latency-massively-parallel.html

[email protected]

why functional programming is important in big data era

Data & Analytics