volodymyr lyubinets "introduction to big data processing with apache spark"

Myself Databricks

2

About the speaker

• Company founded by creators of Apache Spark

• Remains the largest contributor to Spark and builds the platform to make working with Spark easy.

• Raised over 100M USD in funding

• Software Engineer at Databricks

• Previously interned at a Facebook, LinkedIn, etc.

• Competitive programmer, red on TopCoder, 13th at ACM ICPC finals

Big Data - why you should care

• Data grows faster than computing power

3

Some big data use cases

• Log mining and processing.

• Recommendation systems.

• Palantir’s solution for small businesses.

4

How it all started

• In 2004 Google published the MapReduce paper.

• In 2006 Hadoop was started, soon adopted by Yahoo.

5

MapReduce

6

MapReduce

7

Map

8

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context)

throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

}

}

Reduce

9

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

}

Recommender systems at LinkedIn

• Pipeline of nearly 80 individual jobs.• Various data formats: json, binary json,

avro, etc.• Entire pipeline took around 7 hours.• LinkedIn used in-house solutions (e.g.

Azkaban for scheduling, own HDFS).

Problems

• Interactively checking data was inconvenient.

• Slow - not even close to realtime.• Problems working with some formats - as a

result an extra job was required to convert them.

• Some jobs were a “one-liner” and could have been avoided.

How it all started

• In 2012 Spark was created as a research project at Berkeley to address shortcomings of Hadoop MapReduce.

12

What’s Apache Spark?

Spark is a framework for doing distributed computations on a cluster.

13

Large Scale Usage:

Largest cluster:

8000 nodes

Largest single job:

1 petabyte

Top streaming intake:

1 TB/hour

2014 on-disk 100 TB sort record: 23 mins /207 EC2 nodes

Writing Spark programs - RDD

• Resilient Distributed Dataset

• Basically a collection of data that is spread across many computers.

• Can be thought of as list that doesn’t allow random access.

• RDDs built and manipulated through a diverse set of parallel transformations (map, filter, join)and actions (count, collect, save)

• RDDs automatically rebuilt on machine failure

15

map() intersection() cartesian()

flatMap() distinct() pipe()

filter() groupByKey() coalesce()

mapPartitions() reduceByKey() repartition()

mapPartitionsWithIndex() sortByKey() partitionBy()

sample() join() ...

union() cogroup() ...

Transformations (lazy)

reduce() takeOrdered()

collect() saveAsTextFile()

count() saveAsSequenceFile()

first() saveAsObjectFile()

take() countByKey()

takeSample() foreach()

saveToCassandra() ...

Actions


scala> val rdd = sc.parallelize(List(1,

2, 3))

rdd: org.apache.spark.rdd.RDD[Int] =

ParallelCollectionRDD[1] at parallelize

at <console>:27

scala> rdd.count()

res1: Long = 3

18


scala> rdd.collect()

res8: Array[Int] = Array(1, 2, 3)

scala> rdd.map(x => 2 * x).collect()

res2: Array[Int] = Array(2, 4, 6)

scala> rdd.filter(x => x % 2 ==

0).collect()

res3: Array[Int] = Array(2)

19

1) Create some input RDDs from external data or parallelize a collection in your driver program.

1) Lazily transform them to define new RDDs using transformations like filter() or map()

1) Ask Spark to cache() any intermediate RDDs that will need to be reused.

1) Launch actions such as count() and collect() to kick off a parallel computation, which is then optimized and executed by Spark.

Lifecycle of a Spark Program

Problem #1: Hadoop MR is verbose

21

text_file = sc.textFile("hdfs://...")

counts = text_file.flatMap(lambda line:

line.split(" ")) \

.map(lambda word: (word, 1)) \

.reduceByKey(lambda a, b: a + b)

counts.saveAsTextFile("hdfs://...")

Writing Spark programs: ML

22

# Every record of this DataFrame contains the label and

# features represented by a vector.

df = sqlContext.createDataFrame(data, ["label", "features"])

# Set parameters for the algorithm.

lr = LogisticRegression(maxIter=10)

# Fit the model to the data.

model = lr.fit(df)

# Given a dataset, predict each point's label, and show the results.

model.transform(df).show()

Problem #2: Hadoop MR is slow

•Spark is 10-100x times faster than MR•Hadoop MR uses checkpointing to achieve resiliency, Spark uses lineage.

23

Other Spark optimizations: DataFrame API

val users = spark.sql(“select * from users”)

val massUsers = users(users(“country”) === “NL”)

massUsers.count()

massUsers.groupBy(“name”).avg(“age”)

^ Expression AST

Other Spark optimizations: DataFrame API

Other Spark optimizations

• Dataframe operations are executed in Scala even if you run them in Python/R.


• Project Tungsten


• Project Tungsten (simple aggregation)


• Query optimization (taking advantage of lazy computation)

32

joined = users.join(events, users.id == events.uid)filtered = joined.filter(events.date >= "2015-01-01")

Plan Optimization & Execution

logical plan

filter

join

scan(users)

scan(events)

33

logical plan

filter

join

scan(users)

scan(events)

this join is expensive →



logical plan

filter

join

scan(users)

scan(events)

optimized plan

join

scan(users)

filter

scan(events)




In Spark 1.3:

myRdd.toDF() or myDataframe.rdd()

Convert Rows that contain Scala types to Rows that have Catalyst-approved types (e.g. Seq for arrays) and back.

35

Other Spark optimizations: toDF, rdd

Approach:

• Construct converter functions• Avoid using map() and etc. for operations that will be executed

for each row when possible.

36


37


38


39


40

Hands-on Spark: Analyzing Brexit tweets

• Let’s do some simple tweets analysis with Spark on databricks.

• Try Databricks community edition at

databricks.com/try-databricks

41

What Spark is used for

• Interactive analysis• Extract Transform Load• Machine Learning• Streaming

Spark Caveats

• collect()-ing large amount of data OOM’s the driver• Avoid cartesian products in SQL (join!)• Don’t overuse cache• If you’re using S3, don’t use s3n:// (use s3a://)• Don’t use spot instances for the driver node• Data format matters a lot

Today you’ve learned

• What’s Apache Spark, what it’s used for.

• How to write simple programs in Spark: what’s RDDs

and dataframes.

• Optimizations in Apache Spark.

44

Дзякуй за ўвагу!

Volodymyr Lyubinets, [email protected]/04/2017

volodymyr lyubinets "introduction to big data processing with apache spark"

Technology