volodymyr lyubinets "introduction to big data processing with apache spark"

45

Upload: it-event

Post on 21-Jan-2018

234 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Page 2: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Myself Databricks

2

About the speaker

• Company founded by creators of Apache Spark

• Remains the largest contributor to Spark and builds the platform to make working with Spark easy.

• Raised over 100M USD in funding

• Software Engineer at Databricks

• Previously interned at a Facebook, LinkedIn, etc.

• Competitive programmer, red on TopCoder, 13th at ACM ICPC finals

Page 3: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Big Data - why you should care

• Data grows faster than computing power

3

Page 4: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Some big data use cases

• Log mining and processing.

• Recommendation systems.

• Palantir’s solution for small businesses.

4

Page 5: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

How it all started

• In 2004 Google published the MapReduce paper.

• In 2006 Hadoop was started, soon adopted by Yahoo.

5

Page 6: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

MapReduce

6

Page 7: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

MapReduce

7

Page 8: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Map

8

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context)

throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

}

}

Page 9: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Reduce

9

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

}

Page 10: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Recommender systems at LinkedIn

• Pipeline of nearly 80 individual jobs.• Various data formats: json, binary json,

avro, etc.• Entire pipeline took around 7 hours.• LinkedIn used in-house solutions (e.g.

Azkaban for scheduling, own HDFS).

Page 11: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Problems

• Interactively checking data was inconvenient.

• Slow - not even close to realtime.• Problems working with some formats - as a

result an extra job was required to convert them.

• Some jobs were a “one-liner” and could have been avoided.

Page 12: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

How it all started

• In 2012 Spark was created as a research project at Berkeley to address shortcomings of Hadoop MapReduce.

12

Page 13: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

What’s Apache Spark?

Spark is a framework for doing distributed computations on a cluster.

13

Page 14: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Large Scale Usage:

Largest cluster:

8000 nodes

Largest single job:

1 petabyte

Top streaming intake:

1 TB/hour

2014 on-disk 100 TB sort record: 23 mins /207 EC2 nodes

Page 15: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Writing Spark programs - RDD

• Resilient Distributed Dataset

• Basically a collection of data that is spread across many computers.

• Can be thought of as list that doesn’t allow random access.

• RDDs built and manipulated through a diverse set of parallel transformations (map, filter, join)and actions (count, collect, save)

• RDDs automatically rebuilt on machine failure

15

Page 16: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

map() intersection() cartesian()

flatMap() distinct() pipe()

filter() groupByKey() coalesce()

mapPartitions() reduceByKey() repartition()

mapPartitionsWithIndex() sortByKey() partitionBy()

sample() join() ...

union() cogroup() ...

Transformations (lazy)

Page 17: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

reduce() takeOrdered()

collect() saveAsTextFile()

count() saveAsSequenceFile()

first() saveAsObjectFile()

take() countByKey()

takeSample() foreach()

saveToCassandra() ...

Actions

Page 18: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Writing Spark programs - RDD

scala> val rdd = sc.parallelize(List(1,

2, 3))

rdd: org.apache.spark.rdd.RDD[Int] =

ParallelCollectionRDD[1] at parallelize

at <console>:27

scala> rdd.count()

res1: Long = 3

18

Page 19: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Writing Spark programs - RDD

scala> rdd.collect()

res8: Array[Int] = Array(1, 2, 3)

scala> rdd.map(x => 2 * x).collect()

res2: Array[Int] = Array(2, 4, 6)

scala> rdd.filter(x => x % 2 ==

0).collect()

res3: Array[Int] = Array(2)

19

Page 20: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

1) Create some input RDDs from external data or parallelize a collection in your driver program.

1) Lazily transform them to define new RDDs using transformations like filter() or map()

1) Ask Spark to cache() any intermediate RDDs that will need to be reused.

1) Launch actions such as count() and collect() to kick off a parallel computation, which is then optimized and executed by Spark.

Lifecycle of a Spark Program

Page 21: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Problem #1: Hadoop MR is verbose

21

text_file = sc.textFile("hdfs://...")

counts = text_file.flatMap(lambda line:

line.split(" ")) \

.map(lambda word: (word, 1)) \

.reduceByKey(lambda a, b: a + b)

counts.saveAsTextFile("hdfs://...")

Page 22: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Writing Spark programs: ML

22

# Every record of this DataFrame contains the label and

# features represented by a vector.

df = sqlContext.createDataFrame(data, ["label", "features"])

# Set parameters for the algorithm.

lr = LogisticRegression(maxIter=10)

# Fit the model to the data.

model = lr.fit(df)

# Given a dataset, predict each point's label, and show the results.

model.transform(df).show()

Page 23: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Problem #2: Hadoop MR is slow

•Spark is 10-100x times faster than MR•Hadoop MR uses checkpointing to achieve resiliency, Spark uses lineage.

23

Page 24: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Page 25: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Other Spark optimizations: DataFrame API

val users = spark.sql(“select * from users”)

val massUsers = users(users(“country”) === “NL”)

massUsers.count()

massUsers.groupBy(“name”).avg(“age”)

^ Expression AST

Page 26: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Other Spark optimizations: DataFrame API

Page 27: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Other Spark optimizations: DataFrame API

Page 28: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Other Spark optimizations

• Dataframe operations are executed in Scala even if you run them in Python/R.

Page 29: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Other Spark optimizations

• Project Tungsten

Page 30: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Other Spark optimizations

• Project Tungsten (simple aggregation)

Page 31: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Other Spark optimizations

• Query optimization (taking advantage of lazy computation)

Page 32: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

32

joined = users.join(events, users.id == events.uid)filtered = joined.filter(events.date >= "2015-01-01")

Plan Optimization & Execution

logical plan

filter

join

scan(users)

scan(events)

Page 33: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

33

logical plan

filter

join

scan(users)

scan(events)

this join is expensive →

joined = users.join(events, users.id == events.uid)filtered = joined.filter(events.date >= "2015-01-01")

Plan Optimization & Execution

Page 34: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

logical plan

filter

join

scan(users)

scan(events)

optimized plan

join

scan(users)

filter

scan(events)

joined = users.join(events, users.id == events.uid)filtered = joined.filter(events.date >= "2015-01-01")

Plan Optimization & Execution

Page 35: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Other Spark optimizations

In Spark 1.3:

myRdd.toDF() or myDataframe.rdd()

Convert Rows that contain Scala types to Rows that have Catalyst-approved types (e.g. Seq for arrays) and back.

35

Page 36: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Other Spark optimizations: toDF, rdd

Approach:

• Construct converter functions• Avoid using map() and etc. for operations that will be executed

for each row when possible.

36

Page 37: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Other Spark optimizations: toDF, rdd

37

Page 38: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Other Spark optimizations: toDF, rdd

38

Page 39: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Other Spark optimizations: toDF, rdd

39

Page 40: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Other Spark optimizations: toDF, rdd

40

Page 41: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Hands-on Spark: Analyzing Brexit tweets

• Let’s do some simple tweets analysis with Spark on databricks.

• Try Databricks community edition at

databricks.com/try-databricks

41

Page 42: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

What Spark is used for

• Interactive analysis• Extract Transform Load• Machine Learning• Streaming

Page 43: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Spark Caveats

• collect()-ing large amount of data OOM’s the driver• Avoid cartesian products in SQL (join!)• Don’t overuse cache• If you’re using S3, don’t use s3n:// (use s3a://)• Don’t use spot instances for the driver node• Data format matters a lot

Page 44: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Today you’ve learned

• What’s Apache Spark, what it’s used for.

• How to write simple programs in Spark: what’s RDDs

and dataframes.

• Optimizations in Apache Spark.

44

Page 45: Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Дзякуй за ўвагу!

Volodymyr Lyubinets, [email protected]/04/2017