effective spark with alluxio at strata+hadoop world san jose 2017

31
EFFECTIVE SPARK WITH ALLUXIO Strata + Hadoop World San Jose 2017 Calvin Jia, Alluxio Inc.

Upload: alluxio-inc

Post on 05-Apr-2017

119 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

EFFECTIV E SPAR K WITH ALLUXIO

Strata + Hadoop World San Jose 2017Calvin Jia, Alluxio Inc.

Page 2: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

OUTLINE

• Alluxio Overview

• Alluxio + Spark Use Cases

• Using Alluxio with Spark

• Performance Evaluation

2

Page 3: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

BIG DATA ECOSYSTEM TODAYBIG DATA ECOSYSTEM WITH ALLUXIO

FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System

Unifying Data at Memory Speed

BIG DATA ECOSYSTEM ISSUES

GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface

3

BIG DATA ECOSYSTEM YESTERDAY

Page 4: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

4

• Fastest growing open-source project in the big data ecosystem

• 400+ contributors from 100+ organizations

• Running in large production clusters

• Community members are welcome!

FASTEST GROWING BIG DATA PROJECTS

Popular Open Source Projects’ Growth

Months

Num

ber o

f Con

trib

utor

s

Page 5: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

OUTLINE

• Alluxio Overview

• Alluxio + Spark Use Cases

• Using Alluxio with Spark

• Performance Evaluation

5

Page 6: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

ACCELERATE I/O TO/FROM REMOTE STORAGE

The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds.

- Baidu

RESULTS

• Data queries are now 30x faster with Alluxio

• Alluxio cluster runs stably, providing over 50TB of RAM space

• By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds

Baidu’s PMs and analysts run

interactive queries to gain insights

into their products and business

• 200+ nodes deployment

• 2+ petabytes of storage

• Mix of memory + HDD

ALLUXIO

Baidu File System

6

Page 7: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

GUARANTEE PERFORMANCE TO MEET SLAS

RESULTS

• Introducing Alluxio provides the system with smooth performance in an environment with highly variable load

• Without Alluxio, the performance of critical jobs could take more than 4x the expected time

• With Alluxio’s guaranteed performance, business logic could be reliably be executed leading to improved user experience

• Real-time machine learning

• 10+ TB of storage

• Mix of Memory + HDD

A leading online retailer uses

Alluxio to compute conversion

attribution

7

ALLUXIO

Page 8: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

SHARE DATA ACROSS JOBS @ MEMORY SPEED

Thanks to Alluxio, we now have the raw data immediately available at every iteration and we can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity.

- Barclays

RESULTS

• Barclays workflow iteration time decreased from hours to seconds

• Alluxio enabled workflows that were impossible before

• By keeping data only in memory, the I/O cost of loading and storing in Alluxio is now on the order of seconds

Barclays uses query and machine

learning to train models for risk

management

• 6 node deployment

• 1TB of storage

• Memory only

ALLUXIO

Relational Database

8

Page 9: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

TRANSPARENTLY MANAGE DATA ACROSS STORAGE SYSTEMS

We’ve been running Alluxio in production for over 9 months, Alluxio’s unified namespace enable different applications and frameworks to easily interact with data from different storage systems.

- Qunar

RESULTS

• Data sharing among Spark Streaming, Spark batch and Flink jobs provide efficient data sharing

• Improved the performance of their system with 15x – 300x speedups

• Tiered storage feature manages storage resources including memory, SSD and disk

• 200+ nodes deployment

• 6 billion logs (4.5 TB) daily

• Mix of Memory + HDD

ALLUXIO

Qunar uses real-time machine

learning for their website ads.

9

Page 10: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

OUTLINE

• Alluxio Overview

• Alluxio + Spark Use Cases

• Using Alluxio with Spark

• Performance Evaluation

10

Page 11: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

CONSOLIDATING MEMORY

11

Storage Engine & Execution EngineSame Process

• Two copies of data in memory – double the memory used• Inter-process Sharing Slowed Down by Network / Disk I/O

Spark Compute

Spark Storage

block 1

block 3

HDFS / Amazon S3block 1

block 3

block 2

block 4

Spark Compute

Spark Storage

block 1

block 3

Page 12: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

CONSOLIDATING MEMORY

12

Storage Engine & Execution EngineDifferent process

• Half the memory used• Inter-process Sharing Happens at Memory Speed

Spark Compute

Spark Storage

HDFS / Amazon S3block 1

block 3

block 2

block 4

HDFSdisk

block  1

block  3

block  2

block  4Alluxio

block 1

block 3 block 4

Spark Compute

Spark Storage

Page 13: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

DATA RESILIENCE DURING CRASH

13

Spark Compute

Spark Storageblock 1

block 3

HDFS / Amazon S3block 1

block 3

block 2

block 4

Storage Engine & Execution EngineSame Process

Page 14: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

DATA RESILIENCE DURING CRASH

14

CRASH

Spark Storageblock 1

block 3

HDFS / Amazon S3block 1

block 3

block 2

block 4

• Process Crash Requires Network and/or Disk I/O to Re-read Data

Storage Engine & Execution EngineSame Process

Page 15: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

DATA RESILIENCE DURING CRASH

15

CRASH

HDFS / Amazon S3block 1

block 3

block 2

block 4

Storage Engine & Execution EngineSame Process

• Process Crash Requires Network and/or Disk I/O to Re-read Data

Page 16: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

DATA RESILIENCE DURING CRASH

16

Spark Compute

Spark Storage

HDFS / Amazon S3block 1

block 3

block 2

block 4

HDFSdisk

block  1

block  3

block  2

block  4Alluxio

block 1

block 3 block 4

Storage Engine & Execution EngineDifferent process

Page 17: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

DATA RESILIENCE DURING CRASH

17

• Process Crash – Data is Re-read at Memory Speed

HDFS / Amazon S3block 1

block 3

block 2

block 4

HDFSdisk

block  1

block  3

block  2

block  4Alluxio

block 1

block 3 block 4

CRASH Storage Engine & Execution EngineDifferent process

Page 18: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

ACCESSING ALLUXIO DATA FROM SPARK

18

Writing Data Write to an Alluxio file

Reading Data Read from an Alluxio file

Page 19: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

CODE EXAMPLE FOR SPARK RDDS

19

Writing RDD to Alluxiordd.saveAsTextFile(alluxioPath)rdd.saveAsObjectFile(alluxioPath)

Reading RDD from

Alluxiordd = sc.textFile(alluxioPath)rdd = sc.objectFile(alluxioPath)

Page 20: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

CODE EXAMPLE FOR SPARK DATAFRAMES

20

Writing to Alluxio df.write.parquet(alluxioPath)

Reading from Alluxio df = sc.read.parquet(alluxioPath)

Page 21: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

OUTLINE

• Alluxio Overview

• Alluxio + Spark Use Cases

• Using Alluxio with Spark

• Performance Evaluation

21

Page 22: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

ENVIRONMENT

Spark 2.0.0 + Alluxio 1.2.0

Single worker: Amazon r3.2xlarge

Comparisons:• Alluxio• Spark Storage Level: MEMORY_ONLY• Spark Storage Level: MEMORY_ONLY_SER• Spark Storage Level: DISK_ONLY

22

Page 23: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

0

50

100

150

200

250

0 10 20 30 40 50

Tim

e [s

econ

ds]

RDD Size [GB]

READING CACHED RDD

Alluxio (textFile) Alluxio (objectFile) DISK_ONLY

MEMORY_ONLY_SER MEMORY_ONLY

23

Page 24: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

24

0 50 100 150 200 250

Alluxio(textFile)

Alluxio(objectFile)

No Alluxio

Time [seconds]

NEW CONTEXT: READ 50 GB RDD (SSD)

Page 25: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

25

0 100 200 300 400 500 600 700 800

Alluxio(textFile)

Alluxio(objectFile)

No Alluxio

Time [seconds]

NEW CONTEXT: READ 50 GB RDD (S3)

7x speedup

16x speedup

Page 26: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

26

0

50

100

150

200

250

0 10 20 30 40 50

Tim

e [s

econ

ds]

DataFrame Size [GB]

READING CACHED DATAFRAME (PARQUET)

Alluxio (textFile) MEMORY_ONLY_SER MEMORY_ONLY

Page 27: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

27

0 50 100 150 200 250

Alluxio

No Alluxio

Time [seconds]

NEW CONTEXT: READ 50 GB DATAFRAME (SSD)

Page 28: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

28

0 250 500 750 1000 1250 1500 1750

Alluxio

No Alluxio

Time [seconds]

NEW CONTEXT: READ 50 GB DATAFRAME(S3)

10x average speedup, 17x peak speedup

Page 29: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

CONCLUSION

• Easy to use with Spark

• Predictable and improved performance

• Easily connect to various storages

29

Page 30: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

Contact: [email protected]

Twitter: @jiacalvin

Websites: www.alluxio.com and www.alluxio.org

Thank you!

30

Page 31: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

31

Unification

New workflows across any data in any storage system

Orders of magnitude improvement in run time

Choice in compute and storage – grow each independently, buy only what is needed

Performance Flexibility

BENEFITS