effective spark with alluxio at strata+hadoop world san jose 2017

EFFECTIV E SPAR K WITH ALLUXIO

Strata + Hadoop World San Jose 2017Calvin Jia, Alluxio Inc.

OUTLINE

• Alluxio Overview

• Alluxio + Spark Use Cases

• Using Alluxio with Spark

• Performance Evaluation

BIG DATA ECOSYSTEM TODAYBIG DATA ECOSYSTEM WITH ALLUXIO

FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System

Unifying Data at Memory Speed

BIG DATA ECOSYSTEM ISSUES

GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface

BIG DATA ECOSYSTEM YESTERDAY

• Fastest growing open-source project in the big data ecosystem

• 400+ contributors from 100+ organizations

• Running in large production clusters

• Community members are welcome!

FASTEST GROWING BIG DATA PROJECTS

Popular Open Source Projects’ Growth

Months

OUTLINE

ACCELERATE I/O TO/FROM REMOTE STORAGE

The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds.

- Baidu

RESULTS

• Data queries are now 30x faster with Alluxio

• Alluxio cluster runs stably, providing over 50TB of RAM space

• By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds

Baidu’s PMs and analysts run

interactive queries to gain insights

into their products and business

• 200+ nodes deployment

• 2+ petabytes of storage

• Mix of memory + HDD

ALLUXIO

Baidu File System

GUARANTEE PERFORMANCE TO MEET SLAS

RESULTS

• Introducing Alluxio provides the system with smooth performance in an environment with highly variable load

• Without Alluxio, the performance of critical jobs could take more than 4x the expected time

• With Alluxio’s guaranteed performance, business logic could be reliably be executed leading to improved user experience

• Real-time machine learning

• 10+ TB of storage

• Mix of Memory + HDD

A leading online retailer uses

Alluxio to compute conversion

attribution

ALLUXIO

SHARE DATA ACROSS JOBS @ MEMORY SPEED

Thanks to Alluxio, we now have the raw data immediately available at every iteration and we can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity.

- Barclays

RESULTS

• Barclays workflow iteration time decreased from hours to seconds

• Alluxio enabled workflows that were impossible before

• By keeping data only in memory, the I/O cost of loading and storing in Alluxio is now on the order of seconds

Barclays uses query and machine

learning to train models for risk

management

• 6 node deployment

• 1TB of storage

• Memory only

ALLUXIO

Relational Database

TRANSPARENTLY MANAGE DATA ACROSS STORAGE SYSTEMS

We’ve been running Alluxio in production for over 9 months, Alluxio’s unified namespace enable different applications and frameworks to easily interact with data from different storage systems.

- Qunar

RESULTS

• Data sharing among Spark Streaming, Spark batch and Flink jobs provide efficient data sharing

• Improved the performance of their system with 15x – 300x speedups

• Tiered storage feature manages storage resources including memory, SSD and disk

• 200+ nodes deployment

• 6 billion logs (4.5 TB) daily

• Mix of Memory + HDD

ALLUXIO

Qunar uses real-time machine

learning for their website ads.

OUTLINE

CONSOLIDATING MEMORY

Storage Engine & Execution EngineSame Process

• Two copies of data in memory – double the memory used• Inter-process Sharing Slowed Down by Network / Disk I/O

Spark Compute

Spark Storage

block 1

block 3

HDFS / Amazon S3block 1

block 3

block 2

block 4

Spark Compute

Spark Storage

block 1

block 3

CONSOLIDATING MEMORY

Storage Engine & Execution EngineDifferent process

• Half the memory used• Inter-process Sharing Happens at Memory Speed

Spark Compute

Spark Storage

block 3

block 2

block 4

HDFSdisk

block 1

block 3

block 2

block 4Alluxio

block 1

block 3 block 4

Spark Compute

Spark Storage

DATA RESILIENCE DURING CRASH

Spark Compute

Spark Storageblock 1

block 3

block 2

block 4

Spark Storageblock 1

block 3

block 2

block 4

• Process Crash Requires Network and/or Disk I/O to Re-read Data

block 3

block 2

block 4

• Process Crash Requires Network and/or Disk I/O to Re-read Data

Spark Compute

Spark Storage

block 3

block 2

block 4

HDFSdisk

block 1

block 3

block 2

block 4Alluxio

block 1

block 3 block 4

Storage Engine & Execution EngineDifferent process

• Process Crash – Data is Re-read at Memory Speed

block 3

block 2

block 4

HDFSdisk

block 1

block 3

block 2

block 4Alluxio

block 1

block 3 block 4

CRASH Storage Engine & Execution EngineDifferent process

ACCESSING ALLUXIO DATA FROM SPARK

Writing Data Write to an Alluxio file

Reading Data Read from an Alluxio file

CODE EXAMPLE FOR SPARK RDDS

Writing RDD to Alluxiordd.saveAsTextFile(alluxioPath)rdd.saveAsObjectFile(alluxioPath)

Reading RDD from

Alluxiordd = sc.textFile(alluxioPath)rdd = sc.objectFile(alluxioPath)

CODE EXAMPLE FOR SPARK DATAFRAMES

Writing to Alluxio df.write.parquet(alluxioPath)

Reading from Alluxio df = sc.read.parquet(alluxioPath)

OUTLINE

ENVIRONMENT

Spark 2.0.0 + Alluxio 1.2.0

Single worker: Amazon r3.2xlarge

Comparisons:• Alluxio• Spark Storage Level: MEMORY_ONLY• Spark Storage Level: MEMORY_ONLY_SER• Spark Storage Level: DISK_ONLY

0 10 20 30 40 50

RDD Size [GB]

READING CACHED RDD

Alluxio (textFile) Alluxio (objectFile) DISK_ONLY

MEMORY_ONLY_SER MEMORY_ONLY

0 50 100 150 200 250

Alluxio(textFile)

Alluxio(objectFile)

No Alluxio

Time [seconds]

NEW CONTEXT: READ 50 GB RDD (SSD)

0 100 200 300 400 500 600 700 800

Alluxio(textFile)

Alluxio(objectFile)

No Alluxio

Time [seconds]

NEW CONTEXT: READ 50 GB RDD (S3)

7x speedup

16x speedup

0 10 20 30 40 50

DataFrame Size [GB]

READING CACHED DATAFRAME (PARQUET)

Alluxio (textFile) MEMORY_ONLY_SER MEMORY_ONLY

0 50 100 150 200 250

Alluxio

No Alluxio

Time [seconds]

NEW CONTEXT: READ 50 GB DATAFRAME (SSD)

0 250 500 750 1000 1250 1500 1750

Alluxio

No Alluxio

Time [seconds]

NEW CONTEXT: READ 50 GB DATAFRAME(S3)

10x average speedup, 17x peak speedup

CONCLUSION

• Easy to use with Spark

• Predictable and improved performance

• Easily connect to various storages

Contact: calvin@alluxio.com

Twitter: @jiacalvin

Websites: www.alluxio.com and www.alluxio.org

Thank you!

Unification

New workflows across any data in any storage system

Orders of magnitude improvement in run time

Choice in compute and storage – grow each independently, buy only what is needed

Performance Flexibility

BENEFITS

effective spark with alluxio at strata+hadoop world san jose 2017

Technology

apache flink@ strata & hadoop world london

strata hadoop world 2015 context computing - jonas keynote...

presto: distributed sql on anything - strata hadoop 2017...

strata+hadoop 2015 nyc end user panel on real-time data...

strata + hadoop world 2012: apache hbase features for the...

building better analytics workflows (strata-hadoop world...

apache eagle strata hadoop world london 2016

strata + hadoop world: jump into the data lake with...

accelerating machine learning pipelines with alluxio at...

forecasting space-time events - strata + hadoop world 2015...

adversarial analytics - 2013 strata & hadoop world talk

sqoop2 refactoring for generic data transfer - hadoop strata...

evaluation of the suitability of alluxio for hadoop...

marklogic and hadoop - strata + hadoop world 2014

alluxio keynote at strata+hadoop world beijing 2016

strata + hadoop world 2012: data science on hadoop: how...

behavioral analytics with smartphone data. talk at strata +...

strata+hadoop world ny 2016 - avinash ramineni

hadoop adventures at spotify (strata conference + hadoop...

hadoop strata talk - uber, your hadoop has arrived