wisely chen spark talk at spark gathering in taiwan
TRANSCRIPT
SparkSQL and ParquetWisely Chen
Data Tech Lead at Appier
Agenda
• Introduce me and Appier
• How do we build our pipeline?
• Why do we use SparkSQL + HDFS?
• Why do we use Parquet?
Who am I?• Data Team Lead at Appier
• Spark Code Contributor
• Personal Email: [email protected]
• Speaker at
• Spark Summit 2014 SF
• Hadoop Summit 2013 San Jose
• Jenkins Conf 2013 Palo Alto
What is Appier?
• AI and Data Company
• Mission is to make advertisement the preferred content that connects business and users
• Back by Sequoia Capital
Data Team in Appier• Deal with Perabyte per day
• Handling 2K~3K cores cluster on AWS
• Build and maintain a robust data pipeline
• Data correctness is must
• Partial pipeline need < 1min latency
• Total infra need low cost
How do we do that?
Architecture
Log Kafka Spark Streaming
ETLS3
HDFS
ParquetSparkSQL
ML
Heavy Spark User
• ML : Custom Spark Application(no mllib)
• ETL: Spark Application
• SQL: SparkSQL + Parquet
• Streaming: Spark Streaming + Kafka
Why Spark? • We love spark and familiar with Spark
• Appier commit >10 commits in last Quater
• Perfect for ML application
• A general engine for every aspect usage
• You don’t have to learn a lot of big data term
Why SQL is important?
Before SparkSQL
5 engineer coding scala
After SparkSQL
All engineer can involved into data project
Data analytics can query data on their own
User Interface
SQL + TimeRange
File Util SQL Engine
File List
HDFS
Parquet
S3
Parquet
Why SparkSQL?• We know Spark
• Tuning Spark Application knowledge can be reused in SparkSQL
• Any table/UDF defined in SparkSQL application can be reused in ML application
• SparkSQL and Dataframe will be more important in Spark eco-system
Which storage is best for SparkSQL in Appier?
We try Cassandra• Pros
• Easy to use and implement application
• Easy to scale up
• Hide all heavy stuff inside the platform
• Cons
• Not so easy to maintain
• Not so easy to tune performance
• Hide all heavy stuff inside the platform
We try AeroSpike• Pros
• Very good performance
• Easy to maintain
• Easy to scale
• Hide all heavy stuff inside the platform but better implement
• Cons
• Expensive!!!!!!
HDFS + File• Pros
• Low cost
• Good read and write performance on big data
• HDFS is very stable
• We know all the detail
• Easy to scale up
• Cons
• We have to implement all the detail
• We have to implement all the maintain script
Why do we give up AeroSpike?
• Cost is too high
• We prefer put money on people rather than machine
Why do we give up Cassandra?
• We are not familiar with Cassandra(Main Reason)
• Very easy to implement POC
• Reduce a lot of effort on start phase
• We feel Hard to maintain on later phase (again : we are not familiar with Cassandra)
Why do we use HDFS/File?
• Cost is cheap
• Implement need a lot of time
• Solid engineering team don’t afraid this
• We can control all detail
• We can build up a maintainable platform
The main reason
• We love Spark
• I have used with HDFS before. But I tend to love HDFS after these days
Why Parquet?
What is Parquet?
• From Google Dremel paper
• Column format storage
• Support nested data structures(List,Map..)
• Support Protobuf/thrift/Json
Column Storage
ID Name Age
1 Alice 23
2 Beverly 32
3 Cate 15
Select Name From xxx
ID Name Age
1 Alice 23
2 Beverly 32
3 Cate 15
Select * From xxx where Age > 20
Column Pruning Predicate Pushdown
Different Encoding
Encoding Algo Use Case
Run Length Encoding Repeated Data
Delta Encoding Sequence Data with order (Timestamp,auto create id…)
Dictionary Encoding Small scale data set(IP…)
Prefix Encoding Delta Encoding for strings
Storage
Language Independent
The real reason is
• SparkSQL treat Parquet/JSON as first citizen
• ORC, RCFile is not on their plan
• Parquet perform well in every aspect
Good Lesson we learn• File(Parquet) is better storage than any other DB
• Easily to backup, replication
• Easily to change storage solution
• Easy to debug
• Easy to maintain
Conclusion• Spark Spark Spark
• SparkSQL + Parquet is very good combine solution
• Don’t trust any solution / service. Don’t put any critical service on the platform you don’t trust
• A solid team can do anything you want
We are hiring