data storage tips for optimal spark performance-(vida ha, databricks)
TRANSCRIPT
![Page 1: Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)](https://reader034.vdocuments.us/reader034/viewer/2022051516/55cadbe0bb61eb492b8b483f/html5/thumbnails/1.jpg)
Data Storage Tips for Optimal Spark Performance
Vida HaSpark Summit West 2015
![Page 2: Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)](https://reader034.vdocuments.us/reader034/viewer/2022051516/55cadbe0bb61eb492b8b483f/html5/thumbnails/2.jpg)
Today’s TalkAbout Me
Vida Ha - Solutions Engineer at Databricks
Poor Data File Storage Choices Result in:• Exceptions that are difficult to diagnose and fix.• Slow Spark Jobs.• Inaccurate Data Analysis.
2
Goal: Understand the Best Practices for Storing and working with Data in Files with
![Page 3: Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)](https://reader034.vdocuments.us/reader034/viewer/2022051516/55cadbe0bb61eb492b8b483f/html5/thumbnails/3.jpg)
Agenda• Basic Data Storage Decisions• File Sizes• Compression Formats
• Data Format Best Practices• Plain Text, Structured Text, Data Interchange, Columnar
• General Tips
3
![Page 4: Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)](https://reader034.vdocuments.us/reader034/viewer/2022051516/55cadbe0bb61eb492b8b483f/html5/thumbnails/4.jpg)
4
Basic Data Storage Decisions
![Page 5: Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)](https://reader034.vdocuments.us/reader034/viewer/2022051516/55cadbe0bb61eb492b8b483f/html5/thumbnails/5.jpg)
Choosing a File Size• Too Small• Lots of time opening file handles.
• Too Big• Files need to be splittable.
• Optimize for your File System• Good Rule of Thumb: 64 MB - 1 GB
5
![Page 6: Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)](https://reader034.vdocuments.us/reader034/viewer/2022051516/55cadbe0bb61eb492b8b483f/html5/thumbnails/6.jpg)
Choosing a Compression Format• The Obvious• Minimize the Compressed Size of Files.• Decoding characteristics.
• Pay attention to Splittable vs. NonSplittable. • Common Compression Formats for Big Data:• gzip, Snappy, bzip2, LZO, and LZ4.• Columnar formats for Structured Data - Parquet.
6
![Page 7: Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)](https://reader034.vdocuments.us/reader034/viewer/2022051516/55cadbe0bb61eb492b8b483f/html5/thumbnails/7.jpg)
7
Data Format Best Practices
![Page 8: Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)](https://reader034.vdocuments.us/reader034/viewer/2022051516/55cadbe0bb61eb492b8b483f/html5/thumbnails/8.jpg)
Plain Text• sc.textFile() splits the file into lines.• So keep your lines a reasonable size.• Or use a different method to read data in.
• Keep file sizes < 1 GB if compressed with a non-splittable format.
8
![Page 9: Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)](https://reader034.vdocuments.us/reader034/viewer/2022051516/55cadbe0bb61eb492b8b483f/html5/thumbnails/9.jpg)
Basic Structured Text Tiles• Use Spark transformations to ETL the data.• Optionally use Spark SQL for analyzing.• Handle the inevitable malformed data with Spark:• Use a filter transformation to drop bad lines.• Or use a map function to fix bad lines.
• Includes CSV, JSON, XML.
9
![Page 10: Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)](https://reader034.vdocuments.us/reader034/viewer/2022051516/55cadbe0bb61eb492b8b483f/html5/thumbnails/10.jpg)
CSV
• Use the relatively new Spark-CSV package.• Spark SQL Malformed Data Tip:• Use a where clause and sanity check fields.
10
![Page 11: Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)](https://reader034.vdocuments.us/reader034/viewer/2022051516/55cadbe0bb61eb492b8b483f/html5/thumbnails/11.jpg)
JSON• Ideally have one JSON object per line. • Otherwise, DIY parsing the JSON.
• Prefer specifying a schema over inferSchema.•Watch out for arbitrary number of keys. • Inferring Schema will result in an executor OOM error.
• Spark SQL Malformed Data Tip:• Bad inputs have a column called _corrupt_record and other
columns will be null.
11
![Page 12: Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)](https://reader034.vdocuments.us/reader034/viewer/2022051516/55cadbe0bb61eb492b8b483f/html5/thumbnails/12.jpg)
XML• Not an ideal Big Data Format.• Typically not one XML object per line.• So you’ll need to DIY parse.
• Not a lot of built in library support for XML. •Watch out for very large compressed files.•May need to employ the General Tips to parse.
12
![Page 13: Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)](https://reader034.vdocuments.us/reader034/viewer/2022051516/55cadbe0bb61eb492b8b483f/html5/thumbnails/13.jpg)
Data Interchange Formats with a Schema
• Good Practice to enforce an API with backward compatibility.• Avro, Parquet, and Thrift are common ones.
• Usually, good compression.• Data format itself is not corrupt.• But underlying records still be.
13
![Page 14: Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)](https://reader034.vdocuments.us/reader034/viewer/2022051516/55cadbe0bb61eb492b8b483f/html5/thumbnails/14.jpg)
Avro• Use DataFile Writer to write multiple objects.• Use spark-avro package to read in.• Don’t transfer “AvroKey” across driver and workers.• AvroKey is not serializable.• Pull out fields of interest or convert to JSON.
14
![Page 15: Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)](https://reader034.vdocuments.us/reader034/viewer/2022051516/55cadbe0bb61eb492b8b483f/html5/thumbnails/15.jpg)
Protocol Buffers• Need to figure out how to encode multiple in a file.• Encode in Sequence Files or other similar file format.• Prepend the number of bytes before reading the next
message.• Base64 encode one per line.
• Currently, no built in Spark package to support.• Opportunity to contribute to open source community.• For now, convert to JSON and read into Spark SQL.
15
![Page 16: Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)](https://reader034.vdocuments.us/reader034/viewer/2022051516/55cadbe0bb61eb492b8b483f/html5/thumbnails/16.jpg)
Columnar Formats: Parquet & ORD• Great for use with Spark SQL.• Parquet is actually best practice for Spark SQL.
•Makes it easy to pull only certain records at a time.•Worth time to encode if multiple passes on data.• Do not support appending one record at a time.• Not good for collecting log records.
16
![Page 17: Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)](https://reader034.vdocuments.us/reader034/viewer/2022051516/55cadbe0bb61eb492b8b483f/html5/thumbnails/17.jpg)
17
General Tips
![Page 18: Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)](https://reader034.vdocuments.us/reader034/viewer/2022051516/55cadbe0bb61eb492b8b483f/html5/thumbnails/18.jpg)
Reuse Hadoop Libraries • HadoopRDD & NewHadoopRDD• Reuse Hadoop File format libraries.
• Hive SerDe’s are supported in Spark SQL• Load your SerDe jar onto your Spark Cluster.• Issue a SHOW CREATE TABLE command.• Enter the create table command (with EXTERNAL)
18
![Page 19: Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)](https://reader034.vdocuments.us/reader034/viewer/2022051516/55cadbe0bb61eb492b8b483f/html5/thumbnails/19.jpg)
Controlling the Output File Size• Spark generally writes one file per partition.• Use coalesce to write out less files.• Use repartition to write out more files.• This will cause a shuffle of the data.
• If repartition is too expensive:• Call mapPartition to get an iterator and write out as many
(or few) files as you would like.
19
![Page 20: Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)](https://reader034.vdocuments.us/reader034/viewer/2022051516/55cadbe0bb61eb492b8b483f/html5/thumbnails/20.jpg)
sc.wholeTextFiles()• The whole file should fit into memory.• Good for file formats that aren’t splittable by line.• Such as XML files.
•Will need to performance tune.
20
![Page 21: Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)](https://reader034.vdocuments.us/reader034/viewer/2022051516/55cadbe0bb61eb492b8b483f/html5/thumbnails/21.jpg)
Use “filename” as the RDD element• filenames = sc.parallelize([“s3:/file1”, “s3:/file2”, …])• Allows you to function on a single file.• filenames.map(…call_function_on_filename)
• Use to decode non-standard compression formats.• Use to split up files that aren’t separable by line.
21
![Page 22: Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)](https://reader034.vdocuments.us/reader034/viewer/2022051516/55cadbe0bb61eb492b8b483f/html5/thumbnails/22.jpg)
Thank you.Any questions?