![Page 2: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/2.jpg)
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who Am I?
Worked on Hadoop since Jan 2006
MapReduce, Security, Hive, and ORC
Worked on different file formats–Sequence File, RCFile, ORC File, T-File, and Avro
requirements
![Page 3: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/3.jpg)
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goal
Seeking to discover unknowns–How do the different formats perform?–What could they do better?–Best part of open source is looking inside!
Use real & diverse data sets–Over-reliance on similar datasets leads to weakness
Open & reviewed benchmarks
![Page 4: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/4.jpg)
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The File Formats
![Page 5: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/5.jpg)
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Avro
Cross-language file format for Hadoop
Schema evolution was primary goal
Schema segregated from data–Unlike Protobuf and Thrift
Row major format
![Page 6: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/6.jpg)
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
JSON
Serialization format for HTTP & Javascript
Text-format with MANY parsers
Schema completely integrated with data
Row major format
Compression applied on top
![Page 7: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/7.jpg)
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
ORC
Originally part of Hive to replace RCFile–Now top-level project
Schema segregated into footer
Column major format with stripes
Rich type model, stored top-down
Integrated compression, indexes, & stats
![Page 8: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/8.jpg)
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Parquet
Design based on Google’s Dremel paper
Schema segregated into footer
Column major format with stripes
Simpler type-model with logical types
All data pushed to leaves of the tree
![Page 9: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/9.jpg)
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Sets
![Page 10: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/10.jpg)
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NYC Taxi Data
Every taxi cab ride in NYC from 2009–Publically available –http://tinyurl.com/nyc-taxi-analysis
18 columns with no null values–Doubles, integers, decimals, & strings
2 months of data – 22.7 million rows
![Page 11: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/11.jpg)
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
![Page 12: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/12.jpg)
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Github Logs
All actions on Github public repositories –Publically available –https://www.githubarchive.org/
704 columns with a lot of structure & nulls –Pretty much the kitchen sink
1/2 month of data – 10.5 million rows
![Page 13: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/13.jpg)
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Finding the Github Schema
The data is all in JSON.
No schema for the data is published.
We wrote a JSON schema discoverer.–Scans the document and figures out the types
Available in ORC tool jar.
Schema is huge (12k)
![Page 14: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/14.jpg)
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sales
Generated data–Real schema from a production Hive deployment–Random data based on the data statistics
55 columns with lots of nulls–A little structure–Timestamps, strings, longs, booleans, list, & struct
25 million rows
![Page 15: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/15.jpg)
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Storage costs
![Page 16: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/16.jpg)
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Compression
Data size matters!–Hadoop stores all your data, but requires hardware–Is one factor in read speed
ORC and Parquet use RLE & Dictionaries
All the formats have general compression–ZLIB (GZip) – tight compression, slower–Snappy – some compression, faster
![Page 17: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/17.jpg)
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
![Page 18: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/18.jpg)
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Taxi Size Analysis
Don’t use JSON
Use either Snappy or Zlib compression
Avro’s small compression window hurts
Parquet Zlib is smaller than ORC–Group the column sizes by type
![Page 19: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/19.jpg)
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
![Page 20: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/20.jpg)
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
![Page 21: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/21.jpg)
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sales Size Analysis
ORC did better than expected–String columns have small cardinality–Lots of timestamp columns–No doubles
Need to revalidate results with original–Improve random data generator–Add non-smooth distributions
![Page 22: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/22.jpg)
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
![Page 23: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/23.jpg)
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Github Size Analysis
Surprising win for JSON and Avro–Worst when uncompressed–Best with zlib
Many partially shared strings–ORC and Parquet don’t compress across columns
Need to investigate Brotli
![Page 24: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/24.jpg)
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Cases
![Page 25: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/25.jpg)
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Full Table Scans
Read all columns & rows
All formats except JSON are splitable–Different workers do different parts of file
![Page 26: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/26.jpg)
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
![Page 27: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/27.jpg)
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Taxi Read Performance Analysis
JSON is very slow to read–Large storage size for this data set–Needs to do a LOT of string parsing
Tradeoff between space & time–Less compression is sometimes faster
![Page 28: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/28.jpg)
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
![Page 29: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/29.jpg)
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sales Read Performance Analysis
Read performance is dominated by format–Compression matters less for this data set–Straight ordering: ORC, Avro, Parquet, & JSON
Garbage collection is important–ORC 0.3 to 1.4% of time–Avro < 0.1% of time–Parquet 4 to 8% of time
![Page 30: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/30.jpg)
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
![Page 31: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/31.jpg)
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Github Read Performance Analysis
Garbage collection is critical–ORC 2.1 to 3.4% of time–Avro 0.1% of time–Parquet 11.4 to 12.8% of time
A lot of columns needs more space–We need bigger stripes–Rows/stripe - ORC: 18.6k, Parquet: 88.1k
![Page 32: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/32.jpg)
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Column Projection
Often just need a few columns–Only ORC & Parquet are columnar–Only read, decompress, & deserialize some columns
Dataset format compression us/row projection Percent time
github orc zlib 21.319 0.185 0.87%
github parquet zlib 72.494 0.585 0.81%
sales orc zlib 1.866 0.056 3.00%
sales parquet zlib 12.893 0.329 2.55%
taxi orc zlib 2.766 0.063 2.28%
taxi parquet zlib 3.496 0.718 20.54%
![Page 33: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/33.jpg)
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Predicate Pushdown
Query: – select first_name, last_name from employees where
hire_date between ‘01/01/2017’ and ‘01/03/2017’
Predicate:–hire_date between ‘01/01/2017’ and ‘01/03/2017’
Given to reader
![Page 34: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/34.jpg)
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Predicate Pushdown in ORC
ORC stores indexes with min & max
Reader filters out sections of file– Entire file– Stripe–Row group (10k rows)
Engine needs to apply row level filter
![Page 35: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/35.jpg)
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Projection & Predicate Pushdown
Parquet can do pushdown to the stripe
Improves data layout options–Better than partition pruning with sorting
ORC has optional bloom filters–Helps for non-sorted column–Only useful for equality
![Page 36: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/36.jpg)
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Metadata Access
ORC & Parquet store metadata–Stored in file footer–File schema–Number of records–Min, max, count of each column
Provides O(1) Access
![Page 37: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/37.jpg)
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Conclusions
![Page 38: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/38.jpg)
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Recommendations
Disclaimer – Everything changes!–Both these benchmarks and the formats will change.
For complex tables with common strings–Avro with Snappy is a good fit
For other tables–ORC with Zlib is a good fit
Experiment with the benchmarks
![Page 39: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/39.jpg)
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Fun Stuff
Built open benchmark suite for files
Built pieces of a tool to convert files–Avro, CSV, JSON, ORC, & Parquet
Built a random parameterized generator–Easy to model arbitrary tables–Can write to Avro, ORC, or Parquet
![Page 40: File Format Benchmark - Avro, JSON, ORC and Parquet](https://reader034.vdocuments.us/reader034/viewer/2022042600/5a6515d27f8b9aa2548b6ca9/html5/thumbnails/40.jpg)
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank you!Twitter: @owen_omalleyEmail: [email protected]