the columnar roadmap: apache parquet and apache arrow
TRANSCRIPT
![Page 1: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/1.jpg)
© 2017 Dremio Corporation @DremioHQ
The columnar roadmap:
Apache Parquet and Apache ArrowJulien Le Dem, Principal Architect Dremio, VP Apache Parquet, Apache Arrow PMC
![Page 2: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/2.jpg)
© 2017 Dremio Corporation @DremioHQ
• Architect at @DremioHQ
• Formerly Tech Lead at Twitter on Data Platforms.
• Creator of Parquet
• Apache member
• Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet
Julien Le Dem@J_ Julien
![Page 3: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/3.jpg)
© 2017 Dremio Corporation @DremioHQ
Agenda
• Community Driven Standard
• Benefits of Columnar representation
• Vertical integration: Parquet to Arrow
• Arrow based communication
![Page 4: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/4.jpg)
© 2017 Dremio Corporation @DremioHQ
Community Driven Standard
![Page 5: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/5.jpg)
© 2017 Dremio Corporation @DremioHQ
An open source standard
• Parquet: Common need for on disk columnar.• Arrow: Common need for in memory columnar.• Arrow is building on the success of Parquet.• Top-level Apache project• Standard from the start:
– Members from 13+ major open source projects involved
• Benefits:– Share the effort– Create an ecosystem
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
![Page 6: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/6.jpg)
© 2017 Dremio Corporation @DremioHQ
Interoperability and EcosystemBefore With Arrow
• Each system has its own internal memory format
• 70-80% CPU wasted on serialization and deserialization
• Functionality duplication and unnecessary conversions
• All systems utilize the same memory format
• No overhead for cross-system communication
• Projects can share functionality (eg:Parquet-to-Arrow reader)
![Page 7: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/7.jpg)
© 2017 Dremio Corporation @DremioHQ
Benefits of Columnar representation
![Page 8: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/8.jpg)
© 2017 Dremio Corporation @DremioHQ
Columnar layout
Logical table
representation
Row layout
Column layout
@EmrgencyKittens
![Page 9: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/9.jpg)
© 2017 Dremio Corporation @DremioHQ
On Disk and in Memory
• Different trade offs– On disk: Storage.
• Accessed by multiple queries.
• Priority to I/O reduction (but still needs good CPU throughput).
• Mostly Streaming access.
– In memory: Transient.• Specific to one query execution.
• Priority to CPU throughput (but still needs good I/O).
• Streaming and Random access.
![Page 10: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/10.jpg)
© 2017 Dremio Corporation @DremioHQ
Parquet on disk columnar format
![Page 11: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/11.jpg)
© 2017 Dremio Corporation @DremioHQ
Parquet on disk columnar format
• Nested data structures
• Compact format: – type aware encodings
– better compression
• Optimized I/O:– Projection push down (column pruning)
– Predicate push down (filters based on stats)
![Page 12: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/12.jpg)
© 2017 Dremio Corporation @DremioHQ
Parquet nested representation
Document
DocId Links Name
Backward Forward Language Url
Code Country
Columns:docidlinks.backwardlinks.forwardname.language.codename.language.countryname.url
Borrowed from the Google Dremel paper
https://blog.twitter.com/2013/dremel-made-simple-with-parquet
![Page 13: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/13.jpg)
© 2017 Dremio Corporation @DremioHQ
Access only the data you need
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
+ =
Columnar StatisticsRead only the data you need!
![Page 14: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/14.jpg)
© 2017 Dremio Corporation @DremioHQ
Parquet file layout
![Page 15: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/15.jpg)
© 2017 Dremio Corporation @DremioHQ
Arrow in memory columnar format
![Page 16: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/16.jpg)
© 2017 Dremio Corporation @DremioHQ
Arrow goals
• Well-documented and cross language compatible
• Designed to take advantage of modern CPU
• Embeddable
– in execution engines, storage layers, etc.
• Interoperable
![Page 17: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/17.jpg)
© 2017 Dremio Corporation @DremioHQ
Arrow in memory columnar format
• Nested Data Structures
• Maximize CPU throughput
– Pipelining
– SIMD
– cache locality
• Scatter/gather I/O
![Page 18: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/18.jpg)
© 2017 Dremio Corporation @DremioHQ
CPU pipeline
![Page 19: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/19.jpg)
© 2017 Dremio Corporation @DremioHQ
Columnar data
persons = [{name: ’Joe',age: 18,phones: [
‘555-111-1111’, ‘555-222-2222’
] }, {
name: ’Jack',age: 37,phones: [ ‘555-333-3333’ ]
}]
![Page 20: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/20.jpg)
© 2017 Dremio Corporation @DremioHQ
Record Batch Construction
Schema Negotiation
Dictionary Batch
Record Batch
Record Batch
Record Batch
name (offset)
name (data)
age (data)
phones (list offset)
phones (data)
data header (describes offsets into data)
name (bitmap)
age (bitmap)
phones (bitmap)
phones (offset)
{name: ’Joe',age: 18,phones: [
‘555-111-1111’, ‘555-222-2222’
] }
Each box (vector) is contiguous memory The entire record batch is contiguous on wire
![Page 21: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/21.jpg)
© 2017 Dremio Corporation @DremioHQ
Vertical integration: Parquet to Arrow
![Page 22: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/22.jpg)
© 2017 Dremio Corporation @DremioHQ
Representation comparison for flat schema
![Page 23: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/23.jpg)
© 2017 Dremio Corporation @DremioHQ
Naïve conversion
![Page 24: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/24.jpg)
© 2017 Dremio Corporation @DremioHQ
Naïve conversion Data dependent branch
![Page 25: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/25.jpg)
© 2017 Dremio Corporation @DremioHQ
Peeling away abstraction layersVectorized read
(They have layers)
![Page 26: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/26.jpg)
© 2017 Dremio Corporation @DremioHQ
Bit packing case
![Page 27: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/27.jpg)
© 2017 Dremio Corporation @DremioHQ
Bit packing case No branch
![Page 28: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/28.jpg)
© 2017 Dremio Corporation @DremioHQ
Run length encoding case
![Page 29: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/29.jpg)
© 2017 Dremio Corporation @DremioHQ
Run length encoding case
Block of defined values
![Page 30: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/30.jpg)
© 2017 Dremio Corporation @DremioHQ
Run length encoding case
Block of null values
![Page 31: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/31.jpg)
© 2017 Dremio Corporation @DremioHQ
Predicate push down
![Page 32: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/32.jpg)
© 2017 Dremio Corporation @DremioHQ
Example: filter and projection
![Page 33: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/33.jpg)
© 2017 Dremio Corporation @DremioHQ
Naive filter and projection implementation
![Page 34: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/34.jpg)
© 2017 Dremio Corporation @DremioHQ
Peeling away abstractions
![Page 35: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/35.jpg)
© 2017 Dremio Corporation @DremioHQ
Arrow based communication
![Page 36: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/36.jpg)
© 2017 Dremio Corporation @DremioHQ
Universal high performance UDFs
SQL engine
Pythonprocess
User defined function
SQLOperator
1
SQLOperator
2
readsreads
![Page 37: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/37.jpg)
© 2017 Dremio Corporation @DremioHQ
Arrow RPC/REST API
• Generic way to retrieve data in Arrow format
• Generic way to serve data in Arrow format
• Simplify integrations across the ecosystem
• Arrow based pipe
![Page 38: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/38.jpg)
© 2017 Dremio Corporation @DremioHQ
RPC: arrow based storage interchange
The memory representation is sent over the wire.
No serialization overhead.
Scanner
projection/predicate push down
Operator
Arrow batches
Storage
Mem Disk
SQLexecution
Scanner Operator
Scanner Operator
Storage
Mem Disk
Storage
Mem Disk
…
![Page 39: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/39.jpg)
© 2017 Dremio Corporation @DremioHQ
RPC: arrow based cache
The memory representation is sent over the wire.
No serialization overhead.
projection push down
Operator
Arrow-basedCache
SQLexecution
Operator
Operator
…
![Page 40: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/40.jpg)
© 2017 Dremio Corporation @DremioHQ
RPC: Single system execution
The memory representation is sent over the wire.
No serialization overhead.
Scanner
Scanner
Scanner
Parquet files
projection push downread only a and b
Partial Agg
Partial Agg
Partial Agg
Agg
Agg
Agg
ShuffleArrow batches
Result
![Page 41: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/41.jpg)
© 2017 Dremio Corporation @DremioHQ
Results- PySpark Integration:
53x speedup (IBM spark work on SPARK-13534)http://s.apache.org/arrowresult1
- Streaming Arrow Performance7.75GB/s data movementhttp://s.apache.org/arrowresult2
- Arrow Parquet C++ Integration4GB/s readshttp://s.apache.org/arrowresult3
- Pandas Integration9.71GB/shttp://s.apache.org/arrowresult4
![Page 42: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/42.jpg)
© 2017 Dremio Corporation @DremioHQ
Language Bindings
Parquet
• Target Languages– Java
– CPP
– Python & Pandas
• Engines integration:– Many!
Arrow
• Target Languages– Java
– CPP, Python
– R (underway)
– C, Ruby, JavaScript
• Engines integration:– Drill
– Pandas, R
– Spark (underway)
![Page 43: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/43.jpg)
© 2017 Dremio Corporation @DremioHQ
Arrow Releases
178195
311
85
237
131
76
17
October 10, 20160.1.0
February 18, 20170.2.0
May 5, 20170.3.0
May 22, 20170.4.0
Changes Days
![Page 44: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/44.jpg)
© 2017 Dremio Corporation @DremioHQ
Current activity:
• Spark Integration (SPARK-13534)
• Pages index in Parquet footer (PARQUET-922)
• Arrow REST API (ARROW-1077)
• Bindings:
– C, Ruby (ARROW-631)
– JavaScript (ARROW-541)
![Page 45: The columnar roadmap: Apache Parquet and Apache Arrow](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6499c57f8b9a27568b7405/html5/thumbnails/45.jpg)
© 2017 Dremio Corporation @DremioHQ
Get Involved
• Join the community
– dev@{arrow,parquet}.apache.org
– Slack:
• Arrow: https://apachearrowslackin.herokuapp.com/
• Parquet: https://parquet-slack-invite.herokuapp.com/
– http://{arrow,parquet}.apache.org
– Follow @Apache{Parquet,Arrow}