how to use parquet as a basis for etl and analytics
TRANSCRIPT
![Page 1: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/1.jpg)
How to use Parquet as a basis for ETL and analytics
Julien Le Dem @J_
Analytics Data Pipeline tech lead, Data Platform
@ApacheParquet
![Page 2: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/2.jpg)
Outline
2
- Instrumentation and data collection - Storing data efficiently for analysis - Openness and Interoperability
![Page 3: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/3.jpg)
Instrumentation and data collection
![Page 4: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/4.jpg)
Typical data flow
4
Serving
InstrumentedServices
MutableServing stores
mutation
Happy users
![Page 5: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/5.jpg)
Typical data flow
5
Serving
InstrumentedServices
MutableServing stores
mutation
Data collection
log collection
Streaming log(Kafka, Scribe,
Chukwa ...)
periodic snapshots
log
Pull
Pull
streaming analysis
periodic consolidation
snapshots
schema
![Page 6: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/6.jpg)
Typical data flow
6
Serving
InstrumentedServices
MutableServing stores
mutation
Data collection
log collection
Streaming log(Kafka, Scribe,
Chukwa ...)
periodic snapshots
log
Pull
Pull
analysis
Storage (HDFS)
ad-hoc queries(Impala,
Hive, Drill, ...)
automated dashboard
Batch computation(Graph, machine
learning, ...)
Streaming computation
(Storm, Samza, SparkStreaming..)
Query-efficient formatParquet
streaming analysis
periodic consolidation
snapshots
![Page 7: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/7.jpg)
Typical data flow
7
Happy Data Scientist
Serving
InstrumentedServices
MutableServing stores
mutation
Data collection
log collection
Streaming log(Kafka, Scribe,
Chukwa ...)
periodic snapshots
log
Pull
Pull
analysis
Storage (HDFS)
ad-hoc queries(Impala,
Hive, Drill, ...)
automated dashboard
Batch computation(Graph, machine
learning, ...)
Streaming computation
(Storm, Samza, SparkStreaming..)
Query-efficient formatParquet
streaming analysis
periodic consolidation
snapshots
![Page 8: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/8.jpg)
Storing data for analysis
![Page 9: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/9.jpg)
Producing a lot of data is easy
9
Producing a lot of derived data is even easier.Solution: Compress all the things!
![Page 10: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/10.jpg)
Scanning a lot of data is easy
10
1% completed
… but not necessarily fast.Waiting is not productive. We want faster turnaround.Compression but not at the cost of reading speed.
![Page 11: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/11.jpg)
Interoperability not that easy
11
We need a storage format interoperable with all the tools we use and
keep our options open for the next big thing.
![Page 12: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/12.jpg)
Enter Apache Parquet
![Page 13: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/13.jpg)
Parquet design goals
13
- Interoperability
- Space efficiency
- Query efficiency
![Page 14: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/14.jpg)
Efficiency
![Page 15: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/15.jpg)
Columnar storage
15
Logical table representation
Row layout
Column layout
encoding
Nested schema
a b c
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5
a1 b1 c1a2 b2 c2a3 b3 c3a4 b4 c4a5 b5 c5
encoded chunk encoded chunk encoded chunk
![Page 16: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/16.jpg)
Parquet nested representation
16
Document
DocId Links Name
Backward Forward Language Url
Code Country
Columns: docid links.backward links.forward name.language.code name.language.country name.url
Schema:
Borrowed from the Google Dremel paper
https://blog.twitter.com/2013/dremel-made-simple-with-parquet
![Page 17: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/17.jpg)
Statistics for filter and query optimization
17
Vertical partitioning (projection push down)
Horizontal partitioning (predicate push down)
Read only the data you need!+ =
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
+ =
![Page 18: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/18.jpg)
Properties of efficient encodings
18
- Minimize CPU pipeline bubbles: highly predictable branching reduce data dependency !
- Minimize CPU cache misses reduce size of the working set
![Page 19: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/19.jpg)
The right encoding for the right job
19
- Delta encodings:for sorted datasets or signals where the variation is less important than the absolute value. (timestamp, auto-generated ids, metrics, …) Focuses on avoiding branching. !
- Prefix coding (delta encoding for strings)When dictionary encoding does not work. !
- Dictionary encoding: small (60K) set of values (server IP, experiment id, …) !
- Run Length Encoding:repetitive data.
![Page 20: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/20.jpg)
Interoperability
![Page 21: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/21.jpg)
Interoperable
21
Model agnostic
Language agnostic
Java C++
Avro Thrift Protocol Buffer Pig Tuple Hive SerDe
Assembly/striping
Parquet file format
Object model
parquet-avroConverters parquet-thrift parquet-proto parquet-pig parquet-hive
Column encoding
Impala
...
...
Encoding
Queryexecution
![Page 22: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/22.jpg)
Frameworks and libraries integrated with Parquet
22
Query engines: Hive, Impala, HAWQ, IBM Big SQL, Drill, Tajo, Pig, Presto !
Frameworks: Spark, MapReduce, Cascading, Crunch, Scalding, Kite !
Data Models: Avro, Thrift, ProtocolBuffers, POJOs
![Page 23: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/23.jpg)
Schema management
![Page 24: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/24.jpg)
Schema in Hadoop
24
Hadoop does not define a standard notion of schema but there are many available:
- Avro- Thrift- Protocol Buffers- Pig- Hive- …
And they are all different
![Page 25: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/25.jpg)
What they define
25
Schema:Structure of a recordConstraints on the type
!Row oriented binary format: How records are represented one at a time
![Page 26: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/26.jpg)
What they *do not* define
26
Column oriented binary format: Parquet reuses the schema definitions and provides a common column oriented binary format
![Page 27: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/27.jpg)
Example: address book
27
AddressBook
Addressstreetcitystatezipcomment
addresses
![Page 28: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/28.jpg)
Protocol Buffers
28
message AddressBook {! repeated group addresses = 1 {! required string street = 2;! required string city = 3;! required string state = 4;! required string zip = 5;! optional string comment = 6;! }!}!!- Allows recursive definition- Types: Group or primitive- binary format refers to field ids only => Renaming fields does not impact binary format- Requires installing a native compiler separated from your build
Fields have ids and can be optional, required or repeated
Lists are repeated fields
![Page 29: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/29.jpg)
Thrift
29
struct AddressBook {! 1: required list<Address> addresses;!}!struct Addresses {! 1: required string street;! 2: required string city;! 3: required string state;! 4: required string zip;! 5: optional string comment;!}!!- No recursive definition- Types: Struct, Map, List, Set, Union or primitive- binary format refers to field ids only => Renaming fields does not impact binary format- Requires installing a native compiler separately from the build
Fields have ids and can be optional or required
explicit collection types
![Page 30: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/30.jpg)
Avro
30
{! "type": "record", ! "name": "AddressBook",! "fields" : [{ ! "name": "addresses", ! "type": "array", ! "items": { ! “type”: “record”,! “fields”: [! {"name": "street", "type": “string"},! {"name": "city", "type": “string”}! {"name": "state", "type": “string"}! {"name": "zip", "type": “string”}! {"name": "comment", "type": [“null”, “string”]}! ] ! }! }]!}
explicit collection types
- Allows recursive definition- Types: Records, Arrays, Maps, Unions or primitive- Binary format requires knowing the write-time schema➡ more compact but not self descriptive➡ renaming fields does not impact binary format
- generator in java (well integrated in the build)
null is a type Optional is a union
![Page 31: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/31.jpg)
Write to Parquet
![Page 32: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/32.jpg)
Write to Parquet with Map Reduce
32
Protocol Buffers: job.setOutputFormatClass(ProtoParquetOutputFormat.class);!ProtoParquetOutputFormat.setProtobufClass(job, AddressBook.class);!!Thrift: job.setOutputFormatClass(ParquetThriftOutputFormat.class);!ParquetThriftOutputFormat.setThriftClass(job, AddressBook.class);!!Avro: job.setOutputFormatClass(AvroParquetOutputFormat.class);!AvroParquetOutputFormat.setSchema(job, AddressBook.SCHEMA$);
![Page 33: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/33.jpg)
Write to Parquet with Scalding
33
// define the Parquet source!case class AddressBookParquetSource(override implicit val dateRange: DateRange)! extends HourlySuffixParquetThrift[AddressBook](“/my/data/address_book", dateRange)!// load and transform data!…! pipe.write(ParquetSource())!
![Page 34: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/34.jpg)
Write with Parquet with Pig
34
…!STORE mydata !! INTO ‘my/data’ !! USING parquet.pig.ParquetStorer();
![Page 35: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/35.jpg)
Query engines
![Page 36: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/36.jpg)
Scalding
36
loading: new FixedPathParquetThrift[AddressBook](“my”, “data”) {! val city = StringColumn("city")! override val withFilter: Option[FilterPredicate] = ! Some(city === “San Jose”)!}!!operations: p.map( (r) => r.a + r.b )!p.groupBy( (r) => r.c )!p.join !…
![Page 37: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/37.jpg)
Pig
37
loading: mydata = LOAD ‘my/data’ USING parquet.pig.ParquetLoader();!!operations: A = FOREACH mydata GENERATE a + b;!B = GROUP mydata BY c;!C = JOIN A BY a, B BY b;
![Page 38: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/38.jpg)
Hive
38
loading: create table parquet_table_name (x INT, y STRING)! ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'! STORED AS ! INPUTFORMAT "parquet.hive.MapredParquetInputFormat"! OUTPUTFORMAT “parquet.hive.MapredParquetInputFormat";!!
operations: SQL!
![Page 39: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/39.jpg)
Impala
39
loading: create table parquet_table (x int, y string) stored as parquetfile;!insert into parquet_table select x, y from some_other_table;!select y from parquet_table where x between 70 and 100;!
operations: SQL!
![Page 40: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/40.jpg)
Drill
40
SELECT * FROM dfs.`/my/data`
![Page 41: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/41.jpg)
Spark SQL
41
loading: val address = sqlContext.parquetFile(“/my/data/addresses“)!!operations: val result = sqlContext!! .sql("SELECT city FROM addresses WHERE zip == 94707”)!result.map((r) => …)!
![Page 42: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/42.jpg)
Community
![Page 43: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/43.jpg)
Parquet timeline
43
- Fall 2012: Twitter & Cloudera merge efforts to develop columnar formats
- March 2013: OSS announcement; Criteo signs on for Hive integration
- July 2013: 1.0 release. 18 contributors from more than 5 organizations.
- May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases.
- Parquet 2.0 coming as Apache release
![Page 44: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/44.jpg)
Thank you to our contributors
44
Open Source announcement
1.0 release
![Page 45: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/45.jpg)
Get involved
45
Mailing lists: - [email protected] !
Parquet sync ups: - Regular meetings on google hangout
![Page 46: How to use Parquet as a basis for ETL and analytics](https://reader034.vdocuments.us/reader034/viewer/2022051400/55a5220a1a28abc6348b4819/html5/thumbnails/46.jpg)
Questions
46
Questions.foreach( answer(_) )
@ApacheParquet