presto for the enterprise @ hadoop meetup
TRANSCRIPT
11
Warsaw Hadoop User GroupWojciech BielaŁukasz Osipiuk
www.teradata.com/presto
2
➔ History of Teradata Center for Hadoop◆ Formerly Hadapt Founded in July, 2010 by Justin Borgman, Kamil Bajda-
Pawlikowski, and Daniel Abadi◆ Pioneered SQL-on-Hadoop market◆ Based on work done by database research group in Yale Computer
Science Department◆ Hybrid of Hadoop scalability and DBMS performance
➔ Today◆ Acquired by Teradata in July, 2014, renamed Teradata Center for
Hadoop◆ 20+ developers with deep Hadoop and database expertise◆ Headquarters in Boston, MA◆ Teams in US (MA, CA) and Poland (Warsaw)◆ Contributors to open source project Presto
Who are we? - Teradata Center for Hadoop!
3
➔ What is Presto?
➔ What is Teradata doing?
➔ Can I see a Demo?
➔ How can I contribute?
Talk Agenda
4
➔ 100% open source distributed ANSI SQL engine for Big Data◆ Modern code base◆ Proven scalability
➔ Optimized for low latency, Interactive querying◆ Cross platform query capability, not only SQL on Hadoop◆ Distributed under the Apache license, now supported by Teradata◆ Used by a community of well known, well respected technology companies
What is Presto?
5
History of Presto
FALL 20126 developers start Presto
development
FALL 201488 Releases
41 Contributors 3943 Commits
SPRING 201598 Releases
65 Contributors4587 Commits
---------Teradata joins
Presto community
& offers supportSPRING 2013Presto rolled out within Facebook
FALL 2013Facebook open sources Presto
FALL 2008Facebook
open sources Hive
6
Query Execution
Data stream API
Worker
Data stream API
Worker
Coordinator
MetadataAPI
Parser/analyze
rPlanne
rSchedule
r
Worker
Client
Data locationAPI
Pluggable
7
Query Execution
Data stream API
Worker
Data stream API
Worker
Coordinator
Data Location
APIMetadata
API
Parser/analyze
rPlanne
rSchedule
r
Worker
Client
Pluggable
8
Query Execution
Data stream API
Worker
Data stream API
Worker
Coordinator
Data Location
APIMetadata
API
Parser/analyze
rPlanne
rSchedule
r
Worker
Client
Pluggable
9
Query Execution
Data stream API
Worker
Data stream API
Worker
Coordinator
Data Location
APIMetadata
API
Parser/analyze
rPlanne
rSchedule
r
Worker
Client
Pluggable
10
select shipdate,
count(*) count,cast(sum(extendedprice)
as bigint) price from
h_lineitem where
returnflag = 'R' group by shipdateorder by count limit 20
Logical and fragmented plan
11
select *
from hive.default.h_nation,psql.public.p_region
where h_nation.regionkey = p_region.regionkey;
Logical and fragmented plan
12
Query Execution
Data stream API
Worker
Data stream API
Worker
Coordinator
Data Location
APIMetadata
API
Parser/analyze
rPlanne
rSchedule
r
Worker
Client
Pluggable
13
Query Execution
Data stream API
Worker
Data stream API
Worker
Coordinator
Data Location
APIMetadata
API
Parser/analyze
rPlanne
rSchedule
r
Worker
Client
Pluggable
14
Query Execution
Data stream API
Worker
Data stream API
Worker
Coordinator
Data Location
APIMetadata
API
Parser/analyze
rPlanne
rSchedule
r
Worker
Client
Pluggablepage 1
blockA
blockB
page
blockA
blockB ...
15
Query Execution
Data stream API
Worker
Data stream API
Worker
Coordinator
Data Location
APIMetadata
API
Parser/analyze
rPlanne
rSchedule
r
Worker
Client
Pluggable
16
Plan executionHive Presto
map
reduce
I/O
I/O
I/O
I/O
I/O
task task
task task
task task
task
I/O
17
Presto Extensibility – plugins
➔ Connectors
➔ Data types
➔ Extra functions
➔ (new) Security providers
18
Presto Extensibility – connector interfaces
Parser/analyze
rPlanne
r
Worker
Data location API
Hive
Cass
andr
a
Kafk
a MyS
QL …
Metadata API
Hive
Cass
andr
a
Kafk
a MyS
QL
…Data stream
API
Hive
Cass
andr
a
Kafk
a MyS
QL …
Scheduler
Coordinator
19
Presto Extensibility – connector interfacespublic interface Connector{ ConnectorHandleResolver getHandleResolver(); ConnectorMetadata getMetadata(); ConnectorSplitManager getSplitManager(); ConnectorPageSourceProvider getPageSourceProvider() ConnectorRecordSetProvider getRecordSetProvider() ConnectorPageSinkProvider getPageSinkProvider() ConnectorRecordSinkProvider getRecordSinkProvider() ConnectorIndexResolver getIndexResolver() Set<SystemTable> getSystemTables() List<PropertyMetadata<?>> getSessionProperties() List<PropertyMetadata<?>> getTableProperties() ConnectorAccessControl getAccessControl() void shutdown() {}}
20
➔ Data stays in memory during execution and is pipelined across nodes MPP-style
➔ Vectorized columnar processing
➔ Presto is written in highly tuned Java◆ Efficient in-memory data structures◆ Very careful coding of inner loops◆ Bytecode generation
➔ Optimized ORC reader
➔ Predicates push-down
➔ Query optimizer
Presto = Performance
21
➔ Facebook◆ Multiple production clusters (100s of nodes total)
● Including 300PB Hadoop data warehouse◆ 1000s of internal daily active users◆ Millions of queries each month◆ Multiple PBs scanned every day◆ Trillions of rows a day
➔ Netflix ◆ Over 200-node production cluster on EC2◆ Over 15 PB in S3 (Parquet format)◆ Over 300 users and 2.5K queries daily
Presto in Production
22
➔ 100% open source contributions to Presto to increase adoption in the enterprise
➔ A multi-year roadmap commitment to phased enhancements of the open source code
➔ The first ever commercial support offering for Presto
What is Teradata Doing?
Teradata Certified Prestowww.teradata.com/presto
23
➔ Hadoop Distro Agnostic
➔ Modern Code Base◆ Presto is well-designed open source software with proper database
architecture
➔ Strong Like-Minded Community
➔ Push down processing across multiple data platforms
➔ Leverage Teradata expertise to make SQL for Hadoop viable
Why is Teradata Contributing to Presto?
24
Implement Integrate ProliferateInstallerDocumentationMonitoring & Support Tools
ODBC / JDBC DriversBI CertificationSecurityConnectors
Commercial Support
Phase 1
Phase 2
Phase 3June 8, 2015 Q4 2015 2016
Expanding ANSI SQL Coverage
Teradata Contributions to Presto
Management Tools IntegrationYARN Integration
25
➔ Ease of install and management via Presto-Admin tool◆ www.github.com/prestodb/presto-admin◆ Packaging Presto as an RPM
➔ Testing Framework for Presto◆ www.github.com/prestodb/tempto◆ Added large number of tests
➔ JDBC driver for JAVA 6
➔ Various SQL improvements
Teradata’s Contributions
26
➔ Continued SQL Improvements➔ Security – Authentication & Authorization➔ More Connectors – e.g. Hbase➔ ODBC & JDBC Drivers that actually work➔ BI tool certifications – e.g. Tableau➔ YARN Integration➔ Ambari Integration➔ Open Source our Docker based Dev Env - WIP➔ Open our Continuous Integration platform to the community
Teradata’s Contribution Product Roadmap
27
Teradata Engineers Dedicated to Presto
28
“Presto is an integral part of the Airbnb data infrastructure stack with hundreds of employees running queries each day with the technology. We are excited to see Teradata joining the Presto open source community and are encouraged by the direction of their contributions” - James Mayfield, product lead, Airbnb. "We are excited to see Teradata's commitment to Presto and adding capabilities in the open source domain. This will create interesting opportunities within our technical and business teams to open up more access options to our critical data. We think this is a positive for Teradata and for the community as a whole”- Steve Deasy, vice president of Engineering, Groupon.
Early Feedback is Extremely Positive
29
Demo Time!
30
www.github.com/facebook/prestowww.github.com/prestodb
Certified Distro: www.teradata.com/prestoWebsite: www.prestodb.ioPresto : User’s Group: www.groups.google.com/group/presto-usersFacebook Page: www.facebook.com/prestodbTwitter: #prestodb
How can I contribute?
32
Available for Download➔ Presto 101t Server, CLI, JDBC➔ Presto-Admin 0.1➔ Documentation➔ HDP w/ Presto VM Sandbox➔ CDH w/ Presto VM Sandbox
www.teradata.com/presto
Presto 101t certified by Teradata