presto for the enterprise @ hadoop meetup

11

Warsaw Hadoop User GroupWojciech BielaŁukasz Osipiuk

www.teradata.com/presto

2

➔ History of Teradata Center for Hadoop◆ Formerly Hadapt Founded in July, 2010 by Justin Borgman, Kamil Bajda-

Pawlikowski, and Daniel Abadi◆ Pioneered SQL-on-Hadoop market◆ Based on work done by database research group in Yale Computer

Science Department◆ Hybrid of Hadoop scalability and DBMS performance

➔ Today◆ Acquired by Teradata in July, 2014, renamed Teradata Center for

Hadoop◆ 20+ developers with deep Hadoop and database expertise◆ Headquarters in Boston, MA◆ Teams in US (MA, CA) and Poland (Warsaw)◆ Contributors to open source project Presto

Who are we? - Teradata Center for Hadoop!

3

➔ What is Presto?

➔ What is Teradata doing?

➔ Can I see a Demo?

➔ How can I contribute?

Talk Agenda

4

➔ 100% open source distributed ANSI SQL engine for Big Data◆ Modern code base◆ Proven scalability

➔ Optimized for low latency, Interactive querying◆ Cross platform query capability, not only SQL on Hadoop◆ Distributed under the Apache license, now supported by Teradata◆ Used by a community of well known, well respected technology companies

What is Presto?

5

History of Presto

FALL 20126 developers start Presto

development

FALL 201488 Releases

41 Contributors 3943 Commits

SPRING 201598 Releases

65 Contributors4587 Commits

---------Teradata joins

Presto community

& offers supportSPRING 2013Presto rolled out within Facebook

FALL 2013Facebook open sources Presto

FALL 2008Facebook

open sources Hive

6

Query Execution

Data stream API

Worker

Data stream API

Worker

Coordinator

MetadataAPI

Parser/analyze

rPlanne

rSchedule

r

Worker

Client

Data locationAPI

Pluggable

7

Query Execution

Data stream API

Worker

Data stream API

Worker

Coordinator

Data Location

APIMetadata

API

Parser/analyze

rPlanne

rSchedule

r

Worker

Client

Pluggable

8

Query Execution

Data stream API

Worker

Data stream API

Worker

Coordinator

Data Location

APIMetadata

API

Parser/analyze

rPlanne

rSchedule

r

Worker

Client

Pluggable

9

Query Execution

Data stream API

Worker

Data stream API

Worker

Coordinator

Data Location

APIMetadata

API

Parser/analyze

rPlanne

rSchedule

r

Worker

Client

Pluggable

10

select shipdate,

count(*) count,cast(sum(extendedprice)

as bigint) price from

h_lineitem where

returnflag = 'R' group by shipdateorder by count limit 20

Logical and fragmented plan

11

select *

from hive.default.h_nation,psql.public.p_region

where h_nation.regionkey = p_region.regionkey;

Logical and fragmented plan

12

Query Execution

Data stream API

Worker

Data stream API

Worker

Coordinator

Data Location

APIMetadata

API

Parser/analyze

rPlanne

rSchedule

r

Worker

Client

Pluggable

13

Query Execution

Data stream API

Worker

Data stream API

Worker

Coordinator

Data Location

APIMetadata

API

Parser/analyze

rPlanne

rSchedule

r

Worker

Client

Pluggable

14

Query Execution

Data stream API

Worker

Data stream API

Worker

Coordinator

Data Location

APIMetadata

API

Parser/analyze

rPlanne

rSchedule

r

Worker

Client

Pluggablepage 1

blockA

blockB

page

blockA

blockB ...

15

Query Execution

Data stream API

Worker

Data stream API

Worker

Coordinator

Data Location

APIMetadata

API

Parser/analyze

rPlanne

rSchedule

r

Worker

Client

Pluggable

16

Plan executionHive Presto

map

reduce

I/O

I/O

I/O

I/O

I/O

task task

task task

task task

task

I/O

17

Presto Extensibility – plugins

➔ Connectors

➔ Data types

➔ Extra functions

➔ (new) Security providers

18

Presto Extensibility – connector interfaces

Parser/analyze

rPlanne

r

Worker

Data location API

Hive

Cass

andr

a

Kafk

a MyS

QL …

Metadata API

Hive

Cass

andr

a

Kafk

a MyS

QL

…Data stream

API

Hive

Cass

andr

a

Kafk

a MyS

QL …

Scheduler

Coordinator

19

Presto Extensibility – connector interfacespublic interface Connector{ ConnectorHandleResolver getHandleResolver(); ConnectorMetadata getMetadata(); ConnectorSplitManager getSplitManager(); ConnectorPageSourceProvider getPageSourceProvider() ConnectorRecordSetProvider getRecordSetProvider() ConnectorPageSinkProvider getPageSinkProvider() ConnectorRecordSinkProvider getRecordSinkProvider() ConnectorIndexResolver getIndexResolver() Set<SystemTable> getSystemTables() List<PropertyMetadata<?>> getSessionProperties() List<PropertyMetadata<?>> getTableProperties() ConnectorAccessControl getAccessControl() void shutdown() {}}

20

➔ Data stays in memory during execution and is pipelined across nodes MPP-style

➔ Vectorized columnar processing

➔ Presto is written in highly tuned Java◆ Efficient in-memory data structures◆ Very careful coding of inner loops◆ Bytecode generation

➔ Optimized ORC reader

➔ Predicates push-down

➔ Query optimizer

Presto = Performance

21

➔ Facebook◆ Multiple production clusters (100s of nodes total)

● Including 300PB Hadoop data warehouse◆ 1000s of internal daily active users◆ Millions of queries each month◆ Multiple PBs scanned every day◆ Trillions of rows a day

➔ Netflix ◆ Over 200-node production cluster on EC2◆ Over 15 PB in S3 (Parquet format)◆ Over 300 users and 2.5K queries daily

Presto in Production

22

➔ 100% open source contributions to Presto to increase adoption in the enterprise

➔ A multi-year roadmap commitment to phased enhancements of the open source code

➔ The first ever commercial support offering for Presto

What is Teradata Doing?

Teradata Certified Prestowww.teradata.com/presto

23

➔ Hadoop Distro Agnostic

➔ Modern Code Base◆ Presto is well-designed open source software with proper database

architecture

➔ Strong Like-Minded Community

➔ Push down processing across multiple data platforms

➔ Leverage Teradata expertise to make SQL for Hadoop viable

Why is Teradata Contributing to Presto?

24

Implement Integrate ProliferateInstallerDocumentationMonitoring & Support Tools

ODBC / JDBC DriversBI CertificationSecurityConnectors

Commercial Support

Phase 1

Phase 2

Phase 3June 8, 2015 Q4 2015 2016

Expanding ANSI SQL Coverage

Teradata Contributions to Presto

Management Tools IntegrationYARN Integration

25

➔ Ease of install and management via Presto-Admin tool◆ www.github.com/prestodb/presto-admin◆ Packaging Presto as an RPM

➔ Testing Framework for Presto◆ www.github.com/prestodb/tempto◆ Added large number of tests

➔ JDBC driver for JAVA 6

➔ Various SQL improvements

Teradata’s Contributions

26

➔ Continued SQL Improvements➔ Security – Authentication & Authorization➔ More Connectors – e.g. Hbase➔ ODBC & JDBC Drivers that actually work➔ BI tool certifications – e.g. Tableau➔ YARN Integration➔ Ambari Integration➔ Open Source our Docker based Dev Env - WIP➔ Open our Continuous Integration platform to the community

Teradata’s Contribution Product Roadmap

27

Teradata Engineers Dedicated to Presto

28

“Presto is an integral part of the Airbnb data infrastructure stack with hundreds of employees running queries each day with the technology. We are excited to see Teradata joining the Presto open source community and are encouraged by the direction of their contributions” - James Mayfield, product lead, Airbnb. "We are excited to see Teradata's commitment to Presto and adding capabilities in the open source domain. This will create interesting opportunities within our technical and business teams to open up more access options to our critical data. We think this is a positive for Teradata and for the community as a whole”- Steve Deasy, vice president of Engineering, Groupon.

Early Feedback is Extremely Positive

29

Demo Time!

30

www.github.com/facebook/prestowww.github.com/prestodb

Certified Distro: www.teradata.com/prestoWebsite: www.prestodb.ioPresto : User’s Group: www.groups.google.com/group/presto-usersFacebook Page: www.facebook.com/prestodbTwitter: #prestodb

How can I contribute?

31

We’re hiring!➔ Warsaw➔ Boston

Job Offer: bit.do/presto

Contact: [email protected]

Join us!

32

Available for Download➔ Presto 101t Server, CLI, JDBC➔ Presto-Admin 0.1➔ Documentation➔ HDP w/ Presto VM Sandbox➔ CDH w/ Presto VM Sandbox

www.teradata.com/presto

Presto 101t certified by Teradata

33

[email protected]

[email protected]

presto for the enterprise @ hadoop meetup

Software