presto for the enterprise @ hadoop meetup

33
1 1 Warsaw Hadoop User Group Wojciech Biela Łukasz Osipiuk www.teradata.com/presto

Upload: wojciech-biela

Post on 12-Apr-2017

737 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Presto for the Enterprise @ Hadoop Meetup

11

Warsaw Hadoop User GroupWojciech BielaŁukasz Osipiuk

www.teradata.com/presto

Page 2: Presto for the Enterprise @ Hadoop Meetup

2

➔ History of Teradata Center for Hadoop◆ Formerly Hadapt Founded in July, 2010 by Justin Borgman, Kamil Bajda-

Pawlikowski, and Daniel Abadi◆ Pioneered SQL-on-Hadoop market◆ Based on work done by database research group in Yale Computer

Science Department◆ Hybrid of Hadoop scalability and DBMS performance

➔ Today◆ Acquired by Teradata in July, 2014, renamed Teradata Center for

Hadoop◆ 20+ developers with deep Hadoop and database expertise◆ Headquarters in Boston, MA◆ Teams in US (MA, CA) and Poland (Warsaw)◆ Contributors to open source project Presto

Who are we? - Teradata Center for Hadoop!

Page 3: Presto for the Enterprise @ Hadoop Meetup

3

➔ What is Presto?

➔ What is Teradata doing?

➔ Can I see a Demo?

➔ How can I contribute?

Talk Agenda

Page 4: Presto for the Enterprise @ Hadoop Meetup

4

➔ 100% open source distributed ANSI SQL engine for Big Data◆ Modern code base◆ Proven scalability

➔ Optimized for low latency, Interactive querying◆ Cross platform query capability, not only SQL on Hadoop◆ Distributed under the Apache license, now supported by Teradata◆ Used by a community of well known, well respected technology companies

What is Presto?

Page 5: Presto for the Enterprise @ Hadoop Meetup

5

History of Presto

FALL 20126 developers start Presto

development

FALL 201488 Releases

41 Contributors 3943 Commits

SPRING 201598 Releases

65 Contributors4587 Commits

---------Teradata joins

Presto community

& offers supportSPRING 2013Presto rolled out within Facebook

FALL 2013Facebook open sources Presto

FALL 2008Facebook

open sources Hive

Page 6: Presto for the Enterprise @ Hadoop Meetup

6

Query Execution

Data stream API

Worker

Data stream API

Worker

Coordinator

MetadataAPI

Parser/analyze

rPlanne

rSchedule

r

Worker

Client

Data locationAPI

Pluggable

Page 7: Presto for the Enterprise @ Hadoop Meetup

7

Query Execution

Data stream API

Worker

Data stream API

Worker

Coordinator

Data Location

APIMetadata

API

Parser/analyze

rPlanne

rSchedule

r

Worker

Client

Pluggable

Page 8: Presto for the Enterprise @ Hadoop Meetup

8

Query Execution

Data stream API

Worker

Data stream API

Worker

Coordinator

Data Location

APIMetadata

API

Parser/analyze

rPlanne

rSchedule

r

Worker

Client

Pluggable

Page 9: Presto for the Enterprise @ Hadoop Meetup

9

Query Execution

Data stream API

Worker

Data stream API

Worker

Coordinator

Data Location

APIMetadata

API

Parser/analyze

rPlanne

rSchedule

r

Worker

Client

Pluggable

Page 10: Presto for the Enterprise @ Hadoop Meetup

10

select shipdate,

count(*) count,cast(sum(extendedprice)

as bigint) price from

h_lineitem where

returnflag = 'R' group by shipdateorder by count limit 20

Logical and fragmented plan

Page 11: Presto for the Enterprise @ Hadoop Meetup

11

select *

from hive.default.h_nation,psql.public.p_region

where h_nation.regionkey = p_region.regionkey;

Logical and fragmented plan

Page 12: Presto for the Enterprise @ Hadoop Meetup

12

Query Execution

Data stream API

Worker

Data stream API

Worker

Coordinator

Data Location

APIMetadata

API

Parser/analyze

rPlanne

rSchedule

r

Worker

Client

Pluggable

Page 13: Presto for the Enterprise @ Hadoop Meetup

13

Query Execution

Data stream API

Worker

Data stream API

Worker

Coordinator

Data Location

APIMetadata

API

Parser/analyze

rPlanne

rSchedule

r

Worker

Client

Pluggable

Page 14: Presto for the Enterprise @ Hadoop Meetup

14

Query Execution

Data stream API

Worker

Data stream API

Worker

Coordinator

Data Location

APIMetadata

API

Parser/analyze

rPlanne

rSchedule

r

Worker

Client

Pluggablepage 1

blockA

blockB

page

blockA

blockB ...

Page 15: Presto for the Enterprise @ Hadoop Meetup

15

Query Execution

Data stream API

Worker

Data stream API

Worker

Coordinator

Data Location

APIMetadata

API

Parser/analyze

rPlanne

rSchedule

r

Worker

Client

Pluggable

Page 16: Presto for the Enterprise @ Hadoop Meetup

16

Plan executionHive Presto

map

reduce

I/O

I/O

I/O

I/O

I/O

task task

task task

task task

task

I/O

Page 17: Presto for the Enterprise @ Hadoop Meetup

17

Presto Extensibility – plugins

➔ Connectors

➔ Data types

➔ Extra functions

➔ (new) Security providers

Page 18: Presto for the Enterprise @ Hadoop Meetup

18

Presto Extensibility – connector interfaces

Parser/analyze

rPlanne

r

Worker

Data location API

Hive

Cass

andr

a

Kafk

a MyS

QL …

Metadata API

Hive

Cass

andr

a

Kafk

a MyS

QL

…Data stream

API

Hive

Cass

andr

a

Kafk

a MyS

QL …

Scheduler

Coordinator

Page 19: Presto for the Enterprise @ Hadoop Meetup

19

Presto Extensibility – connector interfacespublic interface Connector{ ConnectorHandleResolver getHandleResolver(); ConnectorMetadata getMetadata(); ConnectorSplitManager getSplitManager(); ConnectorPageSourceProvider getPageSourceProvider() ConnectorRecordSetProvider getRecordSetProvider() ConnectorPageSinkProvider getPageSinkProvider() ConnectorRecordSinkProvider getRecordSinkProvider() ConnectorIndexResolver getIndexResolver() Set<SystemTable> getSystemTables() List<PropertyMetadata<?>> getSessionProperties() List<PropertyMetadata<?>> getTableProperties() ConnectorAccessControl getAccessControl() void shutdown() {}}

Page 20: Presto for the Enterprise @ Hadoop Meetup

20

➔ Data stays in memory during execution and is pipelined across nodes MPP-style

➔ Vectorized columnar processing

➔ Presto is written in highly tuned Java◆ Efficient in-memory data structures◆ Very careful coding of inner loops◆ Bytecode generation

➔ Optimized ORC reader

➔ Predicates push-down

➔ Query optimizer

Presto = Performance

Page 21: Presto for the Enterprise @ Hadoop Meetup

21

➔ Facebook◆ Multiple production clusters (100s of nodes total)

● Including 300PB Hadoop data warehouse◆ 1000s of internal daily active users◆ Millions of queries each month◆ Multiple PBs scanned every day◆ Trillions of rows a day

➔ Netflix ◆ Over 200-node production cluster on EC2◆ Over 15 PB in S3 (Parquet format)◆ Over 300 users and 2.5K queries daily

Presto in Production

Page 22: Presto for the Enterprise @ Hadoop Meetup

22

➔ 100% open source contributions to Presto to increase adoption in the enterprise

➔ A multi-year roadmap commitment to phased enhancements of the open source code

➔ The first ever commercial support offering for Presto

What is Teradata Doing?

Teradata Certified Prestowww.teradata.com/presto

Page 23: Presto for the Enterprise @ Hadoop Meetup

23

➔ Hadoop Distro Agnostic

➔ Modern Code Base◆ Presto is well-designed open source software with proper database

architecture

➔ Strong Like-Minded Community

➔ Push down processing across multiple data platforms

➔ Leverage Teradata expertise to make SQL for Hadoop viable

Why is Teradata Contributing to Presto?

Page 24: Presto for the Enterprise @ Hadoop Meetup

24

Implement Integrate ProliferateInstallerDocumentationMonitoring & Support Tools

ODBC / JDBC DriversBI CertificationSecurityConnectors

Commercial Support

Phase 1

Phase 2

Phase 3June 8, 2015 Q4 2015 2016

Expanding ANSI SQL Coverage

Teradata Contributions to Presto

Management Tools IntegrationYARN Integration

Page 25: Presto for the Enterprise @ Hadoop Meetup

25

➔ Ease of install and management via Presto-Admin tool◆ www.github.com/prestodb/presto-admin◆ Packaging Presto as an RPM

➔ Testing Framework for Presto◆ www.github.com/prestodb/tempto◆ Added large number of tests

➔ JDBC driver for JAVA 6

➔ Various SQL improvements

Teradata’s Contributions

Page 26: Presto for the Enterprise @ Hadoop Meetup

26

➔ Continued SQL Improvements➔ Security – Authentication & Authorization➔ More Connectors – e.g. Hbase➔ ODBC & JDBC Drivers that actually work➔ BI tool certifications – e.g. Tableau➔ YARN Integration➔ Ambari Integration➔ Open Source our Docker based Dev Env - WIP➔ Open our Continuous Integration platform to the community

Teradata’s Contribution Product Roadmap

Page 27: Presto for the Enterprise @ Hadoop Meetup

27

Teradata Engineers Dedicated to Presto

Page 28: Presto for the Enterprise @ Hadoop Meetup

28

“Presto is an integral part of the Airbnb data infrastructure stack with hundreds of employees running queries each day with the technology. We are excited to see Teradata joining the Presto open source community and are encouraged by the direction of their contributions” - James Mayfield, product lead, Airbnb. "We are excited to see Teradata's commitment to Presto and adding capabilities in the open source domain. This will create interesting opportunities within our technical and business teams to open up more access options to our critical data. We think this is a positive for Teradata and for the community as a whole”- Steve Deasy, vice president of Engineering, Groupon.

Early Feedback is Extremely Positive

Page 29: Presto for the Enterprise @ Hadoop Meetup

29

Demo Time!

Page 30: Presto for the Enterprise @ Hadoop Meetup

30

www.github.com/facebook/prestowww.github.com/prestodb

Certified Distro: www.teradata.com/prestoWebsite: www.prestodb.ioPresto : User’s Group: www.groups.google.com/group/presto-usersFacebook Page: www.facebook.com/prestodbTwitter: #prestodb

How can I contribute?

Page 31: Presto for the Enterprise @ Hadoop Meetup

31

We’re hiring!➔ Warsaw➔ Boston

Job Offer: bit.do/presto

Contact: [email protected]

Join us!

Page 32: Presto for the Enterprise @ Hadoop Meetup

32

Available for Download➔ Presto 101t Server, CLI, JDBC➔ Presto-Admin 0.1➔ Documentation➔ HDP w/ Presto VM Sandbox➔ CDH w/ Presto VM Sandbox

www.teradata.com/presto

Presto 101t certified by Teradata