running fast, interactive queries on petabyte datasets using presto - aws july 2016 webinar series

28
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Keith Steward, Ph.D. Specialist (EMR) Solution Architect, AWS 2016-08-28 Running Fast, Interactive Queries on Petabyte Datasets using Presto on Amazon EMR

Upload: amazon-web-services

Post on 10-Jan-2017

504 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Keith Steward, Ph.D.Specialist (EMR) Solution Architect, AWS

2016-08-28

Running Fast, Interactive Queries on Petabyte Datasets using Presto on Amazon EMR

Page 2: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

What We’ll Cover:

The challenges of using Data Warehouses. Then a non-DW alternative.

High level steps for querying petabyte-scale data on S3.

Amazon S3

Amazon EMR

Apache Presto: History, Goals & Benefits, Architecture

Presto on EMR

Demo – Querying 29 years of U.S. Air Flights data on S3 using Presto on EMR.

Page 3: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

Challenges in using Data WarehousesSc

hem

a-on

-Writ

e

Data

Data Warehouse extrac t load

schema

Significant “time to answer”

Sche

ma-

on-R

ead

Data

Shorter “time to answer”

s tored

$$$$

$$

Page 4: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

1. Store your petabyte-scale data in S3.

2. Configure & launch an EMR cluster with Presto.

3. Login to the EMR cluster

4. Expose S3 data as a Hive Table(s)

5. Issue SQL queries against the Hive table(s) using Presto.

6. Get query results.

How to Query Petabyte-Scale Datasets on S3?

Page 5: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

Store anything (object storage)

Scalable / Elastic

99.999999999% durability

Effectively infinite inbound bandwidth

Extremely low cost: $0.03/GB-Mo; $30.72/TB-Mo

Data layer for virtually all AWS services

Amazon S3

Page 6: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

Aggregate all Data in S3 as your Data LakeSurrounded by a collection of the right tools

EMR Kinesis

Redshift DynamoDB RDS

Data Pipeline

Spark Streaming Storm

Amazon S3

Import / Export Snowball

Page 7: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

Exposing petabyte-scale datasets in S3 as Hive tables

hive>  CREATE EXTERNAL TABLE airdelays (yr INT,quarter INT,month INT,dayofmonth INT,dayofweek INT,flightdate STRING,uniquecarrier STRING,airlineid INT,. . .div5tailnum STRING)PARTITIONED BY (year STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n'LOCATION 's3://flightdelays-kls/csv’;

S3 bucket with data:

Ask Hive to expose S3

data as table:

hive> describe airdelays;OKyr int quarter int month int dayofmonth int dayofweek int flightdate string . . . div5wheelsoff string div5tailnum string year string

# Partition Information # col_name data_type comment

year string Time taken: 0.169 seconds, Fetched: 115 row(s)

Hive now knows about table:

Page 8: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

Scalable Hadoop clusters as a service

Hadoop, Hive, Spark, Presto, Hbase, etc.

Easy to use; fully managed

On demand, reserved, spot pricing

HDFS, S3, and Amazon EBS filesystems

End to end security

Amazon EMR

Page 9: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

EMRFS makes it easier to leverage Amazon S3

Better performance and error handling options

Transparent to applications – just read/write to “s3://”

Support for Amazon S3 server-side and client-side encryption

Faster listing using EMRFS metadata

HDFS is still available via local instance storage or Amazon EBS

Page 10: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

Amazon S3 as your cluster’s persistent data store

Amazon S3

Designed for 99.999999999% durability Separate compute and storage

Resize and shut down Amazon EMR clusters with no data loss

Point multiple Amazon EMR clusters at same data in Amazon S3 using the EMR File System (EMRFS)

Page 11: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

Demo: Let’s spin up an EMR cluster (with Presto) …

Page 12: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

(History)

PB scale interactive query engine designed by Facebook in 2012

Originally designed for exploring existing Hive tables without triggering slow mapreduce jobs

Open Source in late 2013

Page 13: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

(Benefits)

In-memory distributed query engine

Support standard ANSI-SQL

Support rich analytical functions

Support wide range of data sources

Combine data from multiple sources in single query

Response time ranges from seconds to minutes

Page 14: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

ExtensibilityPluggable backends: Hive, Cassandra, JMX, Kafka, MySQL, PostgreSQL, MySQL, SystemSchema, TPCH

JDBC, ODBC for commercial BI tools or Dashboards, like data visualization

Client Protocol: HTTP+JSON, support various languages (Python, Ruby, PHP, Node.js, Java(JDBC), C#,…)

ANSI SQLcomplex queries, joins, aggregations, various functions (Window functions)

High Performance: 10x faster than HiveE.g. Netflix: runs 3500+ Presto queries / day on 25+ PB dataset in S3 with 350 active platform users

(Features)

Page 15: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

High Level Architecture A distributed system that runs on a cluster of machines.

Components: a coordinator and multiple workers.

Queries are submitted by a client such as the Presto CLI to the coordinator.

The coordinator parses, analyzes and plans the query execution, then distributes the processing to the workers.

https://prestodb.io/overview.html

Page 16: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

Presto Architecture

Page 17: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

In memory parallel queries

Pipeline task execution

Data local computation with multi-threading

Cache hot queries and data

Just-in-time compile-to-bye-code operator

SQL optimization

Other optimizations (e.g. Predicate Pushdown)

(Why is it so fast?)

Page 18: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

Presto: In-memory processing and pipelining

Page 19: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

Presto: Accessing Petabyte-Scale Datasets in S3

Any table known to Hive Metastore can be accessed / queried by Presto.

Including data in S3 exposed via CREATE EXTERNAL TABLE statements in Hive.

Page 20: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

Embedded Mode

Uses Derby

not recommended for Production

Hive Metastore Deployment Modes

http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_hive_metastore_configure.html

Page 21: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

Embedded Mode

• Uses Derby

• not recommended for Production

Local Mode

• metastore service in same process as main HiveServer process

• Metastore db runs in separate process

http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_hive_metastore_configure.html

Hive Metastore Deployment Modes

Page 22: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

Embedded Mode

• Uses Derby

• not recommended for Production

Local Mode

• metastore service in same process as main HiveServer process

• Metastore db runs in separate process

Remote Mode

• metastore service runs in own JVM process

• processes communicate with it via Thrift network API

• metastore service communicates with metastore db over JDBC

http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_hive_metastore_configure.html

Hive Metastore Deployment Modes

Page 23: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

Supported Data Sources

Currently Presto provides connectors for the following Hive

Cassandra

MySQL

PostgreSQL

Kafka

Redis

Page 24: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

Common Use Cases

When to use Presto? Need fast interactive query ability with high concurrency

Need ANSI SQL

When might you not want to use Presto? You focus on batch processing (ETL, enriching, aggregation, etc) for large data sets.

Hive or Spark recommended.

Need to compute (e.g. machine learning, graph algorithms) over the retrieved data. Spark recommended.

Star-schema organization of data. Amazon Redshift data warehouse recommended.

Page 25: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

Airpal – a Presto GUI designed & open-sourced by Airbnb

Optional access controls for users

Search and find tables

See metadata, partitions, schemas & sample rows

Write queries in an easy-to-read editor

Submit queries through a web interface

Track query progress

Get the results back through the browser as a CSV

Create new Hive table based on the results of a query

Save queries once written

Searchable history of all queries run within the tool

Page 26: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

Demo: Let’s now query 29 years worth of Air-traffic data in S3 using Presto on our EMR cluster…

Page 27: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

Summary for Presto on EMR with Data in S3

Data in S3 is queryable using Presto on EMR

Presto is easy to deploy on Amazon EMR

Presto provides fast ad-hoc queries

Supports wide range of data sources

In-memory data processing with pipelining

Feature-rich

Increasing adoption & active community

Amazon S3

Amazon EMR

Page 28: Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

Thank you!

Keith Steward,[email protected]

http://www.slideshare.net/GuorongLIANG/facebook-presto-presentation

https://prestodb.io https://github.com/airbnb/airpal#airpal https://github.com/treasure-data/prestogres

References:

If you want to run this demo later in your own AWS account, go to:

http://bit.ly/1Xg0111