presto: fast sql-on-anything · 2020. 12. 1. · presto: fast sql-on-anything including delta lake,...

Post on 30-Dec-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Presto: Fast SQL-on-Anythingincluding Delta Lake, Snowflake, Elasticsearch and more!

Kamil Bajda-PawlikowskiCo-founder/CTO @ Starburst

Agenda

▪ Presto & Starburst▪ Delta Lake Integration▪ Data Platform Architecture▪ Use Cases

Presto & Starburst

What is Presto?

High performance MPP SQL

engine

•Interactive ANSI SQL queries

•Proven scalability

•High concurrency

Separation of compute & storage

•Scale storage & compute independently

•SQL-on-anything

•Federated queries

Community-driven open

source project

Deploy Anywhere

•Kubernetes

•Cloud

•On premises

Presto Users

Facebook: 10,000+ of nodes, 1000s of users

Uber 2,000+ nodes, 160K+ queries dailyLinkedIn: 500+ nodes, 200K+ queries daily

Lyft: 400+ nodes, 100K+ queries daily

Starburst

6

Enterprise

Grade Security

On-Prem,

or Cloud

Rapid Time to

Insights

Low Cost of

Ownership

24x7 Expert

Support

ANSI SQL MPP

Query Engine

High

Concurrency

Our Platform

Named Open Source Startup to Watch 2020

600% Growth YoY

100+

Enterprise Customers

NPS Score

80+

Massive

Scale

Starburst Enterprise Presto

Performance Connectivity Security Management

30+ supported enterprise

connectors

High performance parallel

connectors for Oracle,

Teradata, Snowflake and

more

Support

From petabytes to exabytes

– query data from disparate

sources using SQL – with

high concurrency

Control your

price/performance with the

latest cost-based optimizer

Caching available for

frequently accessed data

Kerberos & LDAP

integration

Global Security for fine-

grained Access Control

Data encryption

Data masking

Query auditing

Configuration

Autoscaling

High availability

Monitoring

Deploy anywhere

The largest team of Presto

experts in the world

Fully-tested, stable

releases, curated by the

Presto creators

Hot fixes & security

patches

24x7 support, 365 – we’ve

got your back

7

Starburst CustomersTech

Retail Media & Telco

Finance & Insurance

Healthcare & Pharma Other

Delta Lake Integration

Why Delta Lake?

▪ ACID properties over data lake

▪ Open source table format

▪ Stored as Parquet files

▪ Object storage support

▪ Schema evolution

▪ Time travel feature

▪ Metadata & statistics

▪ Data skipping & z-ordering

Native Presto Delta Lake Reader

Supports data skipping & dynamic filtering

Optimizes query using file statistics

Supports reading the Delta transaction log

Native connector written from scratch

Native Delta Lake Reader Performance

▪ 2x average speedup across 22 queries

▪ 6x best query speedup

▪ “What we have here is game changing for our industry. Especially now that the native Delta reader works as fast as it does. We have people lining up to now use this data”

▪ “We have queries that were running in 10 minutes that are now running in 47 seconds"

Feedback from customers:Standard TPC-H benchmark:

Data Platform Architecture

Starburst PlatformData Scientists Data AnalystsFinance Marketers

The Data Consumption Layer

Existing analytics tools

Data Masking Global SecurityColumn + Row-

level permissionsQuery Auditing Fine-grained

access controlData Encryption

Data Lakes Relational Databases NoSQL Stores Publish/Subscribe

Azure Event Hub

Different SQL Technologies In Your Toolbelt

Streaming Ingestion

Machine Learning

Data Investigation

Large Batch Jobs

Fast Federated Queries

High Concurrency SQL Engine

High Performance Ad Hoc

Reporting/Analytics

Optionality

Cloud Data Warehouse

Rapid Ad Hoc Reporting/Analytics

Fast, but everything must live in

Snowflake (ETL/ELT is required)

Vendor and data lock in

Cloud Data Platform Ecosystem

Deployment Architecture

Use Cases

Data Flow Diagram

Using a combination of Databricks and Starburst Presto to

bring a full data ingestion and analytical environment to life

Data Ingestion and Transformation

● Real-time ingestion of event data into

Delta tables

● Customer and inventory data ingested

every hour

● Modified customer information merged

into Delta Lake table

● Data marts created using streaming and

batch data

Query-time Data Federation

● Single point of access to numerous data

sources

● Query Delta Lake and federate with

legacy databases as well as many

NoSQL data stores

● Enforce table, column and row level

policies to ensure maximum data

security

● Mask column data for different groups

and users

Data Consumption & Analytics BI Reporting Tools

SQL Query Tools

• Connect using a variety of BI and SQL

tools including Looker, Tableau, Power

BI and DBeaver

• JDBC, ODBC and many libraries

including Python, R and Java

SELECT id, COUNT(*), SUM(active_seconds)

FROM delta.iot.events e

JOIN snowflake.sales.customer c ON (e.customer_id = c.id)

WHERE e.event_date >= current_date

AND c.region = 'US'

AND c.id IN

(SELECT l.customer_id

FROM elastic.web.logs l

WHERE l.visit_date >= date '2020-01-01')

GROUP BY id;

Thank You!Try Presto with Delta:

www.starburstdata.com/delta-lake-reader

Feedback

Your feedback is important to us.

Don’t forget to rateand review the sessions.

top related