berlin hadoop get together apache drill

35
Apache Drill interactive, ad-hoc query for large-scale datasets Michael Hausenblas, Chief Data Engineer EMEA, MapR Hadoop get-together, Berlin, 2013-05-08

Upload: mapr-technologies

Post on 11-May-2015

360 views

Category:

Technology


1 download

DESCRIPTION

Michael Hausenblas talk on Apache Drill given in Berlin (2013 05-08)

TRANSCRIPT

Page 1: Berlin Hadoop Get Together Apache Drill

Apache Drillinteractive, ad-hoc query for large-scale datasets

Michael Hausenblas, Chief Data Engineer EMEA, MapRHadoop get-together, Berlin, 2013-05-08

Page 2: Berlin Hadoop Get Together Apache Drill

Which workloads do you encounter in your environment?

http://w

ww

.flickr.com/photos/kevinom

ara/2866648330/ licensed under CC BY-NC-N

D 2.0

Page 3: Berlin Hadoop Get Together Apache Drill

Batch processing

… for recurring tasks such as large-scale data mining, ETL offloading/data-warehousing for the batch layer in Lambda architecture

Page 4: Berlin Hadoop Get Together Apache Drill

OLTP

… user-facing eCommerce transactions, real-time messaging at scale (FB), time-series processing, etc. for the serving layer in Lambda architecture

Page 5: Berlin Hadoop Get Together Apache Drill

Stream processing

… in order to handle stream sources such as social media feeds or sensor data (mobile phones, RFID, weather stations, etc.) for the speed layer in Lambda architecture

Page 6: Berlin Hadoop Get Together Apache Drill

Search/Information Retrieval

… retrieval of items from unstructured documents (plain text, etc.), semi-structured data formats (JSON, etc.), as well as data stores (MongoDB, CouchDB, etc.)

Page 7: Berlin Hadoop Get Together Apache Drill

http://www.flickr.com/photos/9479603@N02/4144121838/ licensed under CC BY-NC-ND 2.0

But what about interactive,ad-hoc query at scale?

Page 8: Berlin Hadoop Get Together Apache Drill

Impala

Interactive Query (?)

low-latency

Page 9: Berlin Hadoop Get Together Apache Drill

Use Case I

• Jane, a marketing analyst• Determine target segments• Data from different sources

Page 10: Berlin Hadoop Get Together Apache Drill

Use Case II

• Logistics – supplier status• Queries– How many shipments from supplier X?– How many shipments in region Y?

SUPPLIER_ID NAME REGION

ACM ACME Corp US

GAL GotALot Inc US

BAP Bits and Pieces Ltd Europe

ZUP Zu Pli Asia

{ "shipment": 100123, "supplier": "ACM", “timestamp": "2013-02-01", "description": ”first delivery today”},{ "shipment": 100124, "supplier": "BAP", "timestamp": "2013-02-02", "description": "hope you enjoy it”}…

Page 11: Berlin Hadoop Get Together Apache Drill

Requirements

• Support for different data sources• Support for different query interfaces• Low-latency/real-time• Ad-hoc queries• Scalable, reliable

Page 12: Berlin Hadoop Get Together Apache Drill

And now for something completely different …

Page 13: Berlin Hadoop Get Together Apache Drill

Google’s Dremel

http://research.google.com/pubs/pub36632.html

Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis, Proc. of the 36th Int'l Conf on Very Large Data Bases (2010), pp. 330-339

Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google.…

““Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google.…

Page 14: Berlin Hadoop Get Together Apache Drill

Google’s Dremel

multi-level execution trees

columnar data layout

Page 15: Berlin Hadoop Get Together Apache Drill

Google’s Dremel

nested data + schema column-striped representation

map nested data to tables

Page 16: Berlin Hadoop Get Together Apache Drill

Google’s Dremel

experiments:datasets & query performance

Page 17: Berlin Hadoop Get Together Apache Drill

Back to Apache Drill …

Page 18: Berlin Hadoop Get Together Apache Drill

Apache Drill–key facts

• Inspired by Google’s Dremel• Standard SQL 2003 support• Plug-able data sources• Nested data is a first-class citizen• Schema is optional• Community driven, open, 100’s involved

Page 19: Berlin Hadoop Get Together Apache Drill

High-level Architecture

Page 20: Berlin Hadoop Get Together Apache Drill

Principled Query Execution

Source Query Parser

Logical Plan Optimizer

Physical Plan Execution

SQL 2003 DrQLMongoQLDSL

scanner APITopologyCFetc.

query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” },

parser API

Page 21: Berlin Hadoop Get Together Apache Drill

Wire-level Architecture

• Each node: Drillbit - maximize data locality• Co-ordination, query planning, execution, etc, are distributed• Any node can act as endpoint for a query—foreman

Storage Process

Drillbit

node

Storage Process

Drillbit

node

Storage Process

Drillbit

node

Storage Process

Drillbit

node

Page 22: Berlin Hadoop Get Together Apache Drill

Wire-level Architecture

• Curator/Zookeeper for ephemeral cluster membership info• Distributed cache (Hazelcast) for metadata, locality

information, etc.

Curator/Zk

Distributed Cache

Storage Process

Drillbit

node

Storage Process

Drillbit

node

Storage Process

Drillbit

node

Storage Process

Drillbit

node

Distributed Cache Distributed Cache Distributed Cache

Page 23: Berlin Hadoop Get Together Apache Drill

Wire-level Architecture

• Originating Drillbit acts as foreman: manages query execution, scheduling, locality information, etc.

• Streaming data communication avoiding SerDe

Curator/Zk

Distributed Cache

Storage Process

Drillbit

node

Storage Process

Drillbit

node

Storage Process

Drillbit

node

Storage Process

Drillbit

node

Distributed Cache Distributed Cache Distributed Cache

Page 24: Berlin Hadoop Get Together Apache Drill

Wire-level ArchitectureForeman turns into root of the multi-level execution tree, leafs activate their storage engine interface.

node

node node

Curator/Zk

Page 25: Berlin Hadoop Get Together Apache Drill

Drillbit Modules

DFS Engine

HBase Engine

RPC Endpoint

SQL

HiveQL

Pig

Parser

Distributed Cache

Logi

cal P

lan

Phys

ical

Pla

n

Optimizer

Stor

age

Engi

ne In

terf

aceScheduler

Foreman

OperatorsMongo

Page 26: Berlin Hadoop Get Together Apache Drill

Key features

• Full SQL – ANSI SQL 2003• Nested Data as first class citizen• Optional Schema• Extensibility Points …

Page 27: Berlin Hadoop Get Together Apache Drill

Extensibility Points

• Source query parser API• Custom operators, UDF logical plan• Serving tree, CF, topology physical plan/optimizer• Data sources &formats scanner API

Source Query Parser

Logical Plan Optimizer

Physical Plan Execution

Page 28: Berlin Hadoop Get Together Apache Drill

… and Hadoop?

• HDFS can be a data source

• Complementary use cases*

• … use Apache Drill– Find record with specified condition– Aggregation under dynamic conditions

• … use MapReduce– Data mining with multiple iterations– ETL

*) https://cloud.google.com

/files/BigQueryTechnicalW

P.pdf

Page 29: Berlin Hadoop Get Together Apache Drill

Basic Demo

https://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo

{ "id": "0001", "type": "donut", ”ppu": 0.55, "batters": { "batter”: [

{ "id": "1001", "type": "Regular" },

{ "id": "1002", "type": "Chocolate" },…

data source: donuts.json

query:[ { op:"sequence", do:[

{ op: "scan", ref: "donuts", source: "local-logs", selection: {data:

"activity"} }, { op: "filter", expr: "donuts.ppu < 2.00" },

…logical plan: simple_plan.json

result: out.json

{ "sales" : 700.0, "typeCount" : 1, "quantity" : 700, "ppu" : 1.0} { "sales" : 109.71, "typeCount" : 2, "quantity" : 159, "ppu" : 0.69} { "sales" : 184.25, "typeCount" : 2, "quantity" : 335, "ppu" : 0.55}

Page 30: Berlin Hadoop Get Together Apache Drill

BE A PART OF IT!

Page 31: Berlin Hadoop Get Together Apache Drill

Status

• Heavy development by multiple organizations

• Available– Logical plan (ADSP)– Reference interpreter– Basic SQL parser – Basic demo

Page 32: Berlin Hadoop Get Together Apache Drill

Status

May 2013

• Full SQL support (+JDBC)• Physical plan• In-memory compressed data interfaces• Distributed execution• HBase and MySQL storage engine• WebUI client

Page 33: Berlin Hadoop Get Together Apache Drill

Contributing

Contributions appreciated (besides code drops)!

• Test data & test queries• Use case scenarios (textual/SQL queries)• Documentation• Further schedule– Alpha Q2– Beta Q3

Page 34: Berlin Hadoop Get Together Apache Drill

Kudos to …

• Julian Hyde, Pentaho • Lisen Mu, XingCloud• Tim Chen, Microsoft• Chris Merrick, RJMetrics • David Alves, UT Austin• Sree Vaadi, SSS/NGData• Jacques Nadeau, MapR• Ted Dunning, MapR

Page 35: Berlin Hadoop Get Together Apache Drill

Engage!

• Follow @ApacheDrill on Twitter

• Sign up at mailing lists (user | dev) http://incubator.apache.org/drill/mailing-lists.html

• Standing G+ hangouts every Tuesday at 6pm CEThttp://j.mp/apache-drill-hangouts

• Keep an eye on http://drill-user.org/