sql on hadoop: defining the new generation of analytic sql databases

28
Strata Conference, February 2013 SQL on Hadoop Defining the New Genera/on of Analy/c Databases

Upload: oreillystrata

Post on 09-May-2015

3.221 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

Strata Conference, February 2013

SQL  on  Hadoop  Defining  the  New  Genera/on  of    Analy/c  Databases  

Page 2: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

Speaker Bio: Carl Steinbach

2  

Currently:

Engineer at Citus Data PMC Chair, Committer -- Apache Hive Project @cwsteinbach on Twitter

Formerly:

Cloudera, Informatica, NetApp, Oracle

Page 3: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

This is going to sound strange, but…

3  

I used to think databases were boring

Page 4: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

Why?

4  

Undergrad at MIT 1997-2001 Number of Database Classes: 0 Number of Database Faculty Members: 0 My Conclusion: Databases are a Dead Field

Page 5: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

Things Changed Over the Next Couple of Years

5  

I got a job! Database Group Formed at MIT (2003)

- Mike Stonebraker - Sam Madden

New Class: 6.830 Database Systems (2005)

Page 6: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

What Changed?

6  

Web-scale Data New DB Research: Columnar Storage, NoSQL MPP Analytic Databases Gained Market Traction GFS (’03) and MapReduce (‘04) Papers Apache Hadoop – v0.1.0 released in 2006

Page 7: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

What’s Good About Hadoop?

7  

Commodity Storage Scale-out Flexibility

MapReduce Multi-structured Data

Page 8: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

What’s Bad About Hadoop?

8  

MapReduce! No Schemas! Missing Features

Optimizer, Indexes, Views Incompatibility with Existing Tools

BI, ETL, IDEs

Page 9: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

Hive  MetaStore  

Apache Hive Solved Many of These Problems

9  

HiveServer2  

SQL  to  MapReduce  Compiler  

Rule  Based  Op/mizer  

MR  Plan  Execu/on  Coordinator  

Table  to  Files  

Table  to  Format  

datanode  

HDFS  

Map/Reduce    

Hive  Operators  

Hive  SerDes  

datanode  

HDFS  

Map/Reduce  

Hive  Operators  

Hive  SerDes  

datanode  

HDFS  

Map/Reduce  

Hive  Operators  

Hive  SerDes  

SQL  Queries   Catalog  Metadata  

User  Client  

Hive  ODBC/JDBC  

Hive  CLI  

ETL,  BI,  SQL  IDE  

Page 10: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

But Other Problems Remained

10  

MapReduce: Latency Overhead Many Missing Features: •  ANSI SQL •  Cost Based Optimizer •  UDFs •  Data Types •  Security •  …

Page 11: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

One Solution: Separate MPP DB Cluster

11  

MPP  Worker  Node  MPP  Worker  Node  MPP  Worker  Node  MPP  Worker  Node  

Global  Query  Executor  

MPP  Master  Node  

HDFS  datanode  

HDFS  datanode  

HDFS  datanode  

HDFS  datanode  

Local  Query  Executor  

Local  Query  Executor  

Local  Query  Executor  

Local  Query  Executor  

MPP  Database  Cluster  

Hadoop  Cluster  

Page 12: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

One Solution: Separate MPP DB Cluster

12  

MPP  Worker  Node  MPP  Worker  Node  MPP  Worker  Node  MPP  Worker  Node  

Global  Query  Executor  

MPP  Master  Node  

HDFS  datanode  

HDFS  datanode  

HDFS  datanode  

HDFS  datanode  

Local  Query  Executor  

Local  Query  Executor  

Local  Query  Executor  

Local  Query  Executor  

Pull  Data  to  Work   IO  Bo]leneck  

Page 13: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

Better Solution: A New Architecture for SQL on Hadoop

13  

Global  Query  Executor  

MPP  Master  Node  

HDFS  datanode  

Local  Query  Executor  

HDFS  datanode  

Local  Query  Executor  

HDFS  datanode  

Local  Query  Executor  

HDFS  datanode  

Local  Query  Executor  

Maintain  Data  Locality  

Push  Work  to  

Data  

Page 14: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

The New Architecture in Detail: CitusDB

14  

CitusDB  Master  Node  

Metadata  

Distributed  Query  Planner  

Distributed  Query  Executor  

datanode  

HDFS  

Local  Query  Planner  

Local  Query  Executor  

Foreign  Data  Wrappers  

datanode  

HDFS  

Local  Query  Planner  

Local  Query  Executor  

Foreign  Data  Wrappers  

datanode  

HDFS  

Local  Query  Planner  

Local  Query  Executor  

Foreign  Data  Wrappers  

PostgreSQL  Tools  

ODBC/JDBC  Clients  

Hadoop  Metadata  

HDFS  NameNode  

Page 15: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

The New Architecture in Detail: CitusDB

15  

CitusDB  Master  Node  

Metadata  

Distributed  Query  Planner  

Distributed  Query  Executor  

datanode  

HDFS  

Local  Query  Planner  

Local  Query  Executor  

Foreign  Data  Wrappers  

datanode  

HDFS  

Local  Query  Planner  

Local  Query  Executor  

Foreign  Data  Wrappers  

datanode  

HDFS  

Local  Query  Planner  

Local  Query  Executor  

Foreign  Data  Wrappers  

PostgreSQL  Tools  

ODBC/JDBC  Clients  

Hadoop  Metadata  

HDFS  NameNode  

Metadata  Sync  

Step  1)  The  CitusDB  Master  Node  retrieves  file  system  metadata  from  the  Hadoop  NameNode.  

Page 16: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

The New Architecture in Detail: CitusDB

16  

CitusDB  Master  Node  

Metadata  

Distributed  Query  Planner  

Distributed  Query  Executor  

datanode  

HDFS  

Local  Query  Planner  

Local  Query  Executor  

Foreign  Data  Wrappers  

datanode  

HDFS  

Local  Query  Planner  

Local  Query  Executor  

Foreign  Data  Wrappers  

datanode  

HDFS  

Local  Query  Planner  

Local  Query  Executor  

Foreign  Data  Wrappers  

PostgreSQL  Tools  

ODBC/JDBC  Clients  

Hadoop  Metadata  

HDFS  NameNode  

User  Query  

Step  2)  The  user  submits  a  SQL  query  to  the  CitusDB  master  node  using  the  PostgreSQL  CLI  or  a  JDBC/ODBC  app.  

Page 17: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

The New Architecture in Detail: CitusDB

17  

CitusDB  Master  Node  

Metadata  

Distributed  Query  Planner  

Distributed  Query  Executor  

datanode  

HDFS  

Local  Query  Planner  

Local  Query  Executor  

Foreign  Data  Wrappers  

datanode  

HDFS  

Local  Query  Planner  

Local  Query  Executor  

Foreign  Data  Wrappers  

datanode  

HDFS  

Local  Query  Planner  

Local  Query  Executor  

Foreign  Data  Wrappers  

PostgreSQL  Tools  

ODBC/JDBC  Clients  

Hadoop  Metadata  

HDFS  NameNode  

Local  Queries  

Step  3)  The  Master  Node  generates  an  op/mized  global  query  plan  and  sends  fragment  queries  to  the  workers.  

Page 18: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

The New Architecture in Detail: CitusDB

18  

CitusDB  Master  Node  

Metadata  

Distributed  Query  Planner  

Distributed  Query  Executor  

datanode  

HDFS  

Local  Query  Planner  

Local  Query  Executor  

Foreign  Data  Wrappers  

datanode  

HDFS  

Local  Query  Planner  

Local  Query  Executor  

Foreign  Data  Wrappers  

datanode  

HDFS  

Local  Query  Planner  

Local  Query  Executor  

Foreign  Data  Wrappers  

PostgreSQL  Tools  

ODBC/JDBC  Clients  

Hadoop  Metadata  

HDFS  NameNode  

Local  Results  

Step  4)  The  CitusDB  worker  processes  running  on  each  DataNode  process  the  fragment  queries                              and  send  par/al  result  sets  back  to  the  Master  Node.  

Page 19: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

The New Architecture in Detail: CitusDB

19  

CitusDB  Master  Node  

Metadata  

Distributed  Query  Planner  

Distributed  Query  Executor  

datanode  

HDFS  

Local  Query  Planner  

Local  Query  Executor  

Foreign  Data  Wrappers  

datanode  

HDFS  

Local  Query  Planner  

Local  Query  Executor  

Foreign  Data  Wrappers  

datanode  

HDFS  

Local  Query  Planner  

Local  Query  Executor  

Foreign  Data  Wrappers  

PostgreSQL  Tools  

ODBC/JDBC  Clients  

Hadoop  Metadata  

HDFS  NameNode  

Query  Results  

Step  5)  The  Master  Node  merges  the  par/al  result  sets  and  returns  the  final  result  to  the  user.  

Page 20: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

CitusDB: Standing on the Shoulders of Giants

20  

Mature, Battle-tested Enterprise Class Features Has an Elephant Mascot

Proven Scalability Cost Effectiveness Has an Elephant Mascot

+

Page 21: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

Leveraging PostgreSQL Performance

21  

Cost-based Query Optimizer

postgres=#  EXPLAIN  SELECT      customer.c_custkey,    sum((lineitem.l_extendedprice  *  (1::numeric  -­‐  lineitem.l_discount)))      ….  

 -­‐>    Sort    (cost=282459.19..282599.52  rows=56134  width=182)                    Sort  Key:  customer.c_custkey,  customer.c_name                    Sort  Method:  external  merge    Disk:  17192kB  

 ….      -­‐>    Hash  Join    (cost=39666.61..257246.25  rows=56134  width=16)                      Hash  Cond:  (lineitem.l_orderkey  =  orders.o_orderkey)                                      -­‐>    Seq  Scan  on  lineitem_102022  lineitem    (cost=0.00..190571.11)  

Page 22: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

Leveraging PostgreSQL Features: More than 300 Built-in Functions

22  

REGR_SLOPE ENUM_RANGE NTH_VALUE SIN TRANSACTION_TIMESTAMP REGR_SYY IS FALSE BROADCAST ISCLOSED VAR_SAMP QUERY_TO_XML GET_BIT IS NOT UNKNOWN ENUM_LAST NUMNODE SPLIT_PART REPLACE EVERY IS TRUE RPAD NOW REGR_SXX PATH PERCENT_RANK IN CURRENT_DATE UPPER_INF XMLAGG GENERATE_SERIES REGR_INTERCEPT DIAMETER ROW_NUMBER ARRAY_NDIMS

QUOTE_LITERAL STRING_TO_ARRAY CONVERT_TO DATE_TRUNC TO_DATE LOWER_INC CHAR_LENGTH CURRENT_TIMESTAMP ARRAY_LENGTH REPEAT UPPER CASE BOOL_AND CORR XPATH_EXISTS RADIUS MOD AREA ATAN2 REGEXP_REPLACE DECODE BOOL_OR CUME_DIST OVERLAY POINT JUSTIFY_INTERVAL POWER IS NOT TRUE FORMAT REVERSE IS NOT FALSE ROW_TO_JSON SET_BIT

COS EXTRACT DIV BTRIM LOWER CONCAT ARRAY_FILL JUSTIFY_DAYS VAR_POP OCTET_LENGTH STATEMENT_TIMESTAMP TAN ARRAY_APPEND XMLCOMMENT REGEXP_MATCHES CONVERT_FROM XPATH AT TIME ZONE RANDOM CEILING WIDTH ROUND COVAR_SAMP HOSTMASK ARRAY_DIMS CURSOR_TO_XML GENERATE_SUBSCRIPTS PG_SLEEP STDDEV_SAMP SIMILAR TO NOTNULL SUBSTRING REGEXP_SPLIT_TO_ARRAY

GREATEST REGR_SXY OVERLAPS FLOOR SQRT RTRIM REGR_AVGY IS DISTINCT TIMEOFDAY LN TO_CHAR TRUNC ISNULL SCHEMA_TO_XML COALESCE ENUM_FIRST BIT_AND RADIANS SUM TRIM STDDEV_POP LSEG STRING_AGG PCLOSE MASKLEN LIKE POSITION VAR_POP DENSE_RANK DATABASE_TO_XML HOST SETSEED TO_TIMESTAMP

QUOTE_IDENT REGR_R2 LAG PI TRUNC STRIP XMLAGG UPPER COVAR_POP NETMASK FIRST_VALUE LOWER_INF REGR_COUNT SET_MASKLEN NOW ISOPEN REGR_COUNT NOW MIN TO_HEX GET_BYTE XML_IS_WELL_FORMED LASTVAL HEIGHT DENSE_RANK SETVAL LAST_VALUE STRPOS COT ARRAY_CAT TO_ASCII ISFINITE NOT

SET_BYTE XMLFOREST LAG FORMAT ARRAY_AGG LTRIM BETWEEN BOX CURRVAL LOCALTIME LPAD REGR_AVGX DATE_PART ARRAY_TO_STRING EXTRACT UPPER_INC TRANSLATE SQRT NOT LIKE LOG DATE_TRUNC VARIANCE UNNEST ANY LOCALTIMESTAMP LENGTH INITCAP SIGN SUBSTR STDDEV ABS SOME MD5

Page 23: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

Leveraging PostgreSQL Features

23  

Extensible, Rich Type System Pluggable Format Handlers Security Internationalization Connectivity: ODBC, JDBC Ecosystem Add-Ons:

PostGIS, XML/JSON, Fuzzy Search, Language Bindings (.NET, Python, etc)

Page 24: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

Where are We Headed? Distributed. SQL. Anywhere.

24  

CitusDB  Master  Node  

Metadata  

Distributed  Query  Planner  

Distributed  Query  Executor  

Hadoop  Datanode  

HDFS  

Local  Query  Planner  

Local  Query  Executor  

Foreign  Data  Wrapper  

RDBMS  server  

RDBMS  

Local  Query  Planner  

Local  Query  Executor  

Foreign  Data  Wrapper  

MongoDB  Shard  

mongod  

Local  Query  Planner  

Local  Query  Executor  

Foreign  Data  Wrapper  

Page 25: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

Defining the New Generation of Distributed Analytic Databases

25  

SQL à Ease of Use, Increased Productivity Real-time responsiveness à Faster Data Locality à Proven Scalability Schema-on-Read à Flexibility, Lower Cost

Page 26: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

Where Are We At?

26  

CitusDB SQL on Hadoop is in Open Beta Download our Binary Packages Or Use Our EC2 AMI

http://citusdata.com/docs/sql-on-hadoop

Page 27: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

We’re Hiring

27  

http://citusdata.com/job

Page 28: SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

For questions and more information: [email protected]

(650) 566-9010

28