help build the most advanced sql database on hadoop ......hawq: a hadoop native parallel sql engine...

Post on 20-May-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

© Copyright 2013 Pivotal. All rights reserved.

Help Build the most Advanced SQL Database on Hadoop: Apache HAWQ

Lei Changlchang@pivotal.io

Apachecon 2015, Budapest, Europe

© Copyright 2013 Pivotal. All rights reserved.

• Introduction

• Architecture

• Query processing

• Catalog service

• Virtual segments

• Interconnect

• Transaction management

• Resource management

• Storage

• How to contribute

© Copyright 2013 Pivotal. All rights reserved.

HAWQ: A Hadoop Native Parallel SQL Engine Apache HAWQ

● Discover New Relationships● Enable Data Science ● Analyze External Sources● Query All Data Types!

Multi-level Fault Tolerance

Granular Authorization

Resource Pools

high multi-tenancy

ANSI SQL Standard

OLAP Extensions

JDBC ODBCConnectivity

MPP Architecture

Online Expansion

HDFS

Petabyte Scale

Cost Based Optimizer

Dynamic Pipelining

ACID + Transactional

Multi-LanguageUDF Support

Built-in Data Science Library

Extensible (PXF)Query External

Sources

Hardened, 10+ Years Tested, Production Proven

Accessibility + Usability

HDFS Native File Formats

● Manage Multiple Workloads● Petabyte Scale Analytics● Sub-second Performance

● Leverage Existing Skills & Tools

● Easily Integrate with Other Tools

Compression + Partitioning

core

com

plia

nce

● Well Integrated with Hadoop Ecosystem

© Copyright 2013 Pivotal. All rights reserved.

• Interactive queries

• Scalability

• Consistency

• Extensibility

• Standard compliance

• Productivity

© Copyright 2013 Pivotal. All rights reserved.

History

• 2011: Prototype • GoH (Greenplum Database on HDFS)

• 2012: HAWQ Alpha

• March 2013: HAWQ 1.0• Architecture changes for a Hadoopy system

• 2013~2014: HAWQ 1.x• HAWQ 1.1, HAWQ 1.2, HAWQ 1.3…

• 2015: HAWQ 2.0 Beta & Apache incubating• http://hawq.incubator.apache.org

© Copyright 2013 Pivotal. All rights reserved.

HAWQ components

© Copyright 2013 Pivotal. All rights reserved.

Yarn

Physical Segment

client

Masters Parser/AnalyzerOptimizer

Dispatcher

DataNode

NodeManager

NameNodeNameNode

External System

Resource Manager

Fault Tolerance Service

CatalogService

VirtualSegment

VirtualSegment

Physical Segment

DataNode

NodeManager

VirtualSegment

VirtualSegment

Physical Segment

DataNode

NodeManager

VirtualSegment

VirtualSegment

Resource Broker

libYARN

HDFS Catalog Cache

Interconnect Interconnect

HAWQ Architecture

© Copyright 2013 Pivotal. All rights reserved.

Basic query execution flow

ResourceManager

© Copyright 2013 Pivotal. All rights reserved.

Plan

SELECT l_orderkey, count(l_quantity)

FROM lineitem, orders

WHERE l_orderkey=o_orderkey AND l_tax>0.01

GROUP BY l_orderkey;

Motion:• Redistribution• Broadcast• Gather

© Copyright 2013 Pivotal. All rights reserved.

Catalog service

• Currently on master and external in future

• Query language: CaQL

• Basic single-table SELECT, COUNT(), multi-row DELETE, and single row INSERT/UPDATE

© Copyright 2013 Pivotal. All rights reserved.

Virtual segments & elasticity

• Segment is stateless

• Metadata in Catalog Service

• Self-described plan

• Compression

• Virtual Segment▪ Only one physical segment on each node

o Ease install and management

o Ease performance tuning

▪ Number of virtual segmentso Resource available at the query running time

o The cost of the query

• Node expansion & shrinking•

Master(UCS)

Segment

Segment

Segment

© Copyright 2013 Pivotal. All rights reserved.

Interconnect

© Copyright 2013 Pivotal. All rights reserved.

Interconnect

• Two transports: TCP & UDP• TCP interconnect limitations

– Limited scale– High Latency for connection setup

• Goals of UDP interconnect– Reliability– Ordering– Flow control– Performance and scalability– Portability

© Copyright 2013 Pivotal. All rights reserved.

Interconnect state machine

© Copyright 2013 Pivotal. All rights reserved.

The Design

• Multiple virtual connections share a common socket

• Multi-threaded

• Flow control: flow control window based

• Packet losses and out-of-order packets

• Deadlock elimination

© Copyright 2013 Pivotal. All rights reserved.

Transaction Management

• Snapshot isolation

• Catalog data: MVCC

• User data: Logical EoF & swimming lane

• Truncate to help remove garbage data

© Copyright 2013 Pivotal. All rights reserved.

Resource manager

● Responsibilityo Responsible for acquiring from YARN and return resources to YARN

o Responsible for resource allocation among HAWQ users, queries and

operators

● Two level resource managemento First level: integration with external resource manager (such as YARN…)

o Second level: HAWQ internal resource manager (user/query level resource

management)

o Third level: operator level

● Hierarchical resource queues

● CPU and memory management (allocation & enforcement)

© Copyright 2013 Pivotal. All rights reserved.

Resource manager componentsHAWQ Master

Optimizer

Query

Parser & Analyzer

YARN

Resource Manager (Leader)

Resource Allocator

Resource Pool

Resource Broker

PolicyStore

libYARN

Request Handler

DispatcherFTS (Fault Tolerance Service)

Plan

Mesos

Mesos API

from QD

© Copyright 2013 Pivotal. All rights reserved.

Storage

• Row oriented: AO – Quicklz, zlib

• PAX like: Parquet – Snappy, gzip

• Other format: via PXF

© Copyright 2013 Pivotal. All rights reserved.

Contributing to HAWQ

• Documentation• Wiki• Bug reports• Bug fixes• Features

• Website: http://hawq.incubator.apache.org/• Wiki: https://cwiki.apache.

org/confluence/display/HAWQ

• Repo: https://github.com/apache/incubator-hawq.git

• JIRA: https://issues.apache.org/jira/browse/HAWQ

• Mailing lists: dev/user@hawq.incubator.apache.org

© Copyright 2013 Pivotal. All rights reserved.

Code contribution process• Start a JIRA• Fork a github repo: https://github.

com/apache/incubator-hawq.git

• Clone your repo to local• Add the github repo as “upstream”• Create a feature branch and commit your

code• Start a pull request for code review

Details:https://cwiki.apache.

org/confluence/display/HAWQ/Contributing+to+HAWQ

© Copyright 2013 Pivotal. All rights reserved.

Build & Setup

• Build– Option 1: Use pre-built docker image: https://hub.

docker.com/r/mayjojo/hawq-dev/– Option 2: Build dependencies by yourself

• https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61320026

• Setup & Run– HDFS (required)– YARN (Optional)– HAWQ Init/start/stop cluster– psql -d postgres

© Copyright 2013 Pivotal. All rights reserved.

References

• HAWQ website:– http://hawq.incubator.apache.org– http://pivotal.io/big-data/pivotal-hawq

• HAWQ research papers– Lei Chang et al: HAWQ: a massively parallel processing SQL engine in hadoop.

SIGMOD Conference 2014: 1223-1234

– Mohamed A. Soliman et al: Orca: a modular query optimizer architecture for big data. SIGMOD Conference 2014: 337-348

– Lyublena Antova et al: Optimizing queries over partitioned tables in MPP systems. SIGMOD Conference 2014: 373-384

– Amr El-Helw et al: Optimization of Common Table Expressions in MPP Database Systems. PVLDB 8(12): 1704-1715 (2015)

© Copyright 2013 Pivotal. All rights reserved.

Summary

• HAWQ: A Hadoop Native Parallel SQL Engine

• How to contribute

© Copyright 2013 Pivotal. All rights reserved.

top related