apache hawq and apache madlib: journey to apache

11Pivotal Confidential–Internal Use Only

Apache HAWQ andApache MADlibJourney to Apache

Pivotal San Francisco

Dec 3, 2015

Topics

• Journey to Apache

• HAWQ overview

• MADlib overview

Journey to Apache

1986 … 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014

1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015

Journey to Apache

Michael Stonebraker develops Postgres at UCB

Postgres adds support for SQL

Open Source PostgreSQL

PostgreSQL 7.0 released

PostgreSQL 8.0 released

Greenplum forks PostgreSQL

Hadoop 1.0 Released

HAWQ & MADlibgo Apache

HAWQ launched

Hadoop 2.0 Released

MADliblaunched

Greenplum open sourced

Pivotal is Committed to Open Source

Pivotal GemFire Apache Geode (April 2015)

Pivotal HDB Apache HAWQ (Sept 2015)

Pivotal Query Optimizer ComingSoon

MADlib OSS (BSD License)

Apache MADlib (Sept 2015)

Pivotal Greenplum Greenplum Database (Oct 2015)(Apache 2 License)

Collaborate on software in open and productive ways Need strong community for innovation MADlib and HAWQ are complementary technologies

Why Apache?

Apache HAWQ Overview

What is Apache HAWQ?

Key Featuresof

5 • Up to 30x SQL-on-Hadoop performance advantage

• Faster time to insight• Massive MPP scalability to petabytes

Benefits: Near real-time latency, complex queries and advanced analytics at scale

1. Advanced Analytics Performance

Key Featuresof

5 • ANSI SQL-92, -99, -2003• All 99 TPC-DS queries tested, no

modifications• Plus, OLAP extensions• Complete ACID integrity and reliability

Benefits: 100% SQL compliant No risk to SQL applications All native on HDP via HAWQ

2. 100% ANSI SQL Compliant

Key Featuresof

HAWQ Performance vs Impala

HAWQFaster

ImpalaFaster

2 28 46 66 73 76 79 80 88 90 96

HAWQ• Faster on 46 of 62

TPC-DS queries completed*

• 4.55x mean avg.• 12 hrs faster total

* Impala supported 74 of 99 queries, 12 crashed mid-run

HAWQ vs Apache Hive w/Tez

HAWQFaster

HiveFaster

3 7 15 25 27 34 46 48 76 79 89 90 96

HAWQ• Faster on 45 of 60

TPC-DS queries completed*

• 3.44x mean avg.• 9 hrs faster total

* Hive supported 65 of 99 queries, 5 crashed mid-run

5• Advanced machine learning for big data• Local, in-database operation• Exceptional MPP/parallel performance• Open source, Postgres-based

Benefits: Advanced, highly scalable, machine learning, directly on data in Hadoop

3. Integrated Machine Learning

Key Featuresof

5 • HDP, PHD, other ODPi-derived distros• Easily managed via Ambari• On premises, in cloud, or PaaS• HBase, Avro, Parquet and more• Connectors to make HAWQ data

available to other SQL query tools

Benefits: Flexibility Accessibility Portability

4. Flexible Deployment

Key Featuresof

Open Data PlatformA shared industry effort to advance the state of Apache Hadoop® and Big Data

technologies for the enterprise

17The open ecosystem of big data

September 25, 2015

http://odpi.org

5 • Cost-based query optimization • Robust query plan optimization • Complex big data management

Benefits: Optimize performance and costs Maximize Hadoop cluster resources Offload EDW w/o compromise

5. Query Optimization Options

Key Featuresof

Apache HAWQ

● Discover New Relationships● Enable Data Science ● Analyze External Sources● Query All Data Types!

Multi-level Fault Tolerance

Granular Authorization

Resource Mgmt (+ YARN)

high multi-tenancy

ANSI SQL Standard

OLAP Extensions

JDBC ODBCConnectivity

Parallel Processing

Online Expansion

Petabyte Scale

Cost Based Optimizer

Dynamic Pipelining

ACID + Transactional

Multi-Language

UDF Support

Built-in Data Science Library

Extensible (PXF)

Query External Sources

Hardened, 10+ Years Investment, Production Proven

Accessibility + Usability

HDFS Native File Formats

● Manage Multiple Workloads● Petabyte Scale Analytics● Security controls

● Leverage Existing SQL Skills & BI Tools

● Easily Integrate with Other Tools

● Sub-second Performance Compression

+ Partitioning

● Hadoop-Native● Supports Pivotal HD

and Hortonworks Data Platform

● Ambari-Integrated

Apache HAWQ 2.0 (in beta)Areas of Enhancement New Features

Elastic & Scalable Architecture

Hadoop-Native Integrations

Simplified External Data Access/Queries

Performance & Optimizations

On-Demand Virtual Segments

Flexible Query Dispatch on subset nodes

3 Tier RM: YARN level>User>Query-Operator

Dynamic Cluster Expansion (no redistribute)

New Fault Tolerance Service

HCatalog integration - Read Access

HDFS Catalog Cache

Per Table Directory storage (user friendly)

Single physical segment per node

Easier Administration/Usage

Cloud-ReadySimpler Management Commands

HAWQ Segments

HAWQ Masters

Physical Segment

Client

Parser/AnalyzerOptimizer

Dispatcher

DataNode

NodeManager

NameNodeNameNode

External Data Stores via Xtension Framework (Hive/HBase/etc)

Resource Manager

Fault Tolerance Service

CatalogService

VirtualSegment

Physical Segment

DataNode

NodeManager

VirtualSegment

Physical Segment

DataNode

NodeManager

VirtualSegment

Resource Broker

libYARN

HDFS Catalog Cache

Interconnect Interconnect

Apache HAWQ 2.0Architecture

Example Use CasesSmart/connected car• PHD, HAWQ• Ability to have numerous data

in Hadoop• Generate new business models• Predictive analytics

Network & Call Center Analysis• PHD, HAWQ• Store and maintain 2B records/day• Analyze drop and completed calls• Analyze networks, care-center

responsiveness• 5X capacity of EDW at half the cost

Revenue Prediction• PHD, HAWQ, GPDB• Predict ad revenue

to within 1%• Transform into data-driven

company that builds close relationships withcustomers

Archive Analytics, CustomerBehavior Analytics• PHD, HAWQ• Mainframe alternative• Archive analytics• Customer behavior

profiling and analytics

Apache MADlib Overview

Scalable, In-Database Machine Learning

• Open Source https://github.com/apache/incubator-madlib• Supports Greenplum DB, Apache HAWQ/HDB and PostgreSQL• Downloads and Docs: http://madlib.incubator.apache.org/

Apache (incubating)

History

MADlib project was initiated in 2011 by EMC/Greenplum architects and Joe Hellerstein from Univ. of California, Berkeley.

• MAD stands for:

• lib stands for SQL library of:• advanced (mathematical, statistical, machine learning)• parallel & scalable in-database functions

UrbanDictionary.com:mad (adj.): an adjective used to enhance a noun.

1- dude, you got skills.2- dude, you got mad skills.

Functions

Predictive Modeling Library

Linear Systems• Sparse and Dense Solvers• Linear Algebra

Matrix Factorization• Singular Value Decomposition (SVD)• Low Rank

Generalized Linear Models• Linear Regression• Logistic Regression• Multinomial Logistic Regression• Cox Proportional Hazards Regression• Elastic Net Regularization• Robust Variance (Huber-White), Clustered

Variance, Marginal Effects

Other Machine Learning Algorithms• Principal Component Analysis (PCA)• Association Rules (Apriori)• Topic Modeling (Parallel LDA)• Decision Trees• Random Forest• Support Vector Machines• Conditional Random Field (CRF)• Clustering (K-means) • Cross Validation• Naïve Bayes• Support Vector Machines (SVM)

Descriptive Statistics

Sketch-Based Estimators• CountMin (Cormode-Muth.)• FM (Flajolet-Martin)• MFV (Most Frequent Values)CorrelationSummary

Support Modules

Array OperationsSparse VectorsRandom SamplingProbability FunctionsData PreparationPMML ExportConjugate Gradient

Inferential Statistics

Hypothesis Tests

Time Series• ARIMA

Oct 2014

MADlib Advantages

Better parallelism– Algorithms designed to leverage MPP and

Hadoop architecture

Better scalability– Algorithms scale as your data set scales

Better predictive accuracy– Can use all data, not a sample

ASF open source (incubating)– Available for customization and optimization

Supported Platforms

GPDB PostgreSQLPHDHDP

Other ODPi distros

Scale-out machine learning now available on open source, MPP execution engines.

Now open source !

Has always been

open source

Example UsageTrain a model

Predict for new data

Linear Regression on 10 Million Rows in Seconds

Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.

Pivotal is very proud to deepenour relationship with the ASF to advance SQL-on-Hadoop and machine learning technologies.

Please join us!

Contributors Welcome!

• Web sites– http://hawq.incubator.apache.org/– http://madlib.incubator.apache.org/– https://cran.r-project.org/web/packages/PivotalR/index.html

• Github– https://github.com/apache/incubator-hawq– https://github.com/apache/incubator-madlib– https://github.com/pivotalsoftware/PivotalR

apache hawq and apache madlib: journey to apache

Data & Analytics

bob james v. jackson (madlib) complaint.pdf

pivotal hawq and hortonworks data platform: modern data...

hortonworks data platform for hdinsight · • apache pig...

big data systeme recommendations - haw...

apache phoenix + apache hbase

a new platform for a new era - john funk · 2014-08-25 ·...

help build the most advanced sql database on hadoop...

apache hawq resource management and integration

what is hawq?...apache hawq (incubating) is an elastic...

madlib-input, strings, and lists in scratch barb ericson...

madlib, quasimoto, madlib: an independent study

package ‘pivotalr’ - cran.r-project.org ‘pivotalr ’...

apache calcite for enabling sql access to nosql data ... ·...

apache madlib...

µ hawq ¾ madlib o n...

the madlib analytics library

sql and machine learning on hadoop using hawq

pivotal hawq - high availability (2014)

apache 之道 – 从孵化器到暪撡暫搖搆旅摖 ·...

madlib architecture and functional demo on how to use...