jethro data meetup index base sql on hadoop - oct-2014

24
JethroData Indexed Based SQL-on- Hadoop - An Architectural Comparison of Tools Simpler. Faster. Cheaper.

Upload: eli-singer

Post on 01-Jul-2015

833 views

Category:

Software


3 download

DESCRIPTION

JethroData Index based SQL on Hadoop engine. Architecture comparison of MPP / Full-Scan sql engines such as Impala and Hive to index-based access such as Jethro. SQL and NoSQL NYC meetup Oct 20 2014 Boaz Raufman

TRANSCRIPT

Page 1: Jethro data meetup    index base sql on hadoop - oct-2014

JethroData Indexed Based SQL-on-Hadoop - An Architectural Comparison of Tools

Simpler. Faster. Cheaper.

Page 2: Jethro data meetup    index base sql on hadoop - oct-2014

About the presenter

• Over 25 years experience in software design & mgmt• Expertise in database architecture, information

retrieval and search technologies• Led numerous information retrieval projects for

various Israeli intelligence agencies as well as for commercial companies

• Started JethroData in 2010 with the idea of integrating database and search technologies to accelerate big data analytics

• Bachelor's degree in Computer Science and Philosophy from the Tel-Aviv University

Boaz Raufman – Co-Founder / CTO

Page 3: Jethro data meetup    index base sql on hadoop - oct-2014

SQL-on-Hadoop

Hadoop uses the same parallel design pattern as the parallel databases from last decade

Frameworks

• MapReduce• Tez• Spark

Reborn on Hadoop

• Pivotal HAWQ

• IBM BigSQL• Teradata

Aster• Actian

New Comers

• Hive• Impala• Presto• Tajo• Drill• Spark SQL

Page 4: Jethro data meetup    index base sql on hadoop - oct-2014

Data Node

Full-Scan Execution

Client: SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day

Data Node

Data Node

Data Node

Data Node

Query Executor

Query Executor

Query Executor

Query Executor

Query Executor

Query Planner

/Mgr

Query Planner/

Mgr

Query Planner/

Mgr

Query Planner/

Mgr

Query Planner/

Mgr

Performance and resources based on the size of the dataset

Page 5: Jethro data meetup    index base sql on hadoop - oct-2014

Shared-Nothing MPP Design Principles

Parallel Processing• Divide the work across

many nodes• Try to minimize inter-

node communication• Work should be evenly

distributed

Full Data Scanning• Full sequential scan -

massive I/O• Data locality and local

processing• Minimize amount of data

being read– Columnar data store– Partition by specific key– Block stats

Performance and resource requirement based on the dataset size

Page 6: Jethro data meetup    index base sql on hadoop - oct-2014

MPP Complex queries processing

LocalAggregation

GlobalAggregation

(join, distinct,group by, order by,

sub-query)

Result Merge

Example:SELECTDAY,COUNT( DISTINCT ITEM)FROM T1WHEREPRODUCT=‘abc’GROUP BY DAY

Page 7: Jethro data meetup    index base sql on hadoop - oct-2014

Index based SQL-on-Hadoop

Page 8: Jethro data meetup    index base sql on hadoop - oct-2014

Data Node

The Index-Access Design

Data Node

Data Node

Data Node

Data Node

Jethro Query Node

Query Node

Client: SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day

1. Index Access 2. Read data only for require rows

Performance and resources based on the size of the result-set

Page 9: Jethro data meetup    index base sql on hadoop - oct-2014

Index-Based Design

• Surgical scan – minimum I/O• Performance and required resources based

on the result set size• Extremely efficient for Interactive SQL use

cases• Pay at load time

Page 10: Jethro data meetup    index base sql on hadoop - oct-2014

Architecture – Contrarian Concepts

• Index everything– Every column is indexed

• Colum oriented– Columnar or row-groups– Append only data model

• Everything is stored in HDFS– Can also work with S3 or Posix

• Shared Everything– Separate compute and storage, each

scales-out independently– Minimize cross-node operations– Stateless Query nodes

• Parallelized multi-threaded execution– multiple parallelization dimensions:

columns, row ranges, partitions, pipelining and bucketing

JethroNode

Jethro Node

Storage Layer

Processing Layer

HDFS/Posix/S3

Client

Page 11: Jethro data meetup    index base sql on hadoop - oct-2014

Jethro Indexes

Indexes map each column value to a list of rows

Jethro stores indexes as hierarchical compressed bitmaps Very fast query operations – AND / OR / NOT Processed the entire WHERE clause to a final list of rows Patent pending: http://

www.google.com/patents/WO2013001535A3?cl=en INSERT Performance

– Load is very fast: files are appended, no random read/write, no locks– Jethro Indexes are append-only. If needed, duplicate entries are allowed– Periodic background merge (non-blocking)– Compatible with HDFS

Value Rows

FR rows 5,9,10,11,14

IL rows 1,3,7,12,13

US rows 2,4,6,8,15

Page 12: Jethro data meetup    index base sql on hadoop - oct-2014

Built in optimizations

• Code is written in C++• Column store and true column processing• Vectorization for expression evaluation• Multi-threaded and parallelized execution• Planner using indexes meta data - index-based

queries• Server-side cache in memory and local disk

Page 13: Jethro data meetup    index base sql on hadoop - oct-2014

Full-Scan: Performance depends on size of datasetIndex-Access: Performance depends on size of result-set

Use-Case Analysis

Page 14: Jethro data meetup    index base sql on hadoop - oct-2014

Comparing Recent Benchmarks – Jun 2014

Impala Parquet Vs. Hive/Tez, Presto, Shark

Source

Jethro Vs. Impala/Parquet

Source

Impala

Using the same queries in Jun-2014 Impala benchmarks, we compared Impala with Jethro (TPC-DS, SF 1,000)

Page 15: Jethro data meetup    index base sql on hadoop - oct-2014

Benchmark – Jethro vs. Impala – Oct 2014

q19 q42 q52 q55 q63 q68 q73 q980

20

40

60

80

100

120

140

160

180

200

103

39.8 39.4 39.9

73.4

188.4

84.3 85.2

6.4 5 4.9 412.3 11.7 10.3 4.4

TPC-DS Interactive queriesOct 2014

Impala 1.4.2 Jethro 0.9

* Queries use original TPC-DS filter criteria

*

Page 16: Jethro data meetup    index base sql on hadoop - oct-2014

DEMO

Go to Tableau demo

1. Point browser at: http://54.245.114.83/ 2. Login as try-jethro/jethro1233. Edit workbooks:

1. Jethro: Jethro sd – save2. Impala: Impala sd save

Page 17: Jethro data meetup    index base sql on hadoop - oct-2014

Jethro Query Node

Side-By-Side Implementation

MapReduce / Impala

Jethro Indexer

DataStream

Jethro Query Nodes

▪▪

▪▪▪

▪▪ ▪ ▪▪▪

▪▪ ▪▪▪▪▪

▪▪

Hive / Pig

▪▪▪▪▪

▪▪

▪▪▪

BI Tools

SQL

Existing Hadoop tables are untouched

Indexes are added to select

tables. ~30% incremental

storage

Page 18: Jethro data meetup    index base sql on hadoop - oct-2014

1. Installing JethroData

• Existing Hadoop cluster– CDH 4.x, CDH 5.x, HDP 2.x, EMR 3.x

• Designated Jethro server– Can be inside or outside the cluster– HW: CPU: 16+ cores, Mem: 64GB+, Net: 1GB / 10GB, SSD for

cache• Install Jethro – download package

– rpm install– Install HDFS client (if needed) – Create /jethro dir in HDFS

• Start Jethro– service jethro start

Page 19: Jethro data meetup    index base sql on hadoop - oct-2014

2. Load Data into Jethro

• Run create instance script– JethroAdmin create-instance demo /Jethro/demo

• Define a new table– JethroClient demo localhost 9111

• Create table sales_demo (…);

• Run JethroLoader process– JethroLoader demo sales_demo.desc

sales_demo.csv &• Start Querying – ODBC, JDBC, JethroClient

That’s it!

Page 20: Jethro data meetup    index base sql on hadoop - oct-2014

Road Map

• Jethro S3• Analytic functions• UDF• Cascading optimizer• Function indexes• Light weight text search• Rows group format (Parquet/ORC)• Integration with YARN for resource management• Sync with Hive Metastore/HCatalog• Nested data• Materialized views• Distributes query

Page 21: Jethro data meetup    index base sql on hadoop - oct-2014

Functional Indexes

Solution• Function index created for commonly

used functions• Some function indexes are automatically

created for specific data types. Example:year function for timestamp

• Query optimizer will identify scenarios where functional indexes should be used

• Function index can also be user defined or created on the fly via adaptive optimization

Base Index Function YAER index

Value Rows Value Rows

02/04/2007 5,9,10,11,14 20071,2,3,4,5,6,7,

8,9,10,11,12,

13,14,15

03/05/2007 1,3,7,12,13

10/10/2007 2,4,6,8,15

01/02/2008 15,18 2008 15,16,17,18

05/03/2008 16,17

Query uses function index examples:year(c1)=2007 explicit use for year indexC1 between 2007-01-01 and 2007-12-31 implicit use for year indexC1 between 2007-01-01 and 2008-02-15 Mix: take year index for 2007 and rest from base index

ProblemHow to accelerate this query:Select count(*) from T where year(birthdate)=2007;

Page 22: Jethro data meetup    index base sql on hadoop - oct-2014

Jethro’s Benefits

Simple to use

• Implemented side by side with existing Hadoop system

• Access via SQL or your favorite BI tool

• Integrates with Hadoop eco system

10X Faster queries

• Interactive analysis with Sub second latency

• Access to data as it arrives

• Analyze granular, raw data

50% Cheaper to operate

• Significantly less computing resources

• No dual systems, costly ETL

• Elastically scalable on commodity hardware

Page 23: Jethro data meetup    index base sql on hadoop - oct-2014

• Point browser at: http://www.jethrodata.com/home

• Click • Register:

Try it Today

Page 24: Jethro data meetup    index base sql on hadoop - oct-2014

Jethro – Big Data Analytics. Real-Time.

Thank You!