jethro for tableau - accelerating bi on big data

Accelerating BI on Big Data

Topics

• BI on Big Data Trade-Off

• SQL-on-Hadoop Performance Challenges

• Live Demo: Tableau on HadoopImpala / Redshift / Jethro

• Jethro Technology Overview

• What Does Jethro Do?– Acceleration server for BI on Big Data

• How It Works?– Full Indexing and cube caching– Combines Columnar SQL DB design

with search-indexing technology• When to Use It?

– Reporting, dashboards, discovery, ad-hoc

• How to Get It– Download & free evaluation

• Partnerships– BI & Hadoop vendors

About Us

SQL

Data

• Typical usage based on extracting selective data from remote data sources

• Extracted data then dynamically loaded into memory for interactive analysis

• Challenges: – Size: performance degradation typically

~250M rows– Refresh lag time

BI & Big Data: Extract (In-Memory)

Tableau & Big Data

Data

Extract

• For every user interaction Tableau issues SQL queries to the target DB

• DB retrieves requested data, processes SQL aggregations and returns to Tableau

• Challenges: – DB performance is significantly slower

than in-mem speed

BI & Big Data: Live-Connect (In-DB)

Tableau & Big Data

Queries

Live Access

SQL enables the change of data platform while keeping the analytic apps intact

Analytics: ETL, Predictive, Reporting, BI

SQL

10x-100x Data1/10 HW $costOpen Platform

Big Data Platforms: Hadoop vs. EDW Appliances

SQL-on-Hadoop Performance Challenges

SQL

SQL-on-Hadoop

ETL Predictive Reporting

BI

Too SLOW on Hadoop

x

It’s unrealistic to expect to the same performance when data is much larger and highly optimized hardware is replaced with commodity boxes

The Hadoop Trade-Off: Scale & Cost vs. Performance


More Hardware– Add nodes, RAM, CPU, SSD, network

Different SQL-on-Hadoop engines

– Hive, Impala, Drill, SparkSQL, HAWQ, Presto, Actian, etc.

Rigid Data Model– Less granularity, more pre-

aggregations– Pre-defined OLAP Cubes– De-normalize into single large table– Multiple partition keys (replication)

Replicate from Hadoop to EDW– Traditional: Teradata, Vertica,

Netezza, …– Cloud: Redshift – As-a-Svc: BigQuery, Snowflake,

Qubole

No Hadoop, No EDW– Search: Elastic + Kibana– NoSQL: Hbase, Cassandra, MongoDB

BI & Data Combined– Full-stack Hadoop: Platfora, Arcadia– As-a-Svc: DOMO, QuikSight, PowerBI,

…

BI on Big Data: Technology Alternatives

A Library Analogy:Billions of books, Thousands of racks

Query: List books by author “Stephen King”

Process: Every librarian pulls out book by book from their rack and check for Author

• Hive• Impala• Presto

• SparkSQL• Drill• Pivotal/HAWQ

• IBM/Big SQL• Actian• …

SQL-on-Hadoop: MPP/Full-Scan Architecture


Unsuitable for BI

Query: List books by author “Stephen King”

Process: Access Author index, entry of “Stephen King”, get list of books, fetch only these books

Result: Fast, minimal resources, scalable

SQL-on-Hadoop: Index-Access Architecture


Optimal for BI

Hardware Data Format Hadoop Cluster

Compute Cluster

Total RAM, CPU

AWS $ per hr.

Jethro Jethro indexes 3x m1.xlarge 2x r3.4xlarge (spot)

290GB, 44 cores

$0.75

Impala Parquet 6x r3.2xlarge1x r3.xlarge

390GB, 52 cores

$4.25

Redshift Redshift 6x dc1.large 90GB, 12 cores

$1.50

• Point browser to: tableau.jethrodata.com– Login: demo / demo

• Choose workbook: Jethro, Impala, Redshift

• Dashboard interaction: choose year, category or any other filters to drill-down

• Data– Based on TPC-DS benchmark– 1TB raw data (400GB fact)– Fact table: ~2.9B rows– 7 Dimensions

LIVE Benchmark: Tableau on Hadoop (and Redshift)

Live Benchmark

http://tableau.jethrodata.com/

Indexing Data for Jethro Acceleration

• Identify BI-worthy datasets– Not all data in Hadoop should have Jethro

• Jethro “loader” creates an indexed version– Stores back in same HDFS

• If no Hadoop is used it can also be stored in local filesystem, network storage or cloud storage (e.g. S3)

– Highly efficient: ~1B rows/hour, 3x compression

• Incremental refresh– As frequently as every min, hour, day, …– Does not require a full-rebuild of index

Raw Indexed

Data Node

Data Node

Data Node

Data Node

Data Node

Jethro Query Node

Jethro QN

1. Index Access 2. Read data only for required rows

Performance and resources based on the size of the working-set

Storage- HDFS- Cloud (S3, EFS)- NAS/SAN- Local FS

SELECT date, SUM(sales) FROM T1 WHERE product=‘Books’ AND state=‘NY’ GROUP BY date

Index-Access: How it Works

Jethro Indexes – Superior Technology

http://www.google.com/patents/WO2013001535A3?cl=en Patent Pending:

Complete– Every column is indexed

Simple– Inverted-list indexes map each

column value to a list of rowsFast to read

– Index-of-index provides direct access to a value entry

– No need to scan entire index, or load index to memory

Scalable – Distributed, highly hierarchical

compressed bitmaps

Fast to write– Appendable index structure for

fast incremental refresh

http://www.google.com/patents/WO2013001535A3?cl=en

http://www.google.com/patents/WO2013001535A3?cl=en

Automated Cube & Query Cashing• Every query is cached

– Based on result-set size vs. execution time

• Cubes generated automatically – Identify repeat query patterns– For example: adding the filter as a col to

a GROUP BY• All stored in HDFS

– 10,000’s of cashed cubes and queries• Incremental refresh

– Query executes ONLY on the incremental data and then merges with cached results

What Is Jethro for Tableau?An indexing & cashing server

1. Tableau uses live connect (ODBC) to send SQL queries

2. Jethro checks if query can be served from existing cubes– Yes: reply to Tableau

3. Jethro uses indexed table to access only necessary data– Auto create a cube based on

this and similar queries

Live Connect

HDFS

BI Tools

Why Jethro is the Right Technology for BI on Big Data?Limitless BI on Big Data: Supporting the full-range of BI use-cases. Jethro’s technology is a unique and optimal fit.1. Full indexing enables interactive discovery and fast drill

down– Eliminates need to repeatedly read unnecessary data. The deeper you

go the faster it gets!

2. Auto cubes & cache enables interactive dashboards and fast reports– Optimize repeat query performance

3. Incremental-refresh enables LIVE BI over streaming data – Reduces maintenance and cuts lag time

Ready to Try Jethro?

1. Register: jethro.io– Download and Install on-prem or cloud

2. Schedule a 30min POC review with Jethro SA (free!)

3. Index BI-worthy datasets4. Use Tableau5. Train Jethro with BI apps– Continuous performance improvement

That’s It!

http://jethro.io/

Accelerating BI on Big Data