fifth elephant-grill

41
Grill Unified analytics platform Amareshwari Sriramadasu

Upload: amarsri

Post on 24-Jun-2015

339 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Fifth elephant-grill

GrillUnified analytics platform

Amareshwari Sriramadasu

Page 2: Fifth elephant-grill

Amareshwari

Sriramadasu

•Architect in Data platform team at Inmobi

•Working in Hadoop and eco systems since 2007

•Apache Hadoop – PMC

•Apache Hive– Committer

•Worked with Yahoo! earlier

Page 3: Fifth elephant-grill

Analytics at Inmobi - Problem areas and Motivation

Introduction to Grill

OLAP Model

Query examples

Grill design

Agenda

Page 4: Fifth elephant-grill

Digital advertising at Inmobi

Courtesy: http://www.liesdamnedlies.com/

Owns & Sells Real estate on digital inventory

Has reach to users

Wants to target Users

Brings money

Market place

Consumer

Page 5: Fifth elephant-grill

Analytics Use cases

• Understanding Trends & Inference

• Forecasting and Anamoly detection

Data scientists

• Feedback to improve Ad Relevance in Real Time

Engineering systems

• Troubleshooting of issues

Developers

• Publisher/Advertiser specific analytics(dashboards)

Advertisers and publishers

• Tracking specific accounts

Account managers

• Inventory sizing and estimation

Business/Product analysts

Page 6: Fifth elephant-grill

• Canned/Dashboard queries

• Adhoc queries

• Interactive/Batch queries

• Scheduled queries

• Infer insights through ML algorithms

Analytics use cases

Page 7: Fifth elephant-grill

Analytics Warehouses at Inmobi and Scale

• Billions of Ad Requests/Impressions per day

• 170 TB Hadoop Warehouse• 5 TB SQL Columnar Datawarehouse• 70 TB Hbase cluster

Page 8: Fifth elephant-grill

Why both Hadoop and SQL warehouse?

Canned

Adhoc

Response Times

IO (Input Records)

Adhoc

Canned Adhoc

Query Engine

Query

Dashboard queries are mostly canned and Interactive

Adhoc queries can be Interactive or batch depending on the data volumes and query complexity

Page 9: Fifth elephant-grill

•Disparate user experience

•Disparate data storage and execution engines

•Schema management across storages

•Not leveraging ‘SQL on Hadoop’ community

Problems

Page 10: Fifth elephant-grill

Analytics at Inmobi - Problem areas and Motivation

Introduction to Grill

OLAP Model

Query examples

Grill design

Agenda

Page 11: Fifth elephant-grill

GRILL

Analytics As Service

Unify the Catalog and Query layer for Adhoc/CannedBatch/Interactive

Reports on single Interface

Page 12: Fifth elephant-grill

Grill Architecture

Page 13: Fifth elephant-grill

Data Layout – Fact data

Aggr Factk : measures (mak <= ma(k-1)), dim-cuts

(dak < da(k-1))

…..Aggr Fact2 : measures (ma2 <=

ma1), dim-cuts(da2 < da1)

Aggr Fact1: measures (ma1 <= mr), dim-cuts(da1 < dr)

Raw fact(mr), dim-cuts(dr)

TimeDim-cuts, Measures

Cost

Page 14: Fifth elephant-grill

Data Layout – Dimension data

Subsetm (am < am-1)

Subset2 (a2 < a1)

Subset1 (a1 < ax)

All attributes (ax)

Cost

Page 15: Fifth elephant-grill

Data Layout – Snowflake

Aggr Factk : measures (mak <= ma(k-1)), dims (dak

< da(k-1))

…..

Aggr Fact2 : measures (ma2 <= ma1), dimensions (da2 < da1)

Aggr Fact1 : measures (ma1 <= mr), dimensions (da1 < dr)

Raw Fact: measures (mr), dimensions(dr)

Dim2_1

Dim3

Dim2

Dim4_1

Dim4

Dim1

Page 16: Fifth elephant-grill

Associates structure to dataProvides Metastore and catalog service – HcatalogProvides pluggable storage interfaceAccepts SQL like queriesHQL is widely adopted language by systems like Shark, ImpalaHas strong apache community

Data warehouse features like cubes, facts, dimensionsLogical table associated with multiple physical storagesPluggable execution engineQuery lifecycle managementQuery quota managementScheduling queries

Wh

at

does

Hiv

e

pro

vid

e

Wh

at is m

issing in

H

ive

Apache Hive to the rescue

Page 17: Fifth elephant-grill

Analytics at Inmobi - Problem areas and Motivation

Introduction to Grill

OLAP Model

Query examples

Grill design

Agenda

Page 18: Fifth elephant-grill

Data Model

Cube

Fact Table• Physical Fact

tables

Dimension

Dimension Table• Physical

Dimension tables

Storage

Page 19: Fifth elephant-grill

Data Model - Cube

Expressions• Any expression with

reachable fields

Dim-attributes• Simple dim-attribute• Referenced dim-attribute• Hierarchical dim-

attribute• Timed dim-attribute

Measure• Column Measure

Cube

MeasuresDim-

attributesExpression

s

Note : Some of the concepts are borrowed from http://community.pentaho.com/projects/mondrian/

Page 20: Fifth elephant-grill

Data Model - Dimension

Expressions• Any expression with

reachable fields

Attributes• Simple dim-attribute• Referenced dim-

attribute• Hierarchical dim-

attribute

Dimension

Attributes Expressions

Page 21: Fifth elephant-grill

Data Model - Relationships

Cube

Dimension

Dimension

Fact table

Cube

Fact table

Storage

Dimension Table

Dimension

Dimension table

Storage

Page 22: Fifth elephant-grill

Data Model

Aggrk Fact Table

…..

Aggr2 Fact Table

Aggr1 Fact Table

Raw FactTable

Dim2_table1

Dim2_table2

Dim3_table1

Dim2_table2

Dimtable1

Dimtable2

CUBE

Dimension

Dimension

Dimension

Page 23: Fifth elephant-grill

Analytics at Inmobi - Problem areas and Motivation

Introduction to Grill

OLAP Model

Query examples

Grill design

Agenda

Page 24: Fifth elephant-grill

CUBE SELECT [DISTINCT] select_expr, select_expr, ...FROM cube_table_referenceWHERE [where_condition AND] TIME_RANGE_IN(colName , from, to)[GROUP BY col_list][HAVING having_expr][ORDER BY colList][LIMIT number]

Queries on OLAP cubes

Page 25: Fifth elephant-grill

• SELECT ( city.name ), ( city.stateid ) FROM c2_citytable city LIMIT 100 • SELECT ( city.name ), ( city.stateid ) FROM c1_citytable city WHERE

(city.dt = 'latest') LIMIT 100

cube select name, stateid from city limit 100

Example query

Page 26: Fifth elephant-grill

Example query

• SELECT (citytable.name), sum((testcube.msr2)) FROM c2_testfact testcube INNER JOIN c1_citytable city ON ((testcube.cityid)= (city.id)) WHERE (( testcube.dt='2014-03-10-03') OR (testcube.dt='2014-03-10-04') OR (testcube.dt='2014-03-10-05') OR (testcube.dt='2014-03-10-06') OR (testcube.dt='2014-03-10-07') OR (testcube.dt='2014-03-10-08') OR (testcube.dt='2014-03-10-09') OR (testcube.dt='2014-03-10-10') OR (testcube.dt='2014-03-10-11') OR (testcube.dt='2014-03-10-12') OR (testcube.dt='2014-03-10-13') OR (testcube.dt='2014-03-10-14') OR (testcube.dt='2014-03-10-15') OR (testcube.dt='2014-03-10-16') OR (testcube.dt='2014-03-10-17') OR (testcube.dt='2014-03-10-18') OR (testcube.dt='2014-03-10-19') OR (testcube.dt='2014-03-10-20') OR (testcube.dt='2014-03-10-21') OR (testcube.dt='2014-03-10-22') OR (testcube.dt='2014-03-10-23') OR (testcube.dt='2014-03-11') OR (testcube.dt='2014-03-12-00') OR (testcube.dt='2014-03-12 -01') OR (testcube.dt='2014-03-12-02') )AND (city.dt = 'latest')

• GROUP BY(city.name)

cube select city.name, msr2 from testcube where timerange_in(dt, '2014-03-10-03’, '2014-03-12-03’)

Page 27: Fifth elephant-grill

Analytics at Inmobi - Problem areas and Motivation

Introduction to Grill

OLAP Model

Query examples

Grill design

Agenda

Page 28: Fifth elephant-grill

Implements an interface• explain• execute• executeAsynchronously• fetchResults• Specify all storages it can support

Pluggable execution engine

Page 29: Fifth elephant-grill

OLAP Cube QL query

Rewrite query for available execution engine’s supported storages

Get cost of the rewritten query from each execution engine

Pick up execution engine with least cost and fire the query

Cube query with multiple execution engines

Page 30: Fifth elephant-grill

Grill Server

Page 31: Fifth elephant-grill

Grill – current state

Server

Query ServiceMetastore ServiceMetricsQuery statistics(In progress)Scheduled queries(In progress)Query caching(In progress)

Client

Java ClientCLIJDBC Client

Execution EngineHive DriverJDBC Driver

Page 32: Fifth elephant-grill

• Normalize query cost• Load balancing across execution engines• Alter meta hooks in StorageHandler• Authentication and authorization• Machine learning through Grill• Query quota management

Grill roadmap

Page 33: Fifth elephant-grill

Github source for grill• https://github.com/InMobi/grill

Github source for Hive• https://github.com/InMobi/hive

Documentation• http://inmobi.github.io/grill

Mailing lists• [email protected][email protected]

References

Page 34: Fifth elephant-grill

Thank you!• amareshwari@apache.

org

Page 35: Fifth elephant-grill

Backup

Page 36: Fifth elephant-grill

Data Model – Storage

Sto

rag

e • Name• End point• Properties• Ex : ProdCluster, StagingCluster,

Postgres1, HBase1, HBase2

Page 37: Fifth elephant-grill

Data Model – Fact Table

Fact table

Cube

Fact table

Storage

Fact

Ta

ble• Columns

• Cube that it belongs

• Storages on which it is present and the associated update periods

Page 38: Fifth elephant-grill

Data Model – Dimension tableD

imen

sion

Ta

ble • Columns

• Dimension to which it belongs

• Storages on which it is present and associated snapshot dump period, if any.

Dimension Table

Dimension

Dimension table

Storage

Page 39: Fifth elephant-grill

Data Model – Storage tables and partitionsS

tora

ge

tab

le• Belongs to fact/dimension• Associated storage

descriptor• Partitioned by columns• Naming convention –

storage name followed by fact/dimension name

• Partition can override its storage descriptor

• Fact storage table

Fact table

• Dimension storage table

Dimension table

Page 40: Fifth elephant-grill

Resolve candidate tables and storages

Automatically resolve joins, aggregations

Allows SQL over Cube QL

Queries can span multiple storages

Accepts multiple time ranges in query

All Hive QL features

Query features

Page 41: Fifth elephant-grill

Adhoc querying system Internal Dashboards

Customer facing Dashboards and Reporting

Analytics systems at Inmobi