fifth elephant-grill
TRANSCRIPT
GrillUnified analytics platform
Amareshwari Sriramadasu
Amareshwari
Sriramadasu
•Architect in Data platform team at Inmobi
•Working in Hadoop and eco systems since 2007
•Apache Hadoop – PMC
•Apache Hive– Committer
•Worked with Yahoo! earlier
Analytics at Inmobi - Problem areas and Motivation
Introduction to Grill
OLAP Model
Query examples
Grill design
Agenda
Digital advertising at Inmobi
Courtesy: http://www.liesdamnedlies.com/
Owns & Sells Real estate on digital inventory
Has reach to users
Wants to target Users
Brings money
Market place
Consumer
Analytics Use cases
• Understanding Trends & Inference
• Forecasting and Anamoly detection
Data scientists
• Feedback to improve Ad Relevance in Real Time
Engineering systems
• Troubleshooting of issues
Developers
• Publisher/Advertiser specific analytics(dashboards)
Advertisers and publishers
• Tracking specific accounts
Account managers
• Inventory sizing and estimation
Business/Product analysts
• Canned/Dashboard queries
• Adhoc queries
• Interactive/Batch queries
• Scheduled queries
• Infer insights through ML algorithms
Analytics use cases
Analytics Warehouses at Inmobi and Scale
• Billions of Ad Requests/Impressions per day
• 170 TB Hadoop Warehouse• 5 TB SQL Columnar Datawarehouse• 70 TB Hbase cluster
Why both Hadoop and SQL warehouse?
Canned
Adhoc
Response Times
IO (Input Records)
Adhoc
Canned Adhoc
Query Engine
Query
Dashboard queries are mostly canned and Interactive
Adhoc queries can be Interactive or batch depending on the data volumes and query complexity
•Disparate user experience
•Disparate data storage and execution engines
•Schema management across storages
•Not leveraging ‘SQL on Hadoop’ community
Problems
Analytics at Inmobi - Problem areas and Motivation
Introduction to Grill
OLAP Model
Query examples
Grill design
Agenda
GRILL
Analytics As Service
Unify the Catalog and Query layer for Adhoc/CannedBatch/Interactive
Reports on single Interface
Grill Architecture
Data Layout – Fact data
Aggr Factk : measures (mak <= ma(k-1)), dim-cuts
(dak < da(k-1))
…..Aggr Fact2 : measures (ma2 <=
ma1), dim-cuts(da2 < da1)
Aggr Fact1: measures (ma1 <= mr), dim-cuts(da1 < dr)
Raw fact(mr), dim-cuts(dr)
TimeDim-cuts, Measures
Cost
Data Layout – Dimension data
…
Subsetm (am < am-1)
…
Subset2 (a2 < a1)
Subset1 (a1 < ax)
All attributes (ax)
Cost
Data Layout – Snowflake
Aggr Factk : measures (mak <= ma(k-1)), dims (dak
< da(k-1))
…..
Aggr Fact2 : measures (ma2 <= ma1), dimensions (da2 < da1)
Aggr Fact1 : measures (ma1 <= mr), dimensions (da1 < dr)
Raw Fact: measures (mr), dimensions(dr)
Dim2_1
Dim3
Dim2
Dim4_1
Dim4
Dim1
Associates structure to dataProvides Metastore and catalog service – HcatalogProvides pluggable storage interfaceAccepts SQL like queriesHQL is widely adopted language by systems like Shark, ImpalaHas strong apache community
Data warehouse features like cubes, facts, dimensionsLogical table associated with multiple physical storagesPluggable execution engineQuery lifecycle managementQuery quota managementScheduling queries
Wh
at
does
Hiv
e
pro
vid
e
Wh
at is m
issing in
H
ive
Apache Hive to the rescue
Analytics at Inmobi - Problem areas and Motivation
Introduction to Grill
OLAP Model
Query examples
Grill design
Agenda
Data Model
Cube
Fact Table• Physical Fact
tables
Dimension
Dimension Table• Physical
Dimension tables
Storage
Data Model - Cube
Expressions• Any expression with
reachable fields
Dim-attributes• Simple dim-attribute• Referenced dim-attribute• Hierarchical dim-
attribute• Timed dim-attribute
Measure• Column Measure
Cube
MeasuresDim-
attributesExpression
s
Note : Some of the concepts are borrowed from http://community.pentaho.com/projects/mondrian/
Data Model - Dimension
Expressions• Any expression with
reachable fields
Attributes• Simple dim-attribute• Referenced dim-
attribute• Hierarchical dim-
attribute
Dimension
Attributes Expressions
Data Model - Relationships
Cube
Dimension
Dimension
Fact table
Cube
Fact table
Storage
Dimension Table
Dimension
Dimension table
Storage
Data Model
Aggrk Fact Table
…..
Aggr2 Fact Table
Aggr1 Fact Table
Raw FactTable
Dim2_table1
Dim2_table2
Dim3_table1
Dim2_table2
Dimtable1
Dimtable2
CUBE
Dimension
Dimension
Dimension
Analytics at Inmobi - Problem areas and Motivation
Introduction to Grill
OLAP Model
Query examples
Grill design
Agenda
CUBE SELECT [DISTINCT] select_expr, select_expr, ...FROM cube_table_referenceWHERE [where_condition AND] TIME_RANGE_IN(colName , from, to)[GROUP BY col_list][HAVING having_expr][ORDER BY colList][LIMIT number]
Queries on OLAP cubes
• SELECT ( city.name ), ( city.stateid ) FROM c2_citytable city LIMIT 100 • SELECT ( city.name ), ( city.stateid ) FROM c1_citytable city WHERE
(city.dt = 'latest') LIMIT 100
cube select name, stateid from city limit 100
Example query
Example query
• SELECT (citytable.name), sum((testcube.msr2)) FROM c2_testfact testcube INNER JOIN c1_citytable city ON ((testcube.cityid)= (city.id)) WHERE (( testcube.dt='2014-03-10-03') OR (testcube.dt='2014-03-10-04') OR (testcube.dt='2014-03-10-05') OR (testcube.dt='2014-03-10-06') OR (testcube.dt='2014-03-10-07') OR (testcube.dt='2014-03-10-08') OR (testcube.dt='2014-03-10-09') OR (testcube.dt='2014-03-10-10') OR (testcube.dt='2014-03-10-11') OR (testcube.dt='2014-03-10-12') OR (testcube.dt='2014-03-10-13') OR (testcube.dt='2014-03-10-14') OR (testcube.dt='2014-03-10-15') OR (testcube.dt='2014-03-10-16') OR (testcube.dt='2014-03-10-17') OR (testcube.dt='2014-03-10-18') OR (testcube.dt='2014-03-10-19') OR (testcube.dt='2014-03-10-20') OR (testcube.dt='2014-03-10-21') OR (testcube.dt='2014-03-10-22') OR (testcube.dt='2014-03-10-23') OR (testcube.dt='2014-03-11') OR (testcube.dt='2014-03-12-00') OR (testcube.dt='2014-03-12 -01') OR (testcube.dt='2014-03-12-02') )AND (city.dt = 'latest')
• GROUP BY(city.name)
cube select city.name, msr2 from testcube where timerange_in(dt, '2014-03-10-03’, '2014-03-12-03’)
Analytics at Inmobi - Problem areas and Motivation
Introduction to Grill
OLAP Model
Query examples
Grill design
Agenda
Implements an interface• explain• execute• executeAsynchronously• fetchResults• Specify all storages it can support
Pluggable execution engine
OLAP Cube QL query
Rewrite query for available execution engine’s supported storages
Get cost of the rewritten query from each execution engine
Pick up execution engine with least cost and fire the query
Cube query with multiple execution engines
Grill Server
Grill – current state
Server
Query ServiceMetastore ServiceMetricsQuery statistics(In progress)Scheduled queries(In progress)Query caching(In progress)
Client
Java ClientCLIJDBC Client
Execution EngineHive DriverJDBC Driver
• Normalize query cost• Load balancing across execution engines• Alter meta hooks in StorageHandler• Authentication and authorization• Machine learning through Grill• Query quota management
Grill roadmap
Github source for grill• https://github.com/InMobi/grill
Github source for Hive• https://github.com/InMobi/hive
Documentation• http://inmobi.github.io/grill
Mailing lists• [email protected]• [email protected]
References
Backup
Data Model – Storage
Sto
rag
e • Name• End point• Properties• Ex : ProdCluster, StagingCluster,
Postgres1, HBase1, HBase2
Data Model – Fact Table
Fact table
Cube
Fact table
Storage
Fact
Ta
ble• Columns
• Cube that it belongs
• Storages on which it is present and the associated update periods
Data Model – Dimension tableD
imen
sion
Ta
ble • Columns
• Dimension to which it belongs
• Storages on which it is present and associated snapshot dump period, if any.
Dimension Table
Dimension
Dimension table
Storage
Data Model – Storage tables and partitionsS
tora
ge
tab
le• Belongs to fact/dimension• Associated storage
descriptor• Partitioned by columns• Naming convention –
storage name followed by fact/dimension name
• Partition can override its storage descriptor
• Fact storage table
Fact table
• Dimension storage table
Dimension table
Resolve candidate tables and storages
Automatically resolve joins, aggregations
Allows SQL over Cube QL
Queries can span multiple storages
Accepts multiple time ranges in query
All Hive QL features
Query features
Adhoc querying system Internal Dashboards
Customer facing Dashboards and Reporting
Analytics systems at Inmobi