Apache Lens at Hadoop meetup

Download Apache Lens at Hadoop meetup

Post on 14-Jul-2015




1 download



Apache LensUnified analytics platform

Sharad AgarwalAmareshwari Sriramadasu1Agenda2Reporting warehouse : Generation 1Reporting data in RDBMSAggregations/materialized views in DB~ 1.5 TBGeneration 1: RDBMSChallenges:Data Scale: Loading of data taking ~ 24 hrsAnalysis only upto 3 dimensionsHeavy queries stalling other user queriesUnable to move fast with new reporting requirments

Generation 1 : Reporting data in RDBMSReporting warehouse : Generation 2Small and summarized data in Columnar Database- Rich DashboardsGranular data in Hadoop- Adhoc analysis

Generation 2: Hadoop + Columnar DBGrown to:Hadoop ~ 250 TBColumnar DB ~ 8 TB

Generation 2: Hadoop + Columnar DBChallenges:Maintaining two lines of independent data warehousing systemsData discrepanciesSchema managementLearning curve for UsersDuplicate datasetsInefficient UtilizationProprietary OLAP and MR on Hadoop

Generation 2: Hadoop + Columnar DBReporting warehouse : Generation 3Platform to enable multi-dimensional queries in a unified way over datasets stored in multiple warehouses

- OLAP Cube abstraction- Data discovery by providing single metadata layer- Unified access to data by integrating Hive with other traditional warehouses

Generation 3: Apache Lens (earlier called Grill)- Queries get pushed to where data resides- Central Catalog management: All applications talk same languageQuery analytics for optimizing hot datasetsWorkload based experimentation with newer systems: AWS Redshift, Apache Spark, Apache TezGeneration 3: Apache LensMachine learning workflow in Apache Lens using Apache SparkGeneration 4 (Future) : Advanced AnalyticsAgenda14Lens Architecture

Data Layout Fact dataTime

Dim-cuts, Measures

Cost16Data Layout Dimension dataCost17Data Layout SnowflakeDim2_1Dim3 Dim2 Dim4_1Dim4Dim1DemoData ModelSAMPLE_CUBESAMPLE_DIM2SAMPLE_DIMSAMPLE_DB_DIMd3->idd2id->idd2id->idApache Lens : FeaturesBeta version in prod21ReferencesBackupExample queryExample queryData Model Storage27Data Model Fact TableData Model Dimension tableData Model Storage tables and partitionsQuery featuresResolve candidate dimension tables and the storage tables .Resolve the candidate fact tables which can answer the query, pick the ones from top of the pyramid.Resolve fact storage tables for the queried time range.Automatically resolve joins using the relationships between cubes and dimension.Automatically add aggregate functions to measures.Add expression to group by clause, if projected; and project group by clause, if it is not.

31Analytics systems at InmobiAdhoc querying systemAdhoc and Batch queriesScheduled queriesBased on Hadoop MapreduceProvides UI and custom apiData is stored in HDFS

Dashboard systemCanned reportsInteractive and adhoc queriesProvides UI and Custom apiData is stored in columnar DWH

Customer facing systemFace to the outside world (Advertisers and publishers)Interactive and adhoc queriesProvides UI and custom apiData is stored in relational DB, Postgres