how to create self-service analytics tool from activity logs garbage

23
1 How to create self-service analytics tool from activity logs garbage Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike How to create self-service analytics tool from activity logs garbage 2016 Sep 14 Wrike Tech Hub Aleksei Smirnov Data Analyst at Wrike Inc. Aleksei Pupyshev Data Scientist at Wrike Inc.

Upload: anton-anokhin

Post on 20-Jan-2017

43 views

Category:

Data & Analytics


0 download

TRANSCRIPT

1How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

How to create self-service analytics tool from activity logs garbage

2016 Sep 14 Wrike Tech Hub

Aleksei SmirnovData Analyst at Wrike Inc.

Aleksei PupyshevData Scientist at Wrike Inc.

2How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

Wrike is ...Workspace (Web Application) iOS & Android apps

Many integrations and public API

We're releasing new products and features as well as changing

old ones, very quickly.

3How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

Wrike is - Data Driven Development Company

4How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

Wrike Analytics Tools Evolution: What about logs?

So here we’ve implemented log processing infrastructure based on Spark SQL

Presentation from SPbDSM Sep 2015

UI events

Web Requests

Backend Services

ETL

More about parquet files structure:https://habrahabr.ru/company/wrike/blog/279797/

Thrift interface

5How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

Wrike Analytics Tools Evolution: Problems

Spark-submit python jobs

● More and more ETLs or pyspark jobs for different specific tasks and dashboards

● There is no common standard and knowledge (code) base for different metrics extractions / computations

● Many different specific sources in out for each analytics separately

● It’s hell to generate datasets for ML (predictions, lead-scoring, personalizations etc) or adhocs

● There is no ability to build one monitoring and alert system for wrike events and KPIs

● Hundreds of dashboards for Wrike data stakeholders which is difficult to get any insights about product and business development

● No metrics naming convention

6How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

Wrike Analytics Tools Evolution: Problems

7How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

Wrike Analytics Tools Evolution: Solution

● Unification of log-format data - different event timestamps formats to one, different

production tables to log-structure format, unifications of user_id for all sources

● Unification of grouping format - (in our case) user_id and day

● Standardisation of metric naming principles - positioning based naming schema:

entity__event__source__path__measure__unit__details

● Unification of auto-updateable metrics, features creating and metrics testing

process - via Jupiter Notebook using any of following syntax: Python, Pandas, SQL

(PandasSQL)

● Generating of one datasource which contains all user activity metrics and

features with updatable schema - Daily User Activity Data Mart (Vitrina)

8How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

Wrike User Activity Data Mart: Tech Stack

9How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

Wrike User Activity Data Mart: Under the Hood

10How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

Logs:

● Client log (UI)● Web log (Requests)● Email log● Event log (Invitations, Registrations etc)● Search log● Mobile log● ...

UADataMart Under the Hood: Concatenating logs● Unification of log-format data - different event timestamps formats to one, different production tables to log-structure format, unifications of user_id for all sources

Production Data Bases (from many shards):

● Delta table● Files Attachments● Task changes● ...

Union of spark data frames with merging schema

~ we also should rename columns with adding of source prefix (except user_id and timestamp)

This operation isn’t expensive and very useful!

11How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

UADataMart Under the Hood: Grouping by User

This is expensive operations!

And then applying of “magic” map function

12How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

UADataMart Under the Hood: “magic” map function

13How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

UADataMart Under the Hood: “magic” map function

● Creating of Pandas Data Frame from

grouped Row object

● Applying of each “Metrics Module

Function” to copy of Pandas DF which

generates dictionary with appropriate

metrics (KPIs) name and value

● If exception occurs (some error inside

module function) generates dictionary with

default KPI values

● Concatenation of list of returned dictionaries

and converting to Row

14How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

UADataMart Under the Hood: Metrics Module Functions

Example: based on PandasSQL syntax

Note: here we can use any syntax we like or Python or Pandas!

15How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

UADataMart Under the Hood: Modules Structure

16How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

Wrike User Activity Data Mart: Under the Hood

17How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

Wrike User Activity Data Mart: Under the Hood

Dimensions

apply UDFs (converting to categorical value)for each dimensioncolumn

Categorical dimensions

grouping by categorical dimensions and aggregations (by all users) inside grouped data

Registration Period Paid Details Country KPI Name Sum of KPI Day

From 1 year to 2 year Paid US ses__x__x__x__avg__mn__x 1000000 2016.09.01

From 6 months to 1 year Free BR act__x__ws__dashb__cnt__ev__x 20000 2016.09.01

From 2 week to 1month Free GB act__x__ws__tlist__cnt__ev__x 100000 2016.09.02

~ 1 mln rows

18How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

Wrike User Activity Data Mart: For Wrike Data Stakeholders

● entity__event__source__path__measure__unit__detail

s

19How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

Demo!

20How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

Flow:

21How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

Wrike Analytics Tools Evolution: Problems

Spark-submit python jobs

● More and more ETLs or pyspark jobs for different specific tasks and dashboards

● There is no common standard and knowledge (code) base for different metrics extractions / computations

● Many different specific sources in out for each analytics separately

● It’s hell to generate datasets for ML (predictions, lead-scoring, personalizations etc) or adhocs

● There is no ability to build one monitoring and alert system for wrike events and KPIs

● Hundreds of dashboards for Wrike data stakeholders which is difficult to get any insights about product and business development

● No metrics naming convention

22How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

Other Applications:

● Alarm system (notification when something goes wrong with metrics values)

● Email personalization● Recommendation system ( like wrike features recommendations,

search quality improvements, user-churn predictions, lead-scoring etc. )

23How to create self-service analytics tool from activity logs garbageWrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016 slide Wrike

Questions!

Thank you!