leveraging hadoop to mine customer insights in a developing market

Leveraging Hadoop in Wikimart Roman Zykov

Head of analytics http://wikimart.ru

London, Big Data World Europe, 20th September 2012

Key problem

To be or not to be….

Hadoop

Introduction

Key tasks for Wikimart

• BI tasks

• Web analytics (in-house solution)

• Recommendations on site

• Data services for marketing

• Core analytics team

• Analytics members in other departments

• IT site operations

Problem

Too time consuming or too

expensive? • Data volume

• # of data services

Map Reduce

Standalone

Map Reduce

Our idea

New platform for “Big Data” tasks only

• Start research on Map Reduce software

• First patient - recommendation engine

Difficulties

- no planned budget -> Hadoop is free

- no experts -> learn it

- no hardware -> virtual cluster

Requirements for Hadoop

• Easy scalable

• Easy deployment

• Easy integration

• Less low level Java coding

• SQL-like querries

Data flow

Data feeds DWH

Accomplishments

Recommendations

• Collaborative filtering (item-to-item on browsing history, PIG)

• Similar products (items attributes, PIG)

• Most popular items (browsing history + orders, HiveQL)

• Internal and external search recommendations (HiveQL)

Some statistics after 1 year

• >10% of revenue

• 3 months to launch

• Tens of gigabytes are processed 2 hours daily

• 1 crash only (cluster lost power)

Decision: Invest to Hardware cluster

End user

Internal high-level languages

• HiveQL

• Pig

Reporting

• Pre-aggregated data for OLAP

• RDBMS - front end

• OLAP and Reporting software should

support HiveQL

Data Integration

• SQOOP

• Parallel data exchange with RDBMS

(MS SQL, MySQL, Oracle, Teradata… )

• Incremental updates

• HDFS, Hive, HBASE

• Talend Open Studio

Hadoop vs RDBMS

• Never replace RDBMS:

• Latency

• Weak capabilities of HiveQL vs SQL

• Only some tasks with offline processing:

• Machine learning

• Queries to Big tables

• ….

• Real time: NOSQL

Hadoop myth

Terabytes?

Petabytes?

Big tasks!

Conclusion

• Hadoop is not Rocket Science

• Intermediate data can be Big Data

Starter kit

• Hadoop management system

• Virtual hardware (cloud, virtual servers, etc)

• Offline data tasks

• Pig or HiveQL

• Sqoop: import data from existing data sources

Thank you!!!

rzykov@gmail.com

linkedin.com/in/romanzykov

leveraging hadoop to mine customer insights in a developing market

big data tasks

data flowdwh data feeds

data volume

offline data tasks pig

site data services

big tasks

science intermediate

existing data sources

Technology

· (page views ? hourly? monthly hadoop node hadoop node...

hadoop conf 2014 - hadoop bigquery connector

hadoop - strategy and technology - sas can treat hadoop just...

high performance is no longer a “nice to have” in...

apache conbigdata2015 christiantzolov-federated sql on...

leveraging hadoop in the analytic data pipeline -...

snapshotting in hadoop distributed file system for hadoop...

continuous delivery for linux/windows/hadoop...beta cluster...

leveraging big data with hadoop, nosql and rdbms

analyzing hadoop with hadoop

leveraging historic exploration in prolific and accessible...

leveraging hadoop in your postgresql environment

leveraging hadoop with obiee 11g and odi 11g - ukoug tech'13

federated sql on hadoop and beyond: leveraging mqtt •...

putting spark to work with marklogic · marklogic and spark...

oracle nosql · pdf fileoracle nosql database is in many...

hadoop installation and...

hadoop operations powered by ... hadoop (hadoop summit 2014...

hadoop 1.0 vs hadoop 2.0

2. hadoop -...