leveraging hadoop to mine customer insights in a developing market

Leveraging Hadoop in Wikimart Roman Zykov

Head of analytics http://wikimart.ru

London, Big Data World Europe, 20th September 2012

Key problem

To be or not to be….

Hadoop

Introduction

Key tasks for Wikimart

What

• BI tasks

• Web analytics (in-house solution)

• Recommendations on site

• Data services for marketing

Who

• Core analytics team

• Analytics members in other departments

• IT site operations

Problem

Too time consuming or too

expensive? • Data volume

• # of data services

Map Reduce

DATA

Standalone

Map Reduce

Our idea

New platform for “Big Data” tasks only

• Start research on Map Reduce software

• First patient - recommendation engine

Difficulties

- no planned budget -> Hadoop is free

- no experts -> learn it

- no hardware -> virtual cluster

Requirements for Hadoop

• Easy scalable

• Easy deployment

• Easy integration

• Less low level Java coding

• SQL-like querries

Data flow

Data feeds DWH

Accomplishments

Recommendations

• Collaborative filtering (item-to-item on browsing history, PIG)

• Similar products (items attributes, PIG)

• Most popular items (browsing history + orders, HiveQL)

• Internal and external search recommendations (HiveQL)

Some statistics after 1 year

• >10% of revenue

• 3 months to launch

• Tens of gigabytes are processed 2 hours daily

• 1 crash only (cluster lost power)

Decision: Invest to Hardware cluster

End user

Internal high-level languages

• HiveQL

• Pig

Reporting

• Pre-aggregated data for OLAP

• RDBMS - front end

• OLAP and Reporting software should

support HiveQL

Data Integration

• SQOOP

• Parallel data exchange with RDBMS

(MS SQL, MySQL, Oracle, Teradata… )

• Incremental updates

• HDFS, Hive, HBASE

• Talend Open Studio

Hadoop vs RDBMS

• Never replace RDBMS:

• Latency

• Weak capabilities of HiveQL vs SQL

• Only some tasks with offline processing:

• Machine learning

• Queries to Big tables

• ….

• Real time: NOSQL

Hadoop myth

Terabytes?

Petabytes?

Big tasks!

Conclusion

• Hadoop is not Rocket Science

• Intermediate data can be Big Data

Starter kit

• Hadoop management system

• Virtual hardware (cloud, virtual servers, etc)

• Offline data tasks

• Pig or HiveQL

• Sqoop: import data from existing data sources

Thank you!!!

[email protected]

linkedin.com/in/romanzykov

mailto:[email protected]

leveraging hadoop to mine customer insights in a developing market

Technology

big data tasks

data flowdwh data feeds

data volume

offline data tasks pig

site data services

big tasks

science intermediate

existing data sources