leveraging hadoop to mine customer insights in a developing market

Post on 18-Jan-2015

1.255 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

I was a speaker at Big data world conference in London on the 18th september 2012. http://www.terrapinn.com/2012/big-data-world-europe/ See full text speech at http://webkpis.com/2012/11/hadoop-implementation-in-wikimart/ Incorporating Hadoop technology within your infrastructure to cut costs and increase the scale of your operations Understanding how Hadoop can provide insightful data analysis to the end user Combining Hadoop with existing enterprise systems to deepen your insight and discover previously hidden trends Will Hadoop replace the need for relational data warehousing systems?

TRANSCRIPT

Leveraging Hadoop in Wikimart Roman Zykov

Head of analytics http://wikimart.ru

London, Big Data World Europe, 20th September 2012

Key problem

To be or not to be….

Hadoop

Introduction

Key tasks for Wikimart

What

• BI tasks

• Web analytics (in-house solution)

• Recommendations on site

• Data services for marketing

Who

• Core analytics team

• Analytics members in other departments

• IT site operations

Problem

Too time consuming or too

expensive? • Data volume

• # of data services

Map Reduce

DATA

Standalone

Map Reduce

Our idea

New platform for “Big Data” tasks only

• Start research on Map Reduce software

• First patient - recommendation engine

Difficulties

- no planned budget -> Hadoop is free

- no experts -> learn it

- no hardware -> virtual cluster

Requirements for Hadoop

• Easy scalable

• Easy deployment

• Easy integration

• Less low level Java coding

• SQL-like querries

Data flow

Data feeds DWH

Accomplishments

Recommendations

• Collaborative filtering (item-to-item on browsing history, PIG)

• Similar products (items attributes, PIG)

• Most popular items (browsing history + orders, HiveQL)

• Internal and external search recommendations (HiveQL)

Some statistics after 1 year

• >10% of revenue

• 3 months to launch

• Tens of gigabytes are processed 2 hours daily

• 1 crash only (cluster lost power)

Decision: Invest to Hardware cluster

End user

Internal high-level languages

• HiveQL

• Pig

Reporting

• Pre-aggregated data for OLAP

• RDBMS - front end

• OLAP and Reporting software should

support HiveQL

Data Integration

• SQOOP

• Parallel data exchange with RDBMS

(MS SQL, MySQL, Oracle, Teradata… )

• Incremental updates

• HDFS, Hive, HBASE

• Talend Open Studio

Hadoop vs RDBMS

• Never replace RDBMS:

• Latency

• Weak capabilities of HiveQL vs SQL

• Only some tasks with offline processing:

• Machine learning

• Queries to Big tables

• ….

• Real time: NOSQL

Hadoop myth

Terabytes?

Petabytes?

Big tasks!

Conclusion

• Hadoop is not Rocket Science

• Intermediate data can be Big Data

Starter kit

• Hadoop management system

• Virtual hardware (cloud, virtual servers, etc)

• Offline data tasks

• Pig or HiveQL

• Sqoop: import data from existing data sources

Thank you!!!

rzykov@gmail.com

linkedin.com/in/romanzykov

top related