leveraging hadoop to mine customer insights in a developing market
DESCRIPTION
I was a speaker at Big data world conference in London on the 18th september 2012. http://www.terrapinn.com/2012/big-data-world-europe/ See full text speech at http://webkpis.com/2012/11/hadoop-implementation-in-wikimart/ Incorporating Hadoop technology within your infrastructure to cut costs and increase the scale of your operations Understanding how Hadoop can provide insightful data analysis to the end user Combining Hadoop with existing enterprise systems to deepen your insight and discover previously hidden trends Will Hadoop replace the need for relational data warehousing systems?TRANSCRIPT
Leveraging Hadoop in Wikimart Roman Zykov
Head of analytics http://wikimart.ru
London, Big Data World Europe, 20th September 2012
Key problem
To be or not to be….
Hadoop
Introduction
Key tasks for Wikimart
What
• BI tasks
• Web analytics (in-house solution)
• Recommendations on site
• Data services for marketing
Who
• Core analytics team
• Analytics members in other departments
• IT site operations
Problem
Too time consuming or too
expensive? • Data volume
• # of data services
Map Reduce
DATA
Standalone
Map Reduce
Our idea
New platform for “Big Data” tasks only
• Start research on Map Reduce software
• First patient - recommendation engine
Difficulties
- no planned budget -> Hadoop is free
- no experts -> learn it
- no hardware -> virtual cluster
Requirements for Hadoop
• Easy scalable
• Easy deployment
• Easy integration
• Less low level Java coding
• SQL-like querries
Data flow
Data feeds DWH
Accomplishments
Recommendations
• Collaborative filtering (item-to-item on browsing history, PIG)
• Similar products (items attributes, PIG)
• Most popular items (browsing history + orders, HiveQL)
• Internal and external search recommendations (HiveQL)
Some statistics after 1 year
• >10% of revenue
• 3 months to launch
• Tens of gigabytes are processed 2 hours daily
• 1 crash only (cluster lost power)
Decision: Invest to Hardware cluster
End user
Internal high-level languages
• HiveQL
• Pig
Reporting
• Pre-aggregated data for OLAP
• RDBMS - front end
• OLAP and Reporting software should
support HiveQL
Data Integration
• SQOOP
• Parallel data exchange with RDBMS
(MS SQL, MySQL, Oracle, Teradata… )
• Incremental updates
• HDFS, Hive, HBASE
• Talend Open Studio
Hadoop vs RDBMS
• Never replace RDBMS:
• Latency
• Weak capabilities of HiveQL vs SQL
• Only some tasks with offline processing:
• Machine learning
• Queries to Big tables
• ….
• Real time: NOSQL
Hadoop myth
Terabytes?
Petabytes?
Big tasks!
Conclusion
• Hadoop is not Rocket Science
• Intermediate data can be Big Data
Starter kit
• Hadoop management system
• Virtual hardware (cloud, virtual servers, etc)
• Offline data tasks
• Pig or HiveQL
• Sqoop: import data from existing data sources