hadoop at orange - sophiaconf2012
DESCRIPTION
During SophiaConf2012, presentation of Hadoop at OrangeTRANSCRIPT
![Page 1: Hadoop at Orange - SophiaConf2012](https://reader037.vdocuments.us/reader037/viewer/2022110115/54c1d1404a795920178b4580/html5/thumbnails/1.jpg)
SophiaConf2012 [email protected]
Distributed Calculus withHadoop MapReduce inside Orange Search Engine
mardi 10 juillet 12
![Page 3: Hadoop at Orange - SophiaConf2012](https://reader037.vdocuments.us/reader037/viewer/2022110115/54c1d1404a795920178b4580/html5/thumbnails/3.jpg)
SophiaConf2012 [email protected]
$ 5 billions (2012)
to$ 50 billions
(by 2017)
Forbes
mardi 10 juillet 12
![Page 4: Hadoop at Orange - SophiaConf2012](https://reader037.vdocuments.us/reader037/viewer/2022110115/54c1d1404a795920178b4580/html5/thumbnails/4.jpg)
SophiaConf2012 [email protected]
«Big Data is the new definitive source of competitive advantage across all
industries»
Jeff Kelly
mardi 10 juillet 12
![Page 5: Hadoop at Orange - SophiaConf2012](https://reader037.vdocuments.us/reader037/viewer/2022110115/54c1d1404a795920178b4580/html5/thumbnails/5.jpg)
SophiaConf2012 [email protected]
Product success
«The days are over when you build a product once and it just works.
You have to take ideas, test them, iterate them, use data and analytics to understand what works and what doesn’t in order to be successful»
LivingSocial @PCWorld 2012
mardi 10 juillet 12
![Page 8: Hadoop at Orange - SophiaConf2012](https://reader037.vdocuments.us/reader037/viewer/2022110115/54c1d1404a795920178b4580/html5/thumbnails/8.jpg)
SophiaConf2012 [email protected]
Beliefs
« We believe that by 2015, more than half the world’s data will be processed by Apache Hadoop »
HortonWorks @HadoopSummit 2012
mardi 10 juillet 12
![Page 12: Hadoop at Orange - SophiaConf2012](https://reader037.vdocuments.us/reader037/viewer/2022110115/54c1d1404a795920178b4580/html5/thumbnails/12.jpg)
SophiaConf2012 [email protected]
Orange Search Engine
http://www.orange.fr
http://www.lemoteur.fr
http://www.voila.fr
mardi 10 juillet 12
![Page 13: Hadoop at Orange - SophiaConf2012](https://reader037.vdocuments.us/reader037/viewer/2022110115/54c1d1404a795920178b4580/html5/thumbnails/13.jpg)
SophiaConf2012 [email protected]
Search Engine Architecture
Internet
IndexationCollecte
du WEB24/24
(Crawl) Pré-calcul de score
Infrastructure de Recherche
750 millions dedocuments Français
5 milliards de documents
PageRank
mardi 10 juillet 12
![Page 14: Hadoop at Orange - SophiaConf2012](https://reader037.vdocuments.us/reader037/viewer/2022110115/54c1d1404a795920178b4580/html5/thumbnails/14.jpg)
SophiaConf2012 [email protected]
Main issue
• PageRank calculus on billions nodes and 10s billions edges
• regularly failed ! (hardware ...)
• 4 to 8 weeks calculus
• unscalable
• failure rate aroud 80%
• One person full time to supervise
mardi 10 juillet 12
![Page 16: Hadoop at Orange - SophiaConf2012](https://reader037.vdocuments.us/reader037/viewer/2022110115/54c1d1404a795920178b4580/html5/thumbnails/16.jpg)
SophiaConf2012 [email protected]
PageRank portable to Hadoop / MapReduce ?• Simple programing model:
Map(in_k,in_v) => list(out_k,intermed_v)Reduce(out_k,intermed_v) => list(out_v)
• Scalable
• Batch Processing
• YES !
mardi 10 juillet 12
![Page 17: Hadoop at Orange - SophiaConf2012](https://reader037.vdocuments.us/reader037/viewer/2022110115/54c1d1404a795920178b4580/html5/thumbnails/17.jpg)
SophiaConf2012 [email protected]
Hadoop Axioms
• System shall manage and heal himself
• Performance shall scale linearly
• Compute shall move to data
• Modular and extensible
mardi 10 juillet 12
![Page 19: Hadoop at Orange - SophiaConf2012](https://reader037.vdocuments.us/reader037/viewer/2022110115/54c1d1404a795920178b4580/html5/thumbnails/19.jpg)
SophiaConf2012 [email protected]
Our install
HIVE PIG
MapReduce
HDFS
Zoo
Kee
per
Mahout
Oozie
Khi
ops
mardi 10 juillet 12
![Page 21: Hadoop at Orange - SophiaConf2012](https://reader037.vdocuments.us/reader037/viewer/2022110115/54c1d1404a795920178b4580/html5/thumbnails/21.jpg)
SophiaConf2012 [email protected]
Hadoop - HDFS
0011010101110000111
Master
Nœud données
Nœud données
Nœud données Nœud
données Nœud données
Client
Read/Write Read
replication Bloc ops
COTS - replication - big blocksmaximize throughput - Metadata in RAM
mardi 10 juillet 12
![Page 23: Hadoop at Orange - SophiaConf2012](https://reader037.vdocuments.us/reader037/viewer/2022110115/54c1d1404a795920178b4580/html5/thumbnails/23.jpg)
SophiaConf2012 [email protected]
MapReduce
cat | «your map» | sort -u | «your reduce»
Programming paradigm
mardi 10 juillet 12
![Page 24: Hadoop at Orange - SophiaConf2012](https://reader037.vdocuments.us/reader037/viewer/2022110115/54c1d1404a795920178b4580/html5/thumbnails/24.jpg)
SophiaConf2012 [email protected]
MapReduce
cat | «your map» | sort -u | «your reduce»
FrameWork
mardi 10 juillet 12
![Page 25: Hadoop at Orange - SophiaConf2012](https://reader037.vdocuments.us/reader037/viewer/2022110115/54c1d1404a795920178b4580/html5/thumbnails/25.jpg)
SophiaConf2012 [email protected]
MapReduce
cat | «your map» | sort -u | «your reduce»
your Job
mardi 10 juillet 12
![Page 27: Hadoop at Orange - SophiaConf2012](https://reader037.vdocuments.us/reader037/viewer/2022110115/54c1d1404a795920178b4580/html5/thumbnails/27.jpg)
SophiaConf2012 [email protected]
Interfaces
• Java API
• Pipes
• Streaming (python, perl, C/C++, ...)
mardi 10 juillet 12
![Page 28: Hadoop at Orange - SophiaConf2012](https://reader037.vdocuments.us/reader037/viewer/2022110115/54c1d1404a795920178b4580/html5/thumbnails/28.jpg)
SophiaConf2012 [email protected]
PIG
• High level data analysis script language
• extensible via UDF
• Structure of a Pig script• load• filter• foreach | group by | join | your functions• order• store
mardi 10 juillet 12
![Page 29: Hadoop at Orange - SophiaConf2012](https://reader037.vdocuments.us/reader037/viewer/2022110115/54c1d1404a795920178b4580/html5/thumbnails/29.jpg)
SophiaConf2012 [email protected]
HIVE
• High level SQL-like query and analysis language
• extensible via UDF
• Structure of a Hive script• create table• load data• select ... from ...• insert | group by | join
mardi 10 juillet 12
![Page 31: Hadoop at Orange - SophiaConf2012](https://reader037.vdocuments.us/reader037/viewer/2022110115/54c1d1404a795920178b4580/html5/thumbnails/31.jpg)
SophiaConf2012 [email protected]
Projects
• Scoring
• User profiling
• Log analysis and statistics
• ... and many others to come
mardi 10 juillet 12
![Page 33: Hadoop at Orange - SophiaConf2012](https://reader037.vdocuments.us/reader037/viewer/2022110115/54c1d1404a795920178b4580/html5/thumbnails/33.jpg)
SophiaConf2012 [email protected]
ROI
• Lines Of Code
• Development Time
• IT cost
less bug, automatic, scalable ...
10X gain
2X gain4X gain
mardi 10 juillet 12
![Page 34: Hadoop at Orange - SophiaConf2012](https://reader037.vdocuments.us/reader037/viewer/2022110115/54c1d1404a795920178b4580/html5/thumbnails/34.jpg)
SophiaConf2012 [email protected]
Perfect World ?★ YES• Run cost• Development cost• Scalable• Stable• Heterogenity
★ NO• SPOF (almost solved)
• Fastidious debugging• Localy non optimum• mono-site
mardi 10 juillet 12
![Page 36: Hadoop at Orange - SophiaConf2012](https://reader037.vdocuments.us/reader037/viewer/2022110115/54c1d1404a795920178b4580/html5/thumbnails/36.jpg)
SophiaConf2012 [email protected]
thanks to• Hadoop - Apache (http://hadoop.apache.org/)
• Khiops - Orange (http://www.khiops.com)
• Shauwn Connoly - HortonWorks (http://youtu.be/yPfysFAGv8s)
• Forbes - article(http://www.forbes.com/sites/siliconangle/2012/02/17/big-data-is-big-market-big-business/)
• Living Social (sentence)
• Terradata (Volumetry Graph)
• http://www.wordle.net/ (Words Cloud)
mardi 10 juillet 12