big data @ orange - dev day 2013 - part 2
Post on 10-May-2015
437 Views
Preview:
DESCRIPTION
TRANSCRIPT
olivier.varene@orange.com !
BigData @ Digital Factory!
une petite histoire en cours d’écriture!
Olivier Varene! DSIF/DFY!Orange DevDay 2013 !!
olivier.varene@orange.com !
Hadoop!
olivier.varene@orange.com !
Hadoop - Core!
MapReduce!HDFS!
olivier.varene@orange.com !
Had
oop
gene
alog
y!
olivier.varene@orange.com !
Hadoop Time bar!
0.2[0-2].X! 0.23.x!1.x!
2.x!
olivier.varene@orange.com !
Hadoop Distribution!
Packaging!Deployment!
Support!
olivier.varene@orange.com !
Main Distributions!Licence! Business Model! Support!
Apache! Apache 2.0! Fundation! community only!
HortonWorks! Apache 2.0!HortonWorks (add-on)!
PS + Training + support!
community + Professional!
Cloudera!Apache 2.0!
Closed Source (not core)!
PS + Licencing + Training + support!
community + Professional!
MapR! Apache 2.0!Closed Source (FS)!
PS + Licencing + support!
community + Professional!
WanDisco! Apache 2.0!Closed Source (DConE)!
PS + Licencing + Training + support!
community + Professional!
PS: Professional Services!
olivier.varene@orange.com !
Big Name Distributions!
• IBM InfoSphere BigInsights!
• GreenPlum (EMC)!
• Intel Distribution for Hadoop!
• …!
Paying & Closed Source !
olivier.varene@orange.com !
Big Data Suite!Tooling!
Code generation!Scheduling!Integration!
olivier.varene@orange.com !
Tools (1st level)!Tool"! Description! Licence!
Apache Pig! Scripting Platform! Apache 2!Apache Hive! Data Access & Query! Apache 2!
Apache HCatalog! Metadata Services! Apache 2!Apache HBase! NoSQL Database! Apache 2!
Apache ZooKeeper! Cluster Coordination! Apache 2!Apache Tez ! Query processing! Apache 2!
Apache Oozie! Workflow Scheduler! Apache 2!Apache Sqoop! Data Integration Services! Apache 2!
olivier.varene@orange.com !
Tools (add-ons)!Tool"! Description! Licence!
Teradata connector! Connector! Terradata + Distribution!
Hive ODBC! ODBC! Distribution!
Mahout! Data Mining! Apache 2!
Cascading! Fault Tolerant API / Framework! Apache 2!
Cassandra Connector! Connector to Cassandra NoSQL! Apache 2!
MongoDB Connector! Connector to MongoDB! Apache 2!
…!
olivier.varene@orange.com !
Landscape!
olivier.varene@orange.com !
@ Digital Factory!DSIF / Digital Factory!
olivier.varene@orange.com !
Back in Time!
• PageRank calculus on billions nodes and 10s billions edges
• regularly failed ! (hardware ...)
• 4 to 8 weeks calculus
• unscalable
• failure rate around 80%
• One person full time to supervise !
- 3 years!
olivier.varene@orange.com !
Answer ?!
Internal!Development!+ full control!- long term!- €€ !
OpenSource!+ €€!+ short term!- support!- evolution!
olivier.varene@orange.com !
Success!
In PRODUCTION since 2010 !
olivier.varene@orange.com !
How does it work ?!
olivier.varene@orange.com !
Hadoop Axioms!
• System shall manage and heal himself"• Performance shall scale linearly"• Compute shall move to data"• Modular and extensible!
olivier.varene@orange.com !
HDFS (Simple)!Self-healing High-Bandwidth Clustered Storage!
olivier.varene@orange.com !
olivier.varene@orange.com !
MapReduce V1 (Simple)!
cat <data> | <Mapper> | sort | <Reducer>!
olivier.varene@orange.com !
MapReduce V1 (Simple)!
cat <data> | …………... | sort | …………….!
Framework
olivier.varene@orange.com !
MapReduce V1 (Simple)!
Your program
……………. <Mapper> ..…… <Reducer>!
olivier.varene@orange.com !
olivier.varene@orange.com !
olivier.varene@orange.com !
YARN!Allow plugging in new paradigms!
olivier.varene@orange.com !
MapReduce V1!
Map()
Map()
Map()
Map()
Map()
Reduce()
Reduce()
partXX
partXX
Data on HDFS
Sort!Partition!
Map! Reduce!
olivier.varene@orange.com !
Before map()!
Data on HDFS
Block of Data
Block of Data
Map()
Map()
SlicingPartitioning
JobTracker calculateslocality for job assignment
and input split data
…(Kin,Vin) (Kout,Vout)
olivier.varene@orange.com !
Java (Api)!Mapper!Class YourMapper extends Mapper(Kin,Vin,Kout,Vout) {
[void setup();]
[void cleanup();]
void map(Kin,Vin,context) {
…. Your program …! }
}
olivier.varene@orange.com !
before reduce()!
Map() filefilefile
RAMsorting
disk write
temporary intermediate files
sorted in each file
Combine() filefile
1 or more times
temporary intermediate files
OPTIONAL
key namespace partitioning
(Kout,Vout) (Kout,Vout)
RAMsorting
disk write
(Kout,Vout)
Partition()partpartpart
JobTrackerdistribution to
reducers
olivier.varene@orange.com !
Java (Api)!Reducer!Class YourReducer extends
Reducer(Kin,Vin,Kout,Vout) {
[void setup();]
[void cleanup();]
void reduce(Kin,List<Vin>,context) {
…. Your program …
}
}
olivier.varene@orange.com !
Optimization tips!• JVM!
• Algorithm in MapReduce paradigm!
• Combiner!
• Sort algorithm!
• Partitioning!
olivier.varene@orange.com !
Streaming!
… | <mapper> | … | <reducer> |…!
• STDIN !• STDOUT!• Text as input and output by default!• ‘\t’ as default separator!• Use your language : perl, python, shell, ruby, … !• (interpreter needed on all nodes)!
hadoop jar $streamingJar –input <inputDir> -output <outputDir> !-mapper <mapProg> -reducer <reduceProg> -file <files>!
olivier.varene@orange.com !
Pipe – C++!
… | <mapper> | … | <reducer> |…!
• Socket communication!• Bytes as input and output!• C++ API!
hadoop put <binFile> <toHDFS…>!hadoop pipes –input <inputDir> -output <outputDir> ! -program <path/binfile> [-conf <confFile>]!
class MyMap: public HadoopPipes::Mapper { … }
class MyReducer: public HadoopPipes::Reducer { … }
olivier.varene@orange.com !
Too difficult!
Hopefully there are tools that can generate code for you or let you do SQL queries !!!!
Tools! Algo / Libs!
olivier.varene@orange.com !
PIG!Scripting Language :!
• Simple!
• Parallel execution!
• Data oriented!
• Extensible via UDF!
• Automatic performance enhancement via compiler!
set job.name calculateGraphDegres!!%default nbpigreducers 10!set default_parallel $nbpigreducers!!-- degres sortant!A = load '$degout' using PigStorage() as (url:chararray,out_deg:int);!-- keep entries where out_deg > 1!A2 = filter A by (out_deg > 1);!B = order A2 by out_deg DESC;!store B into '$degoutOrdered';!!-- distribution des degres sortants!C = foreach A generate out_deg,1 as deg_occ;!D = group C by out_deg;!E = foreach D generate FLATTEN(group) as out_deg,SUM(C.deg_occ) as deg_occ;!F = order E by out_deg ASC;!store F into '$degoutDistrib';!
olivier.varene@orange.com !
Hive!Querying Language :!
• HiveQL (sql like)!
• ETL Tool!
• HDFS, HBase, Thrift …!
• MapReduce interface (with streaming to python …)!
• Extensible via UDF!
CREATE EXTERNAL TABLE b_packet (timestamp string, packet_length int, protocol string) !ROW FORMAT DELIMITED FIELDS TERMINATED BY "|" !LOCATION ‘b-file/input/';! !CREATE EXTERNAL TABLE b_packet_out (protocol string, cnt int) !ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" !LOCATION ’b-file/output/1/’;!!INSERT INTO TABLE b_packet_out!select count(*) as overall, !sum( if(protocol like '^ip:tcp',1,0) as tcp, sum( if(protocol like '^ip:udp',1.0) as udp, sum( if(protocol like '^ip:icmp'1,0) as icmp !from b_packet;!
olivier.varene@orange.com !
R!Rhadoop :
https://github.com/RevolutionAnalytics/RHadoop/wiki!
• rmr : functions providing mapreduce in R!
• rhdfs : functions providing dhfs operations in R!
• rhbase : functions providing hbase operations in R!
library(rmr2) library(rhdfs) gdp <- read.csv("GDP_converted.csv") hdfs.init() gdp.values <- to.dfs(gdp) gdp.map.fn <- function(k,v) { key <- ifelse(v[4] < aaplRevenue, "less", "greater") keyval(key, 1) } count.reduce.fn <- function(k,v) { keyval(k, length(v)) } count <- mapreduce(input=gdp.values, map = gdp.map.fn, reduce = count.reduce.fn) from.dfs(count)
olivier.varene@orange.com !
Gui!
Tools!
Poc !
Time saver!Prototyping!Visualize complex processes!Fast changes!
But need to know the inside for optimization!
olivier.varene@orange.com !
SQL!
Hbase !
Phoenix !Hive !
Tajo !
HDFS!
Impala ! Presto !
ODBC/JDBC!
HiveQL!JDBC!
SQL! HQL!ISO!PSQL!
Prod / Beta & Alpha products!
olivier.varene@orange.com !
Sqoop!Transfer from/to HDFS to/from Structured storage via
JDBC connectors : PostGresql, MySQL, Oracle, Terradata, …!
RDBMS!
NoSQL!Hadoop!process!
Sqoop!import!
Sqoop!export!
olivier.varene@orange.com !
Oozie!
olivier.varene@orange.com !
Nowadays !@ Digital Factory ?!
olivier.varene@orange.com !
In Production!• Since 2010!
• Growth by internal projects needs!
• Recycling Servers (€€ savings)!
• We learned as we walked : !* tar -> cdh3 -> cdh4 …!* optimizations!* Run processes …!
olivier.varene@orange.com !
Production « PFS »!• Shared among different teams!• xx nodes on COTS!• xxx TBytes!• >xxx jobs / per day!• Monitoring : Xymon !• Graphing via NetStat (SNMP / RRD : x’s oids/second)!• Automatic Configuration!
olivier.varene@orange.com !
Architecture!
HIVE!
MapReduce!
HDFS!
ZooK
eepe
r!
Mahout!
Oozie!
Khiops!
Sqoop!
Real Time Query Engine!R!
HIVE Server!
Web Service!
Flume!
App Services!
PIG!
HCatalog!HBase!
Cassandra!
Cascading!
in POC!
olivier.varene@orange.com !
Benefits!• Infrastructure cost!
• Development cost!
• Robustness!
• Scalability!
• New development areas (Graph Mining, Logs statistics …)!
€ !-70% loc!-50% dev time!-75% run cost!
olivier.varene@orange.com !
A few of our use cases!
olivier.varene@orange.com !
Graph algorithms for http://www.lemoteur.fr/!
Scoring - Search Engine!
xx TB compressed!xx billions nodes!>xxx billions edges!
xxx TB in RAM!
xRank!
olivier.varene@orange.com !
Customers’ statistical behaviors, ads display optimization, …!
Profiling!
xxx GB / daily!+!xxx GB / monthly (customer DB)!
Customer profile!
olivier.varene@orange.com !
Log Analysis!
xx billion events daily!
OJD certified Measurements : Internet and Mobile, Customers’ journey analysis, …!
KPIs!
olivier.varene@orange.com !
with NoSQL!
Hadoop over Cassandra!
(next session)!
olivier.varene@orange.com !
Benefits & Drawbacks!
Scalable!Stable!
RUN Cost!Development Cost!
Performance!Very fast evolution!
New Dev areas!
Learning curve!Debug!
Algorithms!Complex!
Very fast evolution!
olivier.varene@orange.com !
Future!• Enhance Security and robustness!
• Create Services & Functional Catalog!
• Continue building our expertise : Fast Data, Cascading, MR2, …!
• A thousand nodes cluster !!
• Help other teams to go on Production!
CONTACT US : olivier.varene@orange.com!
olivier.varene@orange.com !
Thank you!Merci!
Olivier Varene! DSIF/DFY!Orange DevDay 2013 !
olivier.varene@orange.com !
My Thanks to!
• Apache http://www.apache.org/!
• http://hadooper.blogspot.com/!
• Cloudera http://www.cloudera.com/!
• HortonWorks http://www.hortonworks.com/!
• And all the community !!
top related