big data @ orange - dev day 2013 - part 2

olivier.varene@orange.com !

BigData @ Digital Factory!

une petite histoire en cours d’écriture!

Olivier Varene! DSIF/DFY!Orange DevDay 2013 !!

Hadoop!

Hadoop - Core!

MapReduce!HDFS!

Hadoop Time bar!

0.2[0-2].X! 0.23.x!1.x!

Hadoop Distribution!

Packaging!Deployment!

Support!

Main Distributions!Licence! Business Model! Support!

Apache! Apache 2.0! Fundation! community only!

HortonWorks! Apache 2.0!HortonWorks (add-on)!

PS + Training + support!

community + Professional!

Cloudera!Apache 2.0!

Closed Source (not core)!

PS + Licencing + Training + support!

MapR! Apache 2.0!Closed Source (FS)!

PS + Licencing + support!

WanDisco! Apache 2.0!Closed Source (DConE)!

PS + Licencing + Training + support!

PS: Professional Services!

Big Name Distributions!

•  IBM InfoSphere BigInsights!

•  GreenPlum (EMC)!

•  Intel Distribution for Hadoop!

•  …!

Paying & Closed Source !

Big Data Suite!Tooling!

Code generation!Scheduling!Integration!

Tools (1st level)!Tool"! Description! Licence!

Apache Pig! Scripting Platform! Apache 2!Apache Hive! Data Access & Query! Apache 2!

Apache HCatalog! Metadata Services! Apache 2!Apache HBase! NoSQL Database! Apache 2!

Apache ZooKeeper! Cluster Coordination! Apache 2!Apache Tez ! Query processing! Apache 2!

Apache Oozie! Workflow Scheduler! Apache 2!Apache Sqoop! Data Integration Services! Apache 2!

Tools (add-ons)!Tool"! Description! Licence!

Teradata connector! Connector! Terradata + Distribution!

Hive ODBC! ODBC! Distribution!

Mahout! Data Mining! Apache 2!

Cascading! Fault Tolerant API / Framework! Apache 2!

Cassandra Connector! Connector to Cassandra NoSQL! Apache 2!

MongoDB Connector! Connector to MongoDB! Apache 2!

Landscape!

@ Digital Factory!DSIF / Digital Factory!

Back in Time!

•  PageRank calculus on billions nodes and 10s billions edges

•  regularly failed ! (hardware ...)

•  4 to 8 weeks calculus

•  unscalable

•  failure rate around 80%

•  One person full time to supervise !

- 3 years!

Answer ?!

Internal!Development!+ full control!- long term!- €€ !

OpenSource!+ €€!+ short term!- support!- evolution!

Success!

In PRODUCTION since 2010 !

How does it work ?!

Hadoop Axioms!

•  System shall manage and heal himself"•  Performance shall scale linearly"• Compute shall move to data"• Modular and extensible!

HDFS (Simple)!Self-healing High-Bandwidth Clustered Storage!

MapReduce V1 (Simple)!

cat <data> | <Mapper> | sort | <Reducer>!

cat <data> | …………... | sort | …………….!

Framework

Your program

……………. <Mapper> ..…… <Reducer>!

YARN!Allow plugging in new paradigms!

MapReduce V1!

Reduce()

partXX

Data on HDFS

Sort!Partition!

Map! Reduce!

Before map()!

Data on HDFS

Block of Data

SlicingPartitioning

JobTracker calculateslocality for job assignment

and input split data

…(Kin,Vin) (Kout,Vout)

Java (Api)!Mapper!Class YourMapper extends Mapper(Kin,Vin,Kout,Vout) {

[void setup();]

[void cleanup();]

void map(Kin,Vin,context) {

…. Your program …! }

before reduce()!

Map() filefilefile

RAMsorting

disk write

temporary intermediate files

sorted in each file

Combine() filefile

1 or more times

temporary intermediate files

OPTIONAL

key namespace partitioning

(Kout,Vout) (Kout,Vout)

RAMsorting

disk write

(Kout,Vout)

Partition()partpartpart

JobTrackerdistribution to

reducers

Java (Api)!Reducer!Class YourReducer extends

Reducer(Kin,Vin,Kout,Vout) {

[void setup();]

[void cleanup();]

void reduce(Kin,List<Vin>,context) {

…. Your program …

Optimization tips!•  JVM!

•  Algorithm in MapReduce paradigm!

•  Combiner!

•  Sort algorithm!

•  Partitioning!

Streaming!

… | <mapper> | … | <reducer> |…!

•  STDIN !•  STDOUT!•  Text as input and output by default!•  ‘\t’ as default separator!•  Use your language : perl, python, shell, ruby, … !•  (interpreter needed on all nodes)!

hadoop jar $streamingJar –input <inputDir> -output <outputDir> !-mapper <mapProg> -reducer <reduceProg> -file <files>!

Pipe – C++!

… | <mapper> | … | <reducer> |…!

•  Socket communication!•  Bytes as input and output!•  C++ API!

hadoop put <binFile> <toHDFS…>!hadoop pipes –input <inputDir> -output <outputDir> ! -program <path/binfile> [-conf <confFile>]!

class MyMap: public HadoopPipes::Mapper { … }

class MyReducer: public HadoopPipes::Reducer { … }

Too difficult!

Hopefully there are tools that can generate code for you or let you do SQL queries !!!!

Tools! Algo / Libs!

PIG!Scripting Language :!

•  Simple!

•  Parallel execution!

•  Data oriented!

•  Extensible via UDF!

•  Automatic performance enhancement via compiler!

set job.name calculateGraphDegres!!%default nbpigreducers 10!set default_parallel $nbpigreducers!!-- degres sortant!A = load '$degout' using PigStorage() as (url:chararray,out_deg:int);!-- keep entries where out_deg > 1!A2 = filter A by (out_deg > 1);!B = order A2 by out_deg DESC;!store B into '$degoutOrdered';!!-- distribution des degres sortants!C = foreach A generate out_deg,1 as deg_occ;!D = group C by out_deg;!E = foreach D generate FLATTEN(group) as out_deg,SUM(C.deg_occ) as deg_occ;!F = order E by out_deg ASC;!store F into '$degoutDistrib';!

Hive!Querying Language :!

•  HiveQL (sql like)!

•  ETL Tool!

•  HDFS, HBase, Thrift …!

•  MapReduce interface (with streaming to python …)!

•  Extensible via UDF!

CREATE EXTERNAL TABLE b_packet (timestamp string, packet_length int, protocol string) !ROW FORMAT DELIMITED FIELDS TERMINATED BY "|" !LOCATION ‘b-file/input/';! !CREATE EXTERNAL TABLE b_packet_out (protocol string, cnt int) !ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" !LOCATION ’b-file/output/1/’;!!INSERT INTO TABLE b_packet_out!select count(*) as overall, !sum( if(protocol like 'îp:tcp',1,0) as tcp, sum( if(protocol like 'îp:udp',1.0) as udp, sum( if(protocol like 'îp:icmp'1,0) as icmp !from b_packet;!

R!Rhadoop :

https://github.com/RevolutionAnalytics/RHadoop/wiki!

•  rmr : functions providing mapreduce in R!

•  rhdfs : functions providing dhfs operations in R!

•  rhbase : functions providing hbase operations in R!

library(rmr2) library(rhdfs) gdp <- read.csv("GDP_converted.csv") hdfs.init() gdp.values <- to.dfs(gdp) gdp.map.fn <- function(k,v) { key <- ifelse(v[4] < aaplRevenue, "less", "greater") keyval(key, 1) } count.reduce.fn <- function(k,v) { keyval(k, length(v)) } count <- mapreduce(input=gdp.values, map = gdp.map.fn, reduce = count.reduce.fn) from.dfs(count)

Tools!

Time saver!Prototyping!Visualize complex processes!Fast changes!

But need to know the inside for optimization!

Hbase !

Phoenix !Hive !

Tajo !

Impala ! Presto !

ODBC/JDBC!

HiveQL!JDBC!

SQL! HQL!ISO!PSQL!

Prod / Beta & Alpha products!

Sqoop!Transfer from/to HDFS to/from Structured storage via

JDBC connectors : PostGresql, MySQL, Oracle, Terradata, …!

RDBMS!

NoSQL!Hadoop!process!

Sqoop!import!

Sqoop!export!

Oozie!

Nowadays !@ Digital Factory ?!

In Production!•  Since 2010!

•  Growth by internal projects needs!

•  Recycling Servers (€€ savings)!

•  We learned as we walked : !* tar -> cdh3 -> cdh4 …!* optimizations!* Run processes …!

Production « PFS »!•  Shared among different teams!•  xx nodes on COTS!•  xxx TBytes!•  >xxx jobs / per day!•  Monitoring : Xymon !•  Graphing via NetStat (SNMP / RRD : x’s oids/second)!•  Automatic Configuration!

Architecture!

MapReduce!

Mahout!

Oozie!

Khiops!

Sqoop!

Real Time Query Engine!R!

HIVE Server!

Web Service!

Flume!

App Services!

HCatalog!HBase!

Cassandra!

Cascading!

in POC!

Benefits!•  Infrastructure cost!

•  Development cost!

•  Robustness!

•  Scalability!

•  New development areas (Graph Mining, Logs statistics …)!

€ !-70% loc!-50% dev time!-75% run cost!

A few of our use cases!

Graph algorithms for http://www.lemoteur.fr/!

Scoring - Search Engine!

xx TB compressed!xx billions nodes!>xxx billions edges!

xxx TB in RAM!

xRank!

Customers’ statistical behaviors, ads display optimization, …!

Profiling!

xxx GB / daily!+!xxx GB / monthly (customer DB)!

Customer profile!

Log Analysis!

xx billion events daily!

OJD certified Measurements : Internet and Mobile, Customers’ journey analysis, …!

with NoSQL!

Hadoop over Cassandra!

(next session)!

Benefits & Drawbacks!

Scalable!Stable!

RUN Cost!Development Cost!

Performance!Very fast evolution!

New Dev areas!

Learning curve!Debug!

Algorithms!Complex!

Very fast evolution!

Future!•  Enhance Security and robustness!

•  Create Services & Functional Catalog!

•  Continue building our expertise : Fast Data, Cascading, MR2, …!

•  A thousand nodes cluster !!

•  Help other teams to go on Production!

CONTACT US : olivier.varene@orange.com!

Thank you!Merci!

Olivier Varene! DSIF/DFY!Orange DevDay 2013 !

My Thanks to!

•  Apache http://www.apache.org/!

•  http://hadooper.blogspot.com/!

•  Cloudera http://www.cloudera.com/!

•  HortonWorks http://www.hortonworks.com/!

•  And all the community !!

big data @ orange - dev day 2013 - part 2

apache hive

apache pig

apache sqoop

apache zookeeper

apache oozie

apache tez

apache hbase

apache hcatalog

Technology

embracing cloud deployment for big data and dev ops

orange et les contenus - site institutionnel d'orange ·...

new! big orange wheel strap, 18,000 lb webbing

big orange riding hood by michael nunziata aidan harker...

dev ops for big data cluster management tools

dev wednesday - swiss transport in real time: tribulations...

the orange spiel page 1 august 2016 - big orange...

big orange give

orange b rice transplanter tiring, back-breaking work...

dev shenoy doe “big idea” concept chief engineer...

big shots northern virginia i -...

hild are ouncil of orange ounty inc. jan. apr. 2019...

microservices with netflix oss and spring cloud - dev day...

student c - itgstextbook.com · student c. 6 marks dev dev...

xengt setup guide - github...7 libperl-dev libgtk2.0-dev...

big orange planet brochure

orange: a player in the internet of things and big data...8...

new! big orange wheel strap, 18,000 lb webbing - tow dolly...

the opencv tutorials · •ffmpeg or libav development...

big data @ orange - dev day 2013 - part 1