scalable etl with talend and hadoop, cédric carbone, talend

61
Scalable ETL with Talend and Hadoop Talend, Global Leader in Open Source Integra7on Solu7ons Cédric Carbone – Talend CTO Twitter : @carbone [email protected]

Upload: ow2-consortium

Post on 12-May-2015

1.814 views

Category:

Technology


8 download

DESCRIPTION

ETL is the process of extracting data from one location, transforming it, and loading it into a different location, often for the purposes of collection and analysis. As Hadoop becomes a common technology for sophisticated analysis and transformation of petabytes of structured and unstructured data, the task of moving data in and out efficiently becomes more important and writing transformation jobs becomes more complicated. Talend provides a way to build and automate complex ETL jobs for migration, synchronization, or warehousing tasks. Using Talend's Hadoop capabilities allows users to easily move data between Hadoop and a number of external data locations using over 450 connectors. Also, Talend can simplify the creation of MapReduce transformations by offering a graphical interface to Hive, Pig, and HDFS. In this talk, Cédric Carbone will discuss how to use Talend to move large amounts of data in and out of Hadoop and easily perform transformation tasks in a scalable way.

TRANSCRIPT

Page 1: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Scalable ETL with Talend and Hadoop Talend,  Global  Leader  in  Open  Source  Integra7on  Solu7ons  

Cédric Carbone – Talend CTO Twitter : @carbone [email protected]

Page 2: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Why speaking about ETL with Hadoop

Hadoop  is  complex  BI  consultants  don’t  have  the  skill  to  manipulate  Hadoop  ➜  Biggest  issue  in  Hadoop  project  is  to  find  skilled  Hadoop  Engineers  

➜  ETL  tool  like  Talend  can  help  to  democra7ze  Hadoop  

Page 3: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Trying  to  get  from  this…  

Page 4: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

to  this…  

Talend generates code that is executed within map reduce. This open approach removes the limitation of a proprietary “engine” to provide a truly unique and powerful set of tools for big data.

Why Talend…

Page 5: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Big  Data    Produc7on  

Big  Data  Management  

Big  Data    Consump7on  

Storage Processing Filtering

Mining  

Analy7cs  

Search  

Enrichment  

RDBMS  Analy7cal  DB  NoSQL  DB  ERP/CRM  SaaS  Social  Media  Web  Analy7cs  Log  Files  RFID  Call  Data  Records  Sensors  Machine-­‐Generated  

Big  Data    Integra7on  

Big Data Quality

BIG Data Management

Turn Big Data into actionable information

Parsing Checking

Page 6: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

1.  Pipelining:  as  part  of  the  load  process  

2.  Load  the  cluster  than  implement  and  execute  a  data  quality  map  reduce  job  

Two methods for inserting data quality into a big data job

Page 7: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

         Extract  –  Transform  -­‐  Load  E-­‐T-­‐L  

Page 8: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

DQ  

           Extract  –  Improve/Cleanse  -­‐  Load  

E-­‐              -­‐L  

Page 9: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

CRM

ERP

Finance

Social Networking

Mobile Devices

Big Data

 

 

 

Pipelining: data quality with big data

•  Use  tradi7onal  data  quality  tools  •  Once  and  done  

DQ

DQ

DQ

DQ

DQ

Page 10: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Big data alternative: Load and improve within the cluster

•  Load  first,  improve  later  •  Complex  matching  cannot  be  done  outside  

CRM

ERP

Finance

Social Networking

Mobile Devices

Big Data

 

 

  DQ

DQ

Page 11: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

One key DQ rules: Match

➜  Find  duplicates  within  Hadoop  ➜  Today’s  matching  algorithms  are  processor-­‐intensive  ➜  Tomorrow’s  matching  algorithms  could  be  more  precise,  more  intensive  

Page 12: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

What is Hadoop?  

Page 13: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

What’s hadoop

➜  The  Apache™  Hadoop®  project  develops  open-­‐source  so^ware  for  reliable,  scalable,  distributed  compu7ng.  

➜  Java  framework  for  storage  and  running  data  transforma7on  on  large  cluster  of  commodity  hardware  

➜  Licensed  under  the  Apache  v2  license  

➜  Created  from  Google's  MapReduce,  BigTable  and  Google  File  System  (GFS)  papers  

Page 14: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

➜  Sqoop  ➜  Pig  ➜  Hive  ➜  Oozie  ➜  HCatalog  ➜ Mahout  

➜  HBase  

Hadoop ecosystem

Page 15: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Talend for Big Data : Hadoop Story

   

4.0:  [April  2010]  Put  or  get  data  into  Haddop  through  HDFS  connectors  

4.1:    [Oct  2010]  Hadoop  Query  (Hive)    

Bulk  loald  &  fast  export  to  Hadoop  (Sqoop)  

4.2:  [May  2011]  Transforma7on  

(Pig)  

5.0:  [Nov  2011]  Hbase  NoSQL.  Extend  our  tPig*  

HCatalog

5.1:  [May  2012]  Metadata  (Hcatalog)  

Deployement  &  Scheduling  (Oozie)  Embeded  into  HDP  

5.2:  [Oct  2012]  Visual  ELT  mapping  (Hive)  

DataLineage  &  Impact  Analysis    

5.3  -­‐  [June  2013]  Visual  Pig  mapping  

Machine  Learning  (Mahout)  Na7ve  MapReduce  Code  Gen  

 

Page 16: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Democratizing Integration with Data Integration tools for Big Data  

Page 17: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

WordCount

➜  WordCount  •  Comes  with  Hadoop  •  “First”  demo  that  everyone  tries!  

➜  How-­‐to  in  Talend  Big  Data  •  Simple  read,  count,  load  results  •  No  coding,  just  drag-­‐n-­‐drop  •  Runs  remotely  

Page 18: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Map Reduce

Data  Node  1  

Data  Node  2  

Page 19: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Map Reduce

Page 20: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Map Reduce

Page 21: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Map Reduce

Page 22: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Map Reduce

Page 23: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

In Java

Page 24: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

WordCount: Howto with a Graphical ETL

Page 25: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Thank You!      

Page 26: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Let  us  show  you…  

Page 27: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Talend Open Studio

Page 28: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Generate Pure Map Reduce

Page 29: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Pig Latin Script generation

FOREACH  tPigMap_1_out1_RESULT  GENERATE  $4  AS  Revenu  ,  $6  AS  Label  

Page 30: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

HiveQL generation

Page 31: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

HDFS Management and Sqoop

Page 32: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Apache Mahout

➜  Big  Data  can  also  be  a  blob  of  data  to  an  organiza7on  

➜  Apache  Mahout  provides  algorithms  to  understanding  data  –  data  mining  

➜  “You  don’t  know  what  you  don’t  know.”  and  mahout  will  tell  you.  

Page 33: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Metadata Management

➜  Centralize  Metadata  repository  for  Hadoop  Cluster,  HDFS,  Hive…  •  Versioning  •  Impact  Analysis  and  Data  Lineage    

➜  HCatalog  accros    •  HDFS  •  Hive  •  Pig  

Page 34: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Thank You!      

Page 35: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Choose your Hadoop distro

➜  Widely  adopted  ➜  Management  tooling  is  not  OSS  

➜  Fully  OpenSource  ➜  Strong  Developer  ecosystem  

➜  More  proprietary  ➜  GTM  partner  with  AWS  

➜  A  lot  of  more  are  comming  

Page 36: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Choose your Hadoop distro

Provide  tooling  :  ➜  For  installa7on  ➜  For  server  monitoring    But  ➜  No  GUI  for  parsing,  transforming,  easily  loading.  No  data  management    

Page 37: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Parse and Standardize

➜  Big  Data  is  not  always  structured  ➜  Correct  big  data  so  that  data  conforms  to  the  same  rules  

Page 38: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Profiling & Monitor DQ

Page 39: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Implications for Integration

DATA Challenge:  Informa7on  explosion  increases  complexity  of  integra7on  and  requires  governance  to  maintain  data  quality    Requirement:  Informa7on  processing    must  scale  

Page 40: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Implications for Integration

APPLICATION    

Challenge:  Brisle,  point-­‐to-­‐point  connec7ons  cannot  adapt  to  evolving  business  requirements,  new  channels,  and  quickly  changing  topologies    Requirement:  Applica7on  architecture  must  scale    

Page 41: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Implications for Integration

PROCESS  

Challenge:  Compe77ve  market  forces  drive  frequent  process  changes  and  increased  process  complexity    

Requirement: Business  processes  must  scale  

Page 42: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Implications for Integration

Challenge:  Interdependencies  across  data,  applica7ons  and  processes  require  more  resources  and  budget    Requirement:  Resources  and  skillsets  must  scale      

Page 43: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Integration at Any Scale

True scalability for •  Any  integra7on  challenge  •  Any  data  volume  •  Any  project  size   Enables integration convergence  

Page 44: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Technology that Scales

CODE GENERATOR No  black-­‐box  engine  means  faster  maintenance  and  deployment  with  improved  quality    The  “engine”  for  Big  Data  is  “Hadoop”,  making  it  uniquely  run  at  infinite  scale.  

Java   SQL   Map  Reduce  

Camel   ……  

STANDARDS-BASED Easy  to  learn,  flexible  to  adopt,  reduces  vendor  lock-­‐in  

Page 45: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

   

   

   

       

ELT (SQL CodeGen) • Terradata • Netezza • Vertica

ETL •  Java Code • Partionning • Paralelisation

   • Visual Wizard • Hive/Pig/MR CodeGen • HDFS, Sqoop, Oozie…

Technology Continuum…

Google Big Query

NoSQL • MongoDB • Neo4J • Cassandra • Hbase • Amazon Redshift

Page 46: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

At a glance

▶  Founded  in  2005  

▶  Offers  highly  scalable  integra7on  solu7ons  addressing  Data  Integra7on,  Data  Quality,  MDM,  ESB  and  BPM  

▶  Provides:    §  Subscrip7ons  including  

24/7  support  and  indemnifica7on;    

§  Worldwide  training  and  services  

▶  Recognized  as  the  open  source  leader  in  each  of  its  market  categories  

Talend  today  ➜  400  employees  in  7  countries  with  dual  HQ  in  Los  Altos,  CA  

and  Paris,  France  ➜  Over  4,000  paying  customers  across  different  industry  

ver7cals  and  company  sizes  ➜  Backed  by  Silver  Lake  Sumeru,  Balderton  Capital  and  

Idinvest  Partners  

Talend Overview

Brand    Awareness  20  million  Downloads  

MoneDzaDon  

4,000  Customers  

AdopDon  

1,000,000  Users  

Market  Momentum  +50  New  

Customers  /  Month  

High  growth  through  a  proven  model  

Page 47: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Talend’s Unique Integration Solution

Consolidated  metadata  &  project  

informa7on  

Repository  

2

Web-­‐based  deployment  &  scheduling  

Deployment  

3 Same  container  for  batch  processing,  message  rou7ng  &  

services  

Execu7on  

4

Single  web-­‐based  monitoring  console  

Monitoring  

5

Comprehensive  Eclipse-­‐based    user  interface  

1

Studio  

Talend Unified Platform

Data Quality

Data Integration

MDM

ESB

BPM

➜  Reduce  costs  ➜  Eliminate  risk  ➜  Reuse  skills  ➜  Economies  of  scale  ➜  Incremental  adop7on  

Best-of-Breed Solutions

+

Talend Unified Platform

=

Unique Integration Solution

Recognized  as  the  open  source  leader  in  each  of  its  market  category  by  all  industry  analysts  

Page 48: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Solutions that Scale

48  

TALEND UNIFIED PLATFORM

Data    Integra7on  

     

Studio   Execu7on   Monitoring  Deployment  Repository  

Big  Data        

Data    Quality  

     

ESB        

MDM        

BPM        

UNIFIED PLATFORM A  shared  founda7on  and  toolset  increases  resource  reuse  

CONVERGED INTEGRATION Use  for  any  data,  applica7on  and    process  project  

Data    Integra7on  

     

Page 49: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

The 6 Dimensions of BIG Data

Primary  challenges  ¾ Volume  ¾ Velocity  ¾ Variety  

 And  also  ¾ Complexity  ¾ Valida7on  ¾ Lineage  

Page 50: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

2015

What Is BIG Data?

"Big  data"    is  informa7on    of  extreme  size,  diversity,  complexity  and  need  for  rapid  processing.  

Ted Friedman - Information Infrastructure and Big Data Projects Key Initiative Overview - July 2011

2020

275  exabytes  of  data  flowing  over    the  Internet  each  day  275,000,000,000,000,000,000  

200  billion  intelligent  devices  200,000,000,000  

1,000,000  transac7ons  per  day  at  Walmart  

3,500  tweets    per  second  (June  2011)  

Page 51: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

What is Big Data?  

Page 52: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

volume, variety, velocity

How to define Big data is….

Key  Takeaway  #1  

Hans  Rosling  –  uses  big  data  to  analyze  world  health  trends  

Page 53: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

CRM  

ERP  

Finance  

ETL  Data  Quality  

Normalized  Data  

Tradi7onal  Data  Warehouse  

Business  Analyst  

Business  User  

Warehouse  Administrator  

 

Traditional Data Flows

•  Scheduled–daily  or  weekly,  some7mes  more  frequently.    

•  Volumes  rarely  exceed  terabytes  

Execu7ves  

Page 54: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

CRM  

ERP  

Finance  

Social  Networking  

Big Data

 

 

The new world of big data

Page 55: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

CRM  

ERP  

Finance  

The new world of big data

Social  Networking  

Mobile  Devices  

Big Data

 

 

 

Page 56: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

CRM  

ERP  

Finance  

Social  Networking  

Mobile  Devices  

Transac7ons  

Network  Devices  

Sensors  Big Data

 

 

 

 

 

The new world of big data

Page 57: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Forces us to think differently

Key  Takeaway  #2  

Page 58: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Data driven business

data

decisions supports

Your business

drives Information provides value to the business If  you  can't  rely  on  your  informa7on  then  the  result  can  be  missed  opportuni7es,  or  higher  costs.  

Mashew  West  and  Julian  Fowler  (1999).  Developing  High  Quality  Data  Models.    The  European  Process  Industries  STEP  Technical  Liaison  Execu7ve  (EPISTLE).  

information

enables governance  

Page 59: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

BIG data driven business

BIG  data  

BIG  information

BIG  business

supports

drives

enables

Mashew  West  and  Julian  Fowler  (1999).  Developing  High  Quality  Data  Models.    The  European  Process  Industries  STEP  Technical  Liaison  Execu7ve  (EPISTLE).  

governance  

BIG  decisions

Information provides value to the business If  you  can't  rely  on  your  informa7on  then  the  result  can  be  missed  opportuni7es,  or  higher  costs.  

Page 60: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

 

How is big data integration being used?

Use  Cases  •  Recommenda7on  Engine  •  Sen7ment  Analysis  •  Risk  Modeling  •  Fraud  Detec7on  •  Behavior  Analysis  •  Marke7ng  Campaign  Analysis  •  Customer  Churn  Analysis  •  Social  Graph  Analysis  •  Customer  Experience  Analy7cs  •  Network  Monitoring      

BUT:  to  what  level  is  DQ  required  for  your  use  case?  

 

 

 

Page 61: Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend

Define your use case Key  Takeaway  #3