scalable etl with talend and hadoop, cédric carbone, talend

Post on 12-May-2015

1.814 Views

Category:

Technology

8 Downloads

Preview:

Click to see full reader

DESCRIPTION

ETL is the process of extracting data from one location, transforming it, and loading it into a different location, often for the purposes of collection and analysis. As Hadoop becomes a common technology for sophisticated analysis and transformation of petabytes of structured and unstructured data, the task of moving data in and out efficiently becomes more important and writing transformation jobs becomes more complicated. Talend provides a way to build and automate complex ETL jobs for migration, synchronization, or warehousing tasks. Using Talend's Hadoop capabilities allows users to easily move data between Hadoop and a number of external data locations using over 450 connectors. Also, Talend can simplify the creation of MapReduce transformations by offering a graphical interface to Hive, Pig, and HDFS. In this talk, Cédric Carbone will discuss how to use Talend to move large amounts of data in and out of Hadoop and easily perform transformation tasks in a scalable way.

TRANSCRIPT

Scalable ETL with Talend and Hadoop Talend,  Global  Leader  in  Open  Source  Integra7on  Solu7ons  

Cédric Carbone – Talend CTO Twitter : @carbone ccarbone@talend.com

Why speaking about ETL with Hadoop

Hadoop  is  complex  BI  consultants  don’t  have  the  skill  to  manipulate  Hadoop  ➜  Biggest  issue  in  Hadoop  project  is  to  find  skilled  Hadoop  Engineers  

➜  ETL  tool  like  Talend  can  help  to  democra7ze  Hadoop  

Trying  to  get  from  this…  

to  this…  

Talend generates code that is executed within map reduce. This open approach removes the limitation of a proprietary “engine” to provide a truly unique and powerful set of tools for big data.

Why Talend…

Big  Data    Produc7on  

Big  Data  Management  

Big  Data    Consump7on  

Storage Processing Filtering

Mining  

Analy7cs  

Search  

Enrichment  

RDBMS  Analy7cal  DB  NoSQL  DB  ERP/CRM  SaaS  Social  Media  Web  Analy7cs  Log  Files  RFID  Call  Data  Records  Sensors  Machine-­‐Generated  

Big  Data    Integra7on  

Big Data Quality

BIG Data Management

Turn Big Data into actionable information

Parsing Checking

1.  Pipelining:  as  part  of  the  load  process  

2.  Load  the  cluster  than  implement  and  execute  a  data  quality  map  reduce  job  

Two methods for inserting data quality into a big data job

         Extract  –  Transform  -­‐  Load  E-­‐T-­‐L  

DQ  

           Extract  –  Improve/Cleanse  -­‐  Load  

E-­‐              -­‐L  

CRM

ERP

Finance

Social Networking

Mobile Devices

Big Data

 

 

 

Pipelining: data quality with big data

•  Use  tradi7onal  data  quality  tools  •  Once  and  done  

DQ

DQ

DQ

DQ

DQ

Big data alternative: Load and improve within the cluster

•  Load  first,  improve  later  •  Complex  matching  cannot  be  done  outside  

CRM

ERP

Finance

Social Networking

Mobile Devices

Big Data

 

 

  DQ

DQ

One key DQ rules: Match

➜  Find  duplicates  within  Hadoop  ➜  Today’s  matching  algorithms  are  processor-­‐intensive  ➜  Tomorrow’s  matching  algorithms  could  be  more  precise,  more  intensive  

What is Hadoop?  

What’s hadoop

➜  The  Apache™  Hadoop®  project  develops  open-­‐source  so^ware  for  reliable,  scalable,  distributed  compu7ng.  

➜  Java  framework  for  storage  and  running  data  transforma7on  on  large  cluster  of  commodity  hardware  

➜  Licensed  under  the  Apache  v2  license  

➜  Created  from  Google's  MapReduce,  BigTable  and  Google  File  System  (GFS)  papers  

➜  Sqoop  ➜  Pig  ➜  Hive  ➜  Oozie  ➜  HCatalog  ➜ Mahout  

➜  HBase  

Hadoop ecosystem

Talend for Big Data : Hadoop Story

   

4.0:  [April  2010]  Put  or  get  data  into  Haddop  through  HDFS  connectors  

4.1:    [Oct  2010]  Hadoop  Query  (Hive)    

Bulk  loald  &  fast  export  to  Hadoop  (Sqoop)  

4.2:  [May  2011]  Transforma7on  

(Pig)  

5.0:  [Nov  2011]  Hbase  NoSQL.  Extend  our  tPig*  

HCatalog

5.1:  [May  2012]  Metadata  (Hcatalog)  

Deployement  &  Scheduling  (Oozie)  Embeded  into  HDP  

5.2:  [Oct  2012]  Visual  ELT  mapping  (Hive)  

DataLineage  &  Impact  Analysis    

5.3  -­‐  [June  2013]  Visual  Pig  mapping  

Machine  Learning  (Mahout)  Na7ve  MapReduce  Code  Gen  

 

Democratizing Integration with Data Integration tools for Big Data  

WordCount

➜  WordCount  •  Comes  with  Hadoop  •  “First”  demo  that  everyone  tries!  

➜  How-­‐to  in  Talend  Big  Data  •  Simple  read,  count,  load  results  •  No  coding,  just  drag-­‐n-­‐drop  •  Runs  remotely  

Map Reduce

Data  Node  1  

Data  Node  2  

Map Reduce

Map Reduce

Map Reduce

Map Reduce

In Java

WordCount: Howto with a Graphical ETL

Thank You!      

Let  us  show  you…  

Talend Open Studio

Generate Pure Map Reduce

Pig Latin Script generation

FOREACH  tPigMap_1_out1_RESULT  GENERATE  $4  AS  Revenu  ,  $6  AS  Label  

HiveQL generation

HDFS Management and Sqoop

Apache Mahout

➜  Big  Data  can  also  be  a  blob  of  data  to  an  organiza7on  

➜  Apache  Mahout  provides  algorithms  to  understanding  data  –  data  mining  

➜  “You  don’t  know  what  you  don’t  know.”  and  mahout  will  tell  you.  

Metadata Management

➜  Centralize  Metadata  repository  for  Hadoop  Cluster,  HDFS,  Hive…  •  Versioning  •  Impact  Analysis  and  Data  Lineage    

➜  HCatalog  accros    •  HDFS  •  Hive  •  Pig  

Thank You!      

Choose your Hadoop distro

➜  Widely  adopted  ➜  Management  tooling  is  not  OSS  

➜  Fully  OpenSource  ➜  Strong  Developer  ecosystem  

➜  More  proprietary  ➜  GTM  partner  with  AWS  

➜  A  lot  of  more  are  comming  

Choose your Hadoop distro

Provide  tooling  :  ➜  For  installa7on  ➜  For  server  monitoring    But  ➜  No  GUI  for  parsing,  transforming,  easily  loading.  No  data  management    

Parse and Standardize

➜  Big  Data  is  not  always  structured  ➜  Correct  big  data  so  that  data  conforms  to  the  same  rules  

Profiling & Monitor DQ

Implications for Integration

DATA Challenge:  Informa7on  explosion  increases  complexity  of  integra7on  and  requires  governance  to  maintain  data  quality    Requirement:  Informa7on  processing    must  scale  

Implications for Integration

APPLICATION    

Challenge:  Brisle,  point-­‐to-­‐point  connec7ons  cannot  adapt  to  evolving  business  requirements,  new  channels,  and  quickly  changing  topologies    Requirement:  Applica7on  architecture  must  scale    

Implications for Integration

PROCESS  

Challenge:  Compe77ve  market  forces  drive  frequent  process  changes  and  increased  process  complexity    

Requirement: Business  processes  must  scale  

Implications for Integration

Challenge:  Interdependencies  across  data,  applica7ons  and  processes  require  more  resources  and  budget    Requirement:  Resources  and  skillsets  must  scale      

Integration at Any Scale

True scalability for •  Any  integra7on  challenge  •  Any  data  volume  •  Any  project  size   Enables integration convergence  

Technology that Scales

CODE GENERATOR No  black-­‐box  engine  means  faster  maintenance  and  deployment  with  improved  quality    The  “engine”  for  Big  Data  is  “Hadoop”,  making  it  uniquely  run  at  infinite  scale.  

Java   SQL   Map  Reduce  

Camel   ……  

STANDARDS-BASED Easy  to  learn,  flexible  to  adopt,  reduces  vendor  lock-­‐in  

   

   

   

       

ELT (SQL CodeGen) • Terradata • Netezza • Vertica

ETL •  Java Code • Partionning • Paralelisation

   • Visual Wizard • Hive/Pig/MR CodeGen • HDFS, Sqoop, Oozie…

Technology Continuum…

Google Big Query

NoSQL • MongoDB • Neo4J • Cassandra • Hbase • Amazon Redshift

At a glance

▶  Founded  in  2005  

▶  Offers  highly  scalable  integra7on  solu7ons  addressing  Data  Integra7on,  Data  Quality,  MDM,  ESB  and  BPM  

▶  Provides:    §  Subscrip7ons  including  

24/7  support  and  indemnifica7on;    

§  Worldwide  training  and  services  

▶  Recognized  as  the  open  source  leader  in  each  of  its  market  categories  

Talend  today  ➜  400  employees  in  7  countries  with  dual  HQ  in  Los  Altos,  CA  

and  Paris,  France  ➜  Over  4,000  paying  customers  across  different  industry  

ver7cals  and  company  sizes  ➜  Backed  by  Silver  Lake  Sumeru,  Balderton  Capital  and  

Idinvest  Partners  

Talend Overview

Brand    Awareness  20  million  Downloads  

MoneDzaDon  

4,000  Customers  

AdopDon  

1,000,000  Users  

Market  Momentum  +50  New  

Customers  /  Month  

High  growth  through  a  proven  model  

Talend’s Unique Integration Solution

Consolidated  metadata  &  project  

informa7on  

Repository  

2

Web-­‐based  deployment  &  scheduling  

Deployment  

3 Same  container  for  batch  processing,  message  rou7ng  &  

services  

Execu7on  

4

Single  web-­‐based  monitoring  console  

Monitoring  

5

Comprehensive  Eclipse-­‐based    user  interface  

1

Studio  

Talend Unified Platform

Data Quality

Data Integration

MDM

ESB

BPM

➜  Reduce  costs  ➜  Eliminate  risk  ➜  Reuse  skills  ➜  Economies  of  scale  ➜  Incremental  adop7on  

Best-of-Breed Solutions

+

Talend Unified Platform

=

Unique Integration Solution

Recognized  as  the  open  source  leader  in  each  of  its  market  category  by  all  industry  analysts  

Solutions that Scale

48  

TALEND UNIFIED PLATFORM

Data    Integra7on  

     

Studio   Execu7on   Monitoring  Deployment  Repository  

Big  Data        

Data    Quality  

     

ESB        

MDM        

BPM        

UNIFIED PLATFORM A  shared  founda7on  and  toolset  increases  resource  reuse  

CONVERGED INTEGRATION Use  for  any  data,  applica7on  and    process  project  

Data    Integra7on  

     

The 6 Dimensions of BIG Data

Primary  challenges  ¾ Volume  ¾ Velocity  ¾ Variety  

 And  also  ¾ Complexity  ¾ Valida7on  ¾ Lineage  

2015

What Is BIG Data?

"Big  data"    is  informa7on    of  extreme  size,  diversity,  complexity  and  need  for  rapid  processing.  

Ted Friedman - Information Infrastructure and Big Data Projects Key Initiative Overview - July 2011

2020

275  exabytes  of  data  flowing  over    the  Internet  each  day  275,000,000,000,000,000,000  

200  billion  intelligent  devices  200,000,000,000  

1,000,000  transac7ons  per  day  at  Walmart  

3,500  tweets    per  second  (June  2011)  

What is Big Data?  

volume, variety, velocity

How to define Big data is….

Key  Takeaway  #1  

Hans  Rosling  –  uses  big  data  to  analyze  world  health  trends  

CRM  

ERP  

Finance  

ETL  Data  Quality  

Normalized  Data  

Tradi7onal  Data  Warehouse  

Business  Analyst  

Business  User  

Warehouse  Administrator  

 

Traditional Data Flows

•  Scheduled–daily  or  weekly,  some7mes  more  frequently.    

•  Volumes  rarely  exceed  terabytes  

Execu7ves  

CRM  

ERP  

Finance  

Social  Networking  

Big Data

 

 

The new world of big data

CRM  

ERP  

Finance  

The new world of big data

Social  Networking  

Mobile  Devices  

Big Data

 

 

 

CRM  

ERP  

Finance  

Social  Networking  

Mobile  Devices  

Transac7ons  

Network  Devices  

Sensors  Big Data

 

 

 

 

 

The new world of big data

Forces us to think differently

Key  Takeaway  #2  

Data driven business

data

decisions supports

Your business

drives Information provides value to the business If  you  can't  rely  on  your  informa7on  then  the  result  can  be  missed  opportuni7es,  or  higher  costs.  

Mashew  West  and  Julian  Fowler  (1999).  Developing  High  Quality  Data  Models.    The  European  Process  Industries  STEP  Technical  Liaison  Execu7ve  (EPISTLE).  

information

enables governance  

BIG data driven business

BIG  data  

BIG  information

BIG  business

supports

drives

enables

Mashew  West  and  Julian  Fowler  (1999).  Developing  High  Quality  Data  Models.    The  European  Process  Industries  STEP  Technical  Liaison  Execu7ve  (EPISTLE).  

governance  

BIG  decisions

Information provides value to the business If  you  can't  rely  on  your  informa7on  then  the  result  can  be  missed  opportuni7es,  or  higher  costs.  

 

How is big data integration being used?

Use  Cases  •  Recommenda7on  Engine  •  Sen7ment  Analysis  •  Risk  Modeling  •  Fraud  Detec7on  •  Behavior  Analysis  •  Marke7ng  Campaign  Analysis  •  Customer  Churn  Analysis  •  Social  Graph  Analysis  •  Customer  Experience  Analy7cs  •  Network  Monitoring      

BUT:  to  what  level  is  DQ  required  for  your  use  case?  

 

 

 

Define your use case Key  Takeaway  #3  

top related