wjax 2013 slides online: big data beyond apache hadoop - how to integrate all your data with camel...

77
Kai Wähner| Talend @KaiWaehner www.kai-waehner.de/blog Xing / LinkedIn Big Data beyond Hadoop How to integrate ALL your Data

Upload: kai-waehner

Post on 26-Jan-2015

109 views

Category:

Technology


5 download

DESCRIPTION

Big data represents a significant paradigm shift in enterprise technology. Big data radically changes the nature of the data management profession as it introduces new concerns about the volume, velocity and variety of corporate data. Apache Hadoop is the open source defacto standard for implementing big data solutions on the Java platform. Hadoop consists of its kernel, MapReduce, and the Hadoop Distributed Filesystem (HDFS). A challenging task is to send all data to Hadoop for processing and storage (and then get it back to your application later), because in practice data comes from many different applications (SAP, Salesforce, Siebel, etc.) and databases (File, SQL, NoSQL), uses different technologies and concepts for communication (e.g. HTTP, FTP, RMI, JMS), and consists of different data formats using CSV, XML, binary data, or other alternatives. This session shows different open source frameworks and products to solve this challenging task. Learn how to use every thinkable data with Hadoop – without plenty of complex or redundant boilerplate code.

TRANSCRIPT

Page 1: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

Kai Wähner| Talend @KaiWaehner www.kai-waehner.de/blog Xing / LinkedIn

Big Data beyond Hadoop How to integrate ALL your Data

Page 2: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Consulting

Developing

Coaching

Speaking

Writing

Main Tasks

Requirements Engineering

Enterprise Architecture Management

Business Process Management

Architecture and Development of Applications

Service-oriented Architecture

Integration of Legacy Applications

Cloud Computing

Big Data

Contact

Email: [email protected]

Blog: www.kai-waehner.de/blog

Twitter: @KaiWaehner

Social Networks: Xing, LinkedIn

Kai Wähner

Page 3: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Key messages

You have to care about big data to be competitive in the future!

You have to integrate different sources to get most value out of it!

Big data integration is no (longer) rocket science!

Page 4: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

• Big data paradigm shift

• Challenges for integrating big data

• Choosing Hadoop for big data integration

• Integration with an open source framework

• Integration with an open source suite

• Custom big data connectors

Agenda

Page 5: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

• Big data paradigm shift

• Challenges for integrating big data

• Choosing Hadoop for big data integration

• Integration with an open source framework

• Integration with an open source suite

• Custom big data connectors

Agenda

Page 6: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

William Edwards Deming (1900 –1993)

American statistician, professor, author, lecturer and consultant

“If you can't measure it, you can't manage it.”

Why should you care about big data?

Page 7: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

„Silence the HiPPOs“ (highest-paid person‘s opinion)

Being able to interpret unimaginable large data stream, the gut feeling is no longer justified!

Why should you care about big data?

Page 8: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

What is big data? The Vs of big data

Volume

(terabytes,

petabytes)

Variety (social networks, blog posts, logs,

sensors, etc.)

Velocity

(realtime or near-realtime)

Value

Page 9: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Big Data Integration – Land data in a Big Data cluster

– Implement or generate parallel processes

Big Data Manipulation – Simplify manipulation, such as sort and filter

– Computational expensive functions

Big Data Quality & Governance – Identify linkages and duplicates, validate big data

– Match connector, execute basic quality features

Big Data Project Management – Place frameworks around big data projects

– Common Repository, scheduling, monitoring

Big data tasks to solve - before analysis

Page 10: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

“The advantage of their new system is that they can now look at their data [from their log processing system] in anyway they want:

➜ Nightly MapReduce jobs collect statistics about their mail system such as spam counts by domain, bytes transferred and number of logins.

➜ When they wanted to find out which part of the world their customers logged in from, a quick [ad hoc] MapReduce job was created and they had the answer within a few hours. Not really possible in your typical ETL system.”

http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data

Use case: Clickstream Analysis

Page 11: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

http://hkotadia.com/archives/5021

Deduce Customer Defections

Use case: Risk management

Page 12: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

➜ With revenue of almost USD 30 billion and a network of 800 locations, Macy's is considered the largest store operator in the USA

➜ Daily price check analysis of its 10,000 articles in less than two hours

➜ Whenever a neighboring competitor anywhere between New York and Los Angeles goes for aggressive price reductions, Macy's follows its example

➜ If there is no market competitor, the prices remain unchanged

http://www.t-systems.com/about-t-systems/examples-of-successes-companies-analyze-big-data-in-record-time-l-t-systems/1029702

Use case: Flexible pricing

Page 13: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

➜ A lot of data must be stored „forever“

➜ Numbers increase exponentially

➜ Goal: As cheap as possible

➜ Problem: (Fast) queries must still be possible

➜ Solution: Commodity servers and „Hadoop querying“

Global Parcel Service

http://archive.org/stream/BigDataImPraxiseinsatz-SzenarienBeispieleEffekte/Big_Data_BITKOM-Leitfaden_Sept.2012#page/n0/mode/2up

Storage: Compliance

Page 14: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

• Big data paradigm shift

• Challenges for integrating big data

• Choosing Hadoop for big data integration

• Integration with an open source framework

• Integration with an open source suite

• Custom big data connectors

Agenda

Page 15: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

This is your

company

Big Data Geek

Limited big data experts

Page 16: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

➜ Wanna buy a big data solution for your industry?

➜ Maybe a competitor has a big data solution which adds business value?

➜ The competitor will never publish it (rat-race)!

Big data tool selection (business perspective)

Page 17: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Looking for ‚your‘ required big data product?

Support your data from scratch?

Good luck!

Big data tool selection (technical perspective)

Page 18: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Big Data + Poor Data Quality = Big Problems

Data quality

Page 19: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

How to solve these big data challenges?

Page 20: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

1) Do not begin with the data, think about business opportunities

2) Choose the right data (combine different data sources)

3) Use easy tooling

http://hbr.org/2012/10/making-advanced-analytics-work-for-you

What is your Big Data process?

Step 1 Step 2 Step 3

Page 21: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

• Big data paradigm shift

• Challenges for integrating big data

• Choosing Hadoop for big data integration

• Integration with an open source framework

• Integration with an open source suite

• Custom big data connectors

Agenda

Page 22: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Technology perspective

How to process big data?

Page 23: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

The critical flaw in parallel ETL tools is the fact that the data is almost never local to the processing nodes. This means that every time a large job is run, the data has to first be read from the source, split N ways and then delivered to the individual nodes. Worse, if the partition key of the source doesn’t match the partition key of the target, data has to be constantly exchanged among the nodes. In essence, parallel ETL treats the network as if it were a physical I/O subsystem. The network, which is always the slowest part of the process, becomes the weakest link in the performance chain.

http://blog.syncsort.com/2012/08/parallel-etl-tools-are-dead

How to process big data?

Page 24: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Slides: http://www.slideshare.net/pavlobaron/100-big-data-0-hadoop-0-java Video: http://www.infoq.com/presentations/Big-Data-Hadoop-Java

How to process big data?

Page 25: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

The defacto standard for big data processing

How to process big data?

Page 26: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Even Microsoft (the .NET house) relies on Hadoop since 2011

How to process big data?

“A big part of [the company’s strategy] includes wiring SQL Server 2012 (formerly known by the codename “Denali”) to the Hadoop distributed computing platform, and bringing Hadoop to Windows Server and Azure”

Page 27: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Apache Hadoop, an open-source software library, is a framework that allows for the distributed processing of large data sets across clusters of commodity hardware using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

What is Hadoop?

Page 28: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Simple example • Input: (very large) text files with lists of strings, such as: „318, 0043012650999991949032412004...0500001N9+01111+99999999999...“ • We are interested just in some content: year and temperate (marked in red) • The Map Reduce function has to compute the maximum temperature for every year

Example from the book “Hadoop: The Definitive Guide, 3rd Edition”

Map (Shuffle) Reduce

Page 29: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

How to process big data?

Page 30: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

• Big data paradigm shift

• Challenges for integrating big data

• Choosing Hadoop for big data integration

• Integration with an open source framework

• Integration with an open source suite

• Custom big data connectors

Agenda

Page 31: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Connectivity

Routing

Transformation

Complexity

of Integration

Enterprise Service Bus

Integration Suite

Low High

Integration

Framework

INTEGRATION

Tooling

Monitoring

Support

+

BUSINESS PROCESS MGT.

BIG DATA / MDM

REGISTRY / REPOSITORY

RULES ENGINE

„YOU NAME IT“

+

Alternatives for systems integration

Page 32: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Complexity

of Integration

Enterprise Service Bus

Integration Suite

Low High

Integration

Framework

Alternatives for systems integration

Page 33: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

More details about integration frameworks...

http://www.kai-waehner.de/blog/2012/12/20/showdown-integration-framework-spring-integration-apache-camel-vs-enterprise-service-bus-esb/

Page 34: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Enterprise Integration Patterns (EIP)

Apache Camel

Implements the EIPs

Page 35: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Enterprise Integration Patterns (EIP)

Page 36: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Enterprise Integration Patterns (EIP)

Page 37: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Architecture

http://java.dzone.com/articles/apache-camel-integration

Page 38: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

HTTP

FTP

File

XSLT

MQ

JDBC

Akka

TCP SMTP

RSS

Quartz

Log

LDAP

JMS

EJB

AMQP

Atom AWS-S3

Bean-Validation

CXF

IRC

Jetty

JMX

Lucene

Netty

RMI

SQL

Many many more Custom Components

Choose your required connectors

Page 39: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Choose your favorite DSL

XML

(not production-ready yet)

Page 40: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Deploy it wherever you need

Standalone

OSGi

Application Server

Web Container

Spring Container

Cloud

Page 41: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Enterprise-ready

• Open Source

• Scalability

• Error Handling

• Transaction

• Monitoring

• Tooling

• Commercial Support

Page 42: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Example: Camel integration route

Page 43: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Hadoop Integration with Apache Camel

Page 44: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

camel-hdfs connector (Java DSL)

// Producer

from("ftp://user@myServer?password=secret")

.to(“hdfs:///myDirectory/myFile.txt?append=true");

// Consumer

from(“hdfs:///myDirectory/myBigDataAnalysis.csv")

.to(“file:target/reports/report.csv");

Page 45: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

camel-hbase connector (XML DSL)

Page 46: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Camel and Hadoop File Formats

Choose the appropriate Hadoop file format, e.g. SequenceFile, MapFile, etc...

Page 47: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

camel-avro connector

... Avro, a data serialization system used in Apache Hadoop. camel-avro connector provides a dataformat for Avro, which allows serialization and deserialization of messages using Apache Avro's binary data format. Moreover, it provides support for Apache Avro's RPC, by providing producers and consumers endpoint for using Avro over Netty or HTTP.

Camel is not just about connectors ... Camel supports a pluggable DataFormat to allow messages to be marshalled to and unmarshalled from binary or text formats, e.g. CSV, JSON, SOAP, EDI, ZIP, or ...

Page 48: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

➜ A lot of data must be stored „forever“

➜ Numbers increase exponentially

➜ Goal: As cheap as possible

➜ Problem: (Fast) queries must still be possible

➜ Solution: Commodity servers and „Hadoop querying“

Global Parcel Service

http://archive.org/stream/BigDataImPraxiseinsatz-SzenarienBeispieleEffekte/Big_Data_BITKOM-Leitfaden_Sept.2012#page/n0/mode/2up

Real World Use Case: Storage: Compliance

Page 49: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Real world use case: Storage: Compliance

Orders (Server 1)

Log Files (Server 3)

Log Files (Server 100)

ETL

Query Storage

Payments (Server 2)

Page 50: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Live demo

Apache Camel in action...

Page 51: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

camel-pig? camel-hive? camel-hcatalog?

Not available yet (current Camel version: 2.12) Workarounds:

• Use Pig / Hive-Query scripts (via camel-exec connector or any scripting language)

• Build your own connector (more details later ...) • Use Hive-Hbase-Integration and store data in HBase ( „ugly“)

Page 52: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

• Big data paradigm shift

• Challenges for integrating big data

• Choosing Hadoop for big data integration

• Integration with an open source framework

• Integration with an open source suite

• Custom big data connectors

Agenda

Page 53: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Connectivity

Routing

Transformation

Complexity

of Integration

Enterprise Service Bus

Integration Suite

Low High

Integration

Framework

INTEGRATION

Tooling

Monitoring

Support

+

BUSINESS PROCESS MGT.

BIG DATA / MDM

REGISTRY / REPOSITORY

RULES ENGINE

„YOU NAME IT“

+

Alternatives for systems integration

Page 54: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Complexity

of Integration

Enterprise Service Bus

Integration Suite

Low High

Integration

Framework

Alternatives for systems integration

Page 55: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

More details about ESBs and suites...

http://www.kai-waehner.de/blog/2013/01/23/spoilt-for-choice-how-to-choose-the-right-enterprise-service-bus-esb/

Page 56: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Hadoop Integration with Talend Open Studio

Page 57: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

…an open source ecosystem

Talend Open Studio for Big Data

• Improves efficiency of big data job design with graphic interface

• Generates Hadoop code and run transforms inside Hadoop

• Native support for HDFS, Pig, Hbase, Hcatalog, Sqoop and Hive

• 100% open source under an Apache License

• Standards based

Pig

Vision: Democratize big data

Page 58: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

…an open source ecosystem

Talend Platform for Big Data

• Builds on Talend Open Studio for Big Data

• Adds data quality, advanced scalability and management functions

• MapReduce massively parallel data processing

• Shared Repository and remote deployment

• Data quality and profiling

• Data cleansing

• Reporting and dashboards

• Commercial support, warranty/IP indemnity under a subscription license

Pig

Vision: Democratize big data

Page 59: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Talend Open Studio for Big Data

Page 60: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

“The advantage of their new system is that they can now look at their data [from their log processing system] in anyway they want:

➜ Nightly MapReduce jobs collect statistics about their mail system such as spam counts by domain, bytes transferred and number of logins.

➜ When they wanted to find out which part of the world their customers logged in from, a quick [ad hoc] MapReduce job was created and they had the answer within a few hours. Not really possible in your typical ETL system.”

http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data

Real world Use case: Clickstream Analysis

Page 61: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Example: A semi-structured log file

Page 62: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

One of the original uses of Hadoop at Yahoo was to store and process their massive volume of clickstream data. Now enterprises of all types can use Hadoop to refine and analyze clickstream data. They can then answer business questions such as:

• What is the most efficient path for a site visitor to research a product, and then buy it?

• What products do visitors tend to buy together, and what are they most likely to buy in the future?

• Where should I spend resources on fixing or enhancing the user experience on my website?

Goal: Data visualization can help you optimize your website and convert more visits into sales and revenue.

Potential Uses of Clickstream Data

Source: for Clickstream Example: „Hortonworks Hadoop Tutorials - Real Life Use Cases” http://hortonworks.com/blog/hadoop-tutorials-real-life-use-cases-in-the-sandbox

Page 63: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Real world use case: Clickstream Analysis

Log Files (Server 1)

Log Files (Server 2)

Log Files (Server 100)

ETL

Page 64: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Example: ETL Job

„... using Talend’s HDFS and Hive Components”

Page 65: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Example: ETL Job

„... using Talend’s Map Reduce Components*”

* Not available in open source version of Talend Studio

Page 66: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Example: Analysis with Microsoft Excel

We can see that the largest number of page hits in Florida were for clothing, followed by shoes.

Source: for Clickstream Example: „Hortonworks Hadoop Tutorials - Real Life Use Cases” http://hortonworks.com/blog/hadoop-tutorials-real-life-use-cases-in-the-sandbox

Page 67: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Example: Analysis with Microsoft Excel

The chart shows that the majority of men shopping for clothing on our website are between the ages of 22 and 30. With this information, we can optimize our content for this market segment.

Source: for Clickstream Example: „Hortonworks Hadoop Tutorials - Real Life Use Cases” http://hortonworks.com/blog/hadoop-tutorials-real-life-use-cases-in-the-sandbox

Page 68: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Example: Analysis with Tableau

Spoilt for Choice Use your preferred BI or Analysis tool!

Page 69: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

„Talend Open Studio for Big Data“ in action...

Live demo

Page 70: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

• Big data paradigm shift

• Challenges for integrating big data

• Choosing Hadoop for big data integration

• Integration with an open source framework

• Integration with an open source suite

• Custom big data connectors

Agenda

Page 71: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Custom connectors

Easy to realize for all

integration alternatives *

• Integration Framework

• Enterprise Service Bus

• Integration Suite

* At least for open source solutions

Page 72: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Custom connectors

You might need a ...

• ... Hive connector for Camel

• ... Impala connector for Talend

• ... custom connector for your

internal data format

Page 73: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Live demo (Example: Apache Camel)

Custom connectors in action...

Page 74: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Did you get the key message?

Page 75: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Key messages

You have to care about big data to be competitive in the future!

You have to integrate different sources to get most value out of it!

Big data integration is no (longer) rocket science!

Page 76: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner

Did you get the key message?

Page 77: WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

Kai Wähner| Talend @KaiWaehner www.kai-waehner.de/blog Xing / LinkedIn

Thank you for your attention.

Questions?