talend spark meetupfiles.meetup.com/14077672/talend spark meetup updated.pdf4 talend real-time big...

20
1 ©2016 Talend Talend – Spark Meetup Edward Ost

Upload: lamhanh

Post on 21-Apr-2018

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an

1 ©2016 Talend

Talend – Spark Meetup

Edward Ost

Page 2: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an

2 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016(Revenue Growth)

Data Integration

Master Data Management

Data Quality

Big Data

Application Integration

Hadoop 2.0

Spark & Cloud

Key Facts

• Founded in 2006

• 550+ employees worldwide

• 7 countries

• 1300+ customers

• 2M+ open source downloads

Talend: A History of Innovation and Growth

Data Preparation

Page 3: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an

3

Top Big Data Challenges

Talend Directly

Addresses these

Challenges

Source:

Gartner 12 September 2013 - G00255160

Survey Analysis: Big Data Adoption in

2013 Shows Substance Behind the Hype

Page 4: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an

4

Talend Real-time Big Data

The first data integration platform on Spark

Internet of Things

Delivers an end-to-end integration platform for

IoT

Continuous Delivery

Provides Continuous Delivery data integration

with unmatched productivity

New Insight

Easily access master data from Big Data, Mobile, and Cloud Apps using

MDM REST APIs

Smarter, More Secure Data

New data masking and semantic discovery

capabilities

Unleashing the Power of Spark with Real-time Big Data Integration

Talend 6.0

Page 5: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an

5

Talend Remains Ahead of the Curve for Big Data

Talend 6 (Sept 2015)

Talend 6.1 (Dec 2015)

Talend 5.6.x (Dec 2014)

No

SQL

Had

oo

p

Dis

tro

s

Had

oo

p

Clo

ud

5.4 5.1

2.3 2.2 2.2

5.7

4.0.X 4.0.X 5.1

1.3 1.1* 1.6

2.0 2.0 3.4

2.6 2.6 3.2

2.0

5.5

5.1

1.5

2.2

3.0

2.4

Talend 6.2 (Jun 2016) * Tech Preview

4.x 3.x 3.x

3.3 3.2

4.x BigInsights

Page 7: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an

7

Easily Convert MapReduce to Spark

MapReduce Performance

(runs on disk)

One Click

Spark Performance

(runs in-memory & on disk)

5X Faster

Page 8: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an

8

Technical Concerns

• Decouple source systems

• Increase agility

• Reduce process latency

• Avoid re-engineering

• At scale

Information Supply Chain Drivers

Business Drivers

• Evolving business network

• Data Broker ecosystem

• Transform Data into Information

• Onboarding data sources rapidly

• Accelerate insight

Page 9: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an

9

Step 1: Establish the Business Keys, Hubs

Step 2: Establish the relationships between the Business Keys, Links

Step 3: Establish description around the Business Keys, Satellites

Step 4: Add Standalone components like Calendars and code/descriptions for decoding in Data Marts

Step 5: Tune for query optimization, add performance tables such as Bridge tables and Point-In-Time structures

DataVault

Page 10: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an

10

Simple Data Vault Design Flow - Relational

Account_ID (Pkey)Company_NMAddress_LN1Address_LN2CityStateZipCodeStatus_CODEIs_AUTHORIZEDIs_LOCKEDCreated_DTModified_DT

ACCOUNTSUser_ID (Pkey)Account_ID (Fkey)First_NMLast_NMMobile_PHGenderStatus_CODEIs_ACTIVECreated_DTModified_DT

USERS

Identify Business Keys

Identify Attributes

Establish Linkages

Control Lineage

Control History De-Normalize

Page 11: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an

11

Simple Data Vault Design Flow – Big Data

Account_ID (Pkey)Company_NMAddress_LN1Address_LN2CityStateZipCodeStatus_CODEIs_AUTHORIZEDIs_LOCKEDCreated_DTModified_DT

ACCOUNTSUser_ID (Pkey)Account_ID (Fkey)First_NMLast_NMMobile_PHGenderStatus_CODEIs_ACTIVECreated_DTModified_DT

USERS

Identify Business Keys

Identify Attributes

Establish Linkages

Control Lineage

Control History

De-Normalize

Create PIT & BRIDGE records

Page 12: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an

12

• Focus on business keys and simplicity in source extracts

• Autonomous extracts enable parallel processing

• Capture and preserve auditable data in raw data vault

• Defer more complex business rules to the business vault

• Consider point-in-time tables for operational data vault

Spark and Data Vault Design Notes

Page 13: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an

13

Basic Ingest

Page 14: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an

14

Data Vault – Relational Model

• Extract data write to DV ready CSV files

• Push to S3/RDS

• Use ELT to De-Normalize into Columnar DataMart

Page 15: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an

15

Data Vault – Big Data Analytics

• Sqoop data directly into S3/DV (Redshift)

• Use ELT to De-Normalize into Columnar DataMart

Page 16: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an

16

Data Vault with Spark – Big Data Real Time

• Sqoop data directly into S3/DV (Hive)

• Transform to Data Vault with Spark Batch

• Operational Data Vault with Spark Streaming

• ELT to De-Normalize into Columnar DataMart

Page 17: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an

17

• OLTP

• Systems of Engagement

• Data Warehouse

• Analytics

• BI

From Data to Information

• Supply Chain

• Collaboration

• Self-Service

• On-Demand

Page 18: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an

18

Lambda Architecture

Extract

Load

Transform

Transform Ingest

Update

Reporting

Data

Mining

MDD/OLAP

Dashboarding

Data Discovery

API

Analytics

Applications

IOT

NoSQL

Web Logs

Systems of

records

ERP

DBMS Learn

Act

Streaming layer

Batch layer

App. Events

Page 19: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an

19

• Discover the Talend Big Data Jumpstart Sandbox • Starting the Talend Big Data Sandbox

• Big Data Sandbox Forum

• Get It Right, in Real Time with SPARK

• Using AWS EMR, Redshift, and Spark to Power Your Analytics

• TalendForge Big Data Forum

• Data Vault Basics

• Data Vault Series – Agile Modeling not an Option Anymore

Talend Big Data Resources

Page 20: Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an

20

Questions

Edward Ost

Channels Technical Director

[email protected]

301-666-1039