(bdt316) offloading etl to amazon elastic mapreduce

24
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Bharat Rangan, Sr. Manager Information Systems, Amgen Kerby Johnson, Specialist IS Programmer Analyst, Amgen October 2015 BDT316 Offloading ETL to Amazon Elastic MapReduce

Upload: amazon-web-services

Post on 16-Apr-2017

1.485 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: (BDT316) Offloading ETL to Amazon Elastic MapReduce

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Bharat Rangan, Sr. Manager Information Systems, Amgen

Kerby Johnson, Specialist IS Programmer Analyst, Amgen

October 2015

BDT316

Offloading ETL to

Amazon Elastic MapReduce

Page 2: (BDT316) Offloading ETL to Amazon Elastic MapReduce

What to Expect from the Session

• Benefits of ETL offloading in Amazon EMR as an entry point into

using big data technologies

• Benefits and challenges of using Amazon EMR vs. expanding

on-premises ETL and reporting technologies

• How to architect an ETL offload solution using Amazon S3,

Amazon EMR, and Impala

• Leveraging Amazon Redshift for a reporting database

• Next steps and future expansion of big data

• Prereqs: Basic knowledge of AWS: What are Amazon Redshift,

Amazon EC2, Amazon EMR, Amazon S3; how VPCs work;

basic big data terminology

Page 3: (BDT316) Offloading ETL to Amazon Elastic MapReduce

Amgen is committed to unlocking the potential of

biology for patients suffering from serious illnesses by

discovering, developing, manufacturing and delivering

innovative human therapeutics.

A biotechnology pioneer since 1980, Amgen has

grown to be one of the world's leading independent

biotechnology companies, has reached millions of

patients around the world and is developing a pipeline

of medicines with breakaway potential.

.

About Amgen

Page 4: (BDT316) Offloading ETL to Amazon Elastic MapReduce

18 Months at a Glance

2014 2015 2016

Amgen IS evaluating Hadoop,

AWS, BigData Tech.

What is a good use case?

Successful

Commercial Analytics

for 5 Business Units

MPP database / Informatica

Many Visualization Tools

Amgen plans launch of Cardiovascular products

Need to scale analytics platform by 7 – 10x

New Biz

Need !!

What next:

Offloading ETL for other business units (almost done)

Running reports on Amazon Redshift (successful pilot)

Enterprise Data Lake

2 more

Business units

AWS / EMR / EC2 to the rescue

Evaluate options and decide to off-load ETL to

Amazon EMR

Design, develop and deploy application in 8 months

50% less time and ~30%

cheaper than MPP DB

Go-Live

Page 5: (BDT316) Offloading ETL to Amazon Elastic MapReduce

Business Context

Data Areas Integrated

• Physician / Hospital Sales

• Sales Force Activity

• Payer Coverage of claims

• Outbound sales and inventory

• Customer Master

• Channel Marketing Data

Business Deliverables

• 200+ Online Reports

• Mobile Reports

• 300+ Metrics

• 25+ data sources integrated

• Analytics Data Warehouse

Business Capabilities

Supported

• Sales Force Reporting

• Customer Targeting

• Incentive Compensation

• Analytics

• Marketing Analytics

Thousands of sales reps across 15 sales forces use the

commercial reporting platform every day for business critical

analytics

Page 6: (BDT316) Offloading ETL to Amazon Elastic MapReduce

Architecture Before Amazon EMR

Technology

Database:

Teradata (72 amp / 15 TB)

Processing:

DB Stored Procedures

Orchestration:

Informatica & Unix

Reporting:

Cognos, Spotfire, Others

Amgen

internal dataExternal

Sales data

Integration

& Business Rules

Staging DB

Metrics

Calculation

Data

Quality

Core DB

Reporting DB

Online

Reports

iPad

Reports

Analytics

Apps

Process

Frequency:

Weekly reporting

Volume (input data):

130 MM rows for 4 BUs

Processing time:

38 hours for 4 business units

Te

rad

ata

Page 7: (BDT316) Offloading ETL to Amazon Elastic MapReduce

ETL Off-Loaded to Amazon EMR for New Business Unit

Amgen

internal dataExternal

Sales data

Integration

& Business Rules

Staging DB

Metrics

Calculation

Data

Quality

Core DB

Reporting DB

Online

Reports

iPad

Reports

Analytics

Apps

Te

rad

ata

New

Sales data

Raw data

Data Quality

Integrations, Rules

Staging &

Core DB

Metrics

Calculation

Reporting DB

S3

EM

RS

3E

MR

S3

New Process

Volume (input data):

790M rows for new BU (6x times)

Processing time:

40-node Amazon EMR cluster for 8

hours to process the data

Time to reports:

50% reduction

Other benefits:

No change to business user

Scalable and on-demand

No resource contention

Page 8: (BDT316) Offloading ETL to Amazon Elastic MapReduce

Options Considered

Expand MPP

Pros:

• Known working solution for

business critical project

• Well understood timelines

Cons:

• Significant capital expense

• Additional workload on busy

MPP Box

• Future roadmap concerns

Use AWS for

Reporting and ETLPros:

• Most scalable solution

• Lower infrastructure cost

• Full cloud commitment

Cons:

• Longer timelines

• Serial project execution

• Full cloud commitment

• Lack of cloud/big data expertise

Use AWS for ETL

OnlyPros:

• Critical to scale ETL

• Lower infrastructure cost

• Lower risk to timeline

Cons:

• Technology introduction for

business critical project

• Lack of cloud/big data

expertise

Page 9: (BDT316) Offloading ETL to Amazon Elastic MapReduce

AWS Account Overview

S3

Dev

Test

Prod

Page 10: (BDT316) Offloading ETL to Amazon Elastic MapReduce

On-demand S3 / Compute

Orchestration and Logging

High Level Architecture Overview

Physical Amgen Network AWS Direct

Connected VPC

AWS Non-Direct

Connected VPC

Amgen

Controller

Source Data App

Master

App

Launcher

App

Logger

Storage

On-

demand

Cluster

Reporting

DB

Reports and Apps

Current

Processing

and DB

Page 11: (BDT316) Offloading ETL to Amazon Elastic MapReduce

App

Master

Control and Process Flow

Physical Amgen Network AWS Direct

Connect VPC

Unconnected VPC

Amgen

Controller

Source Data App

Launcher

App

Logger

Storage

On-

demand

cluster

Reporting

DB

Reports and Apps

Data

Mount

Data

Landing

1Compresses data

Copies to S3

Begins Orchestration

2

Launch EMR cluster

Deploy code / schema

from S3 to EMR

3

Load data into tables

Execute ETL scripts

Push final data to S3

Shuts down cluster

4

Retrieve data from S3

Load data to Reporting DB

Page 12: (BDT316) Offloading ETL to Amazon Elastic MapReduce

Data Flow

Physical Amgen Network

Amgen

Controller

Source Data App

Master

Storage

On-

demand

cluster

Reporting

DB

Reports and Apps

Data

Mount

Data

Landing

App

Launcher

App

Logger

Input Data

Processed Data

S3put

bzip2

Compressed Streaming

S3get

gzip

gzip

fastload

gzip from

source

AWS Direct

Connect VPC

Unconnected VPC

Page 13: (BDT316) Offloading ETL to Amazon Elastic MapReduce

Technology Landscape

Amgen

Controller

Source Data App

Master

Reporting

DB

Reports and Apps

Unix PC

EC2 Instance EC2 Instance EC2 Instance

EMRS3

PigImpala

Physical Amgen Network

AWS CLI

Hive

App

Launcher

App

Logger

Storage

On-

demand

cluster

AWS Direct

Connect VPC

Unconnected VPC

Page 14: (BDT316) Offloading ETL to Amazon Elastic MapReduce

Orchestration

Processing

Hive Metastore

Logging

Persistent Storage

Pig Impala

Orchestration

Reporting

Persistent Storage

Amazon S3

Cloud On-premise

Technology Usage - Type

Page 15: (BDT316) Offloading ETL to Amazon Elastic MapReduce

Cluster Optimization – Performance vs Cost

ComponentPlanned

configuration

Planned

processing

time

Test

configurationTest runtime

Cost Savings

from

Baseline

Dataset #1

Reporting

r3.4xlarge 25

node7 hrs 21 mins

r3.2xlarge 40

node5 hrs 35 mins 38%

r3.2xlarge 60

node

4 hrs 42

mins19%

r3.2xlarge 80

node4 hrs 23 mins -7%

r3.4xlarge 25

node5 hrs 13 mins 21%

Dataset #2

Reporting

r3.2xlarge 40

node5 hrs 45 mins

r3.2xlarge 40

node

4 hrs 52

mins11%

r3.2xlarge 60

node4 hrs 25 mins -30%

Page 16: (BDT316) Offloading ETL to Amazon Elastic MapReduce

Volume vs Processing Time

Data Set Volume Runtime

Set #1 - Before ~110M Records 2 hrs 45 min

Set #1 - After ~1.55B Records 5 hrs 45 min

Set #1 - Delta Increase by ~1,300% Increase by ~110%

Set #2 – Before ~900M Records 3 hrs 45 min

Set #2 - After ~1.05M Records 4 hrs 25 min

Set #2 – Delta Increase by ~16% Increase by ~15%

Set #3 - Before ~130M Records 2 hrs 45 min

Set #3 – After ~1.05B Records 7 hrs 20 min

Set #3 - Delta Increase by ~750% Increase by ~160%

Page 17: (BDT316) Offloading ETL to Amazon Elastic MapReduce

Amazon EMR Lessons Learned

• Make EVERYTHING Configurable

• Design for an easy upgrade path to new AMIs• AMI from August was obsolete in December

• Check your Amazon EC2 account limits before production deployment

• Build restart points throughout process – avoid paying for rework

• Be wary of uneven data distribution during processing

• Maintain a systemic view of the ecosystem for optimization

• If transfer to Amazon S3 is slow, check for data loss prevention (DLP)

proxies

• Don’t assume transferring compressed data is always better

• Build with cost in mind

• Develop big data expertise in a controlled project

Page 18: (BDT316) Offloading ETL to Amazon Elastic MapReduce

Reporting on Amazon Redshift

Physical Amgen Network

Amgen

Controller

Source Data App

Master

Storage

On-

demand

cluster

Reporting

DB

Reports and Apps

Data

Mount

Data

Landing

App

Launcher

App

Logger

Input Data

Processed Data

Amazon

Redshift

Report DB

Reporting

Amazon EMR

AWS Direct

Connect VPC

Unconnected VPC

Page 19: (BDT316) Offloading ETL to Amazon Elastic MapReduce

Amazon Redshift Lessons Learned

Design Principle: Cognos reports should work for both Amazon Redshift and MPP DB

with minimal change

Performance: Report execution time dropped from 20 seconds to 3 seconds

Technical Differences:

• High performance out of the box with little tuning

• Designing on Amazon Redshift

• Split large tables into multiple tables and union

• Load into an empty table and then change view definition to union additional table

• Amazon Redshift uses ~3x space of table to update sort keys of indexes (vacuum)

• Amazon Redshift limited to 50 concurrent user-defined queries

• Moderate rewrite effort: ~6hrs/report due to syntax and function differences

• Amazon Redshift case-sensitive for data

• No NullifZero function

• Rank function orders differently

Page 20: (BDT316) Offloading ETL to Amazon Elastic MapReduce

Considerations for Cloud and Big Data Tech

Hive Impala

Sizing

Security

Troubleshooting

Know Your Data!

Page 21: (BDT316) Offloading ETL to Amazon Elastic MapReduce

Considerations for Cloud and Big Data Tech

• Plan to learn during the project• Vendors, partners, staff – all are learning new things each day

• Managers need a strong understanding of how things work

• Manage technology risks by targeted POCs• We had ~25-30 different tech architecture options we wanted to solve before the project

• Partner with enterprise infrastructure – cloud does not mean no control• Integration with enterprise networks, security, VPN is not trivial

• Billing, cost allocation, controlling who creates infrastructure – challenges better solved before

you have 100-user groups

• New mindset for cost management• Daily incremental spend instead of large, periodic capital investment

• Tools for visibility, forecasting, and tracking are helpful

• Have targets to improve efficiency constantly

Page 22: (BDT316) Offloading ETL to Amazon Elastic MapReduce

What’s Next for Amgen

• Move remaining Business Unit ETL Processing from MPP DB to EMR

• Move reporting database from MPP DB to Amazon Redshift for all BUs

• Optimize costs

• Expand to enterprise data lake and future AWS projects

Enterprise data lake design:

• Hybrid on-premises/cloud model

• Started cluster development and architecture design in AWS

• Connected VPC: persistence and security

• Amazon EBS for HDFS storage: resizing and stopping nodes

• Seamless integration between on-premises and AWS clusters

Page 23: (BDT316) Offloading ETL to Amazon Elastic MapReduce

Remember to complete

your evaluations!

Page 24: (BDT316) Offloading ETL to Amazon Elastic MapReduce

Thank you!