(bdt316) offloading etl to amazon elastic mapreduce
TRANSCRIPT
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Bharat Rangan, Sr. Manager Information Systems, Amgen
Kerby Johnson, Specialist IS Programmer Analyst, Amgen
October 2015
BDT316
Offloading ETL to
Amazon Elastic MapReduce
What to Expect from the Session
• Benefits of ETL offloading in Amazon EMR as an entry point into
using big data technologies
• Benefits and challenges of using Amazon EMR vs. expanding
on-premises ETL and reporting technologies
• How to architect an ETL offload solution using Amazon S3,
Amazon EMR, and Impala
• Leveraging Amazon Redshift for a reporting database
• Next steps and future expansion of big data
• Prereqs: Basic knowledge of AWS: What are Amazon Redshift,
Amazon EC2, Amazon EMR, Amazon S3; how VPCs work;
basic big data terminology
Amgen is committed to unlocking the potential of
biology for patients suffering from serious illnesses by
discovering, developing, manufacturing and delivering
innovative human therapeutics.
A biotechnology pioneer since 1980, Amgen has
grown to be one of the world's leading independent
biotechnology companies, has reached millions of
patients around the world and is developing a pipeline
of medicines with breakaway potential.
.
About Amgen
18 Months at a Glance
2014 2015 2016
Amgen IS evaluating Hadoop,
AWS, BigData Tech.
What is a good use case?
Successful
Commercial Analytics
for 5 Business Units
MPP database / Informatica
Many Visualization Tools
Amgen plans launch of Cardiovascular products
Need to scale analytics platform by 7 – 10x
New Biz
Need !!
What next:
Offloading ETL for other business units (almost done)
Running reports on Amazon Redshift (successful pilot)
Enterprise Data Lake
2 more
Business units
AWS / EMR / EC2 to the rescue
Evaluate options and decide to off-load ETL to
Amazon EMR
Design, develop and deploy application in 8 months
50% less time and ~30%
cheaper than MPP DB
Go-Live
Business Context
Data Areas Integrated
• Physician / Hospital Sales
• Sales Force Activity
• Payer Coverage of claims
• Outbound sales and inventory
• Customer Master
• Channel Marketing Data
Business Deliverables
• 200+ Online Reports
• Mobile Reports
• 300+ Metrics
• 25+ data sources integrated
• Analytics Data Warehouse
Business Capabilities
Supported
• Sales Force Reporting
• Customer Targeting
• Incentive Compensation
• Analytics
• Marketing Analytics
Thousands of sales reps across 15 sales forces use the
commercial reporting platform every day for business critical
analytics
Architecture Before Amazon EMR
Technology
Database:
Teradata (72 amp / 15 TB)
Processing:
DB Stored Procedures
Orchestration:
Informatica & Unix
Reporting:
Cognos, Spotfire, Others
Amgen
internal dataExternal
Sales data
Integration
& Business Rules
Staging DB
Metrics
Calculation
Data
Quality
Core DB
Reporting DB
Online
Reports
iPad
Reports
Analytics
Apps
Process
Frequency:
Weekly reporting
Volume (input data):
130 MM rows for 4 BUs
Processing time:
38 hours for 4 business units
Te
rad
ata
ETL Off-Loaded to Amazon EMR for New Business Unit
Amgen
internal dataExternal
Sales data
Integration
& Business Rules
Staging DB
Metrics
Calculation
Data
Quality
Core DB
Reporting DB
Online
Reports
iPad
Reports
Analytics
Apps
Te
rad
ata
New
Sales data
Raw data
Data Quality
Integrations, Rules
Staging &
Core DB
Metrics
Calculation
Reporting DB
S3
EM
RS
3E
MR
S3
New Process
Volume (input data):
790M rows for new BU (6x times)
Processing time:
40-node Amazon EMR cluster for 8
hours to process the data
Time to reports:
50% reduction
Other benefits:
No change to business user
Scalable and on-demand
No resource contention
Options Considered
Expand MPP
Pros:
• Known working solution for
business critical project
• Well understood timelines
Cons:
• Significant capital expense
• Additional workload on busy
MPP Box
• Future roadmap concerns
Use AWS for
Reporting and ETLPros:
• Most scalable solution
• Lower infrastructure cost
• Full cloud commitment
Cons:
• Longer timelines
• Serial project execution
• Full cloud commitment
• Lack of cloud/big data expertise
Use AWS for ETL
OnlyPros:
• Critical to scale ETL
• Lower infrastructure cost
• Lower risk to timeline
Cons:
• Technology introduction for
business critical project
• Lack of cloud/big data
expertise
AWS Account Overview
S3
Dev
Test
Prod
On-demand S3 / Compute
Orchestration and Logging
High Level Architecture Overview
Physical Amgen Network AWS Direct
Connected VPC
AWS Non-Direct
Connected VPC
Amgen
Controller
Source Data App
Master
App
Launcher
App
Logger
Storage
On-
demand
Cluster
Reporting
DB
Reports and Apps
Current
Processing
and DB
App
Master
Control and Process Flow
Physical Amgen Network AWS Direct
Connect VPC
Unconnected VPC
Amgen
Controller
Source Data App
Launcher
App
Logger
Storage
On-
demand
cluster
Reporting
DB
Reports and Apps
Data
Mount
Data
Landing
1Compresses data
Copies to S3
Begins Orchestration
2
Launch EMR cluster
Deploy code / schema
from S3 to EMR
3
Load data into tables
Execute ETL scripts
Push final data to S3
Shuts down cluster
4
Retrieve data from S3
Load data to Reporting DB
Data Flow
Physical Amgen Network
Amgen
Controller
Source Data App
Master
Storage
On-
demand
cluster
Reporting
DB
Reports and Apps
Data
Mount
Data
Landing
App
Launcher
App
Logger
Input Data
Processed Data
S3put
bzip2
Compressed Streaming
S3get
gzip
gzip
fastload
gzip from
source
AWS Direct
Connect VPC
Unconnected VPC
Technology Landscape
Amgen
Controller
Source Data App
Master
Reporting
DB
Reports and Apps
Unix PC
EC2 Instance EC2 Instance EC2 Instance
EMRS3
PigImpala
Physical Amgen Network
AWS CLI
Hive
App
Launcher
App
Logger
Storage
On-
demand
cluster
AWS Direct
Connect VPC
Unconnected VPC
Orchestration
Processing
Hive Metastore
Logging
Persistent Storage
Pig Impala
Orchestration
Reporting
Persistent Storage
Amazon S3
Cloud On-premise
Technology Usage - Type
Cluster Optimization – Performance vs Cost
ComponentPlanned
configuration
Planned
processing
time
Test
configurationTest runtime
Cost Savings
from
Baseline
Dataset #1
Reporting
r3.4xlarge 25
node7 hrs 21 mins
r3.2xlarge 40
node5 hrs 35 mins 38%
r3.2xlarge 60
node
4 hrs 42
mins19%
r3.2xlarge 80
node4 hrs 23 mins -7%
r3.4xlarge 25
node5 hrs 13 mins 21%
Dataset #2
Reporting
r3.2xlarge 40
node5 hrs 45 mins
r3.2xlarge 40
node
4 hrs 52
mins11%
r3.2xlarge 60
node4 hrs 25 mins -30%
Volume vs Processing Time
Data Set Volume Runtime
Set #1 - Before ~110M Records 2 hrs 45 min
Set #1 - After ~1.55B Records 5 hrs 45 min
Set #1 - Delta Increase by ~1,300% Increase by ~110%
Set #2 – Before ~900M Records 3 hrs 45 min
Set #2 - After ~1.05M Records 4 hrs 25 min
Set #2 – Delta Increase by ~16% Increase by ~15%
Set #3 - Before ~130M Records 2 hrs 45 min
Set #3 – After ~1.05B Records 7 hrs 20 min
Set #3 - Delta Increase by ~750% Increase by ~160%
Amazon EMR Lessons Learned
• Make EVERYTHING Configurable
• Design for an easy upgrade path to new AMIs• AMI from August was obsolete in December
• Check your Amazon EC2 account limits before production deployment
• Build restart points throughout process – avoid paying for rework
• Be wary of uneven data distribution during processing
• Maintain a systemic view of the ecosystem for optimization
• If transfer to Amazon S3 is slow, check for data loss prevention (DLP)
proxies
• Don’t assume transferring compressed data is always better
• Build with cost in mind
• Develop big data expertise in a controlled project
Reporting on Amazon Redshift
Physical Amgen Network
Amgen
Controller
Source Data App
Master
Storage
On-
demand
cluster
Reporting
DB
Reports and Apps
Data
Mount
Data
Landing
App
Launcher
App
Logger
Input Data
Processed Data
Amazon
Redshift
Report DB
Reporting
Amazon EMR
AWS Direct
Connect VPC
Unconnected VPC
Amazon Redshift Lessons Learned
Design Principle: Cognos reports should work for both Amazon Redshift and MPP DB
with minimal change
Performance: Report execution time dropped from 20 seconds to 3 seconds
Technical Differences:
• High performance out of the box with little tuning
• Designing on Amazon Redshift
• Split large tables into multiple tables and union
• Load into an empty table and then change view definition to union additional table
• Amazon Redshift uses ~3x space of table to update sort keys of indexes (vacuum)
• Amazon Redshift limited to 50 concurrent user-defined queries
• Moderate rewrite effort: ~6hrs/report due to syntax and function differences
• Amazon Redshift case-sensitive for data
• No NullifZero function
• Rank function orders differently
Considerations for Cloud and Big Data Tech
Hive Impala
Sizing
Security
Troubleshooting
Know Your Data!
Considerations for Cloud and Big Data Tech
• Plan to learn during the project• Vendors, partners, staff – all are learning new things each day
• Managers need a strong understanding of how things work
• Manage technology risks by targeted POCs• We had ~25-30 different tech architecture options we wanted to solve before the project
• Partner with enterprise infrastructure – cloud does not mean no control• Integration with enterprise networks, security, VPN is not trivial
• Billing, cost allocation, controlling who creates infrastructure – challenges better solved before
you have 100-user groups
• New mindset for cost management• Daily incremental spend instead of large, periodic capital investment
• Tools for visibility, forecasting, and tracking are helpful
• Have targets to improve efficiency constantly
What’s Next for Amgen
• Move remaining Business Unit ETL Processing from MPP DB to EMR
• Move reporting database from MPP DB to Amazon Redshift for all BUs
• Optimize costs
• Expand to enterprise data lake and future AWS projects
Enterprise data lake design:
• Hybrid on-premises/cloud model
• Started cluster development and architecture design in AWS
• Connected VPC: persistence and security
• Amazon EBS for HDFS storage: resizing and stopping nodes
• Seamless integration between on-premises and AWS clusters
Remember to complete
your evaluations!
Thank you!