building scalable big data solutions - · pdf filearchitecture. before. data processing in the...

Building Scalable Big Data Solutions Durga Nemani – AOL Inc.

Upload: trinhdan

Post on 17-Feb-2018

214 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Building Scalable Big Data Solutions

Durga Nemani – AOL Inc.

BACKGROUND& ARCHITECTURE

Before

Data Processing in the Cloud

After

EMREMR

EMR

UNIQUE FEATURES & ADVANTAGES

Page 8: Building Scalable Big Data Solutions - · PDF fileARCHITECTURE. Before. Data Processing in the Cloud. After. S3 EMR EMR EMR EMR. UNIQUE FEATURES & ADVANTAGES. ... Java,Hadoop, Pig,

Separation of Compute and Storage

Page 9: Building Scalable Big Data Solutions - · PDF fileARCHITECTURE. Before. Data Processing in the Cloud. After. S3 EMR EMR EMR EMR. UNIQUE FEATURES & ADVANTAGES. ... Java,Hadoop, Pig,

SEE, SPOT, SQUEEZE

• Just enough spot instances to finish the job in 59 minutes.

Page 10: Building Scalable Big Data Solutions - · PDF fileARCHITECTURE. Before. Data Processing in the Cloud. After. S3 EMR EMR EMR EMR. UNIQUE FEATURES & ADVANTAGES. ... Java,Hadoop, Pig,

Key Features

• Separation of Compute and Storage: Amazon S3 and Amazon EMR

• Transient Clusters: No permanent cluster. Different size clusters for different datasets

• Separation of duties: Independent jobs for Processing, Extracting, loading and monitoring.

• Parallelism: Process the smallest chunk of data possible in parallel to reduce dependencies

• Scalability: Hundreds of Amazon EMR clusters in multiple regions and Availability Zones

• Cost optimized: All Spot instances. Launch in Availability Zone with lowest spot prices.

EMREMR

EMR

Dynamo DB

●● export OPTIMAL_EC2_AVAIL_ZONE=`aws ec2 describe-spot-price-history --instance-types ${CORE_INSTANCE_TYPE} --

product-descriptions Linux/UNIX --start-time 2100-01-01 --output text --region ${OPTIMAL_EC2_REGION}| sort -gk5 | head -1 | cut -f 2`export OPTIMAL_EC2_SUBNET=$(aws --region ${OPTIMAL_EC2_REGION} ec2 describe-subnets --filters "Name=tag-value",Values="dwaol-snet-emr-${OPTIMAL_EC2_AVAIL_ZONE}" "Name=tag-key",Values="Name" | /usr/bin/python2.6 -c 'import sys, json; X=json.load(sys.stdin); print X["Subnets"][0]["SubnetId"];')

● aws emr create-cluster....--region $OPTIMAL_REGION --ec2-attributes SubnetId=$OPTIMAL_EC2_SUBNET

● emrfs-site fs.s3.enableServerSideEncryption true

● emrfs-site fs.s3.serverSideEncryptionAlgorithm AES256

Page 12: Building Scalable Big Data Solutions - · PDF fileARCHITECTURE. Before. Data Processing in the Cloud. After. S3 EMR EMR EMR EMR. UNIQUE FEATURES & ADVANTAGES. ... Java,Hadoop, Pig,

DATA & INSIGHTS

Page 13: Building Scalable Big Data Solutions - · PDF fileARCHITECTURE. Before. Data Processing in the Cloud. After. S3 EMR EMR EMR EMR. UNIQUE FEATURES & ADVANTAGES. ... Java,Hadoop, Pig,

CLOUD Facts

Total CompressedAmazon S3 Data Size

150 TB

UncompressedRAW Data/Day

2-3 TB

Amazon EMR Clusters/Day

350

Amazon S3 DataRetention Period13-24 Months

Page 14: Building Scalable Big Data Solutions - · PDF fileARCHITECTURE. Before. Data Processing in the Cloud. After. S3 EMR EMR EMR EMR. UNIQUE FEATURES & ADVANTAGES. ... Java,Hadoop, Pig,

150

24,000

Restatement Use Case

Terabytes raw

10 Availability Zone

550EMR Clusters EC2 Instances

Page 15: Building Scalable Big Data Solutions - · PDF fileARCHITECTURE. Before. Data Processing in the Cloud. After. S3 EMR EMR EMR EMR. UNIQUE FEATURES & ADVANTAGES. ... Java,Hadoop, Pig,

Best Practices & Recommendations

Page 16: Building Scalable Big Data Solutions - · PDF fileARCHITECTURE. Before. Data Processing in the Cloud. After. S3 EMR EMR EMR EMR. UNIQUE FEATURES & ADVANTAGES. ... Java,Hadoop, Pig,

JSON

LambdaEMRS3

Apache HiveApache PigPresto

Open Source Data Formats

AWS Services

Open Source Technologies

Avro Parquet

Page 17: Building Scalable Big Data Solutions - · PDF fileARCHITECTURE. Before. Data Processing in the Cloud. After. S3 EMR EMR EMR EMR. UNIQUE FEATURES & ADVANTAGES. ... Java,Hadoop, Pig,

Tag all resources

Infrastructure as CodeCommand Line Interface

JSON as configuration files

IAM roles and policies

Use of application ID

Enable CloudTrail

S3 lifecycle management

S3 versioning

Separate code/data/logs buckets

Keyless EMR clusters

Hybrid model

Enable debugging

Create multiple CLI profiles

Multi-factor authentication

CloudWatch billing alarms

EC2 Spot instances

SNS notifications for failures

Loosely coupled Apps

Scale horizontally

Page 18: Building Scalable Big Data Solutions - · PDF fileARCHITECTURE. Before. Data Processing in the Cloud. After. S3 EMR EMR EMR EMR. UNIQUE FEATURES & ADVANTAGES. ... Java,Hadoop, Pig,

Next Steps

Page 19: Building Scalable Big Data Solutions - · PDF fileARCHITECTURE. Before. Data Processing in the Cloud. After. S3 EMR EMR EMR EMR. UNIQUE FEATURES & ADVANTAGES. ... Java,Hadoop, Pig,

Database on cloud• Database on AWS• Options: Amazon RDS, Amazon Redshift, or others using

Amazon EC2Event-driven design

• Kick off code based on events• Run downstream processes as soon as upstream completesOptions: AWS Lambda, Amazon SQS, Amazon SWF or AWS Data Pipeline

Data analytics• Implement massive parallel processing technologiesOptions: Spark, Impala or Presto

DevOPS on cloud• Rapidly and automatically deploy new code• Continuous Integration/Continuous Deployment• Options: AWS CodeDeploy, AWS CodeCommit, or AWS

CodePipeline