building scalable big data solutions - · pdf filearchitecture. before. data processing in the...

24
Building Scalable Big Data Solutions Durga Nemani – AOL Inc.

Upload: trinhdan

Post on 17-Feb-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Building Scalable Big Data Solutions

Durga Nemani – AOL Inc.

BACKGROUND& ARCHITECTURE

Before

Data Processing in the Cloud

After

S3

EMREMR

EMR

EMR

UNIQUE FEATURES & ADVANTAGES

Separation of Compute and Storage

9

SEE, SPOT, SQUEEZE

• Just enough spot instances to finish the job in 59 minutes.

10

Key Features

• Separation of Compute and Storage: Amazon S3 and Amazon EMR

• Transient Clusters: No permanent cluster. Different size clusters for different datasets

• Separation of duties: Independent jobs for Processing, Extracting, loading and monitoring.

• Parallelism: Process the smallest chunk of data possible in parallel to reduce dependencies

• Scalability: Hundreds of Amazon EMR clusters in multiple regions and Availability Zones

• Cost optimized: All Spot instances. Launch in Availability Zone with lowest spot prices.

S3

EMREMR

EMR

EMR

Dynamo DB

●● export OPTIMAL_EC2_AVAIL_ZONE=`aws ec2 describe-spot-price-history --instance-types ${CORE_INSTANCE_TYPE} --

product-descriptions Linux/UNIX --start-time 2100-01-01 --output text --region ${OPTIMAL_EC2_REGION}| sort -gk5 | head -1 | cut -f 2`export OPTIMAL_EC2_SUBNET=$(aws --region ${OPTIMAL_EC2_REGION} ec2 describe-subnets --filters "Name=tag-value",Values="dwaol-snet-emr-${OPTIMAL_EC2_AVAIL_ZONE}" "Name=tag-key",Values="Name" | /usr/bin/python2.6 -c 'import sys, json; X=json.load(sys.stdin); print X["Subnets"][0]["SubnetId"];')

● aws emr create-cluster....--region $OPTIMAL_REGION --ec2-attributes SubnetId=$OPTIMAL_EC2_SUBNET

● emrfs-site fs.s3.enableServerSideEncryption true

● emrfs-site fs.s3.serverSideEncryptionAlgorithm AES256

DATA & INSIGHTS

CLOUD Facts

13

Total CompressedAmazon S3 Data Size

150 TB

UncompressedRAW Data/Day

2-3 TB

Amazon EMR Clusters/Day

350

Amazon S3 DataRetention Period13-24 Months

150

24,000

Restatement Use Case

Terabytes raw

14

10 Availability Zone

550EMR Clusters EC2 Instances

Best Practices & Recommendations

JSON

LambdaEMRS3

Apache HiveApache PigPresto

Open Source Data Formats

AWS Services

Open Source Technologies

Avro Parquet

Tag all resources

Infrastructure as CodeCommand Line Interface

JSON as configuration files

IAM roles and policies

Use of application ID

Enable CloudTrail

S3 lifecycle management

S3 versioning

Separate code/data/logs buckets

Keyless EMR clusters

Hybrid model

Enable debugging

Create multiple CLI profiles

Multi-factor authentication

CloudWatch billing alarms

EC2 Spot instances

SNS notifications for failures

Loosely coupled Apps

Scale horizontally

Next Steps

Database on cloud• Database on AWS• Options: Amazon RDS, Amazon Redshift, or others using

Amazon EC2Event-driven design

• Kick off code based on events• Run downstream processes as soon as upstream completesOptions: AWS Lambda, Amazon SQS, Amazon SWF or AWS Data Pipeline

Data analytics• Implement massive parallel processing technologiesOptions: Spark, Impala or Presto

DevOPS on cloud• Rapidly and automatically deploy new code• Continuous Integration/Continuous Deployment• Options: AWS CodeDeploy, AWS CodeCommit, or AWS

CodePipeline

Data Analytics in the cloud

AWS EMR

S3

AWS Quicksight

AWS Lambda AWS

Redshift

Q & A

THANK YOUReference:

AWS re:Invent 2015 | (BDT210) Building Scalable Big Data Solutions: Intel & AOLhttps://www.youtube.com/watch?v=2yZginBYcEo

AWS re:Invent 2015 | (BDT208) A Technical Introduction to Amazon Elastic MapReducehttps://www.youtube.com/watch?v=WnFYoiRqEHw

We are HIRINGPrincipal software engineers and Senior Software engineers with

Java,Hadoop, Pig, Hive, Scala and AWS experience

Contact: [email protected]

Office +1 703 265 2243Mobile +1 571 302 0269