building scalable big data solutions - · pdf filearchitecture. before. data processing in the...
TRANSCRIPT
10
Key Features
• Separation of Compute and Storage: Amazon S3 and Amazon EMR
• Transient Clusters: No permanent cluster. Different size clusters for different datasets
• Separation of duties: Independent jobs for Processing, Extracting, loading and monitoring.
• Parallelism: Process the smallest chunk of data possible in parallel to reduce dependencies
• Scalability: Hundreds of Amazon EMR clusters in multiple regions and Availability Zones
• Cost optimized: All Spot instances. Launch in Availability Zone with lowest spot prices.
S3
EMREMR
EMR
EMR
Dynamo DB
●● export OPTIMAL_EC2_AVAIL_ZONE=`aws ec2 describe-spot-price-history --instance-types ${CORE_INSTANCE_TYPE} --
product-descriptions Linux/UNIX --start-time 2100-01-01 --output text --region ${OPTIMAL_EC2_REGION}| sort -gk5 | head -1 | cut -f 2`export OPTIMAL_EC2_SUBNET=$(aws --region ${OPTIMAL_EC2_REGION} ec2 describe-subnets --filters "Name=tag-value",Values="dwaol-snet-emr-${OPTIMAL_EC2_AVAIL_ZONE}" "Name=tag-key",Values="Name" | /usr/bin/python2.6 -c 'import sys, json; X=json.load(sys.stdin); print X["Subnets"][0]["SubnetId"];')
● aws emr create-cluster....--region $OPTIMAL_REGION --ec2-attributes SubnetId=$OPTIMAL_EC2_SUBNET
● emrfs-site fs.s3.enableServerSideEncryption true
● emrfs-site fs.s3.serverSideEncryptionAlgorithm AES256
CLOUD Facts
13
Total CompressedAmazon S3 Data Size
150 TB
UncompressedRAW Data/Day
2-3 TB
Amazon EMR Clusters/Day
350
Amazon S3 DataRetention Period13-24 Months
JSON
LambdaEMRS3
Apache HiveApache PigPresto
Open Source Data Formats
AWS Services
Open Source Technologies
Avro Parquet
Tag all resources
Infrastructure as CodeCommand Line Interface
JSON as configuration files
IAM roles and policies
Use of application ID
Enable CloudTrail
S3 lifecycle management
S3 versioning
Separate code/data/logs buckets
Keyless EMR clusters
Hybrid model
Enable debugging
Create multiple CLI profiles
Multi-factor authentication
CloudWatch billing alarms
EC2 Spot instances
SNS notifications for failures
Loosely coupled Apps
Scale horizontally
Database on cloud• Database on AWS• Options: Amazon RDS, Amazon Redshift, or others using
Amazon EC2Event-driven design
• Kick off code based on events• Run downstream processes as soon as upstream completesOptions: AWS Lambda, Amazon SQS, Amazon SWF or AWS Data Pipeline
Data analytics• Implement massive parallel processing technologiesOptions: Spark, Impala or Presto
DevOPS on cloud• Rapidly and automatically deploy new code• Continuous Integration/Continuous Deployment• Options: AWS CodeDeploy, AWS CodeCommit, or AWS
CodePipeline
THANK YOUReference:
AWS re:Invent 2015 | (BDT210) Building Scalable Big Data Solutions: Intel & AOLhttps://www.youtube.com/watch?v=2yZginBYcEo
AWS re:Invent 2015 | (BDT208) A Technical Introduction to Amazon Elastic MapReducehttps://www.youtube.com/watch?v=WnFYoiRqEHw
We are HIRINGPrincipal software engineers and Senior Software engineers with
Java,Hadoop, Pig, Hive, Scala and AWS experience
Contact: [email protected]
Office +1 703 265 2243Mobile +1 571 302 0269