aws august webinar series - ec2 spot instances - 08192015

EC2 SpotSave Up to 90% on your Amazon EC2 Bill with Spot Instances

Tipu Qureshi Jafar Shameem19th August 2015

Name your own price for EC2 Compute

• A market where price of compute changes based upon Supply and Demand

• When Bid Price exceeds Spot Market Price, instance is launched

• Instance is terminated (with 2 minute warning) if market price exceeds bid price

• Unused On-Demand Instances

What is Spot?

• Spot prices are determined via supply and demand• There are hundreds of uncorrelated Spot markets• Prices can, but often don’t fluctuate wildly

About Spot…

General-purpose: M1, M3 , T2

Compute-optimized:C1, CC2, C3, C4

Memory-optimized: M2, CR1, R3, M4

Dense-storage: HS1, D2

I/O-optimized: HI1, I2

GPU: CG1, G2

Micro: T1, T2

.micro

.medium

.large

.xlarge

.2xlarge

.4xlarge

.8xlarge

WindowsLinux

-1a-1b-1c….

Type Size OS AZ

Spot is not one market

Each instance family (r3) and size (4xlarge), in each Availability Zone (US-East-1b)

Uncorrelated pools of Spot Capacity

50% Bid

70% Bid

You pay the market price

Bid Price and Market Price

cc2.8xlarge32 cores, 60.5 GB memory

On-Demand Price:$2.00/hr

$0.00936/core/hr

On average, AWS adds enough new server capacity every day to support Amazon’s global infrastructure when

it was a $7B business.

EC2 Spot - best practices

Check the Price History

Describe Spot Price History API:• Provides historical prices on a per-pool basis • Goes back 90 days (3 months)• Popular instance types tend to have Spot prices that are

somewhat more volatile• Older generations (including c1.8xlarge, m1.small,

cr1.8xlarge, and cc2.8xlarge) tend to be much more stable and have lower cost in general

Capacity pools

Set of EC2 instances of the same properties:• Availability zone• Product/Operating system (Linux/Unix or Windows)• EC2 instance type

Each EC2 capacity pool has it’s own:• Availability – number of Spot instances• Price – based on supply and demand

Use Multiple Capacity Pools

• Run applications across multiple capacity pools to reduce your application’s sensitivity to price spikes that affect a pool

• In general, there is very little correlation between prices in different capacity pools.

• For example, if you run in five different pools your price swings and interruptions can be cut by 80%.

Use Multiple Capacity Pools

Run across multiple availability zones in conjunction• Auto Scaling• Spot Fleet API

Run application across different sizes of instances within the same family

• Amazon EMR takes this approach

Your application could figure out how many vCPUs it is running on, and then launch enough worker threads to keep all of them occupied.

CPU and cores• What kind of performance does your application require?

How many cores does your application need?Memory/core

• How much memory per core does your application need?Networking

• Does your application need high, moderate, low network bandwidth?

Disk• How much local disk does your application need?

Use Normalized pools of Compute

You only pay what the Market price is

But, bid what you are willing to pay

You pay for the price as you enter the hour

And pay for it at the end of the hour

If you get interrupted, you don’t pay for that hour

Bid only what you are willing to pay.

(by default, bid limited to 10 * On Demand Price)

What about Bidding Strategy?

AWS Spot Labs• https://github.com/awslabs/aws-spot-labs

Helps to find capacity pools (defined as instance type and AZ) with lower price volatility by ordering these pools based on duration of time since the Spot price last exceeded the bid price. It uses AWS CLI to programmatically obtain Spot price history data.

Finding the best pools of Compute Capacity

python get_spot_duration.py \--region us-east-1 \--product-description 'Linux/UNIX' \--bids c3.xlarge:0.105,c3.2xlarge:0.21,c3.4xlarge:0.42,c3.8xlarge:0.84,c4.xlarge:0.110,c4.2xlarge:0.220,c4.4xlarge:0.440,c4.8xlarge:0.880,cc2.8xlarge:1.000,c1.xlarge:0.26 \--hours 168

Note:• Price as of 8/15/2015• AZ mappings may differ• 168 hours = 1 week• In this example, bidding

the on-demand price

Using the Spot Tools Lab

Build stateless, distributed, scalable applicationsChoose which instance types fit your workload the bestIngest price feed data for AZs and regions Make run time decisions on which Spot pools to launch in based on price and volatilityManage interruptionsMonitor and manage market prices across Azs and instance typesManage the capacity footprint in the fleetAnd all of this while you don’t know where the capacity isServe your customers

Helping with the undifferentiated heavy lifting

UNDIFFERENTIATED HEAVY LIFTING

Instead of writing all that code to manage Spot Instances, simply specify:

Target Capacity - The number of EC2 instances that you want in your fleet.Maximum Bid Price - The maximum bid price that you are willing to pay.Launch Specifications - # of and types of instances, AMI id, VPC, subnets or AZs, etc.IAM Fleet Role - The name of an IAM role. It must allow EC2 to launch and terminate instances on your behalf.

Introducing Spot Fleet

EC2 Spot - Use Cases

Stateless Web/App Server Fleets

Hadoop Workloads

Continuous Integration (CI)

High Performance Computing (HPC)

Grid Computing

Media Rendering / Transcoding

Spot Use Cases

EC2 Spot - Web Architecture

Considerations

Highly availability

Elasticity

Stateless Web tier

Parallelism

Stateless Web/App/API Architecture with Spot

Elastic LoadBalancing

Stateless Web Servers

On Demand Auto Scaling group

Session State Data

Stateless Web Servers (Spot)

Spot Auto Scaling group

Availability Zone A

Availability Zone B

Web Application - Auto ScalingMultiple Auto Scaling groups

• On-demand instances for fallback. • Multiple EC2 Spot instance Auto Scaling groups• Each Spot Auto Scaling group using a different capacity pool

(e.g. AZ, bid, Instance size, Instance type)

Auto Scaling groups behind the same Elastic Load Balancer.

Pick the right instance time for the job based on the price history.

Auto Scaling Policies

Aggressive scaling policies for Spot Auto Scaling Groupse.g. Scale up at 75% CPU utilization and scale down when at 25% CPU utilization with a large capacity range)

More conservative scaling policies for On-Demand Auto Scaling groups.

Session state for the web application can be stored in DynamoDB.

• Data replicated across availability zones.

You can also choose other databases to maintain state in your architecture.

• Amazon RDS using Multi-AZ deployments• Amazon Elasticache

Where to store the state?

Spot termination considerations

Availability of Spot instances can vary based on supply and demand

Architect application to be resilient to instance termination

When the Spot price exceeds the price you named (i.e. the bid price), the instance will receive a two-minute warning that the instance will be terminated

Spot termination considerations

Check for the 2 minute spot instance termination notification every 5 seconds leveraging a script invoked at instance launch. Upon notification:• Place any session information into DynamoDB• Use IAM roles so that the spot instances can de-register

themselves from the ELB upon termination notification

Since the Auto Scaling groups span across multiple availability zones, we highly recommend enabling cross-zone load balancing for the load balancer.

To allow in-flight requests to complete when de-registering Spot instances that are about to be terminated, connection draining can be enabled on the load balancer with a timeout of 90 seconds.

Elastic Load Balancing

Sample script

#!/bin/bashwhile true do if curl -s http://169.254.169.254/latest/meta-data/spot/termination-time | \ grep -q .*T.*Z; then instance_id=$(curl -s http://169.254.169.254/latest/meta-data/instance-id); \ aws elb deregister-instances-from-load-balancer \ --load-balancer-name my-load-balancer \ --instances $instance_id; /env/bin/flushsessiontoDBonterminationscript.sh; else # Spot instance not yet marked for termination. sleep 5 fidone

Web Application Architecture with Spot

Elastic LoadBalancing

Stateless Web Servers

On Demand Auto Scaling group

Session State Data

Availability Zone A

Availability Zone B

Studyplus Case Study

Batch Processing with Amazon EC2 Spot

Batch oriented applications can leverage on-demand processing using EC2 Spot to save up to 90% cost:• Claims processing• Large scale transformation• Media processing• Multi-part data processing work

You can also leverage EMR with spot instances.

• Multi-part job processing architecture • Auto Scaling groups to setup a heterogeneous, scalable

“grid” of EC2 spot instances with multiple capacity pools as worker nodes

• Use S3 to invoke AWS Lambda upon object upload• Use SQS for decoupling• DynamoDB for tracking job status• Complete large batch processing tasks in parallel

About Lambda and SQS

AWS Lambda is a compute service that runs your code in response to events and automatically manages the compute resources for you, making it easy to build applications that respond quickly to new information.

Amazon Simple Queue Service (SQS) is a fast, reliable, scalable, fully managed message queuing service to decouple components.

Depending on the application’s needs, multiple SQS queues might be required for functions and priorities.

On Demand Auto-Scaling group

Output S3 bucket

Spot Auto-Scaling group 2

Availability Zone A

Availability Zone BSpot Auto-

Scaling group 1

Upload object into input S3

bucket

Job SQS Queue

Auto Scaling groups will scale up based on queue depth and scale down based on

CPU utilization CW metrics

Workers will check for

jobs in the queue

Workers will update Job status (start time, SLA end time, etc)

in DynamoDB

Uploads to S3 will trigger a Lamda

function to put jobs in SQS and DynamoDB

EFSEC2 instance worker fleet

IAM Role for Lambda Policy{ "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1438283855455", "Action": [ "dynamodb:PutItem" ], "Effect": "Allow", "Resource": "arn:aws:dynamodb:us-east-1::table/demojobtable" }, { "Sid": "Stmt1438283929844", "Action": [ "sqs:SendMessage" ], "Effect": "Allow", "Resource": "arn:aws:sqs:us-east-1::demojobqueue" } ]}

AWS Lambda function for SQS and DynamoDB updates// dependenciesvar AWS = require('aws-sdk');

// get reference to clientsvar s3 = new AWS.S3();var sqs = new AWS.SQS();var dynamodb = new AWS.DynamoDB();

console.log ('Loading function');

exports.handler = function(event, context) { // Read options from the event. var srcBucket = event.Records[0].s3.bucket.name; // Object key may have spaces or unicode non-ASCII characters. var srcKey = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, " "));

// prepare SQS message var params = { MessageBody: 'object '+ srcKey + ' ', QueueUrl: 'https://sqs.us-east-1.amazonaws.com//demojobqueue', DelaySeconds: 0 }; //send SQS message sqs.sendMessage(params, function (err, data) { if (err) { console.error('Unable to put object' + srcKey + ' into SQS queue due to an error: ' + err); context.fail(srcKey, 'Unable to send message to SQS'); } // an error occurred else { //define DynamoDB table variables var tableName = "demojobtable"; var datetime = new Date().getTime().toString();

AWS Lambda function for SQS and DynamoDB updates

//Put item into DynamoDB table where srcKey is the hash key and datetime is the range key dynamodb.putItem({ "TableName": tableName, "Item": { "srcKey": {"S": srcKey }, "datetime": {"S": datetime }, } }, function(err, data) { if (err) { console.error('Unable to put object' + srcKey + ' into DynamoDB table due to an error: ' + err); context.fail(srcKey, 'Unable to put data to DynamoDB Table'); } else { console.log('Successfully put object' + srcKey + ' into SQS queue and DynamoDB'); context.succeed(srcKey, 'Data put into SQS and DynamoDB'); } }); } });};

AWS Lambda function for SQS and DynamoDB updates

• Worker nodes get job parts from the SQS and perform single tasks based on the job task state in DynamoDB

• Store the input objects in a file system such as Amazon Elastic File System (Amazon EFS), local instance store or Amazon Elastic Block Store (EBS)

• Each job can be further split into multiples sub-parts if there is a mechanism to stitch the outputs together

• Once completed, the objects will be uploaded back to S3 using multi-part upload.

On Demand Auto-Scaling group

Output S3 bucket

Spot Auto-Scaling group 2

Availability Zone A

Availability Zone BSpot Auto-

Scaling group 1

bucket

Job SQS Queue

Auto Scaling groups will scale up based on queue depth and scale down based on

CPU utilization CW metrics

jobs in the queue

in DynamoDB

function to put jobs in SQS and DynamoDB

EFSEC2 instance worker fleet

More automation?Use a Lambda function to dynamically manage Auto Scaling groups based on the Spot market

• The Lambda function could periodically invoke the EC2 Spot APIs to assess market prices and availability and respond by creating new Auto Scaling launch configurations and groups automatically.

• This function could also delete any Spot Auto Scaling groups and launch configurations that have no instances.

AWS Data Pipeline can be used to invoke the Lambda function using the AWS CLI at regular intervals by scheduling pipelines

Automated Batch Architecture with Spot

Worker

On Demand Autoscaling group

Output S3 bucket

Worker (spot)

Worker(spot)

Spot Autoscaling group 2

Availability Zone A

Availability Zone B

Worker(spot)

Worker (spot)

Spot Autoscaling group 1

bucket

Job SQS Queue

AutoScaling groups will scale up based on queue depth and scale down based on CPU utilization

CW metrics

jobs in the queue

in DynamoDB

DataPipeline can invoke a Lambda function in a scheduled manner which can manage AutoScaling

groups based on the spot market

function to put jobs in DynamoDB and SQS EFS

Further cost optimization with Trusted Advisor

Save money on AWS by eliminating unused and idle resources Cost Optimization TA Checks:

• Amazon EC2 Reserved Instances Optimization• Low Utilization Amazon EC2 Instances• Idle Load Balancers• Underutilized Amazon EBS Volumes• Unassociated Elastic IP Addresses• Amazon RDS Idle DB Instances

AWS re:Invent 2015 – October 6-9AWS re:Invent is the largest annual gathering of the global cloud community. Whether you are an existing customer or new to the cloud, AWS re:Invent will provide you with the knowledge and skills to refine your cloud strategy, improve developer productivity, increase application performance and security, and reduce infrastructure costs.

Though AWS re:Invent tickets are sold out, you can still register to view the Live Stream Broadcasts of the keynote addresses and select technical sessions on October 7 and October 8. Register now.

Details:Wednesday, October 79:00am - 10:30am PT: Andrew Jassy, Sr. Vice President, AWS11:00am - 5:15pm PT: 5 of the most popular breakout sessions (to be announced)

Thursday, October 89:00am - 10:30am PT: Dr. Werner Vogels, CTO, Amazon11:00am - 6:15pm PT: 6 of the most popular breakout sessions (to be announced)

Register now for the Live Stream Broadcast by submitting your email where prompted on the AWS re:Invent home page.

Stay Connected: Follow event activities on Twitter @awsreinvent (#reinvent), or like us on Facebook.

Thank you!

Questions?

What have customers done with Spot?

Some case studies..

Submit jobs, orchestrate HPC clusters over VPC

Run 1 Million drive head designs = 70.75 core-years

90x throughput: Ran in 8 hours, not 30 days 3 days from idea to running

70,908 cores, 729 TFLOPSc3, r3 with Intel E5-2670 v2

Cost: $5,594Spot Instances

New Drive Head DesignWorkloads

World’s Largest F500 Cloud RunTransforming drive design to store the world’s data

Encrypt, route data to AWS, return results

Cluster 70,908 Coreswith SpotInstances

AWS Delivered Unheard-of Processing

39 years of science

10,600 AWS Instances

Saved equivalent of $40M infrastructure

10 Million compounds screened

39 drug design years in 11 hours for a cost of… $4,232

3 promising compounds identified

Scaling Hadoop Jobs with Spothttp://engineering.bloomreach.com/strategies-for-reducing-your-amazon-emr-costs/

Bloomreach launches 1,500 to 2,000 Amazon EMR clusters and run 6,000 Hadoop jobs every day.

Continuous Integration & Testing with Spot

• Tapjoy - Premier Mobile Ad Network Across iOS & Android• Global Network (435 Million Monthly Reach)• Jenkins + Spot Instances

• https://github.com/bwall/ec2-plugin (thanks to an RIT senior project)• Go wide during business hours, scale back in the evenings.

Automatically kicks online at 06:00ET• Workers scale horizontally to support dozens of simultaneous regression

tests spread out over dozens of workers• Jenkins automatically guards against spot termination

Ooyala• Video technology platform that

serves ESPN, Bloomberg, ...• Uses combo of OD/RI/Spot to

ensure it can cover predicted volumes while keeping costs low

• http://aws.amazon.com/solutions/case-studies/ooyala/

Vevo• Library of over 75,000 HD videos• Must be able to rapidly transcode

library to a new screen format• Can spin up 100s of Spot

instances to transcode entire library in a matter of days (instead of the weeks)

Queue-based media transcoding

Using Spot Fleet

An example..

Using Spot Fleet

Create EC2 Spot Fleet IAM Role Requesting a fleet:

• aws ec2 request-spot-fleet --spot-fleet-request-config file://mySmallFleet.json

Describe fleet:• aws ec2 describe-spot-fleet-requests• aws ec2 describe-spot-fleet-requests --spot-fleet-request-ids <sfr-………..>

Describe instances within the fleet• aws ec2 describe-spot-fleet-instances --spot-fleet-request-id <sfr-…………>

Cancel Spot Fleet (with termination):• aws ec2 cancel-spot-fleet-requests --spot-fleet-request-ids <sfr-…………..> -

terminate-instances

mySpotFleet.json{ "TargetCapacity": 5, "SpotPrice": "1.00", "IamFleetRole": "arn:aws:iam::962872214910:role/fleetRole",

"LaunchSpecifications": [ { "ImageId": "ami-ff527ecf", "InstanceType": "m1.small" },

{ "ImageId": "ami-ff527ecf", "InstanceType": "m1.medium" },

{ "ImageId": "ami-ff527ecf", "InstanceType":"m1.large" } ]}

aws august webinar series - ec2 spot instances - 08192015

Technology

cloudera’s enterprise data hub on the aws cloud · pdf...

create a basic elastic cloud compute (ec2) instance€¦ ·...

(pfc306) performance tuning amazon ec2 instances | aws...

(sdd406) amazon ec2 instances deep dive | aws re:invent 2014

the definitive guide for aws cloud ec2 families · the...

distributed software infrastructure for scientific...

amazon ec2 reserved instances and other aws ......amazon ec2...

aws + sso · 2019. 11. 27. · on subset of api events aws...

amazon ec2 auto scaling - aws documentation · amazon ec2...

(cmp402) amazon ec2 instances deep dive

laboratorio01 aws ec2

aws instance scheduler - s3. · pdf file1 note that stopping...

automating management of amazon ec2 instances with auto...

aws snapshot vs. backup- whitepaper-03 · backup &...

hybrid multicloud solutions for flashsystem family : using...

the science of choosing ec2 reserved instances (ent221) |...

aws ec2 tutorial

installing splunk enterprise on aws · deploying splunk...

aws essentials ec2

teradici cloud access software for aws ec2 g2, g3, g4...