deploying software in an autoscaled aws environment
TRANSCRIPT
Deploying Software in an Autoscaled AWS
EnvironmentJeff Horwitz
Director of Cloud Engineering, [email protected]
Presented at Philly DevOps January 20, 2015
• Members get 2-day free shipping and returns
• Benefits apply across a variety of retailers
• Extending our reach into China w/ Alipay
Applications
• Many different applications (10+)
• Each with its own repository and set of servers
• Multiple deployments per day
AWS @ ShopRunner
• Infrastructure is 100% in the cloud
• AWS + other services
• Heavy use of VPC, AutoScaling, CloudFormation
How We Launch• Everything launched via a CloudFormation Stack
• Use nested stacks to stay DRY
• single_instance.json
• autoscaling_group.json
• CloudInit bootstraps each instance
• Puppet applies role-specific configurations
ASG in Cloudformation... !"ContentGroup": { "Type": "AWS::CloudFormation::Stack", "Properties": { "Parameters": { "ServerEnvironment": "prd", "ServerRole": "content", "InstanceType": "m3.medium", "LoadBalancerNames": { "Ref": "ContentELB" }, "AvailabilityZones": { "Fn::Join": [ ",", { "Ref": "AvailabilityZones" } ] }, "VPCZoneIdentifier": { "Fn::Join": [ ",", { "Ref": "ASGroupSubnets" } ] }, "SecurityGroupIds": { "Fn::Join": [ ",", { "Ref": "ContentSecurityGroup" } ] }, "DesiredASCapacity": 3, "MinASSize": 3, "MaxASSize": 6 }, "TemplateURL": "https://s3.amazonaws.com/BUCKET/cloudformation/autoscaling_group.json", "TimeoutInMinutes": 30 } } !...
Waiting for Puppet
• Puppet can take some time to run
• Group shouldn't go live until puppet is finished
• Use CloudFormation Wait Conditions
• Wait for stack status CREATE_COMPLETE
Puppet Wait Conditions "PuppetWaitHandle" : { "Type" : "AWS::CloudFormation::WaitConditionHandle", "Properties" : {} }, ! "PuppetWaitCondition": { "Type" : "AWS::CloudFormation::WaitCondition", "DependsOn" : "AutoScalingGroup", "Properties" : { "Handle" : { "Ref" : "PuppetWaitHandle" }, "Timeout" : "1800", "Count" :{ "Ref": "DesiredASCapacity" } } },
Signal Wait Handler
...
!"command": { "Fn::Join": [ "", [ "/opt/aws/bin/cfn-signal -s $success ", "-r \"puppet agent exited with code $rc\" ", "-i \"puppet-signal-$EC2_INSTANCE_ID\" '", { "Ref": "PuppetWaitHandle" }, "'" !...
Legacy Deployments• one long-lived AS group per application
• per-application scripts launch AS groups
• scripts pull code from git into an EBS volume
• create snapshot and upload ID to S3
• rsync volume to servers in existing AS group
• restart services as necessary
Problems• scripts w/o CloudFormation diverge quickly
• can't easily launch multiple versions
• no association with a tag/branch/commit
• rsync changes code on running servers
• can't easily stage new code before deploying
• can't easily warm servers before deploying
• no clean or consistent rollback procedure
Solutions• stop treating our infrastructure like it's static
• create new stacks for each deployment
• store state in etcd
• stop deploying code changes
• start deploying stacks
Tenets of SR Deployments• Unit of deployment is the stack
• Deployed servers are immutable
• Deployments are reproducible
• Fail back to old stacks, fail forward to new stacks
• DB migrations should be backwards compatible
• Test on the same configuration as production
ELB Catch-22• new instances added to ELB once running
• autoscaling needs services to start automatically
• what if we're not ready?
• what if the service is actually broken?
• wait to associate ELB w/ ASG? can't do that!
Delay Service Start?
• configure instances not to start services on launch and only start services when ready to deploy
• manage with manual steps or custom code
• initial launch versus scale-out event
• feature flags (etcd, other orchestration)
Lifecycle Hooks FTW
• Register hooks for ASG lifecycle events
• Lifecycle halts until told to proceed
• Can launch our group but tell it not to go live
Autoscaling Lifecycle Hooks
Copied from AWS documentation athttp://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroupLifecycle.html
Autoscaling Lifecycle Hooks
Copied from AWS documentation athttp://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroupLifecycle.html
Autoscaling Lifecycle Hooks
Copied from AWS documentation athttp://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroupLifecycle.html
Pre-deploymentLonely ELB
Deployment #1Launch Autoscaling Group
ASG v1
PENDING
Cache warming Launch status check "go/no-go"
OOPS I launched the wrong thing -- run away!
Deployment #1Deploy Autoscaling Group
ASG v1
GO LIVE
Deployment #1Deploy Autoscaling Group
ASG v1
Deployment #2Launch Autoscaling Group
ASG v1 ASG v2
PENDING
Deployment #2Deploy Autoscaling Group
ASG v1 ASG v2
GO LIVE
Deployment #2Multiple ASG Backends
ASG v1 ASG v2
Deployment #2REVERT!
ASG v1 ASG v2
Deployment #2Multiple ASG Backends
ASG v1 ASG v2
Deployment #2Remove ASG v1
ASG v1 ASG v2
Deployment #2Suspend Processes on ASG v1
ASG v1 ASG v2No scalingNo ELB
Deployment #2Delete ASG v1
ASG v2
Deployment #2ASG v2 Deployed
ASG v2
Suspend/Resume
• Launch
• Terminate
• AddToLoadBalancer
• AlarmNotification
• AZRebalance
• HealthCheck
• ReplaceUnhealthy
• ScheduledActions
Standby State• Removes instances from autoscaling group
• Resources are still managed by the group
• Option to maintain capacity while in standby
• Once ready, return the instance to service
• Great for debugging w/o affecting capacity
Attach/Detach Instances
• Relatively new feature
• Use to attach to a pre-launch testing ASG/ELB
• Move instances to production ASG when ready
Deployment Procedure1. Build the app.
2. Create snapshot and register it in etcd.
3. Launch a deployment with the build snapshot.
4. Perform pre-launch tasks (warming, etc.).
5. Release deployment (completes lifecycle).
6. Revert to or remove old deployment.
7. Delete old deployment.
Future Work
• Test instances with a pre-launch testing ELB
• Register Jenkins builds for deployment
• Support multiple environments
• UI/Dashboard