jax devops 2017 succeeding in the cloud – the guidebook of fail

42
The guidebook of Fai Succeeding in the Cloud

Upload: steve-poole

Post on 15-Apr-2017

41 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

The guidebook of FailSucceeding in the Cloud

Page 2: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Steve Poole – IBM Making Java Real Since Version 0.9

DevOps Practitioner @spoole167

Page 3: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

This talk• Come from personal and team experiences as a Leader of

a DevOps team• Comes from weekly consultancy etc with product teams

and external customers

Page 4: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Agenda of Fail• Fail 0 – Believing Migration to Cloud is easy• Fail 1 – No Clarity of Purpose• Fail 2 – Lack of education• Fail 3 – Not kicking the tires enough first• Fail 4 – Ignoring unpleasant discoveries• Fail 5 – Fudging the hard decisions • Fail 6 – Lack of preparation• Fail 7 – Not enough exercise• Fail 8 – Too much excitement• Fail 9 - Big bang deployment• Fail A – A few other things

Page 5: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail 0.0 : Believing Migration to Cloud is Easy

• ‘Cloud’ is not easy• It may be self-service but don’t be fooled• It may look like a nice walk into the forest to grandma’s house. ..• Get yourself together for a large and painful exercise. • Ever moved a Data Centre?

• Experience is key. • Staff. Who’s going to do this – are they qualified?

• Prepare to change your plans • Most migrations require architectural design changes within the first 6 months • Half of all projects fail • Half of all projects will need significant increases in budget

• Think it through• Projects fail later on when new objectives get added be clear on your ultimate goal

Emigration not Migration

(Migration suggests its

something you want to do annually)

Page 6: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail 1.0: No Clarity of Purpose• There are many reasons for moving applications to the ’Cloud’• There are many types of application• There are many ‘Clouds’ to move to

• What’s the chance of you getting it right first time?• What’s the consequence of failure?• Do you even know if you’ll even know it’s failed in time to recover?• Clarity of purpose reduces your risk• Clarity of purpose gives you focus

Page 7: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Not understanding the communications process

• How do they talk to you?• What’s the ticketing system? • How do you get told of a problem? • How do you get told of planned outages?• How much notice do you get for planned outages?• How do you raise a problem? • How do you ESCALATE? • What is the communications SLA here? • Know your rights

Fail 2.0 : Lack of Education or RTFM!

DOH! Ask me about passwords

Page 8: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

I was a single point of failure

And I didn’t even know it

I think I’m in control of my account until I need my password reset

I had no idea where the reset email was going toCloud support could trigger the reset but wouldn’t/couldn’t tell me more.Suggested I go to my Admin!! - Which I thought was me.Turns out there’s a corporate owner of the accounts. Took me days to resolve.

Fail 2.1 : Lack of Education or RTFM!

Page 9: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

The one thing you should remember from this talk

We’re techies. We get excited about APIs. We understand APIsMoving to the Cloud means giving your data, applications, security etc to a 3rd Party.

That means the ‘API’ extends into the human world. The contract and it’s SLA defines what you can and cannot do when using Cloud services

Cloud providers benefit from economies of scale and have large numbers of customersJust like the more usual service providers you use at home. Gas, Electricity, Broadband, Satellite TV.You know how that can work at home. Cloud Provisioning is much more complicated..

Fail 2.2 : Lack of Education or RTFM!

Page 10: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Not understanding the Service Level Agreement

• Does it have location specific differences?• How is the SLA measured? • How well defined are the criteria?• How are issues resolved?• What are your responsibilities?

• If you don’t know your SLA you will fail

Fail 2.3 : Lack of Education or RTFM!

Example: Can you assess free capacity? If a location is at capacity what happens?

Page 11: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Not understanding the Service Level Agreement (2)• True story

• Go to a service provider SLA dashboard• Service says SLA available of 99.5%• I think that means

• Turns out that actual availability is 95.8

Fail 2.4 : Lack of Education or RTFM!

https://uptime.is/

Daily: 43.2sWeekly: 5m 2.4sMonthly: 21m 54.9sYearly: 4h 22m 58.5

Daily: 1h 0m 28.8sWeekly: 7h 3m 21.6sMonthly: 1d 6h 40m 49.3sYearly: 15d 8h 9m 52.0s

Page 12: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

• True story• The difference is because the provider

has a planned daily outage of 1hr• They still claim 99.5%• Get’s worse. • Outages beyond their control don’t

‘count’ either.

Fail 2.5 : Lack of Education or RTFM!Not understanding the Service Level Agreement (3)

•Daily: 3h 36m 0.0s•Weekly: 1d 1h 12m 0.0s•Monthly: 4d 13h 34m 21.9s•Yearly: 54d 18h 52m 22.8s

85%

Page 13: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Not understanding the cost model

• Units of cost. • CPU / RAM / Network / Storage / IP Addresses …. • Penalty costs if you overrun? • When does the time start and end?• Costs change by location?

Fail 2.6 : Lack of Education or RTFM!

DOH! Ask me about GPUs

Page 14: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Not understanding the cost model

• I’m testing new GPU support In IBM’s JVM 8.0• IBM has GPU support in SoftLayer• Amazon has GPU support in AWS• I want to do some scale performance testing• Got my VirtualBox and Ansible config

• Point it at AWS. Deploy < 1hr x 2

• Costs me $39 ?

• Other charges included

Fail 2.7 : Lack of Education or RTFM!

p2.16xlarge16 GPU64 vCPU

732 GB ram

$14/hr

Page 15: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Not understanding how security and compliance is managed

• What are the security, compliance and image update policies?• How did they handle the last pervasive vulnerability? • Firewalls – do you get one for free? Can you configure it? What’s the default policy

for firewalls with deployments?• SSL certificates – do you own and manage or do they offer a service? • How do you access your VMs ? (ssh, telnet, web?)• Passwords vs keys?• Where are the keys kept?• Can you retrieve the keys in an emergency?

Fail 2.8 : Lack of Education or RTFM!

You do understand penetration

attack vectors?

Page 16: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Misunderstanding what APIs exist

• Are there APIs for all the actions you want to perform• Are they symmetrical?• Do any need human interaction to complete? • Are the APIs proprietary or standard? • Are there plugins for IaC tools?

Fail 2.9 : Lack of Education or RTFM!

DOH! Ask me about VM

termination APIs

Page 17: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Lack of a Community

• What do others think of this Cloud?• Is there an active DevOps community?• Do you see active participation from the Cloud provider?

Fail 2.A : Lack of Education or RTFM!

Page 18: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail 3.0 : Not Kicking the tires enough firstPoor assumptions about ’how things work’

• For instance:

• “I don’t need a public IP address for my VM as I have a private gateway”

• “Now I can’t do apt-get update!”• “what do you mean I have to buy public IP addresses?”

Page 19: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail 3.1 : Not Kicking the tires enough first• If you don’t start with IaC techniques from Day 1 you will fail.

• Environments are all different• Is your memory that good enough?• You must encode.

• Trying by hand and then encoding into IaC• helps you learn about your target environments (API’s anyone?)• Builds up a IaC asset base you’ll need in the future.

“The human touch”

Page 20: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail 3.2 : Not Kicking the tires enough first• Get a buddy - “Extreme Deployment”• Install VirtualBox and Vagrant

• Build a Vagrantfile for an environment you care about• Provision locally “vagrant up –provider=virtualbox”

• Pick a Cloud. (Use the ’free tier’!)• Try to deploy a VM by hand.• Now do “vagrant up –provider=XXXXXXX”

• Examine the differences.. • Add more and repeat

Look for how IP addresses are allocated. Look at the options for memory size, networking, disk space, disk types (IO speeds)What CPU’s can you get?What OS’s can you provision?What architectures are available?What’s the cost?

Page 21: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail 3.3 : Not Kicking the tires enough firstTry another CloudTry someone's IaC pattern

Ansible script to deploy a docker swarmGo wild:

Try to deploy OpenStack on your laptop (with 32GB)https://www.rdoproject.org/

Now do it all again with Docker

Page 22: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Not understanding that your initial deploys are the least secure

• How long until your newly deployed VM is attacked? 20 seconds -> 40 minutes• So deploying and then adding vulnerability patches is not the right answer

• War story:• Customer deploys a VM to Cloud. • VM gets hacked immediately• Customer patches the VM.• Customer keeps the VM and uses it in production• Customer gets bill for $500,000 network traffic. VM is now being use to host

warez

Fail 3.4 : Not Kicking the tires enough first

Page 23: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail 3.5 : Not Kicking the tires enough first• Time to think about security

• If you don’t get your security posture defined before you deploy you’ll fail and possibly get some interesting bills

• Maybe you’ll go out of business.

• Worst case (maybe) is you have provided a gateway into your company network

• Regular Vulnerability scanning & fixing.• Keys not passwords• Specific IP address access for VMs• Whitelisted access to internal systems (inside your firewall)• Whitelisted access to remote systems (on the internet) …

Page 24: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail 4.0 – Ignoring unpleasant discoveries

• Not all the OS’s you want are there • Performance of the Cloud is less than you expected

• Now you know what multi-tenancy means. • Managing VMs in the Cloud is complicated• Keeping systems secure and compliant is hard• Deployment times vary (and fail unexpectedly) • Debugging problems remotely is difficult• It costs more than you realized. • Cost is your responsibility. (No one is going to help you save money!) • Clouds fill up

So now you know some of those ‘unexpected’ restrictions

Initial cloud deployments are juicy targets for the bad guys

Page 25: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail 4.1 – Ignoring unpleasant discoveries

• Deploy anyway.• Just run with a smaller JVM heap• Ok I get it wont scale – deploy anyway and we’ll fit scaling later• You’ll just have to deploy with a small budget for VMs• Use the public multitenancy option – its cheaper. • Can’t you add some sort of cache?

I’m impressed by the number of customers who can change the rules of physics

Page 26: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail 5.0 – Fudging the hard decisions

• You have to pick one. Changing your mind later is going to be expensive and complicated

• IaC is critical but it’s not magic.

Not realizing Clouds are sticky

Many of my consultancy discussions started with a company saying to itself:

“It’s ok. If Cloud XXX is too expensive we’ll just move over to YYY”

Page 27: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail 5.1 – Fudging the hard decisions

For instance:A large rich-client application used in-house in multiple locations . Plan was to consolidate into the Cloud. Network traffic between client and servers measured in TB’s / dayTo reduce costs, plan was to create special proxies/data caches on-premConsequence: Increased complexity of design, poor performance, Untried new system -> fail.Should have spent the money on replacing the rich-client with a web based one.

Compromising the architecture because of cost

Unexpected expensive items (such as network costs) can drive you to weird hybrid configurations that increase complexity and ultimately fails

Page 28: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail 5.2 – Fudging the hard decisions

Offering RAM Cost (2015) CPUsIBM Bluemix (CF) $24.15 GB/Month 4vCPUs per instance

IBM Bluemix (Containers) $ 9.94 GB/Month 4vCPUs per GB

run.pivotal.io $21.60 GB/Month 4vCPUs per instance

Heroku (Hobby) $14.00 GB/Month 1 "CPU share" per 512MB in an instance

Heroku (Professional) $50.00 GB/Month 1 "CPU share" per 512MB in an instance

Amazon EC2 (SLES) $16.56 GB/Month 1 vCPU per 4GB in an instance.

Not understanding the cost projections

Old data for example only

Page 29: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail 6.0 – Lack of preparationDriving straight into live deployment

Premature deployment based on happy path will ultimately fail

It is critical that you have exercised an end-to-end deployment and support model before you go live

So many projects fail because of problems later.

Even simple applications need security, logging and monitoring

Page 30: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail 6.1 – Lack of preparationNot having a solid monitoring and diagnostics solution

Most successful cloud applications consider their monitoring solution to be the most critical part of their system

If your monitoring solution fails – you’re running blind

Build the monitoring system and then exercise itBreak things, Scale things, Build run away jobs

Figure out what is important and monitor itNow build dashboardsDo you get the events you need when you need?

Are you measuring end user response times?

Page 31: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail 6.2 – Lack of preparationNot having enough dashboards!My team was a traditional IT one.Responded to tickets – so customers always found the problem first

We added dashboards and an objective “First to Know”We moved from being last to know to being the one to tell the customer.

Dashboards allowed my team to see issues clearly when there was a failure and when trends showed bad things we’re going to happen.

Dashboards changed my teams attitudes. Makes automation and monitoring more acceptable

Page 32: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail 6.3 – Lack of preparationNot having a robust and automated deployment solution

After your application goes live things will go wrongIt’s not just about having a robust application design. How quickly you can remediate issues is dependent on your ability to deliver those fixes

Design for Failure. "Everything fails, all the time". Werner Vogels, CTO Amazon.com

Your deployment solution is your disaster recovery solution

Page 33: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail 6.4 – Lack of preparation

Cloud location goes off-line -> can you fail-over to a new location?What happens if your database gets corrupted?Where is you data backed up to?Can you get the data back into the Cloud fast enough?Who does the backups? When was the last backup taken?

If your deployment solution is not your disaster recovery solution

Page 34: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail 7.0 – Not enough exercise

Scale testing reveals bottlenecksEven just running two instances can be revealing Break things too (chaos monkey)

Your aim is to understand how well your application can react to demand

Scale across Cloud locations - Data costs increase? Response times get worse?Timeouts occur?

Scale testing reveals design issues in application and infrastructure. Things you want to know about before you go live. And tells you if your monitoring is going to be any use

Not testing how your application scales

Page 35: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail 7.1 – Not enough exerciseFailing to scale appropriately costs money

a b c d e f g h i j0

20

40

60

80

100

120

DemandProvisioned

Page 36: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail 7.2 – Not enough exerciseFailing to scale appropriately costs money

a b c d e f g h i j0

20

40

60

80

100

120

DemandProvisioned

Page 37: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail 8.0 – Too much excitement

Projects can fail because of an excess of enthusiasm

”Lets take the opportunity to rewrite the application”“Lets use this new tech”

Often fails due to a lack of situational awareness of the state of play in the industry

It’s easy to get carried away.

Page 38: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail 9.0 – Staged deploymentGoing from Lift and Shift to what?

You can lift and shift. Probably going to bite you. Unexpected dependencies on local items such a C:/ or a local services and servers (authentication servers etc)

Consider your optionsThe “strangler pattern” – staged conversion to micro servicesTime for a rewrite?Look at new options - “serverless” ?

BTW – adding in sufficient debug capability can be just as expensive and increase risk

How far into the woods do you want to go?

Page 39: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail A.0 - A few other things• Cloud providers often offer additional services

• Why build your own when you can use a provided one?

• Skill sets• We have lots of tech experts but not that many systems experts.• Take a look at your team. Do they have the skills and experience you need?

• IaC & DevOps skills?• Some parts of your process are going to become more critical than before

• Who’s doing the data backups? • Who owns your build and test infrastructure?

• Deployment process• How long does it take to deploy a change?• Does your team understand the importance of the process?

Page 40: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Wrap up• Moving anything into a cloud environment is always a challenge

• Lack of clarity around why you want to do this will cost you money, sleep and probably doom the project

• Be sure your team is skilled and commitment . It’s their sleep too

• Most of the projects that fail – fail because of the approach. Not the technology

• But not understanding the economics drivers on systems will also lead to fail

Page 41: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Fail to adapt -> Fail

How you design, code, deploy, debug, support etc will be effected by the metrics and limits imposed on you.

Financial metrics and limits always change behavior. It also creates opportunity

You will have to learn new techniques and tools

Applications have to get leaner and meaner

http

s://w

ww.fl

ickr.c

om/p

hoto

s/be

igep

hoto

s/

Page 42: Jax Devops 2017  Succeeding in the Cloud – the guidebook of Fail

Thank you