3-615 rdfe azure dbservice bus fabric cluster storage powershell pink poodle

43

Upload: stephen-barber

Post on 31-Dec-2015

223 views

Category:

Documents


2 download

TRANSCRIPT

Mark RussinovichTechnical Fellow, Microsoft Azure

Avoiding #CloudFail: Learning Lessons from Microsoft Azure

3-615

Automate Scale out Test in production Deploy early, deploy often Always blame the compiler

Stuff Everyone Knows About Cloud Development

But there are many more rules…

But First: Basic Azure Resource Management Architecture

RDFE

Azure DB

Service Bus

Fabric ClusterFabric

ClusterFabric ClusterFabric

Cluster

StorageResourceProviders

Compute VMs

REST API

Portal

REST API

PowerShell

Pink Poodle

NSFW

Minor #Fail Major #Fail

Agenda

Minor #Fail

Customer’s reported that portal did not show their mobile or media resources

Root cause: Change to resource provider string

comparison made it case-sensitive “mobileservice” no longer matched

“MobileService”

Dude, Where are My Mobile Services?

AM2PrdApp03 Fabric cluster did not show up in RDFE inventory

Root cause: User entered mixed-case label for Fabric cluster RDFE uses case-sensitive compare with region map

Dude, Where’s My Cluster?

Case has to be handled on a case-by-case basis

Be Sensitive About Case

Case insensitive: DNS names GUIDs URL scheme (e.g. “https” vs.

“HTTPS”) Certificate thumbprints Windows filenames

Case sensitive: Username Password XML elements JSON Email address HTTP headers and verbs User Agent Base64 encoded strings Linux filenames Azure Storage objects

Comply with industry standards/conventions

Be compatible with external systems All user-friendly text should be case-

sensitive, such as Display Name, Description, Label/tags, ... All other data should be case-insensitive

Preserve case of string parameters Document casing at string entry points

Casing Rules

A little more detail can go a long way… Error log not reporting a name made correlation

difficult:

Error message in test environment indicating a beta feature was missing was ambiguous:

Intermittent failures because of header incompatibility in test environment made troubleshooting painful:

Log As If That’s All You Have

System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> Microsoft.ServiceModel.Web.WebProtocolException: Server Error: The service name is unknown (NotFound)

HTTP Status Code: 400. Service Management Error Code: MissingOrIncorrectVersionHeader. Message: The versioning header is not specified or was specified incorrectly.

Broke log filtering in production, causing a flood

Rules: Build with /warnaserrror Don’t suppress compiler warnings

Yes, Code Hygiene Matters, Even in the Cloudfor (size_t i=0; i< prefixes.size(); i++)

{ if (StrUtils::StartsWith(id, prefixes[i]), true) { return true; } }

EVER

It’s difficult to have universal coverage for exception handling Many layers of code written by many developers Third-party code may throw exceptions

Should you have a catch-all or fail fast? Visual Studio says not to catch all Catching an unrecoverable error can leave the

component unstable But failing fast on exceptions handling user-

controlled data can expose the service to denial of service

Exceptional Coding

Call a final handler after all known exceptions are handled Crash if it’s not user-initiated and unrecoverable Otherwise log an error and return an error

Log once-per hour per exception and report status 500

Not All Exceptions Are Created Equal

        public static bool ShouldCrashOnUnhandledException(Exception e)        {            if (ExceptionProcessor.IsCrashingException(e))            {                Logger.Instance.Alert("[UnhandledException] Process crashed due to fatal exception: {0}",

e.ToString());                 return true;            }            else            {                                AlertsManager.AlertOrLogException(e);                return false;            }        }

Exceptions are expensive Rule: do not throw an exception in services for expected errors Example: exception when resource is not found

Walking the stack trace can be CPU intensive Rule: control logging the stack just once per request or on demand (for

debugging unknown errors)

Re-throwing from async EndXX method does not capture the original stack trace Using remote stack trace copy is a very expensive operation Rule: recreate the same type of exception with the original exception

as the innerException whenever possible

Exception Handling Rules

RDFE serves as front-end for Resource Providers (RPs) RPs include Websites, Traffic Manager, Azure DB

Testing of RDFE requires RP interaction Many instances of test RDFE

Not practical to deploy RP’s one-to-one with test RDFE

I’m Hooked on You

RDFE_1

SQL Azure_

1

RDFE Test

Cases

RDFE_2

SQL Azure_

2

RDFE Test

Cases

RDFE_X

SQL Azure_

X

RDFE Test

Cases

Workaround used in test automation was RDFE test deployments sharing storage with a well-known test RDFE instance

Problem: tests interfere and corrupt each other

Primary RDFE

SQL Azure

Shared Storag

e

management.rdfetest.dnsdemo4.com

Let’s Pretend You Only Love Me

RDFE_1

RDFE Test

Cases

RDFE_2

RDFE Test

Cases

RDFE_X

RDFE Test

Cases

Rule: make components capable of talking to multiple versions concurrently Updated RPs to be able to communicate with multiple RDFEs Use registration with “return address” pattern Adheres to layering extension model where dependencies are one way

Monogamous Services are so 2000’s

Test SQL Azure

Return address: 137.116.176.40

RDFE Test

Cases

RDFE_1

To: 137.116.176.40

To: 137.116.176.42

To:

13

7.1

16

.17

6.4

1

RDFE_2

RDFE Test

Cases

Retu

rn A

ddre

ss:

13

7.1

16

.17

6.4

1

RDFE Test

Cases

RDFE_X

Return address: 137.116.176.42

RDBug 1042990 “Testability: Expose DeleteExtensions API” Stale test VM extensions were time-consuming to cleanup and caused

bloat of stores

RDBug 764505 “[AM/CSM Design] Need UnEntitleResourceForSubscription API” Only way to test subscription RP entitlement provisioning was to delete

and recreate subscriptions

Rule: implement full CRUD for all resources

Clean Up Your Mess When You’re Done

Developer wrote code in RDFE that failed when it encountered unknown capabilities from a new Fabric version:

Caused issues when upgrade order changed

Rule: be deliberate about forward compatibility If a parameter can be ignored, ignore it If it has semantics that affect state, fail with a version error Don’t expose sematic features to down-level components

Back to the Future

Customers began to complain that newly created IaaS VMs failed to provision

Logs showed that only VMs using one of two images failed Those images had just been uploaded Test provisioning consistently failed

Dude, Where’s My VM?

A check of the images revealed that they were corrupt

Flow of image updates: Every month new OS VHD images are produced After boot performance optimization, prefetch data added to VHD Image is uploaded to platform image repository Manual test is performed by creating new VM, RDP’ing and running

tests in VM

Root cause: images corrupted during upload Not detected because human tested Stage, not Production,

environment

Rule: assume data will be corrupted in transit Use CRC64 checksums

A Case of Mistaken Identity

Major #Fail

VIP Swap“I like your VIP better than mine”

Users started complaining that after a VIP swap that they could not perform operations on their cloud services Was not detected by monitoring systems Affected only a small number of customers

Really? Isn’t that a Bit Much?

You can deploy two versions of a cloud service: Production: has the DNS name and IP address of the cloud service you publish Stage: has a temporary DNS name and IP address

To promote the Stage version to Production, you “VIP Swap”

What’s a VIP Swap?

Role A Role B

Port 80

Port 3389

Port 3390

Role A’ Role B’

Port 80

Port 3389

Port 3390

Production VIP – VIP1<dnsname>.cloudapp.net

Staging VIP – VIP2<guid>.cloudapp.net

RDFE uses storage table rows to cache the state of cloud service deployments Includes state of role instances and deployment slots Row is updated by mutating operations like VIP Swap It’s also updated by RDFE cache updating status of roles

Mutiple roles updated via table conditional update (opportunistic concurrency)

VIP Swap Internals

Slot VIP Role A Role B

Production 168.133.1.22 Healthy Healthy

Stage 168.124.33.22 Healthy Healthy

Slot VIP Role A Role B

Stage 168.124.33.22 Healthy Healthy

Production 168.133.1.22 Healthy Healthy

Bug in RDFE update caused race condition Change would be overwritten, causing inconsistent state

RDFE does not allow update operations when it detects inconsistency

Race condition meant error rate was only marginally higher than normal and went undetected

The VIP Swap Bug

Slot VIP Role A Role B

Production 168.133.1.22 Healthy Healthy

Stage 168.124.33.22 Healthy Healthy

Slot VIP Role A Role B

Stage 168.124.33.22 Healthy Healthy

Production 168.133.1.22 Healthy Healthy

Slot VIP Role A Role B

Stage 168.124.33.22 Healthy Healthy

Stage 168.124.33.22 Healthy Healthy

Root cause: developer claimed “unintuitive behavior of ADO.NET”

Rule: direct a slice of traffic to an updated version for several days Increase traffic gradually Set alerts based on difference in failure rates of two versions

VIP Swap Learnings

Customer Traffic

RDFE A RDFE vNext

5%

RDFE B

Root cause: developer claimed “unintuitive behavior of ADO.NET”

Rule: direct a slice of traffic to an updated version for several days Increase traffic gradually Set alerts based on difference in failure rates of two versions

VIP Swap Learnings

Customer Traffic

RDFE A RDFE vNext

30%

RDFE B

Root cause: developer claimed “unintuitive behavior of ADO.NET”

Rule: direct a slice of traffic to an updated version for several days Increase traffic gradually Set alerts based on difference in failure rates of two versions

VIP Swap Learnings

Customer Traffic

RDFE vNext

RDFE B

50%

Root cause: developer claimed “unintuitive behavior of ADO.NET”

Rule: direct a slice of traffic to an updated version for several days Increase traffic gradually Set alerts based on difference in failure rates of two versions

VIP Swap Learnings

Customer Traffic

RDFE vNext

RDFE vNext

50%

Subscription Deletion“Oh no you didn’t”

Operator was performing regular clean up of internal (Microsoft) Azure subscriptions Wasn’t aware of new process to go through front-line support Resulted in build-up of delete requests Normal process bypassed and operator allowed to submit batch-delete

Several subscriptions had originally been created internal on behalf of partners, and got deleted Front-line checks would have stopped the delete

Customers graciously let us know what had happened

Oops…

All data was fortunately recovered because of data retention From http://www.windowsazure.com/en-us/support/legal/subscription-agreement:

Subscription had to be manually recreated, however

Rule: use “soft delete” Resources inaccessible by customer Non-data resources (e.g. VMs) deleted Timer set to delete after 90 days

Let’s Not Do That Again

Storage Certificate Expiration“Sorry I’m late, the alarm clock never rang”

http://blogs.msdn.com/b/windowsazure/archive/2013/03/01/details-of-the-february-22nd-2013-windows-azure-storage-disruption.aspx

SSL connections to Azure storage began failing at 12:29pm on February 22, 2013

Customers immediately noticed

We did, too

It’s Not You, It’s Me

Certificates are managed by the “Secret Store” Once a week an automated system scans the store An alert is fired for certs within 180 days of expiration Team obtains new cert and updates Secret Store

That process was followed The breakdown:

On January 7, the storage team updated the three certs in question Failed to flag that a storage deployment had a date deadline Deployment was delayed behind other higher-priority update

We Updated It, We Promise!

The real breakdown was not monitoring production: We now scan all service endpoints, internal and external, on a weekly

basis At 90 days until expiration, shows up on VP reports

Rule: service development requires thinking through the entire life-cycle of the software

We are working on “managed service identities” to fully automate non-PKI certs

Be Certain About Your Certs

Minor #Fail Be Sensitive About Case Log As If That’s All You Have Yes, Code Hygiene Matters, Even in the Cloud Exceptional Coding I’m Hooked on You Cleanup Your Mess When You’re Done Back to the Future A Case of Mistaken Identity

Major #Fail VIP Swap Subscription Deletion Storage Certificate Expiration

Summary

Cloud development adds new rules and makes some of the old ones matter more Many rules are devops-oriented Operating large scale with loosely coupled services results in others

If you have hard-won rules to share, please email me: [email protected]

We Made These Mistakes So You Don’t Have To

Good luck, and may the force of the cloud be with you!

Your Feedback is Important

Fill out an evaluation of this session and help shape future events.

Scan the QR code to evaluate this session on your mobile device.

You’ll also be entered into a daily prize drawing!

© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.