lessons learned: building scalable applications with the windows azure platform

46
Lessons Learned: Building Scalable Applications with the Windows Azure Platform Simon Davies Windows Azure TSP Microsoft Corporation SVC32

Upload: knoton

Post on 24-Feb-2016

50 views

Category:

Documents


0 download

DESCRIPTION

SVC32. Lessons Learned: Building Scalable Applications with the Windows Azure Platform. Simon Davies Windows Azure TSP Microsoft Corporation. Agenda. Objectives of this session Thoughts on scalabilty in the cloud Real World Lessons Learned Thuzi RiskMetrics Summary - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

Lessons Learned: Building Scalable Applications with the Windows Azure PlatformSimon DaviesWindows Azure TSPMicrosoft Corporation

SVC32

Page 2: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

Agenda> Objectives of this session> Thoughts on scalabilty in the cloud> Real World Lessons Learned

> Thuzi> RiskMetrics

> Summary> Questions and Answers

Page 3: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

Scalability in the Cloud> Scalability==work\resources> Windows Azure makes adding AND

REMOVING resources dynamic> This – along with the business model -

changes things> Capacity planning becomes dynamic> Utilisation levels are important> Definition of scale is different depending

on application type and workload arrival characteristics

Page 4: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

Scaling Facebook Apps in the Azure Cloud

Jim ZimmermanCTO / Lead Developer Thuzi.com

partner

Page 5: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

Who is Thuzi?> We develop customized viral

marketing solutions, utilizing a variety of technologies that engage users and measure results.

> We ensure maximum scalability through exploiting the latest virtual computing by using Microsoft's Azure Platform and Tools

Page 6: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

Facebook Viral Application Needs> Support for thousands of users

virtually overnight … our models predicted geometric adoption

> The success of one of our clients could not be the failure for others … requirement for distinct computing environments for each Thuzi customer

> Our job is to turn social media data into real business information … must have a robust back end for reporting detailed analytics

Page 7: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

Facebook Viral Application Needs> Thuzi builds cool social media web

apps and we don’t know much about running data centers … besides we didn’t want to purchase extra servers “just in case”

> A consistent user experience was mandatory … social media users don’t like to wait

Page 8: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

Hosting Options> Our own data center – Is too

expensive and with unpredictable growth, hard to plan for

> Google – Didn’t have a familiar programming environment

> Amazon – Could use Windows VM’s, but did not have as many features as we wanted

> Azure - Familiar Microsoft Technologies

Page 9: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

Technology

Page 10: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

Outback DEMO

Page 11: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

The Results

Page 12: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

Fan Growth over Time

050,000

100,000150,000200,000250,000300,000350,000400,000

Fans

Fans

Page 13: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

Lessons Learned> Trace everything!

> Errors, Debug Info> You will upgrade later if as you start to ask questions

about how your app is behaving> Track Perf Counters

> CPU Usage, Req/sec, memory usage> Use Worker roles to move data from Queues to

table storage and SQL Azure> SQL is easier to report on> Table storage allows more scalability

> Deployment> Upgrade Manually> When moving to production, use the VIP Swap feature

Page 14: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

Tracing

> config.DiagnosticInfrastructureLogs.ScheduledTransferLogLevelFilter = LogLevel.Error;

> config.DiagnosticInfrastructureLogs.ScheduledTransferPeriod = TimeSpan.FromMinutes(5);

> config.Logs.ScheduledTransferLogLevelFilter = LogLevel.Error;

> config.Logs.ScheduledTransferPeriod = TimeSpan.FromMinutes(5);

Page 15: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

Performance Monitoring

> var cpuUsage = new PerformanceCounterConfiguration();> cpuUsage.CounterSpecifier = @"\Processor(_Total)\% Processor

Time";> cpuUsage.SampleRate = TimeSpan.FromSeconds(5);> var pccMemory = new PerformanceCounterConfiguration();> pccMemory.CounterSpecifier = @"\Memory\Available Mbytes";> pccMemory.SampleRate = TimeSpan.FromSeconds(5);> var requestsPerSec = new PerformanceCounterConfiguration();> requestsPerSec.CounterSpecifier = @"\ASP.NET

Applications(__Total__)\Requests/Sec";> requestsPerSec.SampleRate = TimeSpan.FromSeconds(5);> config.PerformanceCounters.DataSources.Add(cpuUsage);> config.PerformanceCounters.DataSources.Add(pccMemory);> config.PerformanceCounters.DataSources.Add(requestsPerSec);> config.PerformanceCounters.ScheduledTransferPeriod =

TimeSpan.FromMinutes(5);

Page 16: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

Deployment> Upload new package to staging

> Wait for all roles to be ready> Use VIP Swap to upgrade and deploy

to production> Rewrites the load balancer to swap

staging with production> If anything is wrong, you can swap back

Page 17: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

Tools Needed> Needed to be able to manage records

in table storage for testing> Needed to be able to download logs

from table storage for tracing and perf counters

> Azure Storage Explorer ( Codeplex ) - Free

> Cloud Storage Studio - Cost> Do linq queries against table storage to

get specific info when needed.

Page 18: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

In Summary> Azure provides Thuzi a competitive

advantage … so please don’t tell the other social media marketing companies and let us enjoy our 15 minute advantage

Page 19: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

Building Scalable Applications using Windows Azure: RiskMetrics RiskBurst™

Rob Fraser and Phil JacobRiskMetrics Groupwww.riskmetrics.com

partner

Page 20: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

www.riskmetrics.com 20

RiskMetrics RiskBurst™

RiskMetrics GroupOffers industry-leading products and services in the disciplines of risk management, corporate governance and financial research & analysis

Scaling on-premise computation to the CloudIntegration of RiskMetrics extensive on-premise capability with Windows AzureWe are running on 2,000 instances on Windows AzureWe have plans to use 10,000+ instances in 2010

What are RiskMetrics doing with so much computing power?Calculation of financial riskSimulate scenarios for the movement of market factors over time & price financial assets in those scenariosNotoriously complex – can involve Monte Carlo2 for complex asset classes of the kind that the triggered the 'credit crunch‘

Results in very high computational loads for RiskMetricsDaily risk analysis load equivalent to calculating risk on 4 trillion US StocksComputational loads are characterised by high demand peaksStrong growth trend in calculation complexity

Page 21: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

www.riskmetrics.com 21

Peak Load Characteristics

Page 22: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

www.riskmetrics.com 22

Growth trend in calculation complexity

0

1

2

3

4

5

6

7

8

9

10

1994 1996 1998 2000 2002 2004 2006 2008

Risk problem complexity has doubled every 6 months

Moore’s Law

Processor power doubles every 2 years

Rela

tive

Equi

ty E

quiv

alen

t Uni

ts (L

og S

cale

)

Maximum Complexity of Risk Analysis Processing Request

Page 23: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

www.riskmetrics.com 23

Analytics Architecture: Large-Scale Data Dependent Processing vs. Distributable Work Packets

LoadBalancer

Market and Pricing Data

Velocity ScenarioCache

RiskServer

RiskServer

RiskServer

RiskServer

RiskServer

PricerPricer

PricerPricer

PricerPricer

Scenario Generation and

Aggregation:These Services

dependent on high speed access to large scale data

stores and caches

Scenario Pricing:

Work Packets are self-

contained

Page 24: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

www.riskmetrics.com 24

Work Packet Example:Pricing request for a Mortgage Backed Security

Work PacketOperation: Request price for the asset in specified market scenarioAsset

• Asset id• Asset

description• Collateral

description• Size 1KB

Scenario• Interest rate

points• Swaption

Volatilities• Overrides for

explicit stresses

• Size: 5KB

Response Packet

Price for asset• Size 1KB

Logging• Optional diagnostics

• Exceptions

• Size: Large

Compute Time: 150ms - 30s

Page 25: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

www.riskmetrics.com 25

Analytics Architecture: Integration of Cloud Resources?

LoadBalancer

Market and Pricing Data

Velocity ScenarioCache

RiskServer

RiskServer

RiskServer

RiskServer

RiskServer

PricerPricer

PricerPricer

PricerPricer

PricerPricer

PricerPricer

PricerPricer

Scenario Generation and

Aggregation:These Services

dependent on high speed access to large scale data

stores and caches

Scenario Pricing:

Work Packets are self-

contained

Page 26: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

www.riskmetrics.com 26

RiskBurst™ Project Timeline

March - June•Project Conception

•Choice of Platform

July - August•Initial MSFT Meetings

•RiskMetrics joins TAP

•TAP team actively involved in architecture decisions

September - October•Engineering work on scaling proof of concept

•Deep-dive sessions

•Large-scale testing with test load (200-2000 nodes)

•‘Industrialisation’ of architectural pattern

November - December•Large-scale UAT using load application

•Complete work on operational integration

Q1 2010•Run parallel with in-house solution

•Production

Page 27: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

www.riskmetrics.com 27

RiskBurst™An architectural pattern for large scale computational applications

Page 28: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

www.riskmetrics.com 28

Architectural Pattern

Building large scale computation requires careful designProblem: Need to avoid the Von Neumann Bottleneck

Keywords: Reason and Instrument

No changes to the applicationRun on-premise on HPC Server or in cloud on Azure

Pattern has end-to-end decouplingHorizontal scaling of decoupled components

Computational Resources & Application

WorkloadGeneration

Messaging & StorageWorkload

GenerationWorkloadGenerationWorkload

Generation

Messaging & StorageMessaging &

StorageMessaging & StorageMessaging &

StorageMessaging & StorageMessaging &

Storage

Computational Resources & ApplicationComputational Resources & ApplicationComputational Resources & ApplicationComputational Resources & ApplicationComputational Resources & Application

Page 29: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

www.riskmetrics.com 29

RiskBurst™ Workflow: Windows Azure & HPC Server

RiskBurst™ Server

Workload ReceiverBatching and Sending

Outstanding Request Timeout Sweeper

Scenario Generator

Windows AzureOutput

Queue(s)

Windows AzureInput

Queue(s)

WCF Request

WCF Request

WCF Request

Input Message

Output Message

WCF Response

WCF Response

WCF Response

WCF Error Response

Worker Output Monitoring

Page 30: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

www.riskmetrics.com 30

Azure QueueAzure QueueAzure QueueAzure Queue

Azure QueueAzure QueueAzure QueueAzure Queue

Worker RoleInstanceWorker Role

InstanceWorker RoleInstanceWorker Role

Instance

Input Queues (To Do Jobs)

Input Blob Storage

Local storage

Data

Support files in Blob Storage

Windows Azure Storage Component Usage

RiskBurstServer

Azure Queue

Output Queues (Job done)

Output Blob Storage

Azure Queue

Page 31: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

www.riskmetrics.com 31

Mapping to the Azure Environment

Visual Studio 2008 Azure development SDK mimics cloudMix code running in dev locally, with cloud resources such as Blob storage or queues

Good for features, does not assist with scaleExisting 32-bit .NET C++/CLI application with 3 third-party librariesInitial idea - run directly in web-role – but 32-bit(!)

Run within worker rolePreserve WCF interface – no changes whatsoever to analytics app

Only changes to existing code base are:Retrieve Cash-flow library support files from Blob storage on demandSome diagnostic information added

Page 32: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

www.riskmetrics.com 32

Getting to Cloud Resources: Bandwidth & Latency

Problem: Bandwidth to Azure gateway limited by InternetSolution: pass by reference & blobs

Replace pass-by-value calls with pass-by-referenceCreate key for scenarioLarge, repeated objects (scenarios) pushed to blob storageWCF call contains only keyEach of 1000 scenarios, used for all assets

Problem: Communications Latency Within data centre, 20ms latency on WCF call through HPC SOA platformQueues and Blob storage are off-device; engineering must respect this!Work packet : 200ms computationSolution: batch requests within input queues

But, more simultaneous work requests (threads outstanding on input)

Page 33: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

www.riskmetrics.com 33

Utilizing Cloud Resources: Generating Load

Page 34: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

www.riskmetrics.com 34

Utilizing Cloud Resources: Generating Load

Problem: Generating Load for Cloud Resources Threading architectureWorkload originally generated by synchronous calls in clientNumber of outstanding pricing requests = nodes x batch sizeImplies large number of threads in wait states in scenario generatorsWork request made asynchronous

RiskBurst™ Server LogicCreates a balanced workload – uses a work item’s average run timeMade calls to RiskBurst™ Server asynchronousIncoming calls create batch entry synchronously with requestMap created from message id to wait handlersWhen batch full, sent on to Azure input queueSweeper thread gathers up output messages and uses map to associate with wait handlersScales well to over 1000 simultaneous requests per RiskBurst™ Server Horizontal scale of RiskBurst™ Servers – each creates own input queue

Page 35: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

www.riskmetrics.com 35

Horizontal Scaling within the Cloud

Problem: Saturation behaviour of queuesCan create situation where queues are saturated, made worse by retry logicComplexity due to varied processing timeController will move busy queues to independent hardwareUse exponential back-off algorithmBatch work items for each queue read or write (using 10 work packets per queue item)

Amortizing the cost  of IO against CPU time is key Batch compute sizes need to be big enough both to occupy the CPU for long enough and not cause the swamping of the queues Also, more items contained in queue item -> fewer queue hitsBut, larger batches imply more simultaneous outstanding connections on client sideVariable run-time of assets – from 150ms – 30 seconds

Carry out processing concurrently with queue accessPushing IO onto background threads is critical (the writes and the deletes are independent background tasks)On-node caching within worker role to avoid queue reads

Page 36: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

www.riskmetrics.com 36

Exception Management in Distributed Applications

Keep it simpleLarge distributed system implies need to engineer robustness to failureDistinguish between events that are random and unpredictable and poison-message kind of failuresDo not over-engineer efficient handling of occasional exceptions

Return exceptions to client applicationClient can track number of attempts to process a work itemDistinguish poison messages and give upParallel handling on HPC Server SOA platform

Complexity from varying message processing timesTime-outs can be caused by several long-running pricings in same jobRe-try time-outs by sending all pricings in batch independently

Page 37: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

www.riskmetrics.com 37

Diagnostics and Run-time Monitoring

A challenge for large scale applications, even more so for CloudLogging and monitoring must be switchable so as to reduce overheadVariable level of diagnostics and loggingRequirement to filter information through decoupled architecture (on node; centralized in Azure; returned to client)

Key data for architectural patternRequest and result queue; successful/unsuccessful read, write and delete; time taken for all operationsEmpty request queue getsCount of successful/unsuccessful work packets% Processor Time performance counterCache misses

We utilized custom built solution during TAPNodes broadcast over service busClients subscribe to trace messages

New diagnostic & monitoring package provides platform support

Page 38: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

www.riskmetrics.com 38

Final CommentsIntegrating on-premise and cloud applications

Page 39: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

www.riskmetrics.com 39

Production Services across On-Premise and Cloud

Operational IntegrationFully integrate Windows Azure capabilities with RiskMetrics Operational InfrastructureProvisioning plus diagnostic & monitoring packages

“Outside-In” ServicesControl and visibility of the services on the cloud consistent with on-premise services.

Resource ViewNodesQueuesBlob Stores

Process ViewThroughput & PerformanceTraceabilityProblem identificationProcess linkage (intra- & inter-cloud)

Binding SLA CommitmentsOperational Support Escalation

Page 40: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

www.riskmetrics.com 40

RiskBurst™ on Windows Azure

Effective architectural pattern delivers key business benefitsElastic scalingEnhanced servicesEmpowered innovationHigh reliabilityImproved agility

Windows Azure was an obvious choice of cloud platformMinimize impedance mismatch between on-premise and off-premise.NET/WCF/HPC SOA in data center extended to cloudConfigure to run in either environmentFamiliar development environmentMassive scalability

View of Azure as extension of OS into CloudUndertake work with HPC Server Team in 2010Ability to target either Azure-hosting WCF services or HPC Server hosted WCF services in a seamless mannerSynchronization of on-premise Velocity instance with Azure instance

Page 41: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

www.riskmetrics.com 41

Acknowledgements

www.riskmetrics.com

Prototype Development:Stuart Hartley (University of York, UK)Simon Davies (TAP programme)

Production Development Team:Rich Bower (Team Lead)Kelly Crawford (RiskBurst Server/Client)Simon Davies (TAP Programme)Jonathan Blair (Microsoft Consulting)

Supporting Cast:Alistair Beagley (DPE / Azure)Patrick Butler Monterde (TAP Programme)Azure Product Group (Hoi Vo, Brad Calder, Tom Fahrig, Joe Chau)Hunter Cadzow & Analytics Development at RiskMetricsTom Stockdale (RiskMetrics CTO)

Page 42: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

More Information> SVC16 Developing Advanced

Applications with Windows Azure> SVC09 Windows Azure Tables and

Queues Deep Dive> SVC14 Windows Azure Blobs and

Drives Deep Dive> SVC08 Patterns for Building Scalable

and Reliable Windows Azure Applications

> Windows Azure Platform lounge

Page 43: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

YOUR FEEDBACK IS IMPORTANT TO US! Please fill out session evaluation

forms online atMicrosoftPDC.com

Page 44: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

Learn More On Channel 9> Expand your PDC experience through

Channel 9

> Explore videos, hands-on labs, sample code and demos through the new Channel 9 training courses

channel9.msdn.com/learnBuilt by Developers for Developers….

Page 45: Lessons Learned: Building Scalable Applications with the Windows Azure Platform

© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Page 46: Lessons Learned: Building Scalable Applications with the Windows Azure Platform