the impact of cloud nsbcon ny by yves goeleven
DESCRIPTION
The impact of cloud Understanding cloud Failure is normal Size matters ‘At your service’ How to thriveTRANSCRIPT
The impactof cloud
Yves GoelevenThe cloudy Belgian
• Founder of MessageHandler.net
• Developer on NServiceBus
• Microsoft Azure MVP
• @YvesGoeleven
AgendaThe impact of cloud
• Understanding cloud• Failure is normal• Size matters• ‘At your service’• How to thrive
Understanding
Why people are interested? Various reasons
• Automation, • Scalability (scale out)• Elasticity (scale in again)• Cost• Globally available
What is Azure?Global network of huge data centers operated by Microsoft
200 services running on top
Storage Big data
Caching CDN
Database
Identity
Media Networking
Traffic
Messaging
Cloud ServicesWeb Sites
Connectivity
MobileVirtual Machines
Datacenter Network ArchitectureQuantum10v2 Architecture (Gen 3)
TOR TOR TOR TOR
Spine Spine Spine
…
…
DCR DCR
DSDS
Spine
DC Routers
DS DS
100K servers, 50,000 Gbps
DS … DS… DC Spine Set
Spine Spine Spine Spine
Older ArchitecturesDLA Architecture (Gen 1) Quantum10 Architecture (Gen 2)
TOR TOR TOR TOR
Spine Spine Spine
…
…
DCR DCR
BLBL
Spine
DC Routers
BL BL
30K servers, 30,000 Gbps10K Servers, 120 Gbs
40 Nodes
TOR
LB
LB
AGG
Digi
APC
LB
LB
AGG
LB
LB
AGG
LB
LB
AGG
LB
LB
AGG
LB
LB
AGG
20Racks
DC Router
Access Routers
Aggregation + LB
40 Nodes
TOR
Digi
APC
40 Nodes
TOR
Digi
APC
40 Nodes
TOR
Digi
APC
40 Nodes
TOR
Digi
APC
40 Nodes
TOR
Digi
APC
40 Nodes
TOR
Digi
APC
40 Nodes
TOR
Digi
APC
40 Nodes
TOR
Digi
APC
40 Nodes
TOR
Digi
APC
40 Nodes
TOR
Digi
APC
40 Nodes
TOR
Digi
APC
40 Nodes
TOR
Digi
APC
40 Nodes
TOR
Digi
APC
40 Nodes
TOR
Digi
APC
……
20Racks 20Racks 20Racks
…… … …
Datacenter ClustersDatacenters are divided into “clusters”
• Approximately 1000 rack-mounted server (called “nodes”)• Provides a unit of fault isolation
• Each cluster is managed by a Fabric Controller (FC)• FC is responsible for:
• Blade provisioning• Blade management• Service deployment and lifecycle
Cluster1
Cluster2
Clustern…
Datacenter network
FC FC FC
Fabric ControllerThe “kernel” of the cloud operating system
• Manages datacenter hardware• Manages Windows Azure services• Four main responsibilities:
• Datacenter resource allocation• Datacenter resource provisioning• Service lifecycle management• Service health management
• Inputs:• Description of the hardware and network resources it will
control• Service model and binaries for cloud applications
Server
Kernel
Process
Datacenter
Fabric Controller
Service
Windows Kernel
Server
WordSQL
Server
Fabric Controller
Datacenter
ExchangeOnline
SQL Azure
Deployment
ServicePackage
ServicePackage
Service Resource AllocationComplicated stuff
• Goal: allocate service components to available resources while satisfying all hard constraints • HW requirements: CPU, Memory, Storage, Network• Fault domains• Update domains
• Secondary goal: Satisfy soft constraints • Prefer allocations which will simplify servicing the host OS/hypervisor• Optimize network proximity: pack nodes
ServicePackage
Server Rack 1 Server Rack 2
Virtual machine
Virtual machine
Provision Role InstancesDeploy App CodeConfigure Network
Virtual machine
Virtual machine
Service DeploymentProvisioning a Node
• Power on node• PXE-boot Maintenance OS• Agent formats disk and
downloads Host OS via Windows Deployment Services (WDS)
• Host OS boots, runs Sysprep /specialize, reboots
• FC connects with the “Host Agent”
Fabric ControllerRole
ImagesRole
ImagesRole
ImagesRole
Images
Image Repository
Maintenance OS Parent OS
Node
PXEServer
Maintenance OS Windows AzureOS
Windows Azure
OS
FC Host Agent
Windows Azure Hypervisor
Windows Deployment
Server
Windows Azure Datacenter
ServicePackage
Provision Role InstancesDeploy App CodeConfigure Network
Azure Datacenter
ServicePackage
Provision Role InstancesDeploy App CodeConfigure Network
Network Load Balancer
Azure Datacenter
Network load-balancer configured for traffic
Provision Role InstancesDeploy App CodeConfigure Network
Failure is normal
Network Load Balancer
Azure Datacenter
ImplicationsOf commodity hardware with self healing
• Machine failure is normal• Machines are small, low specs• Little to no redundancy
• Always partially broken state
• FC provisions ‘clean’ machines
• Can occur at any time• On failure• On host upgrades• On move
How to handleSmall machines & continuous failure
• Distribute & duplicate application across multiple machines• At least 2 of each (3 or 5 is better)
• Accept that target machine may be down• Ensure temporal decoupling
• Do not design ‘RPC-style’, use queueing instead
• Do not put anything on disk• You will loose data!• Except for Virtual Machines with persisted data disks• Use azure storage services instead*
Size matters
Some numbersJust to illustrate how huge Azure is
• 13 regions• 321 IP ranges• 250.000+ customers• 2.000.000+ VM’s• 25+ trillion objects stored
ImplicationsOf such a huge network
• Latency is a given• Network IO is typically bottleneck
• Network partitioning is normal• Distributed transactions flaky or not supported
How to handleLatency & lack of DTC
• Most operations in the cloud will be IO/network bound• Multi threaded processing• Process messages, aka wait, in parallel• But don’t overdo it (12-24 per core)
• Lack of DTC• Keep operations atomic• Use compensation logic
‘At your service’
200 services running on top
Storage Big data
Caching CDN
Database
Identity
Media Networking
Traffic
Messaging
Cloud ServicesWeb Sites
Connectivity
MobileVirtual Machines
As A ServiceUnderstanding
• Same capabilities as a product, but it’s not a product
• Operated by Vendor• Multitenant, aka shared hosting• Low marginal profits• ‘Capacity’ VS ‘provisioned’
ImplicationsMicrosoft doesn’t want you to be in control!
• Individual resources are limited• Throttling• Your resources are moved around: unpredictable resource performance• Transient errors• No locks or very short locks• No local transactions!
• 1 exception: Sql as it is build into the protocol
How to handleThrottling & lack of transactions
• Retry, Retry, Retry• On transient errors and throttles• With backoff algorithms
• Lack of Local transactions• Keep operations atomic with retries• Use compensation logic• Take care of idempotency
Thrive
How to thrive in the cloudUse NServiceBus to deal with shortcomings
• Messaging provides distribution & temporal decoupling
• Multithreading model built in• Ideal for network bound operations
• Retry, retry, retry• Azure transports use retries instead of relying on transactions• First Level Retry• Second Level Retry
Choosing the right transportsBoth retry and are built for reliability
Azure ServiceBus
Azure Storage
Azure Storage QueuesQueue construct in Azure Storage Services
• Extremely reliable• Very cheap• 200TB/500TB capacity limit• HTTP(S) based• Queue Peek Lock for retries• Max 7 days TTL!
Azure ServiceBusBroker service in azure
• Highly Reliable• Supports queues, topics & subscriptions• 5GB capacity limit• No limit on TTL• TCP based, lower latency• Queue Peek Lock for retries
• Emulates local transactions
• Loads of additional features• Relatively expensive*
Azure ServiceBusAdditional features & applicability
• Applicable• Duplicate detection: time window• Partitioning: Bundle of queues/topics• Message ordering• Deadlettering• Batched operations
• Not applicable:• Sessions: instance affinity for message set, used for large
messages, use databus instead
How to thrive in the cloudDeal with cost model
• Worker role translates to at least 2 VM’s
• Endpoint per handler• Gets expensive very fast
• Shared endpoint hosting provided
How to thrive in the cloudDo not trust your disk!
• Do not put anything on disk!• The machine will fail, the disk will be gone!• Anyone noticed there is no SLA for individual VM’s?
• Put your stuff in azure storage services• 99.99% SLA• Local Redundant & Geo Redunant
How to thrive in the cloudNServiceBus helps a lot, but you need to code to it as well
• You need to take care of idempotency• Atomic messagehandler implementations• Saga’s too! Update saga state & nothing else!• Use saga’s to coordinate compensation logic• Check for retries• Check side effects
See, http://docs.particular.net/nservicebus/understanding-transactions-in-windows-azure for more options
Wrapup
Want to know more?
• Overview: http://docs.particular.net/nservicebus/windows-azure-transport• Hosting: http://docs.particular.net/nservicebus/hosting-nservicebus-in-windows-azure• Cloud services: http://docs.particular.net/nservicebus/hosting-nservicebus-in-windows-azure-cloud-services• Shared host: http://docs.particular.net/nservicebus/shared-hosting-nservicebus-in-windows-azure-cloud-services• Azure servicebus: http://docs.particular.net/nservicebus/using-azure-servicebus-as-transport-in-nservicebus• Azure storage queues: http://
docs.particular.net/nservicebus/using-azure-storage-queues-as-transport-in-nservicebus• Storage persistence: http://docs.particular.net/nservicebus/using-azure-storage-persistence-in-nservicebus• Transactions: http://docs.particular.net/nservicebus/understanding-transactions-in-windows-azure
Resources
Or get your hands dirty?
• Samples: https://github.com/particular/nservicebus.azure.samples
Resources
Thanks