the impact of cloud nsbcon ny by yves goeleven

The impactof cloud

Yves GoelevenThe cloudy Belgian

• Founder of MessageHandler.net

• Developer on NServiceBus

• Microsoft Azure MVP

• @YvesGoeleven

AgendaThe impact of cloud

• Understanding cloud• Failure is normal• Size matters• ‘At your service’• How to thrive

Understanding

Why people are interested? Various reasons

• Automation, • Scalability (scale out)• Elasticity (scale in again)• Cost• Globally available

What is Azure?Global network of huge data centers operated by Microsoft

200 services running on top

Storage Big data

Caching CDN

Database

Identity

Media Networking

Traffic

Messaging

Cloud ServicesWeb Sites

Connectivity

MobileVirtual Machines

Datacenter Network ArchitectureQuantum10v2 Architecture (Gen 3)

TOR TOR TOR TOR

Spine Spine Spine

…

…

DCR DCR

DSDS

Spine

DC Routers

DS DS

100K servers, 50,000 Gbps

DS … DS… DC Spine Set

Spine Spine Spine Spine

Older ArchitecturesDLA Architecture (Gen 1) Quantum10 Architecture (Gen 2)

TOR TOR TOR TOR

Spine Spine Spine

…

…

DCR DCR

BLBL

Spine

DC Routers

BL BL

30K servers, 30,000 Gbps10K Servers, 120 Gbs

40 Nodes

TOR

LB

LB

AGG

Digi

APC

LB

LB

AGG

LB

LB

AGG

LB

LB

AGG

LB

LB

AGG

LB

LB

AGG

20Racks

DC Router

Access Routers

Aggregation + LB

40 Nodes

TOR

Digi

APC

40 Nodes

TOR

Digi

APC

40 Nodes

TOR

Digi

APC

40 Nodes

TOR

Digi

APC

40 Nodes

TOR

Digi

APC

40 Nodes

TOR

Digi

APC

40 Nodes

TOR

Digi

APC

40 Nodes

TOR

Digi

APC

40 Nodes

TOR

Digi

APC

40 Nodes

TOR

Digi

APC

40 Nodes

TOR

Digi

APC

40 Nodes

TOR

Digi

APC

40 Nodes

TOR

Digi

APC

40 Nodes

TOR

Digi

APC

……

20Racks 20Racks 20Racks

…… … …

Datacenter ClustersDatacenters are divided into “clusters”

• Approximately 1000 rack-mounted server (called “nodes”)• Provides a unit of fault isolation

• Each cluster is managed by a Fabric Controller (FC)• FC is responsible for:

• Blade provisioning• Blade management• Service deployment and lifecycle

Cluster1

Cluster2

Clustern…

Datacenter network

FC FC FC

Fabric ControllerThe “kernel” of the cloud operating system

• Manages datacenter hardware• Manages Windows Azure services• Four main responsibilities:

• Datacenter resource allocation• Datacenter resource provisioning• Service lifecycle management• Service health management

• Inputs:• Description of the hardware and network resources it will

control• Service model and binaries for cloud applications

Server

Kernel

Process

Datacenter

Fabric Controller

Service

Windows Kernel

Server

WordSQL

Server

Fabric Controller

Datacenter

ExchangeOnline

SQL Azure

Deployment

ServicePackage

ServicePackage

Service Resource AllocationComplicated stuff

• Goal: allocate service components to available resources while satisfying all hard constraints • HW requirements: CPU, Memory, Storage, Network• Fault domains• Update domains

• Secondary goal: Satisfy soft constraints • Prefer allocations which will simplify servicing the host OS/hypervisor• Optimize network proximity: pack nodes

ServicePackage

Server Rack 1 Server Rack 2

Virtual machine

Virtual machine

Provision Role InstancesDeploy App CodeConfigure Network

Virtual machine

Virtual machine

Service DeploymentProvisioning a Node

• Power on node• PXE-boot Maintenance OS• Agent formats disk and

downloads Host OS via Windows Deployment Services (WDS)

• Host OS boots, runs Sysprep /specialize, reboots

• FC connects with the “Host Agent”

Fabric ControllerRole

ImagesRole

ImagesRole

ImagesRole

Images

Image Repository

Maintenance OS Parent OS

Node

PXEServer

Maintenance OS Windows AzureOS

Windows Azure

OS

FC Host Agent

Windows Azure Hypervisor

Windows Deployment

Server

Windows Azure Datacenter

ServicePackage


Azure Datacenter

ServicePackage


Network Load Balancer

Azure Datacenter

Network load-balancer configured for traffic


Failure is normal

Network Load Balancer

Azure Datacenter

ImplicationsOf commodity hardware with self healing

• Machine failure is normal• Machines are small, low specs• Little to no redundancy

• Always partially broken state

• FC provisions ‘clean’ machines

• Can occur at any time• On failure• On host upgrades• On move

How to handleSmall machines & continuous failure

• Distribute & duplicate application across multiple machines• At least 2 of each (3 or 5 is better)

• Accept that target machine may be down• Ensure temporal decoupling

• Do not design ‘RPC-style’, use queueing instead

• Do not put anything on disk• You will loose data!• Except for Virtual Machines with persisted data disks• Use azure storage services instead*

Size matters

Some numbersJust to illustrate how huge Azure is

• 13 regions• 321 IP ranges• 250.000+ customers• 2.000.000+ VM’s• 25+ trillion objects stored

ImplicationsOf such a huge network

• Latency is a given• Network IO is typically bottleneck

• Network partitioning is normal• Distributed transactions flaky or not supported

How to handleLatency & lack of DTC

• Most operations in the cloud will be IO/network bound• Multi threaded processing• Process messages, aka wait, in parallel• But don’t overdo it (12-24 per core)

• Lack of DTC• Keep operations atomic• Use compensation logic

‘At your service’

200 services running on top

Storage Big data

Caching CDN

Database

Identity

Media Networking

Traffic

Messaging

Cloud ServicesWeb Sites

Connectivity

MobileVirtual Machines

As A ServiceUnderstanding

• Same capabilities as a product, but it’s not a product

• Operated by Vendor• Multitenant, aka shared hosting• Low marginal profits• ‘Capacity’ VS ‘provisioned’

ImplicationsMicrosoft doesn’t want you to be in control!

• Individual resources are limited• Throttling• Your resources are moved around: unpredictable resource performance• Transient errors• No locks or very short locks• No local transactions!

• 1 exception: Sql as it is build into the protocol

How to handleThrottling & lack of transactions

• Retry, Retry, Retry• On transient errors and throttles• With backoff algorithms

• Lack of Local transactions• Keep operations atomic with retries• Use compensation logic• Take care of idempotency

Thrive

How to thrive in the cloudUse NServiceBus to deal with shortcomings

• Messaging provides distribution & temporal decoupling

• Multithreading model built in• Ideal for network bound operations

• Retry, retry, retry• Azure transports use retries instead of relying on transactions• First Level Retry• Second Level Retry

Choosing the right transportsBoth retry and are built for reliability

Azure ServiceBus

Azure Storage

Azure Storage QueuesQueue construct in Azure Storage Services

• Extremely reliable• Very cheap• 200TB/500TB capacity limit• HTTP(S) based• Queue Peek Lock for retries• Max 7 days TTL!

Azure ServiceBusBroker service in azure

• Highly Reliable• Supports queues, topics & subscriptions• 5GB capacity limit• No limit on TTL• TCP based, lower latency• Queue Peek Lock for retries

• Emulates local transactions

• Loads of additional features• Relatively expensive*

Azure ServiceBusAdditional features & applicability

• Applicable• Duplicate detection: time window• Partitioning: Bundle of queues/topics• Message ordering• Deadlettering• Batched operations

• Not applicable:• Sessions: instance affinity for message set, used for large

messages, use databus instead

How to thrive in the cloudDeal with cost model

• Worker role translates to at least 2 VM’s

• Endpoint per handler• Gets expensive very fast

• Shared endpoint hosting provided

How to thrive in the cloudDo not trust your disk!

• Do not put anything on disk!• The machine will fail, the disk will be gone!• Anyone noticed there is no SLA for individual VM’s?

• Put your stuff in azure storage services• 99.99% SLA• Local Redundant & Geo Redunant

How to thrive in the cloudNServiceBus helps a lot, but you need to code to it as well

• You need to take care of idempotency• Atomic messagehandler implementations• Saga’s too! Update saga state & nothing else!• Use saga’s to coordinate compensation logic• Check for retries• Check side effects

See, http://docs.particular.net/nservicebus/understanding-transactions-in-windows-azure for more options

http://docs.particular.net/nservicebus/understanding-transactions-in-windows-azure



Wrapup

Want to know more?

• Overview: http://docs.particular.net/nservicebus/windows-azure-transport• Hosting: http://docs.particular.net/nservicebus/hosting-nservicebus-in-windows-azure• Cloud services: http://docs.particular.net/nservicebus/hosting-nservicebus-in-windows-azure-cloud-services• Shared host: http://docs.particular.net/nservicebus/shared-hosting-nservicebus-in-windows-azure-cloud-services• Azure servicebus: http://docs.particular.net/nservicebus/using-azure-servicebus-as-transport-in-nservicebus• Azure storage queues: http://

docs.particular.net/nservicebus/using-azure-storage-queues-as-transport-in-nservicebus• Storage persistence: http://docs.particular.net/nservicebus/using-azure-storage-persistence-in-nservicebus• Transactions: http://docs.particular.net/nservicebus/understanding-transactions-in-windows-azure

Resources

http://docs.particular.net/nservicebus/windows-azure-transport

http://docs.particular.net/nservicebus/windows-azure-transport

http://docs.particular.net/nservicebus/hosting-nservicebus-in-windows-azure

http://docs.particular.net/nservicebus/hosting-nservicebus-in-windows-azure

http://docs.particular.net/nservicebus/shared-hosting-nservicebus-in-windows-azure-cloud-services




http://docs.particular.net/nservicebus/using-azure-servicebus-as-transport-in-nservicebus

http://docs.particular.net/nservicebus/using-azure-servicebus-as-transport-in-nservicebus

http://docs.particular.net/nservicebus/using-azure-storage-queues-as-transport-in-nservicebus

http://docs.particular.net/nservicebus/using-azure-storage-queues-as-transport-in-nservicebus

http://docs.particular.net/nservicebus/using-azure-storage-persistence-in-nservicebus





Or get your hands dirty?

• Samples: https://github.com/particular/nservicebus.azure.samples

Resources

https://github.com/particular/nservicebus.azure.samples

https://github.com/particular/nservicebus.azure.samples

Thanks

the impact of cloud nsbcon ny by yves goeleven

Technology

network resources

service model

cloud failure

network partitioning

network proximity

network bound operations

service components

service deploymentprovisioning