a framework for highly-available real-time internet services bhaskaran raman, eecs, u.c.berkeley

A Framework for Highly-Available

Real-Time Internet Services

Bhaskaran Raman,EECS, U.C.Berkeley

Problem Statement

• Real-time services with long-lived sessions• Need to provide continued service in the

face of failures

Internet

Video smoothing proxy

Transcoding proxy

Clients

Content server

Real-time services

Problem Statement

• Goals:1. Quick recovery2. Scalability

Client Monitoring

Video-on-demand server

Network path failure

Service failure

Service replica

Our Approach

• Service infrastructure– Computer clusters deployed at several points on

the Internet– For service replication & path monitoring– (path = path of the data stream in a client

session)

Summary of Related Work

• Fail-over within a single cluster– Active Services (e.g., video transcoding)– TACC (web-proxy)– Only service failure is handled

• Web mirror selection– E.g., SPAND– Does not handle failure during a session

• Fail-over for network paths– Internet route recovery

• Not quick enough

– ATM, Telephone networks, MPLS• No mechanism for wide-area Internet

No architecture to address network path failure during a real-time session

Research Challenges

• Wide-area monitoring– Feasibility: how quickly and reliably can

failures be detected?– Efficiency: per-session monitoring would

impose significant overhead

• Architecture– Who monitors who?– How are services replicated?– What is the mechanism for fail-over?

Is Wide-Area Monitoring Feasible?

Monitoring for liveness of path using keep-alive heartbeat

Time

Time

Failure: detected by timeout

Timeout period

Time

False-positive: failure detected incorrectly

Timeout period

There’s a trade-off between time-to-detection and rate of false-positives

Is Wide-Area Monitoring Feasible?

• False-positives due to:– Simultaneous losses– Sudden increase in RTT

• Related studies:– Internet RTT study, Acharya & Saltz, UMD 1996

• RTT spikes are isolated

– TCP RTO study, Allman & Paxson, SIGCOMM 1999

• Significant RTT increase is quite transient

• Our experiments:– Ping data from ping servers– UDP heartbeats between Internet hosts

Ping measurements

• Ping servers– 12 geographically distributed

servers chosen– Approximation of a keep-alive

stream

• Count number of loss runs with > 4 simultaneous losses– Could be an actual failure or just

intermittant losses– If we have 1 second HBs, and

timeout after losing 4 HBs– This count gives the upper

bound on the number of false-positives

Ping server

Berkeley

Internet host

1

2

3HTTP

ICMP

Ping measurements

Ping server Ping host Total time > 4 misses

cgi.shellnet.co.uk hnets.uia.ac.be 15:14:14 12

hnets.uia.ac.be sites.inka.de 20:55:58 29

cgi.shellnet.co.uk hnets.uia.ac.be 14:53:14 0

www.hit.net d18183f47.rochester.rr.com 20:27:32 10

d18183f47.rochester.rr.com www.his.com 17:30:31 0

www.hit.net www.his.com 20:28:5 2

www.csu.net zeus.lyceum.com 20:1:15 17

zeus.lyceum.com www.atmos.albany.edu 20:1:17 62

www.csu.net www.atmos.albany.edu 20:1:18 7

www.interworld.net www.interlog.com 14:19:44 6

www.interlog.com inoc.iserv.net 20:32:7 24

www.interworld.net inoc.iserv.net 14:39:56 0

UDP-based keep-alive stream

• Geographically distributed hosts:– Berkeley, Stanford, UIUC, TU-Berlin, UNSW

• UDP heart-beat every 300ms• Measure gaps between receipt of successive

heart-beats• False positive:

– No heartbeat received for > 2 seconds, but received before 30 seconds

• Failure:– No HB for > 30 seconds

UDP-based keep-alive stream

HB destination HB source Total time Num. False positives

Berkeley UNSW 130:48:45 135

UNSW Berkeley 130:51:45 9

Berkeley TU-Berlin 130:49:46 27

TU-Berlin Berkeley 130:50:11 174

TU-Berlin UNSW 130:48:11 218

UNSW TU-Berlin 130:46:38 24

Berkeley Stanford 124:21:55 258

Stanford Berkeley 124:21:19 2

Stanford UIUC 89:53:17 4

UIUC Stanford 76:39:10 74

Berkeley UIUC 89:54:11 6

UIUC Berkeley 76:39:40 3

What does this mean?

• If we have a failure detection scheme– Timeout of 2 sec– False positives can be as low as once a day– For many pairs of Internet hosts

• In comparison, BGP route recovery:– > 30 seconds– Can take upto 10s of minutes, Labovitz & Ahuja,

SIGCOMM 2000– Worse with multi-homing (an increasing trend)

Architectural Requirements

• Efficiency:– Monitoring per-session too much overhead– End-to-end more latency monitoring less

effective

• Need aggregation– Client-side aggregation: using a SPAND-like server– Server-side aggregation: clusters– But not all clients have the same server & vice-versa

• Service infrastructure to address this– Several service clusters on the Internet

Architecture

Internet

Service cluster: Compute cluster capable of running

services

Keep-alive stream

Client

Source

Overlay topology

Nodes = service clusters

Links = monitoring

channels between clusters

Source Client

Routed via monitored paths

Could go through an intermediate

service

Local recovery using a backup

path

Architecture

Monitoring within cluster for process/machine failure

Monitoring across clusters for network path failure

Peering of service clusters – to server as backups for

one another – or for monitoring the path

between them

Architecture: Advantages

• 2-level monitoring – Process/machine failures still handled within cluster– Common failure cases do not require wide-area

mechanisms

• Aggregation of monitoring across clusters – Efficiency

• Model works for cascaded services as well

Client

SourceS1

S2

Architecture: Potential Criticism

• Potential criticism:– Does not handle resource reservation

• Response:– Related issue, but could be orthogonal– Aggregated/hierarchical reservation schemes (e.g.,

the Clearing House)– Even if reservation is solved (or is not needed), we

still need to address failures

Architecture: Issues(1) Routing

Internet

Client

Source

Given the overlay topology,

Need a routing algorithm to go from source to destination via service(s)

Also need: (a) WA-SDS, (b) Closest service cluster to a given host

Architecture: Issues(2) Topology

How many service-clusters? (nodes)

How many monitored-paths? (links)

Ideas on Routing

• BGP– Border router– Peering session

• Heartbeats

– IP route• Destination based

• Service infrastructure– Service cluster– Peering session

• Heartbeats

– Route across clusters• Based on destination

and intermediate service(s)

Similarities with BGP

• How is routing in the overlay topology different from BGP?

• Overlay topology– More freedom than physical topology– Constraints on graph can be imposed more easily– For example, can have local recovery

Ideas on Routing

S (Berkeley)

S’ (San Francisco)

Ideas on Routing

AP1

AP2

AP3

BGP exchanges a lot of information

Increases with the number of APs – O(100,000)

Service clusters need to exchange very little information

Problems of (a) Service discovery (b) Knowing the nearest service cluster – are decoupled from routing

Source

Client

Ideas on Routing

• BGP routers do not have per-session state

• But, service clusters can maintain per-session state– Can have local recovery pre-

determined for each session

• Finally, we probably don’t need to have as many nodes as the number of BGP routers– can have more aggressive routing

algorithm

It is feasible to have fail-over in the overlay topology quicker than Internet route recovery with BGP

Ideas on topology

• Decision criteria– Additional latency for client session

• Sparse topology more latency

– Monitoring overhead• Many monitored paths more overhead, but additional

flexibility might reduce end-to-end latency

Ideas on topology

• Number of nodes:– Claim: upper bound is: one per AS– This will ensure that there is a service cluster “close”

to every client– Topology will be close to Internet backbone topology

• Number of monitoring channels:– # AS: ~7000 as of 1999– Monitoring overhead: ~100 Bytes/sec

• Can have ~1000 peering sessions per service cluster easily

– Dense overlay topology possible (to minimize additional latency)

Implementation + Performance

Service cluster

Peer clusterManager node

Exchange ofsession-information + Monitoring heartbeat


• PCM GSM codec service– Overhead of “hot” backup service: 1 process (idle)– Service reinstantiation time: ~100ms

• End-to-end recovery over wide-area: three components

Detection time – O(2sec)

Communication with replica – RTT – O(100ms)

Replica activation: ~100ms


• Overhead of monitoring– Keep-alive heartbeat– One per 300ms in our implementation– O(100Bytes/sec)

• Overhead of false-positive in failure detection– Session transfer– One message exchange across adjacent clusters

• Few hundred bytes

Summary

• Real-time applications with long-lived sessions– No support exists for path-failure recovery (unlike

say, the PSTN)

• Service infrastructure to provide this support– Wide-area monitoring for path liveness:

• O(2sec) failure detection with low rate of false positives

– Peering and replication model for quick fail-over

• Interesting issues:– Nature of overlay topology– Algorithms for routing and fail-over

Questions

• What are the factors in deciding topology?• Strategies for handling failure near client/source

– Currently deployed mechanisms for RIP/OSPF, ATM/MPLS failure recovery

– What is the time to recover?

• Applications: video/audio streaming– More?– Games? Proxies for games on hand-held devices?

http://www.cs.berkeley.edu/~bhaskar

(Presentation running under VMWare under Linux)

a framework for highly-available real-time internet services bhaskaran raman, eecs, u.c.berkeley

Documents