rocky k. c. chang, edmond chan, waiting fok, and …oneprobe/doc/apricot_2010... · apricot 2010 1....

Rocky K. C. Chang, Edmond Chan, Waiting Fok, and Weichao Li

The Hong Kong Polytechnic UniversityHung hom, Kowloon, Hong Kong

APRICOT 2010

Rocky K. C. Chang, Edmond Chan, Waiting Fok, and

The Hong Kong Polytechnic UniversityHung hom, Kowloon, Hong Kong

1

APRICOT 2010

� Problem statement

� Measurement system

� Measurement methodology

� Interesting findings� Interesting findings

� Conclusions

Measurement methodology

2

Source: http://www.jucc.edu.hk/jucc/harnet.html3

Source: http://www.jucc.edu.hk/jucc/harnet.html

� Wide area network linking up eight tertiary

institutions in HK

� Managed by Joint Universities Computer

Centre (JUCC)Centre (JUCC)

− Coordinate IT service of mutual interest

� Provide high-speed optical backbone

network and Internet connectivity

− Bulk tendering and selection of Internet service

provider – (PCCW �Wharf)

Wide area network linking up eight tertiary

Managed by Joint Universities Computer

Coordinate IT service of mutual interest

speed optical backbone

network and Internet connectivity

Bulk tendering and selection of Internet service

Wharf)

4

� Collect reliable performance data for

operation and planning purposes.

� Justifications for service upgrade

Evaluate the fairness of resource sharing � Evaluate the fairness of resource sharing

among the eight institutions.

� Achieve some kind of “fairness”.

� Improve the quality of network services.

� Less optimal routes

� Fault locations

Collect reliable performance data for

operation and planning purposes.

Justifications for service upgrade

Evaluate the fairness of resource sharing Evaluate the fairness of resource sharing

among the eight institutions.

Achieve some kind of “fairness”.

Improve the quality of network services.

5





� Conclusions


6

� Operating since 1 Jan 2009

� Measurement side

� OneProbe: provide around

monitoringmonitoring

� Planetopus: a measurement management

platform

� User side

� Web-based report on measurement results

� Ad hoc performance diagnosis

Operating since 1 Jan 2009

OneProbe: provide around-the-clock path-quality

Planetopus: a measurement management

based report on measurement results

Ad hoc performance diagnosis

7

On

eP

rob

e@

HK

U

On

eP

rob

e@

CU

HK

On

eP

rob

e@

Cit

yU

On

eP

rob

e@

Po

lyU

40+ web servers selected by the JUCC

Measure

ment

sid

e

On

eP

rob

e@

HK

U

On

eP

rob

e@

CU

HK

On

eP

rob

e@

Cit

yU

On

eP

rob

e@

Po

lyU

Planetopus, database, etc

HKU CUHK PolyU CityU

Measure

ment

sid

eU

ser

sid

e

On

eP

rob

e@

BU

On

eP

rob

e@

HK

US

T

On

eP

rob

e@

HK

IED

On

eP

rob

e@

LU

40+ web servers selected by the JUCC

8

On

eP

rob

e@

BU

On

eP

rob

e@

HK

US

T

On

eP

rob

e@

HK

IED

On

eP

rob

e@

LU

Planetopus, database, etc

BU HKUST LU HKIED





� Conclusions


13

� Continuous monitoring

� Configurable sampling

rate and pattern

� Low overhead � Low overhead

� User-chosen websites

� TCP data-path

measurement

� Middlebox friendly

� Multi-metric

measurement

ForwardLoss

ReverseLoss

14

OneProbe

RTT

LossLoss

ForwardRe-ordering

ReverseRe-ordering

RTTJitter

Round-tripCapacity

� Deploying measurement tasks

� Monitoring the resources usage

� Secure measurement data collection

� Measurement data management� Measurement data management

Deploying measurement tasks

Monitoring the resources usage

Secure measurement data collection

Measurement data managementMeasurement data management

15





� Conclusions


17

• Strong and diurnal correlation between RTT

and reverse-path packet loss

Strong and diurnal correlation between RTT

path packet loss

19

• No correlation between RTT and reverse

path loss

No correlation between RTT and reverse-

20

• Good effect of a forwardGood effect of a forward-route change

21

� The three fault events according to public

information:

� 9 Aug 1:37am(HKT) and 12 hours later

EAC− EAC

� 12 Aug 10:50am(HKT)

− APCN2

� 17 Aug 2:20pm(HKT)

− FNAL/RNAL

The three fault events according to public

9 Aug 1:37am(HKT) and 12 hours later

22

Path 9 Aug

Australasia - NLA Diurnal RTT burst –1200ms, up to 12 Aug

Loss burst – 50%, 8 hrs

Japan - Nissan X

Taiwan - TANET RTT increaseFw loss increase

US - Citibank X

Finland - Nokia X

Korea - KREONET X

12 Aug 17 Aug

X X

Rv Loss – 30%17 hrs

RTT burst – 1800ms7 hrs17 hrs 7 hrs

RTT increase 60msDiurnal Rv loss –10~50%, 22 hrs

Diurnal Rv loss burst -10~50%, 17+ hrs

X RTT burst – 1800ms, 7hrs

Rv Loss – 30%, 13 hrs

X Connectivity Lost 12hrsRv Loss – 50% 1.5

days

X RTT increase to 400ms

23

� Affected by the 9 Aug fault:

� RTT peaks of 1200ms up to 12 Aug

� 50%+ burst of losses at 2pm

� PCCW → Pacnet →TransactSDN(AU) PCCW → Pacnet →TransactSDN(AU)

9 Aug 13:37(HKT)

Affected by the 9 Aug fault:

RTT peaks of 1200ms up to 12 Aug

50%+ burst of losses at 2pm-10pm on 9 Aug

TransactSDN(AU) →NLATransactSDN(AU) →NLA

26

� Affected by the 12 & 17 Aug faults:

� Burst of Rv Loss(30%) from 12 Aug 10am to 13 Aug

3am

� RTT burst of 1800ms on 17 Aug 2

� PCCW → Equinix →NTT(US/JP)

12 Aug 10:50(HKT)

Affected by the 12 & 17 Aug faults:

Burst of Rv Loss(30%) from 12 Aug 10am to 13 Aug

RTT burst of 1800ms on 17 Aug 2-9pm

NTT(US/JP) →OCN(JP)

17 Aug 14:20(HKT)

27

� Affected by the 12 & 17 Aug faults:

� RTT increased for 60ms since 12 Aug 15:00

� Diurnal Rv Loss (10~50%) in 22 hrs since 12 Aug

16:20 and 17+ hrs since 21:40 17 Aug16:20 and 17+ hrs since 21:40 17 Aug

� HKIX →ChungHwaTel

12 Aug 10:50(HKT)

17 Aug 14:20(HKT)

Affected by the 12 & 17 Aug faults:

RTT increased for 60ms since 12 Aug 15:00

Diurnal Rv Loss (10~50%) in 22 hrs since 12 Aug

16:20 and 17+ hrs since 21:40 17 Aug16:20 and 17+ hrs since 21:40 17 Aug

ChungHwaTel →TANET

17 Aug 14:20(HKT)

28


� RTT burst of 1800ms

� Reverse-path loss up to 40%

From 17 Aug 2pm to 18 Aug 3am� From 17 Aug 2pm to 18 Aug 3am

� PCCW → BNA →AT&T17 Aug 14:20(HKT)


path loss up to 40%

From 17 Aug 2pm to 18 Aug 3amFrom 17 Aug 2pm to 18 Aug 3am

AT&T

29


� Connectivity lost (OneProbe, TCPTraceroute)

� From 17 Aug 2pm to 18 Aug 2am

Rv Loss burst up to 50% until 20 Aug 4pm� Rv Loss burst up to 50% until 20 Aug 4pm

� PCCW → BNA →GBLX(US) 17 Aug 14:20(HKT)

Connection lost


Connectivity lost (OneProbe, TCPTraceroute)

From 17 Aug 2pm to 18 Aug 2am

Rv Loss burst up to 50% until 20 Aug 4pmRv Loss burst up to 50% until 20 Aug 4pm

GBLX(US) →Nokia(Finland)

30

� Affected by the 17 Aug fault

� RTT increased from 40ms to 400ms since 17 Aug

14:20

� RTT burst of 400ms around 12 Aug 22:00 to 22:30� RTT burst of 400ms around 12 Aug 22:00 to 22:30

� HARNET →ASGC (TW) 12 Aug 10:50(HKT)

Affected by the 17 Aug fault

RTT increased from 40ms to 400ms since 17 Aug

RTT burst of 400ms around 12 Aug 22:00 to 22:30RTT burst of 400ms around 12 Aug 22:00 to 22:30

ASGC (TW) → KREONET

17 Aug 14:20(HKT)

31

� Deploying and managing a distributed measurement system is very challenging.

� A reliable, non-cooperative measurement method

� A measurement management platform� A measurement management platform� But such a system, if deployed and managed

correctly, is very useful.

� More information obtained from contrasting for performance and fault diagnosis

� Currently monitoring the impact of switching to a new provider

Deploying and managing a distributed measurement system is very challenging.

cooperative measurement method

A measurement management platformA measurement management platformBut such a system, if deployed and managed

More information obtained from contrasting for performance and fault diagnosis

Currently monitoring the impact of switching to a

32

rocky k. c. chang, edmond chan, waiting fok, and …oneprobe/doc/apricot_2010... · apricot 2010 1....

Documents