rocky k. c. chang, edmond chan, waiting fok, and …oneprobe/doc/apricot_2010... · apricot 2010 1....
TRANSCRIPT
Rocky K. C. Chang, Edmond Chan, Waiting Fok, and Weichao Li
The Hong Kong Polytechnic UniversityHung hom, Kowloon, Hong Kong
APRICOT 2010
Rocky K. C. Chang, Edmond Chan, Waiting Fok, and
The Hong Kong Polytechnic UniversityHung hom, Kowloon, Hong Kong
1
APRICOT 2010
� Problem statement
� Measurement system
� Measurement methodology
� Interesting findings� Interesting findings
� Conclusions
Measurement methodology
2
� Wide area network linking up eight tertiary
institutions in HK
� Managed by Joint Universities Computer
Centre (JUCC)Centre (JUCC)
− Coordinate IT service of mutual interest
� Provide high-speed optical backbone
network and Internet connectivity
− Bulk tendering and selection of Internet service
provider – (PCCW �Wharf)
Wide area network linking up eight tertiary
Managed by Joint Universities Computer
Coordinate IT service of mutual interest
speed optical backbone
network and Internet connectivity
Bulk tendering and selection of Internet service
Wharf)
4
� Collect reliable performance data for
operation and planning purposes.
� Justifications for service upgrade
Evaluate the fairness of resource sharing � Evaluate the fairness of resource sharing
among the eight institutions.
� Achieve some kind of “fairness”.
� Improve the quality of network services.
� Less optimal routes
� Fault locations
Collect reliable performance data for
operation and planning purposes.
Justifications for service upgrade
Evaluate the fairness of resource sharing Evaluate the fairness of resource sharing
among the eight institutions.
Achieve some kind of “fairness”.
Improve the quality of network services.
5
� Problem statement
� Measurement system
� Measurement methodology
� Interesting findings� Interesting findings
� Conclusions
Measurement methodology
6
� Operating since 1 Jan 2009
� Measurement side
� OneProbe: provide around
monitoringmonitoring
� Planetopus: a measurement management
platform
� User side
� Web-based report on measurement results
� Ad hoc performance diagnosis
Operating since 1 Jan 2009
OneProbe: provide around-the-clock path-quality
Planetopus: a measurement management
based report on measurement results
Ad hoc performance diagnosis
7
On
eP
rob
e@
HK
U
On
eP
rob
e@
CU
HK
On
eP
rob
e@
Cit
yU
On
eP
rob
e@
Po
lyU
40+ web servers selected by the JUCC
Measure
ment
sid
e
On
eP
rob
e@
HK
U
On
eP
rob
e@
CU
HK
On
eP
rob
e@
Cit
yU
On
eP
rob
e@
Po
lyU
Planetopus, database, etc
HKU CUHK PolyU CityU
Measure
ment
sid
eU
ser
sid
e
On
eP
rob
e@
BU
On
eP
rob
e@
HK
US
T
On
eP
rob
e@
HK
IED
On
eP
rob
e@
LU
40+ web servers selected by the JUCC
8
On
eP
rob
e@
BU
On
eP
rob
e@
HK
US
T
On
eP
rob
e@
HK
IED
On
eP
rob
e@
LU
Planetopus, database, etc
BU HKUST LU HKIED
� Problem statement
� Measurement system
� Measurement methodology
� Interesting findings� Interesting findings
� Conclusions
Measurement methodology
13
� Continuous monitoring
� Configurable sampling
rate and pattern
� Low overhead � Low overhead
� User-chosen websites
� TCP data-path
measurement
� Middlebox friendly
� Multi-metric
measurement
ForwardLoss
ReverseLoss
14
OneProbe
RTT
LossLoss
ForwardRe-ordering
ReverseRe-ordering
RTTJitter
Round-tripCapacity
� Deploying measurement tasks
� Monitoring the resources usage
� Secure measurement data collection
� Measurement data management� Measurement data management
Deploying measurement tasks
Monitoring the resources usage
Secure measurement data collection
Measurement data managementMeasurement data management
15
� Problem statement
� Measurement system
� Measurement methodology
� Interesting findings� Interesting findings
� Conclusions
Measurement methodology
17
• Strong and diurnal correlation between RTT
and reverse-path packet loss
Strong and diurnal correlation between RTT
path packet loss
19
� The three fault events according to public
information:
� 9 Aug 1:37am(HKT) and 12 hours later
EAC− EAC
� 12 Aug 10:50am(HKT)
− APCN2
� 17 Aug 2:20pm(HKT)
− FNAL/RNAL
The three fault events according to public
9 Aug 1:37am(HKT) and 12 hours later
22
Path 9 Aug
Australasia - NLA Diurnal RTT burst –1200ms, up to 12 Aug
Loss burst – 50%, 8 hrs
Japan - Nissan X
Taiwan - TANET RTT increaseFw loss increase
US - Citibank X
Finland - Nokia X
Korea - KREONET X
12 Aug 17 Aug
X X
Rv Loss – 30%17 hrs
RTT burst – 1800ms7 hrs17 hrs 7 hrs
RTT increase 60msDiurnal Rv loss –10~50%, 22 hrs
Diurnal Rv loss burst -10~50%, 17+ hrs
X RTT burst – 1800ms, 7hrs
Rv Loss – 30%, 13 hrs
X Connectivity Lost 12hrsRv Loss – 50% 1.5
days
X RTT increase to 400ms
23
� Affected by the 9 Aug fault:
� RTT peaks of 1200ms up to 12 Aug
� 50%+ burst of losses at 2pm
� PCCW → Pacnet →TransactSDN(AU) PCCW → Pacnet →TransactSDN(AU)
9 Aug 13:37(HKT)
Affected by the 9 Aug fault:
RTT peaks of 1200ms up to 12 Aug
50%+ burst of losses at 2pm-10pm on 9 Aug
TransactSDN(AU) →NLATransactSDN(AU) →NLA
26
� Affected by the 12 & 17 Aug faults:
� Burst of Rv Loss(30%) from 12 Aug 10am to 13 Aug
3am
� RTT burst of 1800ms on 17 Aug 2
� PCCW → Equinix →NTT(US/JP)
12 Aug 10:50(HKT)
Affected by the 12 & 17 Aug faults:
Burst of Rv Loss(30%) from 12 Aug 10am to 13 Aug
RTT burst of 1800ms on 17 Aug 2-9pm
NTT(US/JP) →OCN(JP)
17 Aug 14:20(HKT)
27
� Affected by the 12 & 17 Aug faults:
� RTT increased for 60ms since 12 Aug 15:00
� Diurnal Rv Loss (10~50%) in 22 hrs since 12 Aug
16:20 and 17+ hrs since 21:40 17 Aug16:20 and 17+ hrs since 21:40 17 Aug
� HKIX →ChungHwaTel
12 Aug 10:50(HKT)
17 Aug 14:20(HKT)
Affected by the 12 & 17 Aug faults:
RTT increased for 60ms since 12 Aug 15:00
Diurnal Rv Loss (10~50%) in 22 hrs since 12 Aug
16:20 and 17+ hrs since 21:40 17 Aug16:20 and 17+ hrs since 21:40 17 Aug
ChungHwaTel →TANET
17 Aug 14:20(HKT)
28
� Affected by the 17 Aug fault:
� RTT burst of 1800ms
� Reverse-path loss up to 40%
From 17 Aug 2pm to 18 Aug 3am� From 17 Aug 2pm to 18 Aug 3am
� PCCW → BNA →AT&T17 Aug 14:20(HKT)
Affected by the 17 Aug fault:
path loss up to 40%
From 17 Aug 2pm to 18 Aug 3amFrom 17 Aug 2pm to 18 Aug 3am
AT&T
29
� Affected by the 17 Aug fault:
� Connectivity lost (OneProbe, TCPTraceroute)
� From 17 Aug 2pm to 18 Aug 2am
Rv Loss burst up to 50% until 20 Aug 4pm� Rv Loss burst up to 50% until 20 Aug 4pm
� PCCW → BNA →GBLX(US) 17 Aug 14:20(HKT)
Connection lost
Affected by the 17 Aug fault:
Connectivity lost (OneProbe, TCPTraceroute)
From 17 Aug 2pm to 18 Aug 2am
Rv Loss burst up to 50% until 20 Aug 4pmRv Loss burst up to 50% until 20 Aug 4pm
GBLX(US) →Nokia(Finland)
30
� Affected by the 17 Aug fault
� RTT increased from 40ms to 400ms since 17 Aug
14:20
� RTT burst of 400ms around 12 Aug 22:00 to 22:30� RTT burst of 400ms around 12 Aug 22:00 to 22:30
� HARNET →ASGC (TW) 12 Aug 10:50(HKT)
Affected by the 17 Aug fault
RTT increased from 40ms to 400ms since 17 Aug
RTT burst of 400ms around 12 Aug 22:00 to 22:30RTT burst of 400ms around 12 Aug 22:00 to 22:30
ASGC (TW) → KREONET
17 Aug 14:20(HKT)
31
� Deploying and managing a distributed measurement system is very challenging.
� A reliable, non-cooperative measurement method
� A measurement management platform� A measurement management platform� But such a system, if deployed and managed
correctly, is very useful.
� More information obtained from contrasting for performance and fault diagnosis
� Currently monitoring the impact of switching to a new provider
Deploying and managing a distributed measurement system is very challenging.
cooperative measurement method
A measurement management platformA measurement management platformBut such a system, if deployed and managed
More information obtained from contrasting for performance and fault diagnosis
Currently monitoring the impact of switching to a
32