sla monitoring system (slam) - icann · epp epp service availability ≤ 864 min of down(me...
TRANSCRIPT
SLA Monitoring System (SLAM) Gustavo Lozano | ICANN DNS Symposium | 13 May 2017
| 2
Contractual Provisions
SLAM
MoSAPI
Statistics
1 2
3 4
SLA Monitoring System (SLAM) - Agenda
Contractual Provisions
| 4
Why ICANN is monitoring gTLDs?
• Specifica(on10ofthenewgTLDsRegistryAgreementspecifiestheServiceLevelRequirementsforRegistryOperators.
• AmonitoringsystemcalledSLAM(ServiceLevelAgreementMonitoring)SystemwasdevelopedbyICANNasatooltomeasurethecompliancewiththeserequirements.
| 5
What are the Service Level Requirements?
Parameter SLR(monthlybasis)DNS DNSserviceavailability 0mindown(me=100%availability
DNSnameserveravailability ≤432minofdown(me(≈99%)TCPDNSresolu(onRTT ≤1500ms,foratleast95%ofqueriesUDPDNSresolu(onRTT ≤500ms,foratleast95%ofqueriesDNSupdate(me ≤60min,foratleast95%ofprobes
RDDS RDDSavailability ≤864minofdown(me(≈98%)RDDSqueryRTT ≤2000ms,foratleast95%ofqueriesRDDSupdate(me ≤60min,foratleast95%ofprobes
EPP EPPserviceavailability ≤864minofdown(me(≈98%)EPPsession-commandRTT ≤4000ms,foratleast95%ofcommandsEPPquery-commandRTT ≤2000ms,foratleast95%ofcommandsEPPtransform-commandRTT ≤4000ms,foratleast95%ofcommands
| 6
What are the Emergency Thresholds?
• ICANNcandesignateaninterimEBERO(EmergencyBackendRegistryOperator)totakeovertheopera(onofagTLDwhenanemergencythresholdisreached.
• SLAMsystemalertsandComplianceno(cesaresenttoRegistryOperatorswhencertainpercentagesofthespecifiedEmergencyThresholdsaremet.
| 7
What are the Emergency Thresholds?
Cri9calFunc9on EmergencyThresholdDNSService(allservers) 4-hourtotaldown(me/weekDNSSECproperresolu(on
4-hourtotaldown(me/week
EPP 24-hourtotaldown(me/weekRDDS(WHOIS/Web-basedWHOIS)
24-hourtotaldown(me/week
SLAM
| 9
What is the SLAM?
• Zabbixastheprimarymonitoringplaaorm.CustompluginsandcodetosupportICANNmonitoringweredevelopedbyZabbix.
• Probenodenetwork– Consistsof40probenodescoveringallICANNregions.
• ANetworkOpera(onsCenteropera(ng24/7
• ICANN-staffison-call24/7
| 10
Design principles of the system
• Avoidfalseposi(ves• Avoidfalseposi(ves• Avoidfalseposi(ves• ReachtheaffectedRegistryOperatorassoonaspossible
• Providegeneralguidanceregardingthepoten(alissue
| 11
How it works?
DataProcessor
ProbeNode
ProbeNode
ProbeNode
Ry
| 12
DNS test
• Onenon-recursiveDNSquerysenteveryminutefromallprobenodes– QueryissenttoeveryIPaddress,NSpair– QueryisfortheFQDNzz--icann-monitoring.<TLD>
• IfDNSSECisoffered,NSEC/NSEC3andthesignaturesareverified.
• ThechainoftrustisvalidatedagainsttherootzoneKSK.
| 13
DNS test
• Examplesoffailurecriteria
– Noreply– Invalidreply(e.g.,RCODE/SERVFAIL)– Malformedorinvalidresponses– Brokenchainoftrust– NSECandNSEC3errors
| 14
RDDS test
• OneWhois(port43)transac(onini(atedevery5minutesfromallprobenodes.
• OneHTTP(web-Whois)connec(ontestevery5minutes.ThesystemwillfollowHTTPredirects.
| 15
RDDS test
• Examplesoffailurecriteria
– DNS/DNSSECfailureswhenresolvingwhois.nic.<TLD>
– MalformedorinvalidWhois(port43)responses
– HTTP500errorcodeincaseofweb-Whois
| 16
Recursive DNS servers
• RecursiveDNSserversareusedduringthetests(e.g.resolvingwhois.nic.<TLD>)
• DNSSECisenabledintherecursiveDNSservers
• IfDNSSECisfailingwhenresolvingwhois.nic.<TLD>,theRDDSisconsideredtobefailing
• ThemaximumTTLallowedinthecacheandnega(vecacheis15minutes
| 17
What happens when a failure is detected?
RySLAsystemcon(nuously
monitorallgTLDs
• Threeconsecu(vefailingcycles
• 51%ormoreoftheprobenodesdetectedtheissue
• Atleast20probenodesareonline Aler(ng
machine
DNSissues
RDDSissues
• Twoconsecu(vefailingcycles• 51%ormoreoftheprobe
nodesdetectedtheissue• Atleast10probenodesare
online
| 18
What happens when a failure is detected? – cont.
Aler(ngmachine
ICANN’sNOCcontactstheRy’sEmergencyContactstoverify
recep(onofthealert
ICANNTechnicalServicesstaffcontactstheRytoprovidehelp
CallstheRy’sEmergencyContacts
ContactsICANNContractualCompliance
ContactsICANNIT,iftheSLAMisfailing
| 19
Monitoring the quality of IPv4 and IPv6
• EveryprobenodemonitorsthequalityofitsIPv4andIPv6connec(vity.
• IfthequalityofitsIPv4andIPv6connec(vityisdeterminedtobeinsufficient,theprobenodegoesofflineautoma(cally.
• InordertomonitorthequalityofIPv4andIPv6connec(vity,thenode:
– SendsaDNSquerytoeveryroot-servereveryminute– If5ormoreresponsesarereceivedperIPprotocolwithin250ms,thequalityofconnec(vityisconsideredtobesufficient
MoSAPI
| 21
MoSAPI
• ThemonitoringsystemAPIisinpilotmodeatthemoment.
• TheAPIallowstheRegistrytoaccesstheinforma(oncollectedbytheSLAM.
• Theproduc(onversionisgoingtosupportdefiningamaintenancewindowprogramma(cally.Atthemoment,thisisamanualprocess.
Statistics
| 23
Statistics – Interesting data points
• 11outof37RSPshavehadatleastoneTLDthatreachedtheEBEROthresholdinatleastoneservice
• 27(DNSorRDDS)servicefailuresreachedtheEBEROthreshold(wehaven'tdeclaredoneEBEROeventyet)
• 1.7%(21out1,211)ofthenewgTLDshavereachedtheEBEROthresholdinatleastoneservice(DNSorRDDS)
• 32outof37RSPshavehadatleastoneDNSservicefailuresince25-Sep-2014
Note:dataasof1-Jan-2017.
| 24
Statistics – Potential EBERO events
0
2
4
6
8
10
12
14
Feb Mar Apr Jul Nov Dec Jan Feb Apr Jul Oct
Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4
2014 2015 2016
FailuresthatreachedtheEBEROthreshold
| 25
Statistics – Potential EBERO events
0
1
2
3
4
5
6
7
8
FailuresthatreachedtheEBEROthresholdperRSP
| 26
Statistics – DNS failures
0
20
40
60
80
100
120
140
160
Oct Nov Dec Jan Feb Mar Apr May Jun Jul Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan
Qtr4 Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4 Qtr1
2014 2015 2016 2017
DNSfailures
| 27
Statistics – Unique-RSP DNS failures
0
1
2
3
4
5
6
7
8
9
Oct Nov Dec Jan Feb Mar Apr May Jun Jul Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan
Qtr4 Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4 Qtr1
2014 2015 2016 2017
Unique-RSPDNSfailures
| 28
Reach us at: Email: [email protected] Website: icann.org
Thank You and Questions
Engage with ICANN
linkedin.com/company/icann
twitter.com/icann
facebook.com/icannorg weibo.com/ICANNorg
youtube.com/user/icannnews
slideshare.net/icannpresentations
flickr.com/photos/icann
soundcloud.com/icann