sla monitoring system (slam) - icann · epp epp service availability ≤ 864 min of down(me...

28
SLA Monitoring System (SLAM) Gustavo Lozano | ICANN DNS Symposium | 13 May 2017

Upload: others

Post on 09-Jan-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

SLA Monitoring System (SLAM) Gustavo Lozano | ICANN DNS Symposium | 13 May 2017

Page 2: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 2

Contractual Provisions

SLAM

MoSAPI

Statistics

1 2

3 4

SLA Monitoring System (SLAM) - Agenda

Page 3: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

Contractual Provisions

Page 4: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 4

Why ICANN is monitoring gTLDs?

•  Specifica(on10ofthenewgTLDsRegistryAgreementspecifiestheServiceLevelRequirementsforRegistryOperators.

•  AmonitoringsystemcalledSLAM(ServiceLevelAgreementMonitoring)SystemwasdevelopedbyICANNasatooltomeasurethecompliancewiththeserequirements.

Page 5: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 5

What are the Service Level Requirements?

Parameter SLR(monthlybasis)DNS DNSserviceavailability 0mindown(me=100%availability

DNSnameserveravailability ≤432minofdown(me(≈99%)TCPDNSresolu(onRTT ≤1500ms,foratleast95%ofqueriesUDPDNSresolu(onRTT ≤500ms,foratleast95%ofqueriesDNSupdate(me ≤60min,foratleast95%ofprobes

RDDS RDDSavailability ≤864minofdown(me(≈98%)RDDSqueryRTT ≤2000ms,foratleast95%ofqueriesRDDSupdate(me ≤60min,foratleast95%ofprobes

EPP EPPserviceavailability ≤864minofdown(me(≈98%)EPPsession-commandRTT ≤4000ms,foratleast95%ofcommandsEPPquery-commandRTT ≤2000ms,foratleast95%ofcommandsEPPtransform-commandRTT ≤4000ms,foratleast95%ofcommands

Page 6: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 6

What are the Emergency Thresholds?

•  ICANNcandesignateaninterimEBERO(EmergencyBackendRegistryOperator)totakeovertheopera(onofagTLDwhenanemergencythresholdisreached.

•  SLAMsystemalertsandComplianceno(cesaresenttoRegistryOperatorswhencertainpercentagesofthespecifiedEmergencyThresholdsaremet.

Page 7: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 7

What are the Emergency Thresholds?

Cri9calFunc9on EmergencyThresholdDNSService(allservers) 4-hourtotaldown(me/weekDNSSECproperresolu(on

4-hourtotaldown(me/week

EPP 24-hourtotaldown(me/weekRDDS(WHOIS/Web-basedWHOIS)

24-hourtotaldown(me/week

Page 8: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

SLAM

Page 9: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 9

What is the SLAM?

•  Zabbixastheprimarymonitoringplaaorm.CustompluginsandcodetosupportICANNmonitoringweredevelopedbyZabbix.

•  Probenodenetwork–  Consistsof40probenodescoveringallICANNregions.

•  ANetworkOpera(onsCenteropera(ng24/7

•  ICANN-staffison-call24/7

Page 10: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 10

Design principles of the system

•  Avoidfalseposi(ves•  Avoidfalseposi(ves•  Avoidfalseposi(ves•  ReachtheaffectedRegistryOperatorassoonaspossible

•  Providegeneralguidanceregardingthepoten(alissue

Page 11: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 11

How it works?

DataProcessor

ProbeNode

ProbeNode

ProbeNode

Ry

Page 12: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 12

DNS test

•  Onenon-recursiveDNSquerysenteveryminutefromallprobenodes– QueryissenttoeveryIPaddress,NSpair– QueryisfortheFQDNzz--icann-monitoring.<TLD>

•  IfDNSSECisoffered,NSEC/NSEC3andthesignaturesareverified.

•  ThechainoftrustisvalidatedagainsttherootzoneKSK.

Page 13: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 13

DNS test

•  Examplesoffailurecriteria

– Noreply–  Invalidreply(e.g.,RCODE/SERVFAIL)– Malformedorinvalidresponses– Brokenchainoftrust– NSECandNSEC3errors

Page 14: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 14

RDDS test

•  OneWhois(port43)transac(onini(atedevery5minutesfromallprobenodes.

•  OneHTTP(web-Whois)connec(ontestevery5minutes.ThesystemwillfollowHTTPredirects.

Page 15: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 15

RDDS test

•  Examplesoffailurecriteria

– DNS/DNSSECfailureswhenresolvingwhois.nic.<TLD>

– MalformedorinvalidWhois(port43)responses

– HTTP500errorcodeincaseofweb-Whois

Page 16: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 16

Recursive DNS servers

•  RecursiveDNSserversareusedduringthetests(e.g.resolvingwhois.nic.<TLD>)

•  DNSSECisenabledintherecursiveDNSservers

•  IfDNSSECisfailingwhenresolvingwhois.nic.<TLD>,theRDDSisconsideredtobefailing

•  ThemaximumTTLallowedinthecacheandnega(vecacheis15minutes

Page 17: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 17

What happens when a failure is detected?

RySLAsystemcon(nuously

monitorallgTLDs

•  Threeconsecu(vefailingcycles

•  51%ormoreoftheprobenodesdetectedtheissue

•  Atleast20probenodesareonline Aler(ng

machine

DNSissues

RDDSissues

•  Twoconsecu(vefailingcycles•  51%ormoreoftheprobe

nodesdetectedtheissue•  Atleast10probenodesare

online

Page 18: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 18

What happens when a failure is detected? – cont.

Aler(ngmachine

ICANN’sNOCcontactstheRy’sEmergencyContactstoverify

recep(onofthealert

ICANNTechnicalServicesstaffcontactstheRytoprovidehelp

CallstheRy’sEmergencyContacts

ContactsICANNContractualCompliance

ContactsICANNIT,iftheSLAMisfailing

Page 19: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 19

Monitoring the quality of IPv4 and IPv6

•  EveryprobenodemonitorsthequalityofitsIPv4andIPv6connec(vity.

•  IfthequalityofitsIPv4andIPv6connec(vityisdeterminedtobeinsufficient,theprobenodegoesofflineautoma(cally.

•  InordertomonitorthequalityofIPv4andIPv6connec(vity,thenode:

–  SendsaDNSquerytoeveryroot-servereveryminute–  If5ormoreresponsesarereceivedperIPprotocolwithin250ms,thequalityofconnec(vityisconsideredtobesufficient

Page 20: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

MoSAPI

Page 21: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 21

MoSAPI

•  ThemonitoringsystemAPIisinpilotmodeatthemoment.

•  TheAPIallowstheRegistrytoaccesstheinforma(oncollectedbytheSLAM.

•  Theproduc(onversionisgoingtosupportdefiningamaintenancewindowprogramma(cally.Atthemoment,thisisamanualprocess.

Page 22: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

Statistics

Page 23: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 23

Statistics – Interesting data points

•  11outof37RSPshavehadatleastoneTLDthatreachedtheEBEROthresholdinatleastoneservice

•  27(DNSorRDDS)servicefailuresreachedtheEBEROthreshold(wehaven'tdeclaredoneEBEROeventyet)

•  1.7%(21out1,211)ofthenewgTLDshavereachedtheEBEROthresholdinatleastoneservice(DNSorRDDS)

•  32outof37RSPshavehadatleastoneDNSservicefailuresince25-Sep-2014

Note:dataasof1-Jan-2017.

Page 24: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 24

Statistics – Potential EBERO events

0

2

4

6

8

10

12

14

Feb Mar Apr Jul Nov Dec Jan Feb Apr Jul Oct

Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4

2014 2015 2016

FailuresthatreachedtheEBEROthreshold

Page 25: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 25

Statistics – Potential EBERO events

0

1

2

3

4

5

6

7

8

FailuresthatreachedtheEBEROthresholdperRSP

Page 26: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 26

Statistics – DNS failures

0

20

40

60

80

100

120

140

160

Oct Nov Dec Jan Feb Mar Apr May Jun Jul Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan

Qtr4 Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4 Qtr1

2014 2015 2016 2017

DNSfailures

Page 27: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 27

Statistics – Unique-RSP DNS failures

0

1

2

3

4

5

6

7

8

9

Oct Nov Dec Jan Feb Mar Apr May Jun Jul Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan

Qtr4 Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4 Qtr1

2014 2015 2016 2017

Unique-RSPDNSfailures

Page 28: SLA Monitoring System (SLAM) - ICANN · EPP EPP service availability ≤ 864 min of down(me (≈98%) ... DNS test • One non-recursive DNS query sent every minute from all probe

| 28

Reach us at: Email: [email protected] Website: icann.org

Thank You and Questions

Engage with ICANN

linkedin.com/company/icann

twitter.com/icann

facebook.com/icannorg weibo.com/ICANNorg

youtube.com/user/icannnews

slideshare.net/icannpresentations

flickr.com/photos/icann

soundcloud.com/icann