estimating network layer subnet characteristics via

14
Estimating Network Layer Subnet Characteristics via Statistical Sampling M. Engin Tozal and Kamil Sarac Department of Computer Science The University of Texas at Dallas, Richardson, TX 75080 USA engintozal,[email protected] Abstract. Network layer Internet topology consists of a set of routers connected to each other through subnets. Recently, there has been a sig- nificant interest in studying topological characteristics of subnets in ad- dition to routers in the Internet. However, given the size of the Internet, constructing complete subnet level topology maps is neither practical nor economical. A viable solution, then, is to sample subnets in the target do- main and estimate their global characteristics. In this study, we propose a sampling framework for subnets; derive proper estimators for various subnet characteristics including total number of subnets, subnet prefix length distribution, mean subnet degree, and IP address utilization; and analyze the theoretical and empirical aspects of these estimators. Keywords: network, topology, sampling 1 Introduction Understanding the structure of the Internet helps us build representative net- work models, craft efficient algorithms at the application layer, and optimize the networking infrastructure in terms of reliability, robustness, and performence [5, 12, 4]. At the network layer, Internet topology consists of a set of routers connected via point-to-point or multi-access links or subnets. Most of the existing efforts on capturing a network layer map of the Internet focus on router level maps [18, 13, 15]. These maps are then studied to understand various topological character- istics of the Internet including the total number of routers, degree distribution of routers, average router degree, degree assortativity, and betweenness as these features are considered important in characterizing the Internet topology. Router level maps typically do not consider the nature of the subnets (i.e., point-to-point or multi-access) connecting routers and simply use point-to-point links to represent the connections. On the other hand, subnets are also important building blocks of the Internet topology. An alternative graph representation of the Internet at the network layer may depict subnets as the main entities repre- sented as vertices and consider routers as links connecting those subnets/vertices to each other (see Figure 1). Studying features of subnet level maps would im- prove our understanding of the Internet topology. Among many practical uses of subnet level Internet maps are estimating the IP address space utilization in the Internet and developing more representative synthetic graphs of the Internet at the network layer. Despite the benefits of understanding topological properties and evolution of the Internet, drawing complete topology maps turns out to be expensive in terms

Upload: others

Post on 10-Feb-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Estimating Network Layer Subnet Characteristics via

Estimating Network Layer SubnetCharacteristics via Statistical Sampling

M. Engin Tozal and Kamil Sarac

Department of Computer ScienceThe University of Texas at Dallas, Richardson, TX 75080 USA

engintozal,[email protected]

Abstract. Network layer Internet topology consists of a set of routersconnected to each other through subnets. Recently, there has been a sig-nificant interest in studying topological characteristics of subnets in ad-dition to routers in the Internet. However, given the size of the Internet,constructing complete subnet level topology maps is neither practical noreconomical. A viable solution, then, is to sample subnets in the target do-main and estimate their global characteristics. In this study, we proposea sampling framework for subnets; derive proper estimators for varioussubnet characteristics including total number of subnets, subnet prefixlength distribution, mean subnet degree, and IP address utilization; andanalyze the theoretical and empirical aspects of these estimators.

Keywords: network, topology, sampling

1 Introduction

Understanding the structure of the Internet helps us build representative net-work models, craft efficient algorithms at the application layer, and optimize thenetworking infrastructure in terms of reliability, robustness, and performence [5,12, 4].

At the network layer, Internet topology consists of a set of routers connectedvia point-to-point or multi-access links or subnets. Most of the existing efforts oncapturing a network layer map of the Internet focus on router level maps [18, 13,15]. These maps are then studied to understand various topological character-istics of the Internet including the total number of routers, degree distributionof routers, average router degree, degree assortativity, and betweenness as thesefeatures are considered important in characterizing the Internet topology.

Router level maps typically do not consider the nature of the subnets (i.e.,point-to-point or multi-access) connecting routers and simply use point-to-pointlinks to represent the connections. On the other hand, subnets are also importantbuilding blocks of the Internet topology. An alternative graph representation ofthe Internet at the network layer may depict subnets as the main entities repre-sented as vertices and consider routers as links connecting those subnets/verticesto each other (see Figure 1). Studying features of subnet level maps would im-prove our understanding of the Internet topology. Among many practical uses ofsubnet level Internet maps are estimating the IP address space utilization in theInternet and developing more representative synthetic graphs of the Internet atthe network layer.

Despite the benefits of understanding topological properties and evolution ofthe Internet, drawing complete topology maps turns out to be expensive in terms

Page 2: Estimating Network Layer Subnet Characteristics via

R2

R1

S1

S2

R3

R5 R

6R

4

R7

S3 S

4

(a) Example Topology

R1 R

2R

3

R4

R6

R5

R7

(b) Router Level Graph

1S 2

S

3S 4

S

/31

/30

/29

/29

(c) Subnet Level Graph

Fig. 1: A network (left) represented as a router (center) and subnet (right) graph.

of time, bandwidth, and computational resources [14, 7]. Therefore, a viable al-ternative for studying Internet topology is to employ statistical sampling andestimate the parameters of the entire population by analyzing a small unbiasedsample of the population. This would naturally require to discover individualsubnets and define proper estimators to estimate subnet characteristics. In thispaper we present a scheme to achieve this for studying subnet level topologicalcharacteristics of the Internet.

In general Internet topology sampling suffer from several practical limitationsdue to the operational characteristics of the Internet [3, 10]. As an example, inour context we want to sample subnets randomly which requires a list of subnetsto be available a priori. Unfortunately, this information is considered confidentialby ISPs and therefore, is not available to us, researchers. On the other hand,through active probing we can identify the list of alive IP addresses utilized inthese networks. Thus instead of randomly sampling subnets, we can randomlysample IP addresses. In statistical sampling this phenomenon is referred as thediscrepancy between selection units (IP addresses) and observation units (sub-nets) [16]. The problem induced by this discrepancy is that different subnetsmay utilize different number of IP addresses. As a result random sampling ofIP addresses may not indicate random sampling of subnets. In other words, thesubnets hosting more IP addresses are more likely to be sampled compared to theones accommodating less IP addresses. Finally, unresponsive units, i.e., subnetslocated behind firewalls, or located within rate limiting ISPs, might introduceestimation error depending on whether they are random or not.

Following the above discussion our sampling frame in this study is IP ad-dresses rather than subnets. Therefore, we developed proper estimators for vari-ous subnet features using IP addresses of an ISP network as our sampling frame.Specifically, we developed estimators for total number of subnets, subnet maskdistribution, mean subnet degree, and IP address space utilization. Our theoret-ical and empirical evaluations conducted by collecting population characteristicsand samples from six geographically disperse Internet service providers demon-strate that our estimators are fairly accurate and stable with small variances.Note that, the scope of this paper is limited to (i) developing an approach forsampling subnets in the Internet; (ii) defining proper estimators for various sub-net characteristics; and (iii) demonstrating that these estimators work well inestimating population parameters.

The rest of the paper is organized as follows. In Section 2, we discuss thechallenges of subnet level Internet topology sampling and develop a step-by-step sampling framework for estimating various features of subnets. Section 3demonstrates our evaluations on how well the estimators of various subnet fea-tures capture the population parameters. In Section 4 we introduce the relatedwork and finally, in Section 5, we articulate future works and conclude the paper.

Page 3: Estimating Network Layer Subnet Characteristics via

2 Subnet Sampling

In this section we first introduce exploreNET, a subnet inference tool that weused in our sampling process. Then, we discuss the challenges of subnet leveltopology sampling in the Internet and present how we handle these challenges inour sampling framework. Finally, we define a set of subnet characteristics thatwe are interested in and explain how we can achieve either unbiased or slightlybiased estimators with small variance for these characteristics.

2.1 Subnet Inference with ExploreNET

A subnet is a logically visible sub-section of Internet Protocol network (RFC950) where the connected hosts can directly communicate with each other. Atthe network layer, a subnet is independent from its implementation below layer-3which could be an Ethernet network or an ATM or MPLS based virtual network.For practical purposes we define a subnet S as a set of interfaces (or IP addresses)accommodated by S in addition to a subnet mask (or common IP address prefixlength) referring to the IP address range of S.

ExploreNET is a network layer subnet inference tool [19]. Given a targetIP address t as input, exploreNET uses active probing to discover the subnetS that contains t along with all other alive IP addresses on S. Furthermore,exploreNET labels S with its observed subnet mask which corresponds to theminimum of the subnet masks encompassing all observed IP addresses of S. Todiscover the boundaries of a subnet, exploreNET employs a set of heuristics basedon hop distance of the subnet being explored, common IP address assignmentpractices, and routing. Our empirical results show that the proposed subnetinference mechanism achieves 94.9% and 97.3% accuracy rates on Internet2 andGEANT, respectively [17]. Moreover, its success rate is 93% against a groundtruth dataset collected by mrinfo in the global public Internet [19]. On the otherhand, the probing cost of the tool changes between 2|S| and 7|S|+ 7 dependingon the configuration of subnet S and its neighboring subnets, where |S| denotesthe size of S. Considering that exploreNET is based on a set of heuristics itsaccuracy rate is subject to change in different network domains and achieving100% accuracy in every domain is difficult. Since the scope of this paper is limitedto statistical subnet sampling, we direct the readers who are interested in thedetails of subnet discovery to our previous studies [17, 19].

A sampling process has two types of errors, namely, sampling error andnonsampling error [11]. Sampling errors occur because of chance whereas non-sampling errors can be attributed to many sources including inability to obtaininformation, errors made in processing data, and respondents providing incorrectinformation. It is impossible to avoid sampling error, hence, the goal would beto minimize the sampling error by defining unbiased estimators with small vari-ance. On the other hand, nonsampling errors could be eliminated by amendingdata collection and processing phases. In this context, nonsampling error refersto the error introduced by exploreNET the subnet inference tool that we used inthis study and sampling error refers to the error that occurs in our estimations.

Note that our aim in this study is to define good estimators that minimizethe sampling error in estimating various subnet features rather than improvingsubnet inference to eliminate nonsampling error. On the other hand, the sam-pling approach and estimators proposed in this study do not necessarily dependon exploreNET. Any other future subnet inference tool that collects the subnetaccommodating a given IP address should work with our sampling framework.

Page 4: Estimating Network Layer Subnet Characteristics via

1S 2

S

3S

4S

Fig. 2: Subnets hosting more IP addresses are more likely to be sampled com-pared to the ones with less IP addresses.

2.2 Challenges in Subnet Sampling

In this sub-section, we briefly explain the challenges in subnet level Internettopology sampling including the discrepancy between selection and observationunits, non-uniform unit sampling, and unresponsive units in sampling.

One challenging issue with sampling is the discrepancy between the selectionunits and the observation units. Observation units are the main objects that wewant to sample and derive their statistical properties, whereas, selection units arethe objects in the sampling frame that we can draw from with some probabilitydistribution (preferably uniform). In most cases, selection units and observationunits are the same, in some other cases (including Internet topology sampling)however, observation units are not directly available for sampling. To illustrate,we do not have the entire list of the routers or subnets of an ISP to sample from.The only information available to us is the IP address range of ISPs. As a result,the only sampling frame available to us is the list of alive IP addresses.

A consequence of using IP addresses as sampling frame is non-uniform sam-pling of subnets. In other words we can uniformly sample from the IP addressrange of an ISP and give the selected IP addresses to exploreNET as input but thecollected subnets would not be sampled uniformly. Remember that exploreNETdiscovers a subnet as long as one of the IP addresses of the subnet is givenas input. As a result subnets with larger degree, i.e., large number of alive IPaddresses, appear more frequently in the sampling frame and observed more fre-quently compared to the ones having smaller degrees. Put another way, a subnetis drawn with a probability proportional to its degree. Note that we use the termdegree to denote the number of interfaces of a subnet rather than the number ofneighboring subnets of a subnet. Figure 2 shows an example target domain suchthat large circles depict the subnets and small circles filled in gray denote the IPaddresses hosted by the subnets. Considering subnets being sampled through IPaddresses, S3, in Figure 2, is twice as likely to be observed as compared to S1

because the size of S1 is half of the size of S3. As a result, off-the-shelf statisticalestimators fail to estimate the characteristics of the subnets due to the unequaldrawn probabilities of subnets.

Finally, unresponsive units [16] in sampling is another challenge. Most of thetools for collecting topology data in the Internet is based on active probing.However, autonomous systems, especially the ones at the Internet edge, are notcompletely open to active probing because of security and operational concerns.Some portions of their network are completely behind firewalls or they applyrate limiting in the sense that a router remains silent to probe packets if it isbusy or drops probe packets if it suspected a security threat. The interior parts(core) of an ISP network provides a reliable, robust, and efficient communica-tion infrastructure to the other networks. Hence, in this paper we study ISPcore networks rather than the edge networks. ISP core networks are more com-

Page 5: Estimating Network Layer Subnet Characteristics via

patible with active probing compared to the edge networks. We also introduceartificial delays and multiple probing in our experiments to minimize the ratelimiting factor. Assuming that the unresponsive units in the core network areunmethodical our results are not skewed by them.

2.3 Estimating Subnet Characteristics

In this sub-section we present a generic framework based on Hansen-Hurwitz(HH) Estimator [16] for estimating various subnet features. In the next sub-section, we derive proper estimators from the generic framework presented herefor estimating subnet characteristics of a network including number of subnets,prefix length distribution, mean subnet size and IP address utilization.

HH-estimator is an unbiased estimator of population total whenever the se-lection units are drawn with different probabilities and the sampling is donewith replacement. Let {y1, y2,. . . , yj ,. . . , yn} be a sample of n independent ob-servations from a population of size N . Let pi be the selection probability of the

ith unit in the population such that∑Ni=1 pi = 1. HH-estimator estimates the

population total τy =∑Ni=1 yi and it is defined as follows:

τy =1

n

n∑j=1

yjpj

(1)

In Equation (1) dividing observation yj by its selection probability pj giveshigher weight to the units that are less likely to be selected. An unbiased esti-mator of population mean is µy = τy/N .

In the context of subnet sampling, we define yi to be the response variablemeasured on the ith subnet, Si, of an ISP. Response variables are any char-acteristic of subnets that we are interested in such as degree, subnet mask, orutilization. Response variables could take discrete values like degree or categori-cal values like subnet masks. In the former case they are represented as discreterandom variables and in the latter case they are represented as indicator randomvariables. Additionally, the selection probability of a subnet is proportional toits degree, pj ∝ dj , where pj and dj are the drawn probability and the degree ofsubnet Sj , respectively. Then, the drawn probability, pj , of Sj can be defined interms of subnet degree as follows:

pj =dj∑Ni=1 di

(2)

where dj is the degree of the jth subnet in the sample, di is the degree of the ith

subnet in the population, and N is the number of subnets in the population. InEquation (2), we neither know the degree, di, of each subnet in the populationnor the number of subnets, N , in the population in advance. To address thisissue, we decided to build a set of alive IP addresses, Fa = {a1, a2, . . . , ak}, ofthe target ISP that we want to estimate its subnet characteristics and utilize theequality between the size of Fa, i.e. |Fa|, and the sum of degrees of the observablesubnets in the target ISP:

|Fa| =N∑i=1

di (3)

Page 6: Estimating Network Layer Subnet Characteristics via

Fa can be obtained as a dataset from LANDER project [7] at USC or be formedin linear probing overhead by pinging each IP address of the target ISP.

Note that, we also use Fa as our sampling frame from which we randomlyselect n IP addresses with replacement, give each IP address to exploreNET asinput and obtain the sample set of n subnets.

Rearranging the general estimator given in Equation (1) in the context ofsubnet sampling results in:

τy =|Fa|n

n∑j=1

yjdj

(4)

Equation (4) does not have any unknown term; sampling frame, Fa, is formedprior to sampling process and response variable, yj , and degree, dj , are obtainedby our subnet inference tool exploreNET.

As response variables of subnets, y, are independent from each other, sam-pling distribution of the point estimator τy is approximately normal by CentralLimit Theorem [8], i.e., τy ∼ Normal(τy, σ2

τy) where

σ2τy

=1

n|Fa|

N∑i=1

di

(|Fa|

yidi− τy

)2

(5)

Then we can define a confidence interval (CI) with confidence level (1 −α)100% as τy ± zα2

√σ2τy

such that zα2

is the upper α/2 point of the standard

normal distribution. Since τy is the parameter that we want to estimate andN is usually unknown, an unbiased estimator of the variance of the samplingdistribution of τy, i.e., σ2

τyis defined as follows:

s2τy =1

n− 1

|Fa|2n

n∑j=1

(yjdj

)2

− τ2y

(6)

Here, we can replace σ2τy

with s2τy in confidence interval construction.

2.4 Important Subnet Characteristics

Although we have derived a general formula for estimating population total overany subnet response variable y in Section 2.3, using Equation (4) requires furtherarrangements and insights for different subnet features that we want to estimate.In this part we define a set of most relevant subnet characteristics includingsubnet population size, total subnets with a certain subnet mask, average subnetdegree, and IP address utilization percentage and show how to deduce theirproper point estimators. Note that this set is not complete and one may suggestnew subnet characteristics in the future.

Subnet Population Size (N) Subnet Population Size, N , refers to the numberof distinct subnets in a particular ISP. Since τy corresponds to the populationtotal over the response variable y, setting yj = 1 in Equation (4) gives us an

unbiased estimator, N , of total number of subnets, N , in the population.

N =|Fa|n

n∑j=1

1

dj(7)

Page 7: Estimating Network Layer Subnet Characteristics via

Total Subnets with Prefix Length /p (τp) Observable prefix length (subnetmask) of a subnet refers to the the length of the initial block of bits commonto all IP addresses in the subnet. A subnet having prefix length p is said tobe a /p subnet. In order to estimate the total number of subnets of /p let thethe response variable measured on subnet Sj , i.e., yj be an indicator randomvariable such that

yj =

{1 if Sj is a /p subnet0 otherwise (8)

Then we can estimate the total number of subnets having /p prefix, τp, inan ISP as follows:

τp =|Fa|n

n∑j=1

yjdj

(9)

Note that, the variance of the estimator τp increases as the prefix gets larger(subnet degree gets smaller). The implication of this raise is wider confidenceintervals for larger prefix lengths.

In case we want to estimate the prefix length distribution of the subnets wecan resort to the fact that rp = τp/N where rp is the ratio of subnets havingprefix length /p. If N is known then rp = τp/N would be an unbiased estimator of

rp. On the other hand, if N is unknown rp = τp/N would be a biased estimatorof rp. Nevertheless our experimental results show that the bias in the lattercase is small. One should note that, a slightly biased estimator having a smallvariance of a population parameter is usually preferable compared to an unbiasedestimator having a large variance.

Mean Subnet Degree (µd) Degree of subnet Sj , dj , corresponds to the num-ber IP addresses accommodated by Sj . To calculate the average degree through-out an entire ISP we need to compute the total of all degrees divided by thetotal number of subnets. An estimator of mean degree, µd, is defined as follows:

µd =|Fa|N

(10)

Note that, although N is an unbiased estimator of N , µd is not an unbiased

estimator of µd because E[µd] 6= |Fa|/E[N ]. However, our experimental resultsshow that the bias and variance of µd is extremely small and we get very accurateand stable estimations of µd.

IP Address Space Utilization (U) ExploreNET annotates the subnets withtheir observed prefix lengths while discovering their in-use IP addresses. Prefixlength, p, of subnet Sj indicates the observable IP address capacity, cj , of Sjwhile its degree dj indicates the number of IP addresses that have been utilized.Capacity of subnet Sj is defined as

cj =

{2 if Sj is a /31 subnet

232−p − 2 if Sj is a /p subnetsuch that p ≤ 30

(11)

ISP utilization U is defined as the total number of alive IP addresses dividedby the total capacity. Again total number of alive IP addresses corresponds to

Page 8: Estimating Network Layer Subnet Characteristics via

our sampling frame size, |Fa|. On the other hand, we estimate total capacity τcas follows:

τc =∑∀p

cpτp (12)

where cp is the capacity of a subnet having prefix length /p and τp is the estima-tion of the total number of subnets having prefix length /p. Since E[τc] = τc, τcis an unbiased estimator of total capacity τc. Then we can define ISP utilizationas

U =|Fa|τc

(13)

Again, U is not an unbiased estimator of U but our experimental resultsdemonstrate that it is a very good estimator with small bias and variance.

3 Evaluations

In this section, we evaluate sampling errors introduced by our estimators. Forthis, we need a sampling frame consisting of alive IP addresses clustered in anumber of subnets in a network. Such a network could be generated synthet-ically; could be collected from the Internet using subnet inference techniques;or could be a genuine network available on its web page or obtained from itsnetwork operator. Synthetic networks may not be interesting and genuine topolo-gies are available for a few small sized research networks including Internet2 andGEANT which are not large enough for sampling [2]. Therefore, in our evalua-tions we decided to work with collected topologies. To the best of our knowledgethere is no subnet level topology data available for a large sized network. Hence,we use exploreNET to conduct a census of subnets over six geographically dis-persed medium sized ISPs [19] including PCCW Global (ISP-1), nLayer (ISP-2),France Telecom (ISP-3), Telecom Italia Sparkle (ISP-4), Interroute (ISP-5), andMZIMA (ISP-6). The collected population topology via census might posses non-sampling error introduced by the employed subnet inference tool, i.e., collectedtopology might slightly differ from the underlying genuine topology. In our sam-pling we use the same subnet inference tool to conduct our survey of subnets,thus, nonsampling error in the census propagates to the survey. Considering thatnonsampling error in census and survey cancel each out while calculating sam-pling error of point estimates, our evaluations in this section are not affected bynonsampling error. On the other hand, using our methodology for a survey inthe Internet will produce results that would contain both sampling and nonsam-pling errors. The magnitude of those nonsampling errors would depend on theaccuracy of the subnet inference tool used in sampling (see Section 2.1).

First, we identified the IP address space for each ISP. Then, using an AS re-lationship dataset provided by CAIDA[1], we removed the ranges of IP addressesthat are assigned to their customer domains. This pre-processing step enablesus to focus on the core of ISP networks excluding the topology information oftheir customer domains which are managed and operated by others. Next, weutilized active probing to identify alive IP addresses which form our samplingframe, Fa. Table 1 shows the sampling frame sizes, |Fa|, of our target ISPs.

When we collected the entire subnet population for six ISPs having 162, 866IP addresses in total, exploreNET successfully determined a subnet for 155, 309

Page 9: Estimating Network Layer Subnet Characteristics via

Table 1: Sampling Frame Sizes, |Fa|, for target ISPsISP-1 ISP-2 ISP-3 ISP-4 ISP-5 ISP-6 Σ

45,018 54,636 17,170 8,380 21,209 16,453 162,866

of them. On the other hand, it failed to explore a proper subnet for 7, 557 ofthese IP addresses and returned them as /32 subnets. In fact, none of the IPaddresses within /29 proximity of any of these 7, 557 IP addresses has respondedto the probes sent out by exploreNET.

We then randomly selected 10% of the IP addresses from the sampling frameof each ISP and estimated population parameters including population size, N ,prefix length distribution, τp, mean subnet degree, µd, and IP address spaceutilization, U .

Note that, correcting the affect of singular IP addresses for which exploreNETfailed to return a subnet requires special care in our estimation process. For ex-ample, including /32 samples into the estimation causes underestimated meansubnet degree, µd. On the other hand, excluding them causes underestimated

population size, N . Specifically, we include /32 subnets in population size esti-

mation, N , because each one represents a subnet and we exclude them from otherestimators because we assume that these /32 subnets distributed randomly, i.e.,they could be any size subnet in the underlying topology and their omissiondoes not skew the results in any way. The exclusion of /32 subnets is not sim-ply ignoring them in the estimation process but estimating the number of suchsubnets and subtracting it from the estimated total number of subnets.

Remember that, the scope of this paper is confined to deriving proper esti-mators for subnet properties in the Internet and validating them, hence, in therest of this section, we present the estimations, compare them with populationparameters, and interpret the discrepancy between them. We plan to conductanother study based on sampled data to analyze and interpret the motifs andpatterns of subnet characteristics in the Internet as a future work.

Finally, the tools used to collect evaluation dataset and the dataset itself arepublicly available on our project web site at http://itom.utdallas.edu.

3.1 Subnet Population Size (N)

Remember that population size refers to total number of subnets appearingin an ISP. Table 2 shows the population size, N , and the corresponding point

estimate, N , for six ISPs. Third column of the table shows sampling error (SE ),

N −N . The values in the table includes /32 singular IP addresses because theseIP addresses belong to some subnet that exploreNET could not discover.

Table 2: Population subnet sizes and estimations for all ISPsN N SE E

ISP-1 4171 4167.53 3.47 247.99ISP-2 1688 1625.14 62.86 156.38ISP-3 8809 8791.65 17.35 94.36ISP-4 3172 3272.01 100.01 116.55ISP-5 7687 7904.60 217.6 265.44ISP-6 3119 3180.68 61.68 181.02

Note that the point estimates, N , and sampling error, SE, would change foreach sample taken from the population. On the other hand, fourth column of

the table, E, shows the maximum error of the point estimate for N with 90%confidence level. E is calculated as z0.05sN where sN is the estimated standard

deviation of the sampling distribution of N and z0.05 is the z-score of the 0.05right tail probability of standard normal distribution and it demonstrates the

Page 10: Estimating Network Layer Subnet Characteristics via

fact that 90% of the time the error of the point estimate, N , would be less thanE. The high deviation of E for some ISPs is because of the variability of thefactor |Fa|/dj in Equation (6) where yj = 1. In other words, HH estimator willhave low variance as there is less variability among the values of yj/pj with theextreme case yj/pj values are exactly proportional in Equation (1) [16].

3.2 Total Subnets with Prefix Length /p (τp)

Prefix length (subnet mask) observed by exploreNET is the longest prefix lengthencompassing all alive IP addresses of a given subnet. Consequently, observedprefix length distribution is a natural way of grouping subnet degrees into binswhere the range of the bins grow from larger prefix lengths to the smaller. In thispart, we evaluate how well the sample prefix length distribution conforms to thepopulation prefix length distribution for six ISPs. Table 3 presents total numberof subnets having /p prefix length, τp, its estimated value, τp, the prefix lengthprobability mass function (p.m.f.), r, and its estimation, r, for PCCW Global(ISP-1). In the estimation process we omitted IP addresses that exploreNETfailed to return a subnet, i.e., /32 subnets. Assuming that these /32 subnets arerandomly distributed over the entire population, their exclusion does not affectthe accuracy of the estimated prefix length p.m.f, r.

In our data collection process, we encountered a small number of very largesubnets such as the ones having /20, /21, or /22 prefix lengths with at least onethousand IP addresses. A close examination of randomly selected IP addressesfrom these subnet ranges via DNS name resolution queries revealed the factthat almost all of these large subnets belong to Akamai Technologies, an onlinecontent distribution service provider with a global presence in the Internet. Aninteresting detail in this case is that according to DNS records, those IP addressesbelonged to Akamai Technologies while the IP-to-AS mapping dataset that weused mapped those IP addresses to their hosting ASes. Given that the main focusof this work is on building statistical estimations, we leave the close analysis ofthis type of interesting cases to our future study where we plan to study thedata to understand their implications.

Table 3: PCCW Global (ISP-1) prefix length distribution and its estimationτp τp rp rp

/20 3 2.94708 0.001266 0.001245/21 3 2.83139 0.001266 0.001196/22 7 7.3299 0.002954 0.003096/23 3 2.62722 0.001266 0.001110/24 24 24.8725 0.010127 0.010507/25 25 24.9367 0.010549 0.010534/26 123 130.435 0.051899 0.055101/27 152 145.53 0.064135 0.061478/28 262 284.222 0.110549 0.120066/29 440 426.242 0.185654 0.180061/30 899 930.165 0.379325 0.392938/31 429 385.068 0.181013 0.162668

Although Table 3 demonstrates that the estimated values are close to thepopulation parameters, we need to apply goodness-of-fit test [8] to validatewhether the estimated prefix length p.m.f. conforms to the population prefixlength p.m.f.. The test statistic for goodness-of-fit test is χ2 and it is defined as:

χ2 =∑p

(τp − Ep)2

Ep, ∀p (14)

Page 11: Estimating Network Layer Subnet Characteristics via

where p is prefix length, τp is estimated number of subnets having prefix length

/p, and Ep = rp∑31p=0 τp is the expected number of subnets having prefix length

/p under the population distribution. Null hypothesis, H0, is the assertion thatthe estimated and population prefix length distributions are the same, i.e., H0 :{∀p, rp = rp}. Whereas, alternative hypothesis, H1, is the opposite of H0, i.e.,H1 : {∃p, rp 6= rp}.

Computing the χ2 statistic for PCCW Global results in X2 = 8.6591 withdegrees of freedom (df) 9 and its related p-value is 0.4693. Since p-value isgreater than the significance level α = 0.01 we conclude that there is not enoughevidence to reject the null hypothesis, i.e., estimated prefix length distributionconforms to the population prefix length distribution.

Table 4: X2 scores and p-values for all ISPsdf X2 p-value

ISP-1 9 8.659119 0.469317ISP-2 8 11.53715 0.17308ISP-3 4 5.513502 0.238545ISP-4 6 2.1376 0.906618ISP-5 8 2.02614 0.9802ISP-6 7 4.97965 0.662447

Instead of tabulating population and estimated prefix length distributionsfor other five ISPs we present goodness-of-fit test results in Table 4. Since thep-values of all ISPs in Table 4 are greater than the significance level α = 0.01, weagain conclude that there is not enough evidence to reject the null hypothesis as-serting that the estimated prefix length distribution conforms to the populationprefix length distribution for all ISPs. As a result, prefix length ratio estimator,rp, estimates the population prefix length ratio rp well in all of our case studies.

3.3 Mean Subnet Degree (µd)

Remember that mean subnet degree estimator µd is a biased estimator of thepopulation mean subnet degree µd. Even tough the first two columns of Table 5show that the estimated value µd is pretty close to the population value µd, weneed to make sure that the maximum error of the point estimator, µd, is notmuch and we can get good estimates of the mean subnet degree most of the time.Since variance of the sampling distribution of µd has no closed form solution weresorted to Monte Carlo simulation in order to determine a confidence intervalfor µd.

Table 5: Mean subnet degree figures for all ISPsµd µd Min Max E NE

ISP-1 18.235 18.257 15.956 20.975 1.128 1.685ISP-2 44.904 46.112 36.025 57.12 4.536 1.716ISP-3 2.063 2.066 2.034 2.095 0.013 1.7ISP-4 2.849 2.797 2.647 3.079 0.098 1.672ISP-5 3.997 4.049 3.696 4.39 0.144 1.691ISP-6 6.742 6.224 5.846 7.986 0.437 1.74

We took 10000 independent samples from the population database and esti-mated mean subnet degree of the sample. Figure 3 shows the histograms demon-strating the approximate sampling distribution of µd. The third, Min, and fourth,Max, columns of Table 5 present the minimum and the maximum of µd pointestimations obtained over 10000 samples of the same size (10%), respectively.Although the minimum and maximum estimates fall a little bit off the popula-tion mean subnet degree, the bell shape of the distributions suggest that theyare outliers. To have a better idea we constructed a confidence interval with

Page 12: Estimating Network Layer Subnet Characteristics via

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

15 16 17 18 19 20 21

De

nsity

Mean Degree

(a) ISP-1

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

35 40 45 50 55 60

De

nsity

Mean Degree

(b) ISP-2

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

2.03 2.04 2.05 2.06 2.07 2.08 2.09 2.1

De

nsity

Mean Degree

(c) ISP-3

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

2.6 2.65 2.7 2.75 2.8 2.85 2.9 2.95 3 3.05 3.1

De

nsity

Mean Degree

(d) ISP-4

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

3.6 3.7 3.8 3.9 4 4.1 4.2 4.3 4.4

De

nsity

Mean Degree

(e) ISP-5

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

5.5 6 6.5 7 7.5 8

De

nsity

Mean Degree

(f) ISP-6

Fig. 3: Mean subnet degree (µd) sampling distribution for all ISPs obtained via10000 instances of Monte Carlo simulation

confidence level 90% and computed maximum error of the point estimate. Col-umn five, E shows the maximum error that could be obtained 90% of the timewhile estimating the mean subnet degree. Since each ISP has a different meansubnet degree and variance we normalized the maximum error of estimate forµd, E, by dividing it with the estimated standard deviation of µd as shown inthe sixth column, NE. The NE values suggest that for an ISP, 90% of the timethe mean subnet degree estimation conveys a maximum error of approximately1.7 of standard deviation.

Note that, the ISPs having large mean subnet degree are those ones thathome very large network layer subnets belonging to Akamai Technologies.

3.4 IP Address Space Utilization (U)

Similar to mean subnet degree, IP address space utilization estimator, U , is abiased estimator of population IP address space utilization, U . The first and sec-

ond columns of Table 6 presents that the estimated utilization, U , is not far fromthe population utilization U . However, this is a single instance of sampling andto have a better idea on whether our slightly biased estimator works well mostof the time we employed Monte Carlo simulation. Again we repeated samplingfor 10000 times from our population database and estimated the value of U foreach sample.

Table 6: Utilization figures for all ISPsU U Min Max E NE

ISP-1 0.752 0.748 0.734 0.774 0.008 1.645ISP-2 0.942 0.944 0.933 0.951 0.004 1.654ISP-3 0.955 0.953 0.929 0.977 0.01 1.691ISP-4 0.767 0.775 0.722 0.811 0.019 1.653ISP-5 0.737 0.747 0.709 0.766 0.012 1.656ISP-6 0.894 0.89 0.865 0.919 0.012 1.67

Figure 4 shows the sampling distribution of U obtained over 10000 samplesof the same size (10%) for six ISPs. Third and fourth columns of Table 6 present

Page 13: Estimating Network Layer Subnet Characteristics via

the minimum and maximum utilization estimates obtained by Monte Carlo sim-ulation. However, the bell shape of the distributions in Figure 4 suggests thatthe minimum and maximum values are not likely to occur frequently. To havean idea on the error of the estimator we constructed a confidence interval with90% confidence level. Column six, E, demonstrates the maximum error of U atconfidence level 90%. To put in other words, it says that 90% of the time the

maximum error of sample utilization, U , is E. Again we normalized the max-

imum error of estimate for U , E, by dividing it to the standard deviation of

the sampling distribution of U . The last column in Table 6, NE, states that themaximum error of estimate for utilization is about 1.67 of standard deviationfor all ISPs.

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.73 0.74 0.75 0.76 0.77 0.78

De

nsity

Utilization

(a) ISP-1

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.93 0.935 0.94 0.945 0.95 0.955

De

nsity

Utilization

(b) ISP-2

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.92 0.93 0.94 0.95 0.96 0.97 0.98

De

nsity

Utilization

(c) ISP-3

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.7 0.72 0.74 0.76 0.78 0.8 0.82

De

nsity

Utilization

(d) ISP-4

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77

De

nsity

Utilization

(e) ISP-5

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.86 0.87 0.88 0.89 0.9 0.91 0.92

De

nsity

Utilization

(f) ISP-6

Fig. 4: Utilization (U) sampling distribution for all ISPs obtained via 10000 in-stances of Monte Carlo simulation

4 Related Work

Most of the studies in the Internet topology measurement field is based on col-lecting raw topology data from multiple vantage points by path sampling viatraceroute or traceNET. Then inferring routers and subnets using the rawdata [9, 15, 13]. Early studies in the area used path sampling based topologies toderive various conclusions about the topological characteristics of the Internet [5,6]. Later on researchers argued about the validity of those claims by pointingout the limitations of path sampling bias due to source dependency and routingdynamics [10, 3].

TraceNET [17] infers subnets on an end-to-end path. Although traceNETsubstantially improves existing Internet topology maps by discovering subnetstructures, it performs path sampling and hence suffers from above mentionedproblems. ExploreNET [19] used in this paper allows us to discover individualsubnets rather than subnets on an end-to-end path. In this study, we devised

Page 14: Estimating Network Layer Subnet Characteristics via

a sampling framework for studying network layer subnet characteristics in theInternet and introduced a set of unbiased and biased with small variance esti-mators for a set of subnet features. To the best of our knowledge, this is the firststudy investigating statistical sampling in the domain of subnet inference.

5 Conclusions and Future Work

Given its large scale, drawing a complete subnet level topology of the Internetis neither practical nor economical. Consequently, sampling becomes a viablesolution to derive the properties of subnets in the Internet. In this paper, wehave proposed a framework for sampling subnets in the Internet along witheither unbiased or slightly biased estimators with small variance for varioussubnet characteristics including total number of subnets, subnet prefix lengthdistribution, mean subnet degree, and subnet IP address range utilization. Ourtheoretical and empirical evaluations on the estimators show that their maximumerror at 90% confidence level is small and they work well in estimating populationparameters.

We plan to conduct another study which is based on sampled data to analyzeand interpret the motifs and patterns of subnet characteristics in the Internetas a future work.

References

1. CAIDA, UCSD, available at http://www.caida.org/data/2. The Internet Topology Zoo, available at http://www.topology-zoo.org/3. Achlioptas, D., Clauset, A., Kempe, D., Moore, C.: On the bias of traceroute

sampling. In: ACM STOC. Baltimore, MD, USA (May 2005)4. Donnet, B., Friedman, T.: Internet topology discovery: A survey. IEEE Commu-

nications Surveys 9(4) (2007)5. Faloutsos, M., Faloutsos, P., Faloutsos, C.: On power-law relationships of the in-

ternet topology. In: SIGCOMM. New York, NY, USA (October 1999)6. Govindan, R., Tangmunarunkit, H.: Heuristics for Internet map discovery. In: IEEE

INFOCOM. Tel Aviv, Israel (Mar 2000)7. Heidemann, J., Govindan, R., Papadopoulos, C., Bartlett, G., Bannister, J.: Census

and survey of the visible internet. In: ACM IMC. Vouliag., Greece (Oct 2008)8. Irwin Miller, M.M.: John E. Freund’s Mathematical Statistics with Applications.

Prentice Hall (October 2003)9. K. Claffy, Tracie E. Monk, D.M.: Internet Tomography. Nature (Jan 1999)

10. Lakhina, A., Byers, J., Crovella, M., Xie, P.: Sampling biases in IP topology mea-surements. In: IEEE INFOCOM. San Francisco, CA, USA (Mar 2003)

11. Mann, P.S.: Introductory Statistics. Wiley (February 2010)12. Pastor-Satorras, R., Vespignani, A.: Evolution and Structure of the Internet. Cam-

bridge University Press (2004)13. Shavitt, Y., Shir, E.: DIMES: Distributed Internet measurements and simulations,

project page http://www.netdimes.org14. Sherwood, R., Bender, A., Spring, N.: DisCarte: A disjunctive internet cartogra-

pher. In: ACM SIGCOMM. Seattle, WA, USA (Aug 2008)15. Spring, N., Mahajan, R., Wetherall, D., Anderson, T.: Measuring ISP topologies

with Rocketfuel. IEEE/ACM Transactions On Networking 12(1), 2–16 (Feb 2004)16. Thompson, S.K.: Sampling. Wiley-Interscience (April 2002)17. Tozal, M.E., Sarac, K.: TraceNET: An internet topology data collector. In: ACM

Internet Measurement Conference. Melbourne, Australia (Nov 2010)18. Tozal, M.E., Sarac, K.: Palmtree: An ip alias resolution algorithm with linear

probing complexity. Computer Communications 34, 658–669 (April 2011)19. Tozal, M.E., Sarac, K.: Subnet level network topology mapping. In: IEEE IPCCC.

Orlando, FL, USA (November 2011)