estimating global internet penetration from...

Estimating Global Internet Penetration fromNetwork Measurements

Petros Gigis

University of Crete

School of Sciences and Engineering

Computer Science Department

Voutes University Campus Heraklion, GR-71003, Greece

Heraklion, September 2015

Acknowledgements

First of all I would like to thank my supervisor, Professor Xenofontas Dimitropoulos for

showing belief in me and giving me the opportunity to work with him, for all our constructive

meetings, and for his great advice and support.

I am also really grateful to Tasos Alexandridis for his continuous support and his valuable

help to this work.

Also I would like to thank Fotis Nikolaidis and Manos Lakiwtakis for their continuous support

and their useful advices.

I would like to acknowledge the Institute of Computer Science (FORTH-ICS) for providing

financial support and all the necessary equipment during this work.

I would also like to thank all my colleagues at the Telecommunications and Network Lab.

My warmest thanks Christos Liaskos, Dimitrios Gkounis and George Nomikos for all the helpful

discussions, the encouragement, and nice atmosphere.

A dedicated thank you to my closest friends Elina Tsiaka, Fotis Karnis, Eutuxia Zouni, Nicole

Tsampanaki for being always there to listen and for always providing their advice and support.

Last, but definitely not least, I would like to thank my family.

iii

Contents

List of tables vii

List of figures ix

1 Introduction 1

2 Methodology 3

2.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.3 Cleaning Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3.1 Removing Private and reserved IPv4 subnets . . . . . . . . . . . . . . . . . 4

2.3.2 Persistent /24 address blocks . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3.3 Route Leaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Geolocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Validation 9

4 Visualizations 11

4.1 National level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.2 Subnational level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5 Conclusions 17

v

List of Tables

2.1 Reserved and private IPv4 address blocks excluded by our methodology.. . . . . . . 5

2.2 Number of /24 subnets before and after our cleaning process. . . . . . . . . . . . . 6

4.1 The 92 biggest countries based on their population we used for national visualization 12

vii

List of Figures

2.1 One /23 subnet is expanded to two /24 subnets. . . . . . . . . . . . . . . . . . . . 4

2.2 Workflow of the C++ script. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Routed /24 IPv4 address blocks over time. . . . . . . . . . . . . . . . . . . . . . . . 7

3.1 Correlation over time between official statistics (ITU and OECD) and the number

of routed prefixes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.1 Visualization and comparison with economic and political indicators at national

level (snapshot of 2012). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2 Visualization of the growth of the Internet and comparison with the population at

subnational level (snapchot 2012). . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.3 Visualization of the growth of the Internet and comparison with the presidential

election results at subnational level (snapchot 2012). . . . . . . . . . . . . . . . . . 15

ix

Chapter 1

Introduction

Internet nowadays has become a part of our daily life. Everyday million of people use Internet to

access news, weather and watch videos, to plan and book holidays and to pursue their personal

interests. People use chat, messaging and email to stay in touch with friends worldwide, some-

times in the same way as some previously had pen pals. Moreover social networking websites

introduced new ways to socialize and interact.

Overall Internet usage has seen tremendous growth. About 45% of the world’s population

uses the Internet [1]. According to the International Telecommunication Union about 3.2 billion

people, or almost half of the world’s population, will be online by the end of 2015. Among

them, about 2 billion will be from developing countries, including 89 million from least developed

countries. In total, about 80% of households in developed countries and 34% in developing

countries will have Internet access [2, 3].

A question that naturally arises is why measuring Internet penetration is so important? The

Internet reveals the impact of technology on political and societal systems. But how has that

growth affected economic and societal changes? How has the usage of the Internet technologies

increased in different countries? The International Telecommunications Union (ITU) and the

Organization for Economic Cooperation (OECD) provide Internet penetration statistics, which

are collected from official national sources worldwide. However, ITU and the OECD do not

measure Internet penetration directly, but they would rather collect and standardize information

provided by different governments and their regulatory agencies. Statistics from these two orga-

nizations are widely used to inform policymakers and researchers about the expansion of digital

technologies.

Nevertheless, statistics from ITU and OECD have limitations:

• First of all, they use unknown estimation methodologies and there is lack of coor-

dination across countries. Each national agency have its own protocol for collecting

2 Estimating Global Internet Penetration from Network Measurements

these numbers at the national level. Thus, the final statistics that are ultimately included

in the main datasets may be subject to error due to poor data collection standards, or

even systematic inflation due to some countries incentives to exaggerate economic progress

because of aid conditionality.

• These statistics become available with significant delay (often 6-12 months or more).

• They offer only national level statistics which makes analysis of variation in Internet

penetration within countries impossible.

In this work we measure the Internet penetration in terms of global routed /24 address blocks

per geographic region. We introduce a reproducible methodology that uses publicly available

data and circumvents the limitations of transparency, comparability, availability, and resolution.

In particular, we use a consistent methodology across countries that relies on public data from the

Route Views project [4]. Also, we provide data at national as well as subnational level analysis

for the first time. Moreover, we compare our data with both ITU and OECD statistics and find

very high correlations. We also observe that the level of consistency drops for less democratic

and non developed countries.

Finally, we have created a website with two available visualizations of our data at national

and subnational level. At the national level we visualize the growth of the Internet, measured in

terms of globally routed IPv4 addresses, versus the Gross Domestic Product (GDP), the income

per capita, the population, and the polity score of 92 countries. At the subnational level we

visualized the growth of the Internet for each state in the United States. We compared our data

with demographic and economic indicators, such as the population and the presidential election

results. The proposed methodology has been accepted for publication in [5].

The reminder of this thesis is organized as follows: Chapter 2 describes the methodology that

we used, and Chapter 3 presents our experimental results and validates our method with ITU and

OECD data. Data visualizations that we created are explained in Chapter 4. Finally, Chapter 5

concludes this thesis and discuss future work.

Chapter 2

Methodology

In this chapter we describe the data used and the processing methodology. The data comes from

the Route Views project [4] and contain data from Border Gateway Protocol (BGP) collectors.

To process the data we transformed IPv4 address blocks to /24s which is the longest unfiltered

IPv4 prefix and we removed BGP misconfigurations. Then, we removed private and reserved

address blocks. Finally, we geolocated /24 address blocks at national as well as subnational level.

2.1 Data Collection

All data derived from the Route Views project [4]. Route Views project started at the University

of Oregon to allow Internet users to view global BGP routing information from the perspective of

other locations around the Internet. The main purpose of the project is to help Internet Service

Providers (ISPs) and Autonomous Systems (ASs) to determine how their network prefixes are

viewed by others in order to debug and improve user access to their network. Route Views servers

are peering directly with BGP routers, that are typically located at large Internet Exchange Points

(IXPs).

For our analysis, we used a daily routing table snapshot from the Route Views. In particular,

we used the second MRT format RIB file of each day from quagga bgpd route-views2.oregon-

ix.net server. Time period was the first 16 days of each February and August between 2004 to

2012.

2.2 Data Processing

We first extracted the data from the Route Views project and converted them to human-readable

XML format. For the conversion process we had to edit the source files of the IRL BGP Parser [6]

to achieve the following format for the output.


Figure 2.1: One /23 subnet is expanded to two /24 subnets.

IPv4 Address — /X subnet (e.g. 192.168.0.1 — /24),

where /X denotes the address block size.

Afterwards, we removed duplicate address block entries in each file keeping only the unique

ones. We also expanded address blocks in the range of /8 to /23, to /24 subnets. An example of

this expansion process is depicted in Fig. 2.1. We removed subnets bigger than /8 (e.g., /7

subnet) and smaller than /24 (e.g., /25 subnet). The number of possible subnets is given by the

following equation:

N = 2(24−X)

where N is the number of subnets that X has been expanded.

2.3 Cleaning Process

In our cleaning process we intended to clean data from BGP misconfigurations and also from

possible route leaks. The cleaning methodology consists of the following steps.

2.3.1 Removing Private and reserved IPv4 subnets

The IPv4 address space contains private and reserved address blocks for multicast and local

networks. Moreover, there are reserved blocks for network protocols and for broadcast purposes.

The first step of our methodology was to exclude the reserved and private IPv4 address blocks

that are shown in Table 2.1. The address block list was taken from Wikipedia. [7]

2.3.2 Persistent /24 address blocks

The key element of our methodology was to eliminate the BGP misconfigurations and route leaks

from the data. To achieve that, we had to keep only the persistent /24 address blocks that

Chapter 2. Methodology 5

First IPv4 address of the subnet Subnet Size Number of IPv4 addresses

0.0.0.0 /8 16.777.21610.0.0.0 /8 16.777.216

100.64.0.0 /10 4.194.304127.0.0.0 /8 16.777.216

169.254.0.0 /16 65.536172.16.0.06 /12 1.048.576192.0.0.0 /24 256192.0.2.0 /24 256

192.88.99.0 /24 256192.168.0.0 /16 65.536198.18.0.0 /15 131,072

198.51.100.0 /24 256203.0.113.0 /24 256224.0.0.0 /4 268.435.4560240.0.0.0 /4 268.435.455

255.255.255.255 /32 1

Table 2.1: Reserved and private IPv4 address blocks excluded by our methodology.

Figure 2.2: Workflow of the C++ script.

appeared in every one of the 16 files for each month. By keeping the persistent ones we removed

changes that occurred due to BGP updates in that time period. We implemented a C++ script

that takes as input N files (where N is the number of files to parse) and produce as output three

different files in CSV format, as it shown in Fig. 2.2

In particular every output file is described below:

• The first file termed “all.csv” includes all /24 address blocks in the following format: the

first IPv4 address of the /24 subnet and the number of files that the entry exists, e.g.,

1.1.1.0,16

• The second file termed “persistent.csv” includes only the /24 subnets that exist in all the

N files. (where N is the number of input files)

• The third file termed “statistics.csv” includes a type of statistic showing how many subnets


Date /24 Before Cleaning /24 After Cleaning % removed subnets of total

Feb 04 5.039.254 5.003.248 0.7145Aug 04 5.202.900 5.084.807 2.2697Feb 05 5.397.104 5.335.860 1.1375Aug 05 9.932.780 5.446.131 45.1701Feb 06 9.917.356 5.795.405 41.5630Aug 06 6.117.422 6.059.425 0.9480Feb 07 6.626.721 6.500.911 1.8985Aug 07 6.852.740 6.789.840 0.9178Feb 08 7.176.812 7.117.426 0.8274Aug 08 7.410.382 7.352.548 0.7804Feb 09 7.842.377 7.773.388 0.8796Aug 09 8.273.903 8.121.765 1.8387Feb 10 8.557.815 8.508.562 0.5755Aug 10 9.086.480 8.920.520 1.8264Feb 11 9.473.345 9.163.223 3.2736Aug 11 9.752.520 9.698.805 0.5507Feb 12 9.843.380 9.784.378 0.5994Aug 12 10.134.296 10.024.013 1.0882

Table 2.2: Number of /24 subnets before and after our cleaning process.

we spotted in one day, in two days, etc.

Table 2.2 shows the /24 subnets before and after our cleaning process and the percentage of

/24 subnets that were removed. in total, the average non-persistent percent was at 1.257% of

the persistent ones for a time period between February and August of (2004 - 2012), excluding

August 2005 and February 2006. These two cases exhibit a very large amount of removed subnets

as it evident in Table 2.2. These are results of route leaks and will be explained in more detail

in Section 2.3.3.

2.3.3 Route Leaks

First of all, the term route leak is used when an Internet Service Provider (ISP) or an Autonomous

System (AS) denotes address blocks using BGP updates, that does not own in large scale. Route

leaks often are outcomes of misconfiguration from an AS due to an error of an administrator.

However, sometimes route leaks are happening deliberately to collect traffic e.g. network tele-

scope [8,9]. In the time period of 2004 - 2012 we observed two large route leaks: One in August

2005 and another in February 2006. These route leaks can be observed in Fig. 2.3 which shows

the number of /24 subnets for each month of our data before and after the cleaning procedure

(Section 2.3.2).

More specifically, the route leak at August 2005 was caused from the AS2905 which is assigned

to MTN SA. The AS2905 belongs to afrinic Regional Internet Registry (RIR) and the country of

Chapter 2. Methodology 7

Figure 2.3: Routed /24 IPv4 address blocks over time.

registration is South Africa. MTN SA is a GSM cellular network operator delivering service in

South Africa and other African countries.

The February 2006 route leak was caused from the AS 11686 which belongs to the “Education

Networks of America” organization also known as ENA. ENA is an Infrastructure as a Service

(IaaS) provider to K-12 primary and secondary education in the United States and the libraries.

The AS published 64 /8 subnets for one day which is about the 1/3 of the total IPv4 address

space.

The address blocks that went public from these two Autonomous Systems were unrouted in

the past which make someone speculate that it was part of a network telescope experiment to

collect traffic.

2.4 Geolocation

Our geolocation process was based on MaxMind GeoIP City database [10]. MaxMind offers a

time series of databases to determine the geographical location of website visitors based on their

IP address. We geolocated the /24 address blocks at national and subnational level using their

python API. For every /24 address block we geolocated the first IP of the block and we assumed

that whole block is geolocated in the same area. Max Mind GeoIP City provides 99% accuracy

at national level and around 80% accuracy at subnational level.

Chapter 3

Validation

To evaluate our methodology we compared our Internet Penetration estimates with the ITU

and the OECD number of subscribers. In particular, we correlated the number of geolocated

/24 routed prefixes with the number of subscribers for both datasets. We measured the linear

correlation between our method and those datasets using Pearson’s correlation coefficient. The

results our shown in Fig. 3.1 where the high correlation of our methodology with both datasets

(ITU and OECD) is evident. It is noteworthy to observer that correlation with OECD was higher

than ITU. We speculate that the explanation for this lies in the nature of the datasets: OECD

includes less countries than ITU and the majority of them are democratic and developed. On the

other hand, ITU includes developing countries, not all of them being democratic. These countries

may misreport data to exaggerate economic progress because of aid conditionality.

The OECD indicators are composed from data provided by the administrative bodies of the

member states and from the EU Community Survey on household use of ICT. They are available

for 34 countries starting in 2006 [20]. However, despite of similarity in the data collection method,

it does not mean that the values correspond to those in the ITU dataset. The correlation

coefficient between the ITU and OECD is only 0.705.


Figure 3.1: Correlation over time between official statistics (ITU and OECD) and the number ofrouted prefixes.

Chapter 4

Visualizations

To make our data more accessible to the community, we created visualizations of the growth of the

Internet between 2004 and 2012, measured in terms of globally routed IPv4 address blocks. We

created two visualizations, one at national and one at subnational level. Then we compared our

data with economic, demographic and political indicators. The two visualizations are available

at http://www.ics.forth.gr/tnl/ipen.

4.1 National level

For the visualization at national level we selected the 92 biggest countries based on their pop-

ulation. The selected countries are shown in Table 4.1. We colored the countries based on the

continent they belong. We selected red color for Asia, green for America, brown for Africa, blue

for Australia, and orange for Europe. Finally, in case of Russian Federation we selected the

purple color because the country belongs to two different continents.

We compared our data for each country with demographic and political indicators from the

World Bank. These include the gross domestic product, the income per capita and the population.

Also we used a political indicator, known as polity score, which was taken from the Center

of Systemic Peace [11]. The polity score is an indicator that shows the level of democracy

in a country. It uses a 21-point scale taking values from -10 (hereditary monarchy) to +10

(consolidated democracy). The Polity score can also be converted into regime categories in a

suggested three part categorization of “autocracies” (-10 to -6), “anocracies” (-5 to +5), and

“democracies” (+6 to +10).

The visualization was created using the D3.js JavaScript library [12]. An example of the

visualization is depicted in Fig 4.1 where we have visualize the Internet growth in comparison

with gross domestic product and the population for each country. The gross domestic product is

shown in the Y-Axis of the figure while the population is represented by the circle radius.


EuropeGreece France Germany SwedenSpain United Kingdom Netherlands BelgiumPortugal Romania Ukraine Czech RepublicPoland Serbia Belarus ItalySwitzerland Russian Federation Hungary BulgariaAustria

AmericaUnited States of America Canada Mexico BrazilColombia Bolivia Chile CubaHonduras Haiti Peru VenezuelaArgentina Dominican Republic Equador Guatemala

AsiaChina Japan Thailand IndiaVietnam Israel Russian Federation KazakhstanIraq Turkey Pakistan BangladeshYemen Indonesia Sri Lanka NepalTajikistan Malaysia Philippines Saudi ArabiaAfganistan Azerbaijan Camdodia Uzbekistan

AfricaMorocco Egypt Nigeria UgandaBurundi Angola Burkina Faso BeninCameroon Algeria Ethiopia GhanaMali Malawi Mozambique NigerTanzania Chad Rwanda ZimbabweKenya Zambia Congo Cote D’IvoireGuinea Madagascar Sudan TunisiaTogo

AustraliaAustralia Papua New Guinea

Table 4.1: The 92 biggest countries based on their population we used for national visualization

Chapter 4. Visualizations 13

Figure 4.1: Visualization and comparison with economic and political indicators at national level(snapshot of 2012).

In the website we offer an animation of the visualization where the user can change indicators

for Y-Axis and the circle radius. Also, the user has the ability to go back and forward in time by

mouse overing over the year. Moreover, if the user mouse over a circle it will display the specific

country name. Finally, the user can click the Start Again button to start the animation from the

initial year (2004).

Some interesting observations from the visualization are the following ones:

• The higher the number of /24 address blocks the higher the gross domestic product of the

country.

• Countries with high income per capita exhibits the largest Internet growth.

• The countries that exhibits the largest Internet growth are the democratic ones. With

China being an exception having a polity score lower than -5 although also exhibiting large

Internet growth.

• Non democratic countries do not necessary exhibit low Internet growth.

4.2 Subnational level

The second visualization was at subnational level. MaxMind database offers the highest accuracy

in subnational level for the United States of America. That is the main reason we selected to


Figure 4.2: Visualization of the growth of the Internet and comparison with the population atsubnational level (snapchot 2012).

visualize the Internet penetration for each state in the USA. We used a USA map and divided

the map into states. We chose to compare our data with demographic and political indicators.

More specifically, we chose the population and the presidential election results. The time period

starts from 2006 due to the fact that MaxMind offers subnational level data from 2006 and after.

Two examples of the visualization are shown in Figs. 4.2 and 4.2 where we plot the growth

of the Internet in terms of global routed /24 subnets in comparison with the population for

each state (Fig. 4.2) and the presidential election results for red (Republican States) and blue

(Democratic States) states (Fig. 4.3).

Similar to the national level visualization, the user can interact with the visualization and

choose between indicators or use the mouse to change the year. Finally, the user can mouse over

a state and view the state name and the /24 address blocks number.

Some interesting observations from the visualization are the following ones:

• The number of /24 subnets depends on the population of each state.

• The democratic states have higher Internet growth.

Chapter 4. Visualizations 15

Figure 4.3: Visualization of the growth of the Internet and comparison with the presidentialelection results at subnational level (snapchot 2012).

Chapter 5

Conclusions

In this work we developed a consistent and reproducible methodology to measure the Internet

penetration in terms of global routed /24 address blocks per geographic region. Our method

uses publicly available data and provides analysis at national as well as subnational level. We

compared our data with ITU and OECD statistics and found high correlations. Thus we can

estimate Internet penetration as good as ITU and OECD. We have also created a website of two

different visualization one at national and one at subnational level. Our visualizations shows the

growth of the Internet in comparison with various economic and political indicators.

Our future work, is to extend the visualization’s time period to present. We aim to create

a system that collects, processes and visualizes data automatically using our methodology. We

intent to include more economic and political indicators in comparison with our data. Finally,

we plan to extend our subnational level analysis to more countries.

Bibliography

[1] “The internet big picture world internet users and 2015 population stats,”

http://www.internetworldstats.com/stats.htm.

[2] International Telecommunication Union, “ICT facts and figures - the world in 2015,”

http://www.itu.int/en/ITU-D/Statistics/Documents/facts/ICTFactsFigures2015.pdf.

[3] BBC, “Internet used by 3.2 billion people in 2015,”

http://www.bbc.com/news/technology-32884867.

[4] University of Oregon, “Route views project,” www.routeviews.org.

[5] S. Baleato, N. Weidmann, P. Gigis, X. Dimitropoulos, E. Glatz, and B. Trammell,

“Transparent estimation of internet penetration from network observations,” in Springer

Passive and Active Measurement Conference (PAM), 2015.

[6] UCLA Internet Research Lab, “IRL BGP parser v0.3b2,”

http://irl.cs.ucla.edu/software/bgpparser.html.

[7] Wikipedia, “Reserved ip addresses,” https://en.wikipedia.org/wiki/Reserved IP addresses.

[8] “Ucsd network telescope,” .

[9] Alberto Dainotti, Karyn Benson, Alistair King, Eduard Glatz, Xenofontas Dimitropoulos,

Philipp Richter, Alessandro Finamore, Alex C Snoeren, et al., “Lost in space: Improving

inference of ipv4 address space utilization,” 2014.

[10] “Maxmind geoip2 city accuracy,” https://www.maxmind.com/en/geoip2-databases.

[11] “Polityproject - center of systemic peace,”

http://www.systemicpeace.org/polityproject.html.

[12] “Data driven documents,” http://www.d3js.org.