estimating global internet penetration from...
TRANSCRIPT
Estimating Global Internet Penetration fromNetwork Measurements
Petros Gigis
University of Crete
School of Sciences and Engineering
Computer Science Department
Voutes University Campus Heraklion, GR-71003, Greece
Heraklion, September 2015
Acknowledgements
First of all I would like to thank my supervisor, Professor Xenofontas Dimitropoulos for
showing belief in me and giving me the opportunity to work with him, for all our constructive
meetings, and for his great advice and support.
I am also really grateful to Tasos Alexandridis for his continuous support and his valuable
help to this work.
Also I would like to thank Fotis Nikolaidis and Manos Lakiwtakis for their continuous support
and their useful advices.
I would like to acknowledge the Institute of Computer Science (FORTH-ICS) for providing
financial support and all the necessary equipment during this work.
I would also like to thank all my colleagues at the Telecommunications and Network Lab.
My warmest thanks Christos Liaskos, Dimitrios Gkounis and George Nomikos for all the helpful
discussions, the encouragement, and nice atmosphere.
A dedicated thank you to my closest friends Elina Tsiaka, Fotis Karnis, Eutuxia Zouni, Nicole
Tsampanaki for being always there to listen and for always providing their advice and support.
Last, but definitely not least, I would like to thank my family.
iii
iv
Contents
List of tables vii
List of figures ix
1 Introduction 1
2 Methodology 3
2.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Cleaning Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.1 Removing Private and reserved IPv4 subnets . . . . . . . . . . . . . . . . . 4
2.3.2 Persistent /24 address blocks . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.3 Route Leaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Geolocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Validation 9
4 Visualizations 11
4.1 National level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Subnational level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Conclusions 17
v
List of Tables
2.1 Reserved and private IPv4 address blocks excluded by our methodology.. . . . . . . 5
2.2 Number of /24 subnets before and after our cleaning process. . . . . . . . . . . . . 6
4.1 The 92 biggest countries based on their population we used for national visualization 12
vii
List of Figures
2.1 One /23 subnet is expanded to two /24 subnets. . . . . . . . . . . . . . . . . . . . 4
2.2 Workflow of the C++ script. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Routed /24 IPv4 address blocks over time. . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Correlation over time between official statistics (ITU and OECD) and the number
of routed prefixes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1 Visualization and comparison with economic and political indicators at national
level (snapshot of 2012). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Visualization of the growth of the Internet and comparison with the population at
subnational level (snapchot 2012). . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Visualization of the growth of the Internet and comparison with the presidential
election results at subnational level (snapchot 2012). . . . . . . . . . . . . . . . . . 15
ix
Chapter 1
Introduction
Internet nowadays has become a part of our daily life. Everyday million of people use Internet to
access news, weather and watch videos, to plan and book holidays and to pursue their personal
interests. People use chat, messaging and email to stay in touch with friends worldwide, some-
times in the same way as some previously had pen pals. Moreover social networking websites
introduced new ways to socialize and interact.
Overall Internet usage has seen tremendous growth. About 45% of the world’s population
uses the Internet [1]. According to the International Telecommunication Union about 3.2 billion
people, or almost half of the world’s population, will be online by the end of 2015. Among
them, about 2 billion will be from developing countries, including 89 million from least developed
countries. In total, about 80% of households in developed countries and 34% in developing
countries will have Internet access [2, 3].
A question that naturally arises is why measuring Internet penetration is so important? The
Internet reveals the impact of technology on political and societal systems. But how has that
growth affected economic and societal changes? How has the usage of the Internet technologies
increased in different countries? The International Telecommunications Union (ITU) and the
Organization for Economic Cooperation (OECD) provide Internet penetration statistics, which
are collected from official national sources worldwide. However, ITU and the OECD do not
measure Internet penetration directly, but they would rather collect and standardize information
provided by different governments and their regulatory agencies. Statistics from these two orga-
nizations are widely used to inform policymakers and researchers about the expansion of digital
technologies.
Nevertheless, statistics from ITU and OECD have limitations:
• First of all, they use unknown estimation methodologies and there is lack of coor-
dination across countries. Each national agency have its own protocol for collecting
2 Estimating Global Internet Penetration from Network Measurements
these numbers at the national level. Thus, the final statistics that are ultimately included
in the main datasets may be subject to error due to poor data collection standards, or
even systematic inflation due to some countries incentives to exaggerate economic progress
because of aid conditionality.
• These statistics become available with significant delay (often 6-12 months or more).
• They offer only national level statistics which makes analysis of variation in Internet
penetration within countries impossible.
In this work we measure the Internet penetration in terms of global routed /24 address blocks
per geographic region. We introduce a reproducible methodology that uses publicly available
data and circumvents the limitations of transparency, comparability, availability, and resolution.
In particular, we use a consistent methodology across countries that relies on public data from the
Route Views project [4]. Also, we provide data at national as well as subnational level analysis
for the first time. Moreover, we compare our data with both ITU and OECD statistics and find
very high correlations. We also observe that the level of consistency drops for less democratic
and non developed countries.
Finally, we have created a website with two available visualizations of our data at national
and subnational level. At the national level we visualize the growth of the Internet, measured in
terms of globally routed IPv4 addresses, versus the Gross Domestic Product (GDP), the income
per capita, the population, and the polity score of 92 countries. At the subnational level we
visualized the growth of the Internet for each state in the United States. We compared our data
with demographic and economic indicators, such as the population and the presidential election
results. The proposed methodology has been accepted for publication in [5].
The reminder of this thesis is organized as follows: Chapter 2 describes the methodology that
we used, and Chapter 3 presents our experimental results and validates our method with ITU and
OECD data. Data visualizations that we created are explained in Chapter 4. Finally, Chapter 5
concludes this thesis and discuss future work.
Chapter 2
Methodology
In this chapter we describe the data used and the processing methodology. The data comes from
the Route Views project [4] and contain data from Border Gateway Protocol (BGP) collectors.
To process the data we transformed IPv4 address blocks to /24s which is the longest unfiltered
IPv4 prefix and we removed BGP misconfigurations. Then, we removed private and reserved
address blocks. Finally, we geolocated /24 address blocks at national as well as subnational level.
2.1 Data Collection
All data derived from the Route Views project [4]. Route Views project started at the University
of Oregon to allow Internet users to view global BGP routing information from the perspective of
other locations around the Internet. The main purpose of the project is to help Internet Service
Providers (ISPs) and Autonomous Systems (ASs) to determine how their network prefixes are
viewed by others in order to debug and improve user access to their network. Route Views servers
are peering directly with BGP routers, that are typically located at large Internet Exchange Points
(IXPs).
For our analysis, we used a daily routing table snapshot from the Route Views. In particular,
we used the second MRT format RIB file of each day from quagga bgpd route-views2.oregon-
ix.net server. Time period was the first 16 days of each February and August between 2004 to
2012.
2.2 Data Processing
We first extracted the data from the Route Views project and converted them to human-readable
XML format. For the conversion process we had to edit the source files of the IRL BGP Parser [6]
to achieve the following format for the output.
4 Estimating Global Internet Penetration from Network Measurements
Figure 2.1: One /23 subnet is expanded to two /24 subnets.
IPv4 Address — /X subnet (e.g. 192.168.0.1 — /24),
where /X denotes the address block size.
Afterwards, we removed duplicate address block entries in each file keeping only the unique
ones. We also expanded address blocks in the range of /8 to /23, to /24 subnets. An example of
this expansion process is depicted in Fig. 2.1. We removed subnets bigger than /8 (e.g., /7
subnet) and smaller than /24 (e.g., /25 subnet). The number of possible subnets is given by the
following equation:
N = 2(24−X)
where N is the number of subnets that X has been expanded.
2.3 Cleaning Process
In our cleaning process we intended to clean data from BGP misconfigurations and also from
possible route leaks. The cleaning methodology consists of the following steps.
2.3.1 Removing Private and reserved IPv4 subnets
The IPv4 address space contains private and reserved address blocks for multicast and local
networks. Moreover, there are reserved blocks for network protocols and for broadcast purposes.
The first step of our methodology was to exclude the reserved and private IPv4 address blocks
that are shown in Table 2.1. The address block list was taken from Wikipedia. [7]
2.3.2 Persistent /24 address blocks
The key element of our methodology was to eliminate the BGP misconfigurations and route leaks
from the data. To achieve that, we had to keep only the persistent /24 address blocks that
Chapter 2. Methodology 5
First IPv4 address of the subnet Subnet Size Number of IPv4 addresses
0.0.0.0 /8 16.777.21610.0.0.0 /8 16.777.216
100.64.0.0 /10 4.194.304127.0.0.0 /8 16.777.216
169.254.0.0 /16 65.536172.16.0.06 /12 1.048.576192.0.0.0 /24 256192.0.2.0 /24 256
192.88.99.0 /24 256192.168.0.0 /16 65.536198.18.0.0 /15 131,072
198.51.100.0 /24 256203.0.113.0 /24 256224.0.0.0 /4 268.435.4560240.0.0.0 /4 268.435.455
255.255.255.255 /32 1
Table 2.1: Reserved and private IPv4 address blocks excluded by our methodology.
Figure 2.2: Workflow of the C++ script.
appeared in every one of the 16 files for each month. By keeping the persistent ones we removed
changes that occurred due to BGP updates in that time period. We implemented a C++ script
that takes as input N files (where N is the number of files to parse) and produce as output three
different files in CSV format, as it shown in Fig. 2.2
In particular every output file is described below:
• The first file termed “all.csv” includes all /24 address blocks in the following format: the
first IPv4 address of the /24 subnet and the number of files that the entry exists, e.g.,
1.1.1.0,16
• The second file termed “persistent.csv” includes only the /24 subnets that exist in all the
N files. (where N is the number of input files)
• The third file termed “statistics.csv” includes a type of statistic showing how many subnets
6 Estimating Global Internet Penetration from Network Measurements
Date /24 Before Cleaning /24 After Cleaning % removed subnets of total
Feb 04 5.039.254 5.003.248 0.7145Aug 04 5.202.900 5.084.807 2.2697Feb 05 5.397.104 5.335.860 1.1375Aug 05 9.932.780 5.446.131 45.1701Feb 06 9.917.356 5.795.405 41.5630Aug 06 6.117.422 6.059.425 0.9480Feb 07 6.626.721 6.500.911 1.8985Aug 07 6.852.740 6.789.840 0.9178Feb 08 7.176.812 7.117.426 0.8274Aug 08 7.410.382 7.352.548 0.7804Feb 09 7.842.377 7.773.388 0.8796Aug 09 8.273.903 8.121.765 1.8387Feb 10 8.557.815 8.508.562 0.5755Aug 10 9.086.480 8.920.520 1.8264Feb 11 9.473.345 9.163.223 3.2736Aug 11 9.752.520 9.698.805 0.5507Feb 12 9.843.380 9.784.378 0.5994Aug 12 10.134.296 10.024.013 1.0882
Table 2.2: Number of /24 subnets before and after our cleaning process.
we spotted in one day, in two days, etc.
Table 2.2 shows the /24 subnets before and after our cleaning process and the percentage of
/24 subnets that were removed. in total, the average non-persistent percent was at 1.257% of
the persistent ones for a time period between February and August of (2004 - 2012), excluding
August 2005 and February 2006. These two cases exhibit a very large amount of removed subnets
as it evident in Table 2.2. These are results of route leaks and will be explained in more detail
in Section 2.3.3.
2.3.3 Route Leaks
First of all, the term route leak is used when an Internet Service Provider (ISP) or an Autonomous
System (AS) denotes address blocks using BGP updates, that does not own in large scale. Route
leaks often are outcomes of misconfiguration from an AS due to an error of an administrator.
However, sometimes route leaks are happening deliberately to collect traffic e.g. network tele-
scope [8,9]. In the time period of 2004 - 2012 we observed two large route leaks: One in August
2005 and another in February 2006. These route leaks can be observed in Fig. 2.3 which shows
the number of /24 subnets for each month of our data before and after the cleaning procedure
(Section 2.3.2).
More specifically, the route leak at August 2005 was caused from the AS2905 which is assigned
to MTN SA. The AS2905 belongs to afrinic Regional Internet Registry (RIR) and the country of
Chapter 2. Methodology 7
Figure 2.3: Routed /24 IPv4 address blocks over time.
registration is South Africa. MTN SA is a GSM cellular network operator delivering service in
South Africa and other African countries.
The February 2006 route leak was caused from the AS 11686 which belongs to the “Education
Networks of America” organization also known as ENA. ENA is an Infrastructure as a Service
(IaaS) provider to K-12 primary and secondary education in the United States and the libraries.
The AS published 64 /8 subnets for one day which is about the 1/3 of the total IPv4 address
space.
The address blocks that went public from these two Autonomous Systems were unrouted in
the past which make someone speculate that it was part of a network telescope experiment to
collect traffic.
2.4 Geolocation
Our geolocation process was based on MaxMind GeoIP City database [10]. MaxMind offers a
time series of databases to determine the geographical location of website visitors based on their
IP address. We geolocated the /24 address blocks at national and subnational level using their
python API. For every /24 address block we geolocated the first IP of the block and we assumed
that whole block is geolocated in the same area. Max Mind GeoIP City provides 99% accuracy
at national level and around 80% accuracy at subnational level.
8 Estimating Global Internet Penetration from Network Measurements
Chapter 3
Validation
To evaluate our methodology we compared our Internet Penetration estimates with the ITU
and the OECD number of subscribers. In particular, we correlated the number of geolocated
/24 routed prefixes with the number of subscribers for both datasets. We measured the linear
correlation between our method and those datasets using Pearson’s correlation coefficient. The
results our shown in Fig. 3.1 where the high correlation of our methodology with both datasets
(ITU and OECD) is evident. It is noteworthy to observer that correlation with OECD was higher
than ITU. We speculate that the explanation for this lies in the nature of the datasets: OECD
includes less countries than ITU and the majority of them are democratic and developed. On the
other hand, ITU includes developing countries, not all of them being democratic. These countries
may misreport data to exaggerate economic progress because of aid conditionality.
The OECD indicators are composed from data provided by the administrative bodies of the
member states and from the EU Community Survey on household use of ICT. They are available
for 34 countries starting in 2006 [20]. However, despite of similarity in the data collection method,
it does not mean that the values correspond to those in the ITU dataset. The correlation
coefficient between the ITU and OECD is only 0.705.
10 Estimating Global Internet Penetration from Network Measurements
Figure 3.1: Correlation over time between official statistics (ITU and OECD) and the number ofrouted prefixes.
Chapter 4
Visualizations
To make our data more accessible to the community, we created visualizations of the growth of the
Internet between 2004 and 2012, measured in terms of globally routed IPv4 address blocks. We
created two visualizations, one at national and one at subnational level. Then we compared our
data with economic, demographic and political indicators. The two visualizations are available
at http://www.ics.forth.gr/tnl/ipen.
4.1 National level
For the visualization at national level we selected the 92 biggest countries based on their pop-
ulation. The selected countries are shown in Table 4.1. We colored the countries based on the
continent they belong. We selected red color for Asia, green for America, brown for Africa, blue
for Australia, and orange for Europe. Finally, in case of Russian Federation we selected the
purple color because the country belongs to two different continents.
We compared our data for each country with demographic and political indicators from the
World Bank. These include the gross domestic product, the income per capita and the population.
Also we used a political indicator, known as polity score, which was taken from the Center
of Systemic Peace [11]. The polity score is an indicator that shows the level of democracy
in a country. It uses a 21-point scale taking values from -10 (hereditary monarchy) to +10
(consolidated democracy). The Polity score can also be converted into regime categories in a
suggested three part categorization of “autocracies” (-10 to -6), “anocracies” (-5 to +5), and
“democracies” (+6 to +10).
The visualization was created using the D3.js JavaScript library [12]. An example of the
visualization is depicted in Fig 4.1 where we have visualize the Internet growth in comparison
with gross domestic product and the population for each country. The gross domestic product is
shown in the Y-Axis of the figure while the population is represented by the circle radius.
12 Estimating Global Internet Penetration from Network Measurements
EuropeGreece France Germany SwedenSpain United Kingdom Netherlands BelgiumPortugal Romania Ukraine Czech RepublicPoland Serbia Belarus ItalySwitzerland Russian Federation Hungary BulgariaAustria
AmericaUnited States of America Canada Mexico BrazilColombia Bolivia Chile CubaHonduras Haiti Peru VenezuelaArgentina Dominican Republic Equador Guatemala
AsiaChina Japan Thailand IndiaVietnam Israel Russian Federation KazakhstanIraq Turkey Pakistan BangladeshYemen Indonesia Sri Lanka NepalTajikistan Malaysia Philippines Saudi ArabiaAfganistan Azerbaijan Camdodia Uzbekistan
AfricaMorocco Egypt Nigeria UgandaBurundi Angola Burkina Faso BeninCameroon Algeria Ethiopia GhanaMali Malawi Mozambique NigerTanzania Chad Rwanda ZimbabweKenya Zambia Congo Cote D’IvoireGuinea Madagascar Sudan TunisiaTogo
AustraliaAustralia Papua New Guinea
Table 4.1: The 92 biggest countries based on their population we used for national visualization
Chapter 4. Visualizations 13
Figure 4.1: Visualization and comparison with economic and political indicators at national level(snapshot of 2012).
In the website we offer an animation of the visualization where the user can change indicators
for Y-Axis and the circle radius. Also, the user has the ability to go back and forward in time by
mouse overing over the year. Moreover, if the user mouse over a circle it will display the specific
country name. Finally, the user can click the Start Again button to start the animation from the
initial year (2004).
Some interesting observations from the visualization are the following ones:
• The higher the number of /24 address blocks the higher the gross domestic product of the
country.
• Countries with high income per capita exhibits the largest Internet growth.
• The countries that exhibits the largest Internet growth are the democratic ones. With
China being an exception having a polity score lower than -5 although also exhibiting large
Internet growth.
• Non democratic countries do not necessary exhibit low Internet growth.
4.2 Subnational level
The second visualization was at subnational level. MaxMind database offers the highest accuracy
in subnational level for the United States of America. That is the main reason we selected to
14 Estimating Global Internet Penetration from Network Measurements
Figure 4.2: Visualization of the growth of the Internet and comparison with the population atsubnational level (snapchot 2012).
visualize the Internet penetration for each state in the USA. We used a USA map and divided
the map into states. We chose to compare our data with demographic and political indicators.
More specifically, we chose the population and the presidential election results. The time period
starts from 2006 due to the fact that MaxMind offers subnational level data from 2006 and after.
Two examples of the visualization are shown in Figs. 4.2 and 4.2 where we plot the growth
of the Internet in terms of global routed /24 subnets in comparison with the population for
each state (Fig. 4.2) and the presidential election results for red (Republican States) and blue
(Democratic States) states (Fig. 4.3).
Similar to the national level visualization, the user can interact with the visualization and
choose between indicators or use the mouse to change the year. Finally, the user can mouse over
a state and view the state name and the /24 address blocks number.
Some interesting observations from the visualization are the following ones:
• The number of /24 subnets depends on the population of each state.
• The democratic states have higher Internet growth.
Chapter 4. Visualizations 15
Figure 4.3: Visualization of the growth of the Internet and comparison with the presidentialelection results at subnational level (snapchot 2012).
16 Estimating Global Internet Penetration from Network Measurements
Chapter 5
Conclusions
In this work we developed a consistent and reproducible methodology to measure the Internet
penetration in terms of global routed /24 address blocks per geographic region. Our method
uses publicly available data and provides analysis at national as well as subnational level. We
compared our data with ITU and OECD statistics and found high correlations. Thus we can
estimate Internet penetration as good as ITU and OECD. We have also created a website of two
different visualization one at national and one at subnational level. Our visualizations shows the
growth of the Internet in comparison with various economic and political indicators.
Our future work, is to extend the visualization’s time period to present. We aim to create
a system that collects, processes and visualizes data automatically using our methodology. We
intent to include more economic and political indicators in comparison with our data. Finally,
we plan to extend our subnational level analysis to more countries.
18 Estimating Global Internet Penetration from Network Measurements
20 Estimating Global Internet Penetration from Network Measurements
Bibliography
[1] “The internet big picture world internet users and 2015 population stats,”
http://www.internetworldstats.com/stats.htm.
[2] International Telecommunication Union, “ICT facts and figures - the world in 2015,”
http://www.itu.int/en/ITU-D/Statistics/Documents/facts/ICTFactsFigures2015.pdf.
[3] BBC, “Internet used by 3.2 billion people in 2015,”
http://www.bbc.com/news/technology-32884867.
[4] University of Oregon, “Route views project,” www.routeviews.org.
[5] S. Baleato, N. Weidmann, P. Gigis, X. Dimitropoulos, E. Glatz, and B. Trammell,
“Transparent estimation of internet penetration from network observations,” in Springer
Passive and Active Measurement Conference (PAM), 2015.
[6] UCLA Internet Research Lab, “IRL BGP parser v0.3b2,”
http://irl.cs.ucla.edu/software/bgpparser.html.
[7] Wikipedia, “Reserved ip addresses,” https://en.wikipedia.org/wiki/Reserved IP addresses.
[8] “Ucsd network telescope,” .
[9] Alberto Dainotti, Karyn Benson, Alistair King, Eduard Glatz, Xenofontas Dimitropoulos,
Philipp Richter, Alessandro Finamore, Alex C Snoeren, et al., “Lost in space: Improving
inference of ipv4 address space utilization,” 2014.
[10] “Maxmind geoip2 city accuracy,” https://www.maxmind.com/en/geoip2-databases.
[11] “Polityproject - center of systemic peace,”
http://www.systemicpeace.org/polityproject.html.
[12] “Data driven documents,” http://www.d3js.org.