open data: a design for the provisioning of dutch government public and geo-spatial transport data
TRANSCRIPT
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
1/48
University of Groningen
Industrial Engineering and Management
Bachelor Thesis
Supervisors: prof. dr. H.G. Sol (University of Groningen),
ir. drs. T.A. van den Broek (TNO)
Open Data: a design for the provisioning of
Dutch government public and geo-spatial
transport data.
J.P.S. van Grieken
Groningen, February 28, 2011
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
2/48
Abstract
Governments increasingly start to publishing structured, machine readable and free public
sector information for commercial and public re-use. They are moving from a closed model
in which businesses pay a cost that maximizes government profit or covers long-term cost
towards a free model in which data is freely available without any cost. This form of public
sector information provisioning is also referred to as open data. In this paper a design and
business model for Dutch public and geo-spatial data is presented. Furthermore, the impli-
cations of a governmental open data policy on the business case of various stakeholders that
work with public- and geospatial transport data is examined.
To establish a design for open data a literature review and interviews with specialists were
conducted. We found that the proliferation of the internet as a participatory and eco-nomic platform, the development of freedom of information and transparency policies and
the perceived economic benefits of free public sector information, have contributed to the
development of open data. We found that if government data were to be made available at
zero or marginal cost this could lead to significant increases in economic activity. Businesses
could use the different data sets to create services and therefore add value to the data. This
economic activity in its turn would lead to more revenue for the businesses and increase
overall welfare. The government would benefit from this activity through taxation of the
services.
A business model of open data in the public and geo-spatial transport sector was designed.
In this model barriers in legislation were removed, accurate pricing strategies and a tech-
nical implementation for open data were recommended. We found that this model causes
changes in the business case of data providing organizations and businesses. Especially the
cost structure of these respective stakeholder should be changed. Finally, a design for a data
warehouse for road and public transport data is presented. The design covers a warehouse
architecture, data model, interface design, hardware recommendations and qualitative as-
pects. In the final section of the paper we discuss some of the findings in relation to economic
activity, loss of intellectual property, licensing of open data and changes in government cost-
structure.
Keywords: public sector information, open data, design, business case, data-warehouse,
public transport, geo-data, economics, transparency, governments
Open Data: a design for the provisioning of Dutch government public and geo-spatial trans-
port data. by J.P.S. van Grieken is licensed under a Creative Commons Attribution -Non
Commercial -Share Alike 3.0 Unported License.
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
3/48
Contents
1 Introduction 3
1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 The Networked Society . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Drivers of transparency . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Open Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Theory 9
2.1 The economics of open data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Dutch government information architecture . . . . . . . . . . . . . . . . . . . 12
2.3 Stakeholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 The business model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Methods 17
3.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Open Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Stakeholder Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Structured interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5 Business case analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6 Requirements analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.7 Data Warehouse design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Business Model Design 20
4.1 Effects of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Effects on the stakeholder business cases . . . . . . . . . . . . . . . . . . . . . 22
5 Technology Design 25
5.1 Landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Warehouse Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.4 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.5 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
4/48
CONTENTS
5.6 Qualitative Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6 Discussion 33
6.1 Effects on businesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.2 Changes in government cost structures . . . . . . . . . . . . . . . . . . . . . . 33
6.3 Loss of intellectual property and market disturbance . . . . . . . . . . . . . . 34
6.4 Legal: insuring coverage, quality, privacy and neutrality of data . . . . . . . . 34
6.5 Data vs. Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.6 Risks of the design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7 Appendix 40
.1 Requirements Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
.2 Interview Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
.3 Final Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
.4 List of Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
.5 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
5/48
Chapter 1
Introduction
Political participation, civil society, and transparency are among the indispens-able elements that are the imperatives of democratization. As quoted from a
speech at Harvard University, Kennedy School of Government by Recep Tayyip
Erdogan , January 30th 2003
Long before the rise of computer technology governments have started to collected vast
amounts of structured data. Already in 1811 the cadastre started measuring and recording
the ownership of land1. In 1899 the Central Bureau for Statistics (CBS) kept detailed records
and statistics on the Dutch population in order allow decision makers to construct effective
economic policies. Most of this data is used by different governmental organizations to servethe public in their daily operations. For example, the cadastre uses the detailed maps they
have gathered to determine the boundaries of land when sold. Nowadays, this structured
data is stored in large data warehouses owned and maintained by different branches of
government. Estimates suggest that between 100-150 Dutch governmental organizations
posses data that could be relevant to the public or to businesses [ 1].
If this government data were to be made available at zero or marginal cost this could
lead to significant increases in economic activity[23]. Businesses could use the different data
sets to create services and therefore add value to the data. This economic activity in its turn
would lead to more revenue for the businesses and increase overall welfare. The government
would benefit from this activity through taxation of the services. For example, after re-
leasing the data within months innovative applications in public transport, crime, parking,
schools, tourism and dining were created2
There are three main reasons that this business potential remains untapped in the Nether-
lands. First of all, governments often choose a pricing strategy that either maximizes profit
or returns the long-term average cost. This causes a barrier for businesses to re-use the data
because the cost to gather the information themselves is similar to buying it directly from
the government. Secondly, law and policy restrictions apply to most of the datasets the
government owns. For example, copyright and database law restrictions limit businesses in
3
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
6/48
CHAPTER 1. INTRODUCTION
the services that could possibly be build on this data. Finally, most government bodies lack
the technical infrastructure to deliver high quality data to businesses at high speed.
1.1 Context
Before we begin the analysis of the economics and technical infrastructure needed for our
design we first want to explain the developments in legislation and society that have lead to
open data.
1.1.1 The Networked Society
The first important development that has made open data possible is the rise of internet
within our society. The internet has created a market for information services and goods. It
has created possibilities for collaboration and trade of information goods and services andis developing as a major distribution platform for these services.
Everywhere around the globe broadband access has been pushed into markets to con-
nect people to the internet. Since a couple of years almost everybody in the Netherlands
has access to the internet via a computer or mobile device. The access to the internet has
risen from 77% in 2004 to 93% in 2009 [3]. These new forms of communication have enabled
citizens to communicate in new ways amongst themselves and with public institutions. Net-
works of people continue to form the structures and organization of society, a phenomenon
which is mainly referred to as the rise of the network society [4]. These ways of interaction
create new ways of collaboration among citizens in terms of speed, scale, anonymity, inter-
activity and community building. The internet provides a market for people to collaborate
and is described by Antonijevic and Gurak as
[The internet] has brought easy to use content-creating applications such as
blogs, wikis, social networking sites, and file sharing platforms rooted in broad-
band access, affordable hardware and software solutions, and with the Internet
perceived and used as a new normal in contemporary way of life. [5].
The development of the internet as a network of individuals collaborating is recognized
as a new way of creating economic value. The OECD sees the web as one of the drivers
for creativity and economic development among people in the coming century [ 6]. In thefield of software construction this has lead to the collaborative software creation between
programmers and other specialist from all over the globe, which is referred to as open source
software. Open Source software challenges the rules of economics, software development and
IT management. On development networks like sourgeforge.net, vast amounts of program-
mers work together on software projects without any financial compensation[7].
These programmers engage in civil society and organize bar camps3 and online platforms
where they meet and try to construct software that helps governments and citizens in their
daily lives. A good example of a developed network is the Sunlight Labs in the United States
which counts around 2700 volunteering programmers 4 that work on various projects. In
4
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
7/48
CHAPTER 1. INTRODUCTION
Europe a large community of programmers can be found in the United Kingdom, Denmark
and Spain.
A study in the United Kingdom looked at the motivation of these communities of pro-
grammers in relation to open data. Citizens showed a desire to engage with government inopen data initiatives. The survey indicated that 36% wanted to be actively involved and
use, vs. 33% that were just happy to get the data. Similar effects have been found in
the relation between citizens and the government in the Netherlands[8]. A study by TNO
suggests that the rise of the social web (web 2.0) causes citizens to create new platforms
that they use to organize, collaborate, share, trade and create [10]. These platforms are
open in nature, require visitors to collaborate and try to use the distributed knowledge of
all the participants. We have now described the implications that give open data is societal
context. The networked society has lead to a collaboration platform and potential market
for open data.
1.1.2 Drivers of transparency
In most countries that have adopted open data policies the development originated from
transparency and freedom of information laws. The term transparency has many different
definitions depending on specific use and context. In the field of politics and government
transparency is usually referred to as social transparency[10]. This form of transparency
is defined as Social Transparency allows citizens to be more informed and encourages the
disclosure as a regulation mechanism of centers of authority. It is based on ethics and gov-
ernance, where the interests and needs are focused in the citizens [11]. Governments use
Freedom of Information (FOI) laws to define the formal rights and degrees of freedom of
transparency within a nation. The first freedom of information laws came into effect after
the second world war, but in most countries these types of laws are still in development. A
study on freedom of information laws found that in 1985 only 11 countrys adopted free-
dom of information laws, but in 2004 almost 59 countries had some form of transparency
law passed through parliament[12]. Transparency and the right to obtain government in-
formation are seen as essential to corruption prevention, democratic participation, trust in
government, accountability, informed decision making, and provisioning of information to
the public. [13]. As a tool, the internet allows for easy publishing and rapid sharing of public
sector information in relation to Freedom of Information rights. The internet has causedmore transparent public sector organizations that are able to respond to citizen needs more
rapidly[15].
The United States have a rich history of freedom of information and transparency
policies[16]. They experimented in 1997 with one of the first government transparency
websites called Fedstats.com. This website provides statistics on all the federal govern-
ment agencies and publishes it on a website. Furthermore, in the last 20 years various
transparency laws have been approved by the senate. In 2006 the Federal Funding and
Transparency Act was adopted providing high degrees of budget transparency. A year later
the Honest Leadership and Open Government Act followed and provided accountability and
openness to citizens. The final chapter in freedom of information laws in the United States
5
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
8/48
CHAPTER 1. INTRODUCTION
was the Memorandum on Transparency and Open Government5. In this memorandum
the Obama administration calls all federal agencies for an unpresidented level of openness.
The memorandum declares that all departments should be transparent, participatory and
collaborative. With this memorandum the administration promotes accountability, publicengagement, public participation and crowdsourcing using internet technology. The most
important development is that the United States government considered all data gathered
to be national public asset and should therefore be available to all citizens in a structured
format.
In Europe similar policies have been adopted in the United Kingdom, Norway, Spain,
Denmark, Estonia and Greece6. Although most of the initiatives are still in a development
phase, some similarities can be pointed out. The Danish government launched an open gov-
ernment strategy which contained public sector information provisioning called Offentlige
Data I Spil aimed at providing a portal website that provides structured data to citizens.
Similar data portals have been constructed in the United Kingdom7, the Catalan region of
Spain (Aporta)8 and Norway9. In terms of policy some developments at the level of the
European Committee can be pointed out. The first import piece of legislation on the use
of public sector information is 2003 directive 98/EC on the re-use of public sector informa-
tion10. This treaty describes the development of a European data products market based
on public sector information. The main goal of this treaty is to make available, where pos-
sible, documents that will be re-usable for commercial and non-commercial purposes where
possible through electronic means. The member states are allowed to charge for the cost of
collection, production, reproduction and dissemination together with a reasonable return on
investment. Some European studies have been carried out on the effects of public sector in-formation. The Commercial exploitation of Europes public sector information report issued
by the European Committee estimates the total value of the public sector information in
Europe between EUR 28 billion per annum and EUR 134 billion per annum, with a central
estimate of EUR 68 billion[17]. The last relevant European development was the eUnion
program that ran under Swedish presidency of the European Union. In the Visby declara-
tion11 the European member states call for EU member states and community institutions
should seek to make data freely accessible in open machine-readable formats, for the benefit
of entrepreneurship, research and transparency. This declaration has as of now not yet
been put into legislation.
Although the Netherlands scores high on the digital e-readiness ranking[18] there is no
clear open government program as can be found in other European member states. An
open government study found that the Dutch government lacks leadership, central coordi-
nation, focus, has trouble distinguishing open data and participation and is weary of the
business case of open government[?]. The Dutch government has been experimenting with
participation subsidies and has supported some pilots in the field of open data. In terms of
legislation no far reaching freedom of information laws have been adopted by the govern-
ment. Copyright, Freedom of Information and database laws still prohibit the distribution
of open data by central government. Also, no policy programs promoting open government
or open data have been announced. The government is however conducting some research
6
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
9/48
CHAPTER 1. INTRODUCTION
into the possibilities of open data in the Netherlands. In order to successfully implement
open data within a country a culture of freedom of information supported by legislation is
required.
1.2 Open Data
Before we can elaborate problem definition we need a consistent definition of open data
. Open Data is defined as the publishing of structured, free, and machine readable public
sector information[2] Where public sector information (PSI) is information gathered by gov-
ernmental bodies and stored in some structured form. Open Data should not be confused
with open source or open standard which are software and digital communication protocols
respectively. We have used this definition because it is used most often in literature. Fur-
thermore, this definition lets us differentiate between publicly available data (which is not
per definition free or machine readable) and open data.
1.3 Problem Definition
In this section we will state the societal problem that underlies our research question. The
data governments collect in their daily operations represent an economic value, and therefore
economic potential. This economic value currently remains untapped in the Netherlands.
Therefore, the problem definition for this study is:
The business potential of open government data in the Netherlands remains untapped
which causes loss of economic activity.
There is still an uncertainty what consequences an open data model has on different stake-
holders. Furthermore, how the technical infrastructure changes with open data policies.
1.4 Objective
The objective of this study is to create a design for the provisioning of open public and
geo-spatial transport data. This study has been conducted in a period of three months and
is be part of a larger study into the cost - benefit relations of open data at the Netherlands
Organization for Applied Scientific Research (TNO). The study also serves as the bachelor
thesis Industrial Engineering & Management of mr. J.P.S. van Grieken at the University of
Groningen.
Before we start with the design we need to establish the basic premises of our problem
definition: open government data causes economic activity. When we proved this we first
need to find the main causes of our problem definition. When we find those causes we will
then create a design that includes both the societal problem and a technical implementation.
For scoping purposes we will be looking at two types of data: public and geo-spatial transport
7
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
10/48
CHAPTER 1. INTRODUCTION
data. We chose these data types because of their market popularity in foreign open data
initiatives12.
8
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
11/48
Chapter 2
Theory
In this chapter we use theory try to identify the causes of our problem. We will start withan elaboration of the economic case for open data. Then we will briefly introduce Dutch
government information architecture, and describe how this acts as a barrier for open data.
After that we will describe the business model of open data. This will result in elaboration
and justification of the research question.
2.1 The economics of open data
The main premises of this study is that open data causes a positive economic effect. This
chapter elaborates on the economic literature available on open data. We will first start
with an introduction on the economic value of public sector information.
In their daily operation governments collect data in order to perform their primary
tasks such as determination of land ownership or running a public bus service. The data
collected represents both an economic value and an investment value. The investment value
of this data is what governments pay in order to collect, maintain and distribute data. The
second economic value of this data represents the part of the national income which can be
attributed to business that create services using the data, or combine it with other data in
order to add value. Studies performed by the European Committee suggest that the total
economic value lies between e28 billion per annum and e134 billion per annum, with a
central estimate ofe68 billion[17]. In 2000 the total investment of European member statesin public sector information was valued at e9.5bn[17].
Usually, public services that have been paid for by taxpayers can only be used once.
The nature of information and data however provides the option for it to be copied and
distributed at nearly no extra cost.[19]. When governments decide to publish free and
machine readable data value can be created in the market in the same way. Businesses
reusing public sector information do not need to gather the data themselves which lowers
the investment and time to market. Furthermore, companys will use data previously not
available to create new services. Other economic effects of open data can be found within
government itself. Research has shown that these forms of openness reduces corruption[20]
which in the end leads to a more transparent and efficient government due to an effective
9
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
12/48
CHAPTER 2. THEORY
allocation of knowledge[13]. These specific effects however are our out of scope for this study.
Before we go into the details of the economic effects of open data we can describe the value
chain of information products in order to analyze the business case[17]. The value chain for
information products starts with the creation or collection of various forms of data. Afterthis process the data needs to be collected and stored in a form that allows for structured
retrieval. The next step is processing and packaging which allows for delivery of the data.
This final delivery process is used to bring the data at the client or end-user in a form defined
by the processing and packaging stage.
Figure 2.1: The data value chain
We will now give an example of how this value chain applies to the areas we have se-
lected. The Dutch railway network operator Pro-rail embedded sensors in rail network that
can pinpoint the location of trains (creation). This data is collected and together with other
meta data stored into a database (collection & storage). The train operators in the Nether-
lands require this data to be able to adjust train schedules. Pro-rail therefore packages the
data in such a way that the operators can use it to adjust their planning and communicate
with travelers about delays (processing & packaging). Pro-rail uses a computer interface todeliver this data to the different train operators in the country (delivery). The data that
has been delivered to the train operators represents value because it allows the operators to
utilize their material in a more optimal way and provide service to their customers. In the
case of open data, governments will deliver the processed and packaged data at no cost to
businesses and the public.
Different costing methods have been proposed for public sector information in order to
maximize the return of investment for governments. The return governments can get on
public sector information is a trade off between charging directly for the data, or provid-
ing the data at marginal or no cost at all. In the later case the return on investment is
achieved thought regular taxation on the economic activities that businesses perform with
the data. Pollock describes three possible pricing policies governments could use for public
sector information distribution and investigates its returns[21]. In a profit-maximization
strategy governments set their prices to maximize the profit given the demand for the data.
An average-cost or cost-recovery strategy can be used to equal the price to the total cost of
data collection and distribution. In this case the users of the data pay for the entire value
chain of the data. The final policy is the marginal or zero cost strategy in which the prices
are equal to the short-term marginal cost. In many cases these cost will be zero because
agencies that have already created distribution channels for the data to other government
bodies will not have to charge for delivery of data the market. For example, the cadas-
10
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
13/48
CHAPTER 2. THEORY
tre already distributes geo-spatial data to local authorities and therefore should not charge
businesses to use this delivery infrastructure. In the Netherlands depending on the specific
government organization different pricing strategies are used. The most dominant strategies
are profit maximization or average-cost policies.
Several studies have shown that the case for a marginal or zero cost policy is strong.
A study on the economic effects of statistical data approaches the problem from economic
theory angle. The study reasons that economic efficiency is maximized when services that
are produced actually exchange hands in the most efficient manner to avoid waste and fulfill
customer needs. Pricing of public sector information is therefore not economically efficient
because the collection and distribution infrastructure is already funded by taxpayers. In this
case strategies other than zero-cost will prevent the public form enjoying the benefit of these
good trough consumption[22]. Another study shows that the case for marginal or zero cost
policies are quite strong. The marginal cost to deliver data to other sources than primarily
intended approach zero for many government datasets. Moreover, the business demand for
this data is likely to be high and grow over time. Furthermore, it is likely that the distri-
bution of free data will generate new innovative services. It is certainly safe to assume that
the market will be better equipped to innovate on this data than public institutions facing
heavy regulatory and budget constraints.[23].
When we look at the economics of open data in the public and geospatial transport
data we find that similar effects occur. A study on the impact of public sector geographic
information in the Netherlands shows that a reduction in the price of the entire vector mapof the Netherlands from e1 million to e200.000 caused a significant increased demand and
revenue for the cadastre[24]. Furthermore, a case study of the new map of the Nether-
lands containing planning information on housing and infrastructure projects maintained
by the Department of Housing and Special planning sheds an interesting light in the increase
of dataset usage. The department brought this dataset under creative commons license13
making it freely available for downloading. At first, the dataset was bought on average once
every month but by releasing the data under a public license increased to 200 downloads
per month[24].
A similar study on the economic effects of cadastral information was performed in Spain.
In 2004 the Cathalan regional government launched a cadastral information system providing
topographical and geo-data in an open way. Using a survey the cost-benefit effects of this
investment for government organizations (municipalities, regional and public authorities)
were investigated. The study showed that the information system increases the efficiency
and workings of other governmental organizations significantly. Although the investment in
the portal was high (e1,2 million) the benefits within other government authorities were in
2006 e2.371.000[25]. We can conclude that in some cases internal governmental organiza-
tions can benefit largely from open public sector information because data comes available
in a standardized way to both businesses and other branches of government.
11
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
14/48
CHAPTER 2. THEORY
Most of the research on open public sector information focusses on a macro economic
analysis of data provisioning. The Pira[17] study and most of the works of Pollock [19][21]
focus on macro economic descriptions of the market and estimates of the value of publicsector information. At the micro level however literature lacks an analysis of the business
cases and economics.
2.2 Dutch government information architecture
In order to understand the context of the ICT landscape in this study we will briefly in-
troduce the information architecture of the Dutch Government. The Dutch Ministry of the
Interior and Kingdom relations is formally responsible for the ICT within the government.
The basic architecture that the central government should follow is formulated in NORA
(Dutch Government Reference Architecture), a set of principles, guidelines and technologies
that branches of government can follow to organize their ICT. The goals of Nora are to guide
individual government bodies in the design of their information architecture and supports
in policy making and deployment[27]. Within the architecture three principles are defined:
basic principles, collaboration principles and regulations. The basic principles describe the
relation between government, the public and businesses. The collaboration principles de-
scribe interoperability constraints and finally the regulations describe technical constraints,
standards and messages.
In the architecture different components can be identified:
1. Data Sources: (basisregistraties) the data sources or basis registries contain various
forms of data the government collects.
2. Service Bus: (servicebussen) the service bus is a data transportation facility that
can move pieces of information thourgh a messaging system
3. Transaction Gate: (transactiepoort) the transaction Gate allows organizations to
interact with the government on a machine level. For example when applying for a
tax refund.
4. Security and Identity: security and identity management are organized on the level
of the individual datasets but can be accessed through one identification system called
DigiD.
5. Front Office: the front office systems are used by various organizations to interact
with citizens and businesses. This can be a government website, but also a civil servant
supporting a citizen.
6. Organizations: the model allows for different organizations using similar architec-
tures within their organization to interact with each other.
The following image describes the relation between the different components.
12
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
15/48
CHAPTER 2. THEORY
Figure 2.2: The Dutch Government Reference Architecture (NORA)
The Nora architecture can be classified as a service oriented architecture. In a serviceoriented architecture various virtual information services are defined which can be requested
by a user. Furthermore, service oriented architectures use well defined standards for mes-
sages and communication and are build up in a modular fashion. Technical implementations
of these service oriented architectures are usually web-services or some other form of infor-
mation service bus. The Dutch government is still in the phase of constructing this unified
information service bus. In this phase the focus is to enable interoperability, providing basic
technical standards and policies to enable information flow between different governmental
organizations. In the coming years in can be expected that these systems will evolve into
the alignment of administrative procedures and technical systems[28].
For the deployment of vast amounts of data in an open fashion it is important that both
the information service bus as well as alignment of technical systems and administrative
procedures are well organized.
Reflecting on this architecture in relation to open data we can identify a couple of prob-
lems. First of all, the architecture does not include means to deliver raw data (basisregis-
traties) to businesses. The current model includes a government transaction port that allows
for message transactions like for example declaring tax. Furthermore, the central front of-
fice allows for the providing of services like requesting a new passport. No data interface is
provided in this architecture. Secondly, the current architecture only allows for security and
identity management at the front office or transaction port. The service bus that transports
13
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
16/48
CHAPTER 2. THEORY
the data is organized internally. This causes problems with open data because both public
and non-public data travel over the same bus. Finally, the architecture does not dictate
message or data standards that would come in handy when distributing open data. We can
conclude that the current architecture works as a barrier for open data. No central technicalinfrastructure is in place to deliver the data.
2.3 Stakeholders
In this section we elaborate more on our choice of stakeholders and how they relate to
available literature. Most studies in open data are only concerned the government and
businesses as stakeholders. We will use more specific definitions of stakeholders based on
Rowleys e-government stakeholder definition[31].
1. Data provider: is a governmental organization delivering some form of valuablepublic transport data. The data provider is depended on central government funding,
but can be outside of direct democratic control. The stake of this organization is to
fulfill their lawful obligation at the lowest cost. Examples of this stakeholder group in
the Netherlands the Dutch cadastre.
2. Network Operator the network operator stakeholder is the owner of the physical
infrastructure of the transport network (i.e. roads, tracks) and can be both a govern-
mental as well as a non-governmental organization. An example is the rail network
operator Prorail. A network operator can also be a data provider if law forces this
stakeholder group to deliver this data at zero cost. As an e-government stakeholderthe businesses can be classified as Governmental Organization.
3. Service Operators: Using these networks to provide travel services are the service
operators. These operators can also be a governmental or non-governmental organi-
zation. The stake of the service operator is to provide an efficient and high quality
travel service. An example of this stakeholder group in the Netherlands is the rail
operator NS. As an e-government stakeholder the service operators can be classified
as Businesses.
4. Businesses: The businesses are privately owned profit organization that can use
data provided by the operators to create services for the traveler. The stake of this
group is to get the data at the lowest possible cost in a usable format. As an e-
government stakeholder the businesses can be classified as Businesses. An example
of this stakeholder group in the navigation company Tom Tom.
5. Traveler: The traveler is the end-user of the services from both the operators and the
businesses. As an e-government stakeholder the traveler can be classified as People
as service users. The stake of this group in this research is to maximize quality of
services and minimize cost.
6. Transport authorities: the transport authorities are the regulatory bodies involved
in public transport. As an e-government stakeholder the transport authorities can
14
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
17/48
CHAPTER 2. THEORY
be classified as Public Administrators. The stake of this group is to gain a good
understanding of the transport networks in order to control safety.
7. Civil Society: the civil society are citizens and foundations that advocate various
subjects. As an e-government stakeholder the civil society can be classified as People
as citizens. Their interested in the way policies are organized and what their impact
on society is. The stake of this group in this research is to provide transparency and
accountability to decide on and evaluate policy.
Throughout the study these are the definitions of the stakeholders used.
2.4 The business model
In this section we describe the current business case of open data in the Netherlands. Fur-
thermore, we will elaborate on some blind spots literature and the effects on the business
cases of different stakeholders.
The current business case of government data starts at different government organizations
that collect data. These organizations collect and store the data. The data is then provided
under legal, financial and technical limitations. In the Netherlands, no central policy on
these limitations apply. A study on these limitations suggests that 31% of the databases
do not allow for commercial re-use. Furthermore, in 72% of the cases the data is available
free but only for non-commercial use. Finally, only 22% of the databases provide access
through other means then a web-interface (no direct access to the data). Only 4% of the
databases is accessible through a API[1]. In the cases were data is not freely available profit
maximization or cost-averaging pricing strategies apply. The data is then sold to businesses
that re-use the data in their applications. The business use some of the data to improve
their products. The limitations in this business model causes a lack of economic activity on
the government data.
We found that a gap exists in the current literature on open data. Most of the research on
distribution of public sector information at marginal cost has focussed on economic (macro),
policy or transparency effects. We put forward that to study the case of open data more
precisely the business case of different stakeholders should be analyzed more thoroughly. In
most of the studies conducted the stakeholders defined are government and businesses orthe public. These narrow definitions leave little room for the investigation of effects other
than the primary value chain and revenue models. In order to create a good design for open
data we will need to gain more insight into the business cases of the different stakeholders
instead of only looking at the global business model.
2.5 Research Question
Based on our problem definition and the exploration of the subject of open data in the
Netherlands we are ready to introduce the research question. In the previous sections we
15
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
18/48
CHAPTER 2. THEORY
proved the economic case for open data and found the most important causes for our prob-
lem. We now need to find out how we can solve these problems with our design. We will
focus on two causes of the problem:
1. Pricing: we will need to find a pricing strategy that maximizes net-value for both
businesses and government. We will design a business model that deals with this cause.
2. Technology: we will need to find a technical infrastructure to deliver the data.
From our theory section we expect that open data policies will cause changes in the
business cases of different stakeholders. We will need to investigate the effects of the design
of the new open data business model. Based on the theory and hypothesis about changes
in the business case we can introduce the primary research question.
What changes in the business model for public- and geospatial transport data could be
observed when open data would be made available?
The research question aims at finding the effects of an open data business model of various
stakeholders. We focus on public and geospatial transport data based on the statistics of
the American data portal data.gov. The statistics of this website show that geospatial and
transport data are among the most popular datasets businesses tend to reuse. Furthermore,
we focus on the Netherlands in order to be able to study the cases in detail in the amount
of time available.
The secondary research question focusses on solving the design question of our technical
infrastructure. If the government were to decide on an open data policy this will have
significant changes to the information architecture of government organizations. In the
current closed model data is used primarily internally and therefore interfaces to other
information system external to the organizations have not been realized. To be able to
deliver open data to businesses an interface should be designed. Therefore, the secondary
research question is:
What technical infrastructure should be provided in order to deliver open public- and
geospatial transport data to businesses?
16
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
19/48
Chapter 3
Methods
The goal of this study is to design a business case and technical infrastructure for opendata. The study is based on a literature review, open and structured interviews of various
stakeholders and specialists. Also various design methods such as requirements analysis,
business model generation, ORM modeling and data warehouse modeling have been used.
Because open data is subject to many influences concerning economy, privacy, civil society
and is influenced by many different stakeholders like citizens, business, civil society, civil
servants we believe that a literature and stakeholder analysis are appropriate methods to
review the depth of the subject.
Figure 3.1: The design proces
3.1 Literature Review
The literature review serves to find out the theoretical underpinnings of open data. We used
the literature review to find the main causes of the problem, and provide context to the
topic of open data. Furthermore, we looked into the electronic government architectures,
specifically the Dutch governments information architecture NORA.
3.2 Open Interviews
In order to gain more insight into the specific case of open data in the Netherlands and
to outline the methods used to design a business case for open data, interviews with var-
ious specialists were conducted. These specialists vary from government officials, business
leaders, civil servants and activists. Based on these interviews and the literature review
the structured interviews for analysis of the business case were constructed. A list of the
interview subjects can be found in the appendix.
17
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
20/48
CHAPTER 3. METHODS
3.3 Stakeholder Identification
Based on the open interviews and the literature review we made an analysis of the relevant
stakeholders. These stakeholders were used to selects respondents for the structured inter-
views. Furthermore, this identification served as means to retrieve consistent terminology
throughout the design phase. The list of stakeholders and their description can be found in
the previous chapter.
3.4 Structured interviews
Structured interviews were then performed where the interviewer used a fixed set of ques-
tions to gain insight in both the business case and technical requirements. The interviews
were conducted with an interview protocol based on interview techniques by Emans[36]. We
choose this interview form because it provides a good base for comparison of the differentanswers that respondents give. We interviewed 2-3 respondents from organizations within
every stakeholder group that we defined. The interviews were performed in a special inter-
viewing room. Respondents could choose to remain anonymous. All of the conversations
were recorded for future reference. The interviews took between 1:30 and 2 hours and were
performed during the day. The interviews were conducted in the same chronology with ev-
ery respondent. The language of the interviews was Dutch. Depending on the respondents
technological backgrounds the business case question set, interface question set or both sets
were requested. A list of the interview subjects can be found in the appendix together with
the interview protocol.
3.5 Business case analysis
To be able to gain insight in the low level effects of open data an analysis of the business
case of different stakeholders was performed. The business model generation method[26] was
used to analyze the business case of these various stakeholders. Since the design proposes a
change in the business model of government data provisioning an in depth analysis of the
effects is required. We used the Osterwalders method to identify the effects on the business
case of all of the stakeholders within the value chain. This method provides us with a nice
overview of all the possible changes to these respective stakeholders. The business modelgeneration method uses nine areas to describe a stakeholders business case which we will
explain here:
1. Partners: describes the key partners such as suppliers or government institutions are
found and a motivation for the partnership is explained.
2. Activities describes what key activities are preformed and how they contribute to
the revenue streams.
3. Value Proposition: describes what value is delivered to the customer and what
costumer need is solved.
18
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
21/48
CHAPTER 3. METHODS
4. Costumer Relations: describes what type of relationship the organization has with
their costumers, how costly they are and how they are established.
5. Costumer Segments: describes in what markets the organization operates.
6. Distribution Channels: describes the distribution channel of the organization.
7. Resources: describes what resources are necessary in order to create the value propo-
sition.
8. Cost Structure: describes what the most important costs inherent in the business
model are.
9. Revenue Stream: describes the nature of the revenue streams and finds what value
are our customers really willing to pay.
The results of the business case analysis and proposed model are presented in the business
case design section.
3.6 Requirements analysis
For the data warehouse design we used van Lamsweerdes requirements engineering method[29].
Furthermore, Boehms analysis of non-functional requirements was used to gain insight into
qualitative aspects of the warehouse design[30]. The requirements engineering method uses
a process of scoping, stakeholder analysis, user characteristics definitions, product perspec-
tive, use case analysis and requirements specification to create a software interface design.In order to account for non-functional requirements that might be important for the in-
terface we looked for usability, safety, efficiency, performance, capacity and interoperability
constraints.
3.7 Data Warehouse design
We choose to design a data warehouse as a technical solution for delivering open data to
businesses. To design this data warehouse we used a UML based method [33]. However,
instead of using UML to describe the data model, we used Object Role Modeling (ORM) [34].
This specific method was used because we have more experience with this type of modeling,
and this method allows for detailed conceptual modeling in a compact schema. The results
of this design are presented in the technology design section.
19
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
22/48
Chapter 4
Business Model Design
In this chapter we propose a design for the business model of open data in the Netherlands.Furthermore, we analyze the impact of this business model on the different stakeholders.
The current business model of public sector information works as follows. Government bod-
ies collect various forms of transport data and store this for internal use. When a business
wants to use this data for commercial purpose the data can be bought. This data is offered
at a competing or cost averaging pricing strategy. Most governments organizations dont
structure their data in open standards. Furthermore, various types of license limitations
apply to the data. After the data has been sold, the business uses the data in a existing
product or service which in turn is sold to an end user.
Figure 4.1: The business model of open data
We propose an open business model. The business model of open data for public and
geo-spatial transport data essentially works as follows. Government organizations like the
Ministry of Transportation, the cadaster and the public transport network operators pub-
20
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
23/48
CHAPTER 4. BUSINESS MODEL DESIGN
lish structured, machine readable and free datasources in a data warehouse. Businesses then
download or link to this data and create new services.These services are then provided to
end-users. The government provides the data in a structured form based on available open
standards.
In this business model the situation for some of the stakeholders changes. The most
significant changes occur for the government organizations (i.e. data provider and network
operator stakeholder groups). In the designed business model these organizations will have
to change
1. Pricing Strategy: the pricing strategy for re-use of public sector data has to change
from competing or cost-averaging strategies to a free or marginal cost strategy.
2. Legislation: copyright, intellectual property and database law are adjusted in such a
way the data can be easily used by the businesses.
3. Technical Infrastructure: the organizations provide a technical infrastructure to
deliver the data sets or web-services to businesses.
4.1 Effects of the model
It can be expected that in this business model the economic activity of businesses around
this data increases significantly. All of the stakeholders that were interviewed expect a sig-
nificant increase in economic activity. For example, the developers behind the Train I-phone
App (Trein) expect that such a development will cause severe competition to create the best
travel app on a mobile device. The planning service OV9292 expects that not only competi-
tion will increase, but explains that the use of public transport will probably increase when
travel information is more widely available. There own research has shown that OV9292
increases use of public transport with 8%. We can thus expect more businesses will start to
use open data to generate revenue.
Furthermore, it can be expected that new types of innovative services will emerge with
open data. In New York, San Francisco and other major citys that opened up their data
within months various types of travel services emerged14
. The respondents from the inter-views also expect new and innovative services to emerge when government data is combined
with commercial data sets and services. One of the examples that was mentioned in the
interviews was a toilet finding service in Denmark. This service provides citizens with a
bladder defect with the location of toilets in their area, a service that could not have been
created without open data. With our business model we can expect that the business po-
tential currently untapped in the Netherlands could be opened up. The effects that this
business model has on the business cases of the various stakeholders will be explored in the
next section.
21
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
24/48
CHAPTER 4. BUSINESS MODEL DESIGN
4.2 Effects on the stakeholder business cases
This section describes the effects of the business model on the specific business cases of the
stakeholders we interviewed. We use the definitions of the different aspects of the business
case introduced in the methods section. For every stakeholder the aspects of the business
case that change are described. If an aspect is not described in this section no relevant
changes were observed.
1. Data provider: for the data provider some significant changes to the business model
can be observed. The most significant change is the loss of income due to different
pricing strategies. The revenue streams of these data providers change because they
will have to compensate for the loss of income. For example, the cadastre expects that
open data will force them to provide topographic data and information on the legal
status of land for free. However, to maintain the quality expected by law cost have to
be incurred. Somehow the loss in income has to be compensated. Also, organizations
like OV9292 explained that providing the data for free would probably cause a loss in
income on for example the timetable services. They also pointed out that certain data
quality requires maintenance and expertise, which costs money. At the business end
stakeholders agree that this quality of data is one of the most important requirements
for them to re-use the data. We propose that this loss of income is compensated by
the national government since they are beneficiary of the effects of open data through
taxation. Furthermore, the distribution channels of the data providers will change.
Based on the interviews we can observe that both the cadastre and the providers of
transport data fear this loss in income. The cadastre furthermore fears that nationalgovernment is not willing to compensate for the loss of income. In this case they will
either decrease the number of key activities, or will increase the price of other products
they currently deliver to the market.
Furthermore, some organizations will have to provide a technical infrastructure to
deliver vast amounts of data to businesses. This infrastructure will change the way
distribution channels are organized. This change in infrastructure will also require an
investment in technology for some of the organizations. Other areas of the business
case of these organizations like costumer segments, resources and partners will not
change in our business model.
2. Network Operator: for the network operator the most significant changes occur
when they are a provider of data. For example,in the railway sector Prorail main-
tains the network and provides the data on locations of trains to the different service
operators on the network. In this case the change in pricing strategy will decrease
their overall income. However, the network operators in general are already obliged to
provide this data to their main customers: the service operators under Dutch public
transport law (wet personenvervoer). The travel information OV9292 said that they
would make the data available if requested. However, this would be the raw data,
but not the planning service they provide. OV9292 thinks that this planning software
is the core intellectual property, not the raw data. The most significant change for
22
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
25/48
CHAPTER 4. BUSINESS MODEL DESIGN
the network operator is the change in customer segments. When open data would be
introduced a new group of customers for the data would emerge: businesses.
3. Service Operators: for the service operator changes in the cost structure will occur.
Data that was only commercially available can now be obtained at zero or marginal
cost. For some operators like for example NS this could be a significant decrease
in cost for data collection. Furthermore, based on the interviews with OV9292 the
availability of free public transport data will increase the number of customers that
use their services. This increases the volume of the revenue stream obtained from
travel services.
23
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
26/48
CHAPTER 4. BUSINESS MODEL DESIGN
4. Businesses: like the data providers, the changes to the business model of businesses
is significant. In the old model businesses had to pay for the acquisition of data
from government bodies. In the proposed model this data is available for free, which
significantly lowers the cost of acquisition of data products. Furthermore, by enforcingthe use of open standards the cost for changing the data into appropriate formats will
decrease. We can therefore conclude that the cost structure of these business changes
in the business model.
Furthermore, based on the interviews we can conclude that competition will increase.
Respondents expect that the barrier to enter the market with a certain service will
lower. For example, one of the respondents expects that acceptable quality navigation
products could be made with the map provided by the cadaster. The main cause for
lowering this barrier is that no significant investments in acquisition of high quality
mapping data is required when the map can be downloaded for free at the cadastre.
Also, key activities of some business can change due to the change in the business
model. For example, commercial mapping organizations like Google, Tom Tom and
Navteq currently rely on land metering and other mapping techniques for their map-
ping product. At least 20 properties of these mapping products could be made available
for free through the cadastre. Different business organizations pointed out that it is
important that the data is license free and that coverage and quality of the data are
guaranteed.
5. Traveler: for travelers we cant really speak of a business case. We will however state
the obvious changes this stakeholder incurs in our business model. The traveler willexperience an increase in the number of services available to them. Furthermore, due
to the increase in competition the quality and functions of the services provided will
probably increase.
6. Transport authorities: since the transport authorities play no vital role in the
business model we will deem them out of scope. Some of the effects that we might
expect that influence transport authorities is that the availability of more data will
give vital insight in the performance of the transport networks. This could lead to
better policies at the government level.
7. Civil Society: civil society organizations currently play no significant role in the
business model of open data. However, it can be expected that civil society organiza-
tions engage in the creation of social applications. These applications were previously
to expensive to develop because of the data acquisition efforts, but become viable in
our new model. Some examples of these types of applications are Schoolscope in the
United Kingdom. This website offers parents a benchmark of the quality of schools.
Another application reports on hazardous locations in the New York Manhattan area
based on traffic data published by the government.
By using the business model generation method we found that the most significant
changes in our design are a change in cost structure of the providers and users of data.
24
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
27/48
Chapter 5
Technology Design
On of the causes of problem is the lack of technical infrastructure to deliver high qualitydata to businesses at high speed. We performed a requirements analysis that has lead to a
technical solution to our problem. In this chapter we propose a design of a data warehouse
for public and geo-spatial transport data.
A data warehouse is essentially a data storage and decision support system based on a
variety of different datasets. In business data warehouses are frequently used as management
support tools. A data warehouse is always subject-oriented and records and interprets
attributes of these subjects over time. Some examples of subjects in our case are vehicles,
stops, travelers and so on. We chose to design a data warehouse above a normal database
system because a data warehouse allows for decision support (planning) and can cope with
multiple sources of different information. The scope of this design is an analysis of the
landscape where the warehouse will operate in, a draft architecture of the different data
warehouse layers, a data model for the storage of public and geospatial transport data, an
interface design and recommendations on standards and hardware. We will not look into
front-end applications, query structure, optimization, rollout or maintenance aspects of the
data warehouse. We used the UML-based data warehouse design method to create this
design[33].
5.1 Landscape
Before we can describe the interface design we need to define the context architecture in rela-
tion to the value chain. The data warehouse collects data from different data providers and
network operators. This data is processed and packaged in the warehouse. We assume that
the standards as defined by the European Committee for Standardization (CEN) Service
Interface for Real Time Information CEN/TS 1553115 which includes data on timetables,
network monitoring, vehicle monitoring, connection monitoring and a general message ser-
vice will be used. For the geographical data various vector forms can be distributed. In
this study we assume web map service, web feature service and web mapping tile service by
the open geospatial organization are used. For the traffic and delay data we suggest to use
the European Open Travel Data Access Protocol (OTAP) and the standards defined by the
25
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
28/48
CHAPTER 5. TECHNOLOGY DESIGN
National Database Road-traffic (NDW).
Figure 5.1: The data warehouse in its context
After the data is processed and packaged it can be delivered through the interface. Public
transport data can be defined as data regarding the physical infrastructure (stops, stations,
routes), the timetable (planning, platforms), and the status of the network (delays, out-
ages). Geo-spatial transport can be defined as data regarding the main motorway network
(network, ramps) and the status of the network (traffic jams).
5.2 Warehouse Architecture
This section describes the general architecture of the data warehouse. A data warehouse
is generally build up out of four main components. First their are multiple data sources
that provide different sorts of information to data warehouse. In our example road, train,
network and mapping data feeds into the data warehouse. After the data has been processed
through the different layers of the data warehouse it is offered to users in a data mart. This
data mart is a subset of the larger data store and is oriented to either public transport or
road network relevant data. When a user requests certain data from the data mart trough
the interface (API) it can be re-used in an application. In this model we also included a
planning layer that can interpret the different sorts of raw data and return routing and
planning information.
We explicitly place this layer outside the data processing part of the data warehouse
because we want to keep this planning capability of the data warehouse optional. We want
to keep this optional because these specific types of planning packages are also used in the
market and might introduce unfair competition to other vendors of planning software.
26
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
29/48
CHAPTER 5. TECHNOLOGY DESIGN
Figure 5.2: The data warehouse architecture
The source layer of the data warehouse is the physical infrastructure that gathers the
data from the different data sources. In our data warehouse the data sources either push
the data to the data warehouse at some predetermined interval, or a separate data scraper
is used to collect the data. In the extraction layer the scheduling of the data extraction from
the data sources is organized. For example, the vector map of the road network probably
wont require an update more regular than once or twice every week, were the location ofa train will probably have to be updated every 30 seconds. Some data warehouses feature
a staging area that is used to normalize the data and check for quality, coverage and other
constrains. Such a staging area would be relevant if a large number data sources would
be used and if the quality of this data could not be trusted. Since the providers of the
data are all known, agreements can be made on these aspects of the data delivery and we
will not require data staging. In the ETL (Extraction, Transformation and Load) layer the
data from the extraction layer is used and transformed into the relevant data structure,
meta data is extracted and the data is loaded into the databases. In this process the data
is checked for integrity, cleaned and sometimes translated. The ETL stage takes does not
directly operate on the databases of the data warehouse but uses staging tables. Depending
on the requirements of the data and the update frequency the different steps used can vary.
After the ETL layer the data is processed in the storage layer. This layer basically the
data base management system of the data warehouse (DBMS). The primary task of this
layer is to store and retrieve data from the data warehouse. It uses the ACID properties
(atomicity, consistency, isolation, durability) to guarantee data warehouse transactions are
processed reliably. The storage layer pushes different types of data on set intervals to the
two data marts that we included in the design. The data marts are a subset of the data
present in the data warehouse relevant to the user group. We use two different data marts
for different redundancy purposes. First, the data marts can be hosted on different hardware
27
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
30/48
CHAPTER 5. TECHNOLOGY DESIGN
environments than the data warehouse. This will make sure that if the data warehouse for
some reason goes offline data can still be extracted. Furthermore, if these data marts were
non-existed and the API would be coupled to the data warehouse directly a failure in the data
warehouse would cause both the vital road and public transport information infrastructureto go offline together. This could lead to major delays on both the public transport and
road network. Finally, the data marts allow for a much cheaper failover environment than
the data warehouse. Because a data mart is essentially a big cache of the subset of the data
warehouse it could be mirrored onto different physical locations. The final layer in our data
warehouse design is the interface with the end-users. This interface design will be defined
further on in this chapter.
5.3 Data Model
To be able to store data in our data warehouse we will have to model the data first. For the
geo-data and traffic data some good internationally accepted data models are already freely
available to use. We choose to adopt these standards in our design. For the Geo-spatial
information the OpenGis Map Service standard will be used[35]. The road data model will
be based on the model already used by the Dutch National Database Roadtraffic16. However,
such a well defined data model misses for public transport data in the Netherlands. Some
efforts have been put into the BISON standard. This standard however, only models the
interfaces between various service providers in the public transport domain. For the public
transport data a draft version of the BISON standard and the interviews have been used to
derive a data model. We tried to combine the BISON standard with the already availableCEN/TS 15531 standard for public transport defined by the European Comittee.
Figure 5.3: Available data models
Based on the service interface requirements we used the Object Role Modeling (ORM)
technique[34] to generate the model for public transport. The model only describes the
conceptual data relations in the data warehouse. Weve used nine elementary object types
to describe the domain of public transport.
The vehicle object type is the physical means of transportation (e.g. train, bus, taxi)
and has various attributes such as a location, capacity and the availability of a toilet. A
vehicle is maintained by a certain service operator which only has a name in our model. At
the infrastructure side of the spectrum we defined a stop, platform and connection. A stop
28
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
31/48
CHAPTER 5. TECHNOLOGY DESIGN
Figure 5.4: The ORM data model for public transport
is a physical location where a vehicle can stop to drop off travelers. A stop can have multiple
platforms. The route between two stops or platforms can be defined as a connection, which
has a distance and can be available or unavailable. A connection is maintained by a network
operator. Furthermore, the unique combination of a connection, vehicle and a planned
item results in a schedule. The planning item contains a departure and arrive timestamp
(date & time) and may contain a note for the operator. Different planning items together
generate a route for a passenger. When the planning changes a exception can be created.
This exception is a message to the traveller and operators that a certain planned item has
changed. An exception can also be a single message that has no influence on the planning.
5.4 Interface
To connect the data warehouse to the business users an Application Programming Interface
(API) will be constructed. The interface will act as a data provisioning system for public
transport and geo-spatial data. For both data types a separate API will be constructed
capable of providing the data for both the public transport and the geo-spatial transport.
The interface will be run as a web service that allows for access through the HTTP proto-
col (over the web). The interface will be constructed on a Representational State Transfer
29
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
32/48
CHAPTER 5. TECHNOLOGY DESIGN
(REST) communication bus that uses messages formatted in Extensible Markup Language
(XML). The choice for REST is based on the focus on different system states that can be
retrieved through the interface using common operands (like GET, POST, PUT, DELETE).
This type of API provides scalability, safety, stability, generality in interfaces, latency re-duction and is flexible enough to extend with more services in the future. For the messages
that are being sent through the interface the XML standard will be used. XML is an W3C
consortium approved standard for machine readable document markup. It provides enough
freedom to define custom schemas for the propose of geo and public transport data provi-
sioning without losing standardization.
A rest interface can be built on different programming languages, databases and services.
Since the systems that are being used by the different data providers are unknown to us
some assumptions have to be made. We assume that the data provides want high flexibility
and extendibility in programming language. Furthermore, they want low implementation
and maintenance cost, finally they want the interface to be compatible with the wishes of
the third party developers.
Taking into account these requirements the interface will be build on Python. Python is
a multi paradigm language allowing programmers to incorporate different styles of coding.
Python is a stable language that is provided natively in many Linux distributions and works
flawlessly with Oracle web servers. Many large corporations like Google, ABN-AMRO,
CERN and NASA use Python for their interfaces.
Depending on the relation with the data provider (either local caching or direct API) a
database is required. The construction of this interface will be built on an Oracle 11
database. The database can be manipulated using Standard Query Language (SQL) whichis an international standard for interaction with relational databases.
The interface will deliver data through web-services. When a user registers for an API key
the services can be used. We split the API for the rail and road network into two separate
APIs for redundancy. We believe this redundancy is required because if the system were
to be one single API, a failure would result in no transportation data what so ever. For the
public transport data the following categories of service calls to the API can be defined:
1. Planning Services: the planning service category contains several planning and
decision services. These services are used to determine optimal routes based on various
parameters. The most important services are the Planned Timetable Service whichreturns the current timetable. The Estimated Timetable Service also takes into
account the actual state of the network and adjusts the planning accordingly.
2. Monitoring Services: the monitoring services category contains several network
monitoring services. The goal of these services is to determine the current state of the
networks and vehicles. The exception monitoring service provides information into
network exceptions like the failure of turnpikes. The stop monitoring service provides
information on the stations and platforms. The vehicle monitoring service provides
information on the location of individual vehicles. Finally, the network and connection
monitoring service provides meta-information on the state of the network.
30
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
33/48
CHAPTER 5. TECHNOLOGY DESIGN
3. Other Services: the other services category contains services that relate to pricing,
messaging and interaction with the network operator.
For the public transport data the following categories of service calls to the API can be
defined:
1. Planning Services: the planning service category contains two services that can
return the delays on the specific sections of road. Furthermore, the estimated capacity
service returns the probability of a capacity shortage on a certain section of road based
on real time measurement and statistical data.
2. Monitoring Services: the monitoring services category contains several network
monitoring services. The goal of these services is to determine the current state of the
network and connections. Several different services report on planned maintenance,
incidents, connections etc.
3. Map and Network Services: the map and network category contains services re-
turning static data on the road network. Several services provide a download the latest
version of the road vector map, static information on junctions and exits and static
information on road facilities and signs.
4. Other Services: he other services category contains services that relate to pricing,
messaging and interaction with the network operator. Furthermore it provides streams
of video and weather stations at the road side.
A more extensive analysis of the services and the design can be found in the appendix.
5.5 Hardware
The data warehouse will have to run onto a solid physical infrastructure. We will present
some recommendations on the hardware of the data warehouse. We will have to take into
account the scalability, parallel processing capabilities, database management / hardware
combination and cost effectiveness of the hardware environment. Based on the expected
usage of the data warehouse we can expect that the system will sometimes require a high
peak capacity. For example when major malfunctions to the public transport system occur
expected API requests per min can triple. But we cannot plan for these types of outages,
so our hardware will have to be able to cope with these peak loads. Furthermore, since high
volumes of API requests are performed on the system parallel processing support could in-
crease reliability and speed. Finally, it is important that the software and operating systems
used match with the database management tool that we selected.
The goal of this recommendation is to find a solution that has a high reliability and
is cost-efficient. We recommend the use of a cloud oriented hardware. In a cloud server
setup virtual server capacity is rented with a cloud infrastructure provider like Amazon.
The advantages of cloud operated services is that they can scale elastically with the end-
user demand. Furthermore, cloud infrastructure providers have preconfigured virtual servers
31
-
8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.
34/48
CHAPTER 5. TECHNOLOGY DESIGN
readily available for use. This will reduce the cost for maintenance personnel significantly.
A possible specification for this hardware could be:
Amazon Elastic Compute Cloud (Amazon EC2)17
Servers: High-Memory Double Extra Large Instance 34.2 GB of memory, 13 EC2
Compute Units (4 virtual cores with 3.25 EC2 Compute Units each), 850 GB of local
instance storage, 64-bit platform. This setup allows for high transaction volumes.
Operating System: Oracle Enterprise Linux
Database System: Oracle Database 11g
Application Server (running python): Oracle WebLogic Server
Service Packages: Amazon Elastic Block Store, Elastic IP Addresses, Amazon
Virtual Private Cloud, Amazon CloudWatch, Auto Scaling, Elastic Load Balancing
5.6 Qualitative AspectsThe final design specifications for this data warehouse have a non-functional nature. Weve
investigated the performance aspects of the database based on the interviews. For the geo-
spatial data we can expect 5000-10000 requests / min. With the public transport data we
expect 500 planning requests, which we estimate will cause 5000 requests / min . We were
unable to retrieve the expected amount of requests for the road network. We estimate the
number of requests to be 5000 / min. The total number of request that should be handled
by the data warehouse therefore should be: 20.000 API requests per minute.
The update frequency of the data depends on the specific type of data. The vector map
has an update speed of twice a year, while the locat