fine-resolution population mapping using openstreetmap points-of-interest

26
This article was downloaded by: [141.212.109.170] On: 16 December 2014, At: 08:37 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Click for updates International Journal of Geographical Information Science Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/tgis20 Fine-resolution population mapping using OpenStreetMap points-of-interest Mohamed Bakillah ab , Steve Liang b , Amin Mobasheri a , Jamal Jokar Arsanjani a & Alexander Zipf a a GIScience Research Group, Institute of Geography, University of Heidelberg, Heidelberg, Germany b Department of Geomatics Engineering, University of Calgary, Calgary, Canada Published online: 24 Apr 2014. To cite this article: Mohamed Bakillah, Steve Liang, Amin Mobasheri, Jamal Jokar Arsanjani & Alexander Zipf (2014) Fine-resolution population mapping using OpenStreetMap points-of- interest, International Journal of Geographical Information Science, 28:9, 1940-1963, DOI: 10.1080/13658816.2014.909045 To link to this article: http://dx.doi.org/10.1080/13658816.2014.909045 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &

Upload: alexander

Post on 12-Apr-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Fine-resolution population mapping using OpenStreetMap points-of-interest

This article was downloaded by: [141.212.109.170]On: 16 December 2014, At: 08:37Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Click for updates

International Journal of GeographicalInformation SciencePublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/tgis20

Fine-resolution population mappingusing OpenStreetMap points-of-interestMohamed Bakillahab, Steve Liangb, Amin Mobasheria, Jamal JokarArsanjania & Alexander Zipfa

a GIScience Research Group, Institute of Geography, University ofHeidelberg, Heidelberg, Germanyb Department of Geomatics Engineering, University of Calgary,Calgary, CanadaPublished online: 24 Apr 2014.

To cite this article: Mohamed Bakillah, Steve Liang, Amin Mobasheri, Jamal Jokar Arsanjani& Alexander Zipf (2014) Fine-resolution population mapping using OpenStreetMap points-of-interest, International Journal of Geographical Information Science, 28:9, 1940-1963, DOI:10.1080/13658816.2014.909045

To link to this article: http://dx.doi.org/10.1080/13658816.2014.909045

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the“Content”) contained in the publications on our platform. However, Taylor & Francis,our agents, and our licensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Contentshould not be relied upon and should be independently verified with primary sourcesof information. Taylor and Francis shall not be liable for any losses, actions, claims,proceedings, demands, costs, expenses, damages, and other liabilities whatsoever orhowsoever caused arising directly or indirectly in connection with, in relation to or arisingout of the use of the Content.

This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &

Page 2: Fine-resolution population mapping using OpenStreetMap points-of-interest

Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 3: Fine-resolution population mapping using OpenStreetMap points-of-interest

Fine-resolution population mapping using OpenStreetMappoints-of-interest

Mohamed Bakillaha,b*, Steve Liangb, Amin Mobasheria, Jamal Jokar Arsanjania

and Alexander Zipfa

aGIScience Research Group, Institute of Geography, University of Heidelberg, Heidelberg,Germany; bDepartment of Geomatics Engineering, University of Calgary, Calgary, Canada

(Received 31 October 2013; final version received 18 March 2014)

Data on population at building level is required for various purposes. However, toprotect privacy, government population data is aggregated. Population estimates atfiner scales can be obtained through areal interpolation, a process where data from afirst spatial unit system is transferred to another system. Areal interpolation can beconducted with ancillary data that guide the redistribution of population. For popula-tion estimation at the building level, common ancillary data include three-dimensionaldata on buildings, obtained through costly processes such as LiDAR. Meanwhile,volunteered geographic information (VGI) is emerging as a new category of dataand is already used for purposes related to urban management. The objective of thispaper is to present an alternative approach for building level areal interpolation thatuses VGI as ancillary data. The proposed method integrates existing interpolationtechniques, i.e., multi-class dasymetric mapping and interpolation by surface volumeintegration; data on building footprints and points-of-interest (POIs) extracted fromOpenStreetMap (OSM) are used to refine population estimates at building level. A casestudy was conducted for the city of Hamburg and the results were compared usingdifferent types of POIs. The results suggest that VGI can be used to accurately estimatepopulation distribution, but that further research is needed to understand how POIs canreveal population distribution patterns.

Keywords: areal interpolation; OpenStreetMap; points-of-interest; population estimation;volunteered geographic information

1. Introduction

Numerous activities, such as business development, demographic studies, disaster pre-vention, and urban planning, require data on population at a fine resolution, including atbuilding level. However, to protect privacy, population statistics provided by nationalcensus are aggregated and therefore not available at building level (Ural et al. 2011,Langford 2013, Sridharan and Qiu 2013). Estimates at the building level must becomputed by disaggregating census population data with appropriate techniques.

Population estimates can be obtained through areal interpolation. Areal interpolation isthe process of transferring data from a first spatial unit system (called source units, usuallythe census units) to another spatial unit system (called target units, e.g., cells of a rastergrid) (Sridharan and Qiu 2013). Areal interpolation can be conducted using ancillary datato guide the redistribution of population, based on the observation that populationdistribution is often correlated with other spatial variables (such as land use/land cover,

*Corresponding author. Email: [email protected]

International Journal of Geographical Information Science, 2014Vol. 28, No. 9, 1940–1963, http://dx.doi.org/10.1080/13658816.2014.909045

© 2014 Taylor & Francis

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 4: Fine-resolution population mapping using OpenStreetMap points-of-interest

LULC). Still, the quality and the appropriateness of the ancillary data used influence theaccuracy of the estimation. Common types of ancillary data used for areal interpolationinclude LULC data derived from satellite imagery, official cadastral or topographicdatabases, official vector street network data (considering that people commonly residealong residential streets), or Light Detection and Ranging (LiDAR) data.

Methods employed to estimate population at building level commonly use LiDARdata to estimate building volume or height (Lu et al. 2010, Sridharan and Qiu 2013).Building surface can also be used to assess population distribution; however, the lack ofinformation on the volume of buildings may generate underestimation or overestimationof population (Harvey 2002), unless the height of buildings is relatively homogeneous (Luet al. 2010). However, LiDAR data is costly. Alternative methods that use affordable dataare needed.

This paper proposes a method for areal interpolation at building level that uses VGI.To the best of our knowledge, this is the first attempt to use VGI for this task. VGI isproduced by volunteers through Web 2.0 applications rather than by traditional dataproducers (Goodchild 2007). VGI is criticized due to potential data quality issues (Bishrand Mantelas 2008, Flanagin and Metzger 2008, Jackson et al. 2013), but it can alsoextend existing data sets with information that pertains to a level of detail that go beyondthe capacity of ‘official’ producers (Tulloch 2008). VGI applications allow collectinglocal knowledge, which usually cannot be gathered using traditional data collectionprocesses (Goodchild 2007). Consequently, numerous researchers in the field ofGIScience consider VGI as a valuable data source that can complement official andcommercial data (Goodchild 2007, De Longueville et al. 2010, Haklay 2010, Zielstraand Zipf 2010, Mooney and Corcoran 2011). For instance, Song and Sun (2010) havereported that the usage of VGI in urban management has increased. Coleman et al. (2010)report that some state governments in Australia and Germany, as well as the USGeological Survey, have employed VGI in their mapping programs. Commercial dataproviders such as NAVTEQ and TomTom have used VGI to identify updates required fortheir databases (Coleman 2010). One of the most well-known examples of VGI project isOpenStreetMap (OSM), a collaborative mapping project that enables volunteers to pro-duce and share maps (Neis et al. 2011). OSM has already been used for various purposes,including the automatic derivation of three-dimensional CityGML models (Goetz and Zipf2012), road-based travel recommendation systems (Sun et al. in press), and mapping land-use patterns (Jokar Arsanjani et al. 2013). In OSM, users can describe features – such asroads, water bodies, and points-of-interest (POIs) – using ‘tags’ (Goetz et al. 2012). APOI is a feature on the map that is given specific point coordinates (e.g., schools,restaurants, churches, etc.) and which location may be useful to know depending on thecontext of the user (especially in tourism and for routing services). Because some types ofplace can be correlated with a higher density of population (as argued in Zhang and Qiu(2011)), POIs can be useful to refine population estimates provided that a correlation canbe established.

The proposed method is an alternative solution to estimate population at the buildinglevel that relies solely on open data and VGI. The principle of the approach is to combinethree existing methods: multi-class dasymetric mapping (conducted with Urban AtlasLULC data) is used to obtain population estimates for relatively large zones; then,interpolation by surface volume integration (using OSM POIs to determine points ofhigh population density) is employed to uncover variations of population density withinthese zones; finally, binary dasymetric mapping (conducted with OSM building layerdata) is used to redistribute population within buildings.

International Journal of Geographical Information Science 1941

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 5: Fine-resolution population mapping using OpenStreetMap points-of-interest

The originality of the proposed approach lies in investigating the use of VGI to obtainpopulation estimates. The hypothesis behind this method is that OSM data will be ofsufficient quality to support this task. Of note is that Strunck (2010), who has conductedexperiments on the growth of different OSM POI categories in Germany, observed thatOSM had about twice as many POIs as TomTom MultiNet data, a commercial datasource. However, one can still expect that VGI quality issues may affect the accuracyof population estimates. Concerning POIs, specific concerns include spatial and thematicaccuracy. In terms of spatial accuracy, VGI contributors are not formally required togenerate data with an established level of spatial precision, and may have an inaccurate orincomplete perception of the geographic phenomenon they describe (Bakillah et al.2013a). Jackson et al. (2013) note that OSM contributors are subjected to no constraintsconcerning the placement of a POI, and may rely on various devices to do so (GPS,smartphone, etc.); therefore, one can expect significant variations in positioning. In termsof thematic accuracy, Scheider et al. (2011) observed that the terms used by contributorsto describe geographic features can be ambiguous. Still, in OSM, contributors areencouraged to use terms organized into a folksonomy, which contributes to the reductionof naming heterogeneities. Despite such issues, successful usages of VGI POIs have beenreported, such as Rodrigues et al. (2013) who have relied on Yahoo POIs to estimatedisaggregated employment size.

The reliability of the results obtained with the proposed approach was investigatedwith a case study in Hamburg, Germany, using different types of POIs. The proposedmethod was also compared with other typical areal interpolation approaches. The resultsshow that while the choice and quality of POIs have an impact on estimates, using VGI toperform areal interpolation is an interesting avenue.

This paper is organized as follows: related work is provided in the next section.Section 3 presents the data used for the case study. The approach is explained in Section4. Section 5 presents the experiments and a discussion of results. Section 6 concludes thepaper and outlines future work.

2. Related work

Areal weighting is the simplest type of areal interpolation method and it requires noancillary data (Markoff and Shapiro 1973, Goodchild and Lam 1980, Lam 1983). It is alsoreferred to as area weighted interpolation (Lwin 2010) or overlay operation based ongeometric properties (Wu et al. 2005). In this method, population estimates in target zonesare computed based only on the proportion of each source zone that overlaps with thetarget zone (Goodchild and Lam 1980). It is based on the assumption that population isuniformly distributed within the source zone. However, source zones in census rarely havehomogeneous population distributions. Therefore, population estimates based on arealweighting are likely to differ from the reality because of this assumption. However, in theabsence of ancillary data, the areal weighting method may be useful. An improved versionof areal weighting, called target-density weighting (TDW), was proposed by Schroeder(2007). TDW is a method employed to interpolate population data from one set of censusunits to another (data from more than one census is required). TDW is based on theassumption that if people lived in a given region when a first census was conducted, thesepeople are more likely to live there too (in a proportional manner) when a subsequentcensus is conducted. TDW was extended into the cascading density weighting (CDW)method to produce complete population time series; CDW repeats the TDW backwardsthrough a series of census (Schroeder 2010). Another example of areal interpolation

1942 M. Bakillah et al.

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 6: Fine-resolution population mapping using OpenStreetMap points-of-interest

method that does not require ancillary data is Tobler’s pycnophylactic interpolation(Tobler 1979). This approach creates a smooth population density distribution for anylocation (x, y) by minimizing the sum of squares of partial derivatives. The densitydistribution respects Tobler’s pycnophylactic property, i.e., an important property wherethe sum of people reported for each source zone is preserved during the interpolationprocess.

Dasymetric mapping (also referred to as dasymetric modeling) is a technique whereadditional data (called ancillary data) is employed to guide the redistribution of populationdata at a finer level of resolution (Semenov-Tian-Shansky 1928, Wright 1936). Binarydasymetric mapping (also referred to as filtered areal weighting) is the simplest type ofdasymetric mapping method (Eicher and Brewer 2001). It is called binary because sourcezones are subdivided into populated and unpopulated areas. The areal weighting methodis then applied within the populated areas of source zones. Common sources used toidentify populated areas include:

● LULC data derived from satellite imagery (Mennis 2003, Reibel and Agrawal2007, Cromley et al. 2012);

● county address points and parcels (Tapp 2010);● vector street network data (Xie 1995, Hawley and Moellering 2005, Zandbergen

and Ignizio 2010);● vector or raster maps provided by official mapping agencies (Langford 2013),

including topographic databases (Wu et al. 2008, Lwin and Murayama 2010);● Google Map images (Yang et al. 2012).

The main drawback of binary dasymetric mapping is that it relies on the assumption thatthe population is uniformly distributed within the populated areas (Maantay et al. 2007).

The multi-class dasymetric mapping method partly addresses this problem. Theterritory is partitioned into several residential classes (e.g., high density, medium density,low density) associated with different population densities (Eicher and Brewer 2001). Toestimate the population density associated with each residential class, empirical samplingof population densities (Mennis 2003) and multivariate linear regression estimated withordinary least squares (Langford 2006) were proposed. Data sources used for multi-classdasymetric mapping include:

● Cadastral databases (Maantay et al. 2007);● Topographic databases, land-use zoning, and transportation network (Su et al.

2010);● Remote-sensing data (Liu et al. 2008).

Multi-class dasymetric mapping may theoretically be more accurate than binary dasy-metric mapping; however, it is similarly based on the assumption that population is evenlydistributed within an area covered by the same residential class. Lo (2008) has demon-strated that the relation between population density and the type of LULC category isvarying (referred to as the ‘problem of related variables’). This finding suggests that linearregression alone is not appropriate to estimate population distribution, which was alsoemphasized by other authors (Griffith and Can 1996). Indeed, Wu and Murray (2005)explain that methods based on regression analysis assume that population density isspatially independent. Still, the difficulty of identifying a meaningful statistical relationbetween population density and other variables remains a challenge (Mennis 2009). To

International Journal of Geographical Information Science 1943

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 7: Fine-resolution population mapping using OpenStreetMap points-of-interest

address this issue, methods that integrate spatial autocorrelation have been proposed. Forexample, Wu and Murray (2005) propose a co-kriging method for estimating populationdensity using census data and impervious surface proportion as secondary variable torefine population estimates. Other authors suggest area-to-point kriging to disaggregateresidual population density values obtained with regression (Kyriakidis 2004, Liu et al.2008). Griffith (2013) pointed out that kriging and co-kriging involve variable transfor-mation, which may introduce errors in final population estimates. He proposes a geo-graphic areal unit imputation Poisson model specification that uses remotely sensedimages as well as spatial and temporal autocorrelation to circumvent this issue. ThePoisson model has also been used by Leyk et al. (2013b) to represent relations betweenvariables. Another approach to address the issue of related variables is maximum entropydasymetric modeling, where the maximum entropy principle is used to model the statis-tical relation between population density and other variables, such as attributes of house-holds (Leyk et al. 2013a). The expectation–maximization (EM) algorithm, which was firstintroduced by Dempster et al. (1977), iteratively applies a feedback loop to refinepopulation density estimates; it is also used to perform areal interpolation with relatedvariables (Flowerdew and Green 1994). More recently, it was coupled with geographicallyweighted regression (GWR) to allow the densities of different categories of control zones(e.g., commercial, residential) to have non-constant ratios among the different source units(Schroeder and Van Riper 2013).

Interpolation by surface volume integration is another type of technique that relies ona set of points that are considered as places where population density is expected to behigh, compared to the surroundings. These points are referred to as ‘control points’. Zhangand Qiu (2011) used schools as control points. It is assumed that the population decreasesaccording to a decaying function of the radial distance from control points. Then,estimates are smoothed at intersections between two cells. The method proposed in thispaper investigates the use of this technique with VGI.

Also worth mentioning, while not specifically addressed in this paper, is the morerecent research on uncertainty in dasymetric mapping. Approaches that measure theuncertainty of population estimates include those of Schroeder (2007), Kyriakidis(2004), Wu and Murray (2005), and Liu et al. (2008). Nagle et al. (2014) explain thatthese approaches do quantify the uncertainty of population estimates, but do not incorpo-rate uncertainty associated with ancillary data. To address this issue, they propose thepenalized maximum entropy dasymetric model (P–MEDM), which is an extension of theMaximum Entropy (ME) technique that explicitly models the uncertainty of input dataand propagates it through the population estimation process.

The above-mentioned methods provide estimates for target units that include severalbuildings, but not at building level. Some approaches have been proposed to estimatepopulation using three-dimensional data on buildings. Lu et al. (2010) have used regres-sion analysis to model the relation between population and the area or volume of thebuildings obtained with LiDAR data.

Sridharan and Qiu (2013) have developed a similar LiDAR-based approach. Uralet al. (2011) have studied population estimation at the building level using aerial images,digital terrain, and surface models to determine building footprints and heights. Then, cityzoning maps were used to classify buildings as residential and non-residential. Theseapproaches all rely on costly data to evaluate buildings’ volume. The method proposed inthis paper circumvents this obstacle by using free data (VGI). To the best of our knowl-edge, VGI has not been used yet for population estimation. Langford (2013) recentlyclaimed to have conducted the first experimentation of dasymetric mapping using open

1944 M. Bakillah et al.

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 8: Fine-resolution population mapping using OpenStreetMap points-of-interest

data (in this case, provided by UK government), but mentioned that while being aninteresting alternative, OSM has not been used. Similarly, Lin et al.’s evaluation ofpublicly available data for areal interpolation focuses on remote-sensing LULC dataproduced by official sources (2013). Aubrecht et al. (2011) have explored the potentialof VGI for modeling the spatiotemporal characteristics of urban population, but not toproduce population estimates.

3. Study area and data

The study area is the city of Hamburg, Germany. Hamburg is the second largest city inGermany. According to the 2011 census, this city had a total population of 1,746,813 anda total area of 75,525 ha at that time, with an average density of 23 people/ha. The censusdata is provided for 7 districts composed of 106 census blocks with varying sizes, rangingfrom about 71 ha to 14,758 ha, occupied by approximately 3–412,000 people. Populationfor each census blocks (subunits of districts) is also available. The Hamburg city regionmainly comprises medium to dense urban fabric north of the River Elbe, with agricultureand forest areas at the periphery. The center of Hamburg is made of a mixture of urbanfabric of varying density, green urban areas, sport and leisure activity areas, and animportant portion occupied by port facilities.

Ancillary data used in this study include LULC data from Urban Atlas, the EuropeanLULC data for large urban zones that have more than 100,000 inhabitants (http://www.eea.europa.eu/data-and-maps/data/urban-atlas) (Figures 1 and 2).

Figure 1. Classified LULC of Hamburg urban zone (from Urban Atlas).

International Journal of Geographical Information Science 1945

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 9: Fine-resolution population mapping using OpenStreetMap points-of-interest

Urban Atlas data is also from 2011. The LULC categories used by Urban Atlas do notidentify residential areas. Instead, areas that are more likely to contain the majority ofresidential buildings are in the continuous and discontinuous urban fabric categories,which also include commercial and industrial types of building. Meanwhile, other cate-gories may also contain residential buildings. Therefore, it was decided to consider allcategories of LULC as habitable areas, except for the water category.

To redistribute population to individual buildings, data on areas and location ofbuildings is needed. For this purpose, the OSM building layer was used to extract buildingfootprints and coordinates. Because of the collaborative nature of OSM, the data onbuildings has been created at different dates. While, in OSM, the type of building canbe identified with attributes, contributors do not always provide a value for this attribute,so one cannot rely on this type of data to identify building use. POIs, which are used toidentify high-density areas, were also extracted from OSM (and therefore created atdifferent periods of time). OSM POIs describe geo-located features (point, line, orpolygon). The type of place is specified through tags, which are made of a (key, value)pair, where the ‘key’ is a category of features and the ‘value’ is a subcategory of features.An example of tag describing a POI is (amenity, postbox). In OSM, contributors areencouraged to use the vocabulary suggested in OSM Feature Wiki (http://wiki.open-streetmap.org/wiki/Map_Features) when adding POIs. POIs created by a contributor canbe edited by other contributors.

4. Proposed method: areal interpolation at building level using VGI POIs

The flowchart of the method is illustrated in Figure 3. Ancillary data is indicated in theright side, while steps of the methods appear on the left side. The principle of the methodis that the population is gradually disaggregated from larger to smaller spatial units:

(1) Population/district (as given by census data)(2) Population/area covered by a given LULC category

Figure 2. Categories of LULC (from Urban Atlas).

1946 M. Bakillah et al.

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 10: Fine-resolution population mapping using OpenStreetMap points-of-interest

(3) Population/cell(4) Population/building

To disaggregate the population/district to population/area covered by a given LULCcategory, multi-class dasymetric mapping based on LULC data from Urban Atlas was

Figure 3. Flowchart of the method.

International Journal of Geographical Information Science 1947

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 11: Fine-resolution population mapping using OpenStreetMap points-of-interest

used. The main contribution of the method is in the next step, where population/areacovered by a given LULC category is disaggregated to population/cell. Here, the cells arebasic spatial units created by generating a grid over the territory. To obtain population/cell,a new interpolation by surface volume integration method is deployed; it estimates thepopulation density per cell based on the spatial distribution of relevant POIs. The POIsretrieved from OSM are classified by category using the ID3 unpruned machine learningdecision tree. The next steps are based on the hypothesis that some categories of POIs areassociated with a higher density of population. These POI categories are selected based onthe correlation between the density of a given category of POI and the density ofpopulation. Then, a quadtree procedure is applied to localize POIs and infer the presenceof control points. Population in cells is computed with a decaying function centered oncontrol points. Finally, the OSM building layer data is used to compute population perbuilding based on their surface.

4.1. Multi-class dasymetric mapping with Urban Atlas LULC data

Let u be a given LULC category and d a district of the city. Consider that the territory ispartitioned into a grid of cells. The objective of this phase is to estimate, for each districtd, the average number of people that should be allotted to a cell covered by a given LULCcategory u. This amount is denoted with POPud. The value of POPud is unique to eachdistrict. This is because in a given district, a category of LULC does not necessarily havethe same population density than in another district. For example, the areas classified as‘discontinuous medium density urban fabric’ in the north of Hamburg have a lowerpopulation density than the same type of area in the center of Hamburg. The procedureto compute POPud using multi-class dasymetric mapping is described in Mennis (2003).The grid resolution (500 m2) was chosen so that the majority of cells are covered withonly one category of LULC. The average population density values associated with eachcategory of LULC were derived by overlapping LULC data with census data, using thecensus blocks that were covered with at least 85% by a given LULC category. Averagepopulation density values were derived from samples obtained by overlapping land-usedata with census data, using census blocks that were entirely or most entirely covered byonly one LULC category (Mennis 2003). The estimates generated by the multi-classdasymetric mapping assume that the population distribution is uniform within the sameLULC category in a district. In the following, extra steps are taken to further refine thepopulation estimates.

4.2. Using OSM POIs for interpolation by surface volume integration

The objective of this phase is to localize control points. Similar to other types of ancillarydata such as LULC, control points are places whose spatial distribution is related to that ofthe population (Zhang and Qiu 2011). For example, there are some types of place, such asschools, supermarkets, churches, etc., where population density is usually high comparedto the surroundings. This assumption can be tested by measuring the correlation betweenthe density of a category of place and the density of population. Once control points aredetermined, it is assumed that the population decreases according to a given decayingfunction of the radial distance from control points. In this paper, a new method to identifycontrol points is introduced. This method is based on POIs and assumes that somecategories of POIs are correlated with population density. First, POIs must be classifiedto deal with naming heterogeneities. For this procedure, the ID3 machine learning

1948 M. Bakillah et al.

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 12: Fine-resolution population mapping using OpenStreetMap points-of-interest

decision tree algorithm is used. A quadtree procedure is used to deal with high volumes ofPOIs and control points are identified from the spatial distribution of these POIs. Thedetails of the procedure to derive control points from POIs are described below.

4.2.1. Retrieving POIs

First, OSM POIs in the city of Hamburg are gathered through OSM’s extended API. InOSM, contributors are encouraged to use the vocabulary suggested in OSM Feature Wiki(http://wiki.openstreetmap.org/wiki/Map_Features) when creating POIs, but they do nothave to. The resulting naming heterogeneity complicates the retrieval of POIs that fallwithin a given category.

POIs can be associated with nodes, ways, or polygons (combinations of ways). SincePOIs are used to identify control points, only those that can be reduced to points are used(excluding POIs associated with roads). Polygon features (e.g., amenities) were replacedby the centroid of the polygon. The initial data set obtained for the city of Hamburgcontains 31,593 POIs. Only the subset of POIs that belong to a category that makes up asignificant percentage of the whole set of POIs (0.01%) was retained. Otherwise, some‘categories’ of POIs would be instantiated only by a few objects, which would not allowto use them to identify control points due to their scarcity. The result is a set of 16,964POIs. Table 1 shows the number of POIs categorized according to the value of the ‘key’.

4.2.2. Classifying POIs

Owing to the above-mentioned naming heterogeneity, POIs need to be classified accord-ing to predefined categories. POIs that were not specified using the OSM Map Featuresvocabulary were classified using the ID3 implementation of the unpruned decision tree(see Bishop 2006). The ID3 unpruned decision tree is a machine learning algorithm thatrelies on a set of training data to classify objects according to predefined categories. TheID3 unpruned decision tree was reported to perform well at classifying POIs in Rodrigueset al. (2012). For the decision tree, the categorization of OSM Map Features was used,since an important number of POIs already use this categorization. To determine whethera POI belongs to a given category, a lexical and syntactic matcher that compares themeaning of strings, which is described in Bakillah et al. (2013b), was employed. The

Table 1. Main categories of POIs retrieved for Hamburg.

Key; value quantity Key, value quantity

Public transport; station (include subwayentrances and bus stops)

4705 Amenity; recycling 806

Highway; traffic signals 3882 Power; switch 383Historic; memorial 3821 Amenity; school, university,

kindergarden338

Amenity; restaurant, fast-food, & bars 2683 Amenity; fuel 195highway; crossing 2177 Highway; Motorway_link 100Amenity; bench 1910 Man-made; elevator 67Amenity; parking 1363 Amenity; Hospital, doctors 58Amenity; telephone 1273 Tourism; museum 41Amenity; Postbox 1266 Highway; speed_camera 38Highway; Turning-circle 1169 Amenity; studio 30

International Journal of Geographical Information Science 1949

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 13: Fine-resolution population mapping using OpenStreetMap points-of-interest

training data set was elaborated using the matcher and manually validating a randomsubset of the matches.

4.2.3. Selecting ‘high-density indicator’ POIs

POIs that are expected to be correlated with high population density are called ‘high-density indicator’ (HDI) POIs. To identify HDI POIs, correlation between the occurrencesof a category of POI (e.g., amenity = bus station) and the population density in a censusblock was investigated with Spearman’s correlation coefficient, a coefficient that measurescorrelation whether linear or nonlinear. The equation for Spearman’s correlation coeffi-cient is as follows:

ρ ¼Pn

i¼1 PDi � PD� �

nb POIi � nb POI� �

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPni¼1 PDi � PD

� �2Pni¼1 nb POIi � nb POI

� �2q (1)

In this equation, n is the number of census blocks, PDi is the population density in censusblock i, PD is the average population density, nb_POI is the number of POIs (of a givencategory) in census block i, and nb POI is the average number of POIs of the givencategory among all census blocks. Table 2 shows Spearman’s correlation coefficientbetween the occurrences of some categories of POI and population density.

These results are used to select the types of POI that are considered in the determina-tion of control points. The results of population disaggregation using different categoriesof HDI POI for determining control points are reported in the experimentation section.

4.2.4. Determining control points with quadtree

The quadtree is used to locate HDI POIs. This procedure is employed to reduce a largevolume of POIs to a representative subset that mirrors their density distribution. If aquadtree procedure is not available, other techniques such as clustering algorithms couldbe used too. The quadtree is a spatial tree data structure in which a tree, which can have

Table 2. Correlation between selected categories of POIs and population density.

Key; valueCorrelationcoefficient Key, value

Correlationcoefficient

Public transport; station (include subwayentrances and bus stops)

0.89 Amenity; recycling 0.74

Highway; traffic signals 0.62 Power; switch 0.12Historic; memorial 0.55 Amenity; school,

university, kindergarden0.82

Amenity; restaurant, fast-food & bars 0.28 Amenity; fuel 0.31highway; crossing 0.70 Highway; Motorway_link 0.08Amenity; bench 0.68 Man-made; elevator 0.12Amenity; parking 0.29 Amenity; Hospital,

doctors0.59

Amenity; telephone 0.25 Tourism; museum 0.20Amenity; Postbox 0.80 Highway; speed_camera −0.20Highway; Turning-circle 0.20 Amenity; studio −0.10

1950 M. Bakillah et al.

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 14: Fine-resolution population mapping using OpenStreetMap points-of-interest

four nodes, represents the recursive subdivision of space into nested quadrants (Samet1984). The subdivision process does not have to be made evenly. Rather, it adapts to thedensity of the spatial objects we intend to capture: a quadrant is subdivided into four newquadrants (and four new nodes are created in the tree structure) as long as it contains onespatial object or more. If no spatial object is found in a quadrant, this quadrant is nolonger divided (Figure 4).

A variant of the quadtree that was used here is the bucket quadtree. Instead ofcontinuing the subdivision until a quadrant contains only one spatial object, the subdivi-sion continues until a quadrant contains not more than a predefined number of objects.This is used to avoid having too small quadrants as leaf nodes. In this context, this wouldgenerate too many control points, which would create an artificially uniform populationdistribution.

In this approach, the spatial objects we intend to capture are HDI POIs. The bucketquadtree procedure is applied. The predefined minimum number of objects rangedbetween 1 and 10, depending on the category of POI that was used. Then, to determinethe location of control points, a generalization procedure is applied to identify a singlepoint from all HDI POIs located in the quadrant. The type of generalization operation is amidpoint aggregation (Bereuter and Weibel 2010). In the midpoint aggregation operation,spatially and semantically close points (HDI POIs from the same category) are aggregatedinto a single point, which is the mean center point (midpoint) of the HDI POIs. Toillustrate this, in Figure 5, midpoints are identified (circled in blue) in each quadrant,considering train stations as HDI POIs. To find the midpoint MP of a series of points<POI1, POI2, …, POIj, …, POIn> within a quadrant, the following distance is minimized:

Figure 4. Simplified graphical representation of a quadtree.

International Journal of Geographical Information Science 1951

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 15: Fine-resolution population mapping using OpenStreetMap points-of-interest

DISTANCE TO MID� POINT ¼Xnj¼1

dist MP; POIj� �

(2)

The midpoints constitute the control points that are used to compute population density incells.

4.2.5. Computing population density in grid cells

Phase 2 computes the average number of people that should be allotted to a cell coveredby a LULC category u in district d, POPud. In this step, control points are used to estimatethe variations of population density within an area covered by the same LULC category.The procedure is as follows:

First, the districts <d1, d2, … dM> that overlap with A are retrieved. Then, thepolygons P1, P2, … PN formed by the intersection of A and districts d1, d2, … dM arecomputed. Then, for each polygon Pi, and for each LULC category u, the cells that aremainly covered by u in Pi (this number of cells is denoted with NB_CELLudA) areretrieved. The total population that is contained in these cells (denoted withTOTAL_POPudA) is given by the following:

TOTAL POPudA ¼ NB CELLudA � POPud (3)

Figure 5. Identification of control points from HDI POIs.

1952 M. Bakillah et al.

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 16: Fine-resolution population mapping using OpenStreetMap points-of-interest

In this formula, POPud is the value obtained in phase 1. For example, ifTOTAL_POPforest_HamburgNord_A = 120, it means that in the polygon formed by theintersection of the Hamburg Nord district with the area of interest, there is an estimated120 people living in areas classified as forests. Then, as explained above, this number ofpeople must be redistributed (not necessarily evenly) within areas covered by u accordingto the following formula. This formula (adapted from Zhang and Qiu (2011)) gives theestimated population density in a cell Cu mostly covered by u:

Densityd Cuð Þ ¼ constantud �W Cuð Þ

W ðCuÞ ¼ 1� distðCu; nearest control pointÞdist max

� �q(4)

The formula is illustrated in Figure 6 considering that educational institutions were usedas control points.

The population in cell C depends on radial distance from the nearest control point. Theformula depends on different parameters. First, the constantud is a constant that representsthe maximal density in areas covered by u in d. It can be estimated from the density ofpeople per census block, as given in census data. Then, dist(Cu, nearest_control_point) isthe distance between the centroid of the cell Cu and the nearest control point. Dist_max isthe maximal value of dist(Cu, nearest_control_point). Q is the population density decayingfactor. One can see from the above formula that the density in a cell is maximal (densityd(Cu) = constantud) when this cell includes a control point, because dist(Cu, nearest_con-trol_point) is minimal and W(Cu) is very close to 1. As we move away from a controlpoint, the population decreases. The rate of decay is fixed by q. Therefore, q greatlyinfluences the population estimate. This is why several values of q are tested in theexperimentation section.

Figure 6. Population in cell C depends on the distance from the nearest control point (in thisfigure, educational institutions are considered as control points).

International Journal of Geographical Information Science 1953

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 17: Fine-resolution population mapping using OpenStreetMap points-of-interest

Still, another problem must be resolved. If, after computing the population in cells, wesum up the population in all cells covered by u, we must obtain that this population isequal to TOTAL_POPudA in order to respect Tobler’s pycnophylactic property. This iswhy the computation of population in cells with the above formula is reiterated by fittingthe constantud so that the sum of population in all cells covered by u is as close as possibleto TOTAL_POPudA.

4.2.6. Computing population in buildings

The population of each cell is assigned to buildings inside the cell. The number of peoplein a building b is assigned proportionally to the building’s surface, as follows:

POP bð Þ ¼ AbPOP Cð ÞPk

i¼1 Ai

(5)

In this formula, Ab is the building’s surface, Ai is the surface of another building icontained in cell C, k is the number of buildings contained in cell C, and POP(C) is thepopulation in cell C. If a building’s surface overlaps with more than one cell, this buildingis assigned to the cell that contains the largest proportion of the building’s surface.

5. Experimentation

The experimentation has two objectives: (1) investigate the impact of two factors on thereliability of the estimates – the choice of POI to determine control points and thedecaying factor’s value; and (2) compare the reliability of the population estimates withother state-of-the-art methods.

It is not possible to experimentally verify whether the population estimates at thebuilding level are accurate, since no official data is available at that level, due to privacyconcerns. However, it is possible to verify the reliability of the method at the census blocklevel, which is the finer level of detail available. Since blocks are generally composed of asmall number of buildings, the reliability assessments are still worth consideration.Therefore, in each experiment, estimates computed at the building level were aggregatedto obtain the estimated population for each census block. The estimated population is thencompared with the census population.

5.1. Results

Population estimates that were obtained using different categories of HDI POIs werecompared. This experiment tests the hypothesis formulated in Section 4: there exist somecategories of POI that are associated with areas where population density is higher. Thecategories of POI ‘amenity: educational institution’ (comprising schools, universities, andkindergartens) and ‘public transport: station’ were selected as HDI POIs, since they arethe most strongly correlated with population density (see Section 4). Three values for thedecaying factor were also tested: q = 0.3, q = 0.5, and q = 0.7. The results were assessedwith a linear regression (Figures 7 and 8).

For the first category of POI, better results are achieved in terms of standard error(369.2), R2 (0.9274), and regression coefficient (1.007) with a slower decrease of thepopulation as we move away from control points (q = 0.3). Still, standard error and R2 are

1954 M. Bakillah et al.

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 18: Fine-resolution population mapping using OpenStreetMap points-of-interest

Figure 7. Simple linear regressions for HDI POI = ‘amenity: educational institution’: (a) q = 0.3,(b) q = 0.5, (c) q = 0.7.

Figure 8. Simple linear regressions for HDI POI = ‘public transport: station’: (a) q = 0.3, (b)q = 0.5, (c) q = 0.7.

International Journal of Geographical Information Science 1955

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 19: Fine-resolution population mapping using OpenStreetMap points-of-interest

somewhat high and show some lack of accuracy. Then, as q increases (the decay is fasteras we move away from control points), the reliability of the estimate also decreasessignificantly and the regression coefficient moves away from 1.

For the second category of POI, better results are achieved in terms of standard error(286.6), R2 (0.9530), and regression coefficient (0.9849) with a value of q = 0.5, whichcontrasts with the previous case. This demonstrates that the value of q should becalibrated against the type of POI used to identify control points. It is possible thatpopulation decreases more quickly around some types of place (i.e., the population ismore concentrated around these points) than others. However, in both cases, as qincreases, the slope of the regression line diminishes below 1, showing slight under-estimation of the population. Worse results are obtained with q = 0.7. Overall, since bestresults are obtained with the second category of POI, it seems that in this experiment, thecategory ‘public transport: station’ was more appropriate to identify control points.

In the second part of the experiment, while reporting further on experiments withdifferent categories of POI, the proposed method is compared with the following repre-sentative state-of-the-art approaches:

(1) Areal weighting: population is distributed to the block level based on areaproportion only.

(2) Binary dasymetric mapping: population is distributed to the block level basedon area proportion and distinction between habitable and non-habitable areas.LULC data from Urban Atlas was employed to determine habitable/non-habitableareas. Urban fabric categories were considered as habitable, while other categorieswere considered as non-habitable.

(3) Multi-class dasymetric mapping: the method described in Section 4.1 wasapplied. Population was distributed evenly within areas covered by the sameLULC category.

(4) Interpolation by surface volume integration: population estimates are com-puted using Equation (4) with q = 0.3 and control points obtained from ‘amenity:educational institution’ (POIs that generated the most reliable results).

(5) Proposed method with binary instead of multi-class dasymetric mapping,with q = 0.3 and control points obtained from ‘amenity: educational institution’.

(6) Interpolation modeling related variables with co-kriging: population density isestimated for the 500 m2 grid using a co-kriging method, where the primaryvariable is population density and the secondary variable is POI distribution(tested with ‘amenity: educational institution’). Population estimates wererescaled to satisfy Tobler’s pycnophylactic property.

The root mean square error (RMSE) and the mean absolute error (MAE), which are well-known accuracy measures in the field of population density estimation (Liu et al. 2008,Langford 2013), were measured (Table 3):

RMSE ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPn

i¼1ðPestimatedi � Pobserved

i Þ2n

s(6)

1956 M. Bakillah et al.

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 20: Fine-resolution population mapping using OpenStreetMap points-of-interest

MAE ¼ 1

P

Xni¼1

Pestimatedi � Pobserved

i

�� �� (7)

In these formulas, n is the number of census blocks, Piestimated is the estimated population

in block i, Piobserved is the official census population in block i, and P is the total

population in Hamburg.As expected, areal weighting performs poorly but is interesting to illustrate differ-

ences. There is no great difference between binary and multi-class dasymetric mapping,showing that indeed few people live in non-urban fabric areas. However, binary dasy-metric mapping slightly outperforms interpolation by surface volume integration, whichwas also reported (although with a more pronounced difference) in Langford (2013). Thecorroboration of these results suggests that modeling population distribution with adecaying function may be oversimplifying, which contrasts with results obtained inZhang and Qiu (2011). It also questions the reliability of POIs as a unique source ofancillary data (while in the proposed approach, it is coupled with LULC data). Thecomparison of interpolation by surface volume integration with the proposed method(with the same category of POIs being used to ensure a fair comparison) also suggests thatthe multi-class dasymetric mapping method based on LULC contributes to increasing thereliability of the estimates. When comparing the results obtained with the proposedmethod using binary or dasymetric mapping, it can be observed that the multi-classapproach does improve the estimates, but not with a great difference. Since binary iseasier to implement than multi-class dasymetric mapping, the former might therefore beworth considering. Also, the purity of the samples used for estimating population densitiesassociated with each LULC category (85%) impacts the reliability of estimates obtainedwith multi-class dasymetric mapping (although misclassification can also occur withbinary dasymetric mapping). At last, better results are obtained with interpolation withco-kriging than with the proposed method with interpolation by surface volume integra-tion. Here, POIs were used as secondary variable, but preliminary estimates obtained foreach category of LULC with multi-class dasymetric mapping (Section 4.1) were also usedto rescale the results of co-kriging. The results could be interpreted as another indication

Table 3. Comparative reliability assessment results.

Method RMSE MAE (%)

Areal weighting 175.6 49.8Binary dasymetric mapping 124.3 19.4Multi-class dasymetric mapping 119.6 19.1Interpolation by surface volume integration 137.3 26.5Proposed method with binary dasymetric mapping 111.2 11.9Interpolation with co-kriging 100.2 8.5Proposed method, ‘amenity: educational institution’, q = 0.3 104.1 9.2Proposed method, ‘public transport: station’, q = 0.5 103.7 8.9Proposed method, ‘amenity; Post-box’, q = 0.3 106.5 10.1Proposed method, ‘amenity; recycling’, q = 0.5 109.4 11.0Proposed method, ‘highway; crossing’, q = 0.5 112.2 11.9Proposed method, ‘amenity; bench’, q = 0.6 114.2 14.3Proposed method, ‘highway; traffic signals’, q = 0.5 118.6 17.3Proposed method, ‘amenity; hospital, doctors’, q = 0.7 119.7 19.8

International Journal of Geographical Information Science 1957

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 21: Fine-resolution population mapping using OpenStreetMap points-of-interest

that the interpolation by surface volume integration is too simplistic. Still, the differencefor both RMSE and MAE is not large. This may be worth considering, since theco-kriging model is more complex to implement.

The results show that POIs are useful to identify control points and estimate variationswithin a given category of LULC. Table 3 lists some categories of POI that performedbetter compared to other POIs. It shows that for some categories of POI, the results arenot necessarily better than simple multi-class dasymetric mapping (e.g., hospitals anddoctors). It is difficult nevertheless to draw conclusions, since it could be due to the factthat this type of place is not correlated with population density, or because data on thistype of place is incomplete in OSM. Results are also likely to differ in different cities, asspatial urban patterns are varying.

5.2. Discussion

The comparison of results suggests possible areas for improvement. In addition, themargin of error could be greater at the level of buildings, although this cannot be verified.Although volume of buildings could be used to generate estimates at this level, it wouldnot still be an absolute reference against which the proposed method could be compared.

An important cause of error is possibly the decaying function behind the interpolationby surface volume integration. Although results show it helps to better reflect populationdensity variations within an area with uniform LULC classification, this model simplifiesthe population density distribution. The second issue with the model is that the decayingfactor q controls the extent of local influence. The experiments show that the choice of thevalue of q should depend on the type of POI selected to determine control points, withimportant variations comprised between 0.3 and 0.7. To verify whether the populationdecay is indeed correlated with the category of POI, further experiments must beconducted where estimates obtained with the same categories of POI but different decayvalues and in different regions would be compared. Such experiments would also have totake into account the differences between the urban fabric patterns across regions.

The choice of POI also affects the estimates. In addition, the completeness of the POIdata set is an issue, as well as semantic accuracy. Since POI descriptions are provided byvolunteers, inaccurate descriptions are likely to occur. Until further research is done todevelop tools to frame the volunteer production of POIs (such as semantic annotation forPOI, see Scioscia et al. 2013) and until such tools become common use, appropriatefiltering of POI data is required. An alternative could be POIs from official or commercialdata sets.

Errors in LULC classification are likely to impact population estimates (Maantay et al.2007, Liu et al. 2008, Lu et al. 2010, Langford 2013), a source of error that can hardly beavoided. In addition, in different phases of the method, approximations were made, suchas considering cells mostly covered by a category of LULC as entirely covered by thiscategory, or considering buildings as entirely part of a cell while they were only partlyincluded in this cell. Also, data on types of building could not be used (e.g., to discardindustrial or commercial buildings) since building attributes are not always available inOSM. An additional parameter that affects the results is grid resolution. Further experi-ments are needed to assess its impact on the reliability of estimates. A coarser grid wouldaffect dasymetric mapping, because it would negatively impact the purity of populationdensity samples. A finer grid may produce more accurate results by allowing finerdistinctions. However, a finer grid may also have a negative effect with more buildingsbeing split between two or more cells.

1958 M. Bakillah et al.

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 22: Fine-resolution population mapping using OpenStreetMap points-of-interest

6. Conclusion and future work

Areal interpolation techniques using LULC data are commonly used to estimate popula-tion distribution, but numerous experiments, including those reported here, demonstratethat they cannot be used to conduct accurate estimation of population at a fine scale. Acommon alternative to this problem is to use three-dimensional data on buildings’ volumederived from LiDAR data. However, such data is not always available, due to itssignificant cost. Meanwhile, VGI is used for an increasing variety of tasks related tourban management, emergency response, routing services, etc. It was therefore worthexperimenting whether VGI could also prove useful to support population estimation.

This paper presents a method for areal interpolation at building level that uses VGI asancillary data. After a first phase that uses multi-class dasymetric mapping to assessaverage population density within homogeneous LULC category, OSM data on buildingsand POIs are used to refine the population estimation at building level. An originalmethod was elaborated to classify and determine POIs that can be used to estimatepopulation distribution. A case study with the city of Hamburg, Germany, has beenconducted. Population estimates obtained using different categories of POI were com-pared. The method was also compared with state-of-the-art methods. The results show thatalthough there are areas for improvement, POIs can be considered as interesting ancillarydata in the absence of three-dimensional data on buildings. Of note still is that VGI data,including OSM, is not yet widely available or well-developed in all regions of the world.Similarly, LULC data from Urban Atlas is available only in Europe and for urban areaswith a population of more than 100,000 people. Although other LULC data sets areavailable for other countries, such data is not consistently available, e.g., in developingcountries, most likely limiting the approach presented in this paper to urban regions indeveloped countries.

Further research is also required to deal with the VGI quality issues. Although the aimof the proposed approach is to improve accessibility by relying on free data, it cannot fullyachieve this goal until VGI becomes more common. Still, this research opens an inter-esting avenue for other studies to be conducted in the future using VGI.

The possible causes of error that were identified include the incompleteness or spatialand thematic inaccuracy of data on POIs in OSM. The inappropriateness of some POIs toestimate population distribution (where some categories of POI may not be correlatedwith population density) may vary according to the city, since different cities are char-acterized by different urban patterns. In this regard, future work will be conducted tofurther investigate patterns of POIs that reflect spatial population distribution. A morecomprehensive investigation of the issue also requires comparison of the results usingcommercially produced or official POIs and linking it to various indicators of POI quality.Another area of concern is the impact of the decaying function. Calibration is neededwhenever the method needs to be applied to a different region or using different POIs.Additional ancillary data, such as distance from roads, may help determine how popula-tion density is decaying away from a POI. Additionally, further research on correlationbetween types of POI, structure of urban fabric, and population decay may be conductedto improve the approach. The surface volume integration may also be inappropriate,because it measures the density in a cell according to its distance from the nearest controlpoint. Taking into account the distribution of control points surrounding a cell rather thana single control point may produce better results. Also, better estimates were obtainedwith co-kriging interpolation, where the problem of related variables is taken into account.A hybrid approach is worth consideration.

International Journal of Geographical Information Science 1959

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 23: Fine-resolution population mapping using OpenStreetMap points-of-interest

AcknowledgementWe wish to thank the anonymous reviewers for their valuable contribution.

FundingThis work was conducted for the GRIPS Project and supported by the German Federal Ministry ofEducation and Research (BMBF).

ReferencesAubrecht, C., Ungar, J., and Freire, S., 2011. Exploring the potential of volunteered geographic

information for modeling spatio-temporal characteristics of urban population. In: Proceedings of7VCT, 11–13 October, Lisbon.

Bakillah, M., et al., 2013a. Semantic interoperability of sensor data with volunteered geographicinformation: a unified model. ISPRS International Journal of Geo-Information, 2 (3), 766–796.doi:10.3390/ijgi2030766

Bakillah, M., et al., 2013b. A dynamic and context-aware semantic mediation service for discover-ing and fusion of heterogeneous sensor data. Journal of Spatial Information Science, 6,155–185.

Bereuter, P. and Weibel, R., 2010. Generalisation of point data for mobile devices: a problem-oriented approach. In: Proceedings of the 13th workshop on progress in generalisation andmultiple representation, September 12–13, Zurich. Bern: International Cartographic Associationcommission, 1–8.

Bishop, C.M., 2006. Pattern recognition and machine learning (information science and statistics),vol. 1. Secaucus, NJ: Springer-Verlag New York.

Bishr, M. and Mantelas, L., 2008. A trust and reputation model for filtering and classifyingknowledge about urban growth. GeoJournal, 72 (3–4), 229–237. doi:10.1007/s10708-008-9182-4

Coleman, D.J., 2010. Volunteered geographic information in spatial data infrastructure: an earlylook at opportunities and constraints. In: Spatially enabling society: research, emerging trendsand critical assessment. Leuven University Press.

Coleman, D.J., Sabone, B., and Nkhwanana, N., 2010. Volunteering geographic information toauthoritative databases: linking contributor motivations to program effectiveness. Geomatica, 64(1), 383–396.

Cromley, R.G., Hanink, D.M., and Bentley, G.C., 2012. A quantile regression approach to arealinterpolation. Annals of the Association of American Geographers, 102 (4), 763–777.doi:10.1080/00045608.2011.627054

De Longueville, B., et al., 2010. Digital Earth’s nervous system for crisis events: real-time sensorweb enablement of volunteered geographic information. International Journal of Digital Earth,3 (3), 242–259. doi:10.1080/17538947.2010.484869

Dempster, A.P., Laird, N.M., and Rubin, D.B., 1977. Maximum likelihood from incomplete data viathe EM algorithm. Journal of the Royal Statistical Society B, 39 (1), 1–38.

Eicher, C.L. and Brewer, C.A., 2001. Dasymetric mapping and areal interpolation: implementationand evaluation. Cartography and Geographic Information Science, 28 (2), 125–138.doi:10.1559/152304001782173727

Flanagin, A. and Metzger, M., 2008. The credibility of volunteered geographic information.GeoJournal, 72 (3–4), 137–148. doi:10.1007/s10708-008-9188-y

Flowerdew, R. and Green, M., 1994. Areal interpolation and types of data. In: S. Fotheringham andP. Rogerson, eds. Spatial analysis and GIS. London: Taylor and Francis, 121–146.

Goetz, M., Lauer, J., and Auer, M., 2012. An algorithm-based methodology for the creation of aregularly updated global online map derived from volunteered geographic information. In:C.-P. Rückemann and B. Resch, eds. Proceedings of the fourth international conference onadvanced geographic information systems, applications, and services, January 30–February 4,Valencia, 50–58.

Goetz, M. and Zipf, A., 2012. Towards defining a framework for the automatic derivation of 3DCityGML models from volunteered geographic information. International Journal of 3DInformation Modeling, 1 (2), 1–16. doi:10.4018/ij3dim.2012040101

1960 M. Bakillah et al.

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 24: Fine-resolution population mapping using OpenStreetMap points-of-interest

Goodchild, M.F., 2007. Citizens as sensors: the world of volunteered geography. GeoJournal, 69(4), 211–221. doi:10.1007/s10708-007-9111-y

Goodchild, M.F. and Lam, N.S.-N., 1980. Areal interpolation: a variant of the traditional spatialproblem. Geo-Processing, 1, 297–312.

Griffith, D.A., 2013. Estimating missing data values for georeferenced Poisson counts.Geographical Analysis, 45 (3), 259–284. doi:10.1111/gean.12015

Griffith, D.A. and Can, A., 1996. Spatial statistical/econometric version of simple urban populationdensity models. In: S.L. Arlinghaus and D.A. Griffith, eds. Practical handbook of spatialstatistics. Boca-Raton, FL: CRC Press.

Haklay, M., 2010. How good is volunteered geographical information? A comparative study ofOpenStreetMap and Ordnance Survey datasets. Environment and Planning B: Planning andDesign, 37 (4), 682–703. doi:10.1068/b35097

Harvey, J.T., 2002. Estimating census district populations from satellite imagery: some approachesand limitations. International Journal of Remote Sensing, 23 (10), 2071–2095. doi:10.1080/01431160110075901

Hawley, K. and Moellering, H., 2005. A comparative analysis of areal interpolation methods.Cartography and Geographic Information Science, 32 (4), 411–423. doi:10.1559/152304005775194818

Jackson, S.P., et al., 2013. Assessing completeness and spatial error of features in volunteeredgeographic information. ISPRS International Journal of Geo-Information, 2 (2), 507–530.doi:10.3390/ijgi2020507

Jokar Arsanjani, J., et al., 2013. Toward mapping land-use patterns from volunteered geographicinformation. International Journal of Geographical Information Science. doi:10.1080/13658816.2013.800871

Kyriakidis, P., 2004. A geostatistical framework for area-to-point spatial interpolation. GeographicalAnalysis, 36 (3), 259–289. doi:10.1111/j.1538-4632.2004.tb01135.x

Lam, N.S., 1983. Spatial interpolation methods: a review. Cartography and Geographic InformationScience, 10 (2), 129–150. doi:10.1559/152304083783914958

Langford, M., 2006. Obtaining population estimates in non-census reporting zones: an evaluation ofthe 3-class dasymetric method. Computers, Environment and Urban Systems, 30 (2), 161–180.doi:10.1016/j.compenvurbsys.2004.07.001

Langford, M., 2013. An evaluation of small area population estimation techniques using open accessancillary data. Geographical Analysis, 45 (3), 324–344. doi:10.1111/gean.12012

Leyk, S., Nagle, N.N., and Buttenfield, B.P., 2013a. Maximum entropy dasymetric modeling fordemographic small area estimation. Geographical Analysis, 45 (3), 285–306. doi:10.1111/gean.12011

Leyk, S., et al., 2013b. Establishing relationships between parcel data and land cover for demo-graphic small area estimation. Cartography and Geographic Information Science, 40 (4),305–315. doi:10.1080/15230406.2013.782682

Lin, J., et al., 2013. Evaluating the use of publicly available remotely sensed land cover data forareal interpolation. GIScience & Remote Sensing, 50 (2), 212–230.

Liu, X.H., Kyriakidis, P.C., and Goodchild, M.F., 2008. Population‐density estimation using regres-sion and area‐to‐point residual kriging. International Journal of Geographical InformationScience, 22 (4), 431–447. doi:10.1080/13658810701492225

Lo, C.P., 2008. Population estimation using geographically weighted regression. GIScience &Remote Sensing, 45 (2), 131–148. doi:10.2747/1548-1603.45.2.131

Lu, Z., et al., 2010. Population estimation based on multi-sensor data fusion. International Journalof Remote Sensing, 31 (21), 5587–5604. doi:10.1080/01431161.2010.496801

Lwin, K.K., 2010. Online micro-spatial analysis based on GIS estimated building population: acase of Tsukuba City. Thesis (PhD). Graduate School of Life and Environmental Sciences,University of Tsukuba.

Lwin, K.K. and Murayama, Y., 2010. Development of GIS tool for dasymetric mapping.International Journal of Geoinformatics, 6 (1), 8–11.

Maantay, J.A., Maroko, A.R., and Herrmann, C., 2007. Mapping population distribution in theurban environment: the cadastral-based expert dasymetric system (CEDS). Cartography andGeographic Information Science, 34 (2), 77–102. doi:10.1559/152304007781002190

Markoff, J. and Shapiro, G., 1973. The linkage of data describing overlapping geographical units.Historical Methods Newsletter, 7 (1), 34–46. doi:10.1080/00182494.1973.10112670

International Journal of Geographical Information Science 1961

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 25: Fine-resolution population mapping using OpenStreetMap points-of-interest

Mennis, J., 2003. Generating surface models of population using dasymetric mapping. TheProfessional Geographer, 55 (1), 31–42.

Mennis, J., 2009. Dasymetric mapping for estimating population in small areas. GeographyCompass, 3 (2), 727–745. doi:10.1111/j.1749-8198.2009.00220.x

Mooney, P. and Corcoran, P., 2011. Can volunteered geographic information be a participant ineEnvironment and SDI? In: J. Hrebicek, G. Schimak and R. Denzer, eds. EnvironmentalSoftware Systems. Frameworks of eEnvironment, 9th IFIP WG 5.11 International Symposium,ISESS 2011, IFIP Advances in Information and Communication Technology, vol. 359, 27–29June, Brno. Berlin: Springer, 115–122.

Nagle, N.N., et al., 2014. Dasymetric modeling and uncertainty. Annals of the Association ofAmerican Geographers, 104 (1), 80–95. doi:10.1080/00045608.2013.843439

Neis, P., Zielstra, D., and Zipf, A., 2011. The street network evolution of crowdsourced maps:OpenStreetMap in Germany 2007–2011. Future Internet, 4 (1), 1–21. doi:10.3390/fi4010001

Reibel, M. and Agrawal, A., 2007. Areal interpolation of population counts using pre-classified landcover data. Population Research and Policy Review, 26 (5), 619–633. doi:10.1007/s11113-007-9050-9

Rodrigues, F., et al., 2012. Automatic classification of points-of-interest for land use analysis. In:C.-P. Rückemann and B. Resch, eds. Proceedings of fourth International Conference onAdvanced Geographic Information Systems, Applications, and Services, January 30 –February 4, Valencia, 41–49.

Rodrigues, F., et al., 2013. Estimating disaggregated employment size from points-of-interest andcensus data: from mining the web to model implementation and visualization. InternationalJournal On Advances in Intelligent Systems, 6 (1&2), 41–52.

Samet, H., 1984. The quadtree and related hierarchical data structures. ACM Computing Surveys, 16(2), 187–260. doi:10.1145/356924.356930

Scheider, S., et al., 2011. Semantic referencing of geosensor data and volunteered geographicinformation. In: N. Ashish and A.P. Sheth, eds. Geospatial semantics and the semantic web:foundations, algorithms and applications, vol. 12. Semantic Web and Beyond. New York:Springer, 27–59.

Schroeder, J.P., 2007. Target-density weighting interpolation and uncertainty evaluation for temporalanalysis of census data. Geographical Analysis, 39 (3), 311–335. doi:10.1111/j.1538-4632.2007.00706.x

Schroeder, J.P., 2010. Bicomponent trend maps: a multivariate approach to visualizing geographictime series. Cartography and Geographic Information Science, 37 (3), 169–187. doi:10.1559/152304010792194930

Schroeder, J.P. and Van Riper, D.C., 2013. Because Muncie’s densities are not Manhattan’s: usinggeographical weighting in the expectation–maximization Algorithm for areal interpolation.Geographical Analysis, 45 (3), 216–237. doi:10.1111/gean.12014

Scioscia, F., et al., 2013. A framework and a tool for semantic annotation of POIs inOpenStreetMap. Social and Behavioral Sciences, 111, 1092–1101.

Semenov-Tian-Shansky, B., 1928. Russia: territory and population: a perspective on the 1926census. Geographical Review, 18 (4), 616–640. doi:10.2307/207951

Song, W. and Sun, G., 2010. The role of mobile volunteered geographic information in urbanmanagement. In: Y. Liu and A. Chen, eds. 18th International Conference on Geoinformatics,18–20 June, Beijing, 1–5.

Sridharan, H. and Qiu, F., 2013. A spatially disaggregated areal interpolation model using lightdetection and ranging-derived building volumes. Geographical Analysis, 45 (3), 238–258.doi:10.1111/gean.12010

Strunck, A., 2010. Raumzeitliche qualitätsuntersuchung von OpenStreetMap. Thesis (master).Geographisches Institut, Rheinische Friedrich-Wilhelms-Universität Bonn.

Su, M., et al., 2010. Multi-layer multi-class dasymetric mapping to estimate population distribution.Science of the Total Environment, 408 (20), 4807–4816. doi:10.1016/j.scitotenv.2010.06.032

Sun, Y., et al., in press. Road-based travel recommendation using geo-tagged images. Computers,Environment and Urban Systems (CEUS). Elsevier.

Tapp, A.F., 2010. Areal interpolation and dasymetric mapping methods using local ancillary datasources. Cartography and Geographic Information Science, 37 (3), 215–228. doi:10.1559/152304010792194976

1962 M. Bakillah et al.

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14

Page 26: Fine-resolution population mapping using OpenStreetMap points-of-interest

Tobler, W., 1979. Smooth pycnophylactic interpolation for geographical regions. Journal of theAmerican Statistical Association, 74 (367), 519–530. doi:10.1080/01621459.1979.10481647

Tulloch, D., 2008. Is VGI participation? From vernal pools to video games. GeoJournal, 72 (3–4),161–171. doi:10.1007/s10708-008-9185-1

Ural, S., Hussain, E., and Shan, J., 2011. Building population mapping with aerial imagery and GISdata. International Journal of Applied Earth Observation and Geoinformation, 13 (6), 841–852.doi:10.1016/j.jag.2011.06.004

Wright, J.K., 1936. A method of mapping densities of population: with cape cod as an example. TheGeographical Review, 26 (1), 103–110. doi:10.2307/209467

Wu, C. and Murray, A.T., 2005. A cokriging method for estimating population density in urbanareas. Computers, Environment and Urban Systems, 29 (5), 558–579. doi:10.1016/j.compenvurbsys.2005.01.006

Wu, S., Wang, L., and Qiu, X., 2008. Incorporating GIS building data and census housing statisticsfor subblock-level population estimation. The Professional Geographer, 60 (1), 121–135.doi:10.1080/00330120701724251

Wu, S.-S., Qiu, X., and Wang, L., 2005. Population estimation methods in GIS and remote sensing:a review. GIScience and Remote Sensing, 42 (1), 80–96. doi:10.2747/1548-1603.42.1.80

Xie, Y., 1995. The overlaid network algorithms for areal interpolation problem. Computers,Environment and Urban Systems, 19 (4), 287–306. doi:10.1016/0198-9715(95)00028-3

Yang, X., et al., 2012. Preliminary mapping of high-resolution rural population distribution based onimagery from Google Earth: a case study in the Lake Tai basin, eastern China. AppliedGeography, 32 (2), 221–227. doi:10.1016/j.apgeog.2011.05.008

Zandbergen, P. and Ignizio, D., 2010. Comparison of dasymetric mapping techniques for small-areapopulation estimates. Cartography and Geographic Information Science, 37 (3), 199–214.doi:10.1559/152304010792194985

Zhang, C. and Qiu, F., 2011. A point-based intelligent approach to areal interpolation. TheProfessional Geographer, 63 (2), 262–276. doi:10.1080/00330124.2010.547792

Zielstra, D. and Zipf, A., 2010. A comparative study of proprietary geodata and volunteeredgeographic information for Germany. In: The 13th AGILE International Conference onGeographic Information Science (AGILE 2010), 11–14 May, Guimarães.

International Journal of Geographical Information Science 1963

Dow

nloa

ded

by [

141.

212.

109.

170]

at 0

8:37

16

Dec

embe

r 20

14