city data dating: emerging afﬁnities between diverse urban datasets

City Data Dating: emerging affinities between diverse urban datasets

Gloria Re Calegari, Irene Celino, Diego Peroni

Paper available at:http://www.sciencedirect.com/science/article/pii/S0306437915001362

Elsevier - Information Systems Journal 1

http://www.sciencedirect.com/science/article/pii/S0306437915001362

Digital information about cities• Open data (large number of data sources available on the web):

• Urban planning (land cover, public registers)

• Demographics and statistics about municipality

• Closed data sources produced and maintained by enterprises:• Phone activity data but sometimes made open!

• User generated information:• Volunteered geographic information and crowdsourcing information (Open Street Map)

• Location based social network (Foursquare check-ins and geo located information)

• Real-time and streaming information• Sensors (e.g. Temperature, energy consumption, ..)


Do diverse urban datasets provide the same “picture” of the city?

Short term goals

• Discovering “affinities” between heterogeneous datasets.

• Using a human relations metaphor, do diverse urban datasets “date each other” and show “natural affinities”?

• Which is the influence of spatial resolution and data complexity on the dependence strength between heterogeneous urban sources?

Long term goals

• Would it be possible to use one or more “cheap” datasets as proxy for more “expensive” data sources?


Milano datasets

Demographics: • Population density

• Spatial resolution: census area (6079 –median size of census area 12,000 m2)

• Source: Milano open data

Points of interest (POIs): • Trasports, schools, sports facilities, amenity places,

shops ...• Spatial resolution: lat-long points • Source: Milano open data (official, 6718) and Open

Street Map (user generated, 44351)


Milano datasets

Land use: • type of land use according to CORINE taxonomy

(3-levels hierarchy, up to 40 types of land use defined)

• CORINE taxonomy http://swa.cefriel.it/ontologies/corine.html#

• 12 types selected (which better feature metropolitan area as Milan)

[dense residential areas, scattered residential areas, industrial areas, parks and green areas, roads, railways, hospitals, sports centres, public services and offices, construction sites, agricultural areas and wild areas.]

• Spatial resolution: building level

• Source: Lombardy region open data


http://swa.cefriel.it/ontologies/corine.html

Milano datasetsCall data records:

• 5 phone activities • Incoming SMS

• Outcoming SMS

• Incoming CALL

• Outcoming CALL

• Internet

• Recorded every 10 minutes (144 values a day for each activity) for 2 months (Nov-Dec 2013)

• Spatial resolution: grid of 3538 square cells of 250m

• Source: Telecom Italia – provided for their Big Data Challenge http://theodi.fbk.eu/openbigdata/


http://theodi.fbk.eu/openbigdata/

Pre-processing of data

Making spatial resolution uniform

Spatial resolutions used:

• District level with 88 official subdivisions

• Grid level with 3.538 square cells of 250m

Cells

Districts

New datasets generated:

• Density of POIs in each cell/district• Weighted sum of population density in each cell/district• Percentage shares of each land use over each cell/district area


Pre-processing of data

Telecom dataFootprint/temporal signature for each cell/district(average activity over all the 60 days, distinguishing between week and weekend days)

Temporal Data compression (pre-processing large scale time series to get a more manageable compressed representation)


Data exploration experiments

To discover possibly “natural” connections between heterogeneous datasets.

Three-step process

1) Correlation Analysis

2) Regression Analysis

3) Clustering Analysis

All the analysis performed both at district and at cell level

complexity of data


1) Correlation analysis

Try to identify possible correspondences between different datasets.

Measure whether and how two variables change together using correlation indexes -> Pearson’s correlation coefficient

-1 < r < 1

Positive correlation

Negative correlation

Pairwise comparisons between 1-dimensional vectors:

• POIs: density

• Population: density

• Telecom: first Principal Component with 90% of explained variability

• Land use data: only residential and agricultural land use used separately, in term of belonging percentages to district/cell


Correlation analysis - district level

• Correlation between • Telecom and residential

• Telecom and POIs

can actually exist.

Data fits quasi linear models.

tlc

resid

agric

POI mun

POI OSM

pop

• Negative correlation between agricultural land use and other datasets -> human action inversely related to agricultural areas.


Correlation analysis - cell level

• All coefficients lower than the district level

• Higher values again between Telecom and residential and POIs

=> the choice of resolution level can have a significant impact on the correlation results.

tlc

resid

agric

POI mun

POI OSM

pop

• Some phenomena causing the correlation are independent of the resolution level (0.76 residential-population) .


Correlation analysis - phone calls and population

• Could the correlation change during the day according to the everyday human behaviour pattern (get up, go to work, come back home in the evening)?

• Call activity at 6 different day times

• Week and weekend profiles are different -> mirroring people’s different habits

• Average correlation higher in the weekend (phone activity related to the actual presence of people at home)

• Weekday profile -> human behaviour pattern

DISTRICT CELL

WEE

KW

EEK

END


2) Regression analysisFit multiple linear regression models (MLR)

(To take into consideration a larger portion of urban data complexity)

MLR models

Telecom(1-d vector for each communication type for week/weekend days with the average activity)

POIs(1-d vector for each POI category with the POI density)

Demographics (1-d vector with the population density)

Land use(1-d vector for each land use with the percentage of land covered. Only 5 main land use type are used.)

Cheap-to-produce

Expensive-to-mantain


Regression analysis - results • The more selected predictors (increasing k), the larger part of the outcome variable’s variance is explained (increasing R2adj).

• The strength of the correlation decrease from district-level (coarse-grained) to cell-level (fine-grained)

• Benefit of adding a higher number of predictors is weaker at cell-level

• Population, dense residential and agricultural areas show a stronger correlation

manual stepwise model selection

AIC criterion

Heterogeneous datasets provide comparable or even similar pictures of the urban environment.


Regression analysis – significant predictors

Significant predictors of the models selected with manual backwards elimination


3) Clustering analysis

Exploit the whole city information available (n-dimensional datasets), avoiding data compression.

Clustering technique to understand if diverse data naturally group together throughout the urban space.

Cluster each dataset and compare the clustering obtained for each pair of datasets.

k-Means algorithm with 5 classes (Silhouette coefficient) applied on:

• CORINE: one vector, for each NIL/cell, of belonging percentages to the 12 categories

• Telecom: the whole footprint for each cell/NIL (a vector of 1440 values)


Clustering analysis - Pairwise clusterings comparisons

Pairwise clustering comparisons using Rank Index and Overall Accuracy to evaluate datasets correnspondance


District classified in the same way in Telecom and CORINE datasets

Clustering analysis - Qualitative analysis at district level

CORINE and whole activityCORINE and SMSin-out activity CORINE and Internet activity


Conclusions and future work• Evaluation of the correlations between datasets at different levels

• different spatial resolution (coarse-grained district level vs. fine-grained cell level)

• different data complexity (very compressed information vs multi-dimensional data).

• Correlations between different sources exist and their strength depends on both spatial resolution and data complexity

• Diverse urban datasets can “date each other”, but that their actual “affinity” can vary.

What is coming next?

• Extending our investigation toward a predicting approach (statistical and machine learning techniques).


city data dating: emerging afﬁnities between diverse urban datasets

Data & Analytics