city data dating: emerging affinities between diverse urban datasets
TRANSCRIPT
City Data Dating: emerging affinities between diverse urban datasets
Gloria Re Calegari, Irene Celino, Diego Peroni
Paper available at:http://www.sciencedirect.com/science/article/pii/S0306437915001362
Elsevier - Information Systems Journal 1
Digital information about cities• Open data (large number of data sources available on the web):
• Urban planning (land cover, public registers)
• Demographics and statistics about municipality
• Closed data sources produced and maintained by enterprises:• Phone activity data but sometimes made open!
• User generated information:• Volunteered geographic information and crowdsourcing information (Open Street Map)
• Location based social network (Foursquare check-ins and geo located information)
• Real-time and streaming information• Sensors (e.g. Temperature, energy consumption, ..)
Elsevier - Information Systems Journal 2
Do diverse urban datasets provide the same “picture” of the city?
Short term goals
• Discovering “affinities” between heterogeneous datasets.
• Using a human relations metaphor, do diverse urban datasets “date each other” and show “natural affinities”?
• Which is the influence of spatial resolution and data complexity on the dependence strength between heterogeneous urban sources?
Long term goals
• Would it be possible to use one or more “cheap” datasets as proxy for more “expensive” data sources?
Elsevier - Information Systems Journal 3
Milano datasets
Demographics: • Population density
• Spatial resolution: census area (6079 –median size of census area 12,000 m2)
• Source: Milano open data
Points of interest (POIs): • Trasports, schools, sports facilities, amenity places,
shops ...• Spatial resolution: lat-long points • Source: Milano open data (official, 6718) and Open
Street Map (user generated, 44351)
Elsevier - Information Systems Journal 4
Milano datasets
Land use: • type of land use according to CORINE taxonomy
(3-levels hierarchy, up to 40 types of land use defined)
• CORINE taxonomy http://swa.cefriel.it/ontologies/corine.html#
• 12 types selected (which better feature metropolitan area as Milan)
[dense residential areas, scattered residential areas, industrial areas, parks and green areas, roads, railways, hospitals, sports centres, public services and offices, construction sites, agricultural areas and wild areas.]
• Spatial resolution: building level
• Source: Lombardy region open data
Elsevier - Information Systems Journal 5
Milano datasetsCall data records:
• 5 phone activities • Incoming SMS
• Outcoming SMS
• Incoming CALL
• Outcoming CALL
• Internet
• Recorded every 10 minutes (144 values a day for each activity) for 2 months (Nov-Dec 2013)
• Spatial resolution: grid of 3538 square cells of 250m
• Source: Telecom Italia – provided for their Big Data Challenge http://theodi.fbk.eu/openbigdata/
Elsevier - Information Systems Journal 6
Pre-processing of data
Making spatial resolution uniform
Spatial resolutions used:
• District level with 88 official subdivisions
• Grid level with 3.538 square cells of 250m
Cells
Districts
New datasets generated:
• Density of POIs in each cell/district• Weighted sum of population density in each cell/district• Percentage shares of each land use over each cell/district area
Elsevier - Information Systems Journal 7
Pre-processing of data
Telecom dataFootprint/temporal signature for each cell/district(average activity over all the 60 days, distinguishing between week and weekend days)
Temporal Data compression (pre-processing large scale time series to get a more manageable compressed representation)
Elsevier - Information Systems Journal 8
Data exploration experiments
To discover possibly “natural” connections between heterogeneous datasets.
Three-step process
1) Correlation Analysis
2) Regression Analysis
3) Clustering Analysis
All the analysis performed both at district and at cell level
complexity of data
Elsevier - Information Systems Journal 9
1) Correlation analysis
Try to identify possible correspondences between different datasets.
Measure whether and how two variables change together using correlation indexes -> Pearson’s correlation coefficient
-1 < r < 1
Positive correlation
Negative correlation
Pairwise comparisons between 1-dimensional vectors:
• POIs: density
• Population: density
• Telecom: first Principal Component with 90% of explained variability
• Land use data: only residential and agricultural land use used separately, in term of belonging percentages to district/cell
Elsevier - Information Systems Journal 10
Correlation analysis - district level
• Correlation between • Telecom and residential
• Telecom and POIs
can actually exist.
Data fits quasi linear models.
tlc
resid
agric
POI mun
POI OSM
pop
• Negative correlation between agricultural land use and other datasets -> human action inversely related to agricultural areas.
Elsevier - Information Systems Journal 11
Correlation analysis - cell level
• All coefficients lower than the district level
• Higher values again between Telecom and residential and POIs
=> the choice of resolution level can have a significant impact on the correlation results.
tlc
resid
agric
POI mun
POI OSM
pop
• Some phenomena causing the correlation are independent of the resolution level (0.76 residential-population) .
Elsevier - Information Systems Journal 12
Correlation analysis - phone calls and population
• Could the correlation change during the day according to the everyday human behaviour pattern (get up, go to work, come back home in the evening)?
• Call activity at 6 different day times
• Week and weekend profiles are different -> mirroring people’s different habits
• Average correlation higher in the weekend (phone activity related to the actual presence of people at home)
• Weekday profile -> human behaviour pattern
DISTRICT CELL
WEE
KW
EEK
END
Elsevier - Information Systems Journal 13
2) Regression analysisFit multiple linear regression models (MLR)
(To take into consideration a larger portion of urban data complexity)
MLR models
Telecom(1-d vector for each communication type for week/weekend days with the average activity)
POIs(1-d vector for each POI category with the POI density)
Demographics (1-d vector with the population density)
Land use(1-d vector for each land use with the percentage of land covered. Only 5 main land use type are used.)
Cheap-to-produce
Expensive-to-mantain
Elsevier - Information Systems Journal 14
Regression analysis - results • The more selected predictors (increasing k), the larger part of the outcome variable’s variance is explained (increasing R2adj).
• The strength of the correlation decrease from district-level (coarse-grained) to cell-level (fine-grained)
• Benefit of adding a higher number of predictors is weaker at cell-level
• Population, dense residential and agricultural areas show a stronger correlation
manual stepwise model selection
AIC criterion
Heterogeneous datasets provide comparable or even similar pictures of the urban environment.
Elsevier - Information Systems Journal 15
Regression analysis – significant predictors
Significant predictors of the models selected with manual backwards elimination
Elsevier - Information Systems Journal 16
3) Clustering analysis
Exploit the whole city information available (n-dimensional datasets), avoiding data compression.
Clustering technique to understand if diverse data naturally group together throughout the urban space.
Cluster each dataset and compare the clustering obtained for each pair of datasets.
k-Means algorithm with 5 classes (Silhouette coefficient) applied on:
• CORINE: one vector, for each NIL/cell, of belonging percentages to the 12 categories
• Telecom: the whole footprint for each cell/NIL (a vector of 1440 values)
Elsevier - Information Systems Journal 17
Clustering analysis - Pairwise clusterings comparisons
Pairwise clustering comparisons using Rank Index and Overall Accuracy to evaluate datasets correnspondance
Elsevier - Information Systems Journal 18
District classified in the same way in Telecom and CORINE datasets
Clustering analysis - Qualitative analysis at district level
CORINE and whole activityCORINE and SMSin-out activity CORINE and Internet activity
Elsevier - Information Systems Journal 19
Conclusions and future work• Evaluation of the correlations between datasets at different levels
• different spatial resolution (coarse-grained district level vs. fine-grained cell level)
• different data complexity (very compressed information vs multi-dimensional data).
• Correlations between different sources exist and their strength depends on both spatial resolution and data complexity
• Diverse urban datasets can “date each other”, but that their actual “affinity” can vary.
What is coming next?
• Extending our investigation toward a predicting approach (statistical and machine learning techniques).
Elsevier - Information Systems Journal 20