big data para comprender la dinámica humanadatamining.dc.uba.ar/datamining/files/charlas_y... ·...
TRANSCRIPT
GRANDATA @ 2014 – All rights reserved.
®
Big Data para comprender la Dinámica Humana
Carlos SarrauteGrandata Labs
Hablemos de Big Data – 26 noviembre 2014
GRANDATA
Grandata● Founded in 2012.● Leverages advanced research in Human Dynamics (the
application of “big data” to social relationships and human behavior) ● to identify market trends and predict customer actions ● integrating first-party and telco partners data.
Grandata Labs Research team, 5 researchers, based in Buenos Aires, Argentina. Research Interests:
● Study mobility patterns, social interactions and their correlations in dynamic and mobile social networks.
● Integrating categorically different social networks to enhance our understanding of users social behavior and interactions
● e.g. Mobile phone social network and bank transactions spending behavior networks
Brief presentation
GRANDATA
MIT Human Networks and Mobility● Marta Gonzalez
MIT Human Dynamics● Alex "Sandy" Pentland
INRIA● Aline Viana, Eric Fleury
City College of New York● Hernan Makse
Scientific collaborations
GRANDATA
Mobile Data Source
● Mobile phone company in Mexico with ~10% of market share (over 7 million users, logs include 90 million users).● The raw data logs of calls between their clients and external
users.● Data collected over a three month period.● 2,185,852,564 calls● 2,033,719,579 messages
● Each record contains: ● hashed id of caller and callee● Date, Time and duration of call● Geo-location of caller is client
● Age for subset of users (ground truth)
GRANDATA
Create graph from CDRs
Overview of this work
GRANDATAPHONE COMPANY
Transfer CDRs + groundtruth to GRANDATA (Hashed ids)
Analyse ground truth
Reaction Diffusion Algorithm
Topological metrics
Selecting categories from probability vectors
GranData Servers
GRANDATA
Recognition
GRANDATA
Observational Study
GRANDATA
Analyzing the Ground Truth Data
We have over 500,000 users with known age and gender.
Bimodal distribution
Age Population pyramid
GRANDATA
Characterization Variables
Number of Calls incoming calls / outgoing calls weekdays (Monday to Friday) / weekend ``daylight'' (from 7 a.m. to 7 p.m.) / ``night'' (before 7
a.m. and after 7 p.m.).
Duration of Calls
Number of SMS
Number of Contact Days
In/Out-degree of the Social network
Degree of the Social network
GRANDATA
Differences between genders
Variable Female Male
Total duration 10038.75 10663.17
Total duration outgoing
6359.96 7239.53
Total duration incoming
3678.78 3423.64
p(M|F) < p(M) = 0.5683 < p(M|M)
p(F|M) < p(F) = 0.4317 < p(F|F)
GRANDATA
Age homophily in communications behavior (M)
Partly due to the double peak in the age histogram
Age homophily
GRANDATA
Random links matrix (R)
GRANDATA
Difference (M - R)
GRANDATA
Age homophily – number of links
Inflection point
GRANDATA
Prediction Results
GRANDATA
Gender prediction
● Tried: Naive Bayes, Logistic Regression, Linear SVM, Linear Discriminant Analysis and Quadratic Discriminant Analysis.
● Best results: Linear SVM, Logistic Regression
● Precision obtained:
Population 1 1/2 1/4 1/8
Precision 66.3 % 72.9 % 77.1 % 81.4 %
GRANDATA
Age prediction
● Tried: Multinomial Logistic based on node features
● Problem: Doesn't harness the network topology
– In particular the strong age homophily
GRANDATA
Building the Social Graph
Caller id Callee id
55005451 | 12090916222162983 | 184929357147007922 | 20733284860742254 | 9360064832352175 | 77333835204344268 | 20468774021224475 | 429633341439001 | 1932615614727540 | 51241342…....…... ~250 million edges
~70 million users
Hashed origin Hashed target DIRECTION TIME CITY LATITUD LONGITUD OPERATOR DATE
725BB5BFC026CB1 0CD8324BF87BC979 OUTGOING 36 Obregon 19.35 - 99.21 TELCEL 15/04/2013 12:00:44 p.m.CAAEBD085D13B86 82B005A384D23523E OUTGOING 38 Obregon 19.35 -99.21 TELCEL 15/04/2013 08:35:32 p.m.F49F7DE9DDECE07 304B6A2B8BC8BD6D OUTGOING 206 Merida 21.01 -89.59 IUSACELL 15/04/2013 04:28:59 p.m.
Raw data CDRs: (adapted for presentation)
Seeds (red)
- Symetrized links
- Weight are 0 or 1
- Components with no seeds are removed
GRANDATA
CATEGORY PROBABILITY
INFORMATION FLOW
Reaction Diffusion Algorithm
Graph Laplacian
Reactive term Diffusion term
Tunning parameterAge category
Age categories:[10-24, 25-34, 35-50, 50+]
GRANDATA
Selecting age categories from the probability state
Maximum probability: For each node select category with highest probability.
Pyramid scaling: Select category values using maximum probability constrained to having a population pyramid given by the seed nodes
maximum probability
Pyramid scaling
GRANDATA
Precision obtained for age prediction
Population Machine Learning Reaction-Diffusion
q = 1 36.9 % 43.4 %
q = 1/2 42.9 % 47.2 %
q = 1/4 48.4 % 56.1 %
q = 1/8 52.7 % 62.3 %
Generate predictions between 4 categories.
GRANDATA
Conclusions
● First extensive study of social interactions in the country of Mexico focusing on gender and age, based on mobile phone usage.
● Gender homophily and an asymmetry respect to incoming and outgoing calls between men and women
● Strong age homophily
● Standard Machine Learning tools finding that Logistic Regression and Linear SVM gave best results
● Purely graph based Reaction-Diffusion algorithm.
GRANDATA
Future steps
● Analysis of the performance as function of parameters and topological metrics
● Add mobility information
– Differences in mobility patterns between genders and age groups
● Apply this methodology to predict user's spending behavior