big data para comprender la dinámica humanadatamining.dc.uba.ar/datamining/files/charlas_y... ·...

24
GRANDATA @ 2014 – All rights reserved. ® Big Data para comprender la Dinámica Humana Carlos Sarraute Grandata Labs Hablemos de Big Data – 26 noviembre 2014

Upload: others

Post on 12-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA @ 2014 – All rights reserved.

®

Big Data para comprender la Dinámica Humana

Carlos SarrauteGrandata Labs

Hablemos de Big Data – 26 noviembre 2014

Page 2: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA

Grandata● Founded in 2012.● Leverages advanced research in Human Dynamics (the

application of “big data” to social relationships and human behavior) ● to identify market trends and predict customer actions ● integrating first-party and telco partners data.

Grandata Labs Research team, 5 researchers, based in Buenos Aires, Argentina. Research Interests:

● Study mobility patterns, social interactions and their correlations in dynamic and mobile social networks.

● Integrating categorically different social networks to enhance our understanding of users social behavior and interactions

● e.g. Mobile phone social network and bank transactions spending behavior networks

Brief presentation

Page 3: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA

MIT Human Networks and Mobility● Marta Gonzalez

MIT Human Dynamics● Alex "Sandy" Pentland

INRIA● Aline Viana, Eric Fleury

City College of New York● Hernan Makse

Scientific collaborations

Page 4: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA

Mobile Data Source

● Mobile phone company in Mexico with ~10% of market share (over 7 million users, logs include 90 million users).● The raw data logs of calls between their clients and external

users.● Data collected over a three month period.● 2,185,852,564 calls● 2,033,719,579 messages

● Each record contains: ● hashed id of caller and callee● Date, Time and duration of call● Geo-location of caller is client

● Age for subset of users (ground truth)

Page 5: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA

Create graph from CDRs

Overview of this work

GRANDATAPHONE COMPANY

Transfer CDRs + groundtruth to GRANDATA (Hashed ids)

Analyse ground truth

Reaction Diffusion Algorithm

Topological metrics

Selecting categories from probability vectors

GranData Servers

Page 6: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA

Recognition

Page 7: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA

Observational Study

Page 8: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA

Analyzing the Ground Truth Data

We have over 500,000 users with known age and gender.

Bimodal distribution

Age Population pyramid

Page 9: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA

Characterization Variables

Number of Calls incoming calls / outgoing calls weekdays (Monday to Friday) / weekend ``daylight'' (from 7 a.m. to 7 p.m.) / ``night'' (before 7

a.m. and after 7 p.m.).

Duration of Calls

Number of SMS

Number of Contact Days

In/Out-degree of the Social network

Degree of the Social network

Page 10: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA

Differences between genders

Variable Female Male

Total duration 10038.75 10663.17

Total duration outgoing

6359.96 7239.53

Total duration incoming

3678.78 3423.64

p(M|F) < p(M) = 0.5683 < p(M|M)

p(F|M) < p(F) = 0.4317 < p(F|F)

Page 11: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA

Age homophily in communications behavior (M)

Partly due to the double peak in the age histogram

Age homophily

Page 12: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA

Random links matrix (R)

Page 13: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA

Difference (M - R)

Page 14: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA

Age homophily – number of links

Inflection point

Page 15: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA

Prediction Results

Page 16: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA

Gender prediction

● Tried: Naive Bayes, Logistic Regression, Linear SVM, Linear Discriminant Analysis and Quadratic Discriminant Analysis.

● Best results: Linear SVM, Logistic Regression

● Precision obtained:

Population 1 1/2 1/4 1/8

Precision 66.3 % 72.9 % 77.1 % 81.4 %

Page 17: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA

Age prediction

● Tried: Multinomial Logistic based on node features

● Problem: Doesn't harness the network topology

– In particular the strong age homophily

Page 18: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA

Building the Social Graph

Caller id Callee id

55005451 | 12090916222162983 | 184929357147007922 | 20733284860742254 | 9360064832352175 | 77333835204344268 | 20468774021224475 | 429633341439001 | 1932615614727540 | 51241342…....…... ~250 million edges

~70 million users

Hashed origin Hashed target DIRECTION TIME CITY LATITUD LONGITUD OPERATOR DATE

725BB5BFC026CB1 0CD8324BF87BC979 OUTGOING 36 Obregon 19.35 - 99.21 TELCEL 15/04/2013 12:00:44 p.m.CAAEBD085D13B86 82B005A384D23523E OUTGOING 38 Obregon 19.35 -99.21 TELCEL 15/04/2013 08:35:32 p.m.F49F7DE9DDECE07 304B6A2B8BC8BD6D OUTGOING 206 Merida 21.01 -89.59 IUSACELL 15/04/2013 04:28:59 p.m.

Raw data CDRs: (adapted for presentation)

Seeds (red)

- Symetrized links

- Weight are 0 or 1

- Components with no seeds are removed

Page 19: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA

CATEGORY PROBABILITY

INFORMATION FLOW

Reaction Diffusion Algorithm

Graph Laplacian

Reactive term Diffusion term

Tunning parameterAge category

Age categories:[10-24, 25-34, 35-50, 50+]

Page 20: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA

Selecting age categories from the probability state

Maximum probability: For each node select category with highest probability.

Pyramid scaling: Select category values using maximum probability constrained to having a population pyramid given by the seed nodes

maximum probability

Pyramid scaling

Page 21: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA

Precision obtained for age prediction

Population Machine Learning Reaction-Diffusion

q = 1 36.9 % 43.4 %

q = 1/2 42.9 % 47.2 %

q = 1/4 48.4 % 56.1 %

q = 1/8 52.7 % 62.3 %

Generate predictions between 4 categories.

Page 22: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA

Conclusions

● First extensive study of social interactions in the country of Mexico focusing on gender and age, based on mobile phone usage.

● Gender homophily and an asymmetry respect to incoming and outgoing calls between men and women

● Strong age homophily

● Standard Machine Learning tools finding that Logistic Regression and Linear SVM gave best results

● Purely graph based Reaction-Diffusion algorithm.

Page 23: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA

Future steps

● Analysis of the performance as function of parameters and topological metrics

● Add mobility information

– Differences in mobility patterns between genders and age groups

● Apply this methodology to predict user's spending behavior

Page 24: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages

GRANDATA @ 2014 – All rights reserved.

®

[email protected]

Gracias!