d-sieve : a novel data processing engine for crises related social messages

D-Sieve: A Novel Data Processing Engine for Crises Related Social

Messages

Soudip Roy Chowdhury (INRIA & UPSUD, France)Hemant Purohit (Wright State Uni., USA)

Muhammad Imran (QCRI, Doha)

SWDM 2015 May18, 2015

Social Media During Crises

Courtesy: http://worldmap.harvard.edu/maps/nepalquake

And it contains…

TimelyInformation

And it contains…

ActionableInformation

And it contains…

TacticalInformation

Social Data Processing

• Real-time analysis of crises related social data is necessary– Manual analysis • Standby task force [N. Morrow et al. 2011]

– System mediated analysis• Unsupervised learning [T. Sakaki et al. 2010]

• Supervised learning [S. Roy Chowdhury et al. 2013]

(not scalable)

(noisy / ambiguous data)

(scarce training data / non-uniform class distribution)

D-Sieve• Post-classification data processing engine in a

supervised classification setting

• Noise

Clas

s A

Clas

s B

• NoiseCl

ass

ACl

ass

B

Supervised Classifier

D-SieveRaw DataCorrectly Classified Data

Data Model

• A message contains

TEXT Date

Data Model

• A message contains

TEXT Date

hashtags

content

URL

Our Approach

• D-Sieve extracts two content features to classify messages correctly– Stable hashtag association– Stable named entity association

Stable Hashtag Association

Relative frequencies of stable hashtags for an event remain constant over time….

Stable Hashtag Association• A hashtag is stable, if the moving average score

of its frequency distribution (after k posts) becomes steady [X.S. Yang et al. 2013]

• Stable hashtag set contains all stable hashtags for an incident

• Stable hashtag association metric for an input message

• and are set of stable hashtags for true positive and true negative classes

Stable Named Entity Association• Named entities are extracted from the

message content + URL content• Stable named entity association metric for an

input message

• We calculate the final feature association metrics

• Feature weights ( ) are chosen such that ranges [-1,1]

• Rule of thumb weights are set as 0.5

Functional Architecture

In our experiments..CrisisLexT6

[A.Oleanu et al. 2014]

Hurricane Sandy (10k) , Queensland

Flood (10k)

Random Forest[A. Liaw et al.

2002]

UB = 0.8LB = 0.7

50% of the data used during model creation and rest used for testing

Few Observations

• Stable hashtag set contained – 24 and 21 tags for TP class of Sandy and

Queensland dataset respectively– 0 and 7 tags for TN class of Sandy and Queensland

dataset respectively– 68% and 70% of the whole data for Sandy and

Queensland data satisfy the UB and LB constraints

Experimental Results

precision recall

OC 96.09 81.42

OC+HT 96.23 84.51

OC+NE 96.31 86.49

OC++ 96.35 87.55

72.5

82.5

92.5

Sandy 2012

#per

cent

age

precesion recall

OC 99.53 36.75

OC+HT 99.8 58.47

OC+NE 99.75 68.9

OC++ 99.8 74.24

10

50

90

Queensland 2013

#per

cent

age

D-sieve improves recall values substantially (6% and 37% improvements for two datasets).

Future Work

• Investigate efficiency of D-sieve both for online and offline classification settings

• Address the latency issue for feature extraction (URL content crawling and entitiy spotting) using parallel computing

• Extend our experiments with Facebook and G+ data

References• A. Liaw and M. Wiener. Classification and regression by randomforest. R news,

2(3):18–22, 2002. • A. Olteanu, C. Castillo, F. Diaz, and S. Vieweg. “Crisislex: A lexicon for collecting

and filtering microblogged communications in crises”. In ICWSM, 2014. • N. Morrow, N. Mock, A. Papendieck, and N. Kocmich. “Independent evaluation of

the ushahidi haiti project”. Development Information Systems International, 2011.• S. Roy Chowdhury, M. Imran, M. R. Asghar, S. Amer-Yahia, and C. Castillo.

“Tweet4act: Using incident-specific profiles for classifying crisis-related messages”. In ISCRAM 2013.

• Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. “Earthquake shakes Twitter users: real-time event detection by social sensors”. In WWW 2010.

• X. S. Yang, D. W. Cheung, L. Mo, R. Cheng, and B. Kao. “On incentive-based tagging”. In ICDE 2013.

d-sieve : a novel data processing engine for crises related social messages

Education

data model

classified data

novel data processing

entity association metric

efficiency of d

queensland dataset

percentage queensland

percentage sandy