identifying relevant messages in a twitter-based citizen channel for natural disaster situations

Post on 27-Jul-2015

53 Views

Category:

Education

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Identifying Relevant Messages in a Twitter-based Citizen Channel for Natural Disaster Situations

Alfredo  Cobo  ajcobo@uc.cl  

Denis  Parra  dparra@ing.puc.cl  

Jaime  Navón  jnavon@ing.puc.cl  

Pon=ficia  Universidad  Católica  de  Chile  Departamento  de  Ciencia  de  la  Computación  

Av.  Vicuña  Mackenna  4860,  Macul  San=ago,  Chile  

 

I (… and some other people in this room)

…  come  from  Chile  

Picture  from  hMp://www.quadrodemedalhas.com/images/mapas/mapa-­‐chile.jpg  

hMp://upload.wikimedia.org/wikipedia/commons/thumb/9/91/Chile_in_South_America_(-­‐mini_map_-­‐rivers).svg/409px-­‐Chile_in_South_America_(-­‐mini_map_-­‐rivers).svg.png  

Chile, well-known for its..

•   Copper  (Top  Producer)  

"Top  5  Copper  Producers"  by  Plazak  -­‐  Own  work.  Licensed  under  CC  BY-­‐SA  3.0  via  Wikimedia  Commons  -­‐  hMp://commons.wikimedia.org/wiki/File:Top_5_Copper_Producers.png#/media/File:Top_5_Copper_Producers.png  hMps://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0CAYQjB0&url=hMp%3A%2F%2Fcommons.wikimedia.org%2Fwiki%2FFile%3ANa=ve_Copper_(mineral).jpg&ei=L31ZVbOsL4r1UrbRgKAB&bvm=bv.93564037,d.d24&psig=AFQjCNHr2zm5m4Jmim7AgkCwwSb0b5mGUA&ust=1432014509629311  

Chile, well-known for its..

• Wine    (Price  +  quality)    

"Fiesta  de  Vendimia"  by  LuxoDresden  -­‐  Own  work.  Licensed  under  CC  BY-­‐SA  3.0  via  Wikimedia  Commons  -­‐  hMp://commons.wikimedia.org/wiki/File:Fiesta_de_Vendimia.JPG#/media/File:Fiesta_de_Vendimia.JPG  

If you start typing in Google…

9  out  of  10  disasters  …  

If you start typing in Google…

9  out  of  10  disasters  …  prefer  Chile  

… and for Natural Disasters L

• Largest  ever  registered  earthquake  in  History:  Valdivia,  Chile,  22nd  of  May  of  1960  (9.5  in  Richter  Scale)  

• We  usually  have  1  large  earthquake  every  30  years  (~  8  degrees    in  Richter  Scale)  

• Last  one  in  2010  close  to  Concepción,  but  it  also  affected  San=ago  (the  capital)  

… so, at PUC Chile

• We  created  CIGIDEN  “Na=onal  Research  Center  for  the  Integrated  Administra=on  of  Natural  Disasters”  

CIGIDEN’s Goal in this project

• Help  ci=zens  staying  informed  during  situa=ons  of  natural  disasters  by  using  Social  Media.  • Build  Mobile  Applica=on  (Carlos  Molina)  • Filter  automa=cally  relevant  messages  from  those  not  related  to  earthquakes  (Alfredo  Cobo)  to  feed  the  applica=on  

 

Our Task: Building a Twitter classifier -­‐ Filter  tweets  related  to  natural  disasters  from  those  who  did  not.    

Related Work Manual  Classifica8on   Data  Post-­‐processing   Feature  Genera8on   Tools  for  Disaster  Management  

Vieweg  et  al.  (2010)  Imran  et  al.  (2013)  Mendoza  et  al.  (2010)      

Mendoza  et  al.  (2010)  Cas=llo  et  al.  (2011)    (Informa=on  Credibility  on  TwiMer)  

Gimpel  et  al.  (2011)  Koloumpis  et  al.  (2011)  Liu  et  al.  (2012)  Wu  et  al.  (2011)  Lee  et  al.  (2014)    (Not  necessarily  for  natural  disasters)    

Hiltz  et  al.  (2013)  Power  et  al.  (2013)  Caragea  et  al.  (2011)  Abel  et  al.  (2012)  Middleton  et  al.  (2014)  MorstaMer  et  al.  (2013)  Imran  et  al.  (2014)  

Why building this classifier would be a contribution? • Building  and  valida=ng  a  ground  truth  for  classifying  tweets  in  Spanish.  

• Building  the  classifier  and  dealing  with  • Class  Imbalance    • Number  of  latent  dimensions  (Feature  Genera=on  using  LDA)  

Workflow of Activities

Chile’s  Earthquake  2010  

Cas=llo  et  al.  (2010)  

Our  groundtruth  

Non-­‐relevant  messages  

Realis=c  dataset  

Sampling,  Cleaning  &    filtering  

Classifiers  

-­‐  Feature  selec=on  (LDA)  

-­‐  Class  Imbalance  

10%  -­‐  80%  

Building the ground truth

• Random  sampling  of  5,000  tweets  from  Cas=llo  et  al.  (2010)  dataset,  used  to  study  credibility  ~  Chile’s  2010  earthquake.  

• Dates:  From  February  27th  un=l  March  2nd  (Spanning  4  days  in  2010)  

• We  kept  only  Spanish  messages,  removed  messages  too  similar  (Lavenshtein  distance):  2,187  messages  leE  

Validating of the ground truth

•  Fleiss  Kappa:  •  κ  =  0.645,  p  <  .001  

•  Intraclass  correla=on  •  ICC(2,1):  IIC  =  0.646,  p  <  .001  

•  Landis  and  Koch  et  al.  (1977)  

 

•   Relevant  messages  were  labeled  based  on  Imran  et  al.  (2013)  classifica=on:  • Cau=on/Warning  • Casual=es  and  Damage  • People  (missing,  found,  etc.)  • Informa=on  source  

Workflow of Activities

Chile’s  Earthquake  2010  

Cas=llo  et  al.  (2010)  

Our  groundtruth  

Non-­‐relevant  messages  

Realis=c  dataset  

Sampling,  Cleaning  &    filtering  

Classifiers  

-­‐  Feature  selec=on  (LDA)  

-­‐  Class  Imbalance  

Classification Problem Features                                                                                      Class  Imbalance  

User  Network  

Content  (4,766  unique  words)  

Followers   Hashtags  Followees   Words  

User  men=ons  

•  Ground  Truth  is  a  not  realis=c  representa=on  of  TwiMer  

•  We  added  “Noise”:  Introduced  Tweets  non-­‐relevant  to  the  event  (20%  -­‐  80%)  

•  Sampled  non-­‐relevant  tweets  from  5  months.  

•  Removed  all  tweets  posted  during  days  of  seismic  ac=vi=es  

Model   Precision   Recall   F1  score   Accuracy   AUC   Dimensions   Noise  Propor8on  

Baseline   0.625   0.545   0.53   0.5   0.568   -­‐   0  

Bernoulli  NB  

0.831   0.226   0.355   0.594   0.605   2000   0  

Logis=c  Regression  

0.827   0.641   0.722   0.756   0.834   2000   0.6  

Linear  SVM   0.687   0.677   0.682   0.687   0.719   1000   0.6  

Random  Forest  

0.807   0.673   0.734   0.758   0.844   1000   0.8  

Classification Results

Analysis ~ LDA Dimensions and Noise

Analysis ~ LDA Dimensions and Noise

Conclusions & Future Work

• We  built  and  validated  a  ground  truth  of  tweets  in  Spanish  relevant  to  disasters  

• We  implemented  a  classifier  and  analyzed  its  performance  based  on  several  algorithms  and  dealing  with  class  imbalance  problem  

• Future  Work:  Move  the  applica=on  from  prototype  to  produc=on,  test  online  scalability  

That’s all folks!

•   Thanks  and  ques=ons  to  corresponding  author  Alfredo  Cobo:  ajcobo@uc.cl  or  Denis  Parra:  dparra@uc.cl    

Chile, small country, but well-known for its..

• Length  (4,300  Km)    

~  4,300  Km   ~8,000  Km  

Model Features

• Newman  et  al.  (2007)  • Biro  et  al.  (2008)  • Wei  et  al.  (2006)  • Wang  et  al.  (2012)  • Han  (2005)  

Features   Corpora  Features  Followers   Hashtags  Friends   Words  

User  men=ons  

Results

• Amatriain  et  al.  (2013)  

Architecture

Plots of bootstrap Agreement  Day  1   Agreement  Day  2  

Agreement  Day  4  Agreement  Day  3  

Word Frequencies

Just “Terremoto”: AUC

Related Work

Manual classification

• Vieweg  et  al.  (2010)  •  Imran  et  al.  (2013)  

Post Processing

• Cas=llo  et  al.  (2011)  • Mendoza  et  al.  (2010)  

Feature Generation Approaches

• Gimpel  et  al.  (2011)  • Koloumpis  et  al.  (2011)  •  Liu  et  al.  (2012)  • Wu  et  al.  (2011)  •  Lee  et  al.  (2014)  

Tools For Disaster Management

• Hiltz  et  al.  (2013)  • Power  et  al.  (2013)  • Caragea  et  al.  (2011)  • Abel  et  al.  (2012)  • Middleton  et  al.  (2014)  • MorstaMer  et  al.  (2013)  •  Imran  et  al.  (2014)  

Building the ground truth

• Mendoza  et  al.  (2010)  

•  Imran  et  al.  (2013)  

Algorithms and evaluation procedure

• Cas=llo  et  al.  (2011)  •  FawceM  et  al.  (2004)  • Manning  et  al.  (2008)  • Wen  et  al.  (2014)  

top related