Download - Crisis Informatics (November 2013)
Crisis informatics:Finding relevant and credible information on social media during disasters
January 2010
How/when did it start for me?
3
Carlos Castillo – [email protected]://www.chato.cl/research/
Fertile grounds for applied research
✔ Problems of global significance✔ Solved with labor-intensive methods✔ Better solution provides a public good✔ Large and noisy data sets available✔ Engage volunteer communities
State of the art
At least 650 publications:
Crisis Analysis (52)
Crisis Management (280)
Situational Awareness (58)
Social Media (203)
Mobile Phones (64)
Crowdsourcing (109)
Software and Tools (90)
Human-Computer Interaction (28)
Natural Language Processing (33)
Trust and Security (31)
Geographical Analysis (45)
Source: http://humanitariancomp.referata.com/
Publication titles
7
Carlos Castillo – [email protected]://www.chato.cl/research/
Fertile grounds for applied research
✔ Problem of global significance✔ Solved with labor-intensive methods✔ Better solution provides a public good✔ Large and noisy data sets available✔ Engage volunteer communities• Relevance to practitioners?
Patrick Meier, Social Innovation Director @ QCRI – http://irevolution.net/
Patrick Meier, Social Innovation Director @ QCRI – http://irevolution.net/
“What can speed humanitarian
response to tsunami-ravaged
coasts? Expose human rights
atrocities? Launch helicopters to
rescue earthquake victims?
Outwit corrupt regimes?
A map.”
10
Carlos Castillo – [email protected]://www.chato.cl/research/
CollaboratorsMuhammad Imran– QCRI
Hemant Purohit– Wright Univ.
Alexandra Olteanu– EPFL
Jakob Rogstadious– Univ. of Madeira
Ioanna Lykorentzou– INRIA
Shady Elbassuoni– Univ. of Beirut
Lalana Kagal et al.– CSAIL MIT
Fernando Diaz– Microsoft
11
Carlos Castillo – [email protected]://www.chato.cl/research/
Outline
• Motivation• Handling crisis tweets• Crowdsourced verification• Ongoing work
– Automatic classification– Resource matchmaking
Crisis MappingHemant Purohit, Carlos Castillo, Patrick Meier and Amit Sheth: Crisis Mapping, Citizen Sensing and Social Media AnalyticsTutorial at ICWSM, May 2013.
13
Carlos Castillo – [email protected]://www.chato.cl/research/
I don't have time for social networks!
• We all have spare capacity– Television, TV series, Internet sites
• We overestimate ourselves in general– Don' underestimate social media
users, it is a bad starting point
18
Carlos Castillo – [email protected]://www.chato.cl/research/
An earthquake hits a Twitter user
• When an earthquake strikes, the first tweets are posted 20-30 seconds later
• Damaging seismic waves travel at 3-5 km/s, while network communications are light speed on fiber/copper + latency
• After ~100km seismic waves may be overtaken by tweets about them
http://xkcd.com/723/
26
Carlos Castillo – [email protected]://www.chato.cl/research/
Crisis Mapper Conference 2013:Next week!
Classifying and extracting information from tweetsMuhammad Imran, Shady Elbassuoni, Carlos Castillo, Fernando Diaz and Patrick Meier: Practical Extraction of Disaster-Relevant Information from Social MediaIn SWDM. Rio de Janeiro, Brazil, 2013.
Muhammad Imran, Shady Elbassuoni, Carlos Castillo, Fernando Diaz and Patrick Meier: Extracting Information Nuggets from Disaster-Related Messages in Social MediaIn ISCRAM. Baden-Baden, Germany, 2013. Best paper award.
28
Carlos Castillo – [email protected]://www.chato.cl/research/
3.
Extraction
Our approach
2.
Classification1.
Filtering
29
Carlos Castillo – [email protected]://www.chato.cl/research/
1. Filtering
Is disaster-related?
Contributes tosituational
awareness?
Yes Yes
No No
30
Carlos Castillo – [email protected]://www.chato.cl/research/
Labeling task
Classify the following tweet from Hurricane Sandy as:● Personal: only of interest to author and
immediate circle of friends● Informative: interesting to other people● Off-topic: not related to Hurricane Sandy● Other/can't judge
31
Carlos Castillo – [email protected]://www.chato.cl/research/
Advice on labeling
• Your instructions will never be correct the first time you try– e.g. personal / eyewitness– Instructions must be re-written reactively– Perform small-scale labeling first
• Instructions must be concrete and brief– If you can't do it, the task has to be divided
32
Carlos Castillo – [email protected]://www.chato.cl/research/
2. ClassificationCaution &
AdviceInformation
SourcesDamage &Casualties Donations
Health
Shelter
Food
Water
Logistics
...
...
Filteredtweets
33
Carlos Castillo – [email protected]://www.chato.cl/research/
Distribution of tweet types
50%
18%
16%
10%6%
Caution/AdviceInfo SourceDonationsCasualties/DamageUnknown
Joplin Tornado (2011)
34
Carlos Castillo – [email protected]://www.chato.cl/research/
Classification results
Class AUC
Caution and advice 0.91
Information source 0.76
Donations 0.89
Casualties/damage 0.87
35
Carlos Castillo – [email protected]://www.chato.cl/research/
3. Extraction
...
Classifiedtweets
@JimFreund: Apparently we have no choice.
There is a tornado watch in effect
tonight.
36
Carlos Castillo – [email protected]://www.chato.cl/research/
Extraction
• #hashtags, @user mentions, URLs, etc.– Regular expressions– Text library from Twitter
• Temporal expressions– Part-of-speech tagger + heuristics– Natty library
• Supervised learning
37
Carlos Castillo – [email protected]://www.chato.cl/research/
Labels for extraction
• Type-dependent instruction• Ask evaluators to copy-paste a
word/phrase from each tweet
38
Carlos Castillo – [email protected]://www.chato.cl/research/
Learning: Conditional Random Fields
• Used extensively in NLP for part-of-speech tagging and information extraction
• Representation of observations is important (capitalization, position, etc.)
HMM Linear-chain CRF
hidden
observed
39
Carlos Castillo – [email protected]://www.chato.cl/research/
Tool
• CMU ARK Twitter NLP– Tokenization– Feature extraction– CRF learning
• Very easy to use: simply change the training set (part-of-speech tags) into anything, and re-train
40
Carlos Castillo – [email protected]://www.chato.cl/research/
Output examples
RT @weatherchannel: .@NYGovCuomo orders closing of NYC bridges. Only Staten Island bridges unaffected at this time. Bridges must close by 7pm. #Sandy #NYC
Wow what a mess #Sandy has made. Be sure to check on the elderly and homeless please! Thoughts and prayers to all affected
RT @twc_hurricane: Wind gusts over 60 mph are being reported at Central Park and JFK airport in #NYC this hour. #Sandy
RT @mitchellreports: Red Cross tells us grateful for Romney donation but prefer people send money or donate blood dont collect goods NOT best way to help #Sandy
41
Carlos Castillo – [email protected]://www.chato.cl/research/
Extractor evaluation
Setting Rec Prec
Train 2/3 Joplin, Test 1/3 Joplin 78% 90%
Train 2/3 Sandy, Test 1/3 Sandy 41% 79%
Train Joplin, Test Sandy 11% 78%
Train Joplin + 10% Sandy, Test 90% Sandy 21% 81%
• Precision is: one word or more in common with what humans extracted
42
Carlos Castillo – [email protected]://www.chato.cl/research/
Donations matching• Identify and match requests/offers for donations
– Money, clothing, food, shelter, volunteers, blood
Average precision = 0.21 (0.16 if only text similarity is used)
Crowdsourced stream processing systemsMuhammad Imran, Ioanna Lykourentzou and Carlos Castillo: Engineering Crowdsourced Stream Processing Systems(Submitted for publication)
44
Carlos Castillo – [email protected]://www.chato.cl/research/
45
Carlos Castillo – [email protected]://www.chato.cl/research/
Design objectives and principlesDesign principles
Design objective Example metric Automatic components
Crowdsourced components
Low latency End-to-end time Keep-items moving Trivial tasks
High throughput Output items per unit of time
High-performance processing
Task automation
Load adaptability Rate response function
Load shedding, load queueing
Task prioritization
Cost effectiveness Cost vs. quality, throughput, etc.
N/A Task frugality
High quality Application-dependent
Redudancy, aggregation and quality control
Design patterns
● QA loop
● Task assignment
● Process/verify
● Supervised learning
● Crowdwork sub-task chaining
● Humans are not a bottleneck
● Humans review every output element
48
Carlos Castillo – [email protected]://www.chato.cl/research/
Self-service for crisis-related classification
Unstructuredtext reports
Structuredinformation
ReportClassifier
ModelBuilder
Crowdsourced active learning
Library of training data
49
Carlos Castillo – [email protected]://www.chato.cl/research/
Preliminary results: efficiency
Maximum documented input load during a natural disaster = 270 tweets/sec.
Preliminary results: effectiveness
Task: Informative vs. {Personal, Other}
52
Carlos Castillo – [email protected]://www.chato.cl/research/
Free software
• AIDR is free software• The official launch date is
November 20th during the Crisis Mappers conference in Nairobi, Kenya
Mobile applicationsFuming Shih, Oshani Seneviratne, Daniela Miao, Ilaria Liccardi, Lalana Kagal, Evan Patton, Patrick Meier, Carlos Castillo:Democratizing Mobile App Development for Disaster ManagementTo be presented at the IJCAI Workshop on Semantic Cities. Beijing, China, 2013.
54
Carlos Castillo – [email protected]://www.chato.cl/research/
Mobile components (AppInventor)
• Components useful for DIY emergency response apps–e.g. off-line tolerant
photo uploads• Aggregating/federating
linked open data
55
Carlos Castillo – [email protected]://www.chato.cl/research/
Helping developers query linked data
Crowdsourced verification
3
61
Carlos Castillo – [email protected]://www.chato.cl/research/
Crowdsourced verificationfor crisis information
• Veri.ly• Joint project between MASDAR
and QCRI• Iyad Rahwan, Abdulfatai Popoola,
Dmytro Krasnoshtan, Attila Toth (MASDAR), Victor Naroditskiy (Univ. Southampton) + QCRI
Closing remarks
65
Carlos Castillo – [email protected]://www.chato.cl/research/
Computationally feasible
Supported bydata
Useful
Good projects in this space
66
Carlos Castillo – [email protected]://www.chato.cl/research/
Computationally feasible
Supported bydata
Useful
Good projects in this space
Temptation! Danger!
Poorly planned projects :-(
AI-complete problems
67
Carlos Castillo – [email protected]://www.chato.cl/research/
Some venues
• ISCRAM – International Conference on Information Systems for Crisis Response and Management
• SMDW – Workshop on Social Web for Disaster Management
• SMERTS – Social Media and Semantic Technologies in Emergency Response
+ the usual suspects, depending on your area ;-)
68
Carlos Castillo – [email protected]://www.chato.cl/research/
Possibility of large impact by using computer science to support
humanitarian work
=Applied computing at its best
Thank you!Carlos Castillo · [email protected]
http://www.chato.cl/research/With thanks to Patrick Meier for several slides