on the spatiotemporal burstiness of terms · boston university slideshow title goes here...
TRANSCRIPT
ON THE SPATIOTEMPORAL
BURSTINESS OF TERMSTheodoros Lappas (Boston University)
Marcos Vieira (IBM Research Lab - Brazil)Dimitrios Gunopulos (University of Athens)
Vassilis Tsotras (UC Riverside)
Boston University Slideshow Title Goes Here
MOTIVATION
Thousands of new documents daily, recording real-life events
(online news sites, blogs, etc.)
Burstiness
Temporal
Spatiotemporal
Spatial
The deviation of the observed frequency from the expected one
Applications
Event Detection
Trend Identification
Document Search*
Boston University Slideshow Title Goes Here
SPATIOTEMPORAL BURSTINESS
Spatiotemporal Collection: Spatiotemporal Collection:
• Streams from different locations (cities, countries, etc).
• Record real life events in text.
During an event’s time in the spotlight, its characteristic terms exhibit atypically high frequencies in the affected locations.
Large Document Corpora
Streaming Data
Formalization &
Identification
Boston University Slideshow Title Goes Here
SPATIOTEMPORAL PATTERNS
Each captures a different type of bursty behavior
Group streams that simultaneously reported bursts for the same term, during the same timeframeGroup streams that simultaneously reported bursts for the same term, during the same timeframe
Combinatorial Patterns
• Ignore geographical proximity among streams
• A combination of a temporal interval and a set of streams from arbitrary locations
• Encodes unusually high frequencies simultaneously observed for a term t in all the streams in some set C, during the same temporal interval I.
Regional Patterns
• Consider the geographical proximity among document streams.
• Defined as a combination of a temporal interval and a geographical region.
• Encode that unusually high frequencies were observed for term t in geographical region R during a temporal interval I.
Boston University Slideshow Title Goes Here
COMBINATORIAL PATTERNS
• Process each stream separately
• Get temporal bursty intervals for the given term.
• An interval is defined by its burstiness and its endpoints
Create the interval graph of the entire interval collection
Maximum weight clique (MWC)����interval with the
highest cumulative burstiness
The MWC Problem is Polynomial for interval graphs!
Remove MWC nodes and re-apply to get
the 2nd highest–scoring clique, etc.
Boston University Slideshow Title Goes Here
REGIONAL PATTERNS
Get a new snapshot on every new timestamp
Identify set of Bursty Rectangles within the snapshot (R-Bursty alg, using bichromatic
discrepancy)
Aggregate consecutive rectangle-sets as new data arrives from the stream (STLocal
alg for finding maximal windows)
Boston University Slideshow Title Goes Here
DOCUMENT SEARCH
Pt,d: the set of a patterns of term t that overlap with the timestamp of document d
f(Pt,d): representative score (e.g. min, max, median, avg)
� Standard IR techniques focus on relevance to the given query of terms
� We enhance the search process by considering spatiotemporal burstiness
� Retrieve documents that are relevant to the query and also discuss events with a high spatiotemporal impact
Boston University Slideshow Title Goes Here
DATASETS
Topix Dataset181 streams (countries) 305,641 articles Sep/08 – Jul/09• 181 streams (countries) ���� 305,641 articles ���� Sep/08 – Jul/09
Major Events ListList of Influential events from Wikipedia
• 1 query for each event (chosen by human annotator)
• List of Influential events from Wikipedia
• 3 types of events: Global impact, Extended impact, Localized impact
• 1 query for each event (chosen by human annotator)
Artificial DataTwo different
Data generators
RandGen
DistGen
Boston University Slideshow Title Goes Here
PATTERN SIZE
� Report #streams in top regional and top combinatorial pattern ( and #streams in the MBR)
� StLocal: smaller patterns, focused around the event’s source
� StComb: streams from arbitrary locations, spanning very large regions
Boston University Slideshow Title Goes Here
PATTERN RETRIEVAL (ARTIFICIAL)
Jaccard Sim:
Jaccard between predictedand actually affected stream-sets
Start-Error, End-Error:
Absolute difference between
predictedand
actualendpoints of the
pattern
Report Average over 1000 artificial
patterns
Both approaches competitive
for both types of patterns
Each approach better fit for
one type
Boston University Slideshow Title Goes Here
COMPUTATIONAL TIME (TOPIX)
� Emulate the streaming scenario, report the running time per timestamp (average over all terms)
� STLocal is unaffected: customized for streaming data
� STComb slower: repeats the MWC computation for every timestamp
� Both approaches competitive, a few ms are enough
Boston University Slideshow Title Goes Here
DOCUMENT SEARCH
� Use STLocal, STComb, TB (simple temporal burstiness) to retrieve the top-10 docs for each event in the Major Events List
� Ask Human Annotator to tag each doc as relevant/non relevant, report precision
� Avg % of common docs in the top-k: STComb-TB: 0.61, STComb-STLocal: 0.58, TB-STLocal: 0.67���� Complementarity, each approach captures a different facet of burstiness
Boston University Slideshow Title Goes Here
FUTURE WORK
Improve StLocal to handle geographical regions of arbitrary size (now only rectangular)
Improve StComb to handle streaming data, (online maximum weight clique computation)
Compare patterns extracted from the same region, under different granularities (e.g. individuals �neighborhoods � cities � countries)
Visualization
Boston University Slideshow Title Goes Here
ACKNOWLEDGEMENTS
CAPES (Brazilian
Federal Agency for Post-Graduate
Education)
NSF (IIS:0910859, IIS:1144158)
MODAP EU Project
DISFER GGET Project
17
8/28/2012
Boston University Slideshow Title Goes Here
SCALABILITY Vs. #STREAMS (ARTIFICIAL)
Both approaches scale almost linearly
STLocalconsistently faster