leadline: interactive visual analysis of text data …wdou1/publications/2012/...leadline:...

10
LeadLine: Interactive Visual Analysis of Text Data through Event Identification and Exploration Wenwen Dou *1 , Xiaoyu Wang 1 , Drew Skau 1 , William Ribarsky 1 , and Michelle X. Zhou 2 1 University of North Carolina at Charlotte 2 IBM Almaden Research Center Figure 1: Overview of Leadline. Top right: people and entities related to President Obama (selected) are shown in the graph. Bottom right: locations mentioned in news articles related to the president. Left view: highlighted bursts indicate events that are related to President Obama. ABSTRACT Text data such as online news and microblogs bear valuable insights regarding important events and responses to such events. Events are inherently temporal, evolving over time. Existing visual text anal- ysis systems have provided temporal views of changes based on topical themes extracted from text data. But few have associated topical themes with events that cause the changes. In this paper, we propose an interactive visual analytics system, LeadLine, to au- tomatically identify meaningful events in news and social media data and support exploration of the events. To characterize events, LeadLine integrates topic modeling, event detection, and named en- tity recognition techniques to automatically extract information re- garding the investigative 4 Ws: who, what, when, and where for each event. To further support analysis of the text corpora through events, LeadLine allows users to interactively examine meaning- ful events using the 4 Ws to develop an understanding of how and why. Through representing large-scale text corpora in the form of meaningful events, LeadLine provides a concise summary of the * [email protected] corpora. LeadLine also supports the construction of simple narra- tives through the exploration of events. To demonstrate the efficacy of LeadLine in identifying events and supporting exploration, two case studies were conducted using news and social media data. 1 I NTRODUCTION Text data such as online news stories and microblog messages contain rich real-time information about worldwide events and so- cial phenomena. In particular, news stories report ongoing devel- opment of events; microblogs capture people’s comments and reac- tions to these events especially from a social aspect. Overarching patterns are lost in the here and now of the constant feed. Valu- able information regarding major social and news events is hidden by the details. Therefore, methods to distill text data into not only meaningful overarching topics, but more importantly into trigger- ing events are of great help to assemble the details into summarized information. While summarizing large text corpora based on topical themes has received much attention, few have approached the problem from an event-driven perspective. Much work in the visualization community has been devoted to summarizing text data through rep- resenting topic evolution over time. Similar to these visual text analysis systems with a temporal focus such as ThemeRiver [22], TIARA [41, 38] and ParallelTopics [16], our approach organizes

Upload: others

Post on 08-Jun-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LeadLine: Interactive Visual Analysis of Text Data …wdou1/publications/2012/...LeadLine: Interactive Visual Analysis of Text Data through Event Identification and Exploration Wenwen

LeadLine: Interactive Visual Analysis of Text Data through EventIdentification and Exploration

Wenwen Dou∗1, Xiaoyu Wang1, Drew Skau1, William Ribarsky1, and Michelle X. Zhou2

1University of North Carolina at Charlotte2IBM Almaden Research Center

Figure 1: Overview of Leadline. Top right: people and entities related to President Obama (selected) are shown in the graph. Bottom right:locations mentioned in news articles related to the president. Left view: highlighted bursts indicate events that are related to President Obama.

ABSTRACT

Text data such as online news and microblogs bear valuable insightsregarding important events and responses to such events. Events areinherently temporal, evolving over time. Existing visual text anal-ysis systems have provided temporal views of changes based ontopical themes extracted from text data. But few have associatedtopical themes with events that cause the changes. In this paper,we propose an interactive visual analytics system, LeadLine, to au-tomatically identify meaningful events in news and social mediadata and support exploration of the events. To characterize events,LeadLine integrates topic modeling, event detection, and named en-tity recognition techniques to automatically extract information re-garding the investigative 4 Ws: who, what, when, and where foreach event. To further support analysis of the text corpora throughevents, LeadLine allows users to interactively examine meaning-ful events using the 4 Ws to develop an understanding of how andwhy. Through representing large-scale text corpora in the form ofmeaningful events, LeadLine provides a concise summary of the

[email protected]

corpora. LeadLine also supports the construction of simple narra-tives through the exploration of events. To demonstrate the efficacyof LeadLine in identifying events and supporting exploration, twocase studies were conducted using news and social media data.

1 INTRODUCTION

Text data such as online news stories and microblog messagescontain rich real-time information about worldwide events and so-cial phenomena. In particular, news stories report ongoing devel-opment of events; microblogs capture people’s comments and reac-tions to these events especially from a social aspect. Overarchingpatterns are lost in the here and now of the constant feed. Valu-able information regarding major social and news events is hiddenby the details. Therefore, methods to distill text data into not onlymeaningful overarching topics, but more importantly into trigger-ing events are of great help to assemble the details into summarizedinformation.

While summarizing large text corpora based on topical themeshas received much attention, few have approached the problemfrom an event-driven perspective. Much work in the visualizationcommunity has been devoted to summarizing text data through rep-resenting topic evolution over time. Similar to these visual textanalysis systems with a temporal focus such as ThemeRiver [22],TIARA [41, 38] and ParallelTopics [16], our approach organizes

Page 2: LeadLine: Interactive Visual Analysis of Text Data …wdou1/publications/2012/...LeadLine: Interactive Visual Analysis of Text Data through Event Identification and Exploration Wenwen

text corpora based on meaningful overarching topics. Yet unlikethese tools, our focus is not visually presenting topical trends overtime, but rather revealing indicators of events that trigger the majorchanges in temporal trends.1.1 Formulating EventsSeveral questions are critical to identifying events from text cor-pora. How does one identify meaningful events given a text collec-tion? What are the attributes that characterize an event? How doesone automatically and systematically discover attributes that char-acterize an event from text collections? To address these questions,we first determine what comprises an event:

Merriam-Webster provides a general definition of an event as“a noteworthy happening and a social occasion or activity” [17].In the Topic Detection and Tracking (TDT) community and eventdetection related research [8, 28], an event is defined based on itsattributes as “something that has a specific topic, time, and locationassociated with it”. From a storytelling standpoint, McKee refers toa story event as something that “creates meaningful change in thelife situation of a character” [30].

Combining these definitions with our perspective on analyzingtext corpora, we define an event as:

“An occurrence causing change in the volume of text data thatdiscusses the associated topic at a specific time. This occurrence ischaracterized by topic and time, and often associated with entitiessuch as people and location.”

For simplicity, we refer to < Topic, Time, People, Location > asfour attributes of an event. These four attributes address commonquestions in investigative analysis: When did an event start and end?What is the event about? Who was involved? And finally where didthe event occur?

1.2 Introducing LeadLineGiven our notion of events, we identify techniques and modelsthat allow us to use computational methods to automatically ex-tract events from text corpora. To discover events, we extract infor-mation regarding who, what, when and where through integratingtopic modeling, event detection, and named entity recognition tech-niques. More specifically, we first organize text data such as newsstories and microblog messages based on topical themes using La-tent Dirichlet Allocation (LDA) [11]. This step provides topical in-formation for events. To identify the temporal scale for events, weapplied an Early Event Detection algorithm to automatically deter-mine the length and “bursty-ness” of events. This step provides abeginning and an end to each event in time. To discover people andlocations associated with each event, we first perform named en-tity recognition on the text corpora and associate the entities withevents. With all four attributes explicitly modeled, our approachsupports identification and exploration of events at the topical, tem-poral, and entity level.

To effectively communicate the event identification results, wedeveloped a visual interface. The interface allows users to interac-tively explore events and, more importantly, to steer the event iden-tification process to adjust the granularity of the detected events. Inaddition, organizing text corpora based on events provides basicsfor building narratives. This is a constructive format that describesa sequence of events [17] [30]. We extended LeadLine with the ca-pability to examine narratives, which allows users to easily accessand revisit their explorative findings.

Our approach provides an event-driven summarization of textdata with exploratory capability for investigating the events basedon who, what, when, and where. Specifically, our approach presentsthree contributions:

• A general process that couples topic modeling, named entityrecognition, and early event detection techniques to identifymeaningful events from text corpora.

• An interactive visual interface for exploring events based onthe aspects of who, what, when, where, and further allowinginteractive adjustment of event granularities.

• A narrative examination interface that allows users to reportand revisit their findings.

2 RELATED WORK

Three areas of research, namely event detection, topic analysis, andtext visualization techniques, are the main inspiration for the de-sign of LeadLine. Another thread of research on event structure incognition provides background for using events as a summary.

2.1 Event Structure in Perception and CognitionAs Zacks and Tversky noted, the world presents nothing but conti-nuity and flux, yet our mind has a gift to perceive activity as con-sisting of discrete events that have some orderly relations [43]. Anevent is a segment of time at a given location that is perceived byan observer to have a beginning and an end. People make senseof continuous streams of observed behavior in part by segmentingthem into events.

People seem to segment observed physical activities into eventseffortlessly and simultaneously at multiple timescales [25]. How-ever, little research indicates that the same skill applies to abstractcontinuous streams, such as topical streams derived from text cor-pora. Since an event is considered the unit for making sense ofcontinuous activities, we argue that accurately identified and con-veyed events may serve as a more natural representation for makingsense of the activities.

2.2 Event DetectionThe Biosurveillance community has long been investigating waysto improve clinical preparedness for bioterrorism [23]. Earlysurveillance tasks required continuous monitoring of massive quan-tities of multivariate data in order to identify emerging patterns [32].Considering a particular type of health data, such as over-the-counter (OTC) medication sales, as a source for detecting eventsindicating disease outbreaks, Goldenberg et al. described a statis-tical system designed for timely detection of anthrax epidemics.This approach falls into the category of univariate methods whichfocus on detecting events from time series [20]. As a more generalapproach, Guralnik et al. [21] presented algorithms to dynamicallydetermine the change points in time series data without prior knowl-edge of the temporal distributions.

Other disease surveillance systems take into account both tem-poral and spatial information. The system described by Sabhnaniet al. [35] monitors daily data feeds from over 20,000 hospitalsand pharmacies nationwide, including emergency department vis-its and OTC drug sales, to identify early events. Specifically, anexpectation-based scan statistic approach is proposed to search forspace-time regions for disease outbreaks. As an extension, Neilat al. further developed a “multivariate Bayesian scan statistic”(MBSS) [32] method for faster and more accurate event detection.

As efficient as the proposed event detection algorithms are, theylack the ability to handle text corpora, which may contain rich in-formation about when the symptoms emerge and how they evolveover time. In this paper, our formulation allows us to transform tex-tual data into multiple meaningful time series so that we can applyideas from the Biosurveillance community for early event detectionon text corpora.

More recently, researchers have shown interests in performingevent detection on Twitter data. Petrovic et al. [34] presented amethod to detect new events from a stream of tweets. The proposedalgorithm, which is based on locality-sensitive hashing, enablesfirst story detection on streaming data. However, the significanceof the resulting events is not measured in the proposed method.

Page 3: LeadLine: Interactive Visual Analysis of Text Data …wdou1/publications/2012/...LeadLine: Interactive Visual Analysis of Text Data through Event Identification and Exploration Wenwen

Figure 2: System Architecture of LeadLine.

To detect critical natural disaster events such as earthquakes in atimely fashion, Sakaki et al. [36] considered the event detection asa classification problem. One presumption of the approach is thatusers have to know what events to look for. In addition, to detecta new event, a new classifier needs to be built based on labeledexamples of the new event. Another thread of event detection uti-lizes signal-processing techniques such as Wavelet transformation[42] to analyze word-specific signals in the frequency domain. Al-though the results are compelling based on manual examination, thedetected events are describe using a few keywords (2 to 3) withoutany notion of “topics”, therefore the individual event is difficult tointerpret because of the lack of contextual information.

2.3 Event Extraction from News SourcesTopic Detection and Tracking (TDT) is a body of research andan evaluation paradigm that addresses event-based organization ofbroadcast news [8].The motivation for research in TDT is to ad-dress analysts’ need to monitor new events from the high volumeof broadcast news. The TDT has conducted several full-scale eval-uations with research groups around the world by creating newscorpora with the ground truth of events and topics. In his book thatsummarizes the TDT research program from 1997 to 2002, JamesAllan noted that much of the research on TDT has been tuningparameters to make certain information retrieval approaches moreappropriate for events [8]. However, since an event is consideredas something that happens at a particular time and location, Allanpointed out that existing approaches lack explicit modeling of tem-poral and geospatial dimension of the news streams. One exceptionis that Feng et al. [18] proposed incident threading as a news or-ganizational infrastructure. The incident threading which takes intoconsideration of where, when and who is implemented in two algo-rithms, with one aching higher accuracy in identifying incidents andthe other generating better links between incidents. Our approachalso takes into account temporal aspects of events, as well as ex-plicitly extracts named entities including people and location, andfurther allows end users to interactively make use of these aspectsfor inferring relationship between events.

2.4 Visualization of Events and Temporal TopicalTrends

Multiple visualizations have been developed to visualize eventsfrom news sources. For example, in the time-series visualizationCloudLines, Krstajic et al. proposed an incremental visualizationtechnique that allows detection of visual clusters in a compressedview of multiple time series [24]. Luo et al. described a visual ana-lytics approach which presents events in a river-like metaphor based

on event-based text analysis [26]. To explore events reflected in mi-croblogs, Marcus et al. presented TwitInfo [29], which is a systemfor summarizing events on Twitter. As opposed to a summariza-tion approach, TwitInfo requires users to specify a Twitter keywordquery to start exploring events related to the query. Different fromthe above systems, our approach provides investigative informationregarding not only what (the events are about) and when, but alsowho is involved and where the events possibly happened.

More recently, Cui et al. [15] presented TextFlow for analyzingtopical evolution patterns. In particular, TextFlow identifies criticalevents as topic birth/disappearance and topic merging/splitting. Ourapproach differs from their perspective of annotating topic evolu-tion with critical events by centering on the identification of events,with topics as one aspect to characterize events.

Since topic and time are two essential attributes that character-ize an event, previous work on providing temporal summaries oftopics provided inspiration for our approach. There have been mul-tiple examples of presenting topics along the time dimension. Weiet al. [41] introduced TIARA, a time-based interactive visualiza-tion system that presents topical themes in a time-sensitive manner.Similarly, in ParallelTopics [16], temporal evolvement of topicalthemes are also presented in a ThemeRiver visualization [22]. Inour approach, we use a topic-based summarization method (topicmodels) to organize events over time. In addition, we allow usersto interactively explore events based on who and where, as well asgenerate a narrative as a results of the exploration.

3 ANALYTICS ARCHITECTURE FOR EVENTS CHARACTERI-ZATION

To extract information regarding events from textcorpora, we integrate several techniques to identify< Topic, Time, People, Location > for each event. To ex-tract meaningful topics and timespan for events, we leverage topicmodels for topical themes and an Early Event Detection methodto identify a start and an end for each event (section). To extractinformation regarding who (people) and where (location), weperform named entity recognition and further analyze relationshipsbetween extracted entities (Section 6).

As shown in Figure 2, we categorize the identification of topicthemes and timespan as topic-based analytics, in which we first ex-tract topics from the input textual collection using Latent DirichletAllocation (LDA). We then apply a 1) topical-level event detectionalgorithm to automatically identify “bursts” as indicators of eventslabeled by a timespan; 2) topic ranking to facilitate the discoveryof event relationships by placing bursts with similar topics nearby;and finally 3) time-sensitive keyword extraction that provides infor-

Page 4: LeadLine: Interactive Visual Analysis of Text Data …wdou1/publications/2012/...LeadLine: Interactive Visual Analysis of Text Data through Event Identification and Exploration Wenwen

mation regarding an event with a set of succinct keywords.Complementing the topic-based analytics, our architecture also

focuses on entity-based analytics by applying named entity recog-nition technique to identify people and location related with eachevent. Specifically, this process is centered on extracting key enti-ties from the text data regarding who and where. More importantly,this process also reveals the relationship between the entities, lead-ing to further characterization and associations of events.

As illustrated in Figure 2, visual representations are designedand tailored to both stages of topic-based and entity-based analyt-ics processes. The interactive visual interface is an integral part ofboth analytics processes to bridge users and the complex analyticsresults. With the interactive visualization interface, LeadLine sup-ports interactive exploration of events from various aspects (who,what, when, where), as well as allows users to interact with theunderlying analytics algorithms to partially steer the process of de-tecting events from text streams.

In the following sections, we will detail the algorithms and pro-cesses used in both topic and entity analysis, alongside the visualdesigns that are tailored for identification and analysis of the events.

4 DATA ACQUISITION AND PREPARATION

To demonstrate the generalizability of our algorithms and their ap-plicable domains, we have applied our analytics architecture on twotypes of text data: CNN news and microblogs from Twitter. Whileboth sources contain rich information reflecting major real-worldevents, the primary reason for selecting the two data sources is be-cause of their different editorial styles and the latency for respond-ing to an event. In particular, content from news media (CNN andothers) are edited by professional journalists, usually centered ona specific topic with some background. Individual tweets, on thecontrary, contain mostly the unedited commentaries without muchcontextual information [9]. These different text sources providevarious benchmarks to help us evaluate our analytic architecture.

While both sources are in the public domain, there are nodatasets readily available or that can be shared due to privacy poli-cies. Therefore, we have extended our existing scalable data ar-chitecture to acquire news and tweets using customized crawlingstrategies. The details of our data architecture can be reviewed inprevious work [40].

News Data Acquisition: In particular, our current work ex-tended on the previous architecture by adding the news articlecrawler. Both historical news as well as up-to-the-hour news areacquired by our customized webpage crawlers and our RSS dae-mons, respectively. Both methods employ universal programs thatattempt to crawl an entire web domain, download all the webpages,extract all textual articles, parse article time information, and fi-nally normalize articles by removing HTML formatting and noise(e.g. advertisement and external website links). More specifically,our news crawler is built with python, and Apache Nutch [5] im-plementation. The data is stored into our HBase data structure forfast access and MapReduce [4] based data cleaning and process-ing. Using these crawlers, we could retrieve and clean news articlesdated back to 2004 (∼102573 articles) within several hours of thecrawling process.

Twitter Data Crawling: Microblogs from Twitter are also col-lected from dual crawling processes. The primary process uti-lizes our MapReduce parallel data crawler, which is interfaced withthe Internet through multiple independent crawlers. Each of thecrawlers constantly collects social media data from various publicdomains and dumps it into HBase. Specifically, we have createdcrawlers to connect to Twitter’s public “Garden-hose” API to col-lect 1% of tweets, which is a statistically significant sample of allcontent from Twitter [6]. In addition, we have also experimentedwith an Online Social Network (OSN) graph-based crawling mech-anism to extend our practice using Apache Nutch. Inspired by

Catanese et. al [13], the concept for this crawling process is ac-customed to the nature of OSN which can be represented as graphs,with nodes representing users and edges denoting connections. Weperform a breadth-first search using Nutch to acquire Twitter publicuser-graphs and capture the tweets through their webpage portal forbroader streams. As a result, we were able to collect over 5 billiontweets from all languages over the course of 3 months, providing areliable database for evaluation purposes.

5 TOPIC-BASED EVENT ANALYSIS AND VISUALIZATION

Topic-based analytics is crucial for event characterization in termsof revealing topics and time. In this section, we introduce algo-rithms to extract topical and temporal information with regard to anevent, as well as visual representations that communicate the topi-cal and temporal aspects.

5.1 Extracting Topics from Text DataWe start by organizing text streams based on topics. Given atext corpus, there are multiple ways to extract meaningful topicalthemes. Among them, probabilistic topic models [10] are con-sidered to be advantageous compared to traditional vector-basedtext processing techniques. In LeadLine, we first employ the mostwidely used topic model, LDA [11], to extract semantically mean-ingful topics from text corpora. LDA generates a set of latent topics,with each topic as a multinomial distribution over keywords, andassumes each document can be described as a probabilistic mixtureof these topics [11].

5.1.1 Topic StreamsIn addition to topics, another important attribute of an event is time.In order to highlight the temporal dimension, we organize the ex-tracted topics along the temporal axis. Considering each topic as astream that evolves over time, the calculation of each topic streamis performed by computing a spline based on the volume of textualinformation associated with the topic in each time frame. A timeframe is a unit that the texts are temporally aggregated upon. Thetime frame unit varies by datasets and tasks, ranging from minutesfor social media data to days for news stories.

5.1.2 Visualization of Topic StreamsTo visually represent topical themes over time, we adopt a flow-like visual metaphor in which each stream (row) represents a topicand the height of each topic changes as the amount of textual infor-mation related to the topic varies (Figure 1 left) We represent eachstream in a separate row for the later discovery of events withineach topic stream.

However, with each topic stream represented separately, the vi-sualization does not capture the overall trend of how all topicsevolve over time. In order to provide the overall context, a The-meRiver representation is placed in the background of the visu-alization. Therefore repetitive patterns exhibited by a text stream(such as a weekly pattern for news stories) are still portrayed.

5.2 Automatically Detecting Events in Topical StreamsA key technical contribution of this paper is automatically identi-fying temporal peaks in topics as indicators of events. To detectevents from topic streams, we consider each stream as a time se-ries. Each time series is computed by aggregating texts related tothe topic within every timespan.

To automatically discover events from these topical time series,we apply an early event detection (EED) method to look for burstsin topic streams. More specifically, we adopt the cumulative sumcontrol chart (CUSUM) widely used for change detection [31].CUSUM is good at detecting shifts from the mean in a time seriesby keeping a running sum of “surprises”. We implemented our ver-sion of CUSUM for detecting changes in topical volume. For each

Page 5: LeadLine: Interactive Visual Analysis of Text Data …wdou1/publications/2012/...LeadLine: Interactive Visual Analysis of Text Data through Event Identification and Exploration Wenwen

Algorithm 1: CUSUMInput: topic time series X collected at time i = 1, ...kSteps:

1. Calculate the mean µ and standard deviation σ of the timeseries;2. Calculate running sum S from the starting time frameS1 = max[0, x1 - µ]Si = max[0, Si−1 + xi - µ].3. When Sk exceeds a threshold H (in units of σ ), event alarmtriggers. The beginning and end of the event are determined bythe nearest positive Si to the triggering point.

topic stream, the algorithm keeps a cumulative sum of topic volumeat each timespan that are higher than the mean topic volume. Oncethe cumulative sum exceeds a threshold, the event alarm triggers.Our implementation of CUSUM is shown in Algorithm 1.

The result is a set of automatically detected events within alltopic streams, with each event labeled by a start and an end alongthe time dimension. Our approach of detecting events on time seriesis generalizable. It could be used as an extension on any time-seriesvisualization to highlight changes as indicators of events.

5.2.1 Visualizing Detected EventsTo visually represent the detected events, we use a combinationof contours and highlights to indicate events within each topicalstream. As shown in Figure 1 left, we provide an overview of eventsby drawing contours in topical streams to signal events. The timespan of the contours is determined by the event detection results.

In addition to using contours to provide an overview of all events,LeadLine supports highlighting events of interest through user in-teraction. For example, a user could interactively highlight eventsrelated to a specific named entity such as a person or location (theextraction of entities is introduced in Section 6). To provide detailson demand, LeadLine allows users to access documents (news ormicroblog messages) that discuss a highlighted event by clickingon the event.

5.2.2 Interactive Event Detection

Figure 3: Comparison of different granularities of events. K indicatesdetecting changes K*standard deviation above the mean.

Although one advantage of our event detection algorithm is toautomatically discover events that trigger topical bursts, we stillwant to provide end users the ability to specify how big a changewarrants an event. By adjusting the scale of the “change”, userswill be able to view either finer or coarser-grained events.

In step 3 of Algorithm 1 , the event alarm triggers when thedifference between the mean and running sum S is greater than athreshold. The threshold is usually measured by a fixed number ofstandard deviations (K) [33]. In LeadLine, we allow users to inter-actively adjust K between one and four standard deviations abovethe mean of one topic stream. When K is smaller, the algorithm isable to detect smaller shifts from the mean, leading to more eventsthat cause less drastic change to be discovered. On the other end,

when K is larger, the algorithm is only able to detect big shifts fromthe mean so only larger “bursts” are detected. Figure 3 shows threeevent detection results with varying thresholds. Once a user adjuststhe K value using the slider in the interface, LeadLine re-runs theevent detection algorithm to produce new event results. Throughallowing users to interactively manipulate the fineness of events,LeadLine provides overviews with more or less details for giventext corpora.

5.3 Topic RankingWhen placing the topics in our visualization, we want similar top-ics to be close so that the events later derived based on topics arealso posited in proximity. Since LDA does not explicitly model therelationship between topics, we rank the topics based on their sim-ilarity measured by Hellinger distance. Specifically, Hellinger dis-tance measures the distance between two probability distributionsof topics over the entire vocabulary in the text streams:

topic− similarityi, j =N

∑v=1

(√

βi,v −√

β j,v)2 (1)

β is the probability of the ith topic over term v in the vocabulary,and N denotes the vocabulary size.

5.3.1 Visualization of Topical Content

Figure 4: Topic Cloud with the topic related to mobile de-vice/technology highlighted. Keywords in blue are the time-sensitivekeywords in a certain time span. The topics are extracted from CNNnews corpora (Aug 15 - Nov 5, 2011).

To represent topics in the form of a set of keywords, we developthe Topic Cloud view based on tagCloud representation (Figure 4).In the Topic Cloud, the size of each keyword reflects its numberof occurrences in all topics. Since topics are ranked based on theirsimilarities, the Topic Cloud view starts with a random topic at thetop and places the topic with the highest similarity score next tothe one above. As a result, similar topics are placed next to eachother visually. To color the topics in Topic Cloud, we choose colorsfrom the HSV (hue, saturation, value) space by dividing the spacebased on similarities between the list of topics. As a result, similarcolors are assigned to topics with greater similarity. The same colorscheme is also applied to the Topic Stream View (Figure 1 Left).

5.4 Time-sensitive Keyword extractionTo accurately summarize what an individual event is about, we needto re-rank the keywords within a topic by considering which key-words are more prominent given the timespan of the event. In orderto do so, within each topic, we first extract the most prominent key-words for each time frame. We adopted the keyword-ranking algo-rithm that Wei et al. proposed [41] to re-rank terms in each topicbased on time factor. More specifically, we followed Algorithm 2to extract time-sensitive terms given a topic-term distribution ma-trix, and the number of desired terms for each time frame N. Theinput for this algorithm is a text corpus divided into sub-collectionsusing time frame and topics. Each sub-collection contains only thedocuments focusing on the specific topic and also published duringthe time frame. The algorithm follows a TFIDF heuristic to de-termine time-sensitive terms: (a) if a term occurs frequently in thesub-collection, it is important; (b) if the term also occurs in manyother sub-collections, the importance is discounted.

Page 6: LeadLine: Interactive Visual Analysis of Text Data …wdou1/publications/2012/...LeadLine: Interactive Visual Analysis of Text Data through Event Identification and Exploration Wenwen

Algorithm 2: extract time-sensitive termsInput: topic-term distribution matrix φ ; desired number ofkeywords per time frame NSteps:

1.for each topic i do

for each time frame t doIdentify a collection of documents Di,t focusing on topic i fromentire text stream;

end forend for2.for each term W in topic i from Di,t docalculate term frequencies TFend for3. Re-rank the TF scores with topic-term probabilities: for eachterm Wm, weight(Wm) = λ1

T Fi,t,m∑t T Fi,t,m

+λ2φi,m log φi,m

(∏Kk=1 φk,m)1/k

4. Within each topic and time frame, select the top N terms astime-sensitive terms.

The extracted time-sensitive terms could not only aid the explo-ration of the text collections temporally, but also provide accuratelabels for discovered events. The association of events with cor-responding time-sensitive keywords is accomplished through high-lighting time-sensitive keywords (in the topic view) when a topicalburst is examined by users. But for the clarity of the figures, wemanually annotated events with the time-sensitive keywords.

5.4.1 Interaction and View Coordination

The Topic Cloud view and Topic Streams are coordinated throughuser interaction. Hovering the mouse over one topic stream inthe Topic Streams would highlight the corresponding topic in theTopic Cloud. Hovering the mouse over a specific timespan within atopic stream would highlight the time-sensitive keywords. Double-clicking the same area further allows users to access the contentof the documents (as later briefly shown in figure 8). In addition,events from different topics can be highlighted by selecting namedentities. Details of extracting named entities from events are intro-duced in section 6. In addition to view coordination, one importantinteraction LeadLine provides is choosing the “fineness” of the de-tected events. By simply moving the slider, a user can determinethe scale of the examination.

In summary, section 5 presents algorithms that extract topicaland time information that characterizes events. Visual representa-tions of events are also designed based on the algorithmic results.The visualization not only allows interactive examination of the de-tected events, but also provides users the power to steer the eventdetection algorithm

6 ENTITY-BASED EVENT ANALYSIS AND VISUALIZATION

Complementing the topics and time attributes that are depicted intopic analysis, entity analysis is yet another crucial component. Thenamed entities further reveal relationships between events by con-necting them with information regarding people and location.

The primary goal for our name entity extraction is to extract peo-ple and locations from the textual streams and associate these en-tities with events. As illustrated in Figure 2, our named entity ex-traction identifies the named-entities once the data has been cleanedand prepared. The current entity extractor uses the LingPipe pack-age [3] with Statistical Chunking and customized Dictionary-basedExtraction. The three categories extracted are people, location, andorganizations. This information is then inserted into the HBase stor-age platform, and attached to each corresponding news article. Note

that location entities have also been enriched with geo-tags that areacquired alongside the contents.

While our current approach identified a sufficient amount ofmeaningful entities for news articles, the initial results from thisapproach on microblogs was fairly noisy. Similar findings havebeen identified in research in the AAAI community on the sameissue, even with more state-of-the-art part of speech (POS) tag-ging [19]. We believe improvement on entity-based analytics fortweets is much needed, and do appreciate other representative re-search by MacEachren et al. on extracting entities from tweets [27].Given the focus of this paper, we will focus on entities in news anddemonstrate the utility of associating entities and their relationshipsin characterizing events and constructing narratives.

6.1 Entity Graph and Geo-MappingEntities are primarily used in LeadLine to connect multiple eventsinto a meaningful narrative structure. Similar to the term used inJigsaw [39], our distributional similarity between entities is com-puted based on the commonality of their contexts of occurrence intext. In particular, the association between entities is determinedbased on the co-occurrence of entities within the correspondingtext. For example, two entities that co-occur in at least two newsarticles are considered connected. This gives us a contextual back-ground and a baseline to depict the interplay between entities. Sim-ilar to the interactive event detection process, LeadLine enablesusers to interactively adjust the granularity of entity correlationsbased on their co-occurrence.

As shown in Figure 1 top right, we have created an interactiveentity graph visualization to represent the connection and correla-tion between entities. A basic force-directed graph is utilized toallow users to dynamically explore the graph by showing and hid-ing links and entities. Instead of visualizing all the entities within alarge graph, LeadLine constructs the graph content based on user’sselections of topics and time factor as well as the types of entity.Different types of entities are color-coded. For example, people areshown in blue while organizations are shown in green.

Furthermore, location information is plotted on an interactivemap to reveal the spatial distribution of entities contained in thetext corpora (see Figure 1 bottom right). The geospatial view andentity graph are coordinated through user interactions. Selectionswithin one view is used to filter the associated entity in all views.Through such view coordination, different aspects of the entitiescan be examined simultaneously.

In summary, LeadLine is designed to utilize results from namedentity extraction to provide people and location information for ex-ploring and inferring relationships between events. Interactive vi-sualizations are further used to enable users to navigate throughimplicit connections between entities.

7 UNCOVERING NARRATIVES BY ASSOCIATING EVENTS

As a tightly coupled topic and entity analysis and interactive vi-sual interface, LeadLine enables the user to perform investigativeanalysis of events inside a text corpus with a guided exploratoryenvironment. While this environment is useful in interactively ex-amining the four attributes < Topic, Time, People and Location >that comprise an event, summarizing the events into an intuitivenarrative is more convenient for sharing and reporting.

Inspired by Segel and Heer’s narrative genres [37], we have de-veloped a Partitioned Poster style interface designed to assist insummarizing events into narratives (Figure 5). The interface is im-plemented using Bostock’s D3.js library [12] to increase accessibil-ity and to allow further exploration of the data by remote users.

The narrative poster can be considered as a simplified version ofLeadLine. The hierarchy of information in the narrative poster is:1) Entity, 2) Event, 3) Topic. The entity list can be very long fora large text corpus, so an autocompleted text entry field is used for

Page 7: LeadLine: Interactive Visual Analysis of Text Data …wdou1/publications/2012/...LeadLine: Interactive Visual Analysis of Text Data through Event Identification and Exploration Wenwen

Figure 5: A narrative poster constructed from news events related to President Obama. Top left: a tag cloud summarizing all events related tothe selected entity (president Obama); top right: locations mentioned in the events; bottom: event timeline.

entity selection. After selecting an entity, the user is presented withaggregate wordcloud and map views and a timeline view showingthe events. The wordcloud view displays time-sensitive keywordsof all shown events. The map view provides geolocations men-tioned in the events. The timeline view gives a scope of events’durations and overlap (simplified version of the event bursts in theTopic Stream view).

The three views are coordinated with the timeline view being themain point of interaction. Hovering over an event in the timelineview causes the map and wordcloud views to display only keywordsassociated with that event rather than the aggregate for the entireentity. The Partitioned Poster format provides an intuitive way fortelling stories. The simplified views and interactions provide usersa succinct temporal summary of a given corpus, as well as detailsabout events of interest.

8 CASE STUDIES

To evaluate the efficacy of LeadLine in automatically detectingevents from text data and facilitating exploration of events based ontopics, time and named entities, we conducted case studies usingreal-world textual data from CNN news [14] and Twitter [6]. Forthe case studies, we recruited 8 users to explore the events from theCNN and Twitter datasets. We provided training to our participantsregarding the LeadLine system before the exploration. During theexploration, we asked the participants to think-aloud and the ex-perimenter took notes of the findings. In the end, we conducted apost-study interview to collect feedback about the visualization.

8.1 Case Study 1: Exploring the Occupy Wall StreetMovement

In this section, we first describe the dataset we use for the study. Wealso characterize challenges in analyzing such a large-scale socialmovement. We then present how LeadLine assists users in discov-ering major events and understanding the OWS movement throughexploring these events from multiple aspects.

8.1.1 Data set and BackgroundThe Occupy movement is an on-going series of demonstrations andis known for using social media for publicity and organization. The

Occupy movement is long-lasting and wide-spread without cen-tral leadership. This creates challenges in understanding the di-rection of the movement. In addition, a wide range of goals werereported [7], including more jobs, more equal distribution of in-come, and bank reform. Given the prominent use of social media inthe Occupy movement, it makes sense to analyze the voices fromprotesters and citizens.

To provide an overview of the OWS movement, we filtered ourtweet collection described in Section 4 using the hashtag #occupy.The resulting dataset contains more than 100,000 tweets from Aug19 to Nov 01. Given the length of the dataset, the time unit in thevisualization is set to 6 hours, allowing users to see how an eventevolves within a day. 15 topics were extracted using LDA.

8.1.2 Exploring the OWS Movement Through Major Events

Figure 6: Major events discovered by the participants. Only eventsdiscussed in section 8.1.2 are highlighted.

Page 8: LeadLine: Interactive Visual Analysis of Text Data …wdou1/publications/2012/...LeadLine: Interactive Visual Analysis of Text Data through Event Identification and Exploration Wenwen

Before the studies, we asked our participants if they were famil-iar with the OWS movement. All of them indicated that they hadheard or read about the movement from news media, but none hada good understanding of the movement. In this task, we asked theparticipants to identify major events of the OWS movement and re-port what they have learned through exploring with LeadLine.

The participants usually started by quickly going through all top-ics in the Topic Cloud to gain an overview of the topics summariz-ing the tweets. Some participants then mainly focused on the TopicStream View as they explored major events, while glancing at theTopic Cloud for information regarding a topic or a specific event.To make sense of what event triggered a certain topic burst, the par-ticipants clicked on the burst and accessed all tweets contributingto the burst. As they performed explorations in LeadLine, the ex-perimenter took notes of their findings. For reporting purposes, wecollected major findings from our participants, shown in Figure 6.Here we describe a few common discoveries from the participants:

Topic bursts #1 and #2 clearly mark the beginning of the Occupymovement on Sep 17, 2011, with tweets noting the coverage fromnews media on the first day of OWS gathering (burst #1), and thetrend of tweets mentioning the OWS movement picking up steamon Twitter (burst #2).

Topic bursts #3 and #4 mark the events of two forces joiningthe OWS movement to support the protestors. On Sep 28th, theboard of the NYC Transport Workers Union of America voted tosupport OWS, which splashed a lot of celebratory tweets (burst #3).Similarly, on Oct 1st, the burst of tweets (#4) was triggered by themarines joining the movement to protect protestors from the police.

Topic bursts #5 and #6 indicate marches by the occupyprotesters. Topic burst #5 is triggered by protesters marching acrossthe Brooklyn Bridge on Oct 1st. Some tweeted live updates of themarch and of police arrests of the protestors, while others com-plained that there was not enough media coverage of the event. OnOct 5th, more than 10,000 protesters marched near Zuccotti Park.Tweets within initial time of the burst were about the march andthe massive number of people involved. However, tweets in thelater part of the burst were ranting about protesters being peppersprayed, beaten and arrested by police. The event clearly triggeredsome outrage on Twitter (burst #6).

The events mentioned above are just a sample of all events ex-plored and commented on by our participants. After the study, mostof the participants mentioned that they now had a much better un-derstanding of the OWS movement, including when the movementstarted, who was involved, and some major events within the large-scale movement.

8.1.3 Evaluating Detected Events

In addition to having participants identify and explore major eventsof the OWS movement using LeadLine, we compared our eventresults against timelines from Wikipedia [2] and the L.A. Times [1]regarding the same movement. Between Aug 19 and Nov 1, theWikipedia timeline provided 24 events built by online communities,while The L.A. Times provided around 20 new articles describingthe OWS movement. Depending on the level of details a user maywant, LeadLine could provide events indictors ranging from 20 tomore than 50.

We also examined the event indicators to match them with eventsfrom Wikipedia and The L.A. Times. Figure 7 shows our match-ing results. The results indicate that most of the automatically de-tected events by LeadLine match the timelines from Wikipedia andThe L.A. Times, as annotated in Figure 7 with both logos after themajority detected bursts. Wikipedia also published events beforeSeptember 17th that may have led to the OWS movement, whichour automated detection did not identify due to insufficient tweets.However, LeadLine is able to show pre-September 17 activities thatare worthy of further investigation. One interesting finding from our

Figure 7: Comparing events identified by LeadLine against timelinesfrom Wikipedia and L.A. Times.

comparison is that The L.A. Times as a media organization reportedmore “photo-op worthy” events related to celebrities and politi-cians, while Wikipedia covered events centered around the OWSmovement. As expected, LeadLine identified events that triggeredmore commentary on Twitter.

In summary, through interactively exploring the topical burstsLeadLine, the participants were able to identify interesting eventsand gain insights regarding the OWS movement. Through furthercomparing automatically detected events in LeadLine against time-lines provided by Wikipedia and L.A. Times, we validated that thetopical bursts accurately reflect true events.

8.2 Case Study 2: Constructing Narratives based onEvents from News Stories

In the second study, we asked participants to first explore topicalbursts and named entities related to events with LeadLine, and thenfocus on creating narratives based on their exploration.

8.2.1 Data preparationIn this study, the dataset was CNN news stories from Aug 15, 2011to Nov 5, 2011. The news articles were harvested and organized us-ing methods introduced in Section 4. 30 topics were modeled froma total number of 3,130 news articles. Named entities includingpeople and location were extracted from the news corpus.

8.2.2 Uncovering Narratives based on EventsThe exploration process of news events was similar to the first casestudy in that participants started by browsing all news topics to gaina quick overview of the corpus. Some participants then identifieda topic of interest and examined the bursts indicating events withinthat topic. Others turned to the entity graph and the map to see thelocations and people mentioned in the news corpus.

Here we report findings identified by our participants. Startingby exploring the entity graph, one participant was interested in gath-ering events related to former Apple CEO Steve Jobs. By clickingon the node of Steve Jobs in the entity graph, the participant filteredall events that mentioned Jobs’ name in the Topic Stream View(Figure 8). Jobs was consistently mentioned in the events in themobile-technology related topic (in blue), which made sense to theparticipant. He further browsed some of the events, such as Jobs’resignation from Apple in late August (blue burst in the middle),and the unveiling of iPhone 4S on the date of Oct 4th, followed by

Page 9: LeadLine: Interactive Visual Analysis of Text Data …wdou1/publications/2012/...LeadLine: Interactive Visual Analysis of Text Data through Event Identification and Exploration Wenwen

Figure 8: Events related to former Apple CEO Steve Jobs. The greenarrows link news articles with the events related to the blue topic,while the red arrow connects to the purple topic.

the news of Steve Jobs’ death (last burst in blue). The participantinvestigated further when he saw a topic burst in the health relatedtopic in purple also highlighted by LeadLine to indicate relevanceto Jobs. Upon inspection of the news articles related to the eventburst (shown as the call-outs in Figure 8), the participant discov-ered that one CNN program was trying to piece together a detailedhistory Job’s of health in light of his resignation from Apple. “Ididn’t know CNN reported on the health history of Steve Jobs rightafter he stepped down, but I guess this kind of program might raiseawareness for pancreatic cancer”, the participant commented.

One of the most examined entities was President Obama. Par-ticipants discovered that he was a busy man during the three monthperiod. Figure 1 shows topical bursts indicating events related tothe president. He appeared in events such as immigration issues,political debates, natural disasters and international relations. Af-ter exploring the events in LeadLine and gaining insights about thenews stories, multiple participants commented that they can nowtell a story about the entities they focused on. They also expressedtheir desire to have a tool to gather and report their findings.

Inspired by the comments from the participants, we further de-veloped a structure (described in Section 7) for building a narrativebased on selected events. The narrative presents all events-relatedinformation regarding one person or location in a concise manner.The interface is web-based, so it is easily accessible from anywherefor reporting and revisiting.

9 DISCUSSION

In this section, we discuss the limitations of LeadLine, and outlinefuture directions to extend the current research.

9.1 Limitations arise from the Use of Topic Modeling,Early Event Detection, and Named Entity Recogni-tion

Our approach relies on automated algorithms to discover informa-tion regarding who, what, when, and where in order to characterizean event. Inevitably, the final results are affected by the perfor-mance of each algorithm. For instance, the interpretability of each“burst” in the topical streams depends on the topic modeling results.The accuracy of detected entities relies on the performance of thenamed entity recognition (NER) algorithm. In addition, the sameNER algorithm performs differently on different datasets. Solvingthe issues in the short term is challenging, but we think it is use-ful to make users aware of these issues. In order to do so, we planto borrow methods from uncertainty visualization to annotate dif-ferent layers of uncertainty so that users can make more informeddecisions during investigation and analysis.

9.2 Identifying Inter-topic Events

In the scope of this paper, we assume that each event is associatedwith one major topic. However, certain events may have an impacton multiple topics. In order to identify inter-topic events, entitiesand timespan can contribute to grouping bursts from different topicsinto one triggering event. In other words, if bursts in several topicsoccur at the same time and share the same set of entities includingpeople and location, we can assume that they are triggered by thesame event. If one event encompasses multiple topics, the numberof topics may further be used to evaluate the impact of the events.

9.3 Future Improvements

There are several improvements we would like to pursue in the fu-ture. First, when visualizing events identified by our event detectionalgorithm, it is clear that the algorithm favors upward trends in thetopic stream. Although the current results include the starting timefor each event, we want to improve the event detection algorithm toinclude downward trends in the topic stream so that the event cycleis complete.

Second, an interesting challenge we face when cleaning the textdata is the duplication issue in both the news and microblog data.For news corpus, the issue arises because news websites keeps thearticle content up-to-date by reusing the same URL with a differenttimestamp. For Twitter, the issue lies in spam tweets and adver-tisements from different agencies. Our current implementation ofaddressing this duplication challenge is two-fold: first, we use theunique parameters for news and Twitter contents (e.g. URL, postedtimestamp, author, etc) to construct a unique identification (UUID).This unique identification is used as checksum in the first stage tofilter duplicates with exact match. The second stage, our imple-mentation utilizes the standard Longest Common Substring (LCS)algorithm to compare the content length between two articles thatshare similar contents. We believe this is of great use in reducingthe skewing effects that are introduced by the duplicated in the textstreams.

While this performs comparatively well on the news data, suchtwo stage process suffers from a performance penalty due to thelargely fragmented nature of tweets. This issue needs to be ad-dressed since in the event characterization process, both the namedentity extraction algorithm as well as the topic modeling relies onword frequency calculations. Such duplication will undoubtedlyskew the analysis results, producing inaccurate if not false eventinformation.

10 CONCLUSION

In this paper, we present an interactive visual analytics system,LeadLine, that identifies meaningful events and allows users to ex-amine the events that trigger changes in topical themes in news andsocial media. To discover events, LeadLine extracts information re-garding who, what, when and where by integrating topic modeling,event detection, and named entity recognition methods. LeadLinealso supports interactive exploration of events based on the 4Ws.Two case studies were conducted based on both news and socialmedia data. The results indicate that LeadLine can not only accu-rately identify meaningful events given a text collection, but canalso contribute to users’ understanding of the events through inter-active exploration.

ACKNOWLEDGEMENTS

This work was supported in part by a grant from the National Sci-ence Foundation under award number SBE-0915528. Any opin-ions, findings, and conclusions or recommendations expressed inthis material are those of the authors and do not necessarily reflectthe views of the National Science Foundation.

Page 10: LeadLine: Interactive Visual Analysis of Text Data …wdou1/publications/2012/...LeadLine: Interactive Visual Analysis of Text Data through Event Identification and Exploration Wenwen

REFERENCES

[1] L.a. times’ report on ows. http://timelines.latimes.com/occupy-wall-street-movement/, 2011.

[2] Wiki occupywallstreet. http://en.wikipedia.org/wiki/Timeline-of-Occupy-Wall-Street, 2011.

[3] Alias-i. lingpipe 4.1.0. http://alias-i.com/lingpipe, 2012.[4] Apache hadoop. http://hadoop.apache.org, 2012.[5] Apache nutch. http://nutch.apache.org, 2012.[6] Twitter, inc. http://www.Twitter.com, 2012.[7] Occupy wallstreet report [online], http://www.cnn.com/2011/10/05/

opinion/rushkoff- occupy- wall- street/index. html.[8] J. Allan, editor. Topic detection and tracking: event-based informa-

tion organization. Kluwer Academic Publishers, Norwell, MA, USA,2002.

[9] M. S. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam, and E. H. Chi.Eddi: interactive topic-based browsing of social status streams. InProceedings of the 23nd annual ACM symposium on User interfacesoftware and technology, UIST ’10, pages 303–312, New York, NY,USA, 2010. ACM.

[10] D. Blei and J. Lafferty. Text Mining: Theory and Applications, chapterTopic Models. Taylor and Francis, 2009.

[11] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J.Mach. Learn. Res., 3:993–1022, March 2003.

[12] M. Bostock, V. Ogievetsky, and J. Heer. D¡sup¿3¡/sup¿ data-drivendocuments. IEEE Transactions on Visualization and ComputerGraphics, 17(12):2301–2309, Dec. 2011.

[13] S. A. Catanese, P. De Meo, E. Ferrara, G. Fiumara, and A. Provetti.Crawling facebook for social network analysis purposes. In Proceed-ings of the International Conference on Web Intelligence, Mining andSemantics, WIMS ’11, pages 52:1–52:8, New York, NY, USA, 2011.ACM.

[14] CNN. Cnn online. http://www.cnn.com/.[15] W. Cui, S. Liu, L. Tan, C. Shi, Y. Song, Z. Gao, H. Qu, and

X. Tong. Textflow: Towards better understanding of evolving topicsin text. IEEE Transactions on Visualization and Computer Graphics,17(12):2412–2421, Dec. 2011.

[16] W. Dou, X. Wang, R. Chang, and W. Ribarsky. Paralleltopics: Aprobabilistic approach to exploring document collections. In VisualAnalytics Science and Technology (VAST), 2011 IEEE Conference on,pages 231 –240, oct. 2011.

[17] Event. Merriam Webster, http://en.wikipedia.org/wiki/Event.[18] A. Feng and J. Allan. Finding and linking incidents in news. In Pro-

ceedings of the sixteenth ACM conference on Conference on informa-tion and knowledge management, CIKM ’07, pages 821–830, NewYork, NY, USA, 2007. ACM.

[19] J. Foster, O. Cetinoglu, J. Wagner, J. L. Roux, S. Hogan, J. Nivre,D. Hogan, and J. van Genabith. hardtoparse: Pos tagging and parsingthe twitterverse. In Analyzing Microtext, volume WS-11-05 of AAAIWorkshops. AAAI, 2011.

[20] A. Goldenberg, G. Shmueli, R. A. Caruana, and S. E. Fienberg. Earlystatistical detection of anthrax outbreaks by tracking over-the-countermedication sales. Proceedings of the National Academy of Sciences ofthe United States of America, 99(8):pp. 5237–5240, 2002.

[21] V. Guralnik and J. Srivastava. Event detection from time series data.In Proceedings of the fifth ACM SIGKDD international conference onKnowledge discovery and data mining, KDD ’99, pages 33–42, NewYork, NY, USA, 1999. ACM.

[22] S. Havre, B. Hetzler, and L. Nowell. Themeriver: Visualizing themechanges over time. In Proceedings of the IEEE Symposium on Infor-mation Vizualization 2000, INFOVIS ’00, pages 115–, Washington,DC, USA, 2000. IEEE Computer Society.

[23] S. A. Khan. Handbook of biosurveillance, m.m. wagner, a.w. moore,r.m. aryel (eds.). elsevier inc. isbn-13: 978-0-12-369378-5. Journal ofBiomedical Informatics, 40(4):380–381, 2007.

[24] M. Krstajic, E. Bertini, and D. Keim. Cloudlines: Compact displayof event episodes in multiple time-series. Visualization and ComputerGraphics, IEEE Transactions on, 17(12):2432 –2439, dec. 2011.

[25] C. A. Kurby and J. M. Zacks. Segmentation in the perception andmemory of events. Trends in cognitive sciences, 12(2):72–79, Feb.

2008.[26] D. Luo, J. Yang, M. Krstajic, W. Ribarsky, and D. Keim. Eventriver:

Visually exploring text collections with temporal references. Visu-alization and Computer Graphics, IEEE Transactions on, 18(1):93–105, jan. 2012.

[27] A. M. MacEachren, A. R. Jaiswal, A. C. Robinson, S. Pezanowski,A. Savelyev, P. Mitra, X. Zhang, and J. Blanford. Senseplace2:Geotwitter analytics support for situational awareness. In IEEE VAST,pages 181–190, 2011.

[28] H. Mannila, H. Toivonen, and A. Inkeri Verkamo. Discovery offrequent episodes in event sequences. Data Min. Knowl. Discov.,1(3):259–289, Jan. 1997.

[29] A. Marcus, M. S. Bernstein, O. Badar, D. R. Karger, S. Madden, andR. C. Miller. Twitinfo: aggregating and visualizing microblogs forevent exploration. In Proceedings of the 2011 annual conference onHuman factors in computing systems, CHI ’11, pages 227–236, NewYork, NY, USA, 2011. ACM.

[30] R. Mckee. Story - Substance, Structure, Style, and the Principles ofScreenwriting. Methuen, 1999.

[31] D. C. Montgomery. Statistical quality control. Wiley Hoboken, N.J.,2009.

[32] D. Neill and G. Cooper. A multivariate bayesian scan statistic for earlyevent detection and characterization. Machine Learning.

[33] D. B. Neill and W.-K. Wong. Tutorial on event detection tutorial.http://www.cs.cmu.edu/ neill/papers/eventdetection.pdf, 2009.

[34] S. Petrovic, M. Osborne, and V. Lavrenko. Streaming first story de-tection with application to twitter. In Human Language Technologies:The 2010 Annual Conference of the North American Chapter of theAssociation for Computational Linguistics, HLT ’10, pages 181–189,Stroudsburg, PA, USA, 2010. Association for Computational Linguis-tics.

[35] R. Sabhnani, D. Neill, and A. Moore. Detecting anomalous patternsin pharmacy retail data. Proceedings of the KDD 2005 Workshop onData Mining Methods for Anomaly Detection, Aug. 2005.

[36] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitterusers: real-time event detection by social sensors. In Proceedingsof the 19th international conference on World wide web, WWW ’10,pages 851–860, New York, NY, USA, 2010. ACM.

[37] E. Segel and J. Heer. Narrative visualization: Telling stories withdata. IEEE Transactions on Visualization and Computer Graphics,16(6):1139–1148, Nov. 2010.

[38] L. Shi, F. Wei, S. Liu, L. Tan, X. Lian, and M. Zhou. Understandingtext corpora with multiple facets. In Visual Analytics Science andTechnology (VAST), 2010 IEEE Symposium on, pages 99 –106, 2010.

[39] J. Stasko, C. Gorg, Z. Liu, and K. Singhal. Jigsaw: Supporting inves-tigative analysis through interactive visualization. In Visual Analyt-ics Science and Technology, 2007. VAST 2007. IEEE Symposium on,pages 131 –138, 30 2007-nov. 1 2007.

[40] X. Wang, W. Dou, Z. Ma, J. Villalobos, Y. Chen, T. Kraft, and W. Rib-arsky. I-SI: Scalable Architecture of Analyzing Latent Topical-LevelInformation From Social Media Data. Computer Graphics Forum,31(3):1275–1284, 2012.

[41] F. Wei, S. Liu, Y. Song, S. Pan, M. X. Zhou, W. Qian, L. Shi, L. Tan,and Q. Zhang. Tiara: a visual exploratory text analytic system. InProceedings of the 16th ACM SIGKDD international conference onKnowledge discovery and data mining, KDD ’10, pages 153–162,New York, NY, USA, 2010. ACM.

[42] J. Weng, Y. Yao, E. Leonardi, and F. Lee. Event Detection in Twitter.In Fifth International AAAI Conference on Weblogs and Social Media,2011.

[43] J. M. Zacks and B. Tversky. Event structure in perception and con-ception. Psychological Bulletin, 127:3, 2001.