breaking news detection and tracking in twitter (wi:iw'10)

29
WI: IWI August 31st, 2010 Swit Phuvipadawat, Tsuyoshi Murata Dept. of Computer Science Tokyo Institute of Technology, Japan Breaking News Detection and Tracking in Twitter

Upload: jazripper

Post on 10-Apr-2015

1.778 views

Category:

Documents


1 download

DESCRIPTION

Twitter has been used as one of the communication channels for spreading breaking news. We propose a method to collect, group, rank and track breaking news in Twitter. Since short length messages make similarity comparison difficult, we boost scores on proper nouns to improve the grouping results. Each group is ranked based on popularity and reliability factors. Current detection method is limited to facts part of messages. We developed an application called “Hotstream” based on the proposed method. Users can discover breaking news from the Twitter timeline. Each story is provided with the information of message originator, story development and activity chart. This provides a convenient way for people to follow breaking news and stay informed with real-time updates.

TRANSCRIPT

WI: IWIAugust 31st, 2010

Swit Phuvipadawat, Tsuyoshi MurataDept. of Computer Science

Tokyo Institute of Technology, Japan

Breaking News Detection and Tracking in Twitter

Outline

• Introduction

• Analysis

• Methodology

• Results and Application

• Challenges and Future Works

• Conclusion

Introduction

Twitter as a news channel

http://blog.marsdencartoons.com/2009/06/18/cartoon-iranian-election-demonstrations-and-twitter/marsden-iran-twitter72/

In June 2009, during the Iranian Election Twitter has transformed the way people convey news.

Twitter as a news channel

Iraq Election

Earthquake with 6.4 magnitude hits Taiwan!

Tsunami alert after Chilean earthquake.

Early voting begin March, 7 Iraq Election

The Apple iPad starting $499

Apple announced iPad

Obama Health Reform

Earthquakes around the worldEarthquake with 6.4 magnitude hits Taiwan!

Earthquake with 6.4 magnitude hits Taiwan!

Tsunami alert after Chilean earthquake.

Early voting begin March, 7 Iraq Election

The Apple iPad starting $499The Apple iPad starting $499

Apple to launch iPad on March 26

The Apple iPad starting $499

Steve Jobs demoed iPad

Earthquake in Taiwan

Earthquake in Chile

Earthquake in Haiti

Early voting begin March, 7 Iraq Election

US and UN hope Sunni participation help heal the

wound.

Health care explained.

Research Topic

“Breaking News Detection and Tracking in Twitter”

➡ Topic Detection and Tracking (TDT)

➡ Information Retrieval

➡ Social Network Analysis

Topic Detection and Tracking (TDT)• To monitor broadcast news and alert an analyst to new and

interesting events happening in the world. [Allan 2001]

• To search, organize and structure multilingual, news oriented textual materials from a variety of broadcast news media. [Fiscus & Doddington 2002]

• Focuses on 5 tasks:

❖ Story segmentation

❖ First story detection

❖ Cluster detection

❖ Tracking

Recent Studies

• Topological characteristics of Twitter

What is Twitter, a Social Network or a News Media? H. Kwak, C. Lee, S. Moon [WWW2010]

➡ 85% of trending topics in Twitter appear in headline news

• Using Twitter data to improve web ranking

Time is of the Essence: Improving Recency Ranking Using Twitter Data A. Dong, R. Zhang et. al [WWW2010]

➡ Micro-blogging data reveals fresh URLs not yet indexed by search engine

• Event detection

Earthquake Shakes Twitter Users: Real-time Event Detection by Social Sensors T. Sakaki, M. Okazaki, Y. Matsuo [WWW2010]

Recent Studies• In!uential Topics, Users detection

❖ Characterizing Microblogs with Topic Model D. Ramage, S. Dumais, D. Liebling [ICWSM2010]

➡ Use Labeled LDA, a supervised learning model to characterize the content of messages into substance, style, status and social characteristics.

❖ TwitterRank: Finding Topic-sensitive In!uential Twitters J. Weng, E. Peng, J. Jiang [WSDM2010]

➡ Use PageRank with topic model (LDA) to measure the in!uence of users.

Analysis

Message Analysis

Msg Attributes Count %

Tag a user 79,469 51.6% Embed a link 50,404 32.7% Retweet 29,935 19.4% Use a hashtag 20,348 13.2%RT

@http://

#

Findings from a dataset of 154,000 msg.with 33,000 msg. from news engaging users

Text Characteristics Examples

Sensational adjectives E terrible, horrible, terrifying, shocking, terri"c, amazing, ...

Sensational phrases E wow! oh my god! ...

Signi"cant nouns F US. President, Obama, Michael Jackson, Japan, Toyota, ...

Impactful verbs F kill, die, crash, reveal, discover, rescue, ...

@

http://

#

Single Message Aspect

Data of March 2009

Network Analysis

RT

Timeline Aspect

A

A

M6

M3

RT (retweet) is to take a twitter message of someone and rebroadcasting that same

message

Earthquake in Tokyo!

John12:15

RT @John Earthquake in Tokyo!

Lisa 12:30

To retold a story to your friends

Methodology

Method for Collecting, Indexing and Grouping

‣ Collecting

• Fetch messages using pre-de"ned search queries for breaking news related keyword and hashtags

‣ Indexing

• Index based on term vectors is constructed.

• Apache Lucene is used as an information retrieval library

‣ Grouping

• Similar messages are grouped together to form a news story

• Similarity comparison is based on the vector space model using TF-IDF with term boosting for proper nouns

Collecting

Grouping Method Explained

Conditions• Message in a group must be

related to the "rst story• Further messages can develop

upon previous messages

A message is compared with the "rst message in a group and the top k terms in that group.

sim(m1,m2 ) = tf (t,m2 ) ! idf (t) ! boost(t)[ ]t"m1

#

tf (t,m) = count(t in m)size(m)

idf (t) = 1+ log Ncount(m has t)

$%&

'()

Boost is raised for proper nouns e.g. China, Obama, Toyota and Hashtags. NER is used for detection

Name Entity Recognizer

• Stanford Named Entity Recognizer (NER) has been adopted for the following uses:

➡ To detect proper nouns used in the grouping algorithm

➡ To classify messages based on named entities (Person, Organization, Location, Misc.)

• NER is based on linear chain Conditional Random Field (CRF)

Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-37

The score for each group is computed as follows:

• A group score is based on reliability, popularity and freshness factors.‣ Reliability comes from the

numbers of followers who follow the user who posted a message.‣ Popularity comes from the

numbers of retweet.‣ Freshness is computed from

the difference of current time and time where a message is posted.

Method for Group Ranking

Results and Application

Detection Effectiveness

RatesMethod

RatesSearch query

Precision 90.0% (45/50)

Recall -

Spam 8% (4/50)Avg. time to collect

100 new msg. 72 sec

User generated 11.1% (5/45)

Based on an experiment conducted in June 2010

Example Result of Grouping(a) No boost(a) No boost(a) No boost(a) No boost(a) No boost(a) No boost

G0 M3 M4 M5G1 M7 M8G2 M0 M1G3 M2G4 M6G5 M9

(b) b=1.5(b) b=1.5(b) b=1.5(b) b=1.5(b) b=1.5(b) b=1.5G0 M2 M3 M4 M5G1 M7 M8G2 M0 M1G4 M6G5 M9

(c) b=1.7(c) b=1.7(c) b=1.7(c) b=1.7(c) b=1.7(c) b=1.7G0 M0 M1 M7 M8G1 M2 M3 M4 M5G2 M6G3 M9

(c) b=2(c) b=2(c) b=2(c) b=2(c) b=2(c) b=2G0 M2 M3 M4 M5 M9G1 M0 M1 M7 M8G2 M6

ToyotaMJ.

AirlineUS. JapanPrisoner

Boosting improves the grouping result

Application

• A prototype application called Hotstream is developed.

• The goal is to create an automatic news portal based on Twitter data.

Challenges and Future Works

Challenges

• The length of messages is short

• Two similar stories may be expressed using different vocabulary terms

• The style of writting is unconventional with slangs, many ways for spellings

Future Works• Explore the comunity structures of named

entities to "nd relationship among groups of messages

Grouped by TF-IDF with proper

noun term boosting

Example Dataset

Top 18 stories and their keywords from Hotstream as of July 21st, 2010Red nodes = keywords, Yellow nodes = message groups

Messages-Named Entities

Community Detection Experiment

Method Edge betweeness

No. Communities 68

Modularity 0.71

Purity 0.67

BP Oil leak

Australian Prime Minister

US. Military in Middle East

Network Type Edge betweeness

No. Vertices 453 (254,200)

No. Edges 1280

Mean Degree 5.639

No. Clusters 40

Largest Component Fraction 0.781

Network Characteristics

Community Detection Results

Conclusion

• Introduced Twitter as a mean to convey news

• Described messages, network characteristics of Twitter

• Described the method to collect, index, group and rank messages

• Introduced Hotstream, an automatic news portal

• Propose an extension study on group-keyword network to improve the grouping result

Thank You