generating event storylines from microblogs

CIKM’12

Generating Event Storylinesfrom

Microblogs

ABSTRACT

we explore the problem of generating storylines

from microblogs for user input queries.

Given a query of an ongoing event, we propose

to sketch the real-time storyline of the event by a

two-level solution.

1. propose a language model with dynamic

pseudo relevance feedback to obtain relevant

tweets

2. Generate storylines via graph optimization

INTRODUCTION

Generating Event Storyline from Microblogs

(GESM)

differences between GESM and prior studies：

1. Well edited facts ---- short noisy text

2. GESM provides personalized service

3. A two-level framework is necessary: at the low

level, finding all relevant tweets through the

time-line of the event by a retrieve model; and

at the high level, summarizing relevant tweets

and the latent structure to produce a storyline.

INTRODUCTION

INTRODUCTION

Challenges

1、the dynamic and sparse nature of microblogs

——How to match the underlying event expressed

by the vague event query to potential relevant

tweets which possibly not contain any query terms

2、Numerous duplicate tweets and direct and

undirect re-tweets

INTRODUCTION

contributions

1. generating event storylines from microblogs

2. A dynamic pseudo relevance feedback (DPRF)

language

model

3. a graph-based optimization problem and is

solved by approximation algorithms of

minimum-weight dominating set and directed

Steiner tree

THE FRAMEWORK OVERVIEW

generated storyline should be a graph structure

Node is labeled by a summary

Edge represents causal relationship between two

phases

Offline layer

Online layers

THE RETRIEVAL MODEL

Preliminaries

the original query is usually short and vague

Query expansion

In a pseudo relevance manner, suppose the few top

ranked documents d + by the initial query Q builds a

relevant model θ F , we can set the new query to be

a linear combination of original query Q and

relevant model θF

THE RETRIEVAL MODEL

Dynamic Pseudo Relevance Feedback

K burst periods

Assume that the prior probability of relevant

document d + is dependent on the distance of td+

to the centroid

of burst periods, denoted as Φ = { φ 1 ··· φ K }

three probability functions to model the effective

range of burst period, decay coefficient and

skewness.

1. Mixture Gaussian Distribution

2. Local Power Distribution

3. Skewed Linear Distribution

THE RETRIEVAL MODEL

Mixture Gaussian Distribution

Local Power Distribution

Skewed Linear Distribution

THE RETRIEVAL MODEL

Burst Period Detection

1. appear more frequently than usual

2. be continuously frequent around the time point.

detect burst periods of the event by

1. for each query term, finding the time intervals

with arbitrary length in which the query term

appears constantly frequent;

2. picking the time points within these intervals

with the

largest sum of frequencies over all query terms.

THE RETRIEVAL MODEL

“bursty score”

find time interval Tw,j = <st, et, LS, RS> with the maximal cumulative burst score B ( w, Tw,j )

Compute the score of any query term q at each time point

Rank each time point by ∑q∈QH ( q,t )and choose the largest K time point φk .

STORYLINE GENERATION

1. Representative tweets

2. Depict the evolving structure of the event

3. an optimistic connection

a multi-view tweet graph is constructed

a minimum dominant set on the tweet graph

a minimum steiner tree


three non negative real parameters α, τ1, τ2 , τ1<

τ2 .

define E : text similarity > α

define A : τ1 ≤ t j − t i ≤ τ2

w(vi ) = 1 − score ( Q,vi ).


A subset S of the vertex set of an undirected

graph is a

dominating set if for each vertex u ,either u is in

S or is adjacent to a vertex in S .


greedy algorithm


A Steiner tree of a graph G with respect to a

vertex subset S is the edge-induced sub-tree of G

that contains all the vertices of S having the

minimum total cost, where the cost is

the total weight of the vertices.

EXPERIMENTS

Data Set

EXPERIMENTS

Tweet Retrieval

49 queries

evaluation metric :

precision at top 30 tweets(P@30)

mean average precision(MAP)

precision at top 100 tweets(P@100)

R-precision (R-PREC)

EXPERIMENTS

Comparative Study

EXPERIMENTS

Parameter Tuning

EXPERIMENTS

Summarization Capability

EXPERIMENTS

Parameter Tuning

EXPERIMENTS

A User Study

CONCLUSION

The proposed dynamic pseudo relevance

feedback model

minimum weighted Steiner tree on a dominant set

充分的实验

OMG, I Have to Tweet That!

A Study of Factors that Influence Tweet

Rates

Abstract

key limitation ：

it depends on people self reporting their own

behaviors

and observations.

a large scale quantitative analysis of some of

the factors that influence self reporting bias.

the daily variations in tweet rates about weather

events

Introduction

treating social media as a signal to measure the relative real-world occurrence of events

critical challenge ：the bias introduced by the self-reported nature of

social media

What is it about an event that makes it more or less “tweetable”?

A first large-scale, quantitative analysis of some of the factors that influence self-reporting bias by comparing a year of tweets about weather events in cities across the United States and Canada to ground-truth knowledge about actual weather occurrences.

Introduction

three potential factors ：

1. How extreme is the weather?

2. How expected is the weather given the time-of-

year?

3. How much did the weather change?

Data Preparation

Jun 1, 2010 and Jun 30, 2011

56 different metropolitan areas

historical weather data provided by the National

Oceanic and Atmospheric Administration of the

United States.

Identifying Weather-related Tweets

discovering the rate of weather-related tweets

that occurred per-day across metropolitan areas

1. filtering the full archive of tweets for tweets that

contain at least 1 weather-related word from a

list of 179 weather-related words and phrases

2. build a classifier for weather-related tweets

a simple classifier that estimates the probability

of a tweet being weather related as

Identifying the Location of Tweets

geo-coded

the textual user- provided location field in a user’s Twitter profile

normalize the textual

arbitrary user-provided location information into concrete geo-coded coordinates

1. a mapping from user-provided location fields to latitude-longitude coordinates.

2. merge location fields with similar geo-mappings together to create clusters for roughly metropolitan-sized areas

Identifying the Location of Tweets

Historical Weather Data

calculate daily summaries

For each daily summary of weather data at a

location：

Expectation: how normal the observed weather

is at a location

Extremeness : how extreme the weather is on a

particular day

Change: how different the observed weather data

is from previous days’ weather

Analysis and Results

Tweet Rates and Weather Reports


Linear Regression

the relationship between a set of weather-derived

features and the daily rate of weather-related

tweets


Correlating Basic Weather Data and Tweet

Rates


Correlating Expectation and Tweet Rates

expectation measure adds little information about likely tweet rates beyond what is already contained in basic weather data

Correlating Extremeness and Tweet Rates

extremeness can independently explain more of the variation in weather-related tweet rates than basic weather alone

Correlating Delta Change and Tweet Rates

there is little difference in the amount of information gained from building these delta-change models

Combining Extremeness, Expectation, and Delta Change Models


Per-Location Models

Discussion

Additional Factors Likely to Effect Tweet

Rates

Sentiment

Privacy concerns, embarrassments and safety:

Population segments :

Mobile devices

Time-of-Day, day-of-week, holiday, and other

effects of time:

Conclusions

the correlation between daily tweet

rates and the expectation, extremeness, and the

change in

observed weather.

global models

location-specific models

Extremeness>change>expectation

generating event storylines from microblogs

Technology

vague event query

new query

relevant model f

query term q

underlying event

ongoing event

initial query q

retrieve model