automatic blog monitoring and summarization ka cheung “richard” sia phd prospectus

Post on 22-Dec-2015

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Automatic Blog Monitoring and Summarization

Ka Cheung “Richard” Sia

PhD Prospectus

With/without organized access

Inaccessible?

% of Feeds Vs # of Subscribers

0%

20%

40%

60%

80%

100%

1+ 20+ 50+ 1000+ 5000+

# of Subscribers

% o

f F

ee

ds

By AskJeeves

Introduction

Organized access to blogs Full coverage Reflect changes quickly Filtered and organized presentation

Intended Contributions Efficient techniques to harvest blogs Algorithms to monitor frequently changing data sources Algorithms to reconstruct implicit networks and compose

topic summaries

Modules

Monitoring Collection (future work) Topic detection and tracking (future work) Conclusion

Monitoring

Preliminary results

Framework

A central server monitors data source changes and provides succinct summaries to users

Overview

New challenges Content change more rapidly with recurring pattern More time-sensitive requirements

Modeling of posting update Definition of delay Strategies for allocation and scheduling

Characteristics

Homogeneous Poisson modelλ(t) = λ at any t

Periodic inhomogeneous Poisson model λ(t) = λ(t-nT), n=1,2,…

Definition of metrics

Delay of a data sourcesum of elapsed time for every post

Delay experienced by the aggregator

iji ttD )(

k

iitDOD

1

)()(

n

iii ODwAD

1

)()(

Definition of metrics

τj – retrieval timeλ(t) – posting rate

Expected delay Homogeneous Poisson model

Inhomogeneous Poisson model

2

)()(

21

jjOD

j

j

dtttOD j

1

))(()(

Problem formulation

Minimization of expected delay experienced by the aggregator under constraint of limited resources.

Schedule τj’s such that

is minimized.

n

iii ODwAD

1

)()(

Approach

Resource allocation How often to contact data sources? O1 is more active than O2, how much more often should we

contact O1 than O2?

Retrieval scheduling When to contact a data source? 3 retrievals are allocated for O1, when should these 3

retrievals be located?

Resource allocation

Consider n data source O1, …, On

λi – posting rate of Oi

wi – weight of Oi

N – total number of retrievals per day mi – number of retrievals per day allocated to Oi

Optimal allocationiii wm

Retrieval scheduling

m retrieval(s) per day are allocated to a data source O, how should we schedule these m retrievals?

m=1 m>1

Single retrieval per period

λ(t) = 1, t [0,1], λ(t)=0, t [1,2] Periodicity T=2

τ = 0.5, expected delay = 0.75 τ = 1, expected delay = 0.5 τ = 2, expected delay = 1.5

Single retrieval per period

For a data source with posting rate λ(t) and period T, the expected delay when retrieved at time τ is given by:

0)(

and )(1

)(

optimalityfor Criteria

0 dt

ddtt

T

T

T

dttTtdtttD

))(())(()(

0

Multiple retrievals per period

m retrievals per period are allocated, when scheduled at time τ1, …, τm, the expected delay is given by:

11

11

1

))(()(

T

dtttOD

m

m

ii

i

i

j

j

dttjjj

1

)())((

optimalityfor Criteria

1

Example

6 retrievals for λ(t)=2+2sin(2πt)

j

j

dttjjj

1

)())((

optimalityfor Criteria

1

Experiment

Data – 10k RSS feeds over Oct – Dec 2004

Performance

CGM03 – optimize for “age” Ours – both resource allocation and retrieval scheduling

Size of estimation window

Resource constraint: 4 retrievals per day per feeds on average 2 weeks is an appropriate choice

Predictability of posting rate

90% of the RSS feeds post consistently

Summaries and extensions

Resource allocation is more aggressive Retrieval scheduling optimizes within individual data

source

Include user access pattern Variable retrieval cost

Collection

Future work

Collection

Blog hosting website Central repository

~5.3M URLs from weblogs.comlimited and contaminated

CrawlingRetrieve maximum number of blog while reducing number of irrelevant pages downloaded

Domain Count Category

spaces.msn.com 839,663 Blog

blogspot.com 362,957 Blog

wretch.cc 116,161 Blog

search-net101.com 89,750 Spam/ads

abalty.com 86,329 Spam/ads

search-now854.com 80,109 Spam/ads

bigebiz.org 79,059 Spam/ads

Collection

Blogs are inter-connected (blogrolls) Selectively following links, discovering hubs for blogs

blog blog

[1] Chakrabarti et.al. “Focused Crawling: A New Approach to Topic-specific Web Resource Discovery”, The International WWW conference 1999

Relinquishment of blogs

Detection of abandoned blog to save resource

[2] D.R. Cox “Regression models and life-tables (with discussion)”Journal of the Royal Statistical Society, B(34), 1972[3] Gina Venolia “A Matter of Life or Death: Modeling Blog Mortality”Technical report, Microsoft Research

Topic detection and tracking

Future work

Overview

Characteristics Document stream Traces of information propagation among blogs

Challenges Modeling growth and death of a topic Ranking of blog articles Malicious content

Influence network in blogs

Information are “diffused” among blogs

Indicator of popularity Social relationship among

bloggers

Influence network in blogs

Four major patterns of propagation

Reconstruction of implicit network Ranking (source authority) Advertising campaign

Data characteristics

~ 97 - 98 % daily content are new

Data characteristics

Same content last for ~8 days

Topics

Topics with different lifespan Bursty Mid-range Sustaining

Evolving of topic

[4] J. Kleinberg, “Bursty and Hierarchical Structure in Streams”in SIGKDD 2002[5] J. Kleinberg, “Temploral Dynamics of On-Line Information Streams”Data Stream Management: Processing High-Speed Data Stream, Springer 2005

Document similarity

Sparse and diverse ~400 articles clustered into 21 clusters out of 10,000

daily articles (by DBSCAN)

Framework

Document stream approach Filtering Aggregation

Problems

Selecting a representative subset of documents from a topic cluster Coverage Distinctiveness among subset

Ranking of documents Time Source authority

Conclusion

1. Efficient collection of blogs and modeling the relinquishment

2. Monitoring and retrieval scheduling of rapidly changing data sources

3. Composing topic summary1. Reconstruction of an implicit influence network2. Representative document selection problem

End

Questions?

More examples

Major posting patterns

K – means clustering

top related