incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy

26
Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy Jiang-Ming Yang, Rui Cai, Lei Zhang, and Wei-Ying Ma Microsoft Research, Asia Chun-song Wang University of Wisconsin-Madison Hua Huang Beijing University of Posts and Telecommunications

Upload: vanya

Post on 31-Jan-2016

29 views

Category:

Documents


4 download

DESCRIPTION

Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy. Jiang-Ming Yang, Rui Cai, Lei Zhang, and Wei-Ying Ma Microsoft Research, Asia Chun-song Wang University of Wisconsin-Madison Hua Huang Beijing University of Posts and Telecommunications. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Incorporating Site-Level Knowledge for Incremental Crawling of Web

Forums:A List-wise Strategy

Jiang-Ming Yang, Rui Cai, Lei Zhang, and Wei-Ying MaMicrosoft Research, Asia

Chun-song WangUniversity of Wisconsin-Madison

Hua HuangBeijing University of Posts and Telecommunications

Page 2: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Web Forums

April 22, 2023 2

Web Search

Q & A

Social Network

Forums is a huge resource with human knowledge !

Page 3: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Forum Data Crawl and Mining

April 22, 2023 3

Crawling

Data Parsing

WWW 2009Automation Data ParsingWWW 2009Automation Data Parsing

Content Mining

SIGIR 2009Expert Finding & Junk detectionSIGIR 2009Expert Finding & Junk detection

WWW 2008iRobot: Sitemap ReconstructionWWW 2008iRobot: Sitemap Reconstruction

SIGIR 2008Exploring Traversal StrategySIGIR 2008Exploring Traversal Strategy

KDD 2009Incremental CrawlingKDD 2009Incremental Crawling

KDD 2009User Behavior in ForumsKDD 2009User Behavior in Forums

Page 4: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Characteristics of Forums

April 22, 2023 4

http://forums.asp.net/15.aspx

http://forums.asp.net/t/956540.aspx

Index Page

Post Page

Page 1 Page 2 Page 4294

Page 1 Page 2 Page 13

Page 5: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Incremental Crawling

• General Web Pages– Treating page independently, i.e., page-wise

• Forum Pages– Considering pagination, i.e., list-wise

April 22, 2023 5

(a)--------

(b)--------

(c)--------

(d)--------

(e)--------

(f)--------

(g)--------

(h)--------

(i)--------

(*)--------

(*)--------

(*)--------

(a)--------

(b)--------

(c)--------

(d)--------

(e)--------

(f)--------

(g)--------

(h)--------

(i)--------

P1 P2 P1 P2 P3

T1 T2

Page 6: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Our Solution

April 22, 2023 6

• Incorporating Site-level Knowledge– How many kinds of pages in a website– How various pages linked with each others

• Purposes– Distinguish index and post pages– Concatenate pages to list by following paginations

SitemapConstruction

List Construction &Classification

TimestampExtraction

PredictionModels

BandwidthControl

Page 7: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

April 22, 2023 7

SitemapConstruction

List Construction &Classification

TimestampExtraction

PredictionModels

BandwidthControl

Page 8: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Forum Sitemap

• A sitemap is a directed graph consisting of a set of vertices and links

April 22, 2023 8

List-of-Board

Digest

Login Portal

List-of-Thread

Browse-by-Tag

Home Page

Post-of-Thread

Search Result

Vertex

Link

http://forums.asp.net

Page 9: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Page Layout Clustering

• Forum pages are based on database & template• Layout is robust to describe template

– Layout can be characterized by the HTML elements in different DOM paths (e.g. repetitive patterns)

April 22, 2023 9

(b) (d)(a) (c)

Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang and Lei Zhang. iRobot: An Intelligent Crawler for Web Forums. In Proceedings of WWW 2008 Conference

Page 10: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Link Analysis

April 22, 2023 10

1. Login

4. Thread List

5. Thread

A Link = URL Pattern + Location

Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang and Lei Zhang. iRobot: An Intelligent Crawler for Web Forums. In Proceedings of WWW 2008 Conference

Post-of-Thread

Paginations

Links to Other Threads

Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai and Lei Zhang. Exploring Traversal Strategy for Web Forum Crawling. In Proceedings of SIGIR 2008 Conference

Page 11: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

April 22, 2023 11

SitemapConstruction

List Construction &Classification

TimestampExtraction

PredictionModels

BandwidthControl

Index Post

Page 12: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Indentify Index & Post Nodes

• A SVM-based Classifier– Site independent– Features

• Node size• Link structure• Keywords

• Node classification is robust that page– Robust to noise on

individual pages

April 22, 2023 12

Index-of-Thread

Post-of-Thread

Home Page

Pages with the Same Layout Link

Search Result

Login Portal

Digest

List-of-Board

Pagination Links

Page 13: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

List Reconstruction

• Given a new page1. Classify into a node2. Detect pagination links3. Find out link orders

April 22, 2023 13

Page 14: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

April 22, 2023 14

SitemapConstruction

List Construction &Classification

TimestampExtraction

PredictionModels

BandwidthControl

YYYY/MM/DDYYYY/MM/DD

Page 15: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Timestamp Extraction

April 22, 2023 15

• Distinguish real timestamps from noises– The temporal order can help !

Page 16: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

April 22, 2023 16

SitemapConstruction

List Construction &Classification

TimestampExtraction

PredictionModels

BandwidthControl

Page 17: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Feature Extraction

April 22, 2023 17

• Features to describe update frequency– List-dependent & independent (site-level statistics)– Absolute & Relative

Page 18: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Regression Model

• Predict when the next new record arrives– CT: current time– LT: last (re-)visit time by crawler

April 22, 2023 18

• Linear regression– Advantages

Lightweight computational costEfficient for online process

Page 19: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

April 22, 2023 19

SitemapConstruction

List Construction &Classification

TimestampExtraction

PredictionModels

BandwidthControl

Queue

Page 20: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Bandwidth Control• Index and post pages are quite different

April 22, 2023 20

Index Post

Quantity < 10 % > 90 %

Avg. Update Frequency high low

Num. Re-crawl Pages small large

• Post pages blocks the bandwidth– Cannot discover new threads in time– A simple but practical solution

Page 21: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Experiment Setup• 18 web forums in diverse categories

– March 1999 ~ June 2008– 990,476 pages and 5,407,854 posts

• Simulation– Repeatable and Controllable

• Comparison– List-wise strategy (LWS), – LWS with bandwidth control (LWS + BC)– Curve-fitting policy (CF)– Bound-based policy (BB, WWW 2008)– Oracle (Most ideal case)

April 22, 2023 21

Page 22: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Measurements

• Bandwidth Utilization– Inew

: #pages with new information

– IB: #pages crawled

• Coverage– Icrawl

: #new posts crawled

– Iall: #new posts published on forums

• Timeliness– ∆ti

: #minutes between publish and download

April 22, 2023 22

Page 23: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Performance Comparison• Warm-up Stage

– Bandwidth: 3000 pages / day

April 22, 2023 23

0

0.2

0.4

0.6

0.8

1

0 50 100 150 200 250 300 350

Band

wid

th U

tiliz

ation

Days

Oracle

LWS+BC

LWS

Bound-Based

Curve-Fitting

0

0.2

0.4

0.6

0.8

1

0 50 100 150 200 250 300 350

Cove

rage

Days

Oracle

LWS+BC

LWS

Bound-Based

Curve-Fitting

0

300

600

900

0 50 100 150 200 250 300 350

Tim

elin

ess

Days

Oracle

LWS+BC

LWS

Bound-Based

Curve-Fitting

Page 24: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Performance Comparison (Cont.)• Comparison with various bandwidth

April 22, 2023 24

0

0.05

0.1

0.15

1000 2000 3000 4000 5000 6000

Band

wid

th U

tiliz

ation

Bandwidth (pages/per day)

Oracle

Curve-Fitting

Bound-Based

LWS

LWS+BC

0.98

0.985

0.99

0.995

1

1000 2000 3000 4000 5000 6000

Cove

rage

Bandwidth (pages/per day)

Oracle

Curve-Fitting

Bound-Based

LWS

LWS+BC

0

200

400

600

1000 2000 3000 4000 5000 6000

Tim

elin

ess

Bandwidth (pages/per day)

Oracle

Curve-Fitting

Bound-Based

LWS

LWS+BC

Page 25: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Performance Comparison (Cont.)

• Detailed performance of Index and Post pages– Bandwidth: 3000 pages / day

April 22, 2023 25

Page 26: Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Conclusions and Future Work

• Targeted on web forums, a specific but interesting field.

• Developing an effective solution for incremental forum crawling– Integrating site-level knowledge– Some practical engineering implementation

• Future work– Improve timestamps extraction algorithm– Stronger prediction model than linear regression

April 22, 2023 26