incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy
DESCRIPTION
Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy. Jiang-Ming Yang, Rui Cai, Lei Zhang, and Wei-Ying Ma Microsoft Research, Asia Chun-song Wang University of Wisconsin-Madison Hua Huang Beijing University of Posts and Telecommunications. - PowerPoint PPT PresentationTRANSCRIPT
Incorporating Site-Level Knowledge for Incremental Crawling of Web
Forums:A List-wise Strategy
Jiang-Ming Yang, Rui Cai, Lei Zhang, and Wei-Ying MaMicrosoft Research, Asia
Chun-song WangUniversity of Wisconsin-Madison
Hua HuangBeijing University of Posts and Telecommunications
Web Forums
April 22, 2023 2
Web Search
Q & A
Social Network
Forums is a huge resource with human knowledge !
Forum Data Crawl and Mining
April 22, 2023 3
Crawling
Data Parsing
WWW 2009Automation Data ParsingWWW 2009Automation Data Parsing
Content Mining
SIGIR 2009Expert Finding & Junk detectionSIGIR 2009Expert Finding & Junk detection
WWW 2008iRobot: Sitemap ReconstructionWWW 2008iRobot: Sitemap Reconstruction
SIGIR 2008Exploring Traversal StrategySIGIR 2008Exploring Traversal Strategy
KDD 2009Incremental CrawlingKDD 2009Incremental Crawling
KDD 2009User Behavior in ForumsKDD 2009User Behavior in Forums
Characteristics of Forums
April 22, 2023 4
http://forums.asp.net/15.aspx
http://forums.asp.net/t/956540.aspx
Index Page
Post Page
Page 1 Page 2 Page 4294
Page 1 Page 2 Page 13
Incremental Crawling
• General Web Pages– Treating page independently, i.e., page-wise
• Forum Pages– Considering pagination, i.e., list-wise
April 22, 2023 5
(a)--------
(b)--------
(c)--------
(d)--------
(e)--------
(f)--------
(g)--------
(h)--------
(i)--------
(*)--------
(*)--------
(*)--------
(a)--------
(b)--------
(c)--------
(d)--------
(e)--------
(f)--------
(g)--------
(h)--------
(i)--------
P1 P2 P1 P2 P3
T1 T2
Our Solution
April 22, 2023 6
• Incorporating Site-level Knowledge– How many kinds of pages in a website– How various pages linked with each others
• Purposes– Distinguish index and post pages– Concatenate pages to list by following paginations
SitemapConstruction
List Construction &Classification
TimestampExtraction
PredictionModels
BandwidthControl
April 22, 2023 7
SitemapConstruction
List Construction &Classification
TimestampExtraction
PredictionModels
BandwidthControl
Forum Sitemap
• A sitemap is a directed graph consisting of a set of vertices and links
April 22, 2023 8
List-of-Board
Digest
Login Portal
List-of-Thread
Browse-by-Tag
Home Page
Post-of-Thread
Search Result
Vertex
Link
http://forums.asp.net
Page Layout Clustering
• Forum pages are based on database & template• Layout is robust to describe template
– Layout can be characterized by the HTML elements in different DOM paths (e.g. repetitive patterns)
April 22, 2023 9
(b) (d)(a) (c)
Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang and Lei Zhang. iRobot: An Intelligent Crawler for Web Forums. In Proceedings of WWW 2008 Conference
Link Analysis
April 22, 2023 10
1. Login
4. Thread List
5. Thread
A Link = URL Pattern + Location
Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang and Lei Zhang. iRobot: An Intelligent Crawler for Web Forums. In Proceedings of WWW 2008 Conference
Post-of-Thread
Paginations
Links to Other Threads
Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai and Lei Zhang. Exploring Traversal Strategy for Web Forum Crawling. In Proceedings of SIGIR 2008 Conference
April 22, 2023 11
SitemapConstruction
List Construction &Classification
TimestampExtraction
PredictionModels
BandwidthControl
Index Post
Indentify Index & Post Nodes
• A SVM-based Classifier– Site independent– Features
• Node size• Link structure• Keywords
• Node classification is robust that page– Robust to noise on
individual pages
April 22, 2023 12
Index-of-Thread
Post-of-Thread
Home Page
Pages with the Same Layout Link
Search Result
Login Portal
Digest
List-of-Board
Pagination Links
List Reconstruction
• Given a new page1. Classify into a node2. Detect pagination links3. Find out link orders
April 22, 2023 13
April 22, 2023 14
SitemapConstruction
List Construction &Classification
TimestampExtraction
PredictionModels
BandwidthControl
YYYY/MM/DDYYYY/MM/DD
Timestamp Extraction
April 22, 2023 15
• Distinguish real timestamps from noises– The temporal order can help !
April 22, 2023 16
SitemapConstruction
List Construction &Classification
TimestampExtraction
PredictionModels
BandwidthControl
Feature Extraction
April 22, 2023 17
• Features to describe update frequency– List-dependent & independent (site-level statistics)– Absolute & Relative
Regression Model
• Predict when the next new record arrives– CT: current time– LT: last (re-)visit time by crawler
April 22, 2023 18
• Linear regression– Advantages
Lightweight computational costEfficient for online process
April 22, 2023 19
SitemapConstruction
List Construction &Classification
TimestampExtraction
PredictionModels
BandwidthControl
Queue
Bandwidth Control• Index and post pages are quite different
April 22, 2023 20
Index Post
Quantity < 10 % > 90 %
Avg. Update Frequency high low
Num. Re-crawl Pages small large
• Post pages blocks the bandwidth– Cannot discover new threads in time– A simple but practical solution
Experiment Setup• 18 web forums in diverse categories
– March 1999 ~ June 2008– 990,476 pages and 5,407,854 posts
• Simulation– Repeatable and Controllable
• Comparison– List-wise strategy (LWS), – LWS with bandwidth control (LWS + BC)– Curve-fitting policy (CF)– Bound-based policy (BB, WWW 2008)– Oracle (Most ideal case)
April 22, 2023 21
Measurements
• Bandwidth Utilization– Inew
: #pages with new information
– IB: #pages crawled
• Coverage– Icrawl
: #new posts crawled
– Iall: #new posts published on forums
• Timeliness– ∆ti
: #minutes between publish and download
April 22, 2023 22
Performance Comparison• Warm-up Stage
– Bandwidth: 3000 pages / day
April 22, 2023 23
0
0.2
0.4
0.6
0.8
1
0 50 100 150 200 250 300 350
Band
wid
th U
tiliz
ation
Days
Oracle
LWS+BC
LWS
Bound-Based
Curve-Fitting
0
0.2
0.4
0.6
0.8
1
0 50 100 150 200 250 300 350
Cove
rage
Days
Oracle
LWS+BC
LWS
Bound-Based
Curve-Fitting
0
300
600
900
0 50 100 150 200 250 300 350
Tim
elin
ess
Days
Oracle
LWS+BC
LWS
Bound-Based
Curve-Fitting
Performance Comparison (Cont.)• Comparison with various bandwidth
April 22, 2023 24
0
0.05
0.1
0.15
1000 2000 3000 4000 5000 6000
Band
wid
th U
tiliz
ation
Bandwidth (pages/per day)
Oracle
Curve-Fitting
Bound-Based
LWS
LWS+BC
0.98
0.985
0.99
0.995
1
1000 2000 3000 4000 5000 6000
Cove
rage
Bandwidth (pages/per day)
Oracle
Curve-Fitting
Bound-Based
LWS
LWS+BC
0
200
400
600
1000 2000 3000 4000 5000 6000
Tim
elin
ess
Bandwidth (pages/per day)
Oracle
Curve-Fitting
Bound-Based
LWS
LWS+BC
Performance Comparison (Cont.)
• Detailed performance of Index and Post pages– Bandwidth: 3000 pages / day
April 22, 2023 25
Conclusions and Future Work
• Targeted on web forums, a specific but interesting field.
• Developing an effective solution for incremental forum crawling– Integrating site-level knowledge– Some practical engineering implementation
• Future work– Improve timestamps extraction algorithm– Stronger prediction model than linear regression
April 22, 2023 26