1 matching dom trees to search logs for accurate webpage clustering deepayan chakrabarti rupesh...

1

Matching DOM Trees to Search Logs for Accurate Webpage Clustering

Deepayan Chakrabarti

Rupesh Mehta

2

Data extraction

Website-specific

wrappers

Webpages from a site

Structured DB

(product_name, price, rating)

3

Data ExtractionWrapper 1

Wrapper 2

Building wrappers[Muslea+/98, Crescenzi+/01, Cohen+/02,

Hogue+/05, Irmak+/06]

• Cluster pages from the website based on similarity of DOM structure

• Pick a few example pages per cluster

• Manually annotate the DOM nodes which contain the data

• Automatic wrapper induction using these annotations

4

Data Extraction

Clustering affects quality Too few clusters:

Heterogeneity of clusters Imperfect wrappers, or even inability to build wrappers

Too many clusters: Significant editorial effort required to build wrappers

We want to automatically get a good clustering, for any website

5

Main Idea

“Useful” info on a page

Wrappers extract it

Users search for it

html

h1

b

search

+click

search terms match page content

DOM paths repeatedly referenced by search terms are “key” paths

“html h1” and “html h1 b” are key paths

6

Main Idea

Clustering using key paths Pre-processing step (for each site)

Given a large sample of pages and search logs Identify key paths

Run-time (for that website) Given a new webpage Find which key paths exist on the page Map page to cluster using its key paths

7

Mapping pages to clusters

Pages in a cluster should have similar tree structure and hence, similar paths Represent a page by a shingle of its paths

[Buttler/04]

Using key paths: Shingle preferentially picks key paths in the page Requires a global ranking of key paths

8


One cluster per shingle

All pages in a cluster share the same k “key” paths

9

Main Idea




10

Identify key paths

For every (query, webpage) pair match query terms to text of a DOM path yields precision and recall for every path

Need to aggregate over all queries and webpages Expected precision and recall of a path

High if path appears on many queried pages, and has high precision/recall in most of them

html

h1

b

title

price

11

Identify key paths

How can we combine expected precision and recall into one ranking of key paths? F-measure, but

Precision typically more important than recall Precision and recall may be in completely different scales This scaling factor varies among websites

12

Identify key paths

How can we combine expected precision and recall into one ranking of key paths? Borda method [Borda/1781]

Create two rankings of paths, one by precision and one by recall

Combine rankings into one ranking, using relative importance of precision to recall

Immune to varying scales of precision/recall values among websites

13

Main Idea


Given a large sample of pages and search logs Identify key paths, but

Key paths can be dependent


14

Handling dependent paths

Consider the following two paths: html body div div table tr td h1 span (“product name”)

html body div div table tr td h1 If one is a key path, probably the other is too

Shingle can get “swamped” Shingle of a page becomes:

(product_name, product_name_parent, product_name_ancestor) instead of:

(product_name, buy_button, rating)

15


Several sources of dependence Multiple paths may have similar content

“product name” header and its parent product name mentioned in a header and in the text

Multiple paths may always co-occur “product name” header and “price”

16


Identify key independent paths Build a graph of dependencies between paths Pick an independent set of paths

i.e., a set of paths where no one is connected to another

Computation is weighted strongly towards the top-ranked paths Under our weighting scheme, greedily picking an

independent set is optimal

17

Main Idea




Several other optimizations (in paper)

18

Experiments

10 major websites Sampled ~20,000 pages each Built ground truth

Ran an existing clustering algorithm Manually checked results

Homogeneous clusters: merge when necessary Heterogeneous clusters: change parameters, repeat

Small sample of search logs ~5K unique queries per site Far fewer than the number of pages per site

19

Experiments

Compared to clustering using well-known tree-similarity metrics Path Shingles: Shingle of DOM paths without

using key paths [Buttler/04] pq-Grams: Shingle of sub-trees of DOM tree

[Augsten+/05] m/k Path Shingles: Like path shingles, except only

m out of k shingle elements need to match

20

Experiments

Compared clustering using Adjusted RAND index higher is better, 1.0 is perfect

Our algorithm [Buttler/04] [Augsten+/05]

Search logs give significant lift, with very low variance

21

Experiments

Comparison against paths actually used by manually-designed wrappers

Precision of IndepPaths

Key Paths correspond to paths used in wrappers

22

Experiments

Examples of top-ranked paths

23

Conclusions

Clusters affect both wrapper quality, and degree of editorial effort

We use search logs to automatically find good clusters

Current efforts: Combining search features with content features

to pick key paths

24


Given an ranked list of key paths Given a shingle-size k For any page P

Find KP = all key paths in P If |KP| < k

Shingle = KP plus randomly chosen paths from page else

Shingle = top-ranked k paths in KP

25

Experiments

26

Experiments

27

Experiments

Compared clustering using Adjusted RAND index higher is better, 1.0 is perfect

Our algorithm

Shingle of 8 paths; only 6

need to match

Shingles w/o key paths

[Buttler/04] [Augsten+/05]

Shingles of DOM subtrees

Search logs give significant lift, with very low variance

1 matching dom trees to search logs for accurate webpage clustering deepayan chakrabarti rupesh...

Documents

key paths runtime

key paths html h1

similar paths

rankings of paths

paths buttler04

global ranking of key

website slide

cluster pages