sizespotsigs: an effective deduplicate algorithm...

12
SizeSpotSigs: An Effective Deduplicate Algorithm Considering the Size of Page Content Xianling Mao, Xiaobing Liu, Nan Di, Xiaoming Li, and Hongfei Yan Department of Computer Science and Technology, Peking University {mxl,lxb,dn,lxm,yhf}@net.pku.edu.cn Abstract. Detecting if two Web pages are near replicas, in terms of their contents rather than files, is of great importance in many web in- formation based applications. As a result, many deduplicating algorithms have been proposed. Nevertheless, analysis and experiments show that existing algorithms usually don’t work well for short Web pages 1 , due to relatively large portion of noisy information, such as ads and tem- plates for websites, existing in the corresponding files. In this paper, we analyze the critical issues in deduplicating short Web pages and present an algorithm (AF SpotSigs) that incorporates them, which could work 15% better than the state-of-the-art method. Then we propose an al- gorithm (SizeSpotSigs), taking the size of page contents into account, which could handle both short and long Web pages. The contributions of SizeSpotSigs are three-fold: 1) Provide an analysis about the relation between noise-content ratio and similarity, and propose two rules of mak- ing the methods work better; 2) Based on the analysis, for Chinese, we propose 3 new features to improve the effectiveness for short Web pages; 3) We present an algorithm named SizeSpotSigs for near duplicate detec- tion considering the size of the core content in Web page. Experiments confirm that SizeSpotSigs works better than state-of-the-art approaches such as SpotSigs, over a demonstrative Mixer of manually assessed near- duplicate news articles, which include both short and long Web pages. Keywords: Deduplicate, Near Duplicate Detection, AF SpotSigs, SizeSpotSigs, Information Retrieval. 1 Introduction Detection of duplicate or near-duplicate Web pages is an important and difficult problem for Web search engines. Lots of algorithms have been proposed in recent years [6,8,20,13,18]. Most approaches can be characterized as different types of distance or overlap measures operating on the HTML strings. State-of-the-art algorithms, such as Broder et al.’s [2] and Charikar’s [3], achieve reasonable precision or recall. Especially, SpotSigs[19] could avert the process of removing noise in Web page because of its smart feature selection. Existing deduplicate 1 In this page, Web pages are classified into long (Web) pages and short (Web) page based on their core content size. J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part I, LNAI 6634, pp. 537–548, 2011. c Springer-Verlag Berlin Heidelberg 2011

Upload: others

Post on 21-Apr-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SizeSpotSigs: An Effective Deduplicate Algorithm ...net.pku.edu.cn/~webg/TianwangLiterature/2011/2011...SizeSpotSigs: An Effective Deduplicate Algorithm 541 As we know, in fact,

SizeSpotSigs: An Effective Deduplicate

Algorithm Considering the Size of Page Content

Xianling Mao, Xiaobing Liu, Nan Di, Xiaoming Li, and Hongfei Yan

Department of Computer Science and Technology, Peking University{mxl,lxb,dn,lxm,yhf}@net.pku.edu.cn

Abstract. Detecting if two Web pages are near replicas, in terms oftheir contents rather than files, is of great importance in many web in-formation based applications. As a result, many deduplicating algorithmshave been proposed. Nevertheless, analysis and experiments show thatexisting algorithms usually don’t work well for short Web pages1, dueto relatively large portion of noisy information, such as ads and tem-plates for websites, existing in the corresponding files. In this paper, weanalyze the critical issues in deduplicating short Web pages and presentan algorithm (AF SpotSigs) that incorporates them, which could work15% better than the state-of-the-art method. Then we propose an al-gorithm (SizeSpotSigs), taking the size of page contents into account,which could handle both short and long Web pages. The contributionsof SizeSpotSigs are three-fold: 1) Provide an analysis about the relationbetween noise-content ratio and similarity, and propose two rules of mak-ing the methods work better; 2) Based on the analysis, for Chinese, wepropose 3 new features to improve the effectiveness for short Web pages;3) We present an algorithm named SizeSpotSigs for near duplicate detec-tion considering the size of the core content in Web page. Experimentsconfirm that SizeSpotSigs works better than state-of-the-art approachessuch as SpotSigs, over a demonstrative Mixer of manually assessed near-duplicate news articles, which include both short and long Web pages.

Keywords: Deduplicate, Near Duplicate Detection, AF SpotSigs,SizeSpotSigs, Information Retrieval.

1 Introduction

Detection of duplicate or near-duplicate Web pages is an important and difficultproblem for Web search engines. Lots of algorithms have been proposed in recentyears [6,8,20,13,18]. Most approaches can be characterized as different types ofdistance or overlap measures operating on the HTML strings. State-of-the-artalgorithms, such as Broder et al.’s [2] and Charikar’s [3], achieve reasonableprecision or recall. Especially, SpotSigs[19] could avert the process of removingnoise in Web page because of its smart feature selection. Existing deduplicate1 In this page, Web pages are classified into long (Web) pages and short (Web) page

based on their core content size.

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part I, LNAI 6634, pp. 537–548, 2011.c© Springer-Verlag Berlin Heidelberg 2011

Page 2: SizeSpotSigs: An Effective Deduplicate Algorithm ...net.pku.edu.cn/~webg/TianwangLiterature/2011/2011...SizeSpotSigs: An Effective Deduplicate Algorithm 541 As we know, in fact,

538 X. Mao et al.

algorithms don’t take size of the page core content into account. Essentially, thesealgorithms are more suitable for processing the long Web pages because they justtake surfacing features to present documents. For short documents, however, thepresentation is not sufficient. Especially, when documents have noise information,like ads within the Web page, the presentation is worse. Our experiments insection 5.3 also proves that the state-of-the-art deduplicate algorithm is relativelypoor for short Web pages, just 0.62(F1) against 0.92(F1) for long Web pages.

In fact, there are large amount of short Web pages which have duplicatedcore content on the World Wide Web. At the same time, they are also veryimportant, for example, the central bank announces some message, such as,interest rate adjustment. Fig.1 shows a pair of same-core Web pages that onlydiffer in the framing, advertisements, and navigational banners. Both articlesexhibit almost identical core contents, reporting on the match review betweenUruguay and Netherlands.

Fig. 1. Near-duplicate Web Pages: identical core content with different framing andbanners(additional ads and related links removed) and the size of core contents areshort

So it is important and necessary to improve the effectiveness of deduplicationfor short Web pages.

Page 3: SizeSpotSigs: An Effective Deduplicate Algorithm ...net.pku.edu.cn/~webg/TianwangLiterature/2011/2011...SizeSpotSigs: An Effective Deduplicate Algorithm 541 As we know, in fact,

SizeSpotSigs: An Effective Deduplicate Algorithm 539

1.1 Contribution

1. Analyze the relation between noise-content ratio and similarity, and proposetwo rules of making the methods work better;

2. Based on our analysis, for Chinese, we propose 3 new features to improve theeffectiveness for short Web pages, which leads to AF SpotSigs algorithm;

3. We present an algorithm named SizeSpotSigs for near duplicate detectionconsidering the size of the core content in Web page.

2 Related Work

There are two families of methods for near duplicate detection. One is content-based methods, the other is non-content-based methods. The content-basedmethods were to detect near duplicates by computing similarity between contentsof documents, while the non-content-based methods made use of non-contentfeatures[10,1,17](i.e. URL pattern) to detect near duplicates. The non-content-based methods were only used to detect the near duplicate pages in one website while the content-based methods have no any limitation. Content-basedalgorithms could be also divided into two groups according to whether theyneed noise removing. Most of the existing content-based deduplicate algorithmsneeded the process of removing noise.

Broder et al. [6] proposed a DSC algorithm(also called Shingling), as a methodto detect near duplicates by computing similarity among the shingle sets of thedocuments. The similarity between two documents is computed based on thecommon Jaccard overlap measure between these document shingle set. In or-der to reduce the complexity of Shingling for processing large collections, DSC-SS(also called super shingles) was later proposed by Broder in [5]. DSC-SS makesuse of meta-shingles, i.e., shingles of shingles, with only a little decrease inprecision. A variety of methods for getting good shingles are investigated byHod and Zobel [14]. Buttcher and Clarke [7] focus on Kullback-Leibler diver-gence in the more general context of search. A lager-scale evaluation was im-plemented by Henzinger[13] to compare the precision of shingling and simhashalgorithms by adjusting their parameters to maintain almost same recall. Theexperiment shows that neither of the algorithms works well for finding near-duplicate pairs on the same site because of the influence of templates, while bothachieve a higher precision for near-duplicate pairs on different sites. [21] proposedthat near-duplicate clustering should incorporating information about documentattributes or the content structure.

Another widespread duplicate detection technique is to generate a docu-ment fingerprint, which is a compact description of the document, and then tocompute pair-wise similarity of document fingerprints. The assumption is thatfingerprints can be compared much more quickly than complete documents. Acommon method of generating fingerprints is to select a set of character se-quences from a document, and to generate a fingerprint based on the hash val-ues of these sequences. Similarity between two documents is measured by Jaccard

Page 4: SizeSpotSigs: An Effective Deduplicate Algorithm ...net.pku.edu.cn/~webg/TianwangLiterature/2011/2011...SizeSpotSigs: An Effective Deduplicate Algorithm 541 As we know, in fact,

540 X. Mao et al.

formula. Different algorithms are characterized, and their computational costsare determined, by the hash functions and how character sequences are selected.Manber [18] started the research firstly. I-Match algorithm [9,16] uses externalcollection statistics and make recall increase by using multiple fingerprints perdocument. Position based schemes [4] select strings based on their offset in adocument. Broder etc. [6] pick strings whose hash values are multiples of aninteger. Indyk and Motwani [12,15] proposed Locality Sensitive Hashing (LSH),which is an approximate similarity search technique that scales to both large andhigh-dimensional data sets. There are many variant of LSH, such as LSH-Tree[3] or Hamming-LSH [11].

Generally, the noise removing is an expensive operation. If possible, the near-duplicate detection algorithm should avoid noise removing. Martin Theobaldproposed SpotSigs [19] algorithm, which used word chain around stop wordsas features to construct feature set. For example, consider the sentence: “On astreet in Milton, in the city’s inner-west, one woman wept as she toured her wa-terlogged home.” Choosing the articles a, an, the and the verb is as antecedentswith a uniform spot distance of 1 and chain length of 2, we obtain the set of spotsignatures S = {a:street:Milton, the:city’s:inner-west}. The SpotSigs only needsa single pass over a corpus, which is much more efficient, easier to implement,and less error-prone because expensive layout analysis is omitted. Meanwhile,it remains largely independent of the input format. The method will be takenas our baseline. In this paper, considering special merits, we focus on the algo-rithms without noise removing, and we also take Jaccard overlap measure as oursimilarity measure.

3 Relation between Noise-Content Ratio and Similarity

3.1 Concepts and Notation

For calculating the similarity, we need to extract features from Web pages. Wedefine all the features from one page as page-feature set ; Also we split thesefeatures into content-feature set and noise-feature set. A feature comes from thecore content of page is defined as content feature (element) and belongs to thecontent feature set; otherwise, the feature is called noise feature (element) andbelongs to the noise feature set. The noise-content (feature) ratio represents theratio between the size of noise feature set and the size of content feature set.

3.2 Theoretical Analysis

Let sim(P1, P2) = |P1∩P2|/|P1∪P2| be the default Jaccard similarity as definedover two sets P1 and P2, each consisting of distinct page-feature set in our case.P1c and P2c are the content-feature sets; P1nand P2n are the noise-feature sets,which subject to P1c ∪ P1n = P1 and P2c ∪ P2n = P2. The similarity betweenP1c and P2c is sim(P1c, P2c) = |P1c ∩P2c|/|P1c ∪P2c|, which is the real value wecare in the near-duplicate detection.

Page 5: SizeSpotSigs: An Effective Deduplicate Algorithm ...net.pku.edu.cn/~webg/TianwangLiterature/2011/2011...SizeSpotSigs: An Effective Deduplicate Algorithm 541 As we know, in fact,

SizeSpotSigs: An Effective Deduplicate Algorithm 541

As we know, in fact, near-duplicate detection is to compare the similarity ofthe core contents of two pages, but Web pages have many noisycontent, such as banners and ads. Most of algorithms is to use sim(P1, P2) toapproach sim(P1c, P2c). If sim(P1, P2) is close to sim(P1c, P2c), it shows thatthe near-duplicate detection algorithm works well, and vice versa. In order todescribe the difference between sim(P1, P2) and sim(P1c, P2c), we could get theTheorem 1 as follow:

Theorem 1. Given two sets, P1 and P2, subject to P1c ⊂ P1, P1n ⊂ P1 andP1c ∪ P1n = P1. Similarly, P2c ⊂ P2, P2n ⊂ P2 and P2c ∪ P2n = P2; At the sametime, sim(P1, P2) = |P1∩P2|/|P1∪P2| and sim(P1c, P2c) = |P1c∩P2c|/|P1c∪P2c|.Let the noise-content ratio |P1n|

|P1c| ≤ ε and |P2n||P2c| ≤ ε, where ε is a small number.

Then,−2ε

1 + 2ε≤ sim(P1, P2) − sim(P1c, P2c) ≤ 2ε (1)

Proof: letA = |P1c ∩ P2c|, B = |P1c ∪ P2c|, then

A ≤ |(P1c ∪ P1n) ∩ (P2c ∪ P2n)| ≤ A + 2 ∗ max{|P1n|, |P2n|} (2)

B ≤ |(P1c ∪ P1n) ∪ (P2c ∪ P2n)| ≤ B + 2 ∗ max{|P1n|, |P2n|} (3)

From (2) and (3), we can get the following inequality:

A

B + 2 ∗ max{|P1n|, |P2n|} ≤ |(P1c ∪ P1n) ∩ (P2c ∪ P2n)||(P1c ∪ P1n) ∪ (P2c ∪ P2n)| ≤

A + 2 ∗ max{|P1n|, |P2n|}B

(4)

From (4), wet get the following inequality:

−2A ∗ max{|P1n|, |P2n|}B(B + 2 ∗ max{|P1n|, |P2n|})

≤|(P1c ∪ P1n) ∩ (P2c ∪ P2n)||(P1c ∪ P1n) ∪ (P2c ∪ P2n)|

−A

B≤

2 ∗ max{|P1n|, |P2n|}B

(5)

Obviously, A ≤ B and B ≥ max{|P1c|, |P2c|}. So, we get:

max{|P1n|, |P2n|}B

≤ max{|P1n|, |P2n|}max{|P1c|, |P2c|} ≤ ε. (6)

Another inequality is:

−2A ∗ max{|P1n|, |P2n|}B(B + 2 ∗ max{|P1n|, |P2n|} =

−2A

B

max{|P1n|, |P2n|}B + 2 ∗ max{|P1n|, |P2n|}

≥ −2max{|P1n|, |P2n|}

B + 2 ∗ max{|P1n|, |P2n|}≥ −2

max{|P1n|, |P2n|}max{|P1c|, |P2c|} + 2 ∗ max{|P1n|, |P2n|}

≥ −2max{|P1n|,|P2n|}max{|P1c|,|P2c|}

1 + 2∗max{|P1n|,|P2n|}max{|P1c|,|P2c|}

≥ (−2ε)/(1 + 2ε) (7)

Page 6: SizeSpotSigs: An Effective Deduplicate Algorithm ...net.pku.edu.cn/~webg/TianwangLiterature/2011/2011...SizeSpotSigs: An Effective Deduplicate Algorithm 541 As we know, in fact,

542 X. Mao et al.

So, (5)could be reformed as:

− 2ε

1 + 2ε≤ |(P1c ∪ P1n) ∩ (P2c ∪ P2n)|

|(P1c ∪ P1n) ∪ (P2c ∪ P2n)| −A

B≤ 2ε (8)

That is,

− 2ε

1 + 2ε≤ sim(P1, P2) − sim(P1c, P2c) ≤ 2ε (9)

Theorem1 shows:(1). When ε is small enough, the similarity sim(P1, P2) is closeto the similarity sim(P1c, P2c); (2). When ε reaches a certain small value, thedifference between two similarity is little even though ε continue to becomesmaller, the difference varies little. That is, when noise-content ratio reaches acertain small number, the increase of effectiveness of near-duplicate detectionalgorithm will be little.

Without loss of generality, we assume |P2n||P2c| ≤ |P1n|

|P1c| = ε. Then Formula (9)could be reformed as:

− 2|P1n||P1c| + 2|P1n| ≤ sim(P1, P2) − sim(P1c, P2c) ≤ 2|P1n|

|P1c| (10)

Formula(10) shows |P1c| should be large for robust; Otherwise, |P1c| or |P1n|changes slightly will cause fierce change of upper bound and lower bound, whichshows the algorithm is not robust. For example, assume two upper-bounds: 5/100and 5/100, the upper bound become (5+5)/(100+100) after combining featuresets, which is equal with 5/100. but (5+1)/100 > (5+5+1)/(100+100). Obvi-ously, (5+5)/(100+100) is more robuster than 5/100, though they have samevalue.

In a word, when ε is large relatively, we could make the algorithm workbetter by two rules as follows:(a). Select features that have small noise-contentratio to improve effectiveness; (b). When the noise-content ratios of two typesof feature are the same, we should select the feature with larger content-featureset to make the algorithm robust, which implies that if the noise-content ratiosof several types of features are very close, these features should be combined toincrease the robustness while the effectiveness changes little.

4 AF SpotSigs and SizeSpotSigs Algorithm

SpotSigs[19] provided a stopword feature, which aimed to filter natural-languagetext passages out of noisy Web page components, that is, noise-content ratio wassmall, which gave us an intuition that we should choose features that tend to oc-cur mostly in the core content of Web documents and skip over advertisements,banners, and the navigational components. In this paper, based on thoughtin SpotSigs and our analysis in section 3.2, we developed four features which

Page 7: SizeSpotSigs: An Effective Deduplicate Algorithm ...net.pku.edu.cn/~webg/TianwangLiterature/2011/2011...SizeSpotSigs: An Effective Deduplicate Algorithm 541 As we know, in fact,

SizeSpotSigs: An Effective Deduplicate Algorithm 543

Fig. 2. Table of Meaning of Chinese Punctuations and Table of Markers of ChineseStopwords in the paper

all have small noise-content ratio. Details are as follows: 1).Stopword feature;It is similar to the feature in SpotSigs that is a string of stopword and itsneighboring words, except that the stopwords are different because languagesare different; Because the stopwords in noisy content are less than ones in corecontent, so the features could decrease the noise-content ratio against Shinglingfeatures. The Chinese stopwords and corresponding marker used in this paperare listed in the Fig.2. 2).Chinese punctuation feature; In English, many punc-tuations are the same with the special characters in HTML language. So inEnglish, we can’t use the punctuation to extract feature. In Chinese, however,this is not the case. As we known, the Chinese punctuations occurs less in thenoisy area. We choose a string of punctuation and its neighboring words asChinese punctuation feature, which makes the noise-content ratio small. TheChinese punctuations and corresponding English punctuations used in this paperare also listed in the Fig.2. 3).Sentence feature; The string between two Chinesepunctuations is thought as sentence; Considering the sentence with punctuationis little in noisy area, so the sentence features could decrease noise-content rationotably. 4).Sentence shingling feature; Assuming the length of one sentence is n,all 1-gram, 2-gram, ..., (n-1)-gram are taken as new features, aiming to increasethe number of content-feature set for robustness and effectiveness, which wouldalso make noise-content ratio small based on sentence feature.

The Stopword feature is used by the state-of-the-art algorithm, SpotSigs [19].Though the stopwords are different because languages are different, we still callthe algorithm SpotSigs. The experiments in Section 5.3 showed that SpotSigscould reach 0.92(F1) on long Web pages, but only 0.62 on short Web pages. Ob-viously, SpotSigs could not process the short Web pages well, and we need newalgorithm. If all four features are used to detect near duplication, the algorithmis called AF SpotSigs. The experiments in Section 5.3 showed that AF SpotSigscould reach 0.77(F1) against 0.62(F1) of SpotSigs for short Web pages, but onlyincreasing by 0.04(F1) with 28.8 times time overhead for long Web pages, whichpresents AF SpotSigs could work better than SpotSigs for short Web pages,and the effectiveness of AF SpotSigs is slightly better than that of SpotSigsfor long Web pages but cost is higher. Considering the balance between effi-ciency and effectiveness, we propose algorithm called SizeSpotSigs that choosesonly stopword features to judge the near duplication for long Web pages(namelySpotSigs) while the algorithm chooses all four-type features mentioned above forshort Web pages(namely AF SpotSigs).

Page 8: SizeSpotSigs: An Effective Deduplicate Algorithm ...net.pku.edu.cn/~webg/TianwangLiterature/2011/2011...SizeSpotSigs: An Effective Deduplicate Algorithm 541 As we know, in fact,

544 X. Mao et al.

5 Experiment

5.1 Data Set

For verifying our algorithms, AF SpotSigs and SizeSpotSigs, we construct 4datasets. Details are as follows:

Collection Shorter/Collection Longer : we construct the Collection Shorterand Longer humanly. The Collection Shorter has 379 short Web pages and 48clusters; And the Collection Longer has 332 long Web pages and 40 clusters.

Collection Mixer/Collection Mixer Purity: The Collection Shorter and Col-lection Longer are mixed as Collection Mixer, which includes 88 clusters and 711Web pages totally. For each Web page in the Collection Mixer, we get its corecontent according to human judge, which lead to Collection Mixer Purity.

5.2 Choice of Stopwords

Because quantity of stopwords is large, e.g. 370 more in Chinese, we need to selectthe most representative stopwords to improve performance. SpotSigs, however,just did experiments on English Collection. We don’t know how to choose stop-words or the length of its neighboring words on Chinese collection. At the sametime, for AF SpotSigs, we also need to choose stopwords and the length. Wefind that F1 varies slightly about 1 absolute percent from a chain length of 1 todistance of 3 (figures omitted). So we choose two words as length parameter forthe two algorithms.

In this section, we will seek to the best combination of stopwords for AF SpotSigs and SpotSigs for Chinese. We now consider variations in the choice of Spot-Sigs antecedents(stopwords and its neighboring words), thus aiming to find agood compromise between extracting characteristic signatures while avoiding anover-fitting of these signatures to particular articles or sites.

For SpotSigs, which is fit for long Web pages, the best combination wassearched in the collection Longer Sample which was sampled 1/3 clusters fromthe collection Longer. Moreover, for AF SpotSigs, which is fit for short Webpages, we get the parameter over the collection Shorter Sample, which wassampled 1/3 clusters from the collection Shorter.

Fig.3(a) shows that we obtain the best F1 result for SpotSigs from a com-bination of De1, Di, De2, Shi, Ba, Le, mostly occurring in core contents andless likely to occur in ads or navigational banners. Meanwhile, for AF SpotSigs,Fig.3(b) shows the best F1 result is obtained on stopword “De1”. Using a fullstopword list (here we use the most frequent 40 stopwords) already tends to yieldoverly generic signatures but still performs good significantly.

5.3 AF SpotSigs vs. SpotSigs

After obtaining the parameters of AF SpotSigs and SpotSigs, we could comparethe two algorithms from F1 value to computing cost. So, the two algorithms runon the Collection Shorter and Longer to do comparison.

Page 9: SizeSpotSigs: An Effective Deduplicate Algorithm ...net.pku.edu.cn/~webg/TianwangLiterature/2011/2011...SizeSpotSigs: An Effective Deduplicate Algorithm 541 As we know, in fact,

SizeSpotSigs: An Effective Deduplicate Algorithm 545

{De1}

{Di}

{De2}

{De1,Di}

{De1,Di,De2}

{De1,Di,De2,Ba,Le}

{De1,Di,De2,Shi,Ba,Le}

{De1,Di,De2,Yu1,He,Mei}

{Yi,Le,Ba,Suo,Dou,Yu2}

Fullstopwordlist

(a)

F1

0.874

0.588

0.811

0.875

0.898

0.9130.921

0.887

0.856

0.824

SpotSigs on Longer

{De1}

{Di}

{De2}

{De1,Di}

{De1,Di,De2}

{De1,Di,De2,Ba,Le}

{De1,Di,De2,Shi,Ba,Le}

{De1,Di,De2,Yu1,He,Mei}

{Yi,Le,Ba,Suo,Dou,Yu2}

Fullstopwordlist

(b)

0.55

0.65

0.75

0.85

0.95

F1

0.772 0.768 0.770 0.768 0.768 0.769 0.769 0.769 0.767

0.757

AF_SpotSigs on Shorter

Fig. 3. (a)The effectiveness of SpotSigs with different stopwords on Longer collec-tion;(b)The effectiveness of AF SpotSigs with different stopwords on Shorter collection

Fig.4 shows the F1 scores of AF SpotSigs are both better than SpotSigs onShorter and Longer. Moreover, F1 score of SpotSigs is far worse than AF SpotSigson Shorter while F1 scores of two algorithms are very close on Longer. However,Table 1 shows that AF SpotSigs took much more time than SpotSigs.

Considering balance between effectiveness and efficiency, we could partitionone collection into two parts, namely the short part and long part. SpotSigs workson the long part while AF SpotSigs runs on the short part, namely SizeSpotSigsalgorithm.

Shorter Longer

F1

0.622

0.921

0.772

0.960SpotSigs

AF_SpotSigs

Fig. 4. The effectiveness of SpotSigsand AF SpotSigs on Shorter and Longer

Shorter Longer

SpotSigsF1 0.6223 0.9214

Time(Sec.) 1.743 1.812

AF SpotSigsF1 0.7716 0.9597

Time(Sec.) 21.17 52.31

Table 1. The F1 value and cost of twoalgorithms

Page 10: SizeSpotSigs: An Effective Deduplicate Algorithm ...net.pku.edu.cn/~webg/TianwangLiterature/2011/2011...SizeSpotSigs: An Effective Deduplicate Algorithm 541 As we know, in fact,

546 X. Mao et al.

cluster partition point(a)

F1

Mixer_Purity

SizeSpotSigs

AF_SpotSigs

SpotSigs

1 1cluster partition point

(b)

F1

Mixer

SizeSpotSigs

AF_SpotSigs

SpotSigs

Fig. 5. F1 values of SizeSpotSigs, AF SpotSigs and SpotSigs on CollectionMixer purity(a) and Mixer(b)

5.4 SizeSpotSigs over SpotSigs and AF SpotSigs

To verify SizeSpotSigs, all clusters in Mixer are sorted from small to large astheir average size of core contents. We select three partition point (22,44,66) topartition set of clusters. For example, if partition point is 22, the first 22 clustersin the sorted clusters are took as small part while the rest clusters are large part.Table 2 demonstrates the nature of two parts in the every partition. Specially,0/88 means that all clusters are took into large part which make SizeSpotSigsbecomes SpotSigs while 88/0 means all clusters belong to small part which makeSizeSpotSigs becomes AF SpotSigs.

Fig.5(b) shows SizeSpotSigs works better than SpotSigs while worse thanAF SpotSigs. Moreover, the F1 value of SizeSpotSigs increases with the increaseof partition point.

When purified collection is used, noise-content ratio is zero. So based on for-mula (9), sim(P1, P2) = sim(P1c, P2c), which leads to F1 value depends onsim(P1c, P2c) completely. Fig.5(a) demonstrates F1 of SizeSpotSigs rise and fallin a irregular manner, but among a reasonable interval, which all above 0.91.All details are listed in Table 3.

Table 2. The Nature of Partitions

Partition 0/88 22/66 44/44 66/22 88/0

Avg Size(Byte) 0/2189.41 607.65/2561.43 898.24/3247.73 1290.25/4421.20 2189.41/0

File Num 0/711 136/575 321/390 514/197 711/0

Page 11: SizeSpotSigs: An Effective Deduplicate Algorithm ...net.pku.edu.cn/~webg/TianwangLiterature/2011/2011...SizeSpotSigs: An Effective Deduplicate Algorithm 541 As we know, in fact,

SizeSpotSigs: An Effective Deduplicate Algorithm 547

Table 3. the F1 value and time for 3 algorithms on partitions(s is Sec.)

SpotSigs(0/88)

AF SpotSigs(88/0)

SizeSpotSigs(22/66)

SizeSpotSigs(44/44)

SizeSpotSigs(66/22)

MixerF1 0.6957 0.8216 0.7530 0.7793 0.8230

Time(s) 3.6094 148.20 7.142 22.81 61.13

Mixer PurityF1 0.9360 0.9122 0.9580 0.9306 0.9165

Time(s) 2.2783 134.34 4.0118 15.99 47.00

6 Conclusions and Future Works

We analyzed the relation between noise-content ratio and similarity theoretically,which leads to two rules that could make the near-duplicate detection algorithmwork better. Then, the paper proposed 3 new features to improve the effec-tiveness and robustness for short Web pages, which leaded to our AF SpotSigsmethod.

Experiments confirm that 3 new features are effective, and our AF SpotSigswork 15% better than the state-of-the-art method for short Web pages. Besides,SizeSpotSigs that considers the size of page core content performs better thanSpotSigs over different partition points.

Future work will focus on 1). How to decide the size of the core content ofWeb page automatically or approximately; 2). Design more features that is fit forshort Web page to improve the effectiveness, as well as generalizing the boundingapproach toward other metrics such as Cosine.

Acknowledgments

This work is supported by NSFC Grant No.70903008, 60933004 and 61073082,FSSP 2010 Grant No.15. At the same time, we thank Jing He, Dongdong Shanfor a quick review of our paper close to the submission deadline.

References

1. Agarwal, A., Koppula, H., Leela, K., Chitrapura, K., Garg, S., GM, P., Haty, C.,Roy, A., Sasturkar, A.: URL normalization for de-duplication of web pages. In: Pro-ceeding of the 18th ACM Conference on Information and Knowledge Management,pp. 1987–1990. ACM, New York (2009)

2. Baeza-Yates, R., Ribeiro-Neto, B., et al.: Modern information retrieval. Addison-Wesley, Reading (1999)

3. Bawa, M., Condie, T., Ganesan, P.: LSH forest: self-tuning indexes for similaritysearch. In: Proceedings of the 14th International Conference on World Wide Web,p. 660. ACM, New York (2005)

4. Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital doc-uments. ACM SIGMOD Record 24(2), 409 (1995)

5. Broder, A.: Identifying and filtering near-duplicate documents. In: Giancarlo, R.,Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1–10. Springer, Heidelberg (2000)

Page 12: SizeSpotSigs: An Effective Deduplicate Algorithm ...net.pku.edu.cn/~webg/TianwangLiterature/2011/2011...SizeSpotSigs: An Effective Deduplicate Algorithm 541 As we know, in fact,

548 X. Mao et al.

6. Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic clustering of the web.Computer Networks and ISDN Systems 29(8-13), 1157–1166 (1997)

7. Buttcher, S., Clarke, C.: A document-centric approach to static index pruning intext retrieval systems. In: Proceedings of the 15th ACM International Conferenceon Information and Knowledge Management, p. 189. ACM, New York (2006)

8. Charikar, M.: Similarity estimation techniques from rounding algorithms. In: Pro-ceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing,p. 388. ACM, New York (2002)

9. Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.: Collection statistics forfast duplicate document detection. ACM Transactions on Information Systems(TOIS) 20(2), 191 (2002)

10. Dasgupta, A., Kumar, R., Sasturkar, A.: De-duping URLs via rewrite rules. In:Proceeding of the 14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, pp. 186–194. ACM, New York (2008)

11. Datar, M., Gionis, A., Indyk, P., Motwani, R., Ullman, J., et al.: Finding Interest-ing Associations without Support Pruning. IEEE Transactions on Knowledge AndData Engineering 13(1) (2001)

12. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hash-ing. In: Proceedings of the 25th International Conference on Very Large DataBases, pp. 518–529. Morgan Kaufmann Publishers Inc., San Francisco (1999)

13. Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algo-rithms. In: Proceedings of the 29th annual international ACM SIGIR Conferenceon Research and Development in Information Retrieval, pp. 284–291. ACM, NewYork (2006)

14. Hoad, T., Zobel, J.: Methods for identifying versioned and plagiarized documents.Journal of the American Society for Information Science and Technology 54(3),203–215 (2003)

15. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing thecurse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposiumon Theory of Computing, pp. 604–613. ACM, New York (1998)

16. Ko�lcz, A., Chowdhury, A., Alspector, J.: Improved robustness of signature-basednear-replica detection via lexicon randomization. In: Proceedings of the TenthACM SIGKDD International Conference on Knowledge Discovery and Data Min-ing, p. 610. ACM, New York (2004)

17. Koppula, H., Leela, K., Agarwal, A., Chitrapura, K., Garg, S., Sasturkar, A.: Learn-ing URL patterns for webpage de-duplication. In: Proceedings of the Third ACMInternational Conference on Web Search and Data Mining, pp. 381–390. ACM,New York (2010)

18. Manber, U.: Finding similar files in a large file system. In: Proceedings of the USENIXWinter 1994 Technical Conference, San Fransisco, CA, USA, pp. 1–10 (1994)

19. Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient near du-plicate detection in large web collections. In: Proceedings of the 31st Annual In-ternational ACM SIGIR Conference on Research and Development in InformationRetrieval, pp. 563–570. ACM, New York (2008)

20. Whitten, A.: Scalable Document Fingerprinting. In: The USENIX Workshop onE-Commerce (1996)

21. Yang, H., Callan, J.: Near-duplicate detection by instance-level constrained clus-tering. In: Proceedings of the 29th Annual International ACM SIGIR Conferenceon Research and Development in Information Retrieval, p. 428. ACM, New York(2006)