a novel and efficient approach for near duplicate

5
2009 IEEE International Advance Computing Conference (IACC 2009) Patiala, India, 6-7 March 2009 A Novel and Efficient Approach For Near Duplicate Page Detection in Web Crawling V.A.Narayana P. Premchand Dr. A. Govardhan Associate Professor & Head, CSE Professor, Department of Computer Professor & Head of the department Department Science & Engineering Department of Computer Science CMR College of Engineering & University College of Engineering, and Engineering Technology, JNTU Osmania University JNTU College of Engineering Hyderabad, India Hyderabad-500007, AP, India Kukatpally, Hyderabad, India [email protected] [email protected] Abstract-The drastic development of the World Wide Web in mining knowledge from the web pages besides other web the recent times has made the concept of Web Crawling receive objects is known as Web content mining. Web structure remarkable significance. The voluminous amounts of web mining is the process of mining knowledge about the link documents swarming the web have posed huge challenges to the structure linking web pages and some other web objects. The web search engines making their results less relevant to the users. mining of usage patterns created by the users accessing the The presence of duplicate and near duplicate web documents in web pages is called Web usage mining [27]. abundance has created additional overheads for the search engines critically affecting their performance and quality. The The World Wide Web owes its development to the Search detection of duplicate and near duplicate web pages has long engine technology. The chief gateways for access of been recognized in web crawling research community. It is an information in the web are Search engines. Businesses have important requirement for search engines to provide users with turned beneficial and productive with the ability to locate the relevant results for their queries in the first page without contents of particular interest amidst a huge heap [34]. Web duplicate and redundant results. In this paper, we have presented crawling, a process that populates an indexed repository of a novel and efficient approach for the detection of near duplicate web pages is utilized by the search engines in order to respond web pages in web crawling. Detection of near duplicate web to the queries [32]. The programs that navigate the web graph pages is carried out ahead of storing the crawled web pages in to and retrieve pages to construct a confined repository of the repositories. At first, the keywords are extracted from the s crawled pages and the similarity score between two pages is segenof th web tatey sit. Earlir thers progras, calculated based on the extracted keywords. The documents were known by diverse names such as wanderers, robots, having similarity scores greater than a threshold value are spiders, fish, and worms, words in accordance with the web considered as near duplicates. The detection has resulted in imagery [31]. reduced memory for repositories and improved search engine Generic and Focused crawling are the two main types of quality. crawling. Generic crawlers [1] differ from focused crawlers Keywords-Web Mining; WebContentMinig;WebCra[28] in a way that the former crawl documents and links of Keywords-Webmiing; Webmontent Ning We Caweng diverse topics whereas the latter limits the number of pages Web pages;Stemming;Commonwords; Near dupilcate pages;ion with the aid of some prior obtained specialized knowledge. Repositories of web pages are built by the web crawlers so as I. INTRODUCTION to present input for systems that index, mine, and otherwise The employment of automated tools to locate the analyze pages (for instance, the search engines) [9]. The information resources of interest, and for tracking and subsistence of near duplicate data is an issue that accompanies anayzig.teamehabeom inevi e t e dthe drastic development of the Internet and the growing need analyzing tevsame,h become inevitab teseiday ong to incorporate heterogeneous data [36]. Even though the near to the drastic development on duplicate data are not bit wise identical they bear a striking server-side andWebthis hasemade te develpmen of similarity [36]. Web search engines face huge problems due to ser-sciden anowledg client-side10A inte ent system manindatoy the duplicate and near duplicate web pages. These pages either defficwienthth k nowledgesminig[ of datadWideWebis minwng that increase the index storage space or slow down or increase the dealswit te anis of World Wd Webts know aseWe serving costs thereby irritating the users. Thus the algorithms Mining. Web Mininowesitsoiginoconeptsromfor detecting such pages are inevitable [24]. Web crawling areas such as Data Mining,Y Internet technolog,y and World P ____11- ares uchasDat M nig Inere teholg an ol ssues such as freshness and efficient resource usage have Wid Web, and lael, Seani We. 6.Wbmnn been addressed previously [5], [14], [11]. Lately, the includtes the sub areas: web content mining [30], web structure elmnto of dulct n erdplct e ouet minin [1] an.e sg iig[5 n a edfnda has become a vital issue and has attracted significant research the proced ure of d etermining hidden yet potentially beneficial [20] knowledge from the data accessible in the web. The process of [] 978-1T-4244- 1888-6/08/f$25.00 Q 2008 IEEE 1492 Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on January 2, 2010 at 04:24 from IEEE Xplore. Restrictions apply.

Upload: kumaritnce7742

Post on 07-Apr-2015

109 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: A Novel and Efficient Approach for Near Duplicate

2009 IEEE International Advance Computing Conference (IACC 2009)Patiala, India, 6-7 March 2009

A Novel and Efficient Approach For Near DuplicatePage Detection in Web Crawling

V.A.Narayana P. Premchand Dr. A. GovardhanAssociate Professor & Head, CSE Professor, Department of Computer Professor & Head of the department

Department Science & Engineering Department of Computer ScienceCMR College of Engineering & University College of Engineering, and Engineering

Technology, JNTU Osmania University JNTU College of EngineeringHyderabad, India Hyderabad-500007, AP, India Kukatpally, Hyderabad, India

[email protected] [email protected]

Abstract-The drastic development of the World Wide Web in mining knowledge from the web pages besides other webthe recent times has made the concept of Web Crawling receive objects is known as Web content mining. Web structureremarkable significance. The voluminous amounts of web mining is the process of mining knowledge about the linkdocuments swarming the web have posed huge challenges to the structure linking web pages and some other web objects. Theweb search engines making their results less relevant to the users. mining of usage patterns created by the users accessing theThe presence of duplicate and near duplicate web documents in web pages is called Web usage mining [27].abundance has created additional overheads for the searchengines critically affecting their performance and quality. The The World Wide Web owes its development to the Searchdetection of duplicate and near duplicate web pages has long engine technology. The chief gateways for access ofbeen recognized in web crawling research community. It is an information in the web are Search engines. Businesses haveimportant requirement for search engines to provide users with turned beneficial and productive with the ability to locatethe relevant results for their queries in the first page without contents of particular interest amidst a huge heap [34]. Webduplicate and redundant results. In this paper, we have presented crawling, a process that populates an indexed repository ofa novel and efficient approach for the detection of near duplicate web pages is utilized by the search engines in order to respondweb pages in web crawling. Detection of near duplicate web to the queries [32]. The programs that navigate the web graphpages is carried out ahead of storing the crawled web pages in to and retrieve pages to construct a confined repository of therepositories. At first, the keywords are extracted from the

scrawled pages and the similarity score between two pages is segenof th web tatey sit. Earlir thers progras,calculated based on the extracted keywords. The documents were known by diverse names such as wanderers, robots,having similarity scores greater than a threshold value are spiders, fish, and worms, words in accordance with the webconsidered as near duplicates. The detection has resulted in imagery [31].reduced memory for repositories and improved search engine Generic and Focused crawling are the two main types ofquality. crawling. Generic crawlers [1] differ from focused crawlers

Keywords-WebMining; WebContentMinig;WebCra[28] in a way that the former crawl documents and links ofKeywords-Webmiing; Webmontent Ning WeCaweng diverse topics whereas the latter limits the number of pagesWeb pages;Stemming;Commonwords;Near dupilcate pages;ion with the aid of some prior obtained specialized knowledge.

Repositories of web pages are built by the web crawlers so as

I. INTRODUCTION to present input for systems that index, mine, and otherwiseThe employment of automated tools to locate the analyze pages (for instance, the search engines) [9]. The

information resources of interest, and for tracking and subsistence of near duplicate data is an issue that accompaniesanayzig.teamehabeominevie t e dthe drastic development of the Internet and the growing needanalyzing tevsame,h become inevitab teseiday ong to incorporate heterogeneous data [36]. Even though the near

to the drastic development on duplicate data are not bit wise identical they bear a striking

server-side andWebthis hasemade te develpmen of similarity [36]. Web search engines face huge problems due toser-sciden anowledgclient-side10Ainte ent system manindatoy the duplicate and near duplicate web pages. These pages either

defficwienththk nowledgesminig[ ofdatadWideWebis minwngthat increase the index storage space or slow down or increase thedealswit te anisof World Wd Webtsknow aseWe serving costs thereby irritating the users. Thus the algorithmsMining. Web Mininowesitsoiginoconeptsromfor detecting such pages are inevitable [24]. Web crawling

areas such as Data Mining,Y Internet technolog,y and World P ____11-aresuchasDat M nig Inere teholg an ol ssues such as freshness and efficient resource usage haveWid Web, and lael, Seani We. 6.Wbmnn been addressed previously [5], [14], [11]. Lately, the

includtes the sub areas: web content mining [30], web structure elmnto of dulct n erdplct e ouetminin [1] an.e sg iig[5 n a edfnda has become a vital issue and has attracted significant research

the proced ure of d etermining hidden yet potentially beneficial [20]knowledge from the data accessible in the web. The process of []

978-1T-4244-1888-6/08/f$25.00 Q 2008 IEEE 1492

Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on January 2, 2010 at 04:24 from IEEE Xplore. Restrictions apply.

Page 2: A Novel and Efficient Approach for Near Duplicate

Identification of the near duplicates can be advantageous detection.to many applications. Focused crawling, enhanced quality anddiversity of the query results and identification on spams can *Sergey Brin et al. [2] have proposed a system forbe facilitated by determining the near duplicate web pages [ , reistern documents and then detecn coiTes either19, 24]. Numrerous web mlining applications depend on the complete copies or partial copies. They described algorithmsaccurat and proficien midn tificationo neaernd lcthe for detection, and metrics required for evaluating detectionDocumean clusteien [ detection of are uplicated mechanisms covering accuracy, efficiency and security. Theycollections[2] detecting plagiarisme[23] community mining also described a prototype implementation of the service,collections [12] detecting plagiarism [23, communitymining COPS and presented experimental results that suggest thein a social network site [35], collaborative filtering [7] and pn p. . . 1 _ _ r ~~~~~~~~servicecan indeed detect violations of interest.discovering large dense graphs [21] are a notable few amongthose applications. Reduction in storage costs and Andrei Z. Broder et al. [3] have developed an efficient wayenhancement in quality of search indexes besides considerable to determine the syntactic similarity of files and have appliedbandwidth conservation can be achieved by eliminating the it to every document on the World Wide Web. Using theirnear duplicate pages [1]. Check summing techniques can mechanism, they have built a clustering of all the documentsdetermine the documents that are precise duplicates (because that are syntactically similar. Possible applications include aof mirroring or plagiarism) of each other [13]. The recognition "Lost and Found" service, filtering the results of Webof near duplicates is a tedious problem. searches, updating widely distributed web-pages, and

Research on duplicate detection was initially done on identifying violations of intellectual property rights.databases, digital libraries, and electronic publishing. Lately Jack G. Conrad et al. [15] have determined the extent andduplicate detection has been extensively studied for the sake the types of duplication existing in large textual collections.of numerous web search tasks such as web crawling, Their research is divided into three parts. Initially they starteddocument ranking, and document archiving. A huge number with a study of the distribution of duplicate types in twoof duplicate detection techniques ranging from manually broad-ranging news collections consisting of approximatelycoded rules to cutting edge machine learning techniques have 50 million documents. Then they examined the utility ofbeen put forth [36, 24, 18, 2, 3, 15, 38]. Recently few authors document signatures in addressing identical or nearly identicalhave projected near duplicate detection techniques [37, 38, 8, duplicate documents and their sensitivity to collection updates.22]. A variety of issues such as from providing high detection Finally, they have investigated a flexible method ofrates to minimizing the computational and storage resources characterizing and comparing documents in order to permithave been addressed by them. These techniques vary in their the identification of non-identical duplicates. Their methodaccuracy as well. Some of these techniques are has produced promising results following an extensivecomputationally pricey to be implemented completely on huge evaluation using a production-based test collection created bycollections. Even though some of these algorithms prove to be domain experts.efficient they are fragile and so are susceptible to minutechanges ofthe text. Donald Metzler et al. [29] have explored mechanisms for

measuring the intermediate kinds of similarity, focusing on theThe primary intent of our research is to develop a novel task of identifying where a particular piece of information

and efficient approach for detection of near duplicates in web originated. They proposed a range of approaches to reusedocuments. Initially the crawled web pages are preprocessed detection at the sentence level, and a range of approaches forusing document parsing which removes the HTML tags and combining sentence-level evidence into document-leveljava scripts present in the web documents. This is followed by evidence. They considered both sentence-to-sentence andthe removal of common words or stop words from the crawled document-to-document comparison, and have incorporated thepages. Then the stemming algorithm is applied to filter the algorithms into RECAP, a prototype information flow analysisaffixes (prefixes and the suffixes) of the crawled documents in tool.order to get the keywords. Finally, the similarity scorebetween two documents is calculated on basis of the extracted Hui Yang et al. [37] have explored the use of simple textkeywords. The documents with similarity scores greater than a clustering and retrieval algorithms for identifying near-predefined threshold value are considered as near duplicates, duplicate public comments. They have focused on automatingWe have conducted an extensive experimental study using the process of near-duplicate detection, especially form letterseveral real datasets, and have demonstrated that the proposed detection. They gave a clear near-duplicate definition andalgorithms outperform previous ones. explored simple and efficient methods of using feature-baseddocument retrieval and similarity-based clustering to discover

The rest of the paper is organized as follows. Section 2 near-duplicates. The methods were evaluated in experimentspresents a brief review of some approaches available in the with a subset of a large public comment database collected forliterature for duplicates and near duplicates detection. In EPA rule.Section 3, the novel approach for the detection of nearduplicate documents is presented. The conclusions are Monika Henzinger [24] has compared the two algorithmssummed up in Section 4. namely shingling algorithm [3] and random projection based

approach [13] on a very large scale set of 1 .6B distinct webII. RELATED WORK pages. The results showed that neither of the algorithms works

Our wrk hsben inpire by numer o preious well for finding near-duplicate pairs on the same site, whileworks on duplicate and near duplicate document and web page bohaiee ig prcsnfrna-dlcte arsn

2009 IEEE Inxternational Advanxce Computing Conference (IACC 2009) 1493

Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on January 2, 2010 at 04:24 from IEEE Xplore. Restrictions apply.

Page 3: A Novel and Efficient Approach for Near Duplicate

different sites. She has presented a combined algorithm which should utilize minimal number of machines [22].achieves precision 0.79 with 79% of the recall of the other The near duplicate detection is performed on the keywordsalgorithms. extracted from the web documents. First, the crawled web

Hui Yang et al. [38] have presented DURIAN (DUplicate documents are parsed to extract the distinct keywords. ParsingRemoval In lArge collectioN), a refinement of a prior near- includes removal of HTML tags, java scripts, stopduplicate detection algorithm. DURIAN uses a traditional bag- words/common words and stemming of remaining words. Theof-words document representation, document attributes extracted keywords and their counts are stored in a table to("metadata"), and document content structure to identify form ease the process of near duplicates detection. The keywordsletters and their edited copies in public comment collections. are stored in the table in a way that the search space is reducedThe results have demonstrated that statistical similarity for the detection. The similarity score of the current webmeasures and instance-level constrained clustering can be document against a document in the repository is calculatedquite effective for efficiently identifying near-duplicates. from the keywords of the pages. The documents with

In the course of developing a near-duplicate detection similarity score greater than a predefined threshold are

system for a multi-billion page repository, Gurmeet Singh considered as near duplicates.Manku et al. [22] have made two research contributions. First, A. Near Duplicate Web Documentsthey demonstrated that Charikar's [13] fingerprinting Even though the near duplicate documents are not bitwisetechnique is appropriate for this goal. Second, they presented identical they bear striking similarities. The near duplicates arean algorithmic technique for identifying existing f-bit not considered as "exact duplicates" but are files with minutefingerprints that differ from a given fingerprint in at most k differences. Typographical errors, versioned, mirrored, orbit-positions, for small k. This technique is useful for both plagiarized documents, multiple representations of the sameonline queries (single fingerprints) and batch queries (multiple physical object, spam emails generated from the samefingerprints). template, and many such phenomenons may result in near

Ziv BarYossef et al. [8] have considered the problem of duplicate data. A considerable percentage of web pages haveDUST: Different URLs with Similar Text. They proposed a been identified as be near-duplicates according to variousnovel algorithm, DustBuster, for uncovering dust; that is, for studies [3, 196 and 24]. These studies propose that neardiscovering rules that transform a given URL to others that are duplicates constitute almost 1.7% to 7% of the web pageslikely to have similar content. DustBuster mines dust traversed by crawlers. The steps involved in our approach areeffectively from previous crawl logs or web server logs, presented in the following subsections.without examining page contents. B. Web Crawling

Chuan Xiao et al. [36] have proposed exact similarity join The analysis of the structure and informatics of the web isalgorithms with application to near duplicate detection and a facilitated by a data collection technique known as Webpositional filtering principle, which exploits the ordering of Crawling. The collection of as many beneficiary web pages astokens in a record and leads to upper bound estimates of possible along their interconnection links in a speedy yetsimilarity scores. They demonstrated the superior performance proficient manner is the prime intent of crawling. Automaticof their algorithms to the existing prefix filtering-based traversal of web sites, downloading documents and tracingalgorithms on several real datasets under a wide range of links to other pages are some of the features of a web crawlerparameter settings. program. Numerous search engines utilize web crawlers for

gathering web pages of interest besides indexing them. WebIII. NOVEL APPROACH FOR NEAR DUPLICATE WEBPAGE crawling becomes a tedious process due to the subsequent

DETECTION features of the web, the large volume and the huge rate ofA novel approach for the detection of near duplicate web change due to voluminous number of pages being added or

pages is presented in this section. In web crawling, the removed each day.crawled web pages are stored in a repository for further Seed URLs are a set of URLs that a crawler beginsprocess such as search engine formation, page validation, working with. These URLs are queued. A URL is obtained instructural analysis and visualization, update nohtifcationstructural analysis ad vsome order from the queue by a crawler. Then the crawlermirroring and personal web assistants or agents and more. downloads the page. This is followed by the extracting theDuplicate and near duplicate web page detection is an

important ~stpi,ebcaln. In ore to facliat serc URLs from the downloaded page and enqueuing them. Theimportant step inwecawin.nrdroaprocess continues unless the crawler settles to stop [33]. Aengines to provide search results free of redundancy to users crawling loop consists of obtaining a URL from the queue,and to provide distinct and useful results on the first page, downloadin the correspondin file with the aid of HTTP,duplicate and near duplicate detection is essential. Numerous gt gchallenges are encountered by the systems that aid in the URLsato the queue [31detection of near duplicate pages. First is the concern of scalesince the search engines index hundreds of millions of web- C. Web Document Parsingpages thereby amounting to a multi-terabyte database. Next is Information extracted from the crawled documents aid inthe issue of making the crawl engine crawl billions of web determining the future path of a crawler. Parsing may either bepages everyday. Thus marking a page as a near duplicate as simple as hyperlink/RL extraction or complex ones suchshould be done at a quicker pace. Furthermore, the system as analysis of HTML tags by cleaning the HTML content [31].

11494 2009 IEEE Internactionalz Advance Computing Conference (IACC 2009)

Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on January 2, 2010 at 04:24 from IEEE Xplore. Restrictions apply.

Page 4: A Novel and Efficient Approach for Near Duplicate

It is inevitable for a parser that has been designed to traverse T K, K3 K2 K4 Kn1the entire web to encounter numerous errors. The parser tends T2 c1 C3 C2 C4 Cnto obtain information from a web page by not considering afew common words like a, an, the and more, HTML tags, Java The keywords in the tables are considered individually forScripting and a range of other bad characters [16] the similarity score calculation. If a keyword is present in

1) Stop Words Removal. It is necessary and beneficial to both the tables, the formula used to calculate the similarityremove the commonly utilized stop words such as "it", "can" score of the keyword is as follows.,"an"y "and","by", "for", "from", "of', "the", "to", "with" and a = A[Ki ]Tmore either while parsing a document to obtain information babout the content or while scoring fresh URLs that the page 2

recommends. This procedure is termed as stop listing [31]. SDC = log(count(a) / count(b)) * Abs(I + (a - b))Stop listing aids in the reduction of size of the indexing filebesides enhancing efficiency and value. Here 'a' and 'b' represent the index of a keyword in the

two tables respectively.D. StemmingAlgorithm

Variant word forms in Information Retrieval are restricted If the keywords of T. \ T2 # P, we use the followingto a common root by Stemming. The postulation lying behind formula to calculate the similarity score. The amount of theis that, two words posses the same root represent identical keywords present in T, but not in T2 is taken as NT1concepts. Thus terms possessing to identical meaning yetappear morphologically dissimilar are identified in an IR SD =log(count (a))*( + T2 )system by matching query and document terms with the aid of T,Stemming [4]. Stemming facilitates the reduction of all words If the keywords of T2 \ T1 #& (p, we use the belowpossessing an identical root to a single one. This is achieved mentioned formula to calculate the similarity score. Theby removing each word of its derivational and inflectional occurrences of the keywords present in T2 but not in T, issuffixes [26]. For instance, "connect," "connected" and taken as N"connection" are all condensed to "connect". SDFT2 = log(count(b)) * (I + IT1 I)

The similarity score (SSM) of a page against another pageWe posses the distinct keywords and their counts in each is calculated by using the following equation.of the each crawled web page as a result of stemming. Thesekeywords are then represented in a form to ease the process of INCI INT1 NT2near duplicates detection. This representation will reduce the NSD + 'SD +vYSsearch space for the near duplicate detection. Initially the Z D . DT, T2keywords and their number of occurrences in a web page have SSM =-=1 ,=1 N i=

been sorted in descending order based on their counts. N

Afterwards, n numbers of keywords with highest counts are Where n = 1T UT2| andN = T + )/2. The webstored in a table and the remaining keywords are indexed and documents with similarity score greater than a predefinedstored in another table. In our approach the value of n is set to threshold are near duplicates of documents already present inbe 4. The similarity score between two documents can be repository. These near duplicates are not added in to thecalculated if and only the prime keywords of the two repositoryforfurtherprocess suchas searchengine indexing.documents are similar. Thus the search space is reduced fornear duplicates detection. IV. CONCLUSIONF. Similarity Score Calculation Though the web is a huge information store, various

If the prime keywords of the new web page do not match features such as the presence of huge volume of unstructuredwith the prime keywords of the pages in the table, then the or semi-structured data; their dynamic nature; existence ofnew web page is added in to the repository. If all the keywords duplicate and near duplicate documents and the like poseof both pages are same then the new page is considered as serious difficulties for Information Retrieval. The voluminousduplicate and thus is not included in the repository. If the amounts of web documents swarming the web have posed aprime keywords of new page are same with a page in the huge challenge to the web search engines making them renderrepository, then the similarity score between the two results of less relevance to the users. The detection ofdocuments is calculated. The similarity score of two web duplicate and near duplicate web documents has gained moredocuments is calculated as follows: attention in recent years amidst the web mining researchers. In

this paper, we have presented a novel and efficient approachLet T1 and T2 be the tables containing the extracted for detection of near duplicate web documents in web

keywords and their corresponding counts. crawling. The proposed approach has detected the duplicateKK K ~~~~~~~~andnear duplicate web pages efficiently based on theKll l1 K2 K4 K5 *- K11| keywords extracted from the web pages. Further more,

C1 C2 C4 C5 C11 reduced memory spaces for web repositories and improvedsearch engine quality have been accomplished through the

2009 IEEE International Advance Computing Conference (IACC 2009) 1495

Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on January 2, 2010 at 04:24 from IEEE Xplore. Restrictions apply.

Page 5: A Novel and Efficient Approach for Near Duplicate

proposed duplicates detection approach. [19] Fetterly, D., Manasse, M., and Najork, M., (2003) "On the evolution ofclusters of near-duplicate web pages", Proceedings of the First

REFERENCES Conference on Latin American Web Congress, pp. 37.

[1] Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., and Raghavan, S. [20] Garcia-Molina, H., Ullman, J.D., and Widom, J., (2000) "Database(2001) "Searching the web", ACM Transactions on Intenet Technology, System Implementation", Prentice Hall.

vol. 1, no. 1: pp. 2-43. [21] Gibson, D., Kumar, R., and Tomkins, A., (2005) "Discovering large[2] Brin, S., Davis, J. and Garcia-Molina, H. (1995) "Copy detection dense subgraphs in massive graphs", Proceedings of the 31st

mechanisms for digital documents", In Proceedings of the Special .ter7a2onal conference on Very large data bases, Trondheim, Norway,Interest Group on Management of Data (SIGMOD 1995), ACM Press, pp[2G S.pp.398-409, May. [22] Gurmeet S. Manku, Arvind Jain, Anish D. Sarma, (2007) "Detecting

[3] Broder, A. Z., Glassman, S. C., Manasse, M. S. and Zweig, G., (1997) near-duplicates for web crawling," Proceedings of the 16th international"Syntactic clustering of the web", Computer Networks, vol. 29, no. 8- conference on World Wide Web, pp: 141 - 15013, pp.1157-1166. [23] Hoad, T. C., and Zobel, J., (2003) "Methods for identifying versioned

[4] Bacchin, M., Ferro, N., Melucci, M., (2002) "Experiments to evaluate a and plagianzed documents", JASIST, vol. 54, no. 3, pp.203 215.statistical stemming algorithm", Proceedings of the international Cross- [24] Henzinger, M., (2006) "Finding near-duplicate web pages: a large-scaleLanguage Evaluation Forum Workshop, University of Padua at CLEF evaluation of algorithms," Proceedings of the 29th annual international2002. ACM SIGIR conference on Research and development in information

[5] Broder, A. Z., Najork, M., and Wiener, J. L., (2003) "Efficient URL retrieval, pp. 284-291.caching for World Wide Web crawling", Proceedings of the 12th [25] Kosala, R., and Blockeel, H., (2000) "Web Mining Research: A Survey,"intemational conference on World Wide Web, pp. 679 - 689. SIGKDD Explorations, vol. 2, Issue. 1.

[6] Berendt, B., Hotho, A., Mladenic, D., Spiliopoulou, M., and Stumme, [26] Lovins, J.B. (1968) "Development of a stemming algorithm". MechanicalG., (2004) "A Roadmap for Web Mining: From Web to Semantic Web", Translation and Computational Linguistics, vol. 1 1, pp. 22-31.Lecture Notes in Artificial Intelligence Springer-Verlag Heidelberg , [27] Lim, E.P., and Sun, A., (2006) "Web Mining - The Ontology Approach",Berlin; Heidelberg; New York, Vol. 3209, pp. 1-22. ISBN: 3-540- International Advanced Digital Library Conference (IADLC'2005),23258-3. Nagoya University, Nagoya, Japan.

[7] Bayardo, R. J., Ma, Y., and Srikant, R., (2007) "Scaling up all pairs [28] Menczer, F., Pant, G., Srinivasan, P., and Ruiz, M. E., (2001)similarity search", In Proceedings of the 16th International Conference "Evaluating topic-driven web crawlers", In Proceedings 24th Annualon World Wide Web, pp.131-140. International ACM SIGIR Conference on Research and Development in

[8] Bar-Yossef, Z., Keidar, I., Schonfeld, U., (2007) "Do not crawl in the dust: Information Retrieval, pp. 241-249.different URLS with similar text", Proceedings of the 16th international [29] Metzler, D., Bernstein, Y., and Bruce Croft, W., (2005) "Similarityconference on World Wide Web, pp: 111 - 120. Measures for Tracking Information Flow", Proceedings of the fourteenth

[9] Balamurugan, S., Newlin Rajkumar, and Preethi, J., (2008) "Design and international conference on Information and knowledge management,Implementation of a New Model Web Crawler with Enhanced CIKM'05, October 31.November 5, Bremen, Germany.Reliability", Proceedings of world academy of science, engineering and [30] Pazzani, M., Nguyen, L., and Mantik, S., (1995) "Learning from hotliststechnology, volume 32, August, ISSN 2070-3740. and coldlists: Towards a www information filtering and seeking agent",

[10] Cooley, R., Mobasher, B., Srivastava, J., (1997) "Web mining: Proceedings, Seventh International Conference on Tools with Artificialinformation and pattern discovery on the World Wide Web", Intelligence, pp. 492 - 495.Proceedings of the Ninth IEEE International Conference on Tools with [31] Pant, G., Srinivasan, P., Menczer, F., (2004) "Crawling the Web". WebArtificial Intelligence, pp: 558 - 567, 3-8 November, DOI: Dynamics: Adapting to Change in Content, Size, Topology and Use,10.1 109/TAI.1997.632303. edited by M. Levene and A. Poulovassilis, Springer- verlog, pp: 153-

[11] Cho, J., Garcia-Molina, H., and Page, L., (1998) "Efficient crawling 178, November.through URL ordering", Computer Networks and ISDN Systems, vol. [32] Pandey, S.; Olston, C., (2005) "User-centric Web crawling", Proceedings30, no. 1-7: pp. 161-172. of the 14th international conference on World Wide Web, pp: 401 - 411.

[12] Cho, J., Shivakumar, N., and Garcia-Molina, H., (2000) "Finding [33] Shoaib, M., Arshad, S., "Design & Implementation of web informationreplicated web collections", ACM SIGMOD Record, Vol. 29, no. 2,pp. gathering system," NITS e-Proceedings, Internet Technology and355 - 366. Applications, King Saud University, pp.142-1.

[13] Charikar, M., (2002) "Similarity estimation techniques from rounding [34] Singh, A., Srivatsa, M., Liu, L., Miller, T., (2003) "Apoidea: Aalgorithms". In Proc. 34th Annual Symposium on Theory of Computing Decentralized Peer-to-Peer Architecture for Crawling the World Wide(STOC 2002), pp: 380-388. Web", Proceedings of the SIGIR 2003 Workshop on Distributed

[14] Chakrabarti, S., (2002) "Mining the Web: Discovering Knowledge from Information Retrieval, Lecture Notes in Computer Science, Vol. 2924,Hypertext Data", Morgan-Kaufmann Publishers, pp: 352, ISBN 1- pp. 126-142.55860-754-4. [35] Spertus, E., Sahami, M., and Buyukkokten, O., (2005) "Evaluating

[15] Conrad, J. G., Guo, X. S., and Schriber, C. P., (2003) "Online duplicate similarity measures: a large-scale study in the orkut social network", Indocument detection: signature reliability in a dynamic retrieval KDD (2003), pp. 678-684.environment", Proceedings of the twelfth international conference on [36] Xiao, C., Wang, W., Lin, X., Xu Yu, J.,(2008) "Efficient Similarity JoinsInformation and knowledge management, New Orleans, LA, USA, pp. for Near Duplicate Detection", Proceeding of the 17th international443 - 452. conference on World Wide Web, pp:131--140.

[16] Castillo, C., (2005) "Effective web crawling", SIGIR Forum, ACM Press, [37] Yang, H., and Callan, J., (2005) "Near-duplicate detection forVolume 39, Number 1, N, pp.55-56. eRulemaking", Proceedings of the 2005 national conference on Digital

[17] David, G., Jon, K., and Prabhakar R., (1998) "Inferring web communities government research, pp: 78 - 86.from link topology", Proceedings of the ninth ACM conference on [38] Yang, H., Callan, J., Shulman, S., (2006) "Next steps in near-duplicateHypertext and hypermedia: links, objects, time and space-structure in detection for eRulemaking", Proceedings of the 2006 internationalhypermedia systems, Pittsburgh, Pennsylvania, United States, pp.225 - conference on Digital government research, Vol. 151, pp: 239 - 248.234.

[18] Deng, F., and Rafiei, D., (2006) "Approximately detecting duplicates forstreaming data using stable bloom filters," Proceedings of the 2006ACM SIGMOD international conference on Management of data, pp.25-36.

1496 2009 IEEE International Advance Computing Conference (IACC 2009)

Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on January 2, 2010 at 04:24 from IEEE Xplore. Restrictions apply.