similar document retrieval and analysis in information retrieval system based on correlation method...

11
Similar Document Similar Document Retrieval and Retrieval and Analysis in Analysis in Information Information Retrieval System Retrieval System based on correlation based on correlation method for full text method for full text indexing indexing

Upload: lee-chase

Post on 01-Jan-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Similar Document Retrieval and Analysis in Information Retrieval System based on correlation method for full text indexing

Similar Document Similar Document Retrieval and Analysis in Retrieval and Analysis in

Information Retrieval Information Retrieval System based on System based on

correlation method for full correlation method for full text indexingtext indexing

Page 2: Similar Document Retrieval and Analysis in Information Retrieval System based on correlation method for full text indexing

Searching similar documentsSearching similar documents

Searching similar documents or Searching similar documents or searching documents searching documents with content similar to query is a with content similar to query is a new forward-looking new forward-looking technology. technology.

In the correlation method the correlations between words In the correlation method the correlations between words or ASCII symbols are taken into account for creating full or ASCII symbols are taken into account for creating full text index of the archive of electronic documents. text index of the archive of electronic documents.

It makes possible to pick up automatically the typical It makes possible to pick up automatically the typical terminology for the documents indexed in the archive. terminology for the documents indexed in the archive.

In the case of ASCII symbols indexing the similar In the case of ASCII symbols indexing the similar document retrieval is language independent.document retrieval is language independent.

Page 3: Similar Document Retrieval and Analysis in Information Retrieval System based on correlation method for full text indexing

High relevance of the document High relevance of the document retrievalretrieval

This technology This technology increases the relevance of the document increases the relevance of the document

retrieval, retrieval, solves the problems of fuzzy informational solves the problems of fuzzy informational

content, content, consolidates information from various resources consolidates information from various resources

and generating a report on the similarity of and generating a report on the similarity of documents already stored in the database that documents already stored in the database that is, detecting duplicate documents. is, detecting duplicate documents.

Page 4: Similar Document Retrieval and Analysis in Information Retrieval System based on correlation method for full text indexing

Natural language, full page queryNatural language, full page query

Offer in the natural language, a paragraph Offer in the natural language, a paragraph or even the whole page of the text can be or even the whole page of the text can be transmitted as the search inquiry. transmitted as the search inquiry.

The search inquiry transferred to the input The search inquiry transferred to the input of search of similar is coded by means of of search of similar is coded by means of the expanded alphabet available. the expanded alphabet available.

Page 5: Similar Document Retrieval and Analysis in Information Retrieval System based on correlation method for full text indexing

Relevance criteriaRelevance criteria

On the basis of a list of symbols for each indexed page the following On the basis of a list of symbols for each indexed page the following sum is calculatedsum is calculated::

N

kkki countlengthP

1

)(*)(

Then theThen the obtained Pi values are ordered and pages with the highest obtained Pi values are ordered and pages with the highest

values are given to the user as results of search.values are given to the user as results of search.

Page 6: Similar Document Retrieval and Analysis in Information Retrieval System based on correlation method for full text indexing

Software products of the Controlling Software products of the Controlling Chaos Technologies Ltd.Chaos Technologies Ltd.

A described method of text processing is realized and used in the A described method of text processing is realized and used in the software products of the Controlling Chaos Technologies Ltd., that software products of the Controlling Chaos Technologies Ltd., that are CCT Archive and CCT Publisher. are CCT Archive and CCT Publisher.

Software products are intended for the creation of electronic Software products are intended for the creation of electronic archives of not structured documents with an opportunity of full – archives of not structured documents with an opportunity of full – text searching, and for creation and preparation for CD and DVD text searching, and for creation and preparation for CD and DVD electronic books, encyclopedias, archives of magazines. electronic books, encyclopedias, archives of magazines.

Examples of successful application of software products are the Examples of successful application of software products are the electronic archives of well- known Russian magazines “Chemistry electronic archives of well- known Russian magazines “Chemistry and the Life”, "Quantum", "Znanie - Sila".and the Life”, "Quantum", "Znanie - Sila".

Page 7: Similar Document Retrieval and Analysis in Information Retrieval System based on correlation method for full text indexing

Archive of magazine " Quantum "Archive of magazine " Quantum "

On the next slide there areOn the next slide there are results of search results of search system operation with electronic archive of system operation with electronic archive of magazine " Quantum " as an example. magazine " Quantum " as an example.

At the upper left is inquiry in the natural At the upper left is inquiry in the natural language on which the search was carried out, language on which the search was carried out, below is the ranged list of the documents found. below is the ranged list of the documents found. To the right is the document page with the To the right is the document page with the allocated inputs.allocated inputs.

Page 8: Similar Document Retrieval and Analysis in Information Retrieval System based on correlation method for full text indexing

Archive of magazine " Quantum "Archive of magazine " Quantum "

Page 9: Similar Document Retrieval and Analysis in Information Retrieval System based on correlation method for full text indexing

Basic time characteristicsBasic time characteristics

Below are the basic time characteristics Below are the basic time characteristics managed to be reached with the present managed to be reached with the present program realization of the algorithms program realization of the algorithms described. described.

All values are obtained using an ordinary All values are obtained using an ordinary personal computer, by the text size we personal computer, by the text size we mean the number of ASCII symbols in a mean the number of ASCII symbols in a text but not the size of files containing this text but not the size of files containing this text.text.

Page 10: Similar Document Retrieval and Analysis in Information Retrieval System based on correlation method for full text indexing

Basic time characteristicsBasic time characteristics

The maximal size of the indexed text is The maximal size of the indexed text is about 1 Gb. about 1 Gb.

Text indexation rate is about 1 Mb per Text indexation rate is about 1 Mb per min .min .

Time of index opening is not more than 1 Time of index opening is not more than 1 min. min.

Search time is about 1 sec.Search time is about 1 sec.

Page 11: Similar Document Retrieval and Analysis in Information Retrieval System based on correlation method for full text indexing

Rubrication and text clusterizationRubrication and text clusterization

It should be noted that the technology It should be noted that the technology being developed is not language being developed is not language dependent and can be adjusted to any dependent and can be adjusted to any language systems. language systems.

Development of ideas put in searching the Development of ideas put in searching the similar allows one to solve such problems similar allows one to solve such problems as search of plagiarism, rubrication and as search of plagiarism, rubrication and text clusterization and Internet content text clusterization and Internet content filtration.filtration.