logo a comparison of two web-based document management systems shaoxinyu columbia university march...

18
LOGO LOGO A comparison of two web- based document management systems ShaoxinYu Columbia University March 31, 2009

Upload: franklin-buck-warner

Post on 05-Jan-2016

215 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009

LOGOLOGO

A comparison of two web-based

document management systems

ShaoxinYu

Columbia University

March 31, 2009

Page 2: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009

LOGO Index

I. Description of the problem

II. Google Scholar

III. CiteSeer

IV. Comparison of Google Scholar and CiteSeer

Page 3: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009

LOGO Description of the problem

Nowadays, with mushrooming of the quantity of on-line text

information, automatic text summarization plays a more and

more important role in information industry

Online resources will certainly contain similar content, however, exist separately, it is meaningful for us to find high efficient ways to manage these information.

Page 4: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009

LOGO Description of the problem

Background of Multi-document Summarization Techniques

1. Free style summarization2. Sentence Extraction type summarization3. Axis (type of main topic)4. Table style summary

Four types

Page 5: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009

LOGO Description of the problem

How to achieve documents about the same topic manually?

1. Use a marker to mark the important phrases or sentences

2. Figure out the main topics in the marked sentences OR Make a list to figure out the overview of the documents

3. Connect these main topics

Page 6: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009

LOGO Google Scholar

Google Scholar

1. Released in November 20042. Search engine for scholarly literature3. Wide range of subject areas

Page 7: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009

LOGO Google Scholar

Do not search all publicly available Web pages as Google Google Scholar gets its records from three sources:

1. Use a proprietary algorithm to identify Web documents “look scholarly” ----full-text documents and citations with abstracts.

2. Add content provided by its partners—journal publishers, scholarly societies, database vendors, and academic institutions.

3. Extracts citations from the reference lists of documents found through the first two methods

Page 8: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009

LOGO Google Scholar

Google File System Architecture

Page 9: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009

LOGO Google Scholar

1. Chunk fragment of information used in multimedia formats 64 MB: optimize by statistic method2. Metadata (stored in master) a. files and chunk namespaces b. mapping from files to chunks c. locations of each chunk’s replicas 3. Master Single process running on a machine that stores all metadata4. Communication between Master and Chuck Servers If corrupted, master also sends instruction to the chuck servers

for deleting existing chunks, creating new chunks.

Page 10: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009

LOGO CiteSeer

CiteSeer

1. Public search engine for academic papers

2. Created by Steve Lawrence, Kurt Bollacker and Lee Giles

3. NEC Research Institute, Princeton, New Jersey, USA

4. Hosted by Pennsylvania State University

5. Over 700,000 documents, primarily in computer and science

and engineering.

Page 11: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009

LOGO CiteSeer

CiteSeer features

1. Autonomous citation indexing system

2. Index academic literature in Postscript files or PDF

3. Literature retrieval by following citation links

4. Evaluation and ranking of papers, authors and journals

5. Create up-to-date databases not limited to preselected journals or

restricted by journal publication delays

6. Autonomous operation with a corresponding reduction in cost

7. Powerful interactive browsing of the literature using the context of

citations

Page 12: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009

LOGO CiteSeer

Methods of CiteSeer use for computing similarity

1.Word Vectors Use the top 20 components, since the truncation may not have a large effect on the distance measures2. String Distance Use “LikeIt” string distance to measure the edit distance3. Citations Use common citations to find the research papers most closely related to the document4. Combination of Methods CiteSeer combines document similarity methods above

Page 13: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009

LOGO Comparison of Google Scholar & CiteSeer

Different positioning

The core purpose of CiteSeer is to search for the complete academic papers with complete citations and exempt of the hefty fee

Google Scholar is Google’s products to promote the complete solution of searching and other need of academic purposes, whose strategy focuses on complete and can be used as a final solution

Page 14: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009

LOGO Comparison of Google Scholar & CiteSeer

Coverage and performance

Google Scholar utilizes the first 100-120K bytes of the text for searching and the links always need to pay

We can trace the informative paper by CiteSeer itself, and the contributions of all the citation papers provide huge help in academic affairs

Page 15: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009

LOGO Comparison of Google Scholar & CiteSeer

Click any of the informative links can connect to one link

Page 16: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009

LOGO Comparison of Google Scholar & CiteSeer

Results are provided only by the topics extraction

Page 17: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009

LOGO Comparison of Google Scholar & CiteSeer

As to the staleness matter, Google Scholar seems to be a loser in comparison with CiteSeer.

This effect was more obvious in the early days of appearance of Google Scholar.

Nowadays, for majority of uses, the staleness is no longer a big problem for both of them.

Page 18: LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009

LOGOLOGO