logo a comparison of two web-based document management systems shaoxinyu columbia university march...
TRANSCRIPT
LOGOLOGO
A comparison of two web-based
document management systems
ShaoxinYu
Columbia University
March 31, 2009
LOGO Index
I. Description of the problem
II. Google Scholar
III. CiteSeer
IV. Comparison of Google Scholar and CiteSeer
LOGO Description of the problem
Nowadays, with mushrooming of the quantity of on-line text
information, automatic text summarization plays a more and
more important role in information industry
Online resources will certainly contain similar content, however, exist separately, it is meaningful for us to find high efficient ways to manage these information.
LOGO Description of the problem
Background of Multi-document Summarization Techniques
1. Free style summarization2. Sentence Extraction type summarization3. Axis (type of main topic)4. Table style summary
Four types
LOGO Description of the problem
How to achieve documents about the same topic manually?
1. Use a marker to mark the important phrases or sentences
2. Figure out the main topics in the marked sentences OR Make a list to figure out the overview of the documents
3. Connect these main topics
LOGO Google Scholar
Google Scholar
1. Released in November 20042. Search engine for scholarly literature3. Wide range of subject areas
LOGO Google Scholar
Do not search all publicly available Web pages as Google Google Scholar gets its records from three sources:
1. Use a proprietary algorithm to identify Web documents “look scholarly” ----full-text documents and citations with abstracts.
2. Add content provided by its partners—journal publishers, scholarly societies, database vendors, and academic institutions.
3. Extracts citations from the reference lists of documents found through the first two methods
LOGO Google Scholar
Google File System Architecture
LOGO Google Scholar
1. Chunk fragment of information used in multimedia formats 64 MB: optimize by statistic method2. Metadata (stored in master) a. files and chunk namespaces b. mapping from files to chunks c. locations of each chunk’s replicas 3. Master Single process running on a machine that stores all metadata4. Communication between Master and Chuck Servers If corrupted, master also sends instruction to the chuck servers
for deleting existing chunks, creating new chunks.
LOGO CiteSeer
CiteSeer
1. Public search engine for academic papers
2. Created by Steve Lawrence, Kurt Bollacker and Lee Giles
3. NEC Research Institute, Princeton, New Jersey, USA
4. Hosted by Pennsylvania State University
5. Over 700,000 documents, primarily in computer and science
and engineering.
LOGO CiteSeer
CiteSeer features
1. Autonomous citation indexing system
2. Index academic literature in Postscript files or PDF
3. Literature retrieval by following citation links
4. Evaluation and ranking of papers, authors and journals
5. Create up-to-date databases not limited to preselected journals or
restricted by journal publication delays
6. Autonomous operation with a corresponding reduction in cost
7. Powerful interactive browsing of the literature using the context of
citations
LOGO CiteSeer
Methods of CiteSeer use for computing similarity
1.Word Vectors Use the top 20 components, since the truncation may not have a large effect on the distance measures2. String Distance Use “LikeIt” string distance to measure the edit distance3. Citations Use common citations to find the research papers most closely related to the document4. Combination of Methods CiteSeer combines document similarity methods above
LOGO Comparison of Google Scholar & CiteSeer
Different positioning
The core purpose of CiteSeer is to search for the complete academic papers with complete citations and exempt of the hefty fee
Google Scholar is Google’s products to promote the complete solution of searching and other need of academic purposes, whose strategy focuses on complete and can be used as a final solution
LOGO Comparison of Google Scholar & CiteSeer
Coverage and performance
Google Scholar utilizes the first 100-120K bytes of the text for searching and the links always need to pay
We can trace the informative paper by CiteSeer itself, and the contributions of all the citation papers provide huge help in academic affairs
LOGO Comparison of Google Scholar & CiteSeer
Click any of the informative links can connect to one link
LOGO Comparison of Google Scholar & CiteSeer
Results are provided only by the topics extraction
LOGO Comparison of Google Scholar & CiteSeer
As to the staleness matter, Google Scholar seems to be a loser in comparison with CiteSeer.
This effect was more obvious in the early days of appearance of Google Scholar.
Nowadays, for majority of uses, the staleness is no longer a big problem for both of them.
LOGOLOGO