online autonomous citation management for citeseer cse598b course project by huajing li
TRANSCRIPT
2
Introduction to CiteSeer
Software package developed at NEC-Labs Domain Independent Software for
Automatic Citation Indexing (ACI) Focus is on scholarly publications in
electronic format (PS / PDF and variants) Performs:
– Document Discovery / Retrieval / Parsing– Automatic Citation Extraction– Document & Citation Indexing / Search
4
Crawler
Retrieval
Conversion
Parsing & Meta-Data Extraction
Meta-Data Database
PDBM_File & Chunk Tables
Indexing
Web Server
Indexes
CD
DocumentDatabase
File System
Document (Plain Text)
DocumentMeta-data
Set
DIDTitle
Authorsetc.
DocumentBody Text
N CitationTexts
Document (PDF/PS)
Document URL
N CitationMeta-data
Sets
CIDGIDTitle
Authorsetc.
5
Submitting Documents
Output of Crawl / User Submission is URL of page linking to document.
These URLs are dumped in Paper Table Paper Table maintains status for each document:
– Downloaded/undownloaded– Processed/unprocessed– Other processing errors
(tooshort/noreference/etc.) CiteSeer regularly scans this table to start
download of new documents Only Documents meeting typical pattern of
scholarly publications are eventually added to the collection
6
Document Structure Identification
– Title– Subject (keywords)– Description (abstract)– Author names– Author affiliations– Author address, email, phone, Homepage URL– Publication date, Publication number– Archive date– Contributor– Type– Format– Identifier– Source– Publisher– Journal/Conference– Pages– Relation
• References• Is Referenced By
From document header
System info
From citation graph
7
Citations grouping
Citations to same document have common Group ID– Each Group ID has a set of keys associated
to it, based on citation information– {authorkey1-titlekey; … authorkey2-
titlekey}• For every single word in the authors
information there is an authorkey• For a given citation, titlekey is unique and
is concatenation of all title words
8
Citations Grouping
For newly discovered citation– Extract
• Authors : C. Lee Giles, S. Lawrence• Title : “Good Paper Title”
– Generate keys {giles-goodpapertitle; lee-goodpapertitle; lawrence-goodpapertitle}
– Try to match at least one of them with existing Group ID key
• If there is a match, add this citation (Citation ID) to the group
• Otherwise create a new Group ID for this citation
9
Linking Citations to Documents
Citation ID->Group ID– We just saw that …
Document ID->Group ID– Based on document’s metadata, generate
authorkey-titlekey in the same way and try to match a Group ID key generated from the citations
– Document metadata can be erroneous, so successful mapping often happens AFTER correction by users
10
Problems of the Current Approach
There is no guarantee that the most similar citation contains the best metadata
Building citation graph is a time-intensive, offline task
Due to batch clustering, the addition of a single citation requires rebuilding the entire citation graph to include the new instance
The so-called canonical metadata is fixed to the document record
11
Goals of the New Citation Management System
Provide better document metadata Reduce the cost of maintenance Use on-line citation matching such that the
citation graph environment can be adjusted immediately based on a single new citation
Provide a fluid framework for building canonical metadata in which all evidence is always considered
Allow the development of flexible APIs into CiteSeer citation graph system
Maintain data security despite an open, wiki-like approach to user-contributed metadata changes
Provide better citation matching compared to the current system
12
Prototype Overview
DocumentMetadata
Index
CitationMetadata
Index
CitationResolver
CitationMetadata (XML)
DocumentMetadata (XML)
QueryHandler
Edge DB(SQL)
Query
May ultimately be located in separate service
13
Edge DB
One simple table containing one edge per row:– Id: citation handle (equivalent to CID)– citingDoc: citing document handle– citedDoc: cited document handle
Row-level locking
14
Matching citations and docs
Exact string match across disparate metadata fields way too optimistic - need better matching criteria
Lucene provides two methods out of the box:– Match based on Levenshtein distance
• Specify arbitrary distance cut-off per field• choose most similar match out of returned set
– Cut out the middleman - similarity-based matching• Specify arbitrary similarity threshold• Choose most similar match out of return set over
threshold Criteria to be determined through empirical tests
using prototype system.