arxiv.org: research and development directions
DESCRIPTION
his presentation describes the arXiv.org collection and users, development on authentication and access control as well as research projects in text classification and time series analysis. 16 slide presentation, Microsoft powerpoint, given at a November 2003 Information Science Open House.TRANSCRIPT
![Page 1: Arxiv.org: Research And Development Directions](https://reader035.vdocuments.us/reader035/viewer/2022062615/54825ecfb079596a0c8b4799/html5/thumbnails/1.jpg)
ArXiv.org250,000 documents
47,000 registered users
1 million+ downloads per year
Cost Per Paper$10000 Commercial Journal
$1000 Non-Profit Journal
$10 arXiv
![Page 2: Arxiv.org: Research And Development Directions](https://reader035.vdocuments.us/reader035/viewer/2022062615/54825ecfb079596a0c8b4799/html5/thumbnails/2.jpg)
Goal: Process increasing number of submissions at constant or declining cost
![Page 3: Arxiv.org: Research And Development Directions](https://reader035.vdocuments.us/reader035/viewer/2022062615/54825ecfb079596a0c8b4799/html5/thumbnails/3.jpg)
arXiv has an active core of users: 10% of users are responsible for about 1/3 of all submissions, 50% of all users have logged in (to submit or update a paper) in the past 1.5 years
![Page 4: Arxiv.org: Research And Development Directions](https://reader035.vdocuments.us/reader035/viewer/2022062615/54825ecfb079596a0c8b4799/html5/thumbnails/4.jpg)
Authentication and Access Control
Recently moved from an http authentication/Berkeley database system to a system based on cookies and a relational database.
Currently, all registered users (who haven’t been suspended) can submit to all subjects classes in all archives – the original submitter or somebody with the paper password can update the paper.
People are allowed to register depending on their E-mail address: [email protected] can register, but [email protected] can’t unless company=ibm,lucent,…; this list is hard to maintain (we have to block popular ISPs in every country), exceptions are dealt with manually at great cost (each case takes detective work), and there are many people in .edu (alumni, non-research staff) who shouldn’t be able to submit. Because registration and submission are linked, user database can’t be used to offer other services: e-mail notification, personalization.
![Page 5: Arxiv.org: Research And Development Directions](https://reader035.vdocuments.us/reader035/viewer/2022062615/54825ecfb079596a0c8b4799/html5/thumbnails/5.jpg)
Endorsements and Trust Management
Administrators
Grandfathered Users
In new system, everyone will be able to register. Users who registered under the old system will still be able to upload to any archive or subject class, but new users will need to be endorsed by an author with a publication history in that category. Burden shifts from one senior staff person to 47,000 registered users. User database can be used
![Page 6: Arxiv.org: Research And Development Directions](https://reader035.vdocuments.us/reader035/viewer/2022062615/54825ecfb079596a0c8b4799/html5/thumbnails/6.jpg)
Endorsee
Endorser
Endorsement
code
![Page 7: Arxiv.org: Research And Development Directions](https://reader035.vdocuments.us/reader035/viewer/2022062615/54825ecfb079596a0c8b4799/html5/thumbnails/7.jpg)
Web-based interface for administrators:
• View user history and publications
• Monitor endorsement process
• Manage authority records
• Disable ability to submit or endorse
• Keep “institutional memory”
![Page 8: Arxiv.org: Research And Development Directions](https://reader035.vdocuments.us/reader035/viewer/2022062615/54825ecfb079596a0c8b4799/html5/thumbnails/8.jpg)
Future Directions•Flexible Submission Queue (Currently submissions are published the following evening – we can’t easily delay a submission)
•Validating Metadata Form (Force users to clean up entry errors, so administrators don’t have to)
• Automatic Protection (Suspicious submissions and endorsements will be automatically delayed)
• New Search Engine based on Lucene
• Retrofit e-mail notification (current awareness) to use new user database.
![Page 9: Arxiv.org: Research And Development Directions](https://reader035.vdocuments.us/reader035/viewer/2022062615/54825ecfb079596a0c8b4799/html5/thumbnails/9.jpg)
Classifying Articles with the Support Vector
MachinePaul Ginsparg
Paul Houle
Thorsten Joachims
Jae-Hoon Sul
Goal: identify papers in existing archives that are relevant to a new subject archive, q-bio (Quantitative Biology)
![Page 10: Arxiv.org: Research And Development Directions](https://reader035.vdocuments.us/reader035/viewer/2022062615/54825ecfb079596a0c8b4799/html5/thumbnails/10.jpg)
Active Training of SVM
Training: q-bio
Training: not q-bio
Other far from margin
Other close to margin
SVM finds maximum-margin hyperplane. We do first training run on one year of data, then identify other papers that lie close to the dividing line. We iteratively classify these by hand to refine the classification
![Page 11: Arxiv.org: Research And Development Directions](https://reader035.vdocuments.us/reader035/viewer/2022062615/54825ecfb079596a0c8b4799/html5/thumbnails/11.jpg)
![Page 12: Arxiv.org: Research And Development Directions](https://reader035.vdocuments.us/reader035/viewer/2022062615/54825ecfb079596a0c8b4799/html5/thumbnails/12.jpg)
Classifer performance improves as the size of a category increases.
![Page 13: Arxiv.org: Research And Development Directions](https://reader035.vdocuments.us/reader035/viewer/2022062615/54825ecfb079596a0c8b4799/html5/thumbnails/13.jpg)
![Page 14: Arxiv.org: Research And Development Directions](https://reader035.vdocuments.us/reader035/viewer/2022062615/54825ecfb079596a0c8b4799/html5/thumbnails/14.jpg)
Time Series Analysis of Content and Usage
InformationPaul Ginsparg
Jon Kleinberg
![Page 15: Arxiv.org: Research And Development Directions](https://reader035.vdocuments.us/reader035/viewer/2022062615/54825ecfb079596a0c8b4799/html5/thumbnails/15.jpg)
Kleinberg’s algorithm uses a hidden Markov model to detect bursts of word usage in arXiv titles, reveals intellectual trends in the last decade of high-energy physics theory.
![Page 16: Arxiv.org: Research And Development Directions](https://reader035.vdocuments.us/reader035/viewer/2022062615/54825ecfb079596a0c8b4799/html5/thumbnails/16.jpg)
Review papers have a distinctive pattern of use: an initial spike after announcement, followed by a long nearly-constant tail.
Announcement
Cited by other papers
Web Link Added