topical categorization of large collections of electronic theses and dissertations venkat srinivasan...

22
Topical Categorization Topical Categorization of Large Collections of of Large Collections of Electronic Theses and Electronic Theses and Dissertations Dissertations Venkat Srinivasan & Edward A. Venkat Srinivasan & Edward A. Fox Fox Virginia Tech, Blacksburg, VA, Virginia Tech, Blacksburg, VA, USA USA ETD 2009 – June 11, 2009 ETD 2009 – June 11, 2009

Upload: simon-lamb

Post on 04-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA

Topical Categorization of Topical Categorization of Large Collections of Large Collections of

Electronic Theses and Electronic Theses and DissertationsDissertations

Venkat Srinivasan & Edward A. FoxVenkat Srinivasan & Edward A. Fox

Virginia Tech, Blacksburg, VA, USAVirginia Tech, Blacksburg, VA, USA

ETD 2009 – June 11, 2009ETD 2009 – June 11, 2009

Page 2: Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA

OutlineOutline

IntroductionIntroduction

GoalsGoals

ApproachApproach

ResultsResults

Future WorkFuture Work

Page 3: Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA

Introduction – great sourceIntroduction – great source

Electronic submission of dissertations Electronic submission of dissertations is increasingly preferred.is increasingly preferred.

ETDs are a great information source.ETDs are a great information source.Substantial amount of research Substantial amount of research on a topicon a topic

Thorough literature reviewThorough literature reviewPointers to other resources Pointers to other resources (Reference section)(Reference section)

Page 4: Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA

Introduction- under-utilizedIntroduction- under-utilized Yet ETDs are under-utilized. Yet ETDs are under-utilized.

Research papers, books etc. are still major (and in some Research papers, books etc. are still major (and in some cases the only) sources of information for most people. cases the only) sources of information for most people.

Most people (except grad students trained in this) don’t Most people (except grad students trained in this) don’t even think about reading a dissertation! even think about reading a dissertation!

Possible causesPossible causes Access to ETDs not streamlined.Access to ETDs not streamlined.

Users don’t know where to look for ETDs. Users don’t know where to look for ETDs. ETDs of interest could be buried in search engine ETDs of interest could be buried in search engine

results. results. Some universities do not allow outside access to their Some universities do not allow outside access to their

ETD collection.ETD collection.

Page 5: Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA

Introduction - needsIntroduction - needs

Efforts have been made to make ETDs more Efforts have been made to make ETDs more accessible.accessible.

NDLTD, VTLS, Scirus, etc. provide means of NDLTD, VTLS, Scirus, etc. provide means of access to ETDs from different universities. access to ETDs from different universities.

Not very feature rich and convenient:Not very feature rich and convenient: Users search for ETDs based on keywords.Users search for ETDs based on keywords. Don’t know what lies underneath (no idea Don’t know what lies underneath (no idea

about the size, topical coverage, etc. of ETD about the size, topical coverage, etc. of ETD collections)collections)

Not very amenable to browsing (users have to Not very amenable to browsing (users have to sift through search results)sift through search results)

Page 6: Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA

GoalsGoals

Provide a portal to ETD collections of Provide a portal to ETD collections of more different universitiesmore different universities

Provide value added servicesProvide value added servicesCategorize by topicCategorize by topicSupport searching and browsing Support searching and browsing the collection using various the collection using various criteria (by topic, keywords, criteria (by topic, keywords, date, author, etc.)date, author, etc.)

Page 7: Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA

Goals - prioritiesGoals - priorities

Set up infrastructure for crawling Set up infrastructure for crawling ETDs of various universitiesETDs of various universities

Come up with techniques for Come up with techniques for categorizing them into topical areascategorizing them into topical areas

Set up a user-friendly search and Set up a user-friendly search and browse interfacebrowse interface

Page 8: Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA

ApproachApproach

Crawl ETDs from various universitiesCrawl ETDs from various universities Develop a taxonomyDevelop a taxonomy Categorize ETDs into topics in the Categorize ETDs into topics in the

taxonomy treetaxonomy tree Index the ETDsIndex the ETDs Develop a search and browse Develop a search and browse

interfaceinterface

Page 9: Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA

Approach - crawlingApproach - crawling NDLTD’s Union Catalog - as starting pointNDLTD’s Union Catalog - as starting point Dublin Core metadata gatheredDublin Core metadata gathered URLs used to crawl ETDs and other data from the URLs used to crawl ETDs and other data from the

respective universities’ websitesrespective universities’ websites Custom crawlers written Custom crawlers written

Technologies used: Perl, and other open source Perl Technologies used: Perl, and other open source Perl libraries (WWW, Mechanize, etc.)libraries (WWW, Mechanize, etc.)

All metadata (Dublin Core metadata from Union All metadata (Dublin Core metadata from Union Catalog, and the metadata obtained from respective Catalog, and the metadata obtained from respective universities) is stored in our MySQL backend database. universities) is stored in our MySQL backend database.

Page 10: Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA

Approach - taxonomyApproach - taxonomy

Need medium generality and Need medium generality and specificity, as opposed to those from specificity, as opposed to those from Proquest, DMOZ, or WikipediaProquest, DMOZ, or Wikipedia

For example, DMOZ has more than For example, DMOZ has more than 500,000 nodes !500,000 nodes !

Solution?Solution? Prune the DMOZ category tree, and Prune the DMOZ category tree, and

then enhance it using Proquest then enhance it using Proquest categorization system.categorization system.

Page 11: Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA

TOP

BusinessArts Computers Health Science Society

Category Tree (only top 2 levels Category Tree (only top 2 levels shown)shown)

Approach – taxonomy Approach – taxonomy levelslevels

Level 1

Level 2

Page 12: Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA

Approach - categorizeApproach - categorize

Supervised classification approach usedSupervised classification approach used Training set built by using topic labels as query to Training set built by using topic labels as query to

GoogleGoogle 50 webpages retrieved and used for training Naïve 50 webpages retrieved and used for training Naïve

Baye’s classifier for each node (to distinguish between Baye’s classifier for each node (to distinguish between its children)its children)

ETD metadata used for categorizationETD metadata used for categorization Level-wise categorizationLevel-wise categorization

Page 13: Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA

Category

Tree

Document

Sets

Google Naïve Bayes Classifiers

Training

Sets

Web

Interface

ETD Collection

Categorized

ETDs

Category label for each node used as query

Top 50 webpages (for each node in the tree)

Cleanup (stemming, stopword removal, etc.)

Level-wise categorization

ETD metadata used for categorization

BrowsingTraining

ETDs categorized into a node of the category tree (after classification)

Approach (contd.)Approach (contd.) Algorithm PipelineAlgorithm Pipeline

Page 14: Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA

ResultsResults Crawled metadata for all the ETDs from the Crawled metadata for all the ETDs from the

NDLTD Union CatalogNDLTD Union Catalog~800,000 ETDs in Union Catalog~800,000 ETDs in Union Catalog 15 Dublin Core fields extracted and 15 Dublin Core fields extracted and storedstored

Crawled ~200,000 dissertations from the Crawled ~200,000 dissertations from the respective universities (where permissible) respective universities (where permissible) and indexing is in progressand indexing is in progress

Technology used: Lucene search engineTechnology used: Lucene search engine More dissertations being crawledMore dissertations being crawled

Page 15: Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA

Results (contd.)Results (contd.)

Enhanced taxonomy developedEnhanced taxonomy developed Some subtrees are shown in the Some subtrees are shown in the

following few slides.following few slides. The taxonomy currently is 4 levels The taxonomy currently is 4 levels

deep and has ~200 nodes.deep and has ~200 nodes. It is being enhanced to be 5-6 It is being enhanced to be 5-6

levels deep.levels deep.

Page 16: Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA

Results (contd.)Results (contd.)

Arts

SpeechLiterature Art History Classical Studies

Visual Arts

Performing Arts

……

Enhanced taxonomy (some nodes Enhanced taxonomy (some nodes from the “Arts” subtree shown)from the “Arts” subtree shown)

Level 2

Level 3

Page 17: Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA

Results (contd.)Results (contd.)

Enhanced taxonomy (some nodes Enhanced taxonomy (some nodes from the “Business” subtree shown)from the “Business” subtree shown)

Business

E-CommerceAccounting Human Resources

Investing Banking Management……

Level 2

Level 3

Page 18: Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA

Results (contd.)Results (contd.)

Categorized >74K ETDs from 8 universitiesCategorized >74K ETDs from 8 universities MIT, Virginia Tech, Caltech, NCSU, Georgia MIT, Virginia Tech, Caltech, NCSU, Georgia

Tech, Ohiolink, Rice, Texas A&MTech, Ohiolink, Rice, Texas A&M Categorized into 5 topical areas (Arts, Categorized into 5 topical areas (Arts,

Business, Computers, Health, Science, Business, Computers, Health, Science, Society)Society)

Categorization into lower levels of Categorization into lower levels of category tree (levels 3 and 4, that is) is in category tree (levels 3 and 4, that is) is in progressprogress

Page 19: Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA

Results (contd.)Results (contd.)

Name of theUniversity

Total No. of ETDs

Category

Arts Business Computers Health Science Society

MIT 29804 653 1847 6507 375 7141 555

Virginia Tech 11976 742 627 2665 1218 3317 340

Ohiolink 8020 1056 350 1267 1322 2887 345

Rice 6685 937 235 1181 145 2412 62

NCSU 5026 283 245 1419 512 2436 114

Texas A&M 4834 302 363 1363 566 2115 125

CalTech 4774 58 52 1392 29 3096 18

Georgia Tech 3582 32 133 1348 85 1233 23

TOTAL 74701 4063 3852 17142 4252 24637 1582

Page 20: Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA

Results (contd.)Results (contd.)

Algorithm is time efficient.Algorithm is time efficient. Training the classifier is done offline.Training the classifier is done offline. Classification is fast.Classification is fast. Classifying this collection of ~74,000 Classifying this collection of ~74,000

ETDs took <30 mins.ETDs took <30 mins. Hopefully classifiers developed can be Hopefully classifiers developed can be

applied to other data and in other applied to other data and in other systems. systems.

Page 21: Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA

Future WorkFuture Work

Increase coverageIncrease coverage Crawl more ETDs Crawl more ETDs Collaborate with universities and consortia to gain Collaborate with universities and consortia to gain

access to ETD collectionsaccess to ETD collections Better categorization approachesBetter categorization approaches

Leverage query expansion techniques to build Leverage query expansion techniques to build training settraining set

Web interface to facilitate browsing and searchWeb interface to facilitate browsing and search User studies to measure the efficacy of the systemUser studies to measure the efficacy of the system

Page 22: Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA

Questions ?Questions ?

[email protected]@[email protected]@vt.edu

Demo info available at Demo info available at http://fox.cs.vt.edu/etdbrowhttp://fox.cs.vt.edu/etdbrow

se/se/