topical categorization of large collections of electronic theses and dissertations
DESCRIPTION
Topical Categorization of Large Collections of Electronic Theses and Dissertations. Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA ETD 2009 – June 11, 2009. Outline. Introduction Goals Approach Results Future Work. Introduction – great source. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Topical Categorization of Large Collections of Electronic Theses and Dissertations](https://reader035.vdocuments.us/reader035/viewer/2022081514/568137dc550346895d9f7c4e/html5/thumbnails/1.jpg)
Topical Categorization of Topical Categorization of Large Collections of Large Collections of
Electronic Theses and Electronic Theses and DissertationsDissertations
Venkat Srinivasan & Edward A. FoxVenkat Srinivasan & Edward A. FoxVirginia Tech, Blacksburg, VA, USAVirginia Tech, Blacksburg, VA, USA
ETD 2009 – June 11, 2009ETD 2009 – June 11, 2009
![Page 2: Topical Categorization of Large Collections of Electronic Theses and Dissertations](https://reader035.vdocuments.us/reader035/viewer/2022081514/568137dc550346895d9f7c4e/html5/thumbnails/2.jpg)
OutlineOutline IntroductionIntroduction GoalsGoals ApproachApproach ResultsResults Future WorkFuture Work
![Page 3: Topical Categorization of Large Collections of Electronic Theses and Dissertations](https://reader035.vdocuments.us/reader035/viewer/2022081514/568137dc550346895d9f7c4e/html5/thumbnails/3.jpg)
Introduction – great sourceIntroduction – great source Electronic submission of dissertations is Electronic submission of dissertations is
increasingly preferred.increasingly preferred. ETDs are a great information source.ETDs are a great information source.
Substantial amount of research on Substantial amount of research on a topica topic
Thorough literature reviewThorough literature reviewPointers to other resources Pointers to other resources (Reference section)(Reference section)
![Page 4: Topical Categorization of Large Collections of Electronic Theses and Dissertations](https://reader035.vdocuments.us/reader035/viewer/2022081514/568137dc550346895d9f7c4e/html5/thumbnails/4.jpg)
Introduction- under-utilizedIntroduction- under-utilized Yet ETDs are under-utilized. Yet ETDs are under-utilized.
Research papers, books etc. are still major (and in some Research papers, books etc. are still major (and in some cases the only) sources of information for most people. cases the only) sources of information for most people.
Most people (except grad students trained in this) don’t Most people (except grad students trained in this) don’t even think about reading a dissertation! even think about reading a dissertation!
Possible causesPossible causes Access to ETDs not streamlined.Access to ETDs not streamlined.
Users don’t know where to look for ETDs. Users don’t know where to look for ETDs. ETDs of interest could be buried in search engine results. ETDs of interest could be buried in search engine results.
Some universities do not allow outside access to their ETD Some universities do not allow outside access to their ETD collection.collection.
![Page 5: Topical Categorization of Large Collections of Electronic Theses and Dissertations](https://reader035.vdocuments.us/reader035/viewer/2022081514/568137dc550346895d9f7c4e/html5/thumbnails/5.jpg)
Introduction - needsIntroduction - needs Efforts have been made to make ETDs more Efforts have been made to make ETDs more
accessible.accessible. NDLTD, VTLS, Scirus, etc. provide means of NDLTD, VTLS, Scirus, etc. provide means of
access to ETDs from different universities. access to ETDs from different universities. Not very feature rich and convenient:Not very feature rich and convenient:
Users search for ETDs based on keywords.Users search for ETDs based on keywords. Don’t know what lies underneath (no idea Don’t know what lies underneath (no idea
about the size, topical coverage, etc. of ETD about the size, topical coverage, etc. of ETD collections)collections)
Not very amenable to browsing (users have to Not very amenable to browsing (users have to sift through search results)sift through search results)
![Page 6: Topical Categorization of Large Collections of Electronic Theses and Dissertations](https://reader035.vdocuments.us/reader035/viewer/2022081514/568137dc550346895d9f7c4e/html5/thumbnails/6.jpg)
GoalsGoals Provide a portal to ETD collections of Provide a portal to ETD collections of
more different universitiesmore different universities Provide value added servicesProvide value added services
Categorize by topicCategorize by topicSupport searching and browsing Support searching and browsing the collection using various the collection using various criteria (by topic, keywords, date, criteria (by topic, keywords, date, author, etc.)author, etc.)
![Page 7: Topical Categorization of Large Collections of Electronic Theses and Dissertations](https://reader035.vdocuments.us/reader035/viewer/2022081514/568137dc550346895d9f7c4e/html5/thumbnails/7.jpg)
Goals - prioritiesGoals - priorities Set up infrastructure for crawling Set up infrastructure for crawling
ETDs of various universitiesETDs of various universities Come up with techniques for Come up with techniques for
categorizing them into topical areascategorizing them into topical areas Set up a user-friendly search and Set up a user-friendly search and
browse interfacebrowse interface
![Page 8: Topical Categorization of Large Collections of Electronic Theses and Dissertations](https://reader035.vdocuments.us/reader035/viewer/2022081514/568137dc550346895d9f7c4e/html5/thumbnails/8.jpg)
ApproachApproach Crawl ETDs from various universitiesCrawl ETDs from various universities Develop a taxonomyDevelop a taxonomy Categorize ETDs into topics in the Categorize ETDs into topics in the
taxonomy treetaxonomy tree Index the ETDsIndex the ETDs Develop a search and browse Develop a search and browse
interfaceinterface
![Page 9: Topical Categorization of Large Collections of Electronic Theses and Dissertations](https://reader035.vdocuments.us/reader035/viewer/2022081514/568137dc550346895d9f7c4e/html5/thumbnails/9.jpg)
Approach - crawlingApproach - crawling NDLTD’s Union Catalog - as starting pointNDLTD’s Union Catalog - as starting point Dublin Core metadata gatheredDublin Core metadata gathered URLs used to crawl ETDs and other data from the URLs used to crawl ETDs and other data from the
respective universities’ websitesrespective universities’ websites Custom crawlers written Custom crawlers written
Technologies used: Perl, and other open source Perl Technologies used: Perl, and other open source Perl libraries (WWW, Mechanize, etc.)libraries (WWW, Mechanize, etc.)
All metadata (Dublin Core metadata from Union All metadata (Dublin Core metadata from Union Catalog, and the metadata obtained from respective Catalog, and the metadata obtained from respective universities) is stored in our MySQL backend database. universities) is stored in our MySQL backend database.
![Page 10: Topical Categorization of Large Collections of Electronic Theses and Dissertations](https://reader035.vdocuments.us/reader035/viewer/2022081514/568137dc550346895d9f7c4e/html5/thumbnails/10.jpg)
Approach - taxonomyApproach - taxonomy Need medium generality and specificity, Need medium generality and specificity,
as opposed to those from Proquest, as opposed to those from Proquest, DMOZ, or WikipediaDMOZ, or Wikipedia
For example, DMOZ has more than For example, DMOZ has more than 500,000 nodes !500,000 nodes !
Solution?Solution? Prune the DMOZ category tree, and then Prune the DMOZ category tree, and then
enhance it using Proquest categorization enhance it using Proquest categorization system.system.
![Page 11: Topical Categorization of Large Collections of Electronic Theses and Dissertations](https://reader035.vdocuments.us/reader035/viewer/2022081514/568137dc550346895d9f7c4e/html5/thumbnails/11.jpg)
TOP
BusinessArts Computers Health Science Society
Category Tree (only top 2 levels Category Tree (only top 2 levels shown)shown)
Approach – taxonomy Approach – taxonomy levelslevels
Level 1
Level 2
![Page 12: Topical Categorization of Large Collections of Electronic Theses and Dissertations](https://reader035.vdocuments.us/reader035/viewer/2022081514/568137dc550346895d9f7c4e/html5/thumbnails/12.jpg)
Approach - categorizeApproach - categorize Supervised classification approach usedSupervised classification approach used Training set built by using topic labels as query to GoogleTraining set built by using topic labels as query to Google 50 webpages retrieved and used for training Naïve 50 webpages retrieved and used for training Naïve
Baye’s classifier for each node (to distinguish between its Baye’s classifier for each node (to distinguish between its children)children)
ETD metadata used for categorizationETD metadata used for categorization Level-wise categorizationLevel-wise categorization
![Page 13: Topical Categorization of Large Collections of Electronic Theses and Dissertations](https://reader035.vdocuments.us/reader035/viewer/2022081514/568137dc550346895d9f7c4e/html5/thumbnails/13.jpg)
CategoryTree
Document Sets
Google Naïve Bayes Classifiers
Training Sets
Web Interface
ETD Collection
Categorized ETDs
Category label for each node used as query
Top 50 webpages (for each node in the tree)
Cleanup (stemming, stopword removal, etc.)
Level-wise categorization
ETD metadata used for categorization
BrowsingTraining
ETDs categorized into a node of the category tree (after classification)
Approach (contd.)Approach (contd.) Algorithm PipelineAlgorithm Pipeline
![Page 14: Topical Categorization of Large Collections of Electronic Theses and Dissertations](https://reader035.vdocuments.us/reader035/viewer/2022081514/568137dc550346895d9f7c4e/html5/thumbnails/14.jpg)
ResultsResults Crawled metadata for all the ETDs from the Crawled metadata for all the ETDs from the
NDLTD Union CatalogNDLTD Union Catalog~800,000 ETDs in Union Catalog~800,000 ETDs in Union Catalog 15 Dublin Core fields extracted and 15 Dublin Core fields extracted and storedstored
Crawled ~200,000 dissertations from the Crawled ~200,000 dissertations from the respective universities (where permissible) respective universities (where permissible) and indexing is in progressand indexing is in progress
Technology used: Lucene search engineTechnology used: Lucene search engine More dissertations being crawledMore dissertations being crawled
![Page 15: Topical Categorization of Large Collections of Electronic Theses and Dissertations](https://reader035.vdocuments.us/reader035/viewer/2022081514/568137dc550346895d9f7c4e/html5/thumbnails/15.jpg)
Results (contd.)Results (contd.) Enhanced taxonomy developedEnhanced taxonomy developed
Some subtrees are shown in the Some subtrees are shown in the following few slides.following few slides.
The taxonomy currently is 4 levels The taxonomy currently is 4 levels deep and has ~200 nodes.deep and has ~200 nodes.
It is being enhanced to be 5-6 It is being enhanced to be 5-6 levels deep.levels deep.
![Page 16: Topical Categorization of Large Collections of Electronic Theses and Dissertations](https://reader035.vdocuments.us/reader035/viewer/2022081514/568137dc550346895d9f7c4e/html5/thumbnails/16.jpg)
Results (contd.)Results (contd.)Arts
SpeechLiterature Art History Classical Studies
Visual Arts Performing
Arts……
Enhanced taxonomy (some nodes Enhanced taxonomy (some nodes from the “Arts” subtree shown)from the “Arts” subtree shown)
Level 2
Level 3
![Page 17: Topical Categorization of Large Collections of Electronic Theses and Dissertations](https://reader035.vdocuments.us/reader035/viewer/2022081514/568137dc550346895d9f7c4e/html5/thumbnails/17.jpg)
Results (contd.)Results (contd.)
Enhanced taxonomy (some nodes Enhanced taxonomy (some nodes from the “Business” subtree shown)from the “Business” subtree shown)
Business
E-CommerceAccounting Human Resources Investing Banking Management……
Level 2
Level 3
![Page 18: Topical Categorization of Large Collections of Electronic Theses and Dissertations](https://reader035.vdocuments.us/reader035/viewer/2022081514/568137dc550346895d9f7c4e/html5/thumbnails/18.jpg)
Results (contd.)Results (contd.) Categorized >74K ETDs from 8 universitiesCategorized >74K ETDs from 8 universities
MIT, Virginia Tech, Caltech, NCSU, Georgia MIT, Virginia Tech, Caltech, NCSU, Georgia Tech, Ohiolink, Rice, Texas A&MTech, Ohiolink, Rice, Texas A&M
Categorized into 5 topical areas (Arts, Categorized into 5 topical areas (Arts, Business, Computers, Health, Science, Business, Computers, Health, Science, Society)Society)
Categorization into lower levels of category Categorization into lower levels of category tree (levels 3 and 4, that is) is in progresstree (levels 3 and 4, that is) is in progress
![Page 19: Topical Categorization of Large Collections of Electronic Theses and Dissertations](https://reader035.vdocuments.us/reader035/viewer/2022081514/568137dc550346895d9f7c4e/html5/thumbnails/19.jpg)
Results (contd.)Results (contd.)Name of theUniversity
Total No. of ETDs
Category
Arts Business Computers Health Science Society
MIT 29804 653 1847 6507 375 7141 555
Virginia Tech 11976 742 627 2665 1218 3317 340
Ohiolink 8020 1056 350 1267 1322 2887 345
Rice 6685 937 235 1181 145 2412 62
NCSU 5026 283 245 1419 512 2436 114
Texas A&M 4834 302 363 1363 566 2115 125
CalTech 4774 58 52 1392 29 3096 18
Georgia Tech 3582 32 133 1348 85 1233 23
TOTAL 74701 4063 3852 17142 4252 24637 1582
![Page 20: Topical Categorization of Large Collections of Electronic Theses and Dissertations](https://reader035.vdocuments.us/reader035/viewer/2022081514/568137dc550346895d9f7c4e/html5/thumbnails/20.jpg)
Results (contd.)Results (contd.) Algorithm is time efficient.Algorithm is time efficient.
Training the classifier is done offline.Training the classifier is done offline. Classification is fast.Classification is fast. Classifying this collection of ~74,000 Classifying this collection of ~74,000
ETDs took <30 mins.ETDs took <30 mins. Hopefully classifiers developed can be Hopefully classifiers developed can be
applied to other data and in other applied to other data and in other systems. systems.
![Page 21: Topical Categorization of Large Collections of Electronic Theses and Dissertations](https://reader035.vdocuments.us/reader035/viewer/2022081514/568137dc550346895d9f7c4e/html5/thumbnails/21.jpg)
Future WorkFuture Work Increase coverageIncrease coverage
Crawl more ETDs Crawl more ETDs Collaborate with universities and consortia to gain Collaborate with universities and consortia to gain
access to ETD collectionsaccess to ETD collections Better categorization approachesBetter categorization approaches
Leverage query expansion techniques to build training Leverage query expansion techniques to build training setset
Web interface to facilitate browsing and searchWeb interface to facilitate browsing and search User studies to measure the efficacy of the systemUser studies to measure the efficacy of the system
![Page 22: Topical Categorization of Large Collections of Electronic Theses and Dissertations](https://reader035.vdocuments.us/reader035/viewer/2022081514/568137dc550346895d9f7c4e/html5/thumbnails/22.jpg)
Questions ?Questions ?
[email protected]@[email protected]@vt.edu
Demo info available at Demo info available at http://fox.cs.vt.edu/etdbrowhttp://fox.cs.vt.edu/etdbrow
se/se/