using open source tools for visualization and semantic mapping in a large scale article digital...
DESCRIPTION
Presentation at Code4Lib-North at Queen's University, Ontario May 7 2010TRANSCRIPT
Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale
Article Digital Library
Glen [email protected]
Biology Dept, Carleton Universityhttp://zzzoot.blogspot.com/
Code4Lib-North
Queen's University, Kingston, Ontario
Friday May 7 2010
Based on VLDL2009 Workshop
Presentation at ECDL2009
Outline
• Maps of Science• Broad Research Interests• Research Goals• Process• Scalability issues• Open Source Tools• Environment• Results• Conclusions• Future Work
From Bollen et al 2009 PLOS1
From Leydesdorff & Rafols 2006From Leydesdorff & Rafols 2006
From Leydesdorff & Rafols 2006
Broad Research Interests
• Search results visualization & refinement • Domain-specific discovery, with a particular interest in genomics
and drug discovery• Improved discovery in STM domains through results visualization
and contextualization, browse/explore/refine• Use of Open Source tools in complex research problem spaces
Research Goals
• Use Open Source tools to support large scale semantic text analysis and visualization
• Find way to extract journal (& article) semantic vector space (semantics much better than keyword or tf-idf -based representations natural language)
• Latent Semantic Analysis (LSA) works for small/medium sized corpora, does not scale to large scale of items and/or terms
• New alternative: Semantic Vectors (SV): uses random vectors & avoids expensive singular value decomposition (SVD)
• Can SV scale & generate sensible semantic vector space of journals on corpus of this size?
• Can the visualization produced be useful for results query visualization, refinement, discovery?
Corpus
• Licensed journal articles from STM publishers: Elsevier, Springer, etc
• ~4100 journal titles, classified into 23 categories (by publishers)• ~8.4m journal articles• Selection of articles/journals:
– Only those with authors, abstract (no notices, obituaries, etc)– Only English language articles– Only journals with >50 articles in corpus– Resulting corpus: 5,733,721 articles from 2231 journals – Categories overlapping: 1.53 categories per journal
Corpus
Category # Journals per category
Agriculture & Biological Sciences 358
Arts and Humanities 70
Biochemistry, Genetics and Molecular Biology 240
Business, Management and Accounting 106
Chemical Engineering 126
Chemistry 226
Civil Engineering 64
Computer Science 218
Decision Science 50
Earth and Planetary Science 146
Economics, Econometrics and Finance 112
Category # Journals per category
Energy and Power 73
Engineering and Technology 328
Environmental Science 138
Immunology and Microbiology 104
Materials Science 160
Mathematics 205
Medicine 671
Neuroscience 103
Pharmacology, Toxicology and Pharmaceutics
73
Physics and Astronomy 210
Psychology 126
Social Science 222
Process
• Index full-text (only) with Lucene 2.4, aggressive stopword list, Porter stemming using LuSql tool
• Build Semantic Vectors (v1.18, parallelized) index from Lucene index, with 512 semantic dimensions
• Find item x item distance matrix from SV index of 512-dimensional vectors
• Using R, use multidimensional scaling (MDS) to reduce from 512-D to 2-D
Scalability Issues
• #items, #unique terms– #unique terms: SV easily handles very well– #items: SV handles fairly well – #items: impacts size of distance matrix (#items x #items) – R cannot handle huge article distance matrix in MDS (i.e.
millions of articles vs. thousands of journals)
• Instead of using articles for items, use journals for items• Make single large full-text document from concatenation of all
articles of particular journal & index these
Open Source Tools
• Lucene• LuSql (High performance Lucene index building tool)• Semantic Vectors• R• Processing• Linux
Environment
• Dell PowerEdge 1955 Blade server, 2 x dual-core Xeon 5050 processors with 2x2MB cache, 3.0 Ghz 64bit, 32GB RAM, attached to a Dell EMC AX150 storage arrays via SilkWorm 200E Series 16-Port Capable 4Gb Fabric Switch.
• Operating system: Linux openSUSE 10.2 (64-bit X86-64), kernel 2.6.18.8-0.10-default \#1 SMP
• Java version 1.6.0.07 (build 1.6.0 07-b06) Java HotSpot 64-BitServer VM (build 10.0-b23, mixed mode).
• Processing 1.0 (processing.org)
Results: Scalability
• Corpus: ~600GB full-text• Lucene index: 43GB
– LuSql: 13 hours 51 minutes to produce
• SV index: 58 minutes, 885 MB, 21.6m terms– Distance matrix: 6 minutes
Results: Visualization
• Using Processing environment, built simple validation/visualization tool
Harder sciences and engineering categories
Chemistry
Material Science
Physics andAstronomy
Engineering and Technology
Mathematics
Computer Science
Civil Engineering
Chemical Engineering
Agriculture and biomedical categories
Agriculture and Biological Sciences
Biochemistry, Geneticsand Molecular Biology
Immunology andMicrobiology
Pharmacology
Neuroscience
MedicineMedicine
Psychology
Interdisciplinary and non-science categories
Environmental Science
Earth andPlanetary Science
Energy and Power
Decision Science
Economics,EconometricsAnd Finance
Social Sciences
Business, Managementand Accounting
Arts and Humanities
Examination of outliers, extrema and cataloging errors
Environmental Science
Ecotoxicology and Environmental Safety
Corporate EnvironmentalStrategy
Organic Geochemistry
MedicineMedicine
Journal of X-Ray Science and Technology
Journal of Biomolecular NMR
MedicineMedicine
Annales Henri Poincare
Colloidal and Polymer Science
Medicine
Medicine
French language Medical & Psychology Journals
Mathematics
Journal ofMedicalUltrasonics
Bulletin ofMathematical Biology
Conclusions
• Reasonable mapping results• Full-text only (no citations, metadata) gives good results• Scalable to significant size• Open Source tools supported a complex research process and
were easy to modify to deal with scalability issues
Future Work
• Proper precision and recall evaluation using same corpus• Validate with NetNews-20 collection for P & R• Evaluate non-metric MDS• Project articles onto semantic journal space & build interactive
discovery interface & evaluate– Index journal 'documents' and journal articles– SV on all– Distance matrix only on journals– Do MDS– Use eigenvectors to transform N-d article vector to 2-D
• Explore 3-D interface (MDS N-d → 3D)
Acknowledgements
• Collaborators: Michel Dumontier, Alison Callahan @Carleton• Support: Greg Kresko, Andre Vellino, Jeff Demaine @ NRC-
CISTI
Demo
• Link to project demo page
License
Creative Commons Attribution-Noncommercial-No Derivative Works 2.5 Canada License