using open source tools for visualization and semantic mapping in a large scale article digital...

Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale

Article Digital Library

Glen [email protected]

Biology Dept, Carleton Universityhttp://zzzoot.blogspot.com/

Code4Lib-North

Queen's University, Kingston, Ontario

Friday May 7 2010

Based on VLDL2009 Workshop

Presentation at ECDL2009

mailto:[email protected]

http://zzzoot.blogspot.com/

Outline

• Maps of Science• Broad Research Interests• Research Goals• Process• Scalability issues• Open Source Tools• Environment• Results• Conclusions• Future Work

From Bollen et al 2009 PLOS1

From Leydesdorff & Rafols 2006From Leydesdorff & Rafols 2006

From Leydesdorff & Rafols 2006

Broad Research Interests

• Search results visualization & refinement • Domain-specific discovery, with a particular interest in genomics

and drug discovery• Improved discovery in STM domains through results visualization

and contextualization, browse/explore/refine• Use of Open Source tools in complex research problem spaces

Research Goals

• Use Open Source tools to support large scale semantic text analysis and visualization

• Find way to extract journal (& article) semantic vector space (semantics much better than keyword or tf-idf -based representations natural language)

• Latent Semantic Analysis (LSA) works for small/medium sized corpora, does not scale to large scale of items and/or terms

• New alternative: Semantic Vectors (SV): uses random vectors & avoids expensive singular value decomposition (SVD)

• Can SV scale & generate sensible semantic vector space of journals on corpus of this size?

• Can the visualization produced be useful for results query visualization, refinement, discovery?

Corpus

• Licensed journal articles from STM publishers: Elsevier, Springer, etc

• ~4100 journal titles, classified into 23 categories (by publishers)• ~8.4m journal articles• Selection of articles/journals:

– Only those with authors, abstract (no notices, obituaries, etc)– Only English language articles– Only journals with >50 articles in corpus– Resulting corpus: 5,733,721 articles from 2231 journals – Categories overlapping: 1.53 categories per journal

Corpus

Category # Journals per category

Agriculture & Biological Sciences 358

Arts and Humanities 70

Biochemistry, Genetics and Molecular Biology 240

Business, Management and Accounting 106

Chemical Engineering 126

Chemistry 226

Civil Engineering 64

Computer Science 218

Decision Science 50

Earth and Planetary Science 146

Economics, Econometrics and Finance 112

Category # Journals per category

Energy and Power 73

Engineering and Technology 328

Environmental Science 138

Immunology and Microbiology 104

Materials Science 160

Mathematics 205

Medicine 671

Neuroscience 103

Pharmacology, Toxicology and Pharmaceutics

73

Physics and Astronomy 210

Psychology 126

Social Science 222

Process

• Index full-text (only) with Lucene 2.4, aggressive stopword list, Porter stemming using LuSql tool

• Build Semantic Vectors (v1.18, parallelized) index from Lucene index, with 512 semantic dimensions

• Find item x item distance matrix from SV index of 512-dimensional vectors

• Using R, use multidimensional scaling (MDS) to reduce from 512-D to 2-D

Scalability Issues

• #items, #unique terms– #unique terms: SV easily handles very well– #items: SV handles fairly well – #items: impacts size of distance matrix (#items x #items) – R cannot handle huge article distance matrix in MDS (i.e.

millions of articles vs. thousands of journals)

• Instead of using articles for items, use journals for items• Make single large full-text document from concatenation of all

articles of particular journal & index these

Open Source Tools

• Lucene• LuSql (High performance Lucene index building tool)• Semantic Vectors• R• Processing• Linux

Environment

• Dell PowerEdge 1955 Blade server, 2 x dual-core Xeon 5050 processors with 2x2MB cache, 3.0 Ghz 64bit, 32GB RAM, attached to a Dell EMC AX150 storage arrays via SilkWorm 200E Series 16-Port Capable 4Gb Fabric Switch.

• Operating system: Linux openSUSE 10.2 (64-bit X86-64), kernel 2.6.18.8-0.10-default \#1 SMP

• Java version 1.6.0.07 (build 1.6.0 07-b06) Java HotSpot 64-BitServer VM (build 10.0-b23, mixed mode).

• Processing 1.0 (processing.org)

Results: Scalability

• Corpus: ~600GB full-text• Lucene index: 43GB

– LuSql: 13 hours 51 minutes to produce

• SV index: 58 minutes, 885 MB, 21.6m terms– Distance matrix: 6 minutes

Results: Visualization

• Using Processing environment, built simple validation/visualization tool

Harder sciences and engineering categories

Chemistry

Material Science

Physics andAstronomy

Engineering and Technology

Mathematics

Computer Science

Civil Engineering

Chemical Engineering

Agriculture and biomedical categories

Agriculture and Biological Sciences

Biochemistry, Geneticsand Molecular Biology

Immunology andMicrobiology

Pharmacology

Neuroscience

MedicineMedicine

Psychology

Interdisciplinary and non-science categories

Environmental Science

Earth andPlanetary Science

Energy and Power

Decision Science

Economics,EconometricsAnd Finance

Social Sciences

Business, Managementand Accounting

Arts and Humanities

Examination of outliers, extrema and cataloging errors

Environmental Science

Ecotoxicology and Environmental Safety

Corporate EnvironmentalStrategy

Organic Geochemistry

MedicineMedicine

Journal of X-Ray Science and Technology

Journal of Biomolecular NMR

MedicineMedicine

Annales Henri Poincare

Colloidal and Polymer Science

Medicine

Medicine

French language Medical & Psychology Journals

Mathematics

Journal ofMedicalUltrasonics

Bulletin ofMathematical Biology

Conclusions

• Reasonable mapping results• Full-text only (no citations, metadata) gives good results• Scalable to significant size• Open Source tools supported a complex research process and

were easy to modify to deal with scalability issues

Future Work

• Proper precision and recall evaluation using same corpus• Validate with NetNews-20 collection for P & R• Evaluate non-metric MDS• Project articles onto semantic journal space & build interactive

discovery interface & evaluate– Index journal 'documents' and journal articles– SV on all– Distance matrix only on journals– Do MDS– Use eigenvectors to transform N-d article vector to 2-D

• Explore 3-D interface (MDS N-d → 3D)

Acknowledgements

• Collaborators: Michel Dumontier, Alison Callahan @Carleton• Support: Greg Kresko, Andre Vellino, Jeff Demaine @ NRC-

CISTI

Demo

• Link to project demo page

http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html

License

Creative Commons Attribution-Noncommercial-No Derivative Works 2.5 Canada License

http://creativecommons.org/licenses/by-nc-nd/2.5/ca/

using open source tools for visualization and semantic mapping in a large scale article digital...

Technology

open source tools

semantic vector space

semantic vectors

distance matrix

leydesdorff

index

results