using open source tools for visualization and semantic mapping in a large scale article digital...

55
Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library Glen Newton [email protected] Biology Dept, Carleton University http://zzzoot.blogspot.com/ Code4Lib-North Queen's University, Kingston, Ontario Friday May 7 2010 Based on VLDL2009 Workshop Presentation at ECDL2009

Upload: glen-newton

Post on 11-Nov-2014

3.951 views

Category:

Technology


0 download

DESCRIPTION

Presentation at Code4Lib-North at Queen's University, Ontario May 7 2010

TRANSCRIPT

Page 1: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale

Article Digital Library

Glen [email protected]

Biology Dept, Carleton Universityhttp://zzzoot.blogspot.com/

Code4Lib-North

Queen's University, Kingston, Ontario

Friday May 7 2010

Based on VLDL2009 Workshop

Presentation at ECDL2009

Page 2: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Outline

• Maps of Science• Broad Research Interests• Research Goals• Process• Scalability issues• Open Source Tools• Environment• Results• Conclusions• Future Work

Page 3: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

From Bollen et al 2009 PLOS1

Page 4: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

From Leydesdorff & Rafols 2006From Leydesdorff & Rafols 2006

Page 5: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

From Leydesdorff & Rafols 2006

Page 6: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Broad Research Interests

• Search results visualization & refinement • Domain-specific discovery, with a particular interest in genomics

and drug discovery• Improved discovery in STM domains through results visualization

and contextualization, browse/explore/refine• Use of Open Source tools in complex research problem spaces

Page 7: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Research Goals

• Use Open Source tools to support large scale semantic text analysis and visualization

• Find way to extract journal (& article) semantic vector space (semantics much better than keyword or tf-idf -based representations natural language)

• Latent Semantic Analysis (LSA) works for small/medium sized corpora, does not scale to large scale of items and/or terms

• New alternative: Semantic Vectors (SV): uses random vectors & avoids expensive singular value decomposition (SVD)

• Can SV scale & generate sensible semantic vector space of journals on corpus of this size?

• Can the visualization produced be useful for results query visualization, refinement, discovery?

Page 8: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Corpus

• Licensed journal articles from STM publishers: Elsevier, Springer, etc

• ~4100 journal titles, classified into 23 categories (by publishers)• ~8.4m journal articles• Selection of articles/journals:

– Only those with authors, abstract (no notices, obituaries, etc)– Only English language articles– Only journals with >50 articles in corpus– Resulting corpus: 5,733,721 articles from 2231 journals – Categories overlapping: 1.53 categories per journal

Page 9: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Corpus

Category # Journals per category

Agriculture & Biological Sciences 358

Arts and Humanities 70

Biochemistry, Genetics and Molecular Biology 240

Business, Management and Accounting 106

Chemical Engineering 126

Chemistry 226

Civil Engineering 64

Computer Science 218

Decision Science 50

Earth and Planetary Science 146

Economics, Econometrics and Finance 112

Page 10: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Category # Journals per category

Energy and Power 73

Engineering and Technology 328

Environmental Science 138

Immunology and Microbiology 104

Materials Science 160

Mathematics 205

Medicine 671

Neuroscience 103

Pharmacology, Toxicology and Pharmaceutics

73

Physics and Astronomy 210

Psychology 126

Social Science 222

Page 11: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Process

• Index full-text (only) with Lucene 2.4, aggressive stopword list, Porter stemming using LuSql tool

• Build Semantic Vectors (v1.18, parallelized) index from Lucene index, with 512 semantic dimensions

• Find item x item distance matrix from SV index of 512-dimensional vectors

• Using R, use multidimensional scaling (MDS) to reduce from 512-D to 2-D

Page 12: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Scalability Issues

• #items, #unique terms– #unique terms: SV easily handles very well– #items: SV handles fairly well – #items: impacts size of distance matrix (#items x #items) – R cannot handle huge article distance matrix in MDS (i.e.

millions of articles vs. thousands of journals)

• Instead of using articles for items, use journals for items• Make single large full-text document from concatenation of all

articles of particular journal & index these

Page 13: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Open Source Tools

• Lucene• LuSql (High performance Lucene index building tool)• Semantic Vectors• R• Processing• Linux

Page 14: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Environment

• Dell PowerEdge 1955 Blade server, 2 x dual-core Xeon 5050 processors with 2x2MB cache, 3.0 Ghz 64bit, 32GB RAM, attached to a Dell EMC AX150 storage arrays via SilkWorm 200E Series 16-Port Capable 4Gb Fabric Switch.

• Operating system: Linux openSUSE 10.2 (64-bit X86-64), kernel 2.6.18.8-0.10-default \#1 SMP

• Java version 1.6.0.07 (build 1.6.0 07-b06) Java HotSpot 64-BitServer VM (build 10.0-b23, mixed mode).

• Processing 1.0 (processing.org)

Page 15: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Results: Scalability

• Corpus: ~600GB full-text• Lucene index: 43GB

– LuSql: 13 hours 51 minutes to produce

• SV index: 58 minutes, 885 MB, 21.6m terms– Distance matrix: 6 minutes

Page 16: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Results: Visualization

• Using Processing environment, built simple validation/visualization tool

Page 17: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library
Page 18: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library
Page 19: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Harder sciences and engineering categories

Page 20: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Chemistry

Page 21: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Material Science

Page 22: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Physics andAstronomy

Page 23: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Engineering and Technology

Page 24: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Mathematics

Page 25: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Computer Science

Page 26: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Civil Engineering

Page 27: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Chemical Engineering

Page 28: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Agriculture and biomedical categories

Page 29: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Agriculture and Biological Sciences

Page 30: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Biochemistry, Geneticsand Molecular Biology

Page 31: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Immunology andMicrobiology

Page 32: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Pharmacology

Page 33: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Neuroscience

Page 34: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

MedicineMedicine

Page 35: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Psychology

Page 36: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Interdisciplinary and non-science categories

Page 37: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Environmental Science

Page 38: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Earth andPlanetary Science

Page 39: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Energy and Power

Page 40: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Decision Science

Page 41: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Economics,EconometricsAnd Finance

Page 42: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Social Sciences

Page 43: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Business, Managementand Accounting

Page 44: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Arts and Humanities

Page 45: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Examination of outliers, extrema and cataloging errors

Page 46: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Environmental Science

Ecotoxicology and Environmental Safety

Corporate EnvironmentalStrategy

Organic Geochemistry

Page 47: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

MedicineMedicine

Journal of X-Ray Science and Technology

Journal of Biomolecular NMR

Page 48: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

MedicineMedicine

Annales Henri Poincare

Colloidal and Polymer Science

Page 49: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Medicine

Medicine

French language Medical & Psychology Journals

Page 50: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Mathematics

Journal ofMedicalUltrasonics

Bulletin ofMathematical Biology

Page 51: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Conclusions

• Reasonable mapping results• Full-text only (no citations, metadata) gives good results• Scalable to significant size• Open Source tools supported a complex research process and

were easy to modify to deal with scalability issues

Page 52: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Future Work

• Proper precision and recall evaluation using same corpus• Validate with NetNews-20 collection for P & R• Evaluate non-metric MDS• Project articles onto semantic journal space & build interactive

discovery interface & evaluate– Index journal 'documents' and journal articles– SV on all– Distance matrix only on journals– Do MDS– Use eigenvectors to transform N-d article vector to 2-D

• Explore 3-D interface (MDS N-d → 3D)

Page 53: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

Acknowledgements

• Collaborators: Michel Dumontier, Alison Callahan @Carleton• Support: Greg Kresko, Andre Vellino, Jeff Demaine @ NRC-

CISTI

Page 55: Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

License

Creative Commons Attribution-Noncommercial-No Derivative Works 2.5 Canada License