unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction
TRANSCRIPT
![Page 1: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/1.jpg)
Unlocking knowledge in biodiversitylegacy literature through automatic
semantic metadata extraction
Riza Batista-Navarro, William Ulate, Jennifer Hammock, Georgios Kontonatsios, Trish
Rose-Sandler and Sophia Ananiadou
![Page 2: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/2.jpg)
StructuredData
? Text Mining
![Page 4: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/4.jpg)
The partners
Social Media Lab
410/9/2015 Mining Biodiversity
![Page 5: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/5.jpg)
Mining Biodiversity
• Transform BHL into a next-generation social digital library
• A multi-disciplinary approach – Text Mining
– Machine learning
– History of Science
– Environmental History & Studies
– Library and Information Science
– Social Media
510/9/2015 Mining Biodiversity
![Page 6: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/6.jpg)
What do we want to do?
Social Media
Visualisation
Semantic
Metadata
610/9/2015 Mining Biodiversity
![Page 7: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/7.jpg)
Biodiversity Heritage Library
• a consortium of botanical and natural history libraries
• stores digitised legacy literature on biodiversity
• currently holds 160,000 volumes = millions of pages (PDFs and OCR-generated text)
• open-access
710/9/2015 Mining Biodiversity
![Page 8: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/8.jpg)
Current features
• supports keyword-based search
• species names annotated and linked to the Encyclopedia of Life
• integrates automatic taxonomic name finding tools (uBio Taxonfinder)
• data access through export functionalities and Web services
810/9/2015 Mining Biodiversity
![Page 9: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/9.jpg)
Keyword-based search and Browsing
![Page 10: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/10.jpg)
Advanced search (also keyword-based)
10/9/2015 10Mining Biodiversity
![Page 11: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/11.jpg)
What’s wrong with keyword-based search?
• Ambiguity!
Boxwood
historic place in Alabama?
North American term for plants in the Buxaceae
family?
Box
container?
Boxwood for other English-speaking countries?
![Page 12: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/12.jpg)
What’s wrong with keyword-based search?
• Ambiguity!
California bay
hardwood tree?
location?
Drum
musical instrument?
fish?
![Page 13: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/13.jpg)
What’s wrong with keyword-based search?
• Ambiguity!
Emperor
fish?
person?
Scrambled eggs
food?
plant?
![Page 14: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/14.jpg)
Semantic metadata generation
• Entity types
– species
– location
– habitat
– anatomical parts
– qualities
– persons
– temporal expressions
• Association types
– observation
– Habitation
– nutrition
– trait
10/9/2015 Mining Biodiversity 14
![Page 15: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/15.jpg)
Examples of semantic metadata (annotations)
• Observation
• Habitation
![Page 16: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/16.jpg)
Examples of semantic metadata (annotations)
• Nutrition
• Trait
![Page 17: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/17.jpg)
How does semantic information help?
SPECIES:California bay
hardwood tree
location
LOCATION:California bay
![Page 18: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/18.jpg)
Text mining-based approach
Seeddocuments
Unlabelleddocuments
Learn semantics
Annotator/CuratorValidate
Feedback
Annotate
Searchindex
Store
Annotate
![Page 19: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/19.jpg)
Automatic annotation by text mining (TM)
– Web-based, graphical TM workbench
– conforms with the Unstructured Information Management Architecture (UIMA) standard
– facilitates the straightforward integration of various analytics into workflows
– allows for the validation of annotations
10/9/2015 Mining Biodiversity 19
![Page 20: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/20.jpg)
interface
10/9/2015 20Mining Biodiversity
![Page 21: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/21.jpg)
Learning semantics
• Training of models using machine learning
– conditional random fields (CRFs) for sequence labelling
– learning the features of mentions and relations of interest based on labelled documents
• contextual features: surrounding, co-occurring words
• dictionary matches: presence of certain words in controlled vocabularies, e.g., Catalogue of Life, Phenotype and Trait Ontology, Gazetteer
10/9/2015 Mining Biodiversity 21
![Page 22: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/22.jpg)
interface
10/9/2015 22Mining Biodiversity
![Page 23: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/23.jpg)
Annotation workflowPre-processing
Dictionary lookup
Machine learning-based
recognition
Relation extraction
Saving
![Page 24: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/24.jpg)
Validation interface
![Page 25: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/25.jpg)
Enhanced searching of BHL content
Faceted search
Automatically generated questions
Time-sensitive
search
![Page 26: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/26.jpg)
Enhanced document viewing
Page in PDF/image
format
OCR-corrected text with colour-coded
annotations
![Page 27: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction](https://reader031.vdocuments.us/reader031/viewer/2022021923/58eea5ac1a28abb21c8b4631/html5/thumbnails/27.jpg)
Conclusions
• Literature is a rich source of information but difficult to search
• Keyword-based search not enough to address ambiguity
• Semantic metadata allows for more accurate searching
• Semantic metadata can be extracted using text mining tools
• The Argo text mining workbench facilitates the construction of custom semantic metadata generation workflows