Download - Navigating the semantic web for publishers
innovation. quality. service
“Enabling clients to realize the full potential of their content and increase efficiency throughout their enterprise.”
Engineering technology to deliver the revolution
Presentation to Online Publishers’ forum
November 29, 2011
Priya Parvatikar, Technical Architect
About this talk
Engineering technology to deliver the revolution 2
• Features of the GSE Research website
• Overview of how the features have been achieved
• ‘Under the hood’ look at the technology
Improved search - Enhancing auto-suggest
Engineering technology to deliver the revolution 3
Using taxonomy information for “did you mean”
Engineering technology to deliver the revolution 4
Boosting relevant results
Engineering technology to deliver the revolution 5
Guiding the user through facets
Engineering technology to deliver the revolution 6
Guiding the user through suggestions
Engineering technology to deliver the revolution 7
Concept homepages
Engineering technology to deliver the revolution 8
Showing concepts on item homepages
Engineering technology to deliver the revolution 9
Suggest related items
Engineering technology to deliver the revolution 10
GSE Research – How?
Engineering technology to deliver the revolution 11
• Built using the pub2web platform
• MetaStore used for metadata storage
• Apache Solr used for search indexing
• Semantic enrichment of content
• Apache UIMA used for entity extraction
MetaStore
Engineering technology to deliver the revolution 12
• RDF triplestore for storing metadata
• Agnostic to the type of data being stored
• Able to store rich and very granular data
• Flexible to cater for future data enhancements
For the GSE Research site:
Content
Authors
Taxonomy concepts and relations
Federation of data from external datasets
Search
Engineering technology to deliver the revolution 13
• Uses enterprise-grade Apache Solr
• Inbuilt support for rich features
• Faceted searching
• Synonyms
• Stemming
• Boosting
• ‘More like this’
• ‘Did you mean’
Content for GSE Research website
Engineering technology to deliver the revolution 14
Provided by GSE
• Content XML
• Taxonomy prepared by GSE
Taxonomy enhancement
• Concepts mapped to Library of Congress classifications
• Taxonomy automatically enhanced with terms from this classification
GSE Research taxonomy - example
Engineering technology to deliver the revolution 15
For example, the GSE taxonomy contains
Climate change, pollution & environmental impacts
Water pollution
Air pollution
After enhancing with Library of Congress classification
Climate change, pollution & environmental impacts
Water pollution – variants: aquatic pollution, water contamination
Marine pollution – variants: ocean pollution, sea pollution
Oil pollution of water – variants: petroleum pollution of water
Estuarine pollution – variants: estuary pollution
Air pollution
Content workflow in GSE Research
Engineering technology to deliver the revolution 16
MetaStoreMetaStore
SearchIndex
SearchIndex
MetaStoreLoader
MetaStoreLoader
Text miningpipelinesText miningpipelines
Content Content
ImagesImages
TablesTables
AuthorsAuthors
Additional concepts
ConceptsConcepts
External datasetsExternal datasets
Entity extraction for GSE Research content
Engineering technology to deliver the revolution 17
Apache UIMA
• Architectural framework to manage unstructured data
• Apache license open-source project
• OASIS standard
Provides
• Framework
• Annotators – multiple annotators can be applied in a pipeline
• Ability to plug in external text-mining services as annotators
Example of entity extraction
Engineering technology to deliver the revolution 18
Editorial curation
Engineering technology to deliver the revolution 19
Future possibilities for GSE Research
Engineering technology to deliver the revolution 20
• Extraction of geographical concepts
• Federation of data from other external datasets eg. government datasets
• Semantic analysis of search queries to deliver better results
Summary
Engineering technology to deliver the revolution 21
• Tagging drives discovery
• Provide multiple routes to content
• Provide external context to content
• Start simple and experiment
• Flexibility of underlying systems is key
Thank you!