text mining tools for semantically enriching scientific literature

35
Text mining tools for semantically enriching the scientific literature Sophia Ananiadou Director National Centre for Text Mining School of Computer Science University of Manchester

Upload: duncan-hull

Post on 06-Sep-2014

8.747 views

Category:

Education


0 download

DESCRIPTION

presentation by Sophia Ananiadou at the Cheminformatics workshop 4th March 2008

TRANSCRIPT

Page 1: Text mining tools for semantically enriching scientific literature

Text mining tools for

semantically enriching the

scientific literature

Sophia Ananiadou

Director

National Centre for Text Mining

School of Computer Science

University of Manchester

Page 2: Text mining tools for semantically enriching scientific literature

Need for enriching the literature

• Need for semantic search i.e. beyond keywords

• Need for technologies enabling focused semantic search via the creation of semantic metadata from literature

“The current scientific literature, were it to be presented in semantically accessible form, contains huge amounts of undiscovered science”

Peter Murray-Rust, Data-driven science: A Scientist’s view. NSF/JISC Repositories Workshop, 2007

Page 3: Text mining tools for semantically enriching scientific literature

Impact of text mining

• Extraction of named entities (genes, proteins,

metabolites, etc)

• Discovery of concepts allows semantic annotation of

documents

– Improves information access by going beyond index

terms, enabling semantic querying

– Improves clustering, classification of documents

– Visualisation based on semantic metadata derived

from text mining results

Page 4: Text mining tools for semantically enriching scientific literature

Beyond named entities: facts

• Extraction of relationships, events (facts) for knowledge discovery

– Information extraction, more sophisticated annotation of texts (fact annotation)

– Enables even more advanced semantic querying

Page 5: Text mining tools for semantically enriching scientific literature

Enriched annotation

• Text Mining provides enriched annotation

layers

– the user will be able to carry out an easily

expressed semantic query which will deliver

facts matching that semantic query rather

than just sets of documents he has to read…

• Information Extraction and not just Information

Retrieval

• Fact extraction and not just sentence extraction

Page 6: Text mining tools for semantically enriching scientific literature

raw

(unstructured)

text

part-of-speech

tagging

named entity

recognition

deep

syntactic

parsing

annotated

(structured)

text

text processing

lexicon ontology

………………………....

... Secretion of TNF was

abolished by BHA in

PMA-stimulated U937 cells. ……………………

Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells .

NN IN NN VBZ VBN IN NN IN JJ NN NNS .

protein_molecule organic_compound cell_line

PP PP NP

PP

VP

VP

NP

NP

S

negative regulation

Annotations derived from Text MiningAnnotations derived from Text Mining

Multi-layered

annotations

Page 7: Text mining tools for semantically enriching scientific literature

Mining associations from MEDLINE

• FACTA: Finding Associated Concepts with Text Analysis – What diseases are related to a particular chemical?

– What proteins are related to a particular disease?

– etc.

• EBIMed http://www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp

• PubMatrix http://pubmatrix.grc.nia.nih.gov/

:

• FACTA http://text0.mib.man.ac.uk/software/facta/

– Quick and interactive

Page 8: Text mining tools for semantically enriching scientific literature

Query

Page 9: Text mining tools for semantically enriching scientific literature

Click!

Page 10: Text mining tools for semantically enriching scientific literature
Page 11: Text mining tools for semantically enriching scientific literature

Innovative Technologies applied to:

• Term recognition

• Named entity recognition

• Fact extraction

! semantic mark-up improves search

! classifying, linking documents

! knowledge discovery, hidden links,

associations, hypothesis generation

Semantic

Mark-up

Page 12: Text mining tools for semantically enriching scientific literature

Natural Language Processing

technologies

• Part-of-speech tagging: GENIA

– Tuned to biomedical text: 97-99% precision

• Dictionary-based named-entity recognition

• Deep parsing

– Predicate argument relations (90%)

• Protein-protein interaction extraction

• Event / fact extraction

Page 13: Text mining tools for semantically enriching scientific literature

Automatic Term Recognition

http://www.nactem.ac.uk/software/termine/

Page 14: Text mining tools for semantically enriching scientific literature
Page 15: Text mining tools for semantically enriching scientific literature
Page 16: Text mining tools for semantically enriching scientific literature

Recognising and Disambiguating

Acronyms in Biomedical Literature

http://www.nactem.ac.uk/software/acromine

Page 17: Text mining tools for semantically enriching scientific literature

The peri-kappa B site mediates human immunodeficiency

virus type 2 enhancer activation in monocytes …

Named-entity recognition

!

Entity types (defined by Ontologies)

"

Genes/protein names

"

Enzymes, substances, metabolites, etc

"

GO ontology, KEGG, CheBI, etc

DNA virus

cell_type

Page 18: Text mining tools for semantically enriching scientific literature
Page 19: Text mining tools for semantically enriching scientific literature

Leveraging resources

• Annotated texts (GENIA corpus, GENIA event corpus)

• Resources for bio-text mining

– resource-building NLP tools for text-based knowledge harvesting (NaCTeM)

– BioLexicon • Over 1.5M lexical entries for bio-text mining and

growing….

• Containing rich linguistic information for bio-text mining

Page 20: Text mining tools for semantically enriching scientific literature

Population ProcessPopulation Process

Bio-Lexicon

Existing repositories

Subclustering

of term variants

Manual curation

Named entity

recognition

Term mapping

by normalization

Verb subcategorization

Medline abstracts

gene/protein names

chemical, disease, enzyme, species names

terminological verbs

new gene/protein names

verb subcategorization frames

on-going

Page 21: Text mining tools for semantically enriching scientific literature

Semantic search based on facts

• MEDIE: an interactive advanced IR

system retrieving facts

• Performs a semantic search

!

Core technology annotates texts

"

GENIA tagger " syntactic structures

"

Enju (deep parser) " facts

"

Dictionary-based named entity recognitionJ. Tsujii

Page 22: Text mining tools for semantically enriching scientific literature

Medie system overview

Input

Textbase

Deep

parser

Entity

Recognizer

Semantically-

annotated

Textbase

RegionAlgebra

Search engine

QuerySearch

results

Off-line

On-line

Page 23: Text mining tools for semantically enriching scientific literature

Sentence Retrieval System

Using Semantic Representation

MEDIE

Page 24: Text mining tools for semantically enriching scientific literature
Page 25: Text mining tools for semantically enriching scientific literature

InfoPubMed

!

An interactive Information Extraction system and

an efficient PubMed search tool, helping users to

find information about biomedical entities such

as genes, proteins, and the interactions

between them.

!

System components

"

Deep parsing technology

"

Extraction of protein-protein interactions

"

Multi-window interface on a browser

Page 26: Text mining tools for semantically enriching scientific literature

InfoPubMed

Interactions and not

just co-occurrences.

Calculated using ML

and deep semantics.

Page 27: Text mining tools for semantically enriching scientific literature

Semantic Information Retrieval

# KLEIO: a semantically enriched

information retrieval system for biology

# Offers textual and metadata searches

across MEDLINE

# Leverages terminology technologies

#Named entity recognition: gene, protein,

metabolite, organ, disease, symptom

http://nactem4.mc.man.ac.uk:8080/Kleio/

Page 28: Text mining tools for semantically enriching scientific literature

KLEIO architecture

Page 29: Text mining tools for semantically enriching scientific literature
Page 30: Text mining tools for semantically enriching scientific literature

Fewer documents

with more precise

query

Page 31: Text mining tools for semantically enriching scientific literature

Linking and enriching pathways

with text

– REFINE (BBSRC) "

MCISB and NaCTeM (Kell, Ananiadou, Tsujii)

– to integrate text mining techniques with visualisation technologies for better understanding of the evidence for biochemical and signalling pathways

– to enrich pathway models encoded in the Systems Biology Markup Language (SBML) with evidence derived from text mining

Page 32: Text mining tools for semantically enriching scientific literature

2 Steps for linking text with

pathways

IkB IkB P

IkB IkB U

IkB !

IkB IkB P IkB U !

… IkappaB is phosphorylated …

… Ikappa B ubiquitination …

… degradation of IkB…

Literature

Biological events

Pathways

Event Extraction

Pathway Construction

Tsujii-lab, Tokyo

Page 33: Text mining tools for semantically enriching scientific literature

Event Annotation - Example

Page 34: Text mining tools for semantically enriching scientific literature

Statistics & References!

Statistics

"

36,114 events have been identified from

and annotated to

!1,000 Medline abstracts, which contain

!9,372 sentences

"

Kim, Jin-Dong, Tomoko Ohta and Jun'ichi

Tsujii (2008) Corpus annotation for

mining biomedical events from

literature. BMC Bioinformatics

"

http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA

Page 35: Text mining tools for semantically enriching scientific literature

Acknowledgements

• Junichi Tsujii and his lab (University of Tokyo) MEDIE,

InfoPubMed, event annotation

• Yoshimasa Tsuruoka (NER, FACTA, KLEIO, REFINE)

• Naoaki Okazaki (TerMine, AcroMine)

• Yutaka Sasaki (BioLexicon, NER, KLEIO)

• John McNaught (BioLexicon, BOOTStrep project)

• Chikashi Nobata (KLEIO)

• Douglas Kell (REFINE)