dictionary-based named entity recognition

Post on 11-Feb-2017

532 Views

Category:

Science

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lars Juhl Jensen

Dictionary-basednamed entity recognition

>10 km

too much to read

computer

as smart as a dog

teach it specific tricks

named entity recognition

comprehensive dictionary

synonyms

cyclin dependent kinase 1

CDC2

normalization

CDK1_HUMAN

dictionary compilation

genes/proteins

UniProtKB

Ensembl

RefSeq

chemical compounds

PubChem

species/organisms

NCBI Taxonomy

functions

pathways

compartments

Gene Ontology

tissues

Brenda Tissue Ontology

diseases

Disease Ontology

phenotypes

Human Phenotype Ontology

environments

Environment Ontology

filters

redundant terms

insulin

broad synonyms

CDK holoenzyme

related synonyms

polyubiquitin

wrong synonyms

BRCA1

dictionary expansion

shortened forms

protein kinase activity

protein kinase

Wnt signaling pathway

Wnt signaling

synonymous forms

metabolic disease

metabolic disorder

plural forms

protein kinase

protein kinases

mitochondrion

mitochondria

cancer

cancers

adjective forms

mitochondrion

mitochondrial

abbreviated forms

Escherichia coli

E. coli

prefixes and suffixes

CDC2

hCDC2

mCDC2

Cdc28

Cdc28p

huge dictionary

additional ambiguity

handling ambiguity

three options

allow

disallow

disambiguate

acceptable ambiguity

orthologous genes

overlapping ontologies

disease

phenotype

unacceptable ambiguity

unrelated entities

APC

adenomatous polyposis coli

anaphase promoting complex

disambiguation

ranking of name sources

remove unlikely meanings

acronym definitions

three letter acronym (TLA)

other names mentioned

C. sativa

Camelina sativa

Cannabis sativa

Castanea sativa

marijuana

species autodetection

two rounds of NER

species/organisms

genes/proteins

text matching

uppercase / lowercase

spaces and hyphens

punctuation

too many variants

flexible matching

finite state automaton

LINNAEUS

custom hash function

C++ tagger

efficiency

Pafilis et al., PLOS ONE, 2013

performance

~85% precision

~75% recall

“black list”

bad names

SDS

a

an

web resources

indexing of literature

term co-occurrence

iHOP

Hoffmann & Valencia, Nature Genetics, 2004www.ihop-net.org

STRING

Szklarczyk et al., Nucleic Acids Research, 2015string-db.org

real-time text mining

Reflect

augmented browsing

Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009reflect.ws

EXTRACT

interactive annotation

Pafilis et al., Proceedings of BioCreative V, 2015extract.hcmr.gr

Pafilis et al., Proceedings of BioCreative V, 2015extract.hcmr.gr

top related