relevance detection approach to gene annotation aid to automatic annotation of databases annotation...

12
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow Extraction of molecular function of a gene from literature That annotation of this function with a term in a controlled vocabulary • Premise If the document sets retrieved by a GeneRIF and a GO concept are similar then a link can be made between them

Upload: kerry-preston

Post on 03-Jan-2016

216 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from

Relevance Detection Approach to Gene Annotation

• Aid to automatic annotation of databases• Annotation flow

– Extraction of molecular function of a gene from literature

– That annotation of this function with a term in a controlled vocabulary

• Premise– If the document sets retrieved by a GeneRIF and a GO

concept are similar then a link can be made between them

Page 2: Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from

Data

• GeneRIF/GO term pairs– Paired if reference same MEDLINE article– Manually filtered for obvious errors– 550 pairs from 335 distinct genes

• GO concept = GO term + definition• GeneRIFs and GO concepts too short for simple

keyword matching• Treated as an IR problem

– Similar to TREC novelty track– Compute relevance and similarity of 2 sentences

Page 3: Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from

• Document set - TREC Genomics 2003 docs

• Each sentence within GeneRIF/GO concept pair treated as IR query

• Similarity between the 2 computed based on top 200 docs retrieved by each query

• Best Recall = 78.2%(prec = 22.1%)

• Best Precision = 66.2% (rec = 46.9%)

Page 4: Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from

GO Dependence Relations

• Previous work (PSB)– Using substring matching between GO codes

– Derived from annotation databases, using vector space models, co-occurrence, association rule-mining.

• ChEBI: www.ebi.ac.uk/chebi/– Chemical Entities of Biological Interest

– Preferred names + synonyms

– IS_A (poly)hierarchy

Page 5: Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from

methods

• String matching• If the same ChEBI entity is used within 2 GO

codes, they are in a dependence relationship– First order relationship– ChEBI term must be whole word or surrounded by

punctuation, e.g. carbonic anhydrase activity is not related to carbon-oxygen lyase activity

• Also, in a dependence relationship with the ancestors– Second order relationship

Page 6: Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from

Results

• 55% of GO terms contain a ChEBI entity• 56% of dependent pairs with a ChEBI term found

in PSB study were identified in this study• Less than 1% of GO term pairs found in this

study were identified by the PSB study• Issues

– How to validate potential relationships?– Usual naming/synonym ambiguity!– Substrings not used: imidazolonepropionase

Page 7: Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from

Disease Text Classification

• Task: Classification of text into one of 26 disease classes

• Used full text and weighted sections according to information distribution published by other groups

Page 8: Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from

Data Preparation

• HTML full text documents, semi automatic section division

• Tokenisation, Stemming, Stop word filtering, Part of speech tagging

• Dataset: 21*25 positive full text articles, 33 negative full text articles

• 10 fold cross validation • Nearest centroid classifier

Page 9: Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from

Results

• Baseline: 56% F-score

• Additional preprocessing: 67%– 10,000 stopword filter– Only nouns

• Section weighting: 74%– Abstract and Introduction weighted highest

Page 10: Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from

From Nonsense to Sense in Healthcare Questions

• Diagnosis, Prognosis, Therapy, Prevention• medicine finds disease mechanisms by first

finding cures– Currently by trial and error

• Try drug then test

– Future - test then try drug

• Biomarkers– Normality -> dysfunction -> disease– There are prognostic markers before any diagnostic

markers

Page 11: Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from

Integrative Genomics

• Looking for hidden connections over wide field, e.g.– Immune system works too hard = rheumatoid

arthritis– Immune system doesn’t work hard enough =

infectious diseases

Page 12: Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from

Term Disambiguation

• 40% of genes have homonym problem• For 300 genes = 1mil MEDLINE articles• After disambiguation = 60,000 articles• 93% accuracy in asigning correct ID to ambiguous

genes• Use contectual fingerprints:

– Experts choose 5 abstracts about a concept– Fingerprint then created for that concept