pattaran – from annotation triplets to sentence fingerprints motivation motivation scientific...

1
PattArAn – From Annotation Triplets to Sentence Fingerprint Motivation Scientific concepts are annotated with controlled vocabulary (CV) terms from ontologies such as Gene Ontology (GO) and Plant Ontology (PO). Our Arabidopsis specific tool - Patterns in Arabidopsis Annotation (PattArAN) will focus on pattern creation from annotation knowledge of (gene, GO, PO) triplets and triplet validation using the scientific literature. PattArAn will help scientists to scour the literature, to understand the connection to the annotation evidence and biological knowledge, and to develop hypotheses. Goals: (1) Explore new research ideas in three areas of interests using PattArAn. (2) Build a gold standard dataset using manual annotation of triplet fingerprints. The PattArAn Team at the University of Maryland, the University of Iowa, and St. Bonaventure University Gene-GO-PO Triplets Document Annotation Guidelines Observations Check inter-annotator agreement. Extract gene interaction sentences in the context of our annotation triplets. Develop algorithms to rank sentences by importance with this gold standard data. GO and PO combinations centered on a gene. Documents supporting annotations identified and collected. Area1 Area2 Area3 # triplets in document set (8 documents) Found In Full-Text: 32 14 14 # triplets w/ at least 1 sentence 1 11 6 # triplets w/ all 3 doublets in at least 1 sentence each 0 1 0 # triplets w/ only 2 doublets in at least 1 sentence 24 57 5 # triplets w/ only 1 doublet in at least 1 sentence 51 58 54 Found In Supplementary Data : # triplets found 31 3 8 # doublets found 8 34 69 Using our triplets we could identify connections between a specific area to other fields in biology in under four weeks. Interesting also to see how biologists’ genes of interest may function in concert to influence different bioprocesses. This well serves as the beginning of an exploration that may eventually lead to new hypotheses and discoveries. Annotations: Triplets represented by sentences to varying degrees. Supplementary material quite rich. Doublets have most potential. Knowledge Underlying Triplets: Annotations of document (16399800) well explain a biological process of Arabidopsis thaliana. The TSO2 gene relates to cell division by controlling dNTPs balance. All annotating GOs link through the function of TSO2. Also TSO2 is expressed in the organs mentioned in the POs. Thus, this paper nicely links the PO terms and GO terms. Cross-document inference: Document 9880378 indicates that the redox gene AtCB5-D is expressed at varying levels across plant tissues. Document 17028151 indicates that upon infection with Pseudomonas syringae, expression levels drop significantly in Arabidopsis leaves. This process is one aspect of a complex, genome wide response to bacterial infection involving many genes. Inferred Triplet: Using doublets in document (18305484) we may infer that: “The plasma membrane protein SLAC1 is essential for stomatal closure in response to CO2, abscisic acid, ozone, light/dark transitions, humidity change, calcium ions, hydrogen peroxide and nitric oxide.” This is interesting as it is describes a single protein that is involved in many responses due to various environmental signals. Area 1: regulation of flower and fruit development by genes and signal pathways. (e.g., genes TSO1, TSO2, MSI1) Area 2: signal transduction of the plant hormone ethylene. (e.g., genes ETR1, ERS1, ETR2) Area 3: integration of metabolite transporters with plant growth, development and survival. (e.g., genes AtCHX17, AtNHX1, AtKEA2) Future Work Summary

Upload: polly-greer

Post on 16-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PattArAn – From Annotation Triplets to Sentence Fingerprints Motivation Motivation  Scientific concepts are annotated with controlled vocabulary (CV)

PattArAn – From Annotation Triplets to Sentence Fingerprints

Motivation Scientific concepts are annotated with controlled vocabulary (CV) terms from ontologies such as Gene Ontology (GO) and Plant Ontology (PO). Our Arabidopsis specific tool - Patterns in Arabidopsis Annotation (PattArAN) will focus on pattern creation from annotation knowledge of (gene, GO, PO) triplets and triplet validation using the scientific literature. PattArAn will help scientists to scour the literature, to understand the connection to the annotation evidence and biological knowledge, and to develop hypotheses.

Goals: (1) Explore new research ideas in three areas of interests using PattArAn. (2) Build a gold standard dataset using manual annotation of triplet fingerprints.

The PattArAn Team at the University of Maryland, the University of Iowa, and St. Bonaventure University

Gene-GO-PO Triplets

Document Annotation Guidelines

Observations

• Check inter-annotator agreement.• Extract gene interaction sentences in the context of our annotation

triplets. • Develop algorithms to rank sentences by importance with this gold

standard data.

GO and PO combinations centered on a gene. Documents supporting annotations identified and collected.

Area1 Area2 Area3

# triplets in document set (8 documents)

Found In Full-Text:

32 14 14

# triplets w/ at least 1 sentence 1 11 6

# triplets w/ all 3 doublets in at least 1 sentence each 0 1 0

# triplets w/ only 2 doublets in at least 1 sentence 24 57 5

# triplets w/ only 1 doublet in at least 1 sentence 51 58 54

Found In Supplementary Data:

# triplets found 31 3 8

# doublets found 8 34 69

Using our triplets we could identify connections between a specific area to other fields in biology in under four weeks. Interesting also to see how biologists’ genes of interest may function in concert to influence different bioprocesses. This well serves as the beginning of an exploration that may eventually lead to new hypotheses and discoveries.

Annotations: Triplets represented by sentences to varying degrees. Supplementary material quite rich. Doublets have most potential.

Knowledge Underlying Triplets: Annotations of document (16399800) well explain a biological process of Arabidopsis thaliana. The TSO2 gene relates to cell division by controlling dNTPs balance. All annotating GOs link through the function of TSO2. Also TSO2 is expressed in the organs mentioned in the POs. Thus, this paper nicely links the PO terms and GO terms.

Cross-document inference: Document 9880378 indicates that the redox gene AtCB5-D is expressed at varying levels across plant tissues. Document 17028151 indicates that upon infection with Pseudomonas syringae, expression levels drop significantly in Arabidopsis leaves. This process is one aspect of a complex, genome wide response to bacterial infection involving many genes.

Inferred Triplet: Using doublets in document (18305484) we may infer that: “The plasma membrane protein SLAC1 is essential for stomatal closure in response to CO2, abscisic acid, ozone, light/dark transitions, humidity change, calcium ions, hydrogen peroxide and nitric oxide.” This is interesting as it is describes a single protein that is involved in many responses due to various environmental signals.

Area 1: regulation of flower and fruit development by genes and signal pathways. (e.g., genes TSO1, TSO2, MSI1) Area 2: signal transduction of the plant hormone ethylene. (e.g., genes ETR1, ERS1, ETR2) Area 3: integration of metabolite transporters with plant growth, development and survival. (e.g., genes AtCHX17, AtNHX1, AtKEA2)

Future WorkSummary