mining the biomedical literature for genic information presenter: catalina o. tudor bionlp ’08,...

26
Mining the Biomedical Literature for Genic Information Presenter: Catalina O. Tudor 08 Catalina O. Tudor, K. Vijay-Shanker, Carl J. Schmidt

Post on 20-Dec-2015

216 views

Category:

Documents


3 download

TRANSCRIPT

Mining the Biomedical Literaturefor Genic Information

Presenter: Catalina O. Tudor

BioNLP ’08, June 19, 2008 Catalina O. Tudor, K. Vijay-Shanker, Carl J. Schmidt University of Delaware

Groucho? I want to know more about this gene…

User Scenario – Groucho and PubMed

PubMedSearch Engine

270 abstracts retrieved

User Scenario – Groucho and eGIFT

eGIFT

Key Terms for Groucho

Processes:• segmentation• neurogenesis• embryonic development ...

Descriptors:• enhancer • corepressor ...

Domains:• WD40• eh1• WRPW• basic helix-loop-helix ...

Genes:• Hairy• AES

Web Application

All sentences for Groucho containing segmentation

1. The Groucho protein

interacts with Hairy-

related transcription

factors to regulate

segmentation,

neurogenesis and sex

determination. (PMID 8892234)

2. The Drosophila protein

Groucho is involved in

embryonic segmentation

and neural development ,

and is implicated in the

Notch signal transduction

pathway. (PMID 8713081)

...

PubMed

most relevant terms associated with the given gene

Groucho? I want to know more about this gene…

What does eGIFT provide?

• Two types of users

• Scientists trying to quickly find information about a gene

• Annotators trying to quickly locate textual evidence describing

gene functions

• Key Terms provide an overall picture about a given gene

• eGIFT allows users to identify the set of documents for a topic

relevant to the gene of interest

Overall Approach of eGIFT

1. Retrieve abstracts from PubMed

1. Background Set: all abstracts mentioning “gene” or

“protein”

2. Query Set: all abstracts mentioning a given gene

2. Refine Query Set

3. Group morphologically related words

4. Calculate term scores and identify key terms

5. Categorize key terms using controlled vocabularies

6. Link sentences and abstracts to a specific key term

Retrieve abstracts

Background Set• all abstracts mentioning “gene” or “protein”• (gene[ti] OR genes[ti] OR

protein[ti] OR proteins[ti])

AND hasabstract[text]• 639,211 abstracts retrieved

Query Set• all abstracts mentioning a given gene name, symbol,

synonyms

• Compare information from Query Set against general information from Background Set and determine the most specific information in the Query Set

• Compare background and query frequencies of terms to identify statistically interesting cases

PubMed

BackgroundSet

QuerySet

Overall Approach of eGIFT

1. Retrieve abstracts from PubMed

2. Refine Query Set

3. Group morphologically related words

4. Calculate term scores and identify key terms

5. Categorize key terms using controlled vocabularies

6. Link sentences and abstracts to a specific key term

Query Set = all abstracts mentioning given gene

Query Set contains two types of abstracts

1. About Set

• abstracts which focus on the given gene

2. Extra Set

• abstracts which focus on other topics but happen to mention the gene

Heuristics for identifying an About abstract

• if given gene name occurs in title, first or last sentences

• if given gene name occurs 3+ times in abstract

Refine Query Set

QuerySet

AboutSet

ExtraSet

Refine Query Set – About Set example

Multiple RTK pathways downregulate Groucho-mediated repression in Drosophila embryogenesis.

RTK pathways establish cell fates in a wide range of developmental processes. However, how the pathway effector MAPK coordinately regulates the expression of multiple target genes is not fully understood. We have previously shown that the EGFR RTK pathway causes phosphorylation and downregulation of Groucho, a global co-repressor that is widely used by many developmentally important repressors for silencing their various targets. Here, we use specific antibodies that reveal the dynamics of Groucho phosphorylation by MAPK, and show that Groucho is phosphorylated in response to several RTK pathways during Drosophila embryogenesis. Focusing on the regulation of terminal patterning by the Torso RTK pathway, we demonstrate that attenuation of Groucho's repressor function via phosphorylation is essential for the transcriptional output of the pathway and for terminal cell specification. Importantly, Groucho is phosphorylated by an efficient mechanism that does not alter its subcellular localisation or decrease its stability; rather, modified Groucho endures long after MAPK activation has terminated. We propose that phosphorylation of Groucho provides a widespread, long-term mechanism by which RTK signals control target gene expression.

PMID - 18216172

Refine Query Set – Extra Set example

Engrailed defines the position of dorsal di-mesencephalic boundary by repressing diencephalic fate.

Regionalization of a simple neural tube is a fundamental event during the development of central nervous system. To analyze in vivo the molecular mechanisms underlying the development of mesencephalon, we ectopically expressed Engrailed, which is expressed in developing mesencephalon, in the brain of chick embryos by in ovo electroporation. Misexpression of Engrailed caused a rostral shift of the di-mesencephalic boundary, and caused transformation of dorsal diencephalon into tectum, a derivative of dorsal mesencephalon. Ectopic Engrailed rapidly repressed Pax-6, a marker for diencephalon, which preceded the induction of mesencephalon-related genes such as Pax-2, Pax-5, Fgf8, Wnt-1 and EphrinA2. In contrast, a mutant Engrailed, En-2(F51rE), bearing mutation in EH1 domain, which has been shown to interact with a co-repressor, Groucho, did not show the phenotype induced by wild-type Engrailed. Furthermore, VP16-Engrailed chimeric protein, the dominant positive form of Engrailed, caused caudal shift of di-mesencephalic boundary and ectopic Pax-6 expression in mesencephalon. These data suggest that (1) Engrailed defines the position of dorsal di-mesencephalic boundary by directly repressing diencephalic fate, and (2) Engrailed positively regulates the expression of mesencephalon-related genes by repressing the expression of their negative regulator(s).

PMID - 10529429

Overall Approach of eGIFT

1. Retrieve abstracts from PubMed

2. Refine Query Set

3. Group morphologically related words

4. Calculate term scores and identify key terms

5. Categorize key terms using controlled vocabularies

6. Link sentences and abstracts to a specific key term

Group morphologically related words - example

• The Drosophila Groucho transcriptional corepressor protein has been shown to interact with the DNA-binding bHLH domain of Enhancer of split , Hairy and Deadpan proteins.

• Groucho acts as a co-repressor for several Drosophila DNA binding transcriptional repressors.

• Dorsal represses transcription by recruiting the co-repressor Groucho

• The results indicate that FoxD3 recruitment of Groucho corepressors is essential for the transcriptional repression of target genes and induction of mesoderm in Xenopus.

corepressor = {corepressor, corepressors, co-repressor, …}transcription repress = {transcriptional repressors, transcriptional repression, …}

Unigram example

recruit = {recruit, recruits, recruited, recruitment, recruiting,

recruitments}

Bigram example

transcript repress = {transcriptional repressor, transcriptional

repressors, transcriptional repression, transcriptional repressions,

transcription repression, transcription repressions}

Reasons for grouping morphologically related words• textual variants, independent of each other, are scattered in text• we help family stand out• we prevent a very infrequent variant from becoming a key term

Group morphologically related words

Overall Approach of eGIFT

1. Retrieve abstracts from PubMed

2. Refine Query Set

3. Group morphologically related words

4. Calculate term scores and identify key terms

5. Categorize key terms using controlled vocabularies

6. Link sentences and abstracts to a specific key term

dctq = document count of term t in Query Set

Nq = total number of abstracts in Query Set

dctb = document count of term t in Back Set

Nb = total number of abstracts in Back Set

• Calculate Normalized Frequencies

Calculate term scores

segmentation ftb = 0.0012 ftq = 0.13

these ftb = 0.47 ftq = 0.60

• Calculate Score

st = score of term t

ft = frequency of term t

0.13

0.13

0.874

0.098

• Pearson’s Chi-Square• Prefers only highly infrequent terms (bigrams are ranked

high)• Drops very frequent terms, although much more frequent in

QS

• Z-score• Performance is highly dependent on the way the Background

Set is grouped

• Other considered• Ratio of frequencies• Tf-Idf• Mutual Information

Other scoring methods

Overall Approach of eGIFT

1. Retrieve abstracts from PubMed

2. Refine Query Set

3. Group morphologically related words

4. Calculate term scores and retrieve key terms

5. Categorize key terms using controlled vocabularies

6. Link sentences and abstracts to a specific key term

Categorize Key Terms

Overall Approach of eGIFT

1. Retrieve abstracts from PubMed

2. Refine Query Set

3. Group morphologically related words

4. Calculate term scores

5. Categorize key terms using controlled vocabularies

6. Link sentences and abstracts to a specific key term

Link sentences to key terms

• eGIFT allows users to see every sentence mentioning a

particular key term in the gene’s Query Set

• by reading in context, the user gets a better appreciation

of the relationship between the key term and the gene

• From sentences users can choose which abstracts to read

• Sentences can be saved in gene specific files (e.g. for

annotation)

eGIFT Screenshots – Key Terms for Groucho

eGIFT Screenshots – Sentences

Related Work

• Andrade and Valencia (1998)

• Liu et al. (2004)

• e-LiSe (Gladki et al., 2008)

• MedEvi (Kim et al., 2008)

• Anne O’Tate (Smalheiser et al., 2008)

• XplorMed (Perez-Iratxeta et al., 2003)

• Shatkay and Wilbur (2000)

Keywords for a protein familyZ-score

Background divided by literature for individual families

Keyword detection (not necessarily genes)Z-score

More general background set than us, grouped randomly

Keyword detection (some just nouns)

More general background set than us

From kernel document to Query Set of on-topic documents

Background Set contains off-topic documentsScore is ratio of normalized frequencies

Distinguishing Features of eGIFT

• Background Set is specific for genes

• About Set yields better results than the entire Query Set

• Bigrams in addition to unigrams

• Morphological grouping gives “textual concepts”

• New scoring mechanism

• Going beyond key terms

• Categories of key terms (for interface purposes)

• Retrieval of sentences containing a specific key term

Future Work

Evaluation

• comparison with other systems

Named Entity Recognition

• extend unigrams and bigrams to full length names

Using other subsets of Query Set

• currently, eGIFT uses the About Set to compute key terms

• different kinds of information can be obtained from variants of Extra Set and other subsets

The End

http://dinah.cis.udel.edu/tudor/eGIFT