mining the biomedical literature for genic information presenter: catalina o. tudor bionlp ’08,...
Post on 20-Dec-2015
216 views
TRANSCRIPT
Mining the Biomedical Literaturefor Genic Information
Presenter: Catalina O. Tudor
BioNLP ’08, June 19, 2008 Catalina O. Tudor, K. Vijay-Shanker, Carl J. Schmidt University of Delaware
Groucho? I want to know more about this gene…
User Scenario – Groucho and PubMed
PubMedSearch Engine
270 abstracts retrieved
User Scenario – Groucho and eGIFT
eGIFT
Key Terms for Groucho
Processes:• segmentation• neurogenesis• embryonic development ...
Descriptors:• enhancer • corepressor ...
Domains:• WD40• eh1• WRPW• basic helix-loop-helix ...
Genes:• Hairy• AES
Web Application
All sentences for Groucho containing segmentation
1. The Groucho protein
interacts with Hairy-
related transcription
factors to regulate
segmentation,
neurogenesis and sex
determination. (PMID 8892234)
2. The Drosophila protein
Groucho is involved in
embryonic segmentation
and neural development ,
and is implicated in the
Notch signal transduction
pathway. (PMID 8713081)
...
PubMed
most relevant terms associated with the given gene
Groucho? I want to know more about this gene…
What does eGIFT provide?
• Two types of users
• Scientists trying to quickly find information about a gene
• Annotators trying to quickly locate textual evidence describing
gene functions
• Key Terms provide an overall picture about a given gene
• eGIFT allows users to identify the set of documents for a topic
relevant to the gene of interest
Overall Approach of eGIFT
1. Retrieve abstracts from PubMed
1. Background Set: all abstracts mentioning “gene” or
“protein”
2. Query Set: all abstracts mentioning a given gene
2. Refine Query Set
3. Group morphologically related words
4. Calculate term scores and identify key terms
5. Categorize key terms using controlled vocabularies
6. Link sentences and abstracts to a specific key term
Retrieve abstracts
Background Set• all abstracts mentioning “gene” or “protein”• (gene[ti] OR genes[ti] OR
protein[ti] OR proteins[ti])
AND hasabstract[text]• 639,211 abstracts retrieved
Query Set• all abstracts mentioning a given gene name, symbol,
synonyms
• Compare information from Query Set against general information from Background Set and determine the most specific information in the Query Set
• Compare background and query frequencies of terms to identify statistically interesting cases
PubMed
BackgroundSet
QuerySet
Overall Approach of eGIFT
1. Retrieve abstracts from PubMed
2. Refine Query Set
3. Group morphologically related words
4. Calculate term scores and identify key terms
5. Categorize key terms using controlled vocabularies
6. Link sentences and abstracts to a specific key term
Query Set = all abstracts mentioning given gene
Query Set contains two types of abstracts
1. About Set
• abstracts which focus on the given gene
2. Extra Set
• abstracts which focus on other topics but happen to mention the gene
Heuristics for identifying an About abstract
• if given gene name occurs in title, first or last sentences
• if given gene name occurs 3+ times in abstract
Refine Query Set
QuerySet
AboutSet
ExtraSet
Refine Query Set – About Set example
Multiple RTK pathways downregulate Groucho-mediated repression in Drosophila embryogenesis.
RTK pathways establish cell fates in a wide range of developmental processes. However, how the pathway effector MAPK coordinately regulates the expression of multiple target genes is not fully understood. We have previously shown that the EGFR RTK pathway causes phosphorylation and downregulation of Groucho, a global co-repressor that is widely used by many developmentally important repressors for silencing their various targets. Here, we use specific antibodies that reveal the dynamics of Groucho phosphorylation by MAPK, and show that Groucho is phosphorylated in response to several RTK pathways during Drosophila embryogenesis. Focusing on the regulation of terminal patterning by the Torso RTK pathway, we demonstrate that attenuation of Groucho's repressor function via phosphorylation is essential for the transcriptional output of the pathway and for terminal cell specification. Importantly, Groucho is phosphorylated by an efficient mechanism that does not alter its subcellular localisation or decrease its stability; rather, modified Groucho endures long after MAPK activation has terminated. We propose that phosphorylation of Groucho provides a widespread, long-term mechanism by which RTK signals control target gene expression.
PMID - 18216172
Refine Query Set – Extra Set example
Engrailed defines the position of dorsal di-mesencephalic boundary by repressing diencephalic fate.
Regionalization of a simple neural tube is a fundamental event during the development of central nervous system. To analyze in vivo the molecular mechanisms underlying the development of mesencephalon, we ectopically expressed Engrailed, which is expressed in developing mesencephalon, in the brain of chick embryos by in ovo electroporation. Misexpression of Engrailed caused a rostral shift of the di-mesencephalic boundary, and caused transformation of dorsal diencephalon into tectum, a derivative of dorsal mesencephalon. Ectopic Engrailed rapidly repressed Pax-6, a marker for diencephalon, which preceded the induction of mesencephalon-related genes such as Pax-2, Pax-5, Fgf8, Wnt-1 and EphrinA2. In contrast, a mutant Engrailed, En-2(F51rE), bearing mutation in EH1 domain, which has been shown to interact with a co-repressor, Groucho, did not show the phenotype induced by wild-type Engrailed. Furthermore, VP16-Engrailed chimeric protein, the dominant positive form of Engrailed, caused caudal shift of di-mesencephalic boundary and ectopic Pax-6 expression in mesencephalon. These data suggest that (1) Engrailed defines the position of dorsal di-mesencephalic boundary by directly repressing diencephalic fate, and (2) Engrailed positively regulates the expression of mesencephalon-related genes by repressing the expression of their negative regulator(s).
PMID - 10529429
Overall Approach of eGIFT
1. Retrieve abstracts from PubMed
2. Refine Query Set
3. Group morphologically related words
4. Calculate term scores and identify key terms
5. Categorize key terms using controlled vocabularies
6. Link sentences and abstracts to a specific key term
Group morphologically related words - example
• The Drosophila Groucho transcriptional corepressor protein has been shown to interact with the DNA-binding bHLH domain of Enhancer of split , Hairy and Deadpan proteins.
• Groucho acts as a co-repressor for several Drosophila DNA binding transcriptional repressors.
• Dorsal represses transcription by recruiting the co-repressor Groucho
• The results indicate that FoxD3 recruitment of Groucho corepressors is essential for the transcriptional repression of target genes and induction of mesoderm in Xenopus.
corepressor = {corepressor, corepressors, co-repressor, …}transcription repress = {transcriptional repressors, transcriptional repression, …}
Unigram example
recruit = {recruit, recruits, recruited, recruitment, recruiting,
recruitments}
Bigram example
transcript repress = {transcriptional repressor, transcriptional
repressors, transcriptional repression, transcriptional repressions,
transcription repression, transcription repressions}
Reasons for grouping morphologically related words• textual variants, independent of each other, are scattered in text• we help family stand out• we prevent a very infrequent variant from becoming a key term
Group morphologically related words
Overall Approach of eGIFT
1. Retrieve abstracts from PubMed
2. Refine Query Set
3. Group morphologically related words
4. Calculate term scores and identify key terms
5. Categorize key terms using controlled vocabularies
6. Link sentences and abstracts to a specific key term
dctq = document count of term t in Query Set
Nq = total number of abstracts in Query Set
dctb = document count of term t in Back Set
Nb = total number of abstracts in Back Set
• Calculate Normalized Frequencies
Calculate term scores
segmentation ftb = 0.0012 ftq = 0.13
these ftb = 0.47 ftq = 0.60
• Calculate Score
st = score of term t
ft = frequency of term t
0.13
0.13
0.874
0.098
• Pearson’s Chi-Square• Prefers only highly infrequent terms (bigrams are ranked
high)• Drops very frequent terms, although much more frequent in
QS
• Z-score• Performance is highly dependent on the way the Background
Set is grouped
• Other considered• Ratio of frequencies• Tf-Idf• Mutual Information
Other scoring methods
Overall Approach of eGIFT
1. Retrieve abstracts from PubMed
2. Refine Query Set
3. Group morphologically related words
4. Calculate term scores and retrieve key terms
5. Categorize key terms using controlled vocabularies
6. Link sentences and abstracts to a specific key term
Overall Approach of eGIFT
1. Retrieve abstracts from PubMed
2. Refine Query Set
3. Group morphologically related words
4. Calculate term scores
5. Categorize key terms using controlled vocabularies
6. Link sentences and abstracts to a specific key term
Link sentences to key terms
• eGIFT allows users to see every sentence mentioning a
particular key term in the gene’s Query Set
• by reading in context, the user gets a better appreciation
of the relationship between the key term and the gene
• From sentences users can choose which abstracts to read
• Sentences can be saved in gene specific files (e.g. for
annotation)
Related Work
• Andrade and Valencia (1998)
• Liu et al. (2004)
• e-LiSe (Gladki et al., 2008)
• MedEvi (Kim et al., 2008)
• Anne O’Tate (Smalheiser et al., 2008)
• XplorMed (Perez-Iratxeta et al., 2003)
• Shatkay and Wilbur (2000)
Keywords for a protein familyZ-score
Background divided by literature for individual families
Keyword detection (not necessarily genes)Z-score
More general background set than us, grouped randomly
Keyword detection (some just nouns)
More general background set than us
From kernel document to Query Set of on-topic documents
Background Set contains off-topic documentsScore is ratio of normalized frequencies
Distinguishing Features of eGIFT
• Background Set is specific for genes
• About Set yields better results than the entire Query Set
• Bigrams in addition to unigrams
• Morphological grouping gives “textual concepts”
• New scoring mechanism
• Going beyond key terms
• Categories of key terms (for interface purposes)
• Retrieval of sentences containing a specific key term
Future Work
Evaluation
• comparison with other systems
Named Entity Recognition
• extend unigrams and bigrams to full length names
Using other subsets of Query Set
• currently, eGIFT uses the About Set to compute key terms
• different kinds of information can be obtained from variants of Extra Set and other subsets
The End
http://dinah.cis.udel.edu/tudor/eGIFT