finding informative sentences in full-text journal articles
DESCRIPTION
Finding Informative Sentences in Full-text Journal Articles. Introduction. “Informative”: make assertions about a gene’s function Examples: Positive: The in vivo interaction between CIPK23 and CBL1 or CBL9 was confirmed using BiFC assays as shown in Figure 6F. [PMID: 16814720] - PowerPoint PPT PresentationTRANSCRIPT
Zhiyong Lu*, William A. Baumgartner Jr., Gregory Caporaso, K. Bretonnel Cohen, Lawrence HunterComputational Bioscience ProgramUniversity of Colorado School of Medicine
[email protected]://compbio.uchsc.edu/Hunter_lab/Zhiyong
Finding Informative Sentences in Full-text Journal Articles
Introduction
•“Informative”: make assertions about a gene’s function
•Examples:–Positive: The in vivo interaction between
CIPK23 and CBL1 or CBL9 was confirmed using BiFC assays as shown in Figure 6F. [PMID: 16814720]
–Negative: We do not yet know how these protein complexes activate or inhibit the kaiBC promoter. [PMID: 12441347]
Motivation
•Information Overload–Double-exponential growth
of peer-review literature
–Breakdown of disciplinary boundaries
•Identifying informative sentences can:– Provide a simple mechanism for aggregating gene
function information
– Provide evidence sentences for database annotation
– Provide basis for generating gene summarizations
M e d l i n e G r o w t h
y = ~ e
0 . 0 4 1 8 x
R
2
= 0 . 9 9
y = ~ e
0 . 0 3 1 x
R2
= 0 . 9 5
0
5 0
1 0 0
1 5 0
2 0 0
2 5 0
3 0 0
3 5 0
4 0 0
4 5 0
5 0 0
5 5 0
6 0 0
6 5 0
1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
New Entries (thousands)
P u b l i c a t i o n d a t e
6
7
8
9
1 0
1 1
1 2
1 3
1 4
1 5
1 6
1 7
Total Entries (millions)
[Hunter and Cohen, Mol Cell. Mar 2006]
Related Work•Gene References Info Function (GeneRIFs) in the
Entrez Gene database
•Two Problems–Many Entrez genes
have no GeneRIFs
–GeneRIFs were mostly pulled from abstracts rather than the body of the article
System and MethodI. HTML ParsingStripping off HTML tags
II. Document Zoning: Filtering certain sections, e.g. materials and methods
III. Sentence SelectionScoring each sentence according to its:1. keywords of interest [user specific]2. location 3. mentions of gene/protein names 4. summary-indicative cue words5. mentions of experimental methods6. relation with figures/tables
Biomedical Full Text Articles
The in vivo interaction between CIPK23 and CBL1 or CBL9 was confirmed using BiFC assays as shown in Figure 6F. [PMID: 16814720]
Two Applications
•Finding More GeneRIFs for Entrez Genes(Lu et al., Pac Symp Biocomput, 2006)
– 20% more accurate than other methods
– Predicted GeneRIFs for over 8,000 human genes
•Finding Sentences about Protein-Protein Interaction (BioCreative, 2006)
–An int’l competition with 11 participating teams
– Finding key sentences for IntAct and MINT database curators