grounding gene mentions with respect to gene database identifiers ben hachey bionlp reading group...
TRANSCRIPT
Grounding Gene Mentions with Respect to Gene Database Identifiers
Ben HacheyBioNLP Reading Group18.07.2005
Overview of BioCreAtIvE Task 1B
18/07/2005 BioCreative Task 1B 2
Outline
• BioCreAtIvE Task 1B
• Approaches to Task 1B
• Related Work
• Conclusions
18/07/2005 BioCreative Task 1B 3
BioCreAtIvE
• Critical Assessment of Information Extraction Systems in Biology– Task 1A Named Entity Recognition
• Given a single sentence from an abstract, to identify all mentions of genes
• “(or proteins where there is ambiguity)”
– Task 1B Entity Normalisation
• Given NER’d abstract, associate list of unique identifiers
– Task 2 Automatic GO code annotation
18/07/2005 BioCreative Task 1B 4
Example Abstract from Fly
Since Dpp and Gbb levels are not detectably higher in the early phases of cross vein development, other factors apparently account for this localized activity. Our evidence suggests that the product of the crossveinless 2 gene is a novel member of the BMP-like signaling pathway required to potentiate Gbb of Dpp signaling in the cross veins. crossveinless 2 is expressed at higher levels in the developing cross veins and is necessary for local BMP-like activity.
FBgn0000395 crossveinless 2
FBgn0000490 Dpp
FBgn0024234 Gpp
Input
Outp
ut
18/07/2005 BioCreative Task 1B 5
Analogous Tasks
• Can be seen as…– Grounding:
Tying a textual mention of an entity to its identifier in a gene database/ontology
Provides a list, without repetition, of the entities referred to in the sentence (Information Extraction)
– Coreference:
Identifying which textual mentions refer to the same entity
– Lexical Entailment:
Whether term is substitutable in given context
18/07/2005 BioCreative Task 1B 6
Resources
• Synonym database provided for each organism:– Fly Drosophila melanogaster
– Yeast Saccharomyces cerevisiae
– Mouse Mus musculus
• These list a number of different textual realisations for each unique gene identifier
18/07/2005 BioCreative Task 1B 7
Fly Synonym DB Examples
ID Synonyms
FBgn0000395 CG15671, CT35855, crossveinless 2, cv 2, cv-2
FBgn0000490 CG9885, DPP, DPP C, DPP-C, Dpp, Haplo insufficient, Haplo-insufficient, Hin d: Haplo insufficient, Hind: Haplo-insufficient, M(2)23AB, M(2)LS1, Tegula, Tg: Tegula, blink, blk: blink, decapentaplegic, dpp, heldout, ho, ho: heldout, l(2)10638, l(2)22Fa, l(2)k17036, shortvein, Shv
FBgn0001105 CG10545, G beta, G betab, G protein &bgr; subunit, G protein &bgr;-subunit 13F, G protein beta 13F, G protein beta subunit, G protein beta subunit 13F, G protein beta-subunit 13F, G&bgr;, G&bgr;13F, G-&bgr;b, Gbeta, G-betab, G-protein &bgr; 13F, G-protein beta 13F, G¡down¿&bgr;¡/down¿ brain, Gb13F, Gbb, Gbeta, Gbeta brain, Gbeta13F, anon EST:Liang 1.22, anon-EST:Liang-1.22, clone 1.22, dg&bgr;, dgbeta
FBgn0017531 Spal\crossveinless 2, Spal\crossveinless-2, crossveinless 2, crossveinless-2
FBgn0018552 Dpse\cv2, crossveinless, crossveinless 2, crossveinless-2, cv
FBgn0024200 CG9936 Pap/Trap, Scad78, Suppressor of constitutively activated Dpp signaling 78, TRAP240, bli, blind spot, bls, dTRAP240, flytrap, l(3)L7062, l(3)rK760, pap, pap/dTRAP240, poils aux pattes
FBgn0024234 60A, CG5562, Gbb, Gbb 60A, Gbb-60A, SixtyA, TGF&bgr;-60A, TGFbeta 60A, TGFbeta-60A, Tgf&bgr;-60A, Tgfb 60, Tgfb-60, Tgfbeta 60A, Tgfbeta-60A, Transforming growth factor &bgr; at 60A, Transforming growth factor beta at 60A, gbb, gbb 60A, gbb-60A, gcn, gcn: gonial cell neoplasm, gcn: gonial-cell-neoplasm, glass bottom boat, glass bottom boat 60A, glass bottom boat-60A, l(2)60A J, l(2)60A-J, tgfb 60A, tgfb-60A, vgr/60A
FBgn0044017 Scad67, Suppressor of constitutively activated Dpp signaling 67
18/07/2005 BioCreative Task 1B 8
• (Start with documents whose full text has been manually curated)
• Noisy Training Data1. Automatically eliminate gene Ids not found in abstract– Fly: 0.83, Mouse: 0.71, Yeast: 0.92 (quality)
• Testing Gold Standard2. Hand check for over-zealous elimination3. Add genes mentioned “in passing”
(so task is same across organisms)– Fly: 0.93, Mouse: 0.87, Yeast: 0.96 (agreement)– 250 abstracts/organism
Data Preparation
18/07/2005 BioCreative Task 1B 9
Evaluation
• Precision, recall, and balanced f-score automatically calculated with respect to gold standard gene ID lists
• 8 teams total– Various numbers of submissions (0-3) on each
organism
• Number of submissions– Fly: 11
– Mouse: 16
– Yeast: 15
18/07/2005 BioCreative Task 1B 10
Top Systems Performance (P/R/F)
Fly Mouse Yeast Focus83.1/80.0/81.5
(f rank: 1)
76.5/81.9/79.1
(f rank: 1)
96.6/84.0/89.9
(f rank: 3)
16 – Hanisch et al.
(Fraunhofer, LMU Munich)
69.2/76.5/72.6
(f rank: 2)
82.8/67.6/74.4
(f rank: 4)
95.0/89.4/92.1
(f rank: 1)
8 – Crim et al.
(UPenn)
– – – 76.4/78.7/77.6
(f rank: 2)
91.7/87.8/89.7
(f rank: 4)
24 – Fundel et al.
(LMU Munich)
46.3/38.0/41.7
(f rank: 5)
72.8/64.8/68.6
(f rank: 6)
94.0/87.1/90.4
(f rank: 2)
18 – No paper
(???)
59.2/74.8/66.1
(f rank: 3)
81.1/67.6/73.7
(f rank: 5)
91.5/79.0/84.8
(f rank: 6)
5 – Hachey et al.
(Edinburgh, Stanford)
– – – 78.5/70.9/74.5
(f rank: 3)
90.7/81.4/85.8
(f rank: 5)
6 – Tamames
(BioAlma)
18/07/2005 BioCreative Task 1B 11
Outline
• BioCreAtIvE Task 1B
• Approaches to Task 1B
• Related Work
• Conclusions
18/07/2005 BioCreative Task 1B 12
Approaches
1. Use synonyms for simple matching against text
– Difficult to ID false positives• Especially mouse and fly where synonyms include e.g.
common words (with, at, yellow, …)
2. ID gene text, then ground– Leverage NER system from Task 1A– Limited by performance of NER
• 78.8% precision, 73.5% recall, 76.1 balance F• 37% of FPs and 39% of FNs due to boundary problems
Synonym lists not exhaustive
18/07/2005 BioCreative Task 1B 13
Information Sources
• Edit synonym list?– Add other specific and frequently used synonyms– Remove problematic synonyms
• String similarity– Matching against synonym list– Fuzzy matching (spelling variations, abbreviations, …)
• Coreference– Synonym in same text
• Other contextual evidence– Gene co-occurrence in same text– Word context around entity…
• Probabilistic/Statistical models– Pr(geneID), Pr(geneID|synonym)
18/07/2005 BioCreative Task 1B 14
Top Systems Overview
Approach Inf. Sources
Team 1 2 EdSyn StrSim Coref OthrC Prob
user16 (h) user8 (a) user24 (h) user5 user6 (?)
Approach: 1 – match syns to text; 2 – NER, match ent to synsEdited Syn List String Sim/Fuzzy Match
Inf. Sources: Coreference Other ContextualProb/Stat Models
18/07/2005 BioCreative Task 1B 15
Systems
• Team: user24 (mouse, yeast)– Katrin Fundel, Daniel Güttler, Ralf Zimmer, and Joannis Apostolakis– Ludwig-Maximilians-Universität München
• Approach: No NER, match synonyms to text– Rule-based generation and curation of synonym lists
• Remove unspecific and inappropriate synonyms• Expanded to include additional, frequently used synonyms
– Automatic rule-based edit system– Human curation to assure quality
• Tuned using training data– Select all matches– Post-filter: Remove matches with non-gene context (e.g.
‘cells’, ‘domains’, ‘cell type’, ‘DNA binding site’)
Semi-automatic syn list curation, could be used for gazetteers!
18/07/2005 BioCreative Task 1B 16
Systems
• Team: user16 (fly, mouse, yeast)– Daniel Hanisch, Katrin Fundel, Heinz-Theodor Mevissen, Ralf Zimmer, and
Juliane Fluck– Fraunhofer Institute & Ludwig-Maximilians-Universität München
• Approach: No NER, match synonyms to text– Synonym list expanded (offline)
• Automatic rule-based edit system w/ human curation to assure quality
– Rule-based classification of synonyms• Class I: Case-insensitive near-synonyms• Class II: Case-sensitive near-synonyms• Class III: Questionable synonyms
(high frequency, inexact match)– Select n highest scoring matches (Hanisch et al., 2003)
• Focus on matching multi-word terms
Syn list curation, multi-word term matching!
18/07/2005 BioCreative Task 1B 17
Systems
• Team: user8 (fly, mouse, yeast)– Jeremiah Crim, Ryan McDonald, and Fernando Pereira– University of Pennsylvania
• Approach: No NER, match synonyms to text– Pattern Matching
• Synonym list pruned by threshold on conditional probability of a gene ID (g) being a label for a document given that a synonym (s) matches
• List of candidate gene IDs compiled by selecting 1000 training documents with highest token-level cosine similarity
– Match Classification• Binary maximum entropy classifier trained to predict whether gene
IDs selected by pattern matching should be kept• Fly: +7.7, Mouse: +1.5, Yeast: -0.4
Prob models (pruning, disambiguation), no human curation!
18/07/2005 BioCreative Task 1B 18
Systems
• Team: user5 (fly, mouse, yeast)– Ben Hachey, Huy Nguyen, Malvina Nissim, Bea Alex, and Claire Grover– University of Edinburgh & Stanford
• Approach: NER, match entities to synonyms1. Build organism-specific named entity recognition
• Noisy training data obtained from Task 1B materials
2. Match gene entities to synonym lists (fuzzy)• Incorporates various edit operations (e.g. case folding, optional
dashes and other punc, Brit/Am spellings)• Tuned per-organism to select and order edit operations
3. Disambiguate each entity to a single gene ID• Var. heuristic, statistical approaches (e.g. gene ID co-occurrence, IR
query term weighting, repetition in synonym list)• Again, optimised per-organism
Bootstrapping NE data, IR term weighting, no human curation!
18/07/2005 BioCreative Task 1B 19
Systems
• Team: user6 (fly, mouse, yeast)– Javier Tamames– BioAlma SL
• Approach: NER, match entities to synonyms1. NER for various bio ents (e.g. genes, proteins, compounds)
• Also bio-medical semantic tagging of words E.g. Core terms (receptor, kinase, …) and types (alpha, a1, …)
2. Match gene entities to synonym lists (fuzzy)• Use BioCreAtIvE lists and other relevant databases• Match and weighting based on semantic labels
3. Disambiguate each entity to a single gene ID• Use of key words extracted from databases (e.g. HUGO, MGI, SGD)
Semantic tagging module, Key word context from org DBs!
18/07/2005 BioCreative Task 1B 20
Outline
• BioCreAtIvE Task 1B
• Approaches to Task 1B
• Related Work
• Conclusions
18/07/2005 BioCreative Task 1B 21
Related Work
• Ben Wellner (2005). Weakly Supervised Learning Methods for Improving the Quality of Gene Name Normalization Data. In: Proceedings of BioLink-2005.
• …
18/07/2005 BioCreative Task 1B 22
Outline
• BioCreAtIvE Task 1B
• Approaches to Task 1B
• Related Work
• Conclusions
18/07/2005 BioCreative Task 1B 23
Conclusions
• Model that can be automatically tuned to e.g. domain, organism
• Proper modelling of:– Abbreviations
– Spelling variants
– Coreference in abstracts
– Textual context, key words
– Entity co-occurrence
– Entity and term distributions
– Token semantic roles
18/07/2005 BioCreative Task 1B 24
Thank you
18/07/2005 BioCreative Task 1B 25
References
Lynette Hirschman, Marc Colosimo, Alexander Morgan, Jeffrey Colombe, and Alexander Yeh (2004). Task 1B: Gene list task. In: Proceedings BioCreAtIvE Workshop.
Daniel Hanisch, Katrin Fundel, Heinz-Theodor Mevissen, Ralf Zimmer, and Juliane Fluck (2004). ProMiner: Organis-specific protein name detection using approximate string matching. In: Proceedings BioCreAtIvE Workshop. [user16]
Katrin Fundel, Daniel Güttler, Ralf Zimmer, and Joannis Apostolakis (2004). Exact versus approximate string matching for protein name identification. In: Proceedings BioCreAtIvE Workshop. [user24]
Jerimiah Crim, Ryan McDonald, and Fernando Pereira (2004). Automatically annotating documents with normalized gene lists. In: Proceedings BioCreAtIvE Workshop. [user8]
Ben Hachey, Huy Nguyen, Malvina Nissim, Bea Alex, and Claire Grover (2004). Grounding gene mentions with respect to gene database identifiers. In: Proceedings BioCreAtIvE Workshop. [user5]
Javer Tamames (2004). Text detective: BioAlma’s gene annotation tool. In: Proceedings BioCreAtIvE Workshop. [user6]
Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen, and Ralf Zimmer (2003). Playing biology’s name game: Identifying protein names in scientific text.
18/07/2005 BioCreative Task 1B 26
Bea Alex, Shipra Dingare, Claire Grover, Ben Hachey, Ewan Klein, Yuval Krymolowski, Malvina Nissim
Jenny Finkel, Chris Manning, Huy NguyenStanford:
Edinburgh:
The SEER Project Team
18/07/2005 BioCreative Task 1B 28
Top Systems Performance (P/R/F)
Fly Mouse Yeast Focus83.1/80.0/81.5
(16)
76.5/81.9/79.1
(16)
95.0/89.4/92.1
(8)
16 – Hanisch et al.
(Fraunhofer, LMU Munich)
69.2/76.5/72.6
(8)
76.4/78.7/77.6
(24)
94.0/87.1/90.4
(18)
8 – Crim et al.
(UPenn)
59.2/74.8/66.1
(5)
78.5/70.9/74.5
(6)
96.6/84.0/89.9
(16)
24 – Fundel et al.
(LMU Munich)
31.5/73.2/44.0
(23)
82.8/67.6/74.4
(8)
91.7/87.8/89.7
(24)
18 – No paper
(???)
46.3/38.0/41.7
(18)
81.1/67.6/73.7
(5)
90.7/81.4/85.8
(6)
5 – Hachey et al.
(Edinburgh, Stanford)
22.4/38.9/28.4
(19)
72.8/64.8/68.6
(18)
91.5/79.0/84.8
(5)
6 – Tamames
(BioAlma)
18/07/2005 BioCreative Task 1B 29
Top Systems Rank
Fly Mouse Yeast Focus16 16 8 16
Hanisch et al. (Fraunhofer, LMU Munich)
8 24 18 8
Crim et al. (UPenn)
5 6 16 24
Fundel et al. (LMU Munich)
23 8 24 18
???
18 5 6 5
Hachey et al. (Edinburgh, Stanford)
19 18 5 6
Tamames (BioAlma)