the biocreative task in seer. outline background for biomedical information extraction and...

22
The BIOCREATIVE Task in SEER

Upload: carter-ford

Post on 28-Mar-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

The BIOCREATIVE Task in SEER

Page 2: The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

Outline

• Background for biomedical information extraction and BIOCREATIVE

• BIOCREATIVE NER Task

• Stanford-Edinburgh System

• Problems

Page 3: The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

Terms and Resources

Gene An ordered sequence of nucleotides that encodes a product such as a protein.

Protein Gene products; composed of chains of amino acids;

Have sophisticated structures;

kinases, enzymes, etc are types of proteins

Nucleotide Thousands of nucleotides link to form a DNA/RNA molecule

Molecular Biology Branch of biology studying all of the above

MEDLINE The primary research database of the biomedical community, from nursing to drugs to genetics

Gene Databases FlyBase, MGI (mouse), Saccharomyces Gen. Database (Yeast )

Other Databases Swiss-Prot (amino acid sequences of proteins)

GenBank (nucleotide sequences of genes)

Page 4: The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

Biotechnology Information Explosion

David Landsman NCBI Presentation

Page 5: The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

NER in the Biomedical Domain

• Many types of entities can be studied in the biomedical domain (drug names, chemicals)

• Much research has focused on molecular biological entities, particularly genes and proteins

Page 6: The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

Gene Names

• Genes and gene products are constantly being discovered and new names invented

• Nomenclatures exist but vary from organism to organism• Diverse:

– ‘bride of frizzled disco’, ‘cheap date’, ‘broken heart’– ‘REP2’, ‘RFM’

• Ambiguous:– With other genes – Acronyms– With proteins, where genes and their products are often referred to

by the same name. (1st gene in LocusLink is officially alpha-1-B-glycoprotein)

Page 7: The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

F-Score Evaluation Corpus Publication

0.92/Gene Corpus consisting of 750 sentences from FlyBase where each gene is referred to by its official name, and where each name is a single word, kept only sentences containing at least 2 gene mentions, and those gene mentions appear in the dictionary and all the articles concern drosophila melanogaster

Proux et al 1998

0.97/Protein 30 abstracts on SH3 protein Fukuda et al 1998 (KeX)

0.92/Protein SWISSPROT annotations on Transpath database

Hanisch et al 2003

0.15/DNA 0.72/Protein 100 MEDLINE abstracts Nobata et al 1999

0.64/Protein 99 MEDLINE abstracts Eriksson et al 2002 (Yapex)

0.76 Protein 0.03/RNA 100 MEDLINE abstracts Collier et al 2000

0.56 – 24 classes GENIA corpus Kazama et al 2002

0.70/Protein Molecule GENIA corpus Yamamoto et al 2003

Varying Tasks, Results and Evaluation Methods

Page 8: The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

BIOCREATIVE Motivations

• Seeking to be the MUC of the biomedical information extraction field

Page 9: The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

The BIOCREATIVE NER Task

• Given a single sentence from an abstract, to identify all mentions of genes

• “(or proteins where there is ambiguity)”

• In November changed the task to identify all mentions of genes and proteins (but not distinguishing between them)

Page 10: The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

The BIOCREATIVE NER Data

Data Set Sentences Words Genes

Training 7500 200,000 9000

Development 2500 70,000 3000

Evaluation 5000 130,000 6000

Data consisted of MEDLINE abstracts annotated for the single NE “GENE”

Page 11: The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

The BIOCREATIVE NER Evaluation Method

• Only exact matches to the gold standard (which includes alternate correct boundaries for several cases) are counted as correct.

• Genes detected with incorrect boundaries are doubly penalized as false negatives and false positives.

chloramphenicol acetyl transferase reporter gene (FN)

transferase reporter gene (FP)

Page 12: The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

Outline

• Background for BIOCREATIVE and biomedical information extraction

• BIOCREATIVE NER Task Stanford-Edinburgh System

• Problems

Page 13: The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

Baseline System

• Maximum Entropy Tagger in Java• Based on Klein et al (2003) CoNLL submission• Baseline Performance:

Precision 0.79 Recall 0.74 F-Score 0.76• Efforts were mostly in trying different features,

including different POS taggers, NP-chunking, Parsing, Gazetteers, Web, Abbreviations, Word Shapes, Tokenization…

Page 14: The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

Feature Set

wi wi-1 wi+1 Last “real” word Next “real” word Any of the 4 previous words

Word Features (All time s e.g. Monday, April are mapped to lower case)

Any of the 4 next words wi + wi-1 Bigrams wi + wi+1 POSi POSi-1

TnT POS (trained on GENIA POS) POSi+1 Character Substrings

Up to a length of 6

abbri abbri-1 + abbri abbri + abbri+1

Abbreviations

abbri-1 + abbri + abbri+1 wi + POSi

wi-1 + POSi Word + POS

wi+1 + POSi

shapei shapei-1 shapei+1 shapei-1 + shapei shapei + shapei+1

Word Shape

shapei-1 + shapei + shapei+1 wi-1 + shapei Word Shape+Word wi+1+ shapei NEi-1 NEi-2+ NEi-1

Previous NE

NEi-1+wi NEi-1+POSi-1+POSi Previous NE + POS NEi-2+ NEi-1+POSi-2+POSi-1+POSi NEi-1 + shapei NEi-1 + shapei+1 NEi-1 + shapei-1 + shapei

Previous NE + Word Shape

NEi-2+ NEi-1+ shapei-2 + shapei-1 + shapei Parentheses Paren-Matching – a feature that signals

when one parentheses in a pair has been assigned a different tag than the other in a window of 4 s

Page 15: The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

Features – External

Gazetteers 1,731,581 entries

Adapted from Locus Link, Gene Ontology and BIOCREATIVE data

ABGENE A transformation-based NE tagger based on gazetteers and pattern matching

GENIA Biomedical corpus using a different tag set consisting of 37 Named Entities

Web Test Initial tagger output submitted to the Web in patterns such as “X gene”

Page 16: The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

Postprocessing

• Discarded results with mismatched parentheses• Different boundaries were detected when

searching the sentence forwards versus backwards• Unioned the results of both; in cases where

boundary disagreements meant that one detected gene was contained in the other, we kept the shorter gene

Page 17: The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

Final System and Results

Precision Recall F-Score

Closed 0.791 0.854 0.821

Open 0.828 0.836 0.832

Preliminary Best-Closed 0.855 0.854 0.825

Preliminary Best-Open 0.863 0.836 0.832

• Trained on training+development data (1000 sentences)

• 1,247,775 features

Page 18: The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

Outline

• Background for BIOCREATIVE and biomedical information extraction

• BIOCREATIVE NER Task

• Stanford-Edinburgh SystemProblems

Page 19: The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

Performance Discrepancy

C&C Precision Recall F-Score

CoNLL-2003 84.3 85.5 84.9

BIOCREATIVE 77.6 75.9 76.8

Klein et al Precision Recall F-Score

CoNLL-2003 86.1 86.5 86.3

BIOCREATIVE 78.8 73.5 76.1

Page 20: The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

Gene Entity Pitfalls

• Language is complex Stably transfected human kidney 293 cells expressing the wild type rat LH / CG

receptor ( rLHR ) or receptors with C-terminal tails truncated at residues 653 , 631 , or 628 (designated rLHR-t653 , rLHR-t631 , and rLHR-t628 ) were used to probe the importance of this region on the regulation of hormonal responsiveness.

• Gene names are frequently uncapitalizedThe chick axon-associated surface glycoprotein neurofascin is implicated in axonal growth and fasciculation as revealed by antibody perturbation experiments .

• Looks weird is not indicative A newly synthesized anti-inflammatory agent , Y-8004 demonstrated a greater inhibition than did indomethacin ( IM ) . on inflammatory response such as ultraviolet erythema in guinea pigs , carrageenin edema , evans blue and carrageenin-induced pleuritis and acetic acid-induced peritonitis in rats .

Page 21: The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

Boundary Problems

• Gene names can be long and complex

• 37% of our false positives and 39% of false negatives were boundary problems

• Gold: chloramphenicol acetyl transferase reporter gene

chloramphenicol acetyl transferase reporter gene deletion

Gold: estrogen receptor

estrogen receptor ligand

Page 22: The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

Interannotator Agreement

• MUC-7 interannotator agreement was measured at 97 F-Score

• Demetriou and Gaizauskas:

Interannotator agreement for biomedical terms at 89% F-Score

• Hirschman measured interannotator agreement for gene names at 87% F-Score