analysing entity type variation across biomedical subdomains
Post on 22-Apr-2015
365 Views
Preview:
DESCRIPTION
TRANSCRIPT
Claudiu Mihăilă, Riza Theresa Batista-Navarro, Sophia Ananiadou
Claudiu Mihăilă
Analysing Entity Type Variationacross Biomedical Subdomains
National Centre for Text MiningSchool of Computer Science
University of Manchester
26 May 2012
2
BioTxtM 2012
Introduction• Named entities
o Atomic elements, classified into various categories (protein, gene, disease, treatment, metabolite etc.)
Organism OrganismPro Pro Pro ProPro+RegTranscription
ThemeTheme
In contrast to the phenotype of the pta ackA double mutant, pbgP transcription was reduced in the pmrD mutant.
3
BioTxtM 2012
Introduction• Corpora
4
BioTxtM 2012
Methodology• Full-text open-access journal articles from UKPMC• 20 subdomains 400 single broad-subject-termed articles
4
Allergy & Immunology Biology Cell Biology Communicable
Diseases Critical Care
Environmental Health Genetics
Health Services Research
Medical Informatics Medicine
Microbiology Neoplasms Neurology Pharmacology Physiology
Public Health Pulmonary Medicine Rheumatology Tropical
Medicine Virology
5
BioTxtM 2012
Methodology• NE source: ASilver = AUKPMC AOscar ANeMine
Allergy & Immunology Biology Cell Biology Communicable
Diseases Critical Care
Environmental Health Genetics
Health Services Research
Medical Informatics Medicine
Microbiology Neoplasms Neurology Pharmacology Physiology
Public Health Pulmonary Medicine Rheumatology Tropical
Medicine Virology
UKPMC
NeMine
OSCAR
Critical Care
Medicine
Physiology
Virology
Corpus Annotation
6
BioTxtM 2012
MethodologyNeMine
GeneProteinDiseaseDrugMetaboliteBacteriaDiagnostic processGeneral phenomenonIndicatorNatural phenomenonOrganPathologic functionSymptomTherapeutic process
OSCAR
Chemical molecule
Chemical adjective
Enzyme
Reaction
UKPMC
Gene
Protein
Disease
Drug
Metabolite
Gene|Protein
SilverAnnotation
7
BioTxtM 2012
Methodology• Feature vectors
Document d
Enzyme 2
Chemical molecule 71
Disease 8
Drug 12
Gene 15
Gene|Protein 155
Metabolite 3
Protein 188
Reaction 24
Document d
Enzyme 0.45%
Chemical molecule 14.85%
Disease 1.67%
Drug 2.51%
Gene 3.13%
Gene|Protein 3.24%
Metabolite 0.62%
Protein 39.33%
Reaction 5.02%
8
BioTxtM 2012
Methodology
9
BioTxtM 2012
Methodology
10
BioTxtM 2012
Methodology• Chi-squared statistics
11
BioTxtM 2012
Methodology• Frobenius norm
1247.0725
12
BioTxtM 2012
Feature evaluation
Frobenius norm of 2 vectors for each pair.
• Good features foro Cell Biologyo Pharmacologyo Health Scienceso Public Health
• Not-so-good features foro Medical Informaticso Medicineo Microbiologyo Neoplasmso Neurology
13
BioTxtM 2012
Feature evaluation•Mean Chi-Squared for every feature over all pairs
14
BioTxtM 2012
Classifier selection
Random Forest F-score for each pair.
Classifier Top result count
J48 0 0%
JRip 4 2.10%
Logistic 2 1.05%
Random Tree 0 0%
Random Forest 86 45.26%
SMO 0 0%
AdaBoost
J48 6 3.15%
JRip 7 3.68%
Decision Stump 16 8.42%
Logistic 0 0%
Random Tree 0 0%
Random Forest 68 35.78%
SMO 1 5.26%
15
BioTxtM 2012
Classifier evaluation
Random Forest F-score for each pair.
• Dissimilar subdomainso Cell Biologyo Pharmacologyo Health Scienceso Public Health
• Similar subdomainso Medical Informaticso Medicineo Microbiologyo Neoplasmso Neurology
16
BioTxtM 2012
Conclusions• To remember
o Significant semantic variation of biomedical sublanguageso Distinguishable bio-subdomains using only NE typeso Caution needed when adapting NLP tools to subdomains
• To doo Extension to bio-eventso Combination with lexical, syntactical, discourse featureso Extension to other domains
17
BioTxtM 2012
Thank you!
http://misteringo.deviantart.com/art/Bunnies-Scream-Again-79745974
top related