biomedical text mining: automatic processing of unstructured text
TRANSCRIPT
Lars Juhl Jensen
Biomedical text miningAutomatic processing of unstructured
text
>10 km
1 paper / 40 seconds
patent literature
grant proposals
FDA product labels
electronic medical records
too much to read
computer
as smart as a dog
teach it specific tricks
named entity recognition
comprehensive dictionary
genes/proteins
cyclin dependent kinase 1
CDC2
chemical compounds
diseases
adverse drug reactions
cellular components
tissues
organisms
environments
orthographic variation
flexible matching
spaces and hyphens
cyclin dependent kinase 1
cyclin-dependent kinase 1
expansion rules
prefixes and suffixes
CDC2
hCdc2
plural/adjective forms
mitochondrion
mitochondria
mitochondrial
abbreviated forms
Saccharomyces cerevisiae
S. cerevisiae
“black list”
SDS
use cases
assess studiedness
TIN-X
Cannon et al., Bioinformatics, 2017newdrugtargets.org
interactive annotation
EXTRACT
Pafilis et al., Database, 2016extract.hcmr.gr
extract.hcmr.gr Pafilis et al., Database, 2016
implicit relations
Encyclopedia of Life
habitats
Pafilis et al., Bioinformatics, 2016environments.hcmr.gr / eol.org
SIDER
adverse drug reactions
Kuhn et al., Nucleic Acids Research, 2016sideeffects.embl.de
relation extraction
two approaches
natural language processing
part-of-speech tagging
what you learned in schoolpronoun pronoun verb preposition noun
sentence parsing
Gene and protein namesCue words for entity
recognitionVerbs for relation extraction
[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]
Saric et al., Proceedings of ACL, 2004
manually crafted rules
machine learning
manually annotated corpus
association type
direction
high precision
poor recall
manual work
co-mentioning
counting
within documents
within paragraphs
within sentences
scoring scheme
weighted counts
normalization
easy
high recall
high precision
undirected associations
unknown type
use cases
natural language processing
transcription factor targets
kinase substrates
protein–protein interactions
co-mentioning
drug targets
protein function
subcellular localization
Binder et al., Database, 2014compartments.jensenlab.org
tissue expression
tissues.jensenlab.org Santos et al., PeerJ, 2015
disease genes
diseases.jensenlab.org Frankild et al., Methods, 2015
disease mutations