dtcii
TRANSCRIPT
Linguistic techniques for Text Mining
NaCTeM teamwww.nactem.ac.uk
Sophia AnaniadouChikashi NobataYutaka Sasaki
Yoshimasa Tsuruoka
2
raw(unstructured)
text
part-of-speechtagging
named entityrecognition
deepsyntacticparsing
annotated(structured)
text
Natural Language Processing
lexicon ontology
………………………………..………………………………………….………....... Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells. ……………………………………………………………..
Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells .
NN IN NN VBZ VBN IN NN IN JJ NN NNS .
protein_molecule organic_compound cell_line
PP PP NP
PP
VP
VP
NP
NP
S
negative regulation
3
Basic Steps of Natural Language Processing
• Sentence splitting• Tokenization• Part-of-speech tagging• Shallow parsing• Named entity recognition• Syntactic parsing• (Semantic Role Labeling)
4
Sentence splittingCurrent immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines. However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs. Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection.
Current immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines.
However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs.
Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection.
5
A heuristic rule for sentence splitting
sentence boundary = period + space(s) + capital
letter
Regular expression in Perls/\. +([A-Z])/\.\n\1/g;
6
Errors
• Two solutions:– Add more rules to handle exceptions– Machine learning
IL-33 is known to induce the production of Th2-associated cytokines (e.g. IL-5 and IL-13).
IL-33 is known to induce the production of Th2-associated cytokines (e.g.
IL-5 and IL-13).
7
Tools for sentence splitting
• JASMINE– Rule-based– http://uvdb3.hgc.jp/ALICE/program_downloa
d.html• Scott Piao’s splitter
– Rule-based– http://text0.mib.man.ac.uk:8080/scottpiao/s
ent_detector• OpenNLP
– Maximum-entropy learning– https://sourceforge.net/projects/opennlp/– Needs training data
8
Tokenization
• Convert a sentence into a sequence of tokens
• Why do we tokenize?• Because we do not want to treat a sentence as a
sequence of characters!
The protein is activated by IL2.
The protein is activated by IL2 .
9
Tokenization
• Tokenizing general English sentences is relatively straightforward.
• Use spaces as the boundaries• Use some heuristics to handle exceptions
The protein is activated by IL2.
The protein is activated by IL2 .
10
Tokenisation issues
• separate possessive endings or abbreviated forms from preceding words: – Mary’s Mary ‘s
Mary’s Mary isMary’s Mary has
• separate punctuation marks and quotes from words :– Mary. Mary .– “new” “ new “
11
Tokenization
• Tokenizer.sed: a simple script in sed • http://www.cis.upenn.edu/~treebank/tokeniz
ation.html
• Undesirable tokenization– original: “1,25(OH)2D3”– tokenized: “1 , 25 ( OH ) 2D3”
• Tokenization for biomedical text– Not straight-forward– Needs dictionary? Machine learning?
12
Tokenisation problems in Bio-text
• Commas– 2,6-diaminohexanoic acid– tricyclo(3.3.1.13,7)decanone
• Four kinds of hyphens– “Syntactic:”– Calcium-dependent– Hsp-60– Knocked-out gene: lush-- flies– Negation: -fever– Electric charge: Cl-
K. Cohen NAACL-2007
13
Tokenisation
• Tokenization: Divides the text into smallest units (usually words), removing punctuation. Challenge: What should be done with punctuation that has linguistic meaning?
• Negative charge (Cl-)• Absence of symptom (-fever)• Knocked-out gene (Ski-/-)• Gene name (IL-2 –mediated)• Plus, “syntactic”uses (insulin-dependent)
K. Cohen NAACL-2007
14
Part-of-speech tagging
• Assign a part-of-speech tag to each token in a sentence.
The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NNvirus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS
15
Part-of-speech tags
• The Penn Treebank tagset– http://www.cis.upenn.edu/~treebank/– 45 tags
NN Noun, singular or massNNS Noun, pluralNNP Proper noun, singularNNPS Proper noun, plural : :VB Verb, base formVBD Verb, past tenseVBG Verb, gerund or present participleVBN Verb, past participleVBZ Verb, 3rd person singular present : :
JJ AdjectiveJJR Adjective, comparativeJJS Adjective, superlative : :DT DeterminerCD Cardinal numberCC Coordinating conjunctionIN Preposition or subordinating
conjunctionFW Foreign word : :
16
Part-of-speech tagging is not easy
• Parts-of-speech are often ambiguous
• We need to look at the context• But how?
I have to go to school.
I had a go at skiing.
verb
noun
17
Writing rules for part-of-speech tagging
• If the previous word is “to”, then it’s a verb.• If the previous word is “a”, then it’s a noun.• If the next word is … :
Writing rules manually is impossible
I have to go to school.verb
I had a go at skiing.noun
18
Learning from examples
The involvement of ion channels in B and T lymphocyte activation is DT NN IN NN NNS IN NN CC NN NN NN VBZ supported by many reports of changes in ion fluxes and membrane VBN IN JJ NNS IN NNS IN NN NNS CC NN…………………………………………………………………………………….…………………………………………………………………………………….
Machine Learning Algorithm
training
We demonstrate that …
Unseen text
We demonstratePRP VBPthat … IN
19
Part-of-speech tagging with Hidden Markov Models
n
iiiii
nnn
n
nnnnn
ttPtwP
ttPttwwP
wwP
ttPttwwPwwttP
11
111
1
11111
||
......|...
...
......|......|...
wordstags
output probability transition probability
20
First-order Hidden Markov Models
• Training– Estimate
– Counting (+ smoothing)
• Using the tagger
zy
xj
tagtagP
tagwordP
|
|
n
iiiii ttPtwP
11||maxarg
21
Machine learning using diverse features
• We want to use diverse types of information when predicting the tag.
Verb
He opened it
The word is “opened”The suffix is “ed”The previous word is “He” :
many clues
22
Machine learning with log-linear models
iii yxf
xZxyp ,exp
1|
Feature functionFeature weight
y iii yxfxZ ,exp
23
Machine learning with log-linear models
• Maximum likelihood estimation– Find the parameters that maximize the
conditional log-likelihood of the training data
• Gradient
yx
xypxpxypLL,
|~~|log)(
][][)(
~ ipipi
fEfELL
24
Computing likelihood and model expectation
• Example– Two possible tags: “Noun” and
“Verb”– Two types of features: “word” and
“suffix”
Verb
He opened it
edsuffixverbtagopenedwordverbtagedsuffixnountagopenedwordnountag
edsuffixverbtagopenedwordverbtag
,,,,
,,
Noun Noun
tag = noun tag = verb
25
Conditional Random Fields (CRFs)
• A single log-linear model on the whole sentence
• The number of classes is HUGE, so it is impossible to do the estimation in a naive way.
F
i
n
tttiin yytf
ZyyP
1 111 ,,,exp
1)|...( xx
26
Conditional Random Fields (CRFs)
• Solution– Let’s restrict the types of features– You can then use a dynamic programming
algorithm that drastically reduces the amount of computation
• Features you can use (in first-order CRFs)– Features defined on the tag– Features defined on the adjacent pair of tags
27
Features
• Feature weights are associated with states and edges
Noun
Verb
Noun
Verb
Noun
Verb
Noun
Verb
He has opened it
W0=He&
Tag = Noun
Tagleft = Noun&
Tagright = Noun
28
A naive way of calculating Z(x)
Noun Noun Noun Noun = 7.2
= 1.3
= 4.5
= 0.9
= 2.3
= 11.2
= 3.4
= 2.5
= 4.1
= 0.8
= 9.7
= 5.5
= 5.7
= 4.3
= 2.2
= 1.9
Sum = 67.5
Noun Noun Noun Verb
Noun Noun Verb Noun
Noun Noun Verb Verb
Noun Verb Noun Noun
Noun Verb Noun Verb
Noun Verb Verb Noun
Noun Verb Verb Verb
Verb Noun Noun Noun
Verb Noun Noun Verb
Verb Noun Verb Noun
Verb Noun Verb Verb
Verb Verb Noun Noun
Verb Verb Noun Verb
Verb Verb Verb Noun
Verb Verb Verb Verb
29
Dynamic programming
• Results of intermediate computation can be reused.
Noun
Verb
Noun
Verb
Noun
Verb
Noun
Verb
He has opened it
forward
30
Dynamic programming
• Results of intermediate computation can be reused.
Noun
Verb
Noun
Verb
Noun
Verb
Noun
Verb
He has opened it
backward
31
Dynamic programming
• Computing marginal distribution
Noun
Verb
Noun
Verb
Noun
Verb
Noun
Verb
He has opened it
32
Maximum entropy learning and Conditional Random Fields
• Maximum entropy learning– Log-linear modeling + MLE– Parameter estimation
• Likelihood of each sample• Model expectation of each feature
• Conditional Random Fields– Log-linear modeling on the whole sentence– Features are defined on states and edges– Dynamic programming
33
POS tagging algorithms
• Performance on the Wall Street Journal corpus
Training Cost
Speed Accuracy
Dependency Net (2003) Low Low 97.2
Conditional Random Fields
High High 97.1
Support vector machines (2003)
97.1
Bidirectional MEMM (2005) Low 97.1
Brill’s tagger (1995) Low 96.6
HMM (2000) Very low High 96.7
34
POS taggers
• Brill’s tagger– http://www.cs.jhu.edu/~brill/
• TnT tagger– http://www.coli.uni-saarland.de/~thorsten/tnt/
• Stanford tagger– http://nlp.stanford.edu/software/tagger.shtml
• SVMTool– http://www.lsi.upc.es/~nlp/SVMTool/
• GENIA tagger– http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/
35
Tagging errors made by a WSJ-trained POS tagger
… and membrane potential after mitogen binding. CC NN NN IN NN JJ… two factors, which bind to the same kappa B enhancers… CD NNS WDT NN TO DT JJ NN NN NNS … by analysing the Ag amino acid sequence. IN VBG DT VBG JJ NN NN… to contain more T-cell determinants than … TO VB RBR JJ NNS IN Stimulation of interferon beta gene transcription in vitro by NN IN JJ JJ NN NN IN NN IN
36
Taggers for general text do not work wellon biomedical text
Accuracy
Exact 84.4%
NNP = NN, NNPS = NNS 90.0%
LS = NN 91.3%
JJ = NN 94.9%
Accuracies of a WSJ-trained POS tagger evaluated on the GENIA corpus (Tsuruoka et al., 2005)
Performance of the Brill tagger evaluated on randomly selected 1000 MEDLINE sentences: 86.8% (Smith et al., 2004)
37
MedPost(Smith et al., 2004)
• Hidden Markov Models (HMMs)• Training data
– 5700 sentences randomly selected from various thematic subsets.
• Accuracy– 97.43% (native tagset), 96.9% (Penn tagset)– Evaluated on 1,000 sentences
• Available from– ftp://
ftp.ncbi.nlm.nih.gov/pub/lsmith/MedPost/medpost.tar.gz
38
Training POS taggers with bio-corpora
(Tsuruoka and Tsujii, 2005)
training WSJ GENIA PennBioIE
WSJ 97.2 91.6 90.5GENIA 85.3 98.6 92.2PennBioIE 87.4 93.4 97.9WSJ + GENIA 97.2 98.5 93.6WSJ + PennBioIE 97.2 94.0 98.0GENIA + PennBioIE 88.3 98.4 97.8WSJ + GENIA + PennBioIE 97.2 98.4 97.9
39
Performance on new data
training NAR NMED NMED Total (Acc.)
WSJ 109 47 102 258 (70.9%)
GENIA 121 74 132 327 (89.8%)
PennBioIE 129 65 122 316 (86.6%)
WSJ + GENIA 125 74 135 334 (91.8%)
WSJ + PennBioIE 133 71 133 337 (92.6%)
GENIA + PennBioIE 128 75 135 338 (92.9%)
WSJ + GENIA + PennBioIE 133 74 139 346 (95.1%)
Relative performance evaluated on recent abstracts selected from three journals: - Nucleic Acid Research (NAR) - Nature Medicine (NMED) - Journal of Clinical Investigation (JCI)
40
Chunking (shallow parsing)
• A chunker (shallow parser) segments a sentence into non-recursive phrases.
He reckons the current account deficit will narrow toNP VP NP VP PPonly # 1.8 billion in September . NP PP NP
41
Extracting noun phrases from MEDLINE(Bennett, 1999)
• Rule-based noun phrase extraction– Tokenization– Part-Of-Speech tagging
– Pattern matching
FastNPE NPtool Chopper AZ Phraser
Recall 50% 95% 97% 92%
Precision 80% 96% 90% 86%
Noun phrase extraction accuracies evaluated on 40 abstracts
42
Chunking with Machine learning
• Chunking performance on Penn Treebank
Recall
Precision
F-score
Winnow (with basic features) (Zhang, 2002)
93.60 93.54 93.57
Perceptron (Carreras, 2003) 93.29 94.19 93.74
SVM + voting (Kudoh, 2003) 93.92 93.89 93.91
SVM (Kudo, 2000) 93.51 93.45 93.48
Bidirectional MEMM (Tsuruoka, 2005)
93.70 93.70 93.70
43
Machine learning-based chunking
• Convert a treebank into sentences that are annotated with chunk information.– CoNLL-2000 data set
• http://www.cnts.ua.ac.be/conll2000/chunking/
• The conversion script is available
• Apply a sequence tagging algorithm such as HMM, MEMM, CRF, or Semi-CRF.
• YamCha: an SVM-based chunker– http://www.chasen.org/~taku/software/yamch
a/
44
GENIA tagger
• Algorithm: Bidirectional MEMM• POS tagging
– Trained on WSJ, GENIA and Penn BioIE– Accuracy: 97-98%
• Shallow parsing– Trained on WSJ and GENIA– Accuracy: 90-94%
• Can output base forms• Available from
http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/
45
Named-Entity Recognition
• Recognize named-entities in a sentence.– Gene/protein names– Protein, DNA, RNA, cell_line, cell_type
We have shown that interleukin-1 (IL-1) and IL-2 control protein protein proteinIL-2 receptor alpha (IL-2R alpha) gene transcription in DNACD4-CD8-murine T lymphocyte precursors. cell_line
46
Performance of biomedical NE recognition
Recall Precision
F-score
SVM+HMM (Zhou, 2004) 76.0 69.4 72.6
Semi-Markov CRFs (in prep.) 72.7 70.4 71.5
Two-Phase (Kim, 2005) 72.8 69.7 71.2
Sliding Window (in prep.) 71.5 70.2 70.8
CRF (Settles, 2005) 72.0 69.1 70.5
MEMM (Finkel, 2004) 71.6 68.6 70.1
: : : :
• Shared task data for Coling 2004 BioNLP workshop - entity types: protein, DNA, RNA, cell_type, and cell_line
47
Features
CM lx af or sh
gn
gz
po
np
sy
tr ab
ca
do
pa
pr ext.
Zho SH x x x x x x x x x
Fin M x x x x x x x x x x B,W
Set C x x x x (x)
(x)
x (W)
Son SC x x x x x V
Zha H x x M
Classification models, main features used in NLPBA (Kim, 2004)
Classification Model (CM): S: SVM; H: HMM; M: MEMM; C: CRFFeatures lx: lexical features; af: affix information (chracter n-grams); or; orthographic Information;sh: word shapes; gn: gene sequence; gz: gazetteers; po: part-of-speech tags; np: noun phrase tags; sy: syntactic tags; tr: word triggers; ab: abbreviations; ca: cascaded entities; do: global document information; pa: parentheses handling; pre: previously predicted entity tags; B: British National Corpus; W: WWW; V: virtually generated corpus; M: MEDLINE
48
Estimated volume was a light 2.4 million ounces .
VBN NN VBD DT JJ CD CD NNS .
QP
NP
VP
NP
S
CFG parsing
49
Estimated volume was a light 2.4 million ounces .
VBN NN VBD DT JJ CD CD NNS .
QP
NP
VP
NP
S
Phrase structure + head information
50
Estimated volume was a light 2.4 million ounces .
VBN NN VBD DT JJ CD CD NNS .
Dependency relations
51
CFG parsing algorithms
LR LP F-score
Generative model (Collins, 1999) 88.1 88.3 88.2
Maxent-inspired (Charniak, 2000) 89.6 89.5 89.5
Simply Synchrony Networks (Henderson, 2004)
89.8 90.4 90.1
Data Oriented Parsing (Bod, 2003) 90.8 90.7 90.7
Re-ranking (Johnson, 2005) 91.0
• Performance on the Penn Treebank
52
CFG parsers
• Collins parser– http://people.csail.mit.edu/mcollins/code.html
• Bikel’s parser– http://www.cis.upenn.edu/~dbikel/software.html#stat-pa
rser
• Charniak parser– http://www.cs.brown.edu/people/ec/
• Re-ranking parser– http://www.cog.brown.edu:16080/~mj/Software.htm
• SSN parser– http://homepages.inf.ed.ac.uk/jhender6/parser/ssn_parser.html
53
Parsing biomedical documents
• CFG parsing accuracies on the GENIA treebank (Clegg, 2005)
• In order to improve performance,– Unsupervised parse combination (Clegg, 2005)– Use lexical information (Lease, 2005)
• 14.2% reduction in error.
LR LP F-score
Bikel 0.9.8 77.43 81.33 79.33
Charniak 76.05 77.12 76.58
Collins model 2 74.49 81.30 77.75
54
Parse tree
CRP mesurement
does
NP13
VP17
VP15
So
NP1
DT2
A
VP16
AV19
not
VP21
VP22
exclude
NP25
NP24
AJ26
deep
NP28
NP29
vein
NP31
thrombosis
serum
NP4
AJ5
normal
NP7
NP8 NP10
NP11
55
Semantic structure
CRP mesurement
does
NP13
VP17
VP15
So
NP1
DT2
A
VP16
AV19
not
VP21
VP22
exclude
NP25
NP24
AJ26
deep
NP28
NP29
vein
NP31
thrombosis
serum
NP4
AJ5
normal
NP7
NP8 NP10
NP11
ARG1
ARG1
MOD
MOD
ARG1
ARG2
ARG1
ARG1
ARG2
ARG1
MOD
Predicate argument relations
56
Abstraction of surface expressions
57
HPSG parsing
• HPSG– A few schema– Many lexical entries– Deep syntactic
analysis
• Grammar– Corpus-based
grammar construction (Miyao et al 2004)
• Parser– Beam search
(Tsuruoka et al.)
Lexical entryLexical entry
HEAD: verbSUBJ: <>COMPS: <>
Mary walked slowly
HEAD: nounSUBJ: <>COMPS: <>
HEAD: verbSUBJ: <noun>COMPS: <>
HEAD: advMOD: verb
HEAD: verbSUBJ: <noun>COMPS: <>
Subject-head schema
Head-modifier schema
58
Experimental results
• Training set: Penn Treebank Section 02-21 (39,832 sentences)
• Test set: Penn Treebank Section 23 (< 40 words, 2,164 sentences)
• Accuracy of predicate argument relations (i.e., red arrows) is measured
Precision Recall F-score
87.9% 86.9% 87.4%
59
Parsing MEDLINE with HPSG
• Enju– A wide-coverage HPSG parser– http://www-tsujii.is.s.u-tokyo.ac.jp/enju/
60
Extraction of Protein-protein Interactions:Predicate-argument relations + SVM (1)
• (Yakushiji, 2005)
ENTITY1 ENTITY2protein interact with non-polymorphic regionof MHCII
argM arg1 arg1 arg2 arg1 arg2
arg1
CD4
CD4 protein interacts with non-polymorphic regions of MHCII .ENTITY1 ENTITY2
Extraction patterns based on predicate-argument relations
SVM learning with predicate-argument patterns
61
Text Mining for Biology
• MEDIE: An interactive intelligent IR system retrieving events– Performs a semantic search
• InfoPubMed: an interactive IE system and an efficient PubMed search tool, helping users to find information about biomedical entities such as genes, proteins, and the interactions between them.
62
Medie system overview
InputTextbase
Deep parser
Entity Recognizer
Semantically-annotatedTextbase
RegionAlgebraSearch engine
QuerySearchresults
Off-line
On-line
63
64
65
Service: extracting interactions
• Info-PubMed: interactive IE system and an efficient PubMed search tool, helping users to find information about biomedical entities such as genes, proteins,and the interactions between them.
• System components– MEDIE– Extraction of protein-protein interactions – Multi-window interface on a browser
• UTokyo: NaCTeM self-funded partner • https://www-tsujii.is.s.u-tokyo.ac.jp/info-pubm
ed/
66
Info-PubMed
• helps biologists to search for their interests– genes, proteins, their interactions, and
evidence sentences– extracted from MEDLINE
(about 15 million abstracts of biomedical papers)
• uses many NLP techniques explained– in order to achieve high precision of retrieval
67
Flow Chart
Gene or proteinkeywords
Gene or proteinentities
Gene or proteinentitiy
interactionsaround thegiven gene
interactionevidence sentencesdescribing the given
interaction
Input Outputtoken:“TNF”
Gene:“TNF”
Interaction:
“TNF” and “IL6”
68
Techniques(1/2)
• Biomedical entity recognitionin abstract sentences– prepare a gene dictionary– string match
69
Techniques(2/2)
• Extract sentences describing protein-protein interaction– deep parser based on HPSG syntax
• can detect semantic relations betweenphrases
– domain dependent pattern recognition• can learn and expand source patterns• by using the result of the deep parser, it
can extract semantically true patterns• not affected by syntactic variations
70
Info-PubMed