Download - Statistical Relational Learning for NLP
Statistical Relational Learning for NLP
Ray Mooney & Razvan BunescuStatistical Relational Learning
Presented by Michele Banko
Outline
Quick intro to NLP– Problems, History– Why SRL?
SRL for two NLP tasks– Information Extraction using RMNs– Semantic Parsing using ILP
Quick Tour of NLP
Systems– Translation, Question-Answering/NL
Search, Reading Comprehension Sub-tasks
– Part-of-Speech Tagging, Phrase Finding, Parsing [syntax, semantics], Named-entity Recognition
Quick Tour of NLP [2] Knowledge engineering/linguistics vs.
statistics Recently, many NLP tasks treated as
sequence labeling problems– Pos-tagging: The/DT cat/NN can/MD walk/VB
– NP-finding: Stanford/B-NP University/I-NP is/O in/O California/B-NP
– SRL with 1 relation, adjacency HMMs, CRFs to model, Viterbi to label
– Find most probable assignment of labels– States = Part-of-speech tags– Compute P(w|t), emit word in a particular state
Why SRL?
NLP involves.. Complex reasoning about entities
and relationships between them Predicate logic
Resolving & integrating ambiguities on mulitple levels (morphology, syntax, semantics..)– More than just adjacency! Bayesian methods, graphical models
Intro to Information Extraction Early 90s, DARPA Message
Understanding Conference– Identify references to named-entities
(people, companies, locations..)– Multi-lingual, multi-document– Attrributes, relationships, events
Attributes
NAME Fletcher Maddox,Dr. Maddox
DESCRIPTORS former Dean of the UCSD Business School,his father,the firm's CEO
CATEGORY PERSON
Facts
PERSON Employee_Of
Fletcher Maddox UCSD Business School, La Jolla Genomatics
Fletcher Maddox, former Dean of the UCSD Business School, announced the formation of La Jolla Genomatics together with his two sons.
Dr. Maddox will be the firm's CEO. His son, Oliver, is the Chief Scientist and holds patents on many of the algorithms used in Geninfo.
IE for Protein Identification
Medline DB: 750K abstracts, 3700 proteinsProduction of nitric oxide ( NO ) in endothelial cells is regulated by direct interactions of endothelial nitric oxide synthase ( eNOS ) which effector proteins such as Ca2+ - calmodulin . ... which avidly binds to the carboxyl terminal region of the eNOS oxygenase domain.
Rule-based, HMM, SVM, MaxEnt CRFs outperform rest (Ramani et al,
2005)– May fail to capture long-distance
dependencies
Collective IE using RMNs(Bunescu & Mooney, 2004) Typical IE: extractions in isolation Want to consider influences between
extractions– If context surrounding one occurrence strongly
indicates protein, should affect future taggings in different contexts
– Acronyms & their long forms Use Sequence Labeling Treatment to get all
substrings that make up protein names– Start, End, Continue, Unique, Other
Classification of sequence types, not solo tokens
RMNs (Relational Markov Networks)
RMNs (Taskar et al., 2002)
For each document d in collection– Associate d with set of candidate entities
d.E Entities = token sequences = too many
possible phrases Either: constrain length or form (baseNPs)
– Characterize each entity e in d.E with a predefined set of boolean features e.F E.label=1 if e is a valid extraction E.HeadWord, E.POS_Left, E.BigramRight,
E.Prefix
Clique Templates
Find all subsets of entities satisfying a given constraint, then for each subset, form a clique c
Local Templates– Number of hidden labels = 1– Model correlations between an entity’s observed
features and its label Global Templates
– Number of hidden labels > 1– Model influences among multiple entities
Overlap Template, Repeat Template, Acronym Template
Using Local TemplatesVariable Nodes: labels of all candidate entities
in documentPotential Nodes: represent correlations
between >1 entity attributes by linking to variable nodes– Edges: by matching clique templates against d.E
e.label
…
E.f1=vi E.f2=vj E.fh=vk
e.label
φ.HEAD=enzyme φ.POSLEFT=NOUN
φ.WORDLEFT=the
φ.PREFIX=A0
RMN by Local Templates Factor Graph by Local Templates
antioxidant superoxidedismutase-1
Using Global Templates
Connect label nodes of two or more entities
Acronym Factor GraphThe antioxidant superoxide dismutase-1 (SOD1)
φAT (acronym potential)
V (the entity)uOR
φOR
superoxidedismutase-1
dismutase-1
Inference & Learning in RMNs Inference
– Labels are only hidden features– Probability distribution over hidden entity
labels computed using Gibbs distribution– Find most probable assignment of values to
labels using max-product algorithm Learning
– Structure defined by clique templates– Find clique potentials that maximize likelihood
over training data using Voted Perceptron
Experiments
Datasets– Yapex, Aimed: ~200 Medline abstracts,
~4000 protein references each Systems
– LT-RMN: RMN with local + overlap templates– GLT-RMN: RMN with local + global
templates– CRF: McCallum 2002
Evaluation– Position-based precision, recall
Results
Yapex
Method Precision Recall F-Measure
LT-RMN 70.79 53.81 61.14
GLT-RMN 69.71 65.76 67.68
CRF 72.45 58.64 64.81
Aimed
Method Precision Recall F-Measure
LT-RMN 81.33 72.79 76.82
GLT-RMN 82.79 80.04 81.39
CRF 85.37 75.90 80.36
Intro to Semantic Parsing NL input logical form Assignment of semantic roles
“John gives the book to Mary” gives1(subj: John2, dobj: book3, iobj: Mary4)
Ambiguities“Every man loves a woman”– Is there one woman loved by all or…
YX man(X) ^ woman(Y) -> loves(X,Y)XY man(X) ^ woman(Y) -> loves(X,Y)
– LF just saysloves1(every m1:man(m1), a w1:woman(w1))
Previous Work in Semantic Parsing Hand-built/Linguistic systems
– Grammar development– NLPwin (MSR)
Data-driven approaches– Tagging problem: for each NP, tag role
Gildea & Jurafsky (2002): NB classifier combination using lexical and syntactic features, previous labels
Jurafsky et al. (2004): SVMs
– Sentence + LF Parser CHILL: ILP to learn a generalized Prolog parser
ILP in CHILL Parser induction = learn control rules of a shift
reduce parser– Shift symbols from the input string onto a stack– Reduce items on the stack by applying a matching
grammar rule Can be encoded as Prolog program:
– parse(S,Parse) :- parse([],S,[Parse],[]). Start with 3 generic operators
– INRTRODUCE pushes predicate onto stack based on word input
– Lexicon: ‘capital’ -> capital(_,_)– COREF_VARS unifies two variables under consideration– DROP_CONJ embeds one predicate as argument of
another– SHIFT pushes word input onto stack
What is the capital of Texas?
Stack Input Buffer Action
[answer(_,_):[]][what,is,the,capital,of,texas,?]
[answer(_,_):[the,is,what]] [capital,of,texas,?]SHIFT, SHIFT, SHIFT
[capital(_,_):[], answer(_,_):[the,is,what]]
[capital,of,texas,?] INTRODUCE
[capital(C,_):[], answer(C,_):[the,is,what]]
[capital,of,texas,?] COREF_VARS
[capital(C,_):[of,capital], answer(C,_):[the,is,what]]
[texas,?] SHIFT, SHIFT
[const(S,stateid(texas)):[],capital(C,S):[of,capital], answer(C,_):[the,is,what]]
[texas,?]INTRODUCE, COREF_VARS
[answer(C, (capital(C,S),const(S,stateid(texas)))):[?,texas,of,capital,the,is,what]]
[]DROP_CONJ, SHIFT, SHIFT
Learning Control Rules Operator Generation
– Initial parser is too general, will produce spurious parses
– Use training data to extend program Example Analysis
– Use general parser, recording parse states – Positive Examples
Parse states to which operator should be applied Find 1st correct parse of training pair, ops used to
achieve subgoals become positive examples
– Negative Examples for single-parse systems S is a negative example for the current action if S is a
positive example for a previous action
Induction in CHILL
Control-Rule Induction– Cover all positive examples, not negative– Bottom-Up: Compact rule set by forming Least-
General Generalizations of clause pairs – Top-Down: Overly-general rules specialized by
addition of literals– Invention of new predicates
op([X,[Y,det:the]], [the|Z],A,B) :- animate(Y).
animate(man). animate(boy). animate(girl). . .
Fold constraints back into general parser
ProbCHILL: Parsing via ILP Ensembles ILP + Statistical Parsing
– ILP: not forced to make decisions based on predetermined features
– SP: handle multiple sources of uncertainty Ensemble classifier
– Combine outputs of > 1 classifiers– Bagging, boosting
TABULATE: generate several ILP hypotheses– Use SP to estimate probability for each potential
operator– Find most probable semantic parse (beam-search)
P(parse state) = product of probs of operators to reach state
Experiments
U.S. Geography DB with 800 Prolog facts 250 questions from 50 humans,
annotated with Prolog queriesWhat is the capital of the state with the highest population?answer(C, (capital(S,C), largest(P, (state(S),
population(S,P)))))
10-fold CV, measuring Recall = # correct queries produced
# test sentences
Precision = # correct queries produced
# complete parses reduced
Results
System Recall Precision F-measure
Geobase 56.00 96.40 70.85
CHILL 68.50 97.65 80.52
ProbCHILL 80.40 88.16 84.10