constrained conditional models for natural language processing
DESCRIPTION
Constrained Conditional Models for Natural Language Processing. Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign. March 2009 EACL. Nice to Meet You. Page 2. Informally: - PowerPoint PPT PresentationTRANSCRIPT
Page 1
March 2009
EACL
Constrained Conditional Models for
Natural Language Processing
Ming-Wei Chang, Lev Ratinov, Dan RothDepartment of Computer ScienceUniversity of Illinois at Urbana-Champaign
Page 2
Nice to Meet You
Informally: Everything that has to do with constraints (and learning
models)
Formally: We typically make decisions based on models such as:
With CCMs we make decisions based on models such as:
We do not define the learning method but we’ll discuss it and make suggestions
CCMs make predictions in the presence/guided by constraints
Page 3
Constraints Conditional Models (CCMs)
),(maxarg xyfwTy
arg max ( , ) ( ,1 )Ty c C
c C
w f y x d y
Constraints Driven Learning and Decision Making Why Constraints?
The Goal: Building a good NLP systems easily We have prior knowledge at our hand
How can we use it? We suggest that often knowledge can be injected directly
Can use it to guide learning Can use it to improve decision making Can use it to simplify the models we need to
learn
How useful are constraints? Useful for supervised learning Useful for semi-supervised learning Sometimes more efficient than labeling data directly
Page 5
Inference
Page 6
Comprehension
1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now.
A process that maintains and updates a collection of propositions about the state of affairs.
(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.
This is an Inference Problem
Page 7
This Tutorial: Constrained Conditional Models
Part 1: Introduction to CCMs Examples:
NE + Relations Information extraction – correcting models with CCMS
First summary: why are CCM important Problem Setting
Features and Constraints; Some hints about training issues Part 2: Introduction to Integer Linear Programming
What is ILP; use of ILP in NLP Part 3: Detailed examples of using CCMs
Semantic Role Labeling in Details Coreference Resolution Sentence Compression
BREAK
Page 8
This Tutorial: Constrained Conditional Models (2nd part) Part 4: More on Inference –
Other inference algorithms Search (SRL); Dynamic Programming (Transliteration); Cutting Planes
Using hard and soft constraints Part 5: Training issues when working with CCMS
Formalism (again) Choices of training paradigms -- Tradeoffs
Examples in Supervised learning Examples in Semi-Supervised learning
Part 6: Conclusion Building CCMs Features and Constraints; Objective functions; Different Learners. Mixed models vs. Joint models; where is Knowledge coming from
THE END
Page 9
Learning and Inference Global decisions in which several local decisions play a
role but there are mutual dependencies on their outcome. E.g. Structured Output Problems – multiple dependent output
variables
(Learned) models/classifiers for different sub-problems In some cases, not all local models can be learned simultaneously Key examples in NLP are Textual Entailment and QA In these cases, constraints may appear only at evaluation time
Incorporate models’ information, along with prior knowledge/constraints, in making coherent decisions decisions that respect the local models as well as domain &
context specific knowledge/constraints.
Page 10
This Tutorial: Constrained Conditional Models
Part 1: Introduction to CCMs Examples:
NE + Relations Information extraction – correcting models with CCMS
First summary: why are CCM important Problem Setting
Features and Constraints; Some hints about training issues Part 2: Introduction to Integer Linear Programming
What is ILP; use of ILP in NLP Semantic Role Labeling in Details
Part 3: Detailed examples of using CCMs Coreference Resolution Sentence Compression
BREAK
Page 11
This Tutorial: Constrained Conditional Models (2nd part) Part 4: More on Inference –
Other inference algorithms Search (SRL); Dynamic Programming (Transliteration); Cutting Planes
Using hard and soft constraints Part 5: Training issues when working with CCMS
Formalism (again) Choices of training paradigms -- Tradeoffs
Examples in Supervised learning Examples in Semi-Supervised learning
Part 6: Conclusion Building CCMs Features and Constraints; Objective functions; Different Learners. Mixed models vs. Joint models; where is Knowledge coming from
THE END
Page 12
Inference with General Constraint Structure [Roth&Yih’04]Recognizing Entities and Relations
Dole ’s wife, Elizabeth , is a native of N.C. E1 E2 E3
R12 R2
3
other 0.05per 0.85loc 0.10
other 0.05per 0.50loc 0.45
other 0.10per 0.60loc 0.30
irrelevant 0.10spouse_of 0.05born_in 0.85
irrelevant 0.05spouse_of 0.45born_in 0.50
irrelevant 0.05spouse_of 0.45born_in 0.50
other 0.05per 0.85loc 0.10
other 0.10per 0.60loc 0.30
other 0.05per 0.50loc 0.45
irrelevant 0.05spouse_of 0.45born_in 0.50
irrelevant 0.10spouse_of 0.05born_in 0.85
other 0.05per 0.50loc 0.45
Improvement over no inference: 2-5%
Some Questions: How to guide the global inference? Why not learn Jointly?
Models could be learned separately; constraints may come up only at decision time.
Note: Non Sequential Model
Page 13
Task of Interests: Structured Output For each instance, assign values to a set of
variables Output variables depend on each other
Common tasks in Natural language processing
Parsing; Semantic Parsing; Summarization; Transliteration; Co-reference resolution,…
Information extraction Entities, Relations,…
Many pure machine learning approaches exist Hidden Markov Models (HMMs); CRFs Structured Perceptrons ad SVMs…
However, …
Page 14Page 14
Information Extraction via Hidden Markov Models
Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 .
Prediction result of a trained HMM Lars Ole Andersen . Program analysis andspecialization for the C Programming language
. PhD thesis .DIKU , University of Copenhagen , May1994 .
[AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE]
Unsatisfactory results !
Motivation II
Page 15
Strategies for Improving the Results
(Pure) Machine Learning Approaches Higher Order HMM/CRF? Increasing the window size? Adding a lot of new features
Requires a lot of labeled examples
What if we only have a few labeled examples?
Any other options? Humans can immediately tell bad outputs The output does not make sense
Increasing the model complexity
Can we keep the learned model simple and still make expressive decisions?
Page 16Page 16
Information extraction without Prior Knowledge
Prediction result of a trained HMMLars Ole Andersen . Program analysis andspecialization for the C Programming language
. PhD thesis .DIKU , University of Copenhagen , May1994 .
[AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE]
Violates lots of natural constraints!
Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 .
Page 17
Examples of Constraints
Each field must be a consecutive list of words and can appear at most once in a citation.
State transitions must occur on punctuation marks.
The citation can only start with AUTHOR or EDITOR.
The words pp., pages correspond to PAGE. Four digits starting with 20xx and 19xx are DATE. Quotations can appear only in TITLE ……. Easy to express pieces of “knowledge”
Non Propositional; May use Quantifiers
Page 18Page 18
Adding constraints, we get correct results! Without changing the model
[AUTHOR] Lars Ole Andersen . [TITLE] Program analysis and specialization
for the C Programming language .
[TECH-REPORT] PhD thesis .[INSTITUTION] DIKU , University of Copenhagen , [DATE] May, 1994 .
Information Extraction with Constraints
Page 19
Random Variables Y:
Conditional Distributions P (learned by models/classifiers) Constraints C– any Boolean function defined over partial assignments (possibly: + weights W )
Goal: Find the “best” assignment The assignment that achieves the highest global
performance. This is an Integer Programming Problem
Problem Setting
y7y4 y5 y6 y8
y1 y2 y3C(y1,y4)C(y2,y3,y6,y7,y8)
Y*=argmaxY PY subject to constraints C(+ WC)
observations
Page 20
Formal Model
How to solve?This is an Integer Linear ProgramSolving using ILP packages gives an exact solution. Search techniques are also possible
(Soft) constraints component
Weight Vector for “local” models
Penalty for violatingthe constraint.
How far y is from a “legal” assignment
Subject to constraints
A collection of Classifiers; Log-linear models (HMM, CRF) or a combination
How to train?How to decompose the global objective function?Should we incorporate constraints in the learning process?
Page 21
Features Versus ConstraintsÁi : X £ Y ! R; Ci : X £ Y ! {0,1}; d: X £ Y ! R;
In principle, constraints and features can encode the same propeties In practice, they are very different
Features Local , short distance properties – to allow tractable inference Propositional (grounded): E.g. True if: “the” followed by a Noun occurs in the sentence”
Constraints Global properties Quantified, first order logic expressions E.g.True if: “all yis in the sequence y are assigned different values.”
Indeed, used differently
Page 22
Encoding Prior Knowledge Consider encoding the knowledge that:
Entities of type A and B cannot occur simultaneously in a sentence The “Feature” Way
Results in higher order HMM, CRF May require designing a model tailored to
knowledge/constraints Large number of new features: might require more labeled
data Wastes parameters to learn indirectly knowledge we have.
The Constraints Way Keeps the model simple; add expressive constraints
directly A small set of constraints Allows for decision time incorporation of constraints
A form of supervision
Need more training data
Page 23
Constrained Conditional Models – 1st Summary Everything that has to do with Constraints and Learning
models In both examples, we started with learning models
Either for components of the problem Classifiers for Relations and Entities
Or the whole problem Citations
We then included constraints on the output As a way to “correct” the output of the model
In both cases this allows us to Learn simpler models than we would otherwise
As presented, global constraints did not take part in training Global constraints were used only at the output.
We will later call this training paradigm L+I
Page 24
This Tutorial: Constrained Conditional Models
Part 1: Introduction to CCMs Examples:
NE + Relations Information extraction – correcting models with CCMS
First summary: why are CCM important Problem Setting
Features and Constraints; Some hints about training issues Part 2: Introduction to Integer Linear Programming
What is ILP; use of ILP in NLP Semantic Role Labeling in Details
Part 3: Detailed examples of using CCMs Coreference Resolution Sentence Compression
BREAK
Page 25
This Tutorial: Constrained Conditional Models (2nd part) Part 4: More on Inference –
Other inference algorithms Search (SRL); Dynamic Programming (Transliteration); Cutting Planes
Using hard and soft constraints Part 5: Training issues when working with CCMS
Formalism (again) Choices of training paradigms -- Tradeoffs
Examples in Supervised learning Examples in Semi-Supervised learning
Part 6: Conclusion Building CCMs Features and Constraints; Objective functions; Different Learners. Mixed models vs. Joint models; where is Knowledge coming from
THE END
Page 26
Formal Model
How to solve?This is an Integer Linear ProgramSolving using ILP packages gives an exact solution. Search techniques are also possible
(Soft) constraints component
Weight Vector for “local” models
Penalty for violatingthe constraint.
How far y is from a “legal” assignment
A collection of Classifiers; Log-linear models (HMM, CRF) or a combination
How to train?How to decompose the global objective function?Should we incorporate constraints in the learning process?
Inference with constraints.
We start with adding constraints to existing models.
It’s a good place to start, because conceptually all you do is to add constraints to what you were doing before and the performance improves.
Page 27
Constraints and Integer Linear Programming (ILP)
ILP is powerful (NP-complete) ILP is popular – inference for many models, such
as Viterbi for CFR, have already been implemented.
Powerful off-the shelf solvers exist. All we need is to write down the objective function and the constraints, there is no need to write code.
Page 28
Linear Programming
Key contributors: Leonid Kantorovich, George B. Dantzig, John von Neumann.
Optimization technique with linear objective function and linear constraints.
Note that the word “Integer” is absent.
Page 29
Example (Thanks James Clarke)
Telfa Co. produces tables and chairs Each table makes 8$ profit, each chair makes 5$ profit.
We want to maximize the profit.
Page 30
Telfa Co. produces tables and chairs Each table makes 8$ profit, each chair makes 5$ profit. A table requires 1 hour of labor and 9 sq. feet of wood. A chair requires 1 hour of labor and 5 sq. feet of wood. We have only 6 hours of work and 45sq. Feet of wood.
We want to maximize the profit.
Example (Thanks James Clarke)
Page 31
Solving LP problems.
Page 32
Solving LP problems.
Page 33
Solving LP problems.
Page 34
Solving LP problems.
Page 35
Solving LP problems
Page 36
Solving LP Problems.
Page 37
Integer Linear Programming- integer solutions.
Page 38
Integer Linear Programming.
In NLP, we are dealing with discrete outputs, therefore we’re almost always interested in integer solutions.
ILP is NP-complete, but often efficient for large NLP problems.
In some cases, the solutions to LP are integral (e.g totally unimodular constraint matrix.). Next, we show an example for using ILP with
constraints. The matrix is not totally unimodular, but LP gives integral solutions.
Page 39
Page 40
Back to Example: Recognizing Entities and Relations
Dole ’s wife, Elizabeth , is a native of N.C. E1 E2 E3
R12 R2
3
other 0.05per 0.85loc 0.10
other 0.05per 0.50loc 0.45
other 0.10per 0.60loc 0.30
irrelevant 0.10spouse_of 0.05born_in 0.85
irrelevant 0.05spouse_of 0.45born_in 0.50
irrelevant 0.05spouse_of 0.45born_in 0.50
other 0.05per 0.85loc 0.10
other 0.10per 0.60loc 0.30
other 0.05per 0.50loc 0.45
irrelevant 0.05spouse_of 0.45born_in 0.50
irrelevant 0.10spouse_of 0.05born_in 0.85
other 0.05per 0.50loc 0.45
Page 41
Back to Example: Recognizing Entities and Relations
Dole ’s wife, Elizabeth , is a native of N.C. E1 E2 E3
R12 R2
3
other 0.05per 0.85loc 0.10
other 0.05per 0.50loc 0.45
other 0.10per 0.60loc 0.30
irrelevant 0.10spouse_of 0.05born_in 0.85
irrelevant 0.05spouse_of 0.45born_in 0.50
irrelevant 0.05spouse_of 0.45born_in 0.50
other 0.05per 0.85loc 0.10
other 0.10per 0.60loc 0.30
other 0.05per 0.50loc 0.45
irrelevant 0.05spouse_of 0.45born_in 0.50
irrelevant 0.10spouse_of 0.05born_in 0.85
other 0.05per 0.50loc 0.45
NLP with ILP- key issues:1)Write down the objective function.
2)Write down the constraints as linear inequalities
Page 42
Back to Example: Recognizing Entities and Relations
Dole ’s wife, Elizabeth , is a native of N.C. E1 E2 E3
R12 R2
3
other 0.05per 0.85loc 0.10
other 0.05per 0.50loc 0.45
other 0.10per 0.60loc 0.30
irrelevant 0.10spouse_of 0.05born_in 0.85
irrelevant 0.05spouse_of 0.45born_in 0.50
irrelevant 0.05spouse_of 0.45born_in 0.50
other 0.05per 0.85loc 0.10
other 0.10per 0.60loc 0.30
other 0.05per 0.50loc 0.45
irrelevant 0.05spouse_of 0.45born_in 0.50
irrelevant 0.10spouse_of 0.05born_in 0.85
other 0.05per 0.50loc 0.45
Page 43
Back to Example: cost functionother 0.05per 0.85loc 0.10
other 0.05per 0.50loc 0.45
other 0.10per 0.60loc 0.30
irrelevant 0.10spouse_of 0.05born_in 0.85
irrelevant 0.05spouse_of 0.45born_in 0.50
irrelevant 0.05spouse_of 0.45born_in 0.50
other 0.05per 0.85loc 0.10
other 0.10per 0.60loc 0.30
other 0.05per 0.50loc 0.45
irrelevant 0.05spouse_of 0.45born_in 0.50
irrelevant 0.10spouse_of 0.05born_in 0.85
other 0.05per 0.50loc 0.45
x{E1 = per}, x{E1 = loc}, …, x{R12 = spouse_of}, x{R12 = born_in}, …, x{R12 = } , … {0,1}
R12 R21 R23 R32 R13 R31
E1
DoleE2
ElizabethE3
N.C.
Page 44
Back to Example: cost functionother 0.05per 0.85loc 0.10
other 0.05per 0.50loc 0.45
other 0.10per 0.60loc 0.30
irrelevant 0.10spouse_of 0.05born_in 0.85
irrelevant 0.05spouse_of 0.45born_in 0.50
irrelevant 0.05spouse_of 0.45born_in 0.50
other 0.05per 0.85loc 0.10
other 0.10per 0.60loc 0.30
other 0.05per 0.50loc 0.45
irrelevant 0.05spouse_of 0.45born_in 0.50
irrelevant 0.10spouse_of 0.05born_in 0.85
other 0.05per 0.50loc 0.45
Cost function:
c{E1 = per}· x{E1 = per} + c{E1 = loc}· x{E1 = loc} + … + c{R12 = spouse_of}· x{R12 = spouse_of} + … + c{R12 = }· x{R12 = } + …
R12 R21 R23 R32 R13 R31
E1
DoleE2
ElizabethE3
N.C.
Adding Constraints
Each entity is either a person, organization or location: x{E1 = per}+ x{E1 = loc}+ x{E1 = org} + x{E1 = }=1
(R12 = spouse_of) (E1 = person) (E2 = person) x{R12 = spouse_of} x{E1 = per}
x{R12 = spouse_of} x{E2 = per}
We need more consistency constraints. Any Boolean constraint can be written with a set of
linear inequalities, and an efficient algorithm exist [Rizzollo2007].
Page 45
CCM for NE-Relations
We showed how a CCM formulation for the NE-Relation problem.
In this case, the cost of each variable was learned independently, using trained classifiers.
Other expressive problems can be formulated as Integer Linear Programs. For example, HMM/CRF inference, Viterbi
algorithms
Page 46
c{E1 = per}· x{E1 = per} + c{E1 = loc}· x{E1 = loc} + … + c{R12 = spouse_of}· x{R12 = spouse_of} + … + c{R12 = }· x{R12 = } + …
TSP as an Integer Linear Program
Page 47
Dantzig et al. were the first to suggest a practical solution to the problem using ILP.
Page 48
Reduction from Traveling Salesman to ILP
x12c123
1
2x32c32
Every node hasONE outgoing
edge
Page 49
Reduction from Traveling Salesman to ILP
Every node hasONE incoming
edge
x12c123
1
2x32c32
Page 50
Reduction from Traveling Salesman to ILP
The solutions are binary (int)
x12c123
1
2x32c32
Page 51
Reduction from Traveling Salesman to ILP
No sub graphcontains a cycle
x12c123
1
2x32c32
Viterbi as an Integer Linear Program
Page 52
y0 y1 y2 yN
x0 x1 x2 xN
Viterbi with ILP
Page 53
y2y0 y1 yN
x2x0 x1 xN
y01
y02
y0M
S
-Log{P(x0|y0=1) P(y0=1)}
Viterbi with ILP
Page 54
y2y0 y1 yN
x2x0 x1 xN
y01
y02
y0M
S -Log{P(x0|y0=2)P(y0=2)}
Viterbi with ILP
Page 55
y2y0 y1 yN
x2x0 x1 xN
y01
y02
y0M
S -Log{P(x0|y0=M) P(y0=M)}
Viterbi with ILP
Page 56
y2y0 y1 yN
x2x0 x1 xN
y01
y02
y0M
y11
y12
y1M
-Log{P(x2|y1=1)P(y1=1|y0=1)}S
Viterbi with ILP
Page 57
y2y0 y1 yN
x2x0 x1 xN
y01
y02
y0M
y11
y12
y1M
-Log{P(x2|y1=2)P(y1=1|y0=1)}S
Viterbi with ILP
Page 58
y0 y1 y2 yN
x0 x1 x2 xN
S
y01
y02
y0M
y11
y12
y1M
y21
y22
y2M
yN
1
yN2
yNM
T
Viterbi with ILP
Page 59
y0 y1 y2 yN
x0 x1 x2 xN
S
y01
y02
y0M
y11
y12
y1M
y21
y22
y2M
yN
1
yN2
yNM
T
Viterbi = shortest path (ST ) with dynamic programming. We saw TSP ILP.Now show: Shortest Path ILP.
Shortest path with ILP.
For each edge (u,v), xuv =1 if (u,v) is on the shortest path, 0 otherwise. cuv = the weight of the edge.
Page 60
Shortest path with ILP.
For each edge (u,v), xuv =1 if (u,v) is on the shortest path, 0 otherwise. cuv = the weight of the edge.
Page 61
Shortest path with ILP.
For each edge (u,v), xuv =1 if (u,v) is on the shortest path, 0 otherwise. cuv = the weight of the edge.
Page 62
All nodes except the S and the T have eq. number of ingoing and outgoing edges
Shortest path with ILP.
For each edge (u,v), xuv =1 if (u,v) is on the shortest path, 0 otherwise. cuv = the weight of the edge.
Page 63
All nodes except the S and the T have eq. number of ingoing and outgoing edges
The source node has one outgoing edge more than ingoing edges.
Shortest path with ILP.
For each edge (u,v), xuv =1 if (u,v) is on the shortest path, 0 otherwise. cuv = the weight of the edge.
Page 64
All nodes except the S and the T have eq. number of ingoing and outgoing edges
The source node has one outgoing edge more than ingoing edges.
The target node has one ingoing edge more than outgoing edges.
Shortest path with ILP.
For each edge (u,v), xuv =1 if (u,v) is on the shortest path, 0 otherwise. cuv = the weight of the edge.
Page 65
Interestingly, in this formulation, the solution to the general LP problem will have integer solutions. Thus, we have a reasonably fast alternative to Viterbi with ILP.
Who cares that Viterbi is an ILP?
Assume you want to learn an HMM/CRF model (e.g., Extracting fields from citations (IE))
But, you also want to add expressive constraints: No field can appear twice in the data. The fields <journal> and <tech-report> are mutually
exclusive. The field <author> must appear at least once.
Do: Learn HMM/CRF Convert to an LP Modify the LP canonical constraint matrix
A concrete example for adding constraints over CRF will be shown
Page 67
This Tutorial: Constrained Conditional Models
Part 1: Introduction to CCMs Examples:
NE + Relations Information extraction – correcting models with CCMS
First summary: why are CCM important Problem Setting
Features and Constraints; Some hints about training issues Part 2: Introduction to Integer Linear Programming
What is ILP; use of ILP in NLP Part 3: Detailed examples of using CCMs
Semantic Role Labeling in Details Coreference Resolution Sentence Compression
BREAK
Page 68
This Tutorial: Constrained Conditional Models (2nd part) Part 4: More on Inference –
Other inference algorithms Search (SRL); Dynamic Programming (Transliteration); Cutting Planes
Using hard and soft constraints Part 5: Training issues when working with CCMS
Formalism (again) Choices of training paradigms -- Tradeoffs
Examples in Supervised learning Examples in Semi-Supervised learning
Part 6: Conclusion Building CCMs Features and Constraints; Objective functions; Different Learners. Mixed models vs. Joint models; where is Knowledge coming from
THE END
CCM Examples
Many works in NLP make use of constrained conditional models, implicitly or explicitly.
Next we describe in details three examples.
Example 2: Semantic Role Labeling. The use of inference with constraints to improve
semantic parsing Example 3 (Co-ref):
combining classifiers through objective function outperforms pipeline.
Example 4 (Sentence Compression): Simple language model with constraints
outperforms complex models.
Page 69
Example 2: Semantic Role Labeling
Demo:http://L2R.cs.uiuc.edu/~cogcomp
Approach :1) Reveals several relations.
Top ranked system in CoNLL’05 shared task
Key difference is the Inference
2) Produces a very good semantic parser. F1~90% 3) Easy and fast: ~7 Sent/Sec (using Xpress-MP)
Who did what to whom, when, where, why,…
Simple sentence:
I left my pearls to my daughter in my will .[I]A0 left [my pearls]A1 [to my daughter]A2 [in my will]AM-LOC .
A0 Leaver A1 Things left A2 Benefactor AM-LOC Location
I left my pearls to my daughter in my will .
Page 72
SRL Dataset
PropBank [Palmer et. al. 05]
Core arguments: A0-A5 and AA different semantics for each verb specified in the PropBank Frame files
13 types of adjuncts labeled as AM-arg where arg specifies the adjunct type
In this problem, all the information is given, but conceptually, we could train different components on different resources.
Algorithmic Approach
Identify argument candidates Pruning [Xue&Palmer, EMNLP’04] Argument Identifier
Binary classification (SNoW) Classify argument candidates
Argument Classifier Multi-class classification (SNoW)
Inference Use the estimated probability distribution
given by the argument classifier Use structural and linguistic constraints Infer the optimal global output
I left my nice pearls to her
I left my nice pearls to her[ [ [ [ [ ] ] ] ] ]
I left my nice pearls to her[ [ [ [ [ ] ] ] ] ]
I left my nice pearls to her
Identify Vocabulary
Inference over (old and new) Vocabulary
candidate arguments
Well understood
I left my nice pearls to her
Page 74
Learning
Both argument identifier and argument classifier are trained using SNoW. Sparse network of linear functions Multiclass classifier that produces a probability
distribution over output values.
Features (some examples) Voice, phrase type, head word, parse tree path
from predicate, chunk sequence, syntactic frame, …
Conjunction of features
Page 75
Inference
The output of the argument classifier often violates some constraints, especially when the sentence is long.
Finding the best legitimate output is formalized as an optimization problem and solved via Integer Linear Programming (ILP).
Input: The scores from the argument classifier Structural and linguistic constraints
ILP allows incorporating expressive (non-sequential) constraints on the variables (the arguments types).
Page 76
Semantic Role Labeling (SRL)
I left my pearls to my daughter in my will .
0.50.150.150.10.1
0.150.60.050.050.05
0.050.10.20.60.05
0.050.050.70.050.15
0.30.20.20.10.2
Page 77
Semantic Role Labeling (SRL)
I left my pearls to my daughter in my will .
0.50.150.150.10.1
0.150.60.050.050.05
0.050.10.20.60.05
0.050.050.70.050.15
0.30.20.20.10.2
Page 78
Semantic Role Labeling (SRL)
I left my pearls to my daughter in my will .
0.50.150.150.10.1
0.150.60.050.050.05
0.050.10.20.60.05
0.050.050.70.050.15
0.30.20.20.10.2
One inference problem for each verb predicate.
Page 79
Integer Linear Programming Inference
For each argument ai
Set up a Boolean variable: ai,t indicating whether ai is classified as t
Goal is to maximize i score(ai = t ) ai,t
Subject to the (linear) constraints
If score(ai = t ) = P(ai = t ), the objective is to find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints.
Page 80
No duplicate argument classesa POTARG x{a = A0} 1
R-ARG a2 POTARG , a POTARG x{a = A0} x{a2 = R-A0}
C-ARG a2 POTARG , (a POTARG) (a is before a2 ) x{a = A0} x{a2 = C-A0}
Many other possible constraints: Unique labels No overlapping or embedding Relations between number of arguments; order constraints If verb is of type A, no argument of type B
Any Boolean rule can be encoded as a set of linear inequalities.
If there is an R-ARG phrase, there is an ARG Phrase
If there is an C-ARG phrase, there is an ARG before it
Constraints
Joint inference can be used also to combine different SRL Systems.
Universally quantified rulesLBJ: allows a developer to encode
constraints in FOL; these are compiled into linear inequalities automatically.
Page 81
Summary: Semantic Role Labeling
Demo:http://L2R.cs.uiuc.edu/~cogcomp
Who did what to whom, when, where, why,…
Example 3: co-ref (thanks Pascal Denis)
Page 82
Example 1: co-ref (thanks Pascal Denis)
Page 83
Traditional approach: train a classifier which predicts the probability that two named entities co-refer:
Pco-ref(Clinton, NPR) = 0.2 (order sensitive!)Pco-ref(Lewinsky, her) = 0.6
Decision Process: E.g., If Pco-ref(s1, s2) > 0.5, label two entities as co-referent
or, clustering
Example 1: co-ref (thanks Pascal Denis)
Page 84
Traditional approach: train a classifier which predicts the probability that two named entities co-refer:
Pco-ref(Clinton, NPR) = 0.2Pco-ref(Lewinsky, her) = 0.6
Decision Process: E.g., If Pco-ref(s1, s2) > 0.5, label two entities as co-referent
or, clustering
Evaluation:Cluster all the entities based on co-ref classifier predictions. The evaluation is based on cluster overlap.
Example 1: co-ref (thanks Pascal Denis)
Page 85
Two types of entities: “Base entities”“Anaphors” (pointers)
NPR
Example 1: co-ref (thanks Pascal Denis)
Page 86
Error analysis: 1)“Base entities” that “point” to anaphors.2)Anaphors that don’t “point” to anything.
Proposed solution.
Page 87
Pipelined approach: Identify anaphors.Forbid links of certain types.
NPR
Proposed solution.
Page 88
Doesn’t work, despite 80% accuracy of the anaphorisity classifier.
The reason: error propagation in the pipeline.
NPR
Joint Solution with CCMs
Page 89
Joint solution using CCMs
Page 90
Joint solution using CCMs
Page 91
This performs considerably better than the pipelined approach, and actually improves over the baseline without anaphorisity classifier (though marginally)
Joint solution using CCMs
Page 92
α
β
Note:If we have reasons to believe one of the classifiers significantly more
than the other, we can scale with tuned parameters
Page 93
Example 4 - Sentence Compression (thanks James Clarke).
Page 94
Example
Example:0 1 2 3 4 5 6 7 8Big fish eat small fish in a small pondBig fish in a pond
11
568156015
86510
Page 95
Language model-based compression
Page 96
Example 2 : summarization (thanks James Clarke)
This formulation requires some additional constraintsBig fish eat small fish in a small pond
No selection of decision variables can make these trigrams appear consecutively in output.
We skip these constraints here.
Page 97
Trigram model in action
Page 98
Modifier Constraints
Page 99
Example
Page 100
Example
Page 101
Sentential Constraints
Page 102
Example
Page 103
Example
Page 104
More constraints
Other CCM Examples: Opinion Recognition
Y. Choi, E. Breck, and C. Cardie. Joint Extraction of Entities and Relations for Opinion Recognition EMNLP-2006
Semantic parsing variation: Agent=entity Relation=opinion
Constraints: A an agent can have at most two opinions. An opinion should be linked to only one agent. The usual non-overlap constraints.
Page 105
Other CCM Examples: Temporal Ordering
N. Chambers and D. Jurafsky. Jointly Combining Implicit Constraints Improves Temporal Ordering. EMNLP-2008.
Page 106
Other CCM Examples: Temporal Ordering
N. Chambers and D. Jurafsky. Jointly Combining Implicit Constraints Improves Temporal Ordering. EMNLP-2008.
Page 107
Three types of edges:1)Annotation relations before/after2)Transitive closure constraints3)Time normalization constraints
Related Work: Language generation.
Regina Barzilay and Mirella Lapata. Aggregation via Set Partitioning for Natural Language Generation.HLT-NAACL-2006.
Constraints: Transitivity: if (ei,ej)were aggregated, and (ei,ejk) were
too, then (ei,ek) get aggregated. Max number of facts aggregated, max sentence
length.
Page 108
MT & Alignment
Ulrich Germann, Mike Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. Fast decoding and optimal decoding for machine translation. ACL 2001.
John DeNero and Dan Klein. The Complexity of Phrase Alignment Problems. ACL-HLT-2008.
Page 109
Summary of Examples
We have shown several different NLP solution that make use of CCMs.
Examples vary in the way models are learned.
In all cases, constraints can be expressed in a high level language, and then transformed into linear inequalities.
Learning based Java (LBJ) [Rizollo&Roth’07] describe an automatic way to compile high level description of constraint into linear inequalities.
Page 110
Page 111
Solvers
All applications presented so far used ILP for inference.
People used different solvers Xpress-MP GLPK lpsolve R Mosek CPLEX
Next we discuss other inference approaches to CCMs
112
This Tutorial: Constrained Conditional Models (2nd part) Part 4: More on Inference –
Other inference algorithms Search (SRL); Dynamic Programming (Transliteration); Cutting Planes
Using hard and soft constraints Part 5: Training issues when working with CCMS
Formalism (again) Choices of training paradigms -- Tradeoffs
Examples in Supervised learning Examples in Semi-Supervised learning
Part 6: Conclusion Building CCMs Features and Constraints; Objective functions; Different Learners. Mixed models vs. Joint models; where is Knowledge coming from
THE END
113
Where Are We ?
We hope we have already convinced you that Using constraints is a good idea for addressing NLP problems Constrained conditional models provide a good platform
We were talking about using expressive constraints To improve existing models Learning + Inference The problem: inference
A powerful inference tool: Integer Linear Programming SRL, co-ref, summarization, entity-and-relation… Easy to inject domain knowledge
114
Constrained Conditional Model : Inference
How to solve?This is an Integer Linear ProgramSolving using ILP packages gives an exact solution. Search techniques are also possible
(Soft) constraints component
Weight Vector for “local” models
Constraint violation penalty
How far y is from a “legal” assignment
A collection of Classifiers; Log-linear models (HMM, CRF) or a combination
How to train?How to decompose the global objective function?Should we incorporate constraints in the learning process?
115
Advantages of ILP Solvers: Review
ILP is Expressive: We can solve many inference problems Converting inference problems into ILP is easy
ILP is Easy to Use: Many available packages (Open Source Packages): LPSolve, GLPK, … (Commercial Packages): XPressMP, Cplex No need to write optimization code!
Why should we consider other inference options?
116
ILP: Speed Can Be an Issue
Inference problems in NLP Sometimes large problems are actually easy for ILP
E.g. Entities-Relations Many of them are not “difficult”
When ILP isn’t fast enough, and one needs to resort to approximate solutions.
The Problem: General Solvers vs. Specific Solvers ILP is a very, very general solver But, sometimes the structure of the problem allows for
simpler inference algorithms. Next we give examples for both cases.
117
Example 1: Search based Inference for SRL
The objective function
Constraints Unique labels No overlapping or embedding If verb is of type A, no argument of type B …
Intuition: check constraints’ violations on partial assignments
Maximize summation of the scores subject to linguistic constraints
jiijij xc
,
max
Indicator variableassigns the j-th class for the i-th token
Classification confidence
118
Inference using Beam Search
For each step, discard partial assignments that violate constraints!
Shape: argument
Color: label
Beam size = 2,Constraint:Only one Red
Rank them according to classification confidence!
119
Heuristic Inference
Problems of heuristic inference Problem 1: Possibly, sub-optimal solution Problem 2: May not find a feasible solution
Drop some constraints, solve it again
Using search on SRL gives comparable results to using ILP, but is much faster.
120
How to get a score for the pair? Previous approaches:
Extract features for each source and target entity pair
The CCM approach: Introduce an internal structure (characters) Constrain character mappings to “make
sense”.
Example 2: Exploiting Structure in Inference: Transliteration
121
Transliteration Discovery with CCM
The problem now: inference How to find the best mapping that satisfies the constraints?
A weight is assigned to each edge.
Include it or not? A binary decision.
Score = sum of the mappings’ weight
Assume the weights are given. More on this latter.
• Natural constraints• Pronunciation constraints• One-to-One • Non-crossing•…
Score = sum of the mappings’ weights. t. mapping satisfies constraints
122
Finding The Best Character Mappings
An Integer Linear Programming Problem
Is this the best inference algorithm? ...
1 ,,,,,,
,1,
,0,),(
,10
max,
kmij
jij
ij
ijij
TjSiijij
xxjmkimkji
xi
xBji
Zxx
xc
Maximize the mapping score
Pronounciation constraint
One-to-one constraint
Non-crossing constraint
123
Finding The Best Character Mappings
A Dynamic Programming Algorithm
Exact and fast!
Maximize the mapping score
Restricted mapping constraints
One-to-one constraint
Non-crossing constraint
Take Home Message:Although ILP can solve most problems, the fastest inference algorithm depends on the constraints and can be simpler
We can decompose the inference problem into two parts
124
Other Inference Options
Constraint Relaxation Strategies Try Linear Programming
[Roth and Yih, ICML 2005] Cutting plane algorithms do not use all constraints at first
Dependency Parsing: Exponential number of constraints [Riedel and Clarke, EMNLP 2006]
Other search algorithms A-star, Hill Climbing… Gibbs Sampling Inference [Finkel et. al, ACL 2005]
Named Entity Recognition: enforce long distance constraints
Can be considered as : Learning + Inference One type of constraints only
125
Inference Methods – Summary
Why ILP? A powerful way to formalize the problems However, not necessarily the best algorithmic
solution
Heuristic inference algorithms are useful sometimes! Beam search Other approaches: annealing …
Sometimes, a specific inference algorithm can be designed According to your constraints
126
This Tutorial: Constrained Conditional Models (2nd part) Part 4: More on Inference –
Other inference algorithms Search (SRL); Dynamic Programming (Transliteration); Cutting Planes
Using hard and soft constraints Part 5: Training issues when working with CCMS
Formalism (again) Choices of training paradigms -- Tradeoffs
Examples in Supervised learning Examples in Semi-Supervised learning
Part 6: Conclusion Building CCMs Features and Constraints; Objective functions; Different Learners. Mixed models vs. Joint models; where is Knowledge coming from
THE END
127
Constrained Conditional Model: Training
How to solve?This is an Integer Linear ProgramSolving using ILP packages gives an exact solution. Search techniques are also possible
(Soft) constraints component
Weight Vector for “local” models
Constraint violation penalty
How far y is from a “legal” assignment
A collection of Classifiers; Log-linear models (HMM, CRF) or a combination
How to train?How to decompose the global objective function?Should we incorporate constraints in the learning process?
128
Where are we?
Algorithmic Approach: Incorporating general constraints Showed that CCMs allow for formalizing many problems Showed several algorithmic ways to incorporate global
constraints in the decision.
Training: Coupling vs. Decoupling Training and Inference. Incorporating global constraints is important but Should it be done only at evaluation time or also at training
time? How to decompose the objective function and train in parts? Issues related to:
Modularity, efficiency and performance, availability of training data
Problem specific considerations
129
Training in the presence of Constraints
General Training Paradigm: First Term: Learning from data (could be further
decomposed) Second Term: Guiding the model by constraints Can choose if constraints’ weights trained, when
and how, or taken into account only in evaluation.
Decompose Model (SRL case) Decompose Model from constraints
130
Comparing Training Methods
Option 1: Learning + Inference (with Constraints) Ignore constraints during training
Option 2: Inference (with Constraints) Based Training Consider constraints during training
In both cases: Decision Making with Constraints
Question: Isn’t Option 2 always better?
Not so simple… Next, the “Local model story”
131
f1(x)
f2(x)
f3(x)f4(x)
f5(x)
Training Methods
x1
x6
x2
x5
x4x3
x7
y1
y2
y5
y4
y3
X
Y
Learning + Inference (L+I)Learn models independentlyInference Based Training (IBT)Learn all models together!IntuitionLearning with constraints may make learning more difficult
132
-1 1 111Y’ Local Predictions
Training with ConstraintsExample: Perceptron-based Global Learning
x1
x6
x2
x5
x4x3
x7
f1(x)
f2(x)
f3(x)f4(x)
f5(x)
X
Y
-1 1 1-1-1YTrue Global Labeling -1 1 11-1Y’ Apply Constraints:
Which one is better? When and Why?
133
Claims [Punyakanok et. al , IJCAI 2005]
Theory applies to the case of local model (no Y in the features)
When the local modes are “easy” to learn, L+I outperforms IBT. In many applications, the components are identifiable and
easy to learn (e.g., argument, open-close, PER). Only when the local problems become difficult to solve in
isolation, IBT outperforms L+I, but needs a larger number of training examples.
Other training paradigms are possible Pipeline-like Sequential Models: [Roth, Small, Titov: AI&Stat’09]
Identify a preferred ordering among components Learn k-th model jointly with previously learned models
L+I: cheaper computationally; modularIBT is better in the limit, and other extreme cases.
134
opt=0.2opt=0.1opt=0
Bound Prediction
Local ≤ opt + ( ( d log m + log 1/ ) / m )1/2
Global ≤ 0 + ( ( cd log m + c2d + log 1/ ) / m )1/2
Bounds Simulated Data
L+I vs. IBT: the more identifiable individual problems are, the better overall performance is with L+I
Indication for hardness of
problem
135
Relative Merits: SRL
Difficulty of the learning problem
(# features)
L+I is better.
When the problem is artificially made harder, the tradeoff is clearer.
easyhard
In some cases problems are hard due to lack of training data.
Semi-supervised learning
136
YPRED=
For each iterationFor each (X, YGOLD ) in the training data
If YPRED != YGOLD
λ = λ + F(X, YGOLD ) - F(X, YPRED) endifendfor
L+I & IBT: General View – Structured Perceptron
The theory applies when F(x,y) = F(x)
The difference between
L+I and IBT
137
Comparing Training Methods (Cont.)
Local Models (train independently) vs. Structured Models In many cases, structured models might be better due to expressivity
But, what if we use constraints? Local Models+Constraints vs. Structured Models
+Constraints Hard to tell: Constraints are expressive For tractability reasons, structured models have less expressivity than
the use of constraints. Local can be better, because local models are easier to learn
Decompose Model (SRL case) Decompose Model from constraints
138
Example: Semantic Role Labeling Revisited
s
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
t
Sequential Models Conditional Random Field Global perceptronTraining: sentence basedTesting: find the shortest path
with constraints
Local Models Logistic Regression Local Avg. PerceptronTraining: token based. Testing: find the best assignment locally with constraints
139
Model CRF CRF-D CRF-IBT Avg. PBaseline 66.46 69.14 58.15
+ Constraint
s
71.94 73.91 69.82 74.49
Training Time
48 38 145 0.8
Which Model is Better? Semantic Role Labeling Experiments on SRL: [Roth and Yih, ICML 2005]
Story: Inject constraints into conditional random field mdoels
Sequential Models LocalL+I L+IIBT
Sequential Models are better than Local Models !
(No constraints)
Local Models are now better than Sequential Models! (With constraints)
140
Summary: Training Methods
Many choices for training a CCM Learning + Inference (Training without constraints) Inference based Learning (Training with constraints)
Based on this, what kind of models you want to use?
Advantages of L+I Require fewer training examples More efficient; most of the time, better performance Modularity; easier to incorporate already learned models.
141
Constrained Conditional Model: Soft Constraints
(3) How to solve?This is an Integer Linear ProgramSolving using ILP packages gives an exact solution. Search techniques are also possible
(Soft) constraints component
Constraint violation penalty
How far y is from a “legal” assignment
Subject to constraints
(4) How to train?How to decompose the global objective function?Should we incorporate constraints in the learning process?
(1) Why use soft constraint?
(2) How to model “degree of violations”
142
(1) Why Are Soft Constraints Important?
Some constraints may be violated by the data.
Even when the gold data violates no constraints, the model may prefer illegal solutions. We want a way to rank solutions based on the
level of constraints’ violation. Important when beam search is used
Working with soft constraints Need to define the degree of violation Need to assign penalties for constraints
143
Example: Information extraction
Prediction result of a trained HMMLars Ole Andersen . Program analysis andspecialization for the C Programming language
. PhD thesis .DIKU , University of Copenhagen , May1994 .
[AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE]
Violates lots of natural constraints!
Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 .
144
Examples of Constraints
Each field must be a consecutive list of words and can appear at most once in a citation.
State transitions must occur on punctuation marks.
The citation can only start with AUTHOR or EDITOR.
The words pp., pages correspond to PAGE. Four digits starting with 20xx and 19xx are DATE. Quotations can appear only in TITLE …….
145
(2) Modeling Constraints’ Degree of Violations
Lars Ole Andersen .AUTH AUTH EDITOR EDITOR
Φc(y1)=0 Φc(y2)=0 Φc(y3)=1 Φc(y4)=0
)( Nc y1 - if assigning yN to xN violates the constraint C with respect to assignment (x1,..,xN-1;y1,…,yN-1)
0 - otherwise•Count how many times it violated the constraints:
d(y,1c(X))= ∑Φc(yi)
1 1, is a punctuationi i ii y y x
Lars Ole Andersen .AUTH BOOK EDITOR EDITOR
Φc(y1)=0 Φc(y2)=1 Φc(y3)=1 Φc(y4)=0∑Φc(yi) =1∑Φc(yi) =2
State transition must occur on punctuations.
146
Example: Degree of Violation?
It might be the case that all “good” assignments violate some constraints We lost the ability to judge which assignment is better
Better options: Choose the one with fewer violations!
Lars Ole Andersen .AUTH BOOK EDITOR EDITOR
Φc(y1)=0 Φc(y2)=1 Φc(y3)=1 Φc(y4)=0
Lars Ole Andersen .AUTH AUTH EDITOR EDITOR
Φc(y1)=0 Φc(y2)=0 Φc(y3)=1 Φc(y4)=0 The first one is better because of d(y,1c(X))!
147
Hard Constraints vs. Weighted Constraints
Constraints are close to perfect
Labeled data might not follow the constraints
148
(3) Constrained Conditional Model with Soft Constraints
How to solve?This is an Integer Linear ProgramSolving using ILP packages gives an exact solution. Search techniques are also possible
(Soft) constraints component
Constraint violation penalty
How far y is from a “legal” assignment
How to train?How to decompose the global objective function?Should we incorporate constraints in the learning process?
Inference with Beam Search (& Soft Constraints)
AuL. Sha
riet.
al. 1998.
Building
semantic
Concordances.
In
C. Fellbaum,
ed. WordNet:
and its applications.
ILP is another option!Run beam search inference as before. Rather than eliminating illegal assignments, re-rank them.
149
Assume that We Have Made a Mistake…
150
Au
Au Au
Au
Date
Title Title Journal
L. Shari
et.
al. 1998.
Building
semantic
Concordances.
In
C. Fellbaum,
ed. WordNet:
and its applications.
Assumption: the best assignment according to the objective functionStart from here
Top Choice from Learned Models Is Not Good: ∑Φc1(yi) =1 ; ∑Φc2(yi) =1+1
151
Au Au Au Au
Date
Title Title Journal Book
L. Shari
et.
al. 1998.
Building
semantic
Concordances.
In
AuC. Fellbau
m,ed. WordNe
t:and its applicatio
ns.
Each field must be a consecutive list of words and can appear at most once in a citation.
State transitions must occur on punctuation marks. Two violations.
The Second Choice from Learned Models Is Not Good
152
Au Au Au Au
Date
Title Title Journal Book
L. Shari
et.
al. 1998.
Building
semantic
Concordances.
In
EditorC. Fellbau
m,ed. WordNe
t:and its applicatio
ns.
State transitions must occur on punctuation marks. Two violations.
∑Φc2(yi) =1+1
Use Constraints to Find the Best Assignment
Au Au Au Au Date
Title Title Journal Editor
L. Shari
et.
al. 1998.
Building
semantic
Concordances.
In
EditorC. Fellbau
m,ed. WordNe
t:and its applicatio
ns.
153
State transitions must occur on punctuation marks. one violations.
∑Φc2(yi) =1
Soft Constraints Help Us Find a Better Assignment
Au Au Au Au Date
Title Title Journal Editor
L. Shari
et.
al. 1998.
Building
semantic
Concordances.
In
Editor
Editor Editor
Book Book
Book Book
C. Fellbaum,
ed. WordNet:
and its applications.
154
We can do this because we use soft constraints. If we use hard constraints, the score of all assignment given the prefix is negative infinity.
155
(4) Constrained Conditional Model with Soft Constraints
How to solve?This is an Integer Linear ProgramSolving using ILP packages gives an exact solution. Search techniques are also possible
(Soft) constraints component
Constraint violation penalty
How far y is from a “legal” assignment
How to train?How to decompose the global objective function?Should we incorporate constraints in the learning process?
156
Compare Training Methods (with Soft Constraints)
Need to figure out the penalty as well…
Option 1: Learning + Inference (with Constraints) Learn the weights and penalties separately
Option 2: Inference (with Constraints) Based Training Learn the weights and penalties together
The tradeoff between L+I and IBT is similar to what we have seen earlier.
157
YPRED=
Inference Based Training With Soft Constraints
For each iterationFor each (X, YGOLD ) in the training data
If YPRED != YGOLD
λ = λ + F(X, YGOLD ) - F(X, YPRED)ρI = ρI+ d(YGOLD,1C(X)) - d(YPRED,1C(X)), I =
1,..endif
endfor
• Example: Perceptron• Update penalties as well !
158
L+I v.s IBT for Soft Constraints
Test on citation recognition: L+I: HMM + weighted constraints IBT: Perceptron + weighted constraints Same feature set
With constraints Factored Model is better More significant with a small # of examples
Without constraints Few labeled examples, HMM > perceptron Many labeled examples, perceptron > HMM
Agrees with earlier results in the supervised setting ICML’05, IJCAI’5
Summary – Soft Constraints
Using soft constraints is important sometimes Some constraints might be violated in the gold data We want to have the notion of degree of violation
Degree of violation One approximation: count how many time the
constraints was violated
How to solve? Beam search, ILP, …
How to learn? L+I v.s. IBT
159
160
Constrained Conditional Model: Injecting Knowledge
How to solve?This is an Integer Linear ProgramSolving using ILP packages gives an exact solution. Search techniques are also possible
(Soft) constraints component
Weight Vector for “local” models
Constraint violation penalty
How far y is from a “legal” assignment
A collection of Classifiers; Log-linear models (HMM, CRF) or a combination
How to train?How to decompose the global objective function?Should we incorporate constraints in the learning process?
Inject Prior Knowledge via Constraints
A.K.A.Semi/unsupervised Learning with Constrained
Conditional Model
161
Constraints As a Way To Encode Prior Knowledge
Consider encoding the knowledge that: Entities of type A and B cannot occur simultaneously in a
sentence The “Feature” Way
Requires larger models
The Constraints Way Keeps the model simple; add expressive
constraints directly A small set of constraints Allows for decision time incorporation of
constraints
A effective way to inject knowledge
Need more training data
We can use constraints as a way to replace training data
162162
Constraint Driven Semi/Un Supervised LearningCODLUse constraints to generate better training samples in semi/unsupervised leaning.
Model
Unlabeled Data
PredictionLabel unlabeled data
FeedbackLearn from labeled data
Seed Examples
Prediction + ConstraintsLabel unlabeled data
Better FeedbackLearn from labeled data
Better ModelResource
In traditional semi/unsupervised Learning, models can drift away from correct model
163
Semi-Supervised Learning with Soft Constraints (L+I)
Learning model weights Example: HMM
Constraints Penalties Hard Constraints : infinity Weighted Constraints:
½i = -log P{Constraint Ci is violated in training data}
Only 10 constraints
Learn the weights and penalty separately
164
Semi-supervised Learning with Constraints
=learn(T)For N iterations do
T= For each x in unlabeled dataset
{y1,…,yK} InferenceWithConstraints(x,C, )
T=T {(x, yi)}i=1…k
= +(1- )learn(T)
Learn from new training data.Weigh supervised and unsupervised model.
Inference based augmentation of the training set (feedback)(inference with constraints).
Supervised learning algorithm parameterized by
[Chang, Ratinov, Roth, ACL’07]
165
Objective function: Value of Constraints in Semi-Supervised Learning
300# of available labeled
examples
Learning w 10 ConstraintsConstraints are used to Bootstrap a semi-supervised learner Poor model + constraints used to annotate unlabeled data, which in turn is used to keep training the model.
Learning w/o Constraints: 300 examples.
166
Unsupervised Constraint Driven Learning
Semi-supervised Constraint Driven Learning In [Chang et.al. ACL 2007], they use a small labeled
dataset Reason: We need a good starting point!
Unsupervised Constraint Driven Learning Sometimes, good resources exist
We can use the resource to initialize the model We do not need labeled instances at all!
Application: transliteration discovery
167
Why Adding Constraints?
Before talking about transliteration discoveryLet’s think about why adding
constraints again Reason: We want to capture the
dependencies between different outputs
A new question: What happen if you are trying to solve a single output classification problem Can you still inject prior knowledge?
168
xX
x1
x6
x2
x5
x4x3
x7
X
Y
y1
y2
y4
y3
y5
Structure Output Problem:Dependencies between different outputs
Why Add Constraints?
169
xX
x1
x6
x2
x5
x4x3
x7
X
Yy1
Single Output Problem:Only one output
Why Add Constraints?
170
Adding Constraints Through Hidden Variables
xX
x1
x6
x2
x5
x4x3
x7
X
xX
Yf1
f2
f4
f3
f5
y1
Single Output Problemwith hidden variables
171
Hidden variables Character mappings
Character mappings is not our final goal But, it provides a intermediate representation
that satisfies natural constraints.
Example: Transliteration Discovery
The score for each pair depends on the character mappings.
172
Transliteration Discovery with Hidden Variables
For each source NE, find the best target candidate
For each NE pair, the score is equal to the score of the best hidden set of character mapping
(edges) Add constraints when predicting hidden
variables
Cc
cctsT
ts HdNNHFNNHg )1,(),,(),,(
||),,(maxarg),(
t
tsHts N
NNHgNNscore
Violation
X + y
A CCM Formulation, use constraintsto bias the prediction!Alternative View: Finding the best feature representation
173
Natural Constraints
A Dynamic Programming Algorithm
Exact and fast!
Maximize the mapping score
Pronouncition constraints
One-to-one constraint
Non-crossing constraint
Take Home Message:Although ILP can solve most problems, the fastest inference algorithm depends on the constraints and can be simpler
We can decompose the problem into two parts
174
Algorithm: High Level View
Model Resource (Romanization Table) Until Convergence
Use model + Constraints to get assignments for both hidden variables (F) and labels (Y)
Update the model with newly-labeled F and y
Get feedback from both hidden variables and labels
*
*
)},|,(,{
)}|,(,{
ttts
ts
NNWNNFDD
WNNFDD
)(DtrainW
175
Results - RussianHigher, better
No constraints + 20 labeled example
Unsupervised Constraint Driven LearningNo labeled data No temporal Information
No NER on Russian
176
Results - Hebrew
No constraints + 250 labeled example
Unsupervised Constraint Driven LearningNo labeled data No temporal Information
Summary
Adding constraints is an effective way to inject knowledge Constraints can correct our predictions on unlabeled data
Injecting constraints sometimes are more effective than annotating data 300 labeled data v.s. 20 labeled data + constraints
On the cases of single output problems Use hidden variables to capture the constraints
Other approaches are possible [Duame, EMNLP 2008]
Only get feedback from the examples where constraints are satisfied
177
178Page 178
This Tutorial: Constrained Conditional Models (2nd part) Part 4: More on Inference –
Other inference algorithms Search (SRL); Dynamic Programming (Transliteration); Cutting Planes
Using hard and soft constraints Part 5: Training issues when working with CCMS
Formalism (again) Choices of training paradigms -- Tradeoffs
Examples in Supervised learning Examples in Semi-Supervised learning
Part 6: Conclusion Building CCMs Features and Constraints; Objective functions; Different Learners. Mixed models vs. Joint models; where is Knowledge coming from
THE END
179
Conclusion Constrained Conditional Models combine
Learning conditional models with using declarative expressive constraints
Within a constrained optimization framework
Our goal was to: Introduce a clean way of incorporating constraints to
bias and improve decisions of learned models A clean way to use (declarative) prior knowledge to
guide semi-supervised learning Provide examples for the diverse usage CCMs have
already found in NLP Significant success on several NLP and IE tasks (often,
with ILP)
180
Technical Conclusions Presented and discussed inference algorithms
How to use constraints to make global decisions The formulation is an Integer Linear Programming formulation,
but algorithmic solutions can employ a variety of algorithms Present and discussed learning models to be used along with
constraints Training protocol matters We distinguished two extreme cases – training with/without
constraints. Issues include performance, as well as modularity of the
solution and the ability to use previously learned models. We did not attend to the question of “how to find constraints”
In order to emphasize the idea that we think background knowledge is important, it exists, and to focus on using it.
But, we talked about learning weights for constraints, and it’s clearly possible to learn constraints.
181
y* = argmaxy wi Á(x; y)
Linear objective functions Typically Á(x,y) will be local
functions, or Á(x,y) = Á(x)
Summary: Constrained Conditional Models
y7y4 y5 y6 y8
y1 y2 y3
y7y4 y5 y6 y8
y1 y2 y3
Conditional Markov Random Field Constraints Network
i ½i dC(x,y)
Expressive constraints over output variables
Soft, weighted constraints Specified declaratively as FOL
formulae Clearly, there is a joint probability distribution that
represents this mixed model. We would like to:
Learn a simple model or several simple models Make decisions with respect to a complex model
Key difference from MLNs which provide a concise definition of a model, but the whole joint one.
182
Decide what variables are of interest – learn model (s) Think about constraints among the variables of interest Design your objective function
LBJ (Learning Based Java): http://L2R.cs.uiuc.edu/~cogcompA modeling language for Constrained Conditional Models. Supports
programming along with building learned models, high level specification of constraints and inference with constraints
Designing CCMs
y* = argmaxy wi Á(x; y)
Linear objective functions Typically Á(x,y) will be local
functions, or Á(x,y) = Á(x)
y7y4 y5 y6 y8
y1 y2 y3
y7y4 y5 y6 y8
y1 y2 y3
i ½i dC(x,y)
Expressive constraints over output variables
Soft, weighted constraints Specified declaratively as FOL
formulae
183
Questions? Thank you!