constrained conditional models for natural language processing

March 2009

EACL

Constrained Conditional Models for

Natural Language Processing

Ming-Wei Chang, Lev Ratinov, Dan RothDepartment of Computer ScienceUniversity of Illinois at Urbana-Champaign

Nice to Meet You

Informally: Everything that has to do with constraints (and learning

models)

Formally: We typically make decisions based on models such as:

With CCMs we make decisions based on models such as:

We do not define the learning method but we’ll discuss it and make suggestions

CCMs make predictions in the presence/guided by constraints

Constraints Conditional Models (CCMs)

),(maxarg xyfwTy

arg max ( , ) ( ,1 )Ty c C

c C

w f y x d y

Constraints Driven Learning and Decision Making Why Constraints?

The Goal: Building a good NLP systems easily We have prior knowledge at our hand

How can we use it? We suggest that often knowledge can be injected directly

Can use it to guide learning Can use it to improve decision making Can use it to simplify the models we need to

learn

How useful are constraints? Useful for supervised learning Useful for semi-supervised learning Sometimes more efficient than labeling data directly

Inference

Comprehension

1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now.

A process that maintains and updates a collection of propositions about the state of affairs.

(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.

This is an Inference Problem

This Tutorial: Constrained Conditional Models

Part 1: Introduction to CCMs Examples:

NE + Relations Information extraction – correcting models with CCMS

First summary: why are CCM important Problem Setting

Features and Constraints; Some hints about training issues Part 2: Introduction to Integer Linear Programming

What is ILP; use of ILP in NLP Part 3: Detailed examples of using CCMs

Semantic Role Labeling in Details Coreference Resolution Sentence Compression

BREAK

This Tutorial: Constrained Conditional Models (2nd part) Part 4: More on Inference –

Other inference algorithms Search (SRL); Dynamic Programming (Transliteration); Cutting Planes

Using hard and soft constraints Part 5: Training issues when working with CCMS

Formalism (again) Choices of training paradigms -- Tradeoffs

Examples in Supervised learning Examples in Semi-Supervised learning

Part 6: Conclusion Building CCMs Features and Constraints; Objective functions; Different Learners. Mixed models vs. Joint models; where is Knowledge coming from

THE END

Learning and Inference Global decisions in which several local decisions play a

role but there are mutual dependencies on their outcome. E.g. Structured Output Problems – multiple dependent output

variables

(Learned) models/classifiers for different sub-problems In some cases, not all local models can be learned simultaneously Key examples in NLP are Textual Entailment and QA In these cases, constraints may appear only at evaluation time

Incorporate models’ information, along with prior knowledge/constraints, in making coherent decisions decisions that respect the local models as well as domain &

context specific knowledge/constraints.






What is ILP; use of ILP in NLP Semantic Role Labeling in Details

Part 3: Detailed examples of using CCMs Coreference Resolution Sentence Compression

BREAK







THE END

Inference with General Constraint Structure [Roth&Yih’04]Recognizing Entities and Relations

Dole ’s wife, Elizabeth , is a native of N.C. E1 E2 E3

R12 R2

3

other 0.05per 0.85loc 0.10



irrelevant 0.10spouse_of 0.05born_in 0.85









Improvement over no inference: 2-5%

Some Questions: How to guide the global inference? Why not learn Jointly?

Models could be learned separately; constraints may come up only at decision time.

Note: Non Sequential Model

Task of Interests: Structured Output For each instance, assign values to a set of

variables Output variables depend on each other

Common tasks in Natural language processing

Parsing; Semantic Parsing; Summarization; Transliteration; Co-reference resolution,…

Information extraction Entities, Relations,…

Many pure machine learning approaches exist Hidden Markov Models (HMMs); CRFs Structured Perceptrons ad SVMs…

However, …

Information Extraction via Hidden Markov Models

Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 .

Prediction result of a trained HMM Lars Ole Andersen . Program analysis andspecialization for the C Programming language

. PhD thesis .DIKU , University of Copenhagen , May1994 .

[AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE]

Unsatisfactory results !

Motivation II

Strategies for Improving the Results

(Pure) Machine Learning Approaches Higher Order HMM/CRF? Increasing the window size? Adding a lot of new features

Requires a lot of labeled examples

What if we only have a few labeled examples?

Any other options? Humans can immediately tell bad outputs The output does not make sense

Increasing the model complexity

Can we keep the learned model simple and still make expressive decisions?

Information extraction without Prior Knowledge

Prediction result of a trained HMMLars Ole Andersen . Program analysis andspecialization for the C Programming language



Violates lots of natural constraints!


Examples of Constraints

Each field must be a consecutive list of words and can appear at most once in a citation.

State transitions must occur on punctuation marks.

The citation can only start with AUTHOR or EDITOR.

The words pp., pages correspond to PAGE. Four digits starting with 20xx and 19xx are DATE. Quotations can appear only in TITLE ……. Easy to express pieces of “knowledge”

Non Propositional; May use Quantifiers

Adding constraints, we get correct results! Without changing the model

[AUTHOR] Lars Ole Andersen . [TITLE] Program analysis and specialization

for the C Programming language .

[TECH-REPORT] PhD thesis .[INSTITUTION] DIKU , University of Copenhagen , [DATE] May, 1994 .

Information Extraction with Constraints

Random Variables Y:

Conditional Distributions P (learned by models/classifiers) Constraints C– any Boolean function defined over partial assignments (possibly: + weights W )

Goal: Find the “best” assignment The assignment that achieves the highest global

performance. This is an Integer Programming Problem

Problem Setting

y7y4 y5 y6 y8

y1 y2 y3C(y1,y4)C(y2,y3,y6,y7,y8)

Y*=argmaxY PY subject to constraints C(+ WC)

observations

Formal Model

How to solve?This is an Integer Linear ProgramSolving using ILP packages gives an exact solution. Search techniques are also possible

(Soft) constraints component

Weight Vector for “local” models

Penalty for violatingthe constraint.

How far y is from a “legal” assignment

Subject to constraints

A collection of Classifiers; Log-linear models (HMM, CRF) or a combination

How to train?How to decompose the global objective function?Should we incorporate constraints in the learning process?

Features Versus ConstraintsÁi : X £ Y ! R; Ci : X £ Y ! {0,1}; d: X £ Y ! R;

In principle, constraints and features can encode the same propeties In practice, they are very different

Features Local , short distance properties – to allow tractable inference Propositional (grounded): E.g. True if: “the” followed by a Noun occurs in the sentence”

Constraints Global properties Quantified, first order logic expressions E.g.True if: “all yis in the sequence y are assigned different values.”

Indeed, used differently

Encoding Prior Knowledge Consider encoding the knowledge that:

Entities of type A and B cannot occur simultaneously in a sentence The “Feature” Way

Results in higher order HMM, CRF May require designing a model tailored to

knowledge/constraints Large number of new features: might require more labeled

data Wastes parameters to learn indirectly knowledge we have.

The Constraints Way Keeps the model simple; add expressive constraints

directly A small set of constraints Allows for decision time incorporation of constraints

A form of supervision

Need more training data

Constrained Conditional Models – 1st Summary Everything that has to do with Constraints and Learning

models In both examples, we started with learning models

Either for components of the problem Classifiers for Relations and Entities

Or the whole problem Citations

We then included constraints on the output As a way to “correct” the output of the model

In both cases this allows us to Learn simpler models than we would otherwise

As presented, global constraints did not take part in training Global constraints were used only at the output.

We will later call this training paradigm L+I






What is ILP; use of ILP in NLP Semantic Role Labeling in Details

Part 3: Detailed examples of using CCMs Coreference Resolution Sentence Compression

BREAK







THE END

Formal Model




Penalty for violatingthe constraint.




Inference with constraints.

We start with adding constraints to existing models.

It’s a good place to start, because conceptually all you do is to add constraints to what you were doing before and the performance improves.

Constraints and Integer Linear Programming (ILP)

ILP is powerful (NP-complete) ILP is popular – inference for many models, such

as Viterbi for CFR, have already been implemented.

Powerful off-the shelf solvers exist. All we need is to write down the objective function and the constraints, there is no need to write code.

Linear Programming

Key contributors: Leonid Kantorovich, George B. Dantzig, John von Neumann.

Optimization technique with linear objective function and linear constraints.

Note that the word “Integer” is absent.

http://en.wikipedia.org/wiki/Leonid_Kantorovich

http://en.wikipedia.org/wiki/George_Dantzig

http://en.wikipedia.org/wiki/John_von_Neumann

Example (Thanks James Clarke)

Telfa Co. produces tables and chairs Each table makes 8$ profit, each chair makes 5$ profit.

We want to maximize the profit.

Telfa Co. produces tables and chairs Each table makes 8$ profit, each chair makes 5$ profit. A table requires 1 hour of labor and 9 sq. feet of wood. A chair requires 1 hour of labor and 5 sq. feet of wood. We have only 6 hours of work and 45sq. Feet of wood.

We want to maximize the profit.

Example (Thanks James Clarke)

Solving LP problems.

Solving LP problems

Solving LP Problems.

Integer Linear Programming- integer solutions.

Integer Linear Programming.

In NLP, we are dealing with discrete outputs, therefore we’re almost always interested in integer solutions.

ILP is NP-complete, but often efficient for large NLP problems.

In some cases, the solutions to LP are integral (e.g totally unimodular constraint matrix.). Next, we show an example for using ILP with

constraints. The matrix is not totally unimodular, but LP gives integral solutions.

Back to Example: Recognizing Entities and Relations


R12 R2

3















R12 R2

3













NLP with ILP- key issues:1)Write down the objective function.

2)Write down the constraints as linear inequalities



R12 R2

3













Back to Example: cost functionother 0.05per 0.85loc 0.10












x{E1 = per}, x{E1 = loc}, …, x{R12 = spouse_of}, x{R12 = born_in}, …, x{R12 = } , … {0,1}

R12 R21 R23 R32 R13 R31

E1

DoleE2

ElizabethE3

N.C.

Back to Example: cost functionother 0.05per 0.85loc 0.10












Cost function:

c{E1 = per}· x{E1 = per} + c{E1 = loc}· x{E1 = loc} + … + c{R12 = spouse_of}· x{R12 = spouse_of} + … + c{R12 = }· x{R12 = } + …

R12 R21 R23 R32 R13 R31

E1

DoleE2

ElizabethE3

N.C.

Adding Constraints

Each entity is either a person, organization or location: x{E1 = per}+ x{E1 = loc}+ x{E1 = org} + x{E1 = }=1

(R12 = spouse_of) (E1 = person) (E2 = person) x{R12 = spouse_of} x{E1 = per}

x{R12 = spouse_of} x{E2 = per}

We need more consistency constraints. Any Boolean constraint can be written with a set of

linear inequalities, and an efficient algorithm exist [Rizzollo2007].

CCM for NE-Relations

We showed how a CCM formulation for the NE-Relation problem.

In this case, the cost of each variable was learned independently, using trained classifiers.

Other expressive problems can be formulated as Integer Linear Programs. For example, HMM/CRF inference, Viterbi

algorithms

c{E1 = per}· x{E1 = per} + c{E1 = loc}· x{E1 = loc} + … + c{R12 = spouse_of}· x{R12 = spouse_of} + … + c{R12 = }· x{R12 = } + …

TSP as an Integer Linear Program

Dantzig et al. were the first to suggest a practical solution to the problem using ILP.

http://en.wikipedia.org/wiki/George_Dantzig

Reduction from Traveling Salesman to ILP

x12c123

1

2x32c32

Every node hasONE outgoing

edge


Every node hasONE incoming

edge

x12c123

1

2x32c32


The solutions are binary (int)

x12c123

1

2x32c32


No sub graphcontains a cycle

x12c123

1

2x32c32

Viterbi as an Integer Linear Program

y0 y1 y2 yN

x0 x1 x2 xN

Viterbi with ILP

y2y0 y1 yN

x2x0 x1 xN

y01

y02

y0M

S

-Log{P(x0|y0=1) P(y0=1)}

Viterbi with ILP

y2y0 y1 yN

x2x0 x1 xN

y01

y02

y0M

S -Log{P(x0|y0=2)P(y0=2)}

Viterbi with ILP

y2y0 y1 yN

x2x0 x1 xN

y01

y02

y0M

S -Log{P(x0|y0=M) P(y0=M)}

Viterbi with ILP

y2y0 y1 yN

x2x0 x1 xN

y01

y02

y0M

y11

y12

y1M

-Log{P(x2|y1=1)P(y1=1|y0=1)}S

Viterbi with ILP

y2y0 y1 yN

x2x0 x1 xN

y01

y02

y0M

y11

y12

y1M

-Log{P(x2|y1=2)P(y1=1|y0=1)}S

Viterbi with ILP

y0 y1 y2 yN

x0 x1 x2 xN

S

y01

y02

y0M

y11

y12

y1M

y21

y22

y2M

yN

1

yN2

yNM

T

Viterbi with ILP

y0 y1 y2 yN

x0 x1 x2 xN

S

y01

y02

y0M

y11

y12

y1M

y21

y22

y2M

yN

1

yN2

yNM

T

Viterbi = shortest path (ST ) with dynamic programming. We saw TSP ILP.Now show: Shortest Path ILP.

Shortest path with ILP.

For each edge (u,v), xuv =1 if (u,v) is on the shortest path, 0 otherwise. cuv = the weight of the edge.



All nodes except the S and the T have eq. number of ingoing and outgoing edges




The source node has one outgoing edge more than ingoing edges.




The source node has one outgoing edge more than ingoing edges.

The target node has one ingoing edge more than outgoing edges.



Interestingly, in this formulation, the solution to the general LP problem will have integer solutions. Thus, we have a reasonably fast alternative to Viterbi with ILP.

Who cares that Viterbi is an ILP?

Assume you want to learn an HMM/CRF model (e.g., Extracting fields from citations (IE))

But, you also want to add expressive constraints: No field can appear twice in the data. The fields <journal> and <tech-report> are mutually

exclusive. The field <author> must appear at least once.

Do: Learn HMM/CRF Convert to an LP Modify the LP canonical constraint matrix

A concrete example for adding constraints over CRF will be shown






What is ILP; use of ILP in NLP Part 3: Detailed examples of using CCMs

Semantic Role Labeling in Details Coreference Resolution Sentence Compression

BREAK







THE END

CCM Examples

Many works in NLP make use of constrained conditional models, implicitly or explicitly.

Next we describe in details three examples.

Example 2: Semantic Role Labeling. The use of inference with constraints to improve

semantic parsing Example 3 (Co-ref):

combining classifiers through objective function outperforms pipeline.

Example 4 (Sentence Compression): Simple language model with constraints

outperforms complex models.

Example 2: Semantic Role Labeling

Demo:http://L2R.cs.uiuc.edu/~cogcomp

Approach :1) Reveals several relations.

Top ranked system in CoNLL’05 shared task

Key difference is the Inference

2) Produces a very good semantic parser. F1~90% 3) Easy and fast: ~7 Sent/Sec (using Xpress-MP)

Who did what to whom, when, where, why,…

http://l2r.cs.uiuc.edu/~cogcomp

Simple sentence:

I left my pearls to my daughter in my will .[I]A0 left [my pearls]A1 [to my daughter]A2 [in my will]AM-LOC .

A0 Leaver A1 Things left A2 Benefactor AM-LOC Location

I left my pearls to my daughter in my will .

SRL Dataset

PropBank [Palmer et. al. 05]

Core arguments: A0-A5 and AA different semantics for each verb specified in the PropBank Frame files

13 types of adjuncts labeled as AM-arg where arg specifies the adjunct type

In this problem, all the information is given, but conceptually, we could train different components on different resources.

Algorithmic Approach

Identify argument candidates Pruning [Xue&Palmer, EMNLP’04] Argument Identifier

Binary classification (SNoW) Classify argument candidates

Argument Classifier Multi-class classification (SNoW)

Inference Use the estimated probability distribution

given by the argument classifier Use structural and linguistic constraints Infer the optimal global output

I left my nice pearls to her

I left my nice pearls to her[ [ [ [ [ ] ] ] ] ]

I left my nice pearls to her[ [ [ [ [ ] ] ] ] ]


Identify Vocabulary

Inference over (old and new) Vocabulary

candidate arguments

Well understood


Learning

Both argument identifier and argument classifier are trained using SNoW. Sparse network of linear functions Multiclass classifier that produces a probability

distribution over output values.

Features (some examples) Voice, phrase type, head word, parse tree path

from predicate, chunk sequence, syntactic frame, …

Conjunction of features

Inference

The output of the argument classifier often violates some constraints, especially when the sentence is long.

Finding the best legitimate output is formalized as an optimization problem and solved via Integer Linear Programming (ILP).

Input: The scores from the argument classifier Structural and linguistic constraints

ILP allows incorporating expressive (non-sequential) constraints on the variables (the arguments types).

Semantic Role Labeling (SRL)


0.50.150.150.10.1

0.150.60.050.050.05

0.050.10.20.60.05

0.050.050.70.050.15

0.30.20.20.10.2

Semantic Role Labeling (SRL)


0.50.150.150.10.1

0.150.60.050.050.05

0.050.10.20.60.05

0.050.050.70.050.15

0.30.20.20.10.2

One inference problem for each verb predicate.

Integer Linear Programming Inference

For each argument ai

Set up a Boolean variable: ai,t indicating whether ai is classified as t

Goal is to maximize i score(ai = t ) ai,t

Subject to the (linear) constraints

If score(ai = t ) = P(ai = t ), the objective is to find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints.

No duplicate argument classesa POTARG x{a = A0} 1

R-ARG a2 POTARG , a POTARG x{a = A0} x{a2 = R-A0}

C-ARG a2 POTARG , (a POTARG) (a is before a2 ) x{a = A0} x{a2 = C-A0}

Many other possible constraints: Unique labels No overlapping or embedding Relations between number of arguments; order constraints If verb is of type A, no argument of type B

Any Boolean rule can be encoded as a set of linear inequalities.

If there is an R-ARG phrase, there is an ARG Phrase

If there is an C-ARG phrase, there is an ARG before it

Constraints

Joint inference can be used also to combine different SRL Systems.

Universally quantified rulesLBJ: allows a developer to encode

constraints in FOL; these are compiled into linear inequalities automatically.

Summary: Semantic Role Labeling

Demo:http://L2R.cs.uiuc.edu/~cogcomp

Who did what to whom, when, where, why,…


Example 3: co-ref (thanks Pascal Denis)


Traditional approach: train a classifier which predicts the probability that two named entities co-refer:

Pco-ref(Clinton, NPR) = 0.2 (order sensitive!)Pco-ref(Lewinsky, her) = 0.6

Decision Process: E.g., If Pco-ref(s1, s2) > 0.5, label two entities as co-referent

or, clustering


Traditional approach: train a classifier which predicts the probability that two named entities co-refer:

Pco-ref(Clinton, NPR) = 0.2Pco-ref(Lewinsky, her) = 0.6

Decision Process: E.g., If Pco-ref(s1, s2) > 0.5, label two entities as co-referent

or, clustering

Evaluation:Cluster all the entities based on co-ref classifier predictions. The evaluation is based on cluster overlap.


Two types of entities: “Base entities”“Anaphors” (pointers)

NPR


Error analysis: 1)“Base entities” that “point” to anaphors.2)Anaphors that don’t “point” to anything.

Proposed solution.

Pipelined approach: Identify anaphors.Forbid links of certain types.

NPR

Proposed solution.

Doesn’t work, despite 80% accuracy of the anaphorisity classifier.

The reason: error propagation in the pipeline.

NPR

Joint Solution with CCMs

Joint solution using CCMs


This performs considerably better than the pipelined approach, and actually improves over the baseline without anaphorisity classifier (though marginally)


α

β

Note:If we have reasons to believe one of the classifiers significantly more

than the other, we can scale with tuned parameters

Example 4 - Sentence Compression (thanks James Clarke).

Example

Example:0 1 2 3 4 5 6 7 8Big fish eat small fish in a small pondBig fish in a pond

11

568156015

86510

Language model-based compression

Example 2 : summarization (thanks James Clarke)

This formulation requires some additional constraintsBig fish eat small fish in a small pond

No selection of decision variables can make these trigrams appear consecutively in output.

We skip these constraints here.

Trigram model in action

Modifier Constraints

Example

Sentential Constraints

Example

More constraints

Other CCM Examples: Opinion Recognition

Y. Choi, E. Breck, and C. Cardie. Joint Extraction of Entities and Relations for Opinion Recognition EMNLP-2006

Semantic parsing variation: Agent=entity Relation=opinion

Constraints: A an agent can have at most two opinions. An opinion should be linked to only one agent. The usual non-overlap constraints.

Other CCM Examples: Temporal Ordering

N. Chambers and D. Jurafsky. Jointly Combining Implicit Constraints Improves Temporal Ordering. EMNLP-2008.

Other CCM Examples: Temporal Ordering

N. Chambers and D. Jurafsky. Jointly Combining Implicit Constraints Improves Temporal Ordering. EMNLP-2008.

Three types of edges:1)Annotation relations before/after2)Transitive closure constraints3)Time normalization constraints

Related Work: Language generation.

Regina Barzilay and Mirella Lapata. Aggregation via Set Partitioning for Natural Language Generation.HLT-NAACL-2006.

Constraints: Transitivity: if (ei,ej)were aggregated, and (ei,ejk) were

too, then (ei,ek) get aggregated. Max number of facts aggregated, max sentence

length.

MT & Alignment

Ulrich Germann, Mike Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. Fast decoding and optimal decoding for machine translation. ACL 2001.

John DeNero and Dan Klein. The Complexity of Phrase Alignment Problems. ACL-HLT-2008.

Summary of Examples

We have shown several different NLP solution that make use of CCMs.

Examples vary in the way models are learned.

In all cases, constraints can be expressed in a high level language, and then transformed into linear inequalities.

Learning based Java (LBJ) [Rizollo&Roth’07] describe an automatic way to compile high level description of constraint into linear inequalities.

Solvers

All applications presented so far used ILP for inference.

People used different solvers Xpress-MP GLPK lpsolve R Mosek CPLEX

Next we discuss other inference approaches to CCMs

112







THE END

113

Where Are We ?

We hope we have already convinced you that Using constraints is a good idea for addressing NLP problems Constrained conditional models provide a good platform

We were talking about using expressive constraints To improve existing models Learning + Inference The problem: inference

A powerful inference tool: Integer Linear Programming SRL, co-ref, summarization, entity-and-relation… Easy to inject domain knowledge

114

Constrained Conditional Model : Inference




Constraint violation penalty




115

Advantages of ILP Solvers: Review

ILP is Expressive: We can solve many inference problems Converting inference problems into ILP is easy

ILP is Easy to Use: Many available packages (Open Source Packages): LPSolve, GLPK, … (Commercial Packages): XPressMP, Cplex No need to write optimization code!

Why should we consider other inference options?

116

ILP: Speed Can Be an Issue

Inference problems in NLP Sometimes large problems are actually easy for ILP

E.g. Entities-Relations Many of them are not “difficult”

When ILP isn’t fast enough, and one needs to resort to approximate solutions.

The Problem: General Solvers vs. Specific Solvers ILP is a very, very general solver But, sometimes the structure of the problem allows for

simpler inference algorithms. Next we give examples for both cases.

117

Example 1: Search based Inference for SRL

The objective function

Constraints Unique labels No overlapping or embedding If verb is of type A, no argument of type B …

Intuition: check constraints’ violations on partial assignments

Maximize summation of the scores subject to linguistic constraints

jiijij xc

,

max

Indicator variableassigns the j-th class for the i-th token

Classification confidence

118

Inference using Beam Search

For each step, discard partial assignments that violate constraints!

Shape: argument

Color: label

Beam size = 2,Constraint:Only one Red

Rank them according to classification confidence!

119

Heuristic Inference

Problems of heuristic inference Problem 1: Possibly, sub-optimal solution Problem 2: May not find a feasible solution

Drop some constraints, solve it again

Using search on SRL gives comparable results to using ILP, but is much faster.

120

How to get a score for the pair? Previous approaches:

Extract features for each source and target entity pair

The CCM approach: Introduce an internal structure (characters) Constrain character mappings to “make

sense”.

Example 2: Exploiting Structure in Inference: Transliteration

121

Transliteration Discovery with CCM

The problem now: inference How to find the best mapping that satisfies the constraints?

A weight is assigned to each edge.

Include it or not? A binary decision.

Score = sum of the mappings’ weight

Assume the weights are given. More on this latter.

• Natural constraints• Pronunciation constraints• One-to-One • Non-crossing•…

Score = sum of the mappings’ weights. t. mapping satisfies constraints

122

Finding The Best Character Mappings

An Integer Linear Programming Problem

Is this the best inference algorithm? ...

1 ,,,,,,

,1,

,0,),(

,10

max,

kmij

jij

ij

ijij

TjSiijij

xxjmkimkji

xi

xBji

Zxx

xc

Maximize the mapping score

Pronounciation constraint

One-to-one constraint

Non-crossing constraint

123

Finding The Best Character Mappings

A Dynamic Programming Algorithm

Exact and fast!


Restricted mapping constraints



Take Home Message:Although ILP can solve most problems, the fastest inference algorithm depends on the constraints and can be simpler

We can decompose the inference problem into two parts

124

Other Inference Options

Constraint Relaxation Strategies Try Linear Programming

[Roth and Yih, ICML 2005] Cutting plane algorithms do not use all constraints at first

Dependency Parsing: Exponential number of constraints [Riedel and Clarke, EMNLP 2006]

Other search algorithms A-star, Hill Climbing… Gibbs Sampling Inference [Finkel et. al, ACL 2005]

Named Entity Recognition: enforce long distance constraints

Can be considered as : Learning + Inference One type of constraints only

125

Inference Methods – Summary

Why ILP? A powerful way to formalize the problems However, not necessarily the best algorithmic

solution

Heuristic inference algorithms are useful sometimes! Beam search Other approaches: annealing …

Sometimes, a specific inference algorithm can be designed According to your constraints

126







THE END

127

Constrained Conditional Model: Training








128

Where are we?

Algorithmic Approach: Incorporating general constraints Showed that CCMs allow for formalizing many problems Showed several algorithmic ways to incorporate global

constraints in the decision.

Training: Coupling vs. Decoupling Training and Inference. Incorporating global constraints is important but Should it be done only at evaluation time or also at training

time? How to decompose the objective function and train in parts? Issues related to:

Modularity, efficiency and performance, availability of training data

Problem specific considerations

129

Training in the presence of Constraints

General Training Paradigm: First Term: Learning from data (could be further

decomposed) Second Term: Guiding the model by constraints Can choose if constraints’ weights trained, when

and how, or taken into account only in evaluation.

Decompose Model (SRL case) Decompose Model from constraints

130

Comparing Training Methods

Option 1: Learning + Inference (with Constraints) Ignore constraints during training

Option 2: Inference (with Constraints) Based Training Consider constraints during training

In both cases: Decision Making with Constraints

Question: Isn’t Option 2 always better?

Not so simple… Next, the “Local model story”

131

f1(x)

f2(x)

f3(x)f4(x)

f5(x)

Training Methods

x1

x6

x2

x5

x4x3

x7

y1

y2

y5

y4

y3

X

Y

Learning + Inference (L+I)Learn models independentlyInference Based Training (IBT)Learn all models together!IntuitionLearning with constraints may make learning more difficult

132

-1 1 111Y’ Local Predictions

Training with ConstraintsExample: Perceptron-based Global Learning

x1

x6

x2

x5

x4x3

x7

f1(x)

f2(x)

f3(x)f4(x)

f5(x)

X

Y

-1 1 1-1-1YTrue Global Labeling -1 1 11-1Y’ Apply Constraints:

Which one is better? When and Why?

133

Claims [Punyakanok et. al , IJCAI 2005]

Theory applies to the case of local model (no Y in the features)

When the local modes are “easy” to learn, L+I outperforms IBT. In many applications, the components are identifiable and

easy to learn (e.g., argument, open-close, PER). Only when the local problems become difficult to solve in

isolation, IBT outperforms L+I, but needs a larger number of training examples.

Other training paradigms are possible Pipeline-like Sequential Models: [Roth, Small, Titov: AI&Stat’09]

Identify a preferred ordering among components Learn k-th model jointly with previously learned models

L+I: cheaper computationally; modularIBT is better in the limit, and other extreme cases.

134

opt=0.2opt=0.1opt=0

Bound Prediction

Local ≤ opt + ( ( d log m + log 1/ ) / m )1/2

Global ≤ 0 + ( ( cd log m + c2d + log 1/ ) / m )1/2

Bounds Simulated Data

L+I vs. IBT: the more identifiable individual problems are, the better overall performance is with L+I

Indication for hardness of

problem

135

Relative Merits: SRL

Difficulty of the learning problem

(# features)

L+I is better.

When the problem is artificially made harder, the tradeoff is clearer.

easyhard

In some cases problems are hard due to lack of training data.

Semi-supervised learning

136

YPRED=

For each iterationFor each (X, YGOLD ) in the training data

If YPRED != YGOLD

λ = λ + F(X, YGOLD ) - F(X, YPRED) endifendfor

L+I & IBT: General View – Structured Perceptron

The theory applies when F(x,y) = F(x)

The difference between

L+I and IBT

137

Comparing Training Methods (Cont.)

Local Models (train independently) vs. Structured Models In many cases, structured models might be better due to expressivity

But, what if we use constraints? Local Models+Constraints vs. Structured Models

+Constraints Hard to tell: Constraints are expressive For tractability reasons, structured models have less expressivity than

the use of constraints. Local can be better, because local models are easier to learn

Decompose Model (SRL case) Decompose Model from constraints

138

Example: Semantic Role Labeling Revisited

s

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

t

Sequential Models Conditional Random Field Global perceptronTraining: sentence basedTesting: find the shortest path

with constraints

Local Models Logistic Regression Local Avg. PerceptronTraining: token based. Testing: find the best assignment locally with constraints

139

Model CRF CRF-D CRF-IBT Avg. PBaseline 66.46 69.14 58.15

+ Constraint

s

71.94 73.91 69.82 74.49

Training Time

48 38 145 0.8

Which Model is Better? Semantic Role Labeling Experiments on SRL: [Roth and Yih, ICML 2005]

Story: Inject constraints into conditional random field mdoels

Sequential Models LocalL+I L+IIBT

Sequential Models are better than Local Models !

(No constraints)

Local Models are now better than Sequential Models! (With constraints)

140

Summary: Training Methods

Many choices for training a CCM Learning + Inference (Training without constraints) Inference based Learning (Training with constraints)

Based on this, what kind of models you want to use?

Advantages of L+I Require fewer training examples More efficient; most of the time, better performance Modularity; easier to incorporate already learned models.

141

Constrained Conditional Model: Soft Constraints

(3) How to solve?This is an Integer Linear ProgramSolving using ILP packages gives an exact solution. Search techniques are also possible




Subject to constraints

(4) How to train?How to decompose the global objective function?Should we incorporate constraints in the learning process?

(1) Why use soft constraint?

(2) How to model “degree of violations”

142

(1) Why Are Soft Constraints Important?

Some constraints may be violated by the data.

Even when the gold data violates no constraints, the model may prefer illegal solutions. We want a way to rank solutions based on the

level of constraints’ violation. Important when beam search is used

Working with soft constraints Need to define the degree of violation Need to assign penalties for constraints

143

Example: Information extraction

Prediction result of a trained HMMLars Ole Andersen . Program analysis andspecialization for the C Programming language



Violates lots of natural constraints!


144

Examples of Constraints


State transitions must occur on punctuation marks.

The citation can only start with AUTHOR or EDITOR.

The words pp., pages correspond to PAGE. Four digits starting with 20xx and 19xx are DATE. Quotations can appear only in TITLE …….

145

(2) Modeling Constraints’ Degree of Violations

Lars Ole Andersen .AUTH AUTH EDITOR EDITOR

Φc(y1)=0 Φc(y2)=0 Φc(y3)=1 Φc(y4)=0

)( Nc y1 - if assigning yN to xN violates the constraint C with respect to assignment (x1,..,xN-1;y1,…,yN-1)

0 - otherwise•Count how many times it violated the constraints:

d(y,1c(X))= ∑Φc(yi)

1 1, is a punctuationi i ii y y x

Lars Ole Andersen .AUTH BOOK EDITOR EDITOR

Φc(y1)=0 Φc(y2)=1 Φc(y3)=1 Φc(y4)=0∑Φc(yi) =1∑Φc(yi) =2

State transition must occur on punctuations.

146

Example: Degree of Violation?

It might be the case that all “good” assignments violate some constraints We lost the ability to judge which assignment is better

Better options: Choose the one with fewer violations!

Lars Ole Andersen .AUTH BOOK EDITOR EDITOR

Φc(y1)=0 Φc(y2)=1 Φc(y3)=1 Φc(y4)=0

Lars Ole Andersen .AUTH AUTH EDITOR EDITOR

Φc(y1)=0 Φc(y2)=0 Φc(y3)=1 Φc(y4)=0 The first one is better because of d(y,1c(X))!

147

Hard Constraints vs. Weighted Constraints

Constraints are close to perfect

Labeled data might not follow the constraints

148

(3) Constrained Conditional Model with Soft Constraints






Inference with Beam Search (& Soft Constraints)

AuL. Sha

riet.

al. 1998.

Building

semantic

Concordances.

In

C. Fellbaum,

ed. WordNet:

and its applications.

ILP is another option!Run beam search inference as before. Rather than eliminating illegal assignments, re-rank them.

149

Assume that We Have Made a Mistake…

150

Au

Au Au

Au

Date

Title Title Journal

L. Shari

et.

al. 1998.

Building

semantic

Concordances.

In

C. Fellbaum,

ed. WordNet:


Assumption: the best assignment according to the objective functionStart from here

Top Choice from Learned Models Is Not Good: ∑Φc1(yi) =1 ; ∑Φc2(yi) =1+1

151

Au Au Au Au

Date

Title Title Journal Book

L. Shari

et.

al. 1998.

Building

semantic

Concordances.

In

AuC. Fellbau

m,ed. WordNe

t:and its applicatio

ns.


State transitions must occur on punctuation marks. Two violations.

The Second Choice from Learned Models Is Not Good

152

Au Au Au Au

Date

Title Title Journal Book

L. Shari

et.

al. 1998.

Building

semantic

Concordances.

In

EditorC. Fellbau

m,ed. WordNe


ns.

State transitions must occur on punctuation marks. Two violations.

∑Φc2(yi) =1+1

Use Constraints to Find the Best Assignment

Au Au Au Au Date

Title Title Journal Editor

L. Shari

et.

al. 1998.

Building

semantic

Concordances.

In

EditorC. Fellbau

m,ed. WordNe


ns.

153

State transitions must occur on punctuation marks. one violations.

∑Φc2(yi) =1

Soft Constraints Help Us Find a Better Assignment

Au Au Au Au Date

Title Title Journal Editor

L. Shari

et.

al. 1998.

Building

semantic

Concordances.

In

Editor

Editor Editor

Book Book

Book Book

C. Fellbaum,

ed. WordNet:


154

We can do this because we use soft constraints. If we use hard constraints, the score of all assignment given the prefix is negative infinity.

155

(4) Constrained Conditional Model with Soft Constraints






156

Compare Training Methods (with Soft Constraints)

Need to figure out the penalty as well…

Option 1: Learning + Inference (with Constraints) Learn the weights and penalties separately

Option 2: Inference (with Constraints) Based Training Learn the weights and penalties together

The tradeoff between L+I and IBT is similar to what we have seen earlier.

157

YPRED=

Inference Based Training With Soft Constraints

For each iterationFor each (X, YGOLD ) in the training data

If YPRED != YGOLD

λ = λ + F(X, YGOLD ) - F(X, YPRED)ρI = ρI+ d(YGOLD,1C(X)) - d(YPRED,1C(X)), I =

1,..endif

endfor

• Example: Perceptron• Update penalties as well !

158

L+I v.s IBT for Soft Constraints

Test on citation recognition: L+I: HMM + weighted constraints IBT: Perceptron + weighted constraints Same feature set

With constraints Factored Model is better More significant with a small # of examples

Without constraints Few labeled examples, HMM > perceptron Many labeled examples, perceptron > HMM

Agrees with earlier results in the supervised setting ICML’05, IJCAI’5

Summary – Soft Constraints

Using soft constraints is important sometimes Some constraints might be violated in the gold data We want to have the notion of degree of violation

Degree of violation One approximation: count how many time the

constraints was violated

How to solve? Beam search, ILP, …

How to learn? L+I v.s. IBT

159

160

Constrained Conditional Model: Injecting Knowledge








Inject Prior Knowledge via Constraints

A.K.A.Semi/unsupervised Learning with Constrained

Conditional Model

161

Constraints As a Way To Encode Prior Knowledge

Consider encoding the knowledge that: Entities of type A and B cannot occur simultaneously in a

sentence The “Feature” Way

Requires larger models

The Constraints Way Keeps the model simple; add expressive

constraints directly A small set of constraints Allows for decision time incorporation of

constraints

A effective way to inject knowledge

Need more training data

We can use constraints as a way to replace training data

162162

Constraint Driven Semi/Un Supervised LearningCODLUse constraints to generate better training samples in semi/unsupervised leaning.

Model

Unlabeled Data

PredictionLabel unlabeled data

FeedbackLearn from labeled data

Seed Examples

Prediction + ConstraintsLabel unlabeled data

Better FeedbackLearn from labeled data

Better ModelResource

In traditional semi/unsupervised Learning, models can drift away from correct model

163

Semi-Supervised Learning with Soft Constraints (L+I)

Learning model weights Example: HMM

Constraints Penalties Hard Constraints : infinity Weighted Constraints:

½i = -log P{Constraint Ci is violated in training data}

Only 10 constraints

Learn the weights and penalty separately

164

Semi-supervised Learning with Constraints

=learn(T)For N iterations do

T= For each x in unlabeled dataset

{y1,…,yK} InferenceWithConstraints(x,C, )

T=T {(x, yi)}i=1…k

= +(1- )learn(T)

Learn from new training data.Weigh supervised and unsupervised model.

Inference based augmentation of the training set (feedback)(inference with constraints).

Supervised learning algorithm parameterized by

[Chang, Ratinov, Roth, ACL’07]

165

Objective function: Value of Constraints in Semi-Supervised Learning

300# of available labeled

examples

Learning w 10 ConstraintsConstraints are used to Bootstrap a semi-supervised learner Poor model + constraints used to annotate unlabeled data, which in turn is used to keep training the model.

Learning w/o Constraints: 300 examples.

166

Unsupervised Constraint Driven Learning

Semi-supervised Constraint Driven Learning In [Chang et.al. ACL 2007], they use a small labeled

dataset Reason: We need a good starting point!

Unsupervised Constraint Driven Learning Sometimes, good resources exist

We can use the resource to initialize the model We do not need labeled instances at all!

Application: transliteration discovery

167

Why Adding Constraints?

Before talking about transliteration discoveryLet’s think about why adding

constraints again Reason: We want to capture the

dependencies between different outputs

A new question: What happen if you are trying to solve a single output classification problem Can you still inject prior knowledge?

168

xX

x1

x6

x2

x5

x4x3

x7

X

Y

y1

y2

y4

y3

y5

Structure Output Problem:Dependencies between different outputs

Why Add Constraints?

169

xX

x1

x6

x2

x5

x4x3

x7

X

Yy1

Single Output Problem:Only one output

Why Add Constraints?

170

Adding Constraints Through Hidden Variables

xX

x1

x6

x2

x5

x4x3

x7

X

xX

Yf1

f2

f4

f3

f5

y1

Single Output Problemwith hidden variables

171

Hidden variables Character mappings

Character mappings is not our final goal But, it provides a intermediate representation

that satisfies natural constraints.

Example: Transliteration Discovery

The score for each pair depends on the character mappings.

172

Transliteration Discovery with Hidden Variables

For each source NE, find the best target candidate

For each NE pair, the score is equal to the score of the best hidden set of character mapping

(edges) Add constraints when predicting hidden

variables

Cc

cctsT

ts HdNNHFNNHg )1,(),,(),,(

||),,(maxarg),(

t

tsHts N

NNHgNNscore

Violation

X + y

A CCM Formulation, use constraintsto bias the prediction!Alternative View: Finding the best feature representation

173

Natural Constraints

A Dynamic Programming Algorithm

Exact and fast!


Pronouncition constraints



Take Home Message:Although ILP can solve most problems, the fastest inference algorithm depends on the constraints and can be simpler

We can decompose the problem into two parts

174

Algorithm: High Level View

Model Resource (Romanization Table) Until Convergence

Use model + Constraints to get assignments for both hidden variables (F) and labels (Y)

Update the model with newly-labeled F and y

Get feedback from both hidden variables and labels

*

*

)},|,(,{

)}|,(,{

ttts

ts

NNWNNFDD

WNNFDD

)(DtrainW

175

Results - RussianHigher, better

No constraints + 20 labeled example

Unsupervised Constraint Driven LearningNo labeled data No temporal Information

No NER on Russian

176

Results - Hebrew

No constraints + 250 labeled example

Unsupervised Constraint Driven LearningNo labeled data No temporal Information

Summary

Adding constraints is an effective way to inject knowledge Constraints can correct our predictions on unlabeled data

Injecting constraints sometimes are more effective than annotating data 300 labeled data v.s. 20 labeled data + constraints

On the cases of single output problems Use hidden variables to capture the constraints

Other approaches are possible [Duame, EMNLP 2008]

Only get feedback from the examples where constraints are satisfied

177

178







THE END

179

Conclusion Constrained Conditional Models combine

Learning conditional models with using declarative expressive constraints

Within a constrained optimization framework

Our goal was to: Introduce a clean way of incorporating constraints to

bias and improve decisions of learned models A clean way to use (declarative) prior knowledge to

guide semi-supervised learning Provide examples for the diverse usage CCMs have

already found in NLP Significant success on several NLP and IE tasks (often,

with ILP)

180

Technical Conclusions Presented and discussed inference algorithms

How to use constraints to make global decisions The formulation is an Integer Linear Programming formulation,

but algorithmic solutions can employ a variety of algorithms Present and discussed learning models to be used along with

constraints Training protocol matters We distinguished two extreme cases – training with/without

constraints. Issues include performance, as well as modularity of the

solution and the ability to use previously learned models. We did not attend to the question of “how to find constraints”

In order to emphasize the idea that we think background knowledge is important, it exists, and to focus on using it.

But, we talked about learning weights for constraints, and it’s clearly possible to learn constraints.

181

y* = argmaxy wi Á(x; y)

Linear objective functions Typically Á(x,y) will be local

functions, or Á(x,y) = Á(x)

Summary: Constrained Conditional Models

y7y4 y5 y6 y8

y1 y2 y3

y7y4 y5 y6 y8

y1 y2 y3

Conditional Markov Random Field Constraints Network

i ½i dC(x,y)

Expressive constraints over output variables

Soft, weighted constraints Specified declaratively as FOL

formulae Clearly, there is a joint probability distribution that

represents this mixed model. We would like to:

Learn a simple model or several simple models Make decisions with respect to a complex model

Key difference from MLNs which provide a concise definition of a model, but the whole joint one.

182

Decide what variables are of interest – learn model (s) Think about constraints among the variables of interest Design your objective function

LBJ (Learning Based Java): http://L2R.cs.uiuc.edu/~cogcompA modeling language for Constrained Conditional Models. Supports

programming along with building learned models, high level specification of constraints and inference with constraints

Designing CCMs

y* = argmaxy wi Á(x; y)

Linear objective functions Typically Á(x,y) will be local

functions, or Á(x,y) = Á(x)

y7y4 y5 y6 y8

y1 y2 y3

y7y4 y5 y6 y8

y1 y2 y3

i ½i dC(x,y)

Expressive constraints over output variables

Soft, weighted constraints Specified declaratively as FOL

formulae


183

Questions? Thank you!

constrained conditional models for natural language processing

Documents

learning models

supervised learning

supervised learning

semisupervised learning

learning method

ilp use of ilp

decision makingwhy constraints

christopher robins dad