1 11 montague meets markov: combining logical and distributional semantics raymond j. mooney katrin...
TRANSCRIPT
111
Montague meets Markov:Combining Logical and Distributional Semantics
Raymond J. Mooney
Katrin Erk
Islam BeltagyUniversity of Texas at Austin
Logical AI Paradigm
• Represents knowledge and data in a binary symbolic logic such as FOPC.
+ Rich representation that handles arbitrary sets of objects, with properties, relations, quantifiers, etc.
Unable to handle uncertain knowledge and probabilistic reasoning.
Probabilistic AI Paradigm
• Represents knowledge and data as a fixed set of random variables with a joint probability distribution.
+ Handles uncertain knowledge and probabilistic reasoning.
Unable to handle arbitrary sets of objects, with properties, relations, quantifiers, etc.
Statistical Relational Learning (SRL)
• SRL methods attempt to integrate methods from predicate logic (or relational databases) and probabilistic graphical models to handle structured, multi-relational data.
• 5
SRL Approaches(A Taste of the “Alphabet Soup”)
• Stochastic Logic Programs (SLPs) (Muggleton, 1996)
• Probabilistic Relational Models (PRMs) (Koller, 1999)
• Bayesian Logic Programs (BLPs) (Kersting & De Raedt, 2001)
• Markov Logic Networks (MLNs) (Richardson & Domingos, 2006)
• Probabilistic Soft Logic (PSL) (Kimmig et al., 2012)
SRL Methods Based onProbabilistic Graphical Models
• BLPs use definite-clause logic (Prolog programs) to define abstract templates for large, complex Bayesian networks (i.e. directed graphical models).
• MLNs use full first order logic to define abstract templates for large, complex Markov networks (i.e. undirected graphical models).
• PSL uses logical rules to define templates for Markov nets with real-valued propositions to support efficient inference.
• McCallum’s FACTORIE uses an object-oriented programming language to define large, complex factor graphs.
• Goodman & Tanenbaum’s CHURCH uses a functional programming language to define, large complex generative models. 6
7
Markov Logic Networks [Richardson & Domingos, 2006]
Set of weighted clauses in first-order predicate logic.
Larger weight indicates stronger belief that the clause should hold.
MLNs are templates for constructing Markov networks for a given set of constants
)()(),(,
)()(
ySmokesxSmokesyxFriendsyx
xCancerxSmokesx
1.1
5.1
MLN Example: Friends & Smokers
Example: Friends & Smokers
)()(),(,
)()(
ySmokesxSmokesyxFriendsyx
xCancerxSmokesx
1.1
5.1
Two constants: Anna (A) and Bob (B)
8
Example: Friends & Smokers
)()(),(,
)()(
ySmokesxSmokesyxFriendsyx
xCancerxSmokesx
1.1
5.1
Cancer(A)
Smokes(A)Friends(A,A)
Friends(B,A)
Smokes(B)
Friends(A,B)
Cancer(B)
Friends(B,B)
Two constants: Anna (A) and Bob (B)
9
Example: Friends & Smokers
)()(),(,
)()(
ySmokesxSmokesyxFriendsyx
xCancerxSmokesx
1.1
5.1
Cancer(A)
Smokes(A)Friends(A,A)
Friends(B,A)
Smokes(B)
Friends(A,B)
Cancer(B)
Friends(B,B)
Two constants: Anna (A) and Bob (B)
10
Example: Friends & Smokers
)()(),(,
)()(
ySmokesxSmokesyxFriendsyx
xCancerxSmokesx
1.1
5.1
Cancer(A)
Smokes(A)Friends(A,A)
Friends(B,A)
Smokes(B)
Friends(A,B)
Cancer(B)
Friends(B,B)
Two constants: Anna (A) and Bob (B)
11
iii xnw
ZxXP )(exp
1)(
Weight of formula i No. of true groundings of formula i in x
12
Probability of a possible world
A possible world becomes exponentially less likely as the total weight of all the grounded clauses it violates increases.
a possible world
x iii xnwZ )(exp
MLN Inference
Infer probability of a particular query given a set of evidence facts. P(Cancer(Anna) | Friends(Anna,Bob),
Smokes(Bob)) Use standard algorithms for inference in
graphical models such as Gibbs Sampling or belief propagation.
MLN Learning
Learning weights for an existing set of clauses EM Max-margin On-line
Learning logical clauses (a.k.a. structure learning) Inductive Logic Programming methods Top-down and bottom-up MLN clause
learning On-line MLN clause learning 14
Strengths of MLNs
• Fully subsumes first-order predicate logic– Just give weight to all clauses
• Fully subsumes probabilistic graphical models.– Can represent any joint distribution over an
arbitrary set of discrete random variables.
• Can utilize prior knowledge in both symbolic and probabilistic forms.
• Large existing base of open-source software (Alchemy)
15
Weaknesses of MLNs
• Inherits computational intractability of general methods for both logical and probabilistic inference and learning.– Inference in FOPC is semi-decidable– Inference in general graphical models is P-space
complete
• Just producing the “ground” Markov Net can produce a combinatorial explosion. – Current “lifted” inference methods do not help
reasoning with many kinds of nested quantifiers.16
PSL: Probabilistic Soft Logic[Kimmig & Bach & Broecheler & Huang & Getoor, NIPS 2012]
● Probabilistic logic framework designed with efficient inference in mind.
● Input: set of weighted First Order Logic rules and a set of evidence, just as in BLP or MLN
● MPE inference is a linear-programming problem that can efficiently draw probabilistic conclusions.
17
PSL● Atoms have continuous
truth values in the interval [0,1].
● Inference finds truth value of all atoms that best satisfy the rules and evidence.
● MPE inference: Most Probable Explanation.
● Linear optimization problem.
PSL vs. MLN
MLN● Atoms have boolean truth
values {0, 1}.
● Inference finds probability of atoms given the rules and evidence.
● Calculates conditional probability of a query atom given evidence.
● Combinatorial counting problem.
18
PSL Example
● First Order Logic weighted rules
● EvidenceI(friend(John,Alex)) = 1 I(spouse(John,Mary)) = 1I(votesFor(Alex,Romney)) = 1 I(votesFor(Mary,Obama)) = 1● Inference
– I(votesFor(John, Obama)) = 1 – I(votesFor(John, Romney)) = 0
19
PSL’s Interpretation of Logical Connectives
● Łukasiewicz relaxation of AND, OR, NOT– I(ℓ1 ∧ ℓ2) = max {0, I(ℓ1) + I(ℓ2) – 1}– I(ℓ1 ∨ ℓ2) = min {1, I(ℓ1) + I(ℓ2) }– I(¬ ℓ1) = 1 – I(ℓ1)
● Distance to satisfaction– Implication: ℓ1 → ℓ2 is Satisfied iff I(ℓ1) ≤ I(ℓ2)– d = max {0, I(ℓ1) - I(ℓ2) }
● Example– I(ℓ1) = 0.3, I(ℓ2) = 0.9 ⇒ d = 0– I(ℓ1) = 0.9, I(ℓ2) = 0.3 ⇒ d = 0.6
20
PSL Probability Distribution
● PDF:
Weight of formula r
Distance to satisfaction of rule r
Normalization constant
a possible continuous truth assignment
For all rules
21
● MPE Inference: (Most probable explanation)
– Find interpretation that maximizes PDF
– Find interpretation that minimizes summation
– Distance to satisfaction is a linear function
– Linear optimization problem
PSL Inference
22
Semantic Representations
• Formal Semantics– Uses first-order
logic– Deep– Brittle
Distributional Semantics
Statistical method
Robust
Shallow
23
• Combining both logical and distributional semantics
– Represent meaning using a probabilistic logic
• Markov Logic Network (MLN)
• Probabilistic Soft Logic (PSL)
– Generate soft inference rules
• From distributional semantics
System Architecture[Garrette et al. 2011, 2012; Beltagy et al., 2013]
24
Sent1BOXER
Rule Base
result
Sent2
LF1
LF2
Dist. RuleConstructor
Vector SpaceMLN/PSL Inference
• BOXER [Bos, et al. 2004]: maps sentences to logical form
• Distributional Rule constructor: generates relevant soft inference rules based on distributional similarity
• MLN/PSL: probabilistic inference
• Result: degree of entailment or semantic similarity score (depending on the task)
Markov Logic Networks[Richardson & Domingos, 2006]
• Two constants: Anna (A) and Bob (B)
• P(Cancer(Anna) | Friends(Anna,Bob), Smokes(Bob))
1.1
5.1
Cancer(A)
Smokes(A)Friends(A,A)
Friends(B,A)
Smokes(B)
Friends(A,B)
Cancer(B)
Friends(B,B)
25
∀ x Smokes ( x )⇒ Cancer ( x )∀ x,y Friends ( x,y )⇒ ( Smokes ( x )⇔ Smokes ( y ) )
Recognizing Textual Entailment (RTE)
• Premise: “A man is cutting pickles”
x,y,z. man(x) ∧ cut(y) ∧ agent(y, x) ∧ pickles(z) ∧ patient(y, z)• Hypothesis: “A guy is slicing cucumber”
x,y,z. guy(x) ∧ slice(y) ∧ agent(y, x) ∧ cucumber(z) ∧ patient(y, z)• Inference: Pr(Hypothesis | Premise)
– Degree of entailment 26
27
Distributional Lexical Rules
• For all pairs of words (a, b) where a is in S1 and b is in S2 add a soft rule relating the two
– x a(x) → b(x) | wt(a, b)– wt(a, b) = f( cos(a, b) )
• Premise: “A man is cutting pickles”
• Hypothesis: “A guy is slicing cucumber”
– x man(x) → guy(x) | wt(man, guy)– x cut(x) → slice(x) | wt(cut, slice)– x pickle(x) → cucumber(x) | wt(pickle, cucumber)– x man(x) → cucumber(x) | wt(man, cucumber)– x pickle(x) → guy(x) | wt(pickle, guy)
→ →
Distributional Phrase Rules
• Premise: “A boy is playing”
• Hypothesis: “A little kid is playing”
• Need rules for phrases
– x boy(x) → little(x) ∧ kid(x) | wt(boy, "little kid")• Compute vectors for phrases using vector
addition [Mitchell & Lapata, 2010]
– "little kid" = little + kid
28
Paraphrase Rules [by: Cuong Chau]
• Generate inference rules from pre-compiled paraphrase collections like Berant et al. [2012]
• e.g,“X solves Y” => “X finds a solution to Y ” | w
29
Evaluation (RTE using MLNs)
• Dataset• RTE-1, RTE-2, RTE-3• Each dataset is 800 training pairs and 800
testing pairs
• Use multiple parses to reduce impact of misparses
30
Evaluation (RTE using MLNs)[by: Cuong Chau]
RTE-1 RTE-2 RTE-3
Bos & Markert[2005] 0.52 ––
MLN 0.570.58 0.55
MLN-multi-parse 0.56 0.580.57
MLN-paraphrases 0.60 0.600.60
31
Logic-only baseline
KB is wordnet
Semantic Textual Similarity (STS)
• Rate the semantic similarity of two sentences on a 0 to 5 scale
• Gold standards are averaged over multiple human judgments
• Evaluate by measuring correlation to human rating
S1S2 score
A man is slicing a cucumber A guy is cutting a cucumber5
A man is slicing a cucumber A guy is cutting a zucchini4
A man is slicing a cucumber A woman is cooking a zucchini3
A man is slicing a cucumber A monkey is riding a bicycle1
32
Softening Conjunction for STS
33
• Premise: “A man is driving”
x,y. man(x) ∧ drive(y) ∧ agent(y, x)• Hypothesis: “A man is driving a bus”
x,y,z. man(x) ∧ drive(y) ∧ agent(y, x) ∧ bus(z) ∧ patient(y, z)• Break the sentence into “mini-clauses” then combine their
evidences using an “averaging combiner” [Natarajan et al., 2010]
• Becomes
– x,y,z. man(x) ∧ agent(y, x)→ result()– x,y,z. drive(y) ∧ agent(y, x)→ result()– x,y,z. drive(y) ∧ patient(y, z) → result()– x,y,z. bus(z) ∧ patient(y, z) → result()
Evaluation (STS using MLN)
34
• Microsoft video description corpus (SemEval 2012)
– Short video descriptions
System Pearson r
Our System with no distributional rules [Logic only] 0.52
Our System with lexical rules0.60
Our System with lexical and phrase rules0.63
PSL: Probabilistic Soft Logic[Kimmig & Bach & Broecheler & Huang & Getoor, NIPS 2012]
● MLN's inference is very slow
● PSL is a probabilistic logic framework designed with efficient inference in mind
● Inference is a linear program
35
● Łukasiewicz relaxation of AND is very restrictive
– I(ℓ1 ∧ ℓ2) = max {0, I(ℓ1) + I(ℓ2) – 1}● Replace AND with weighted average
– I(ℓ1 ∧ … ∧ ℓn) = w_avg( I(ℓ1), …, I(ℓn))– Learning weights (future work)
• For now, they are equal
● Inference
– “weighted average” is a linear function
– no changes in the optimization problem
STS using PSL - Conjunction
36
Evaluation (STS using PSL)
msr-vidmsr-par SICK
vec-add (dist. only) 0.78 0.24 0.65
vec-mul (dist. only) 0.76 0.12 0.62
MLN (logic + dist.) 0.63 0.16 0.47
PSL-no-DIR (logic only) 0.74 0.46 0.68
PSL (logic + dist.) 0.79 0.53 0.70
PSL+vec-add (ensemble) 0.83 0.49 0.71
msr-vid: Microsoft video description corpus (SemEval 2012)Short video description sentences
msr-par: Microsoft paraphrase corpus (SemEval 2012)Long news sentences
SICK: (SemEval 2014)
37
Evaluation (STS using PSL)
msr-vid msr-par SICK
PSL time/pair 8s 30s 10s
MLN time/pair 1m 31s 11m 49s 4m 24s
MLN timeouts(10 min) 9% 97% 36%
38