self-supervised probabilistic methods for extracting facts from text
DESCRIPTION
Self-supervised Probabilistic Methods for Extracting Facts from Text. Doug Downey. Web Search: Answering Questions. Q: Who did IBM acquire in 2002? A: “IBM acquired * in 2002” Q: Who has won a best actor Oscar for playing a villain? A: “won best actor for playing a villain” – 0 hits! - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/1.jpg)
1
Self-supervised Probabilistic Methods for Extracting Facts
from TextDoug Downey
![Page 2: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/2.jpg)
2
Q: Who did IBM acquire in 2002?
A:“IBM acquired * in 2002”
Q: Who has won a best actor Oscar for playing a villain?
A: “won best actor for playing a villain” – 0 hits!
The answer isn’t on just one Web page
Web Search: Answering Questions
![Page 3: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/3.jpg)
3
Q: Who has won a best actor Oscar for playing a villain?
A: Find all $X where the following appear:“$X won best actor for $Y”“$X, who played $Z in $Y”“the villain, $Z”
“Forest Whitaker won best actor for The Last King of Scotland” – 210 hits
“Forest Whitaker, who played Idi Amin in The Last King of Scotland” – 4 hits
“the villian, Idi Amin” – 1 hitAnswer: Forest Whitaker
Solution: Synthesizing Across Pages
![Page 4: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/4.jpg)
4
Given: One or more contexts indicating a semantic class C, e.g., “$X starred in $Y” => StarredIn($X, $Y)– User-specified (TextRunner [Banko et al., IJCAI 2007])– Automatically generated (KnowItAll [Etzioni et al., AIJ 2005])– Bootstrapped from resources [Snow et al., NIPS 2004].
Output: instances of Cbut, extraction from contexts is highly imperfect!
=> Output P(x C) for each term x
Self-supervised – no hand-tagged examples
Self-supervised Information Extraction
![Page 5: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/5.jpg)
5
Given: One or more contexts suggestive of a semantic class C, and a corpus of text
Output: P(x C) for each term x
KnowItAll Hypothesis – Terms x which occur in the suggestive contexts more
frequently are more likely to be instances of C.
Distributional Hypothesis– Terms in the same class tend to appear in similar contexts.
My task: formalizing these heuristics into statements about P(x C) given a corpus.
Self-supervised Information Extraction
![Page 6: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/6.jpg)
6
Who cares about Probabilities?
Why not use rankings (e.g., the precision/recall metric)?
P( WonBestActorFor(Forest Whitaker, The Last King of Scotland) )
And P( PlayedVillainIn(Forest Whitaker, The Last King of Scotland) )
=>Our goal: an estimate of the probability that Forest Whitaker won best actor for playing a villain.
Not possible with rankings!In fact, combining even perfect rankings can yield accuracy < .
![Page 7: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/7.jpg)
7
1) Two Research Questions
2) URNS model3) REALM
4) Proposal for DH
5) Chez KnowItAll
Outline
![Page 8: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/8.jpg)
8
Term-Context Matrix
Terms
. . . 98 0 2 25 1 513 . . .
. . . 2 0 930 0 0 1 . . .
. . . 1 0 10 0 0 1 . . .
Contexts
E.g., Miami
(Robert De Niro, Raging Bull)
…potential elements of C
![Page 9: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/9.jpg)
9
Terms
. . . 98 0 2 25 1 513 . . .
. . . 2 0 930 0 0 1 . . .
. . . 1 0 10 0 0 1 . . .
Contexts
E.g.,
cities such as $X,
$X said $Y offered to,
also: parse trees, bag of words, containing Web domain, etc.
Term-Context Matrix
![Page 10: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/10.jpg)
10
Miami
Twisp
Star Wars
. . . 98 0 20 250 30 513 . . .
. . . 5 0 1 2 1 1 . . .
. . . 1 1000 0 2 1 1 . . .X s
ound
trac
khe
vis
ited
X a
ndci
ties
such
as
XX
and
oth
er c
ities
X lo
dgin
g
KnowItAll Hypothesis
Distributional Hypothesis
![Page 11: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/11.jpg)
11
Two Research Questions
-- term-context matrix
-- columns of M for contexts
suggesting C.
-- prior estimate that x C
Formalizing the KnowItAll hypothesis: What is an expression for ?
Formalizing the distributional hypothesis: What is an expression for ?
![Page 12: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/12.jpg)
12
Key Requirements for Models
1) Produce probabilities
2) Execute at “interactive” speed
3) No hand-tagged data
![Page 13: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/13.jpg)
13
1) Two Research Questions
2) URNS model3) REALM
4) Proposal for DH
5) Chez KnowItAll
Outline
![Page 14: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/14.jpg)
14
Miami
Twisp
Star Wars
. . . 98 0 20 250 30 513 . . .
. . . 5 0 1 2 1 1 . . .
. . . 1 1000 0 2 1 1 . . .X s
ound
trac
khe
vis
ited
X a
ndci
ties
such
as
XX
and
oth
er c
ities
X lo
dgin
g
KnowItAll Hypothesis
Distributional Hypothesis
![Page 15: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/15.jpg)
15
Miami
Twisp
Star Wars
. . . 98 0 20 250 30 513 . . .
. . . 5 0 1 2 1 1 . . .
. . . 1 1000 0 2 1 1 . . .X s
ound
trac
khe
vis
ited
X a
ndci
ties
such
as
XX
and
oth
er c
ities
X lo
dgin
g
KnowItAll Hypothesis
Distributional Hypothesis
![Page 16: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/16.jpg)
16
1. Modeling Redundancy – The Problem
Consider a single context, e.g.:“cities such as x”
If an extraction x appears k times in a set of n sentences containing this pattern, what is the probability that x C?
![Page 17: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/17.jpg)
17
Modeling with k
“…countries such as Saudi Arabia…”
“…countries such as the United States…”
“…countries such as Saudi Arabia…”
“…countries such as Japan…”
“…countries such as Africa…”
“…countries such as Japan…”
“…countries such as the United Kingdom…”
“…countries such as Iraq…”
“…countries such as Afghanistan…”
“…countries such as Australia…”
Country(x)
extractions, n = 10
![Page 18: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/18.jpg)
18
Modeling with k
Country(x)
extractions, n = 10
Saudi Arabia
Japan
United States
Africa
United Kingdom
Iraq
Afghanistan
Australia
k2
2
1
1
1
1
1
1
Noisy-Or Model :
k
ornoisy
p
kxCxP
11
times appears
p is the probability that a single sentence is true, i.e.
p = 0.9
ornoisyP 0.99
0.99
0.9
0.9
0.9
0.9
0.9
0.9
Important: –Sample size (n) –Distribution of C }Noisy-or ignores these
![Page 19: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/19.jpg)
19
Needed in Model: Sample Size
kJapan
Norway
Israil
OilWatch Africa
Religion Paraguay
Chicken Mole
Republics of Kenya
Atlantic Ocean
New Zeland
Country(x)
extractions, n ~50,000 ornoisyP 1723
295
1
1
1
1
1
1
1
0.9999…
0.9999…
0.9
0.9
0.9
0.9
0.9
0.9
0.9
Country(x)
extractions, n = 10
Saudi Arabia
Japan
United States
Africa
United Kingdom
Iraq
Afghanistan
Australia
k2
2
1
1
1
1
1
1
ornoisyP 0.99
0.99
0.9
0.9
0.9
0.9
0.9
0.9
As sample size increases, noisy-or becomes inaccurate.
![Page 20: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/20.jpg)
20
Needed in Model: Distribution of C
nk
freq
p
kxCxP
100011
times appears
kJapan
Norway
Israil
OilWatch Africa
Religion Paraguay
Chicken Mole
Republics of Kenya
Atlantic Ocean
New Zeland
Country(x)
extractions, n ~50,000 ornoisyP 1723
295
1
1
1
1
1
1
1
0.9999…
0.9999…
0.9
0.9
0.9
0.9
0.9
0.9
0.9
![Page 21: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/21.jpg)
21
Needed in Model: Distribution of C
nk
freq
p
kxCxP
100011
times appears
kJapan
Norway
Israil
OilWatch Africa
Religion Paraguay
Chicken Mole
Republics of Kenya
Atlantic Ocean
New Zeland
Country(x)
extractions, n ~50,000
1723
295
1
1
1
1
1
1
1
0.9999…
0.9999…
0.05
0.05
0.05
0.05
0.05
0.05
0.05
freqP
![Page 22: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/22.jpg)
22
Needed in Model: Distribution of C
kToronto
Belgrade
Lacombe
Kent County
Nikki
Ragaz
Villegas
Cres
Northeastwards
City(x)
extractions, n ~50,000
274
81
1
1
1
1
1
1
1
0.9999…
0.98
0.05
0.05
0.05
0.05
0.05
0.05
0.05
freqP
Probability that x C depends on the distribution of C.
kJapan
Norway
Israil
OilWatch Africa
Religion Paraguay
Chicken Mole
Republics of Kenya
Atlantic Ocean
New Zeland
Country(x)
extractions, n ~50,000
1723
295
1
1
1
1
1
1
1
0.9999…
0.9999…
0.05
0.05
0.05
0.05
0.05
0.05
0.05
freqP
![Page 23: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/23.jpg)
23
The URNS Model – Single Urn
![Page 24: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/24.jpg)
24
The URNS Model – Single Urn
U.K.
Sydney
Urn for City(x)
Cairo
Tokyo
Tokyo
Atlanta
Atlanta
Yakima
Utah
U.K.
![Page 25: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/25.jpg)
25
Tokyo
The URNS Model – Single Urn
U.K.
Sydney
Urn for City(x)
Cairo
Tokyo
Tokyo
Atlanta
Atlanta
Yakima
Utah
U.K.
…cities such as Tokyo…
![Page 26: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/26.jpg)
26
Single Urn – Formal Definition
C – set of unique target labels
E – set of unique error labels
num(b) – number of balls labeled by b C E
num(B) –distribution giving the number of balls for each label b B.
![Page 27: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/27.jpg)
27
Single Urn Example
num(“Atlanta”) = 2
num(C) = {2, 2, 1, 1, 1}
num(E) = {2, 1}
Estimated from data
U.K.
Sydney
Urn for City(x)
Cairo
Tokyo
Tokyo
Atlanta
Atlanta
Yakima
Utah
U.K.
![Page 28: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/28.jpg)
28
Single Urn: Computing Probabilities
If an extraction x appears k times in a set of n sentences containing a pattern, what is the probability that x C?
![Page 29: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/29.jpg)
29
Single Urn: Computing Probabilities
Given that an extraction x appears k times in n draws from the urn (with replacement), what is the probability that x C?
where s is the total number of balls in the urn
![Page 30: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/30.jpg)
30
Consider the case where num(ci) = RC and num(ej) = RE
for all ci C, ej E
Then:
Then using a Poisson Approximation:
Odds increase exponentially with k, but decrease exponentially with n.
Uniform Special Case
![Page 31: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/31.jpg)
31
The URNS Model – Multiple Urns
Correlation across contexts is higher for elements of C than for elements of E.
![Page 32: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/32.jpg)
32
0
1
2
3
4
5
City Film Country MayorOf
De
via
tio
n f
rom
ide
al l
og
lik
elih
oo
d
urns
noisy-or
pmi
Unsupervised Performance
![Page 33: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/33.jpg)
33
1) Two Research Questions
2) URNS model3) REALM
4) Proposal for DH
5) Chez KnowItAll
Outline
![Page 34: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/34.jpg)
34
0
250
500
0 50000 100000
Frequency rank of extraction
Nu
mb
er
of
tim
es
ex
tra
cti
on
a
pp
ea
rs i
n p
att
ern
Redundancy fails on “sparse” facts
Tend to be correct
e.g., (Michael Bloomberg, New York City)
A mixture of correct and incorrect
e.g., (Dave Shaver, Pickerington)(Ronald McDonald, McDonaldland)
![Page 35: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/35.jpg)
35
Miami
Twisp
Star Wars
. . . 98 0 20 250 30 513 . . .
. . . 5 0 1 2 1 1 . . .
. . . 1 1000 0 2 1 1 . . .X s
ound
trac
khe
vis
ited
X a
ndci
ties
such
as
XX
and
oth
er c
ities
X lo
dgin
g
KnowItAll Hypothesis
Distributional Hypothesis
![Page 36: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/36.jpg)
36
Miami
Twisp
Star Wars
. . . 98 0 20 250 30 513 . . .
. . . 5 0 1 2 1 1 . . .
. . . 1 1000 0 2 1 1 . . .X s
ound
trac
khe
vis
ited
X a
ndci
ties
such
as
XX
and
oth
er c
ities
X lo
dgin
g
KnowItAll Hypothesis
Distributional Hypothesis
![Page 37: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/37.jpg)
37
Assessing Sparse Extractions
Task: Identify which sparse extractions are correct.
Strategy:1. Build a model of how common extractions occur in
text2. Rank sparse extractions by fit to model
• The distributional hypothesis!
Our contribution: Unsupervised language models.– Methods for mitigating sparsity– Precomputed, so greatly improved scalability
![Page 38: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/38.jpg)
38
The REALM Architecture
RElation Assessment using Language Models
Input: Set of extractions for relation R
ER = {(arg11, arg21), …, (arg1M, arg2M)}
1) Seeds SR = s most frequent pairs in ER
(assume these are correct)
2) Output ranking of (arg1, arg2) ER
by distributional similarity to each (seed1, seed2) in SR
![Page 39: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/39.jpg)
39
Distributional Similarity (1)
N-gram Language Model:
Estimate P(wi | wi-1, … wi-k)
#Parameters scales with (Vocab. Size)k+1
wi-k … wi-1 wi
![Page 40: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/40.jpg)
40
Distributional Similarity (2)
Naïve Approach:
Compare context distributions:
P(wg,…, wj | seed1, seed2 )
P(wg,…, wj | arg1, arg2)But j-g can be large
Many parameters, sparse data => inaccuracy
wg … wh seed1 wh+2 … wi seed2 wi+2 … wj
wg … wh arg1 wh+2 … wi arg2 wi+2 … wj
![Page 41: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/41.jpg)
41
The REALM ArchitectureTwo steps for assessing R(arg1, arg2)• Typechecking
– e.g., AuthorOf( arg1, arg2 )arg1 must be an author, arg2 a written
workValuable, but allows errors like:AuthorOf(Danielle Steele, Hamlet)
• Relation Assessment– Ensure R actually holds between arg1 and arg2
Both steps use small, pre-computed language models=> Scaleable
![Page 42: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/42.jpg)
42
Task: For each extraction (arg1, arg2) ER, determine if arg1 and arg2 are the proper type for R.
Solution: Assume seedj SR are of the proper type, and
rank argj by distributional similarity to each seedj
Computing Distributional Similarity:
1) Offline, train Hidden Markov Model (HMM) of corpus
2) Measure distance between argj , seedj in HMM’s N-dimensional latent state space.
Typechecking and HMM-T
![Page 43: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/43.jpg)
43
HMM Language Model
ti ti+1 ti+2 ti+3
wi wi+1 wi+2 wi+3
cities such as Seattle
wordsw
Nt
i
i
,...,1
Offline Training: Learn P(w | t), P(ti | ti-1, …, ti-k) to maximize probability of corpus (using EM).
k = 1 case:
![Page 44: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/44.jpg)
44
HMM-T
Trained HMM gives “distributional summary” of each w: N-dimensional state distribution P(t | w)
Typecheck each arg by comparing state distributions:
Rank extractions in ascending order of f(arg) summed over arguments.
arg|,|
||
1(arg) tPseedtP
seedsKLf
ii
![Page 45: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/45.jpg)
45
Miami: < >Twisp: < >
Problems:– Vectors are large– Intersections are sparse
. . . 71 25 1 513 . . .w
hen
he v
isite
d X
he v
isite
d X
and
visi
ted
X a
nd o
ther
X a
nd o
ther
citi
es
. . . 0 0 0 1 . . .
Why not use context vectors?
![Page 46: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/46.jpg)
46
Miami: <
>
P(t | Miami):
Latent state distribution P(t | w)– Compact (efficient – 10-50x less data retrieved)– Dense (accurate)
. . . 71 25 1 513 . . .
0.14 0.01 … 0.06 t=1 2 N
HMM-T Advantages (1)
![Page 47: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/47.jpg)
47
HMM-T Advantages (2)
Is Pickerington of the same type as Chicago?
Chicago , Illinois
Pickerington , Ohio
Chicago:
Pickerington:
=> N-grams says no, Dot product is 0!
291 0 …
<x> , Ohio
<x> , Illinois
0 1 …
![Page 48: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/48.jpg)
48
HMM Generalizes:
Chicago , Illinois
Pickerington , Ohio
HMM-T Advantages (3)
![Page 49: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/49.jpg)
49
HMM-T Limitations
Learning time is proportional to (corpus size *Tk+1)
T = number of latent states
k = HMM order
We use limited values T=20, k=3– Sufficient for typechecking (Santa Clara is a city)– Too coarse for relation assessment
(Santa Clara is where Intel is headquartered)
![Page 50: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/50.jpg)
50
1) Two Research Questions
2) URNS model3) REALM
4) Proposal for Formalizing the DH
5) Chez KnowItAll
Outline
![Page 51: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/51.jpg)
51
Formalizing the Distributional Hypothesis
How is this not just semi-supervised or transductive learning?
– Starts with prior , not hand-labeled examples.– Features are counts.
Two alternative formalizations– Context Counts– Distance Function
Don’t yet have expression for – Instead: basic formalizations, preliminary results
![Page 52: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/52.jpg)
52
Context Counts
Terms
. . . 920 600 293 20 2 1 . . .
. . . 20 110 930 3 0 1 . . .
. . . 43 30 0 1 0 2 . . .
Contexts
Reliable
Unreliable
As the corpus increases in size, the number of reliable contexts increases.
![Page 53: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/53.jpg)
53
Context Counts
Terms
. . . 920 600 293 20 2 1 . . .
. . . 20 110 930 3 0 1 . . .
. . . 43 30 0 1 0 2 . . .
Contexts
Reliable
Unreliable
Basic idea: model each reliable context as a “single urn.”
![Page 54: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/54.jpg)
54
Context Counts – Assumptions
1) Only a term’s reliable contexts are useful.• Occur at least r times with the term.
2) Contexts conditionally independent given C.
3) Terms and contexts are Zipf distributed.
Key question: how many reliable contexts co-occur with a given term in a corpus of n total tokens?
Can be computed in closed form given the above assumptions.
![Page 55: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/55.jpg)
55
Preliminary Result (1)
Assume that the Bayes Risk for a classifier using just one context is at least \Beta. Then for a corpus of n tokens over a vocabulary V and context set \Pi,
![Page 56: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/56.jpg)
56
Preliminary Result (2)
Provides non-trivial bounds:
Google n-grams data set (roughly):
n = 1,000,000,000,000
|V| = 15,000,000
|\Pi| = 1,000,000,000
Setting \Beta = 0.45, we get E[accuracy] <= 0.85.
![Page 57: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/57.jpg)
57
Alternate Formalization: Distance Functions
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
distance(x, y)
P(x
, y s
am
e c
las
s |
dis
tan
ce
(x, y
))
![Page 58: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/58.jpg)
58
Distance Functions
Key Formal Problem:
Given a distance function d(x, y) and prior over P(x C), what isP(x C | , d(xi, yj) for i, j V)
Straightforward to compute, but:
Requires (naively) summing over the power set of V.
![Page 59: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/59.jpg)
59
Empirical Investigation
Either formalization is governed by parameters, some specific to C, others more global.
Proposed Experiments – with a variety of classes, measure empirically:Context Counts
Urn parameters for contextsDependence between contexts
Distance FunctionsObserved distance functions, as a function of:term frequency, corpus size, class prevalence.
![Page 60: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/60.jpg)
60
1) Two Research Questions
2) URNS model3) REALM
4) Proposal for DH
5) Chez KnowItAll
Outline
![Page 61: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/61.jpg)
61
Theoretical Questions:Entrée: DH Formalisms
(Distance Functions, Context Counts, something else?)
Sides: Relationship between KH and DH, generative textual models yielding hypotheses.
Empirical Questions:
Improving REALM’s language modeling techniques
Modeling polysemy
Language modeling accuracy vs. IE accuracy
Applying HMM-T to NER
![Page 62: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/62.jpg)
62
Context Counts Advantages:
Explicitly models counts
Leverages Urns model
Likely tractable
Distance Function Advantages
Applicable to semi-supervised learning
More “pure” instantiation of DH
Entrée: DH Formalizms
![Page 63: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/63.jpg)
63
Relationship between KH and DH
Theoretical Sides(1)
Terms
. . . 920 400 293 … 2 1 . . .
. . . 200 170 30 … 0 1 . . .
. . . 43 30 50 … 0 2 . . .
Contexts
DH KH (in $X) … (cities such as $X)
![Page 64: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/64.jpg)
64
Theoretical Sides(2)
Is there a generative model of text that leads to KH, DH?
E.g., if text is generated by a HMM…
![Page 65: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/65.jpg)
65
Empirical Questions (1)
Improving REALM with language modeling enhancementsCharacter level models, syntax, PCFGs, etc.
Modeling PolysemyP(t | Chicago) the same for Chicago the city and Chicago the musical.
Idea: an HMM that selectively bifurcates words into senses when this improves LM accuracy.
![Page 66: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/66.jpg)
66
Empirical Questions (2)
Language Modeling Accuracy vs. Information Extraction accuracyIs it monotonic?
Applying HMM-T to Named Entity Recognition
![Page 67: Self-supervised Probabilistic Methods for Extracting Facts from Text](https://reader034.vdocuments.us/reader034/viewer/2022051516/56814527550346895db1ee22/html5/thumbnails/67.jpg)
67
Thanks!