administrivia

Administrivia What: LTI Seminar

When: Friday Oct 30, 2009, 2:00pm - 3:00pmWhere: 1305 NSHFaculty Host: Noah Smith

Title: SeerSuite: Enterprise Search and Cyberinfrastructure for Science and Academia

Speaker: Dr. C. Lee Giles, Pennsylvania State University Cyberinfrastructure or e-science has become crucial in many areasof science as data access often defines scientific progress. Opensource systems have greatly facilitated design and implementation andsupporting cyberinfrastructure. However, there exists no open sourceintegrated system for building an integrated search engine and digitallibrary that focuses on all phases of information and knowledgeextraction, such as citation extraction, automated indexing andranking, chemical formulae search, table indexing, etc. ….

Counts for two writeups if you attend!

Two Page Status Report on Project – due Wed 11/2 at 9am

This is a chance to tell me how your project is progressing - what's you accomplished, and what you plan to do in the future. There's no fixed

format, but here are some things you might discuss. What dataset will you be using? What does it look like (e.g., how many

entities are there, how many tokens, etc)? Looking over the data is always a good first step before you start working with it, what did you do to get acquainted with the data?

Do you plan on looking at the same problem, or have you changed your plans?

If you plan on writing code, what have you written so far, in what languages, and what do you still need to do?

In you plan on using off-the-shelf code, what have you installed, what experiences have you had with it?

If you've run a baseline system on the data and gotten some results, what are they? are they consistent with what you expected?

Brin’s 1998 paper

Poon and Domingos – continued!

plus Bellare & McCallum

10-28-2009Mostly pilfered from Pedro’s slides

Idea: exploit “pattern/relation duality”:

1. Start with some seed instances of (author,title) pairs (e.g., “Isaac Asimov”, “The Robots of Dawn”)

2. Look for occurrences of these pairs on the web.

3. Generate patterns that match heuristically chosen subsets of the occurrences

- order, URLprefix, prefix, middle, suffix

4. Extract new (author, title) pairs that match the patterns.

5. Go to 2.

[some workshop, 1998]

Result: 24M web pages + 5 books 199 occurrences 3 patterns 4047 occurrences + 5M pages 3947 occurrences 105 patterns … 15,257 books

But: • mostly learned “science fiction books” at least in early rounds• some manual intervention• special regex’s for author/title used

RelationPatterns

PatternsRelation

Markov Networks: [Review] Undirected graphical models

Cancer

CoughAsthma

Smoking

Potential functions defined over cliques

Smoking Cancer Ф(S,C)

False False 4.5

False True 4.5

True False 2.7

True True 4.5

c

cc xZ

xP )(1

)(

x c

cc xZ )(

iii xfw

ZxP )(exp

1)(

First-Order Logic

Constants, variables, functions, predicatesE.g.: Anna, x, MotherOf(x), Friends(x, y)

Literal: Predicate or its negationClause: Disjunction of literalsGrounding: Replace all variables by constants

E.g.: Friends (Anna, Bob)World (model, interpretation):

Assignment of truth values to all ground predicates

Markov Logic: Intuition

A logical KB is a set of hard constraintson the set of possible worlds

Let’s make them soft constraints:When a world violates a formula,It becomes less probable, not impossible

Give each formula a weight(Higher weight Stronger constraint)

satisfiesit formulas of weightsexpP(world)

Example: Friends & Smokers

)()(),(,

)()(

ySmokesxSmokesyxFriendsyx

xCancerxSmokesx

1.1

5.1

Cancer(A)

Smokes(A)Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants: Anna (A) and Bob (B)


)()(),(,

)()(


xCancerxSmokesx

1.1

5.1

Cancer(A)

Smokes(A)Friends(A,A) Smokes(B)

Cancer(B)Friends(B,A)

Friends(B,B)

Smokes(Anna) Cancer(Anna) W(edge:s(a)->c(a))

F F 1.5

F T 1.5

T F 0

T T 1.5


)()(),(,

)()(


xCancerxSmokesx

1.1

5.1

Cancer(A)


Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)



)()(),(,

)()(


xCancerxSmokesx

1.1

5.1

Cancer(A)


Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)


Friends(A,B) Smokes(A) Smokes(B) W(f(a,b),s(a),s(b))

F F F 1.1

F F T 1.1

F T F 1.1

F T T 1.1

T F F 1.1

T F T 0

T T F 0

T T T 1.1

Markov Logic NetworksMLN is template for ground Markov nets Probability of a world x:

Weight of formula i No. of true groundings of formula i in x

iii xnw

ZxP )(exp

1)(

Parameter tying: Groundings of same clause

Generative learning: Pseudo-likelihoodDiscriminative learning: Cond. Likelihood [like CRFs – but we need to do inference. They use a

Collins-like method that computes expectations near a MAP soln. WC]

Weight Learning

No. of times clause i is true in data

Expected no. times clause i is true according to MLN

)()()(log xnExnxPw iwiw

i

MAP/MPE Inference

Problem: Find most likely state of world given evidence

This is just the weighted MaxSAT problemUse weighted SAT solver

(e.g., MaxWalkSAT [Kautz et al., 1997] )

i

iiy

yxnw ),(maxarg

The MaxWalkSAT Algorithm

for i ← 1 to max-tries do solution = random truth assignment for j ← 1 to max-flips do if ∑ weights(sat. clauses) > threshold then return solution c ← random unsatisfied clause with probability p flip a random variable in c else flip variable in c that maximizes ∑ weights(sat. clauses) return failure, best solution found

MAP=WalkSat, Expectations=????

MCMC????:Deterministic dependencies break MCMCNear-deterministic ones make it very slow

Solution:Combine MCMC and WalkSAT

→ MC-SAT algorithm [Poon & Domingos, 2006]

Slice Sampling [Damien et al. 1999]

Xx(k)

u(k)

x(k+1)

Slice

U P(x)

The MC-SAT Algorithm

X ( 0 ) A random solution satisfying all hard clausesfor k 1 to num_samples

M Ø

forall Ci satisfied by X ( k – 1 )

With prob. 1 – exp ( – wi ) add Ci to MendforX ( k ) A uniformly random solution satisfying M

endfor

MaxWalkSat

“SampleSat”: MaxWalkSat + Simulated Annealing

What can you do with MLNs?

Problem: Given database, find duplicate records

HasToken(token,field,record)SameField(field,record,record)SameRecord(record,record)

HasToken(+t,+f,r) ^ HasToken(+t,+f,r’) => SameField(f,r,r’)SameField(f,r,r’) => SameRecord(r,r’)SameRecord(r,r’) ^ SameRecord(r’,r”) => SameRecord(r,r”)

Entity Resolution

Can also resolve fields:

HasToken(token,field,record)SameField(field,record,record)SameRecord(record,record)

HasToken(+t,+f,r) ^ HasToken(+t,+f,r’) => SameField(f,r,r’)SameField(f,r,r’) <=> SameRecord(r,r’)SameRecord(r,r’) ^ SameRecord(r’,r”) => SameRecord(r,r”)SameField(f,r,r’) ^ SameField(f,r’,r”) => SameField(f,r,r”)

P. Singla & P. Domingos, “Entity Resolution withMarkov Logic”, in Proc. ICDM-2006.

Entity Resolution

Hidden Markov Models

obs = { Obs1, … , ObsN }state = { St1, … , StM }time = { 0, … , T }

State(state!,time)Obs(obs!,time)

State(+s,0)State(+s,t) => State(+s',t+1)State(+s,t) => State(+s,t+1) [variant we’ll use-WC]Obs(+o,t) => State(+s,t)

What did P&D do with MLNs?

Information Extraction (simplified)

Problem: Extract database from text orsemi-structured sources

Example: Extract database of publications from citation list(s) (the “CiteSeer problem”)

Two steps: Segmentation:

Use HMM to assign tokens to fields Entity resolution:

Use logistic regression and transitivity

Motivation for joint extraction and matching

Token(token, position, citation)InField(position, field, citation)SameField(field, citation, citation)SameCit(citation, citation)

Token(+t,i,c) => InField(i,+f,c)InField(i,+f,c) <=> InField(i+1,+f,c)f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c))

Token(+t,i,c) ^ InField(i,+f,c) ^ Token(+t,i’,c’) ^ InField(i’,+f,c’) => SameField(+f,c,c’)SameField(+f,c,c’) <=> SameCit(c,c’)SameField(f,c,c’) ^ SameField(f,c’,c”) => SameField(f,c,c”)SameCit(c,c’) ^ SameCit(c’,c”) => SameCit(c,c”)

Information Extraction(simplified)


Token(+t,i,c) => InField(i,+f,c)InField(i,+f,c) ^ !Token(“.”,i,c) <=> InField(i+1,+f,c)f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c))

Token(+t,i,c) ^ InField(i,+f,c) ^ Token(+t,i’,c’) ^ InField(i’,+f,c’) => SameField(+f,c,c’)SameField(+f,c,c’) <=> SameCit(c,c’)SameField(f,c,c’) ^ SameField(f,c’,c”) => SameField(f,c,c”)SameCit(c,c’) ^ SameCit(c’,c”) => SameCit(c,c”)

More: H. Poon & P. Domingos, “Joint Inference in InformationExtraction”, in Proc. AAAI-2007.

Information Extraction (simplified)



Information Extraction (less simplified)

Token(+t,i,c) => InField(i,+f,c)!Token("aardvark",i,c) v InField(i,”author”,c)…!Token("zymurgy",i,c) v InField(i,"author",c)… !Token("zymurgy",i,c) v InField(i,"venue",c)




Token(+t,i,c) => InField(i,+f,c)InField(i,+f,c) => InField(i+1,+f,c)InField(i,+f,c) ^ !HasPunct(c,i) => InField(i+1,+f,c)InField(i,+f,c) ^ !HasComma(c,i) => InField(i+1,+f,c)=> InField(1,”author”,c)=> InField(2,”author”,c)=> InField(midpointOfC, "title", c) [computed off-line –WC]


Token(+t,i,c) => InField(i,+f,c)f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c))


Token(+t,i,c) => InField(i,+f,c)InField(i,+f,c) => InField(i+1,+f,c)InField(i,+f,c) ^ !HasPunct(c,i) => InField(i+1,+f,c)InField(i,+f,c) ^ !HasComma(c,i) => InField(i+1,+f,c)=> InField(1,”author”,c)=> InField(2,”author”,c)Center(c,i) => InField(i, "title", c)

Token(+t,i,c) => InField(i,+f,c)f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c))


Token(+t,i,c) => InField(i,+f,c)InField(i,+f,c) => InField(i+1,+f,c)InField(i,+f,c) ^ !HasPunct(c,i) => InField(i+1,+f,c)InField(i,+f,c) ^ !HasComma(c,i) => InField(i+1,+f,c)=> InField(1,”author”,c)=> InField(2,”author”,c)=> InField(midpointOfC, "title", c)

Token(w,i,c) ^ IsAlphaChar(w) ^ FollowBy(c,i,”.”) => InField(c,”author”,i) v InField(c,”venue”,i)

LastInitial(c,i) ^ LessThan(j,i) => !InField(c,”title”,j) ^ !InField(c,”venue”,j) FirstInitial(c,i) ^ LessThan(i,j) => InField(c,”author”,j) FirstVenueKeyword(c,i) ^ LessThan(i,j) =>

!InField(c,”author”,j) ^ !InField(c,”title”,j)

Initials tend to appear in author or venue field.

Positions before the last non-venue initial are usually not title or venue.

Positions after first “venue keyword” are usually not author or title.

Information Extraction (less simplified)SimilarTitle(c,i,j,c’,i’,j’): true if • c[i..j] and c’[i’…j’] are both “titlelike”

• i.e., no punctuation, doesn’t violate rules above

• c[i..j] and c’[i’…j’] are “similar”• i.e. start with same trigram and end with same token

SimilarVenue(c,c’): true if c and c’ don’t contain conflicting venue keywords (e.g., journal vs proceedings)

Information Extraction (less simplified)SimilarTitle(c,i,j,c’,i’,j’): …SimilarVenue(c,c’): …JointInferenceCandidate(c,i,c’): • trigram starting at i in c also appears in c’• and trigram is a possible title• and punct before trigram in c’ but not c

Information Extraction (less simplified)SimilarTitle(c,i,j,c’,i’,j’): …SimilarVenue(c,c’): …JointInferenceCandidate(c,i,c’): [InField(i,+f,c) ^ !HasPunct(c,i) => InField(i+1,+f,c)]InField(i,+f,c) ^ !HasPunct(c,i) ^ (!exists c’:JointInferenceCandidate(c,i,c’)) =>

InField(i+1,+f,c)

Jnt-SegWhy is this joint? Recall we also have:Token(+t,i,c) ^ InField(i,+f,c) ^ Token(+t,i’,c’) ^ InField(i’,+f,c’) => SameField(+f,c,c’)

Information Extraction (less simplified)SimilarTitle(c,i,j,c’,i’,j’): …SimilarVenue(c,c’): …JointInferenceCandidate(c,i,c’):

InField(i,+f,c) ^ ~HasPunct(c,i) ^ (!exists c’:JointInferenceCandidate(c,i,c’) ^SameCitation(c,c’) ) =>

InField(i+1,+f,c) Jnt-Seg-ER

Results: segmentation

Percent error reduction for best joint model

Results: matching

Cora F-S: 0.87 F1Cora TFIDF: 0.84 max F1

Fraction of clusters correctly constructed using transitive closure of pairwise decisions

William’s summary MLNs are a compact, elegant way of describing a

Markov network Standard learning methods work Network may be very very large Inference may be expensive Doesn’t eliminate feature engineering

E.g., complicated “feature” predicates

Experimental results for joint matching/NER are not that strong overall Cascading segmentation and then matching improves

segmentation, maybe not matching But it needs to be carefully restricted (efficiency?)

Bellare & McCallum

Outline

Goal: Given (DBLP record, citation-text) that do match,

learn to segment citations.Methods: Learn a CRF to align the record and text (sort of

like learning an edit distance) Generate alignments, anduse them as training

data for a linear-chain CRF that does segmentation (aka extraction) This CRF does not need records to work

Alignment….

Notation:

like feature :),,,(

][~][ wherepair a :

textfrom tokensoflist :

names field DB oflist :

fields DB from tokensoflist :

211

21

2

1

1

xyx

xx

x

y

x

af

jii,ja

booktitle""][ EMNLP""][

])[],[(editDist

22

21

ij

kji

yx

xx

Alignment feature: depends on a and x’s

Extraction feature: depends on a, y1 and x2

Learning for alignment… Generalized expectation

criterion: rather than minimize Edata[f]-Emodel[f] … plus a penalty term for the weights…minimize a weighted squared difference between Emodel[f] and p, where p is the user’s prior on the value of the feature.

Sum of marginal probabilities divided by size of variable set

“We simulate user-specified expectation criteria [i.e. p’s] with statistics on manually labeled citation texts.” … top 10 features by MI, p in 11 bins, w=10

ResultsOn 260 records, 522 record-text pairs

Results

“Gold standard”- hand-labeled

extraction data

Trained on DBLP

records

Trained on records

partially aligned with high-

precision rules

…and also use partial

match to DB records at test time

CRF trained with

extraction criteria

derived from labeled data

Alignments and expectations

Simplified version of the idea: from Learning String Edit Distance, Ristad and Yianilos, PAMI 1998

HMM Example

},...,{, emittingafter in is HMM state:

of substring a:...

ofprefix a :...

aka string,input :...

)Pr(parameter,y probabilitemission :)(

)Pr(parameter,y probabilitn transitio:)'(

HMM theof parameters :

1

1

1

21

'

Ktt

t

Tutt

ut

Tt

t

TT

l

ll

llsxs

xxxxx

xxxx

xxxx

xsxl

ssll

1 2

Pr(1->2)

Pr(2->1)

Pr(2->2)Pr(1->1)

Pr(1->x)

d 0.3

h 0.5

b 0.2

Pr(2->x)

a 0.3

e 0.5

o 0.2

Sample output: xT=heehahaha, sT=122121212

HMM Inference

t=1 t=2 t=2 ... t=T

l=1 ...

l=2 ...

...

l=K ...

Key point: Pr(si=l) depends only on Pr(l’->l) and si-1

so you can propogate probabilities forward...

x1 x2 x3 xT

Pair HMM Notation

}..1{}..1{, with associatedproperty hidden :

edits, of string:

, aka strings,input :,

),...,(),,( written also y,probabilitemission :)(

editsemissions/:}END{)},{()},{()},{(

HMM"pair " a of parameters :

*

VTrzr

Ezz

yxyx

bbae

babaE

kkk

n

VT

Andrew used “null”

Pair HMM Example

}..1{}..1{, with associatedproperty hidden :

edits, of string:

, aka strings,input :,

),...,(),,( written also y,probabilitemission :)(

editsemissions/:}END{)},{()},{()},{(

HMMpair theof parameters :

*

VTrzr

Ezz

yxyx

bbae

babaE

kkk

n

VT

1

e Pr(e)

<a,a> 0.10

<e,e> 0.10

<h,h> 0.10

<a,-> 0.05

<h,t> 0.05

<-,h> 0.01

... ..

Pair HMM Example

1

e Pr(e)

<a,a> 0.10

<e,e> 0.10

<h,h> 0.10

<e,-> 0.05

<h,t> 0.05

<-,h> 0.01

... ..

Sample run: zT = <h,t>,<e,e><e,e><h,h>,<e,->,<e,e>Strings x,y produced by zT: x=heehee, y=teehe

Notice that x,y is also produced by z4 + <e,e>,<e,-> and many other edit strings

Distances based on pair HMMs

),,(:),,(:

)Pr()Pr()|,Pr(VTnnVTnn yxzEDITz i

i

yxzEDITz

nVT zzyx

)|,Pr(log),(stochastic TTTT yxyxd

)|Pr(maxarglog),(),,(:

viterbi n

yxzEDITz

TT zyxdTTnn

Pair HMM Inference

),()1,(),(),1(),()1,1(

),(),Pr(),(),Pr(),(),Pr(

),Pr(),(1111

vtvt

vvt

tvt

vtvt

vt

yvtxvtyxvt

yyxxyxyxyx

yxvt

)1,1( vt ),1( vt

)1,( vt ),( vt

Dynamic programming is possible: fill out matrix left-to-right, top-down

Pair HMM Inference

),()1,(),(),1(),()1,1(

),Pr(),(

vtvt

vt

yvtxvtyxvt

yxvt

t=1 t=2 t=2 ... t=T

v=1 ...

v=2 ...

...

v=K ...

Pair HMM Inference

t=1 t=2 t=2 ... t=T

v=1 ...

v=2 ...

...

v=K ...

One difference: after i emissions of pair HMM, we do not know the column position

i=1

i=1 i=3

i=1 i=2

i=3

Multiple states

SUB

e Pr(e)

<a,a> 0.10

<e,e> 0.10

<h,h> 0.10

<a,-> 0.05

<h,t> 0.01

<-,h> 0.01

... ..

IXe Pr(e)

<a,-> 0.11

<e,-> 0.21

<h,-> 0.11

… …

IY

...v=K

...

...v=2

...v=1

t=T...t=2t=2t=1l=2

An extension: multiple states

...v=K

...

...v=2

...v=1

t=T...t=2t=2t=1l=1

conceptually, add a “state” dimension to the model

EM methods generalize easily to this setting

SUB

IX

administrivia

Documents

occurrences order

title pairs

science fiction books

areas of science

data access

enterprise search

integrated search engine

open source integrated