administrivia

58
Administrivia What: LTI Seminar When: Friday Oct 30, 2009, 2:00pm - 3:00pm Where: 1305 NSH Faculty Host: Noah Smith Title: SeerSuite: Enterprise Search and Cyberinfrastructure for Science and Academia Speaker: Dr. C. Lee Giles, Pennsylvania State University Cyberinfrastructure or e-science has become crucial in many areas of science as data access often defines scientific progress. Open source systems have greatly facilitated design and implementation and supporting cyberinfrastructure. However, there exists no open source integrated system for building an integrated search engine and digital library that focuses on all phases of information and knowledge extraction, such as citation extraction, automated indexing and ranking, chemical formulae search, table indexing, etc. …. Counts for two writeups if you attend!

Upload: tuari

Post on 03-Feb-2016

16 views

Category:

Documents


0 download

DESCRIPTION

Administrivia. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Administrivia

Administrivia What: LTI Seminar

When: Friday Oct 30, 2009, 2:00pm - 3:00pmWhere: 1305 NSHFaculty Host: Noah Smith

Title: SeerSuite: Enterprise Search and Cyberinfrastructure for Science and Academia

Speaker: Dr. C. Lee Giles, Pennsylvania State University          Cyberinfrastructure or e-science has become crucial in many areasof science as data access often defines scientific progress. Opensource systems have greatly facilitated design and implementation andsupporting cyberinfrastructure. However, there exists no open sourceintegrated system for building an integrated search engine and digitallibrary that focuses on all phases of information and knowledgeextraction, such as citation extraction, automated indexing andranking, chemical formulae search, table indexing, etc. ….

Counts for two writeups if you attend!

Page 2: Administrivia

Two Page Status Report on Project – due Wed 11/2 at 9am

This is a chance to tell me how your project is progressing - what's you accomplished, and what you plan to do in the future. There's no fixed

format, but here are some things you might discuss. What dataset will you be using? What does it look like (e.g., how many

entities are there, how many tokens, etc)? Looking over the data is always a good first step before you start working with it, what did you do to get acquainted with the data?

Do you plan on looking at the same problem, or have you changed your plans?

If you plan on writing code, what have you written so far, in what languages, and what do you still need to do?

In you plan on using off-the-shelf code, what have you installed, what experiences have you had with it?

If you've run a baseline system on the data and gotten some results, what are they? are they consistent with what you expected?

Page 3: Administrivia

Brin’s 1998 paper

Page 4: Administrivia

Poon and Domingos – continued!

plus Bellare & McCallum

10-28-2009Mostly pilfered from Pedro’s slides

Page 5: Administrivia

Idea: exploit “pattern/relation duality”:

1. Start with some seed instances of (author,title) pairs (e.g., “Isaac Asimov”, “The Robots of Dawn”)

2. Look for occurrences of these pairs on the web.

3. Generate patterns that match heuristically chosen subsets of the occurrences

- order, URLprefix, prefix, middle, suffix

4. Extract new (author, title) pairs that match the patterns.

5. Go to 2.

[some workshop, 1998]

Result: 24M web pages + 5 books 199 occurrences 3 patterns 4047 occurrences + 5M pages 3947 occurrences 105 patterns … 15,257 books

But: • mostly learned “science fiction books” at least in early rounds• some manual intervention• special regex’s for author/title used

RelationPatterns

PatternsRelation

Page 6: Administrivia

Markov Networks: [Review] Undirected graphical models

Cancer

CoughAsthma

Smoking

Potential functions defined over cliques

Smoking Cancer Ф(S,C)

False False 4.5

False True 4.5

True False 2.7

True True 4.5

c

cc xZ

xP )(1

)(

x c

cc xZ )(

iii xfw

ZxP )(exp

1)(

Page 7: Administrivia

First-Order Logic

Constants, variables, functions, predicatesE.g.: Anna, x, MotherOf(x), Friends(x, y)

Literal: Predicate or its negationClause: Disjunction of literalsGrounding: Replace all variables by constants

E.g.: Friends (Anna, Bob)World (model, interpretation):

Assignment of truth values to all ground predicates

Page 8: Administrivia

Markov Logic: Intuition

A logical KB is a set of hard constraintson the set of possible worlds

Let’s make them soft constraints:When a world violates a formula,It becomes less probable, not impossible

Give each formula a weight(Higher weight Stronger constraint)

satisfiesit formulas of weightsexpP(world)

Page 9: Administrivia

Example: Friends & Smokers

)()(),(,

)()(

ySmokesxSmokesyxFriendsyx

xCancerxSmokesx

1.1

5.1

Cancer(A)

Smokes(A)Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants: Anna (A) and Bob (B)

Page 10: Administrivia

Example: Friends & Smokers

)()(),(,

)()(

ySmokesxSmokesyxFriendsyx

xCancerxSmokesx

1.1

5.1

Cancer(A)

Smokes(A)Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants: Anna (A) and Bob (B)

Page 11: Administrivia

Example: Friends & Smokers

)()(),(,

)()(

ySmokesxSmokesyxFriendsyx

xCancerxSmokesx

1.1

5.1

Cancer(A)

Smokes(A)Friends(A,A) Smokes(B)

Cancer(B)Friends(B,A)

Friends(B,B)

Smokes(Anna) Cancer(Anna) W(edge:s(a)->c(a))

F F 1.5

F T 1.5

T F 0

T T 1.5

Page 12: Administrivia

Example: Friends & Smokers

)()(),(,

)()(

ySmokesxSmokesyxFriendsyx

xCancerxSmokesx

1.1

5.1

Cancer(A)

Smokes(A)Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants: Anna (A) and Bob (B)

Page 13: Administrivia

Example: Friends & Smokers

)()(),(,

)()(

ySmokesxSmokesyxFriendsyx

xCancerxSmokesx

1.1

5.1

Cancer(A)

Smokes(A)Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants: Anna (A) and Bob (B)

Friends(A,B) Smokes(A) Smokes(B) W(f(a,b),s(a),s(b))

F F F 1.1

F F T 1.1

F T F 1.1

F T T 1.1

T F F 1.1

T F T 0

T T F 0

T T T 1.1

Page 14: Administrivia

Markov Logic NetworksMLN is template for ground Markov nets Probability of a world x:

Weight of formula i No. of true groundings of formula i in x

iii xnw

ZxP )(exp

1)(

Page 15: Administrivia

Parameter tying: Groundings of same clause

Generative learning: Pseudo-likelihoodDiscriminative learning: Cond. Likelihood [like CRFs – but we need to do inference. They use a

Collins-like method that computes expectations near a MAP soln. WC]

Weight Learning

No. of times clause i is true in data

Expected no. times clause i is true according to MLN

)()()(log xnExnxPw iwiw

i

Page 16: Administrivia

MAP/MPE Inference

Problem: Find most likely state of world given evidence

This is just the weighted MaxSAT problemUse weighted SAT solver

(e.g., MaxWalkSAT [Kautz et al., 1997] )

i

iiy

yxnw ),(maxarg

Page 17: Administrivia

The MaxWalkSAT Algorithm

for i ← 1 to max-tries do solution = random truth assignment for j ← 1 to max-flips do if ∑ weights(sat. clauses) > threshold then return solution c ← random unsatisfied clause with probability p flip a random variable in c else flip variable in c that maximizes ∑ weights(sat. clauses) return failure, best solution found

Page 18: Administrivia

MAP=WalkSat, Expectations=????

MCMC????:Deterministic dependencies break MCMCNear-deterministic ones make it very slow

Solution:Combine MCMC and WalkSAT

→ MC-SAT algorithm [Poon & Domingos, 2006]

Page 19: Administrivia

Slice Sampling [Damien et al. 1999]

Xx(k)

u(k)

x(k+1)

Slice

U P(x)

Page 20: Administrivia

The MC-SAT Algorithm

X ( 0 ) A random solution satisfying all hard clausesfor k 1 to num_samples

M Ø

forall Ci satisfied by X ( k – 1 )

With prob. 1 – exp ( – wi ) add Ci to MendforX ( k ) A uniformly random solution satisfying M

endfor

MaxWalkSat

“SampleSat”: MaxWalkSat + Simulated Annealing

Page 21: Administrivia

What can you do with MLNs?

Page 22: Administrivia

Problem: Given database, find duplicate records

HasToken(token,field,record)SameField(field,record,record)SameRecord(record,record)

HasToken(+t,+f,r) ^ HasToken(+t,+f,r’) => SameField(f,r,r’)SameField(f,r,r’) => SameRecord(r,r’)SameRecord(r,r’) ^ SameRecord(r’,r”) => SameRecord(r,r”)

Entity Resolution

Page 23: Administrivia

Can also resolve fields:

HasToken(token,field,record)SameField(field,record,record)SameRecord(record,record)

HasToken(+t,+f,r) ^ HasToken(+t,+f,r’) => SameField(f,r,r’)SameField(f,r,r’) <=> SameRecord(r,r’)SameRecord(r,r’) ^ SameRecord(r’,r”) => SameRecord(r,r”)SameField(f,r,r’) ^ SameField(f,r’,r”) => SameField(f,r,r”)

P. Singla & P. Domingos, “Entity Resolution withMarkov Logic”, in Proc. ICDM-2006.

Entity Resolution

Page 24: Administrivia

Hidden Markov Models

obs = { Obs1, … , ObsN }state = { St1, … , StM }time = { 0, … , T }

State(state!,time)Obs(obs!,time)

State(+s,0)State(+s,t) => State(+s',t+1)State(+s,t) => State(+s,t+1) [variant we’ll use-WC]Obs(+o,t) => State(+s,t)

Page 25: Administrivia

What did P&D do with MLNs?

Page 26: Administrivia

Information Extraction (simplified)

Problem: Extract database from text orsemi-structured sources

Example: Extract database of publications from citation list(s) (the “CiteSeer problem”)

Two steps: Segmentation:

Use HMM to assign tokens to fields Entity resolution:

Use logistic regression and transitivity

Page 27: Administrivia

Motivation for joint extraction and matching

Page 28: Administrivia

Token(token, position, citation)InField(position, field, citation)SameField(field, citation, citation)SameCit(citation, citation)

Token(+t,i,c) => InField(i,+f,c)InField(i,+f,c) <=> InField(i+1,+f,c)f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c))

Token(+t,i,c) ^ InField(i,+f,c) ^ Token(+t,i’,c’) ^ InField(i’,+f,c’) => SameField(+f,c,c’)SameField(+f,c,c’) <=> SameCit(c,c’)SameField(f,c,c’) ^ SameField(f,c’,c”) => SameField(f,c,c”)SameCit(c,c’) ^ SameCit(c’,c”) => SameCit(c,c”)

Information Extraction(simplified)

Page 29: Administrivia

Token(token, position, citation)InField(position, field, citation)SameField(field, citation, citation)SameCit(citation, citation)

Token(+t,i,c) => InField(i,+f,c)InField(i,+f,c) ^ !Token(“.”,i,c) <=> InField(i+1,+f,c)f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c))

Token(+t,i,c) ^ InField(i,+f,c) ^ Token(+t,i’,c’) ^ InField(i’,+f,c’) => SameField(+f,c,c’)SameField(+f,c,c’) <=> SameCit(c,c’)SameField(f,c,c’) ^ SameField(f,c’,c”) => SameField(f,c,c”)SameCit(c,c’) ^ SameCit(c’,c”) => SameCit(c,c”)

More: H. Poon & P. Domingos, “Joint Inference in InformationExtraction”, in Proc. AAAI-2007.

Information Extraction (simplified)

Page 30: Administrivia

Token(token, position, citation)InField(position, field, citation)SameField(field, citation, citation)SameCit(citation, citation)

Token(+t,i,c) => InField(i,+f,c)InField(i,+f,c) ^ !Token(“.”,i,c) <=> InField(i+1,+f,c)f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c))

Information Extraction (less simplified)

Token(+t,i,c) => InField(i,+f,c)!Token("aardvark",i,c) v InField(i,”author”,c)…!Token("zymurgy",i,c) v InField(i,"author",c)… !Token("zymurgy",i,c) v InField(i,"venue",c)

Page 31: Administrivia

Token(token, position, citation)InField(position, field, citation)SameField(field, citation, citation)SameCit(citation, citation)

Token(+t,i,c) => InField(i,+f,c)InField(i,+f,c) ^ !Token(“.”,i,c) <=> InField(i+1,+f,c)f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c))

Information Extraction (less simplified)

Token(+t,i,c) => InField(i,+f,c)InField(i,+f,c) => InField(i+1,+f,c)InField(i,+f,c) ^ !HasPunct(c,i) => InField(i+1,+f,c)InField(i,+f,c) ^ !HasComma(c,i) => InField(i+1,+f,c)=> InField(1,”author”,c)=> InField(2,”author”,c)=> InField(midpointOfC, "title", c) [computed off-line –WC]

Page 32: Administrivia

Token(token, position, citation)InField(position, field, citation)SameField(field, citation, citation)SameCit(citation, citation)

Token(+t,i,c) => InField(i,+f,c)f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c))

Information Extraction (less simplified)

Token(+t,i,c) => InField(i,+f,c)InField(i,+f,c) => InField(i+1,+f,c)InField(i,+f,c) ^ !HasPunct(c,i) => InField(i+1,+f,c)InField(i,+f,c) ^ !HasComma(c,i) => InField(i+1,+f,c)=> InField(1,”author”,c)=> InField(2,”author”,c)Center(c,i) => InField(i, "title", c)

Page 33: Administrivia

Token(+t,i,c) => InField(i,+f,c)f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c))

Information Extraction (less simplified)

Token(+t,i,c) => InField(i,+f,c)InField(i,+f,c) => InField(i+1,+f,c)InField(i,+f,c) ^ !HasPunct(c,i) => InField(i+1,+f,c)InField(i,+f,c) ^ !HasComma(c,i) => InField(i+1,+f,c)=> InField(1,”author”,c)=> InField(2,”author”,c)=> InField(midpointOfC, "title", c)

Token(w,i,c) ^ IsAlphaChar(w) ^ FollowBy(c,i,”.”) => InField(c,”author”,i) v InField(c,”venue”,i)

LastInitial(c,i) ^ LessThan(j,i) => !InField(c,”title”,j) ^ !InField(c,”venue”,j) FirstInitial(c,i) ^ LessThan(i,j) => InField(c,”author”,j) FirstVenueKeyword(c,i) ^ LessThan(i,j) =>

!InField(c,”author”,j) ^ !InField(c,”title”,j)

Initials tend to appear in author or venue field.

Positions before the last non-venue initial are usually not title or venue.

Positions after first “venue keyword” are usually not author or title.

Page 34: Administrivia

Information Extraction (less simplified)SimilarTitle(c,i,j,c’,i’,j’): true if • c[i..j] and c’[i’…j’] are both “titlelike”

• i.e., no punctuation, doesn’t violate rules above

• c[i..j] and c’[i’…j’] are “similar”• i.e. start with same trigram and end with same token

SimilarVenue(c,c’): true if c and c’ don’t contain conflicting venue keywords (e.g., journal vs proceedings)

Page 35: Administrivia

Information Extraction (less simplified)SimilarTitle(c,i,j,c’,i’,j’): …SimilarVenue(c,c’): …JointInferenceCandidate(c,i,c’): • trigram starting at i in c also appears in c’• and trigram is a possible title• and punct before trigram in c’ but not c

Page 36: Administrivia

Information Extraction (less simplified)SimilarTitle(c,i,j,c’,i’,j’): …SimilarVenue(c,c’): …JointInferenceCandidate(c,i,c’): [InField(i,+f,c) ^ !HasPunct(c,i) => InField(i+1,+f,c)]InField(i,+f,c) ^ !HasPunct(c,i) ^ (!exists c’:JointInferenceCandidate(c,i,c’)) =>

InField(i+1,+f,c)

Jnt-SegWhy is this joint? Recall we also have:Token(+t,i,c) ^ InField(i,+f,c) ^ Token(+t,i’,c’) ^ InField(i’,+f,c’) => SameField(+f,c,c’)

Page 37: Administrivia

Information Extraction (less simplified)SimilarTitle(c,i,j,c’,i’,j’): …SimilarVenue(c,c’): …JointInferenceCandidate(c,i,c’):

InField(i,+f,c) ^ ~HasPunct(c,i) ^ (!exists c’:JointInferenceCandidate(c,i,c’) ^SameCitation(c,c’) ) =>

InField(i+1,+f,c) Jnt-Seg-ER

Page 38: Administrivia

Results: segmentation

Percent error reduction for best joint model

Page 39: Administrivia

Results: matching

Cora F-S: 0.87 F1Cora TFIDF: 0.84 max F1

Fraction of clusters correctly constructed using transitive closure of pairwise decisions

Page 40: Administrivia

William’s summary MLNs are a compact, elegant way of describing a

Markov network Standard learning methods work Network may be very very large Inference may be expensive Doesn’t eliminate feature engineering

E.g., complicated “feature” predicates

Experimental results for joint matching/NER are not that strong overall Cascading segmentation and then matching improves

segmentation, maybe not matching But it needs to be carefully restricted (efficiency?)

Page 41: Administrivia

Bellare & McCallum

Page 42: Administrivia

Outline

Goal: Given (DBLP record, citation-text) that do match,

learn to segment citations.Methods: Learn a CRF to align the record and text (sort of

like learning an edit distance) Generate alignments, anduse them as training

data for a linear-chain CRF that does segmentation (aka extraction) This CRF does not need records to work

Page 43: Administrivia

Alignment….

Notation:

like feature :),,,(

][~][ wherepair a :

textfrom tokensoflist :

names field DB oflist :

fields DB from tokensoflist :

211

21

2

1

1

xyx

xx

x

y

x

af

jii,ja

booktitle""][ EMNLP""][

])[],[(editDist

22

21

ij

kji

yx

xx

Alignment feature: depends on a and x’s

Extraction feature: depends on a, y1 and x2

Page 44: Administrivia

Learning for alignment… Generalized expectation

criterion: rather than minimize Edata[f]-Emodel[f] … plus a penalty term for the weights…minimize a weighted squared difference between Emodel[f] and p, where p is the user’s prior on the value of the feature.

Sum of marginal probabilities divided by size of variable set

“We simulate user-specified expectation criteria [i.e. p’s] with statistics on manually labeled citation texts.” … top 10 features by MI, p in 11 bins, w=10

Page 45: Administrivia

ResultsOn 260 records, 522 record-text pairs

Page 46: Administrivia

Results

“Gold standard”- hand-labeled

extraction data

Trained on DBLP

records

Trained on records

partially aligned with high-

precision rules

…and also use partial

match to DB records at test time

CRF trained with

extraction criteria

derived from labeled data

Page 47: Administrivia

Alignments and expectations

Simplified version of the idea: from Learning String Edit Distance, Ristad and Yianilos, PAMI 1998

Page 48: Administrivia

HMM Example

},...,{, emittingafter in is HMM state:

of substring a:...

ofprefix a :...

aka string,input :...

)Pr(parameter,y probabilitemission :)(

)Pr(parameter,y probabilitn transitio:)'(

HMM theof parameters :

1

1

1

21

'

Ktt

t

Tutt

ut

Tt

t

TT

l

ll

llsxs

xxxxx

xxxx

xxxx

xsxl

ssll

1 2

Pr(1->2)

Pr(2->1)

Pr(2->2)Pr(1->1)

Pr(1->x)

d 0.3

h 0.5

b 0.2

Pr(2->x)

a 0.3

e 0.5

o 0.2

Sample output: xT=heehahaha, sT=122121212

Page 49: Administrivia

HMM Inference

t=1 t=2 t=2 ... t=T

l=1 ...

l=2 ...

...

l=K ...

Key point: Pr(si=l) depends only on Pr(l’->l) and si-1

so you can propogate probabilities forward...

x1 x2 x3 xT

Page 50: Administrivia

Pair HMM Notation

}..1{}..1{, with associatedproperty hidden :

edits, of string:

, aka strings,input :,

),...,(),,( written also y,probabilitemission :)(

editsemissions/:}END{)},{()},{()},{(

HMM"pair " a of parameters :

*

VTrzr

Ezz

yxyx

bbae

babaE

kkk

n

VT

Andrew used “null”

Page 51: Administrivia

Pair HMM Example

}..1{}..1{, with associatedproperty hidden :

edits, of string:

, aka strings,input :,

),...,(),,( written also y,probabilitemission :)(

editsemissions/:}END{)},{()},{()},{(

HMMpair theof parameters :

*

VTrzr

Ezz

yxyx

bbae

babaE

kkk

n

VT

1

e Pr(e)

<a,a> 0.10

<e,e> 0.10

<h,h> 0.10

<a,-> 0.05

<h,t> 0.05

<-,h> 0.01

... ..

Page 52: Administrivia

Pair HMM Example

1

e Pr(e)

<a,a> 0.10

<e,e> 0.10

<h,h> 0.10

<e,-> 0.05

<h,t> 0.05

<-,h> 0.01

... ..

Sample run: zT = <h,t>,<e,e><e,e><h,h>,<e,->,<e,e>Strings x,y produced by zT: x=heehee, y=teehe

Notice that x,y is also produced by z4 + <e,e>,<e,-> and many other edit strings

Page 53: Administrivia

Distances based on pair HMMs

),,(:),,(:

)Pr()Pr()|,Pr(VTnnVTnn yxzEDITz i

i

yxzEDITz

nVT zzyx

)|,Pr(log),(stochastic TTTT yxyxd

)|Pr(maxarglog),(),,(:

viterbi n

yxzEDITz

TT zyxdTTnn

Page 54: Administrivia

Pair HMM Inference

),()1,(),(),1(),()1,1(

),(),Pr(),(),Pr(),(),Pr(

),Pr(),(1111

vtvt

vvt

tvt

vtvt

vt

yvtxvtyxvt

yyxxyxyxyx

yxvt

)1,1( vt ),1( vt

)1,( vt ),( vt

Dynamic programming is possible: fill out matrix left-to-right, top-down

Page 55: Administrivia

Pair HMM Inference

),()1,(),(),1(),()1,1(

),Pr(),(

vtvt

vt

yvtxvtyxvt

yxvt

t=1 t=2 t=2 ... t=T

v=1 ...

v=2 ...

...

v=K ...

Page 56: Administrivia

Pair HMM Inference

t=1 t=2 t=2 ... t=T

v=1 ...

v=2 ...

...

v=K ...

One difference: after i emissions of pair HMM, we do not know the column position

i=1

i=1 i=3

i=1 i=2

i=3

Page 57: Administrivia

Multiple states

SUB

e Pr(e)

<a,a> 0.10

<e,e> 0.10

<h,h> 0.10

<a,-> 0.05

<h,t> 0.01

<-,h> 0.01

... ..

IXe Pr(e)

<a,-> 0.11

<e,-> 0.21

<h,-> 0.11

… …

IY

Page 58: Administrivia

...v=K

...

...v=2

...v=1

t=T...t=2t=2t=1l=2

An extension: multiple states

...v=K

...

...v=2

...v=1

t=T...t=2t=2t=1l=1

conceptually, add a “state” dimension to the model

EM methods generalize easily to this setting

SUB

IX