distance functions and ie – 5 william w. cohen cald

45
Distance functions and IE – 5 William W. Cohen CALD

Upload: raymond-jefferson

Post on 04-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distance functions and IE – 5 William W. Cohen CALD

Distance functions and IE – 5

William W. Cohen

CALD

Page 2: Distance functions and IE – 5 William W. Cohen CALD

Announcements

• Current statistics:– days with unscheduled student talks: 5– students with unscheduled student talks: 3– Projects are due: 4/28 (last day of class)– Additional requirement: draft (for comments)

no later than 4/21

Page 3: Distance functions and IE – 5 William W. Cohen CALD

String distance metrics so far...

• Term-based (e.g. TF/IDF as in WHIRL)– Distance depends on set of words contained in both s and t – so sensitive

to spelling errors.– Usually weight words to account for “importance”– Fast comparison: O(n log n) for |s|+|t|=n

• Edit-distance metrics– Distance is shortest sequence of edit commands that transform s to t.– No notion of word importance– More expensive: O(n2)

• Other metrics– Jaro metric & variants– Monge-Elkan’s recursive string matching– etc?

• Which metrics work best, for which problems?

Page 4: Distance functions and IE – 5 William W. Cohen CALD

Results - Overall

Page 5: Distance functions and IE – 5 William W. Cohen CALD
Page 6: Distance functions and IE – 5 William W. Cohen CALD

Combining Information Extraction and Similarity Computations

Krauthammer et al

Page 7: Distance functions and IE – 5 William W. Cohen CALD

Background

• Common task in proteomics/genomics: – look for (soft) matches to a query sequence in a

large “database” of sequences.– want to find subsequences (genes) that are

highly similar (and hence probably related)– want to ignore “accidental” matches– possible technique is Smith-Waterman (local

alignment)• want char-char “reward” for alignment to reflect

confidence that the alignment is not due to chance

Page 8: Distance functions and IE – 5 William W. Cohen CALD

Background

• Common task in proteomics/genomics: – look for (soft) matches to a query sequence in a

large “database” of sequences.– want to find subsequences (genes) that are

highly similar (and hence probably related)– want to ignore “accidental” matches– possible technique is Smith-Waterman (local

alignment)• want char-char “reward” for alignment to reflect

confidence that the alignment is not due to chance

Page 9: Distance functions and IE – 5 William W. Cohen CALD

Smith-Waterman distance

c o h e n d o r f

m 0 0 0 0 0 0 0 0 0

c 1 0 0 0 0 0 0 0 0

c 0 0 0 0 0 0 0 0 0

o 0 2 1 0 0 0 2 1 0

h 0 1 4 3 2 1 1 1 0

n 0 0 3 3 5 4 3 2 1

s 0 0 2 2 4 4 3 2 1

k 0 0 1 1 3 3 3 2 1

i 0 0 0 0 2 2 2 2 1

dist=5

Page 10: Distance functions and IE – 5 William W. Cohen CALD

In general “peaks” in the matrix scores indicate highly similar substrings.

Page 11: Distance functions and IE – 5 William W. Cohen CALD

Background

• Common task in proteomics/genomics: – look for (soft) matches to a query sequence in a

large “database” of sequences.– possible technique is Smith-Waterman (local

alignment)• want char-char “reward” for alignment to reflect

confidence that the alignment is not due to chance• based on substitutability theory/stats for amino acids

– doesn’t scale well• BLAST and FASTA: fast approximate S-W

Page 12: Distance functions and IE – 5 William W. Cohen CALD

BLAST/FASTA ideas

• Find all char n-grams (“words”) in the query string.

• FASTA:– Use inverted indices to find out where these

words appear in the DB sequence– Use S-W only near DB sections that contain

some of these words

Page 13: Distance functions and IE – 5 William W. Cohen CALD

BLAST/FASTA ideas

• Find all char n-grams (“words”) in the query string.

• BLAST:– Generate variations of these words by looking

for changes that would lead to strong similarities

– Discard “low IDF” words (where accidental matches are likely)

– Use expanded set of n-grams to focus search

Page 14: Distance functions and IE – 5 William W. Cohen CALD

query string

words and expansions

Page 15: Distance functions and IE – 5 William W. Cohen CALD

BLAST/FASTA ideas

• Find all char n-grams (“words”) in the query string.• BLAST:

– Generate variations of these words by looking for changes that would lead to strong similarities

– Discard “low IDF” words (where accidental matches are likely)– Use expanded set of n-grams to focus search

• The BLAST program:– Widely used, – Fast implementation, – Supports asking multiple queries against a database at once...– Can one use it find soft matches of protein names (from a

dictionary) in text?

Page 16: Distance functions and IE – 5 William W. Cohen CALD

Basic idea:

• Protein database• Query strings• Proposed alignment

(query->database)• Query algorithm:

BLAST

• Biomedical paper• Protein name dictionary• Extracted protein name

(dict. entry->text)• IE system:

dictionaries+BLAST (optimized for this problem)

Page 17: Distance functions and IE – 5 William W. Cohen CALD

1) Mapping text to DNA sequences(Q: what sort of char similarity is this?)

Page 18: Distance functions and IE – 5 William W. Cohen CALD

2) Optimizing blast

• Split protein-name database into several parts (for short, medium-length, long protein names)– Scoring depends on length of matched string

• Require space chars before and after “short” protein names.

• Manually search (grid search?) for better settings for certain key parameters for each protein-name subdatabase – With what data?

• Evaluate on one review article, 1162 protein names– inter-annotator agreement not great (70-85%)

Page 19: Distance functions and IE – 5 William W. Cohen CALD

2) Optimizing blast

Page 20: Distance functions and IE – 5 William W. Cohen CALD

2) Optimizing blast

Page 21: Distance functions and IE – 5 William W. Cohen CALD

Results

Page 22: Distance functions and IE – 5 William W. Cohen CALD

Results

Overall: precision 71.1%, recall 78.8% (optimized)

Page 23: Distance functions and IE – 5 William W. Cohen CALD

IE with Dictionaries

Cohen & Sarawagi

Page 24: Distance functions and IE – 5 William W. Cohen CALD

Finding names you know about

• Problem: given dictionary of names, find them in email text– Important task beyond email (biology, link analysis,...)– Exact match is unlikely to work perfectly, due to

nicknames (Will Cohen), abbreviations (William C) , misspellings (Willaim Chen), polysemous words (June, Bill), etc

– In informal text it sometimes works very poorly– Problem is similar to record linkage (aka data

cleaning, de-duping, merge-purge, ...) problem of finding duplicate database records in heterogeneous databases.

Page 25: Distance functions and IE – 5 William W. Cohen CALD

Finding names you know about

• Problem: given dictionary of names, find them in email text– Exact match is unlikely to work well for

informal text.– Problem is similar to record linkage– Hard to combine state of the art similarity

metrics (as used in record linkage) with state of the art NER system due to representational mismatch:

• Opening up the box, modern NER systems don’t really know anything about names....

Page 26: Distance functions and IE – 5 William W. Cohen CALD

IE as Sequential Word Classification

Yesterday Pedro Domingos spoke this example sentence.

Person name: Pedro Domingos

A trained IE systemmodels the relative probability of labeled sequences of words.

To classify, find the most likely state sequence for the given words:

Any words said to be generated by the designated “person name”state extract as a person name:

person name

location name

background

Page 27: Distance functions and IE – 5 William W. Cohen CALD

IE as Sequential Word Classification

Modern IE systems use a rich representation for words, and clever probabilistic models of how labels interact in a sequence, but do not explicitly represent the names extracted.

wt -1

wt

Ot

wt+1

Ot +1

Ot -1

identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchorlast person name was femalenext two words are “and Associates”

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Page 28: Distance functions and IE – 5 William W. Cohen CALD

Semi-Markov models for IE

• Train on sequences of labeled segments, not labeled words.S=(start,end,label)

• Build probability model of segment sequences, not word sequences

• Define features f of segments

• (Approximately) optimize feature weights on training data

f(S) = words xt...xu, length, previous words, case information, ..., distance to known name

maximize:

m

iii

1

)|Pr(log xS

with Sunita Sarawagi, IIT Bombay

Page 29: Distance functions and IE – 5 William W. Cohen CALD

Details: Semi-Markov model

Page 30: Distance functions and IE – 5 William W. Cohen CALD

Details: Semi-Markov model

Page 31: Distance functions and IE – 5 William W. Cohen CALD

Conditional Semi-Markov models

CMM:

CSMM:

Page 32: Distance functions and IE – 5 William W. Cohen CALD

A training algorithm for CSMM’s (1)

Review: Collins’ perceptron training algorithm

Correct tags

Viterbi tags

Page 33: Distance functions and IE – 5 William W. Cohen CALD

A training algorithm for CSMM’s (2)

Variant of Collins’ perceptron training algorithm:

voted perceptron learner for TTRANS

like Viterbi

Page 34: Distance functions and IE – 5 William W. Cohen CALD

A training algorithm for CSMM’s (3)

Variant of Collins’ perceptron training algorithm:

voted perceptron learner for TTRANS

like Viterbi

Page 35: Distance functions and IE – 5 William W. Cohen CALD

A training algorithm for CSMM’s (3)

Variant of Collins’ perceptron training algorithm:

voted perceptron learner for TSEGTRANS

like Viterbi

Page 36: Distance functions and IE – 5 William W. Cohen CALD

Sample CSMM features

Page 37: Distance functions and IE – 5 William W. Cohen CALD

Experimental results

• Baseline algorithms:– HMM-VP/1: tags are “in entity”, “other”– HMM-VP/4: tags are “begin entity”, “end entity”,

“continue entity”, “unique”, “other”– SMM-VP: all features f(w) have versions for “f(w) true for

some w in segment that is first (last, any) word of segment”– dictionaries: like Borthwick

• HMM-VP/1: fD(w)=“word w is in D”• HMM-VP/4: fD,begin(w)=“word w begins entity in D”,

etc, etc• Dictionary lookup

Page 38: Distance functions and IE – 5 William W. Cohen CALD

Datasets used

Used small training sets (10% of available) in experiments.

Page 39: Distance functions and IE – 5 William W. Cohen CALD

Results

Page 40: Distance functions and IE – 5 William W. Cohen CALD
Page 41: Distance functions and IE – 5 William W. Cohen CALD

Results: varying history

Page 42: Distance functions and IE – 5 William W. Cohen CALD

Results: changing the dictionary

Page 43: Distance functions and IE – 5 William W. Cohen CALD

Results: vs CRF

Page 44: Distance functions and IE – 5 William W. Cohen CALD

Results: vs CRF

Page 45: Distance functions and IE – 5 William W. Cohen CALD