syntax the study of how words are ordered and grouped together key concept: constituent = a sequence...

27
Syntax • The study of how words are ordered and grouped together • Key concept: constituent = a sequence of words that acts as a unit he the man the short man the short man with the large hat went home to his house out of the car with her }{

Upload: cynthia-clemence-stanley

Post on 12-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

Syntax

• The study of how words are ordered and grouped together

• Key concept: constituent = a sequence of words that acts as a unit

he

the man

the short man

the short man with the large hat

went

home

to his house

out of the car

with her}{

Page 2: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

Phrase Structure

S

NP

PN

VP

VBD NP PP

PRP NP

She saw a tall man with a telescope

Page 3: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

Noun Phrases

• Contains a noun plus descriptors, including:– Determiner: the, a, this, that– Adjective phrases: green, very tall– Head: the main noun in the phrase– Post-modifiers: prepositional phrases or relative

clauses

That old green couch of yours that I want to throw out

det adj adj head PP relative clause

Page 4: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

Verb Phrases• Contains a verb (the head) with modifiers

and other elements that depend on the verb

want to throw out

head PP

previously saw the man in the park with her telescope

adv head direct object PP

might have showed his boss the code yesterday

indirectobject DObjheadauxmodal adverb

Page 5: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

Prepositional Phrases

• Preposition as head and NP as complement

with her grey poodle

head complement

Adjective Phrases

• Adjective as head with modifiers

extremely sure that he would win

head relative clauseadv

Page 6: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

Shallow Parsing

• Extract phrases from text as ‘chunks’

• Flat, no tree structures

• Usually based on patterns of POS tags

• Full parsing conceived of two steps:– Chunking / Shallow parsing– Attachment of chunks to each other

Page 7: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

Noun Phrases

• Base Noun Phrase: A noun phrase that does not contain other noun phrases as a component

• Or, no modification to the right of the head

a large green cow

The United States Government

every poor shop-owner’s dream ?

other methods and techniques ?

Page 8: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

Manual Methodology

• Build a regular-expression over POS

• E.g:

DT? (ADJ | VBG)* (NN)+

• Very hard to do accurately

• Lots of manual labor

• Cannot be easily tuned to a specific corpus

Page 9: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

Chunk Tags

• Represent NPs by tags:

[the tall man] ran with [blinding speed]DT ADJ NN1 VBD PRP VBG NN0

I I I O O I I• Need B tag for adjacent NPs:

On [Tuesday] [the company] went bankrupt

O I B I O O

Page 10: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

Transformational Learning• Baseline tagger:

– Most frequent chunk tag for POS or word

• Rule templates (100 total):

current word/POS current ctag

word/POS 1 on left/right current and left ctag

current and left/right word/POS current and right ctag

word/POS on left and on right in two ctags to left

in two words/POSs on left/right in two ctags to right

in three words/POSs on left/right

Page 11: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

Some Rules Learned

1. (T1=O, P0=JJ) I O

2. (T-2=I, T-1=I, P0=DT) B

3. (T-2=O, T-1=I, P-1=DT) I

4. (T-1=I, P0=WDT) I B

5. (T-1=I, P0=PRP) I B

6. (T-1=I, W0=who) I B

7. (T-1=I, P0=CC, P1=NN) O I

Page 12: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

ResultsTraining Prec. Recall Tag Acc.

Baseline 78.2 81.9 94.5

50K 89.8 90.4 96.9

100K 91.3 91.8 97.2

200K 91.8 92.3 97.4

200K nolex 90.5 90.7 97.0

950K 93.1 93.5 97.8

• Precision = fraction of NPs predicted that are correct• Recall = fraction of actual NPs that are found

Page 13: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

Memory-Based Learning

• Match test data to previously seen data and classify based on the most similar previously seen instances

• E.g:

{the saw wasshe saw theboy saw three

boy saw the

boy ate the

Page 14: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

k-Nearest Neighbor (kNN)

• Find k most similar training examples

• Let them ‘vote’ on the correct class for the test example– Weight neighbors by distance from test

• Main problem: defining ‘similar’– Shallow parsing – overlap of words and POS– Use feature weighting...

Page 15: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

Information Gain• Not all features are created equal (e.g. saw

in previous example is more important)

• Weight the features by information gain

= how much does f distinguish different classes

Xx

i

fVv ii

i

xPxPXH

fVH

vfCHvfPCHfw i

)(log)()(

))((

)|()()()(

2

)(

Page 16: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

C1

C2

C3

C4

high information gainlow information gain

Page 17: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

Base Verb Phrase

• Verb phrase not including NPs or PPs

[NP Pierre Vinken NP] , [NP 61 years NP] old ,

[VP will soon be joining VP] [NP the board NP]

as [NP a nonexecutive director NP] .

Page 18: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

Results• Context:

2 words and POS on left and 1 word and POS on right

Task Context Prec. Recall Acc.

bNP curr. word 76 80 93

curr. POS 80 82 95

2 – 1 94 94 98

bVP curr. word 68 73 96

curr. POS 75 89 97

2 – 1 94 96 99

Page 19: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

Efficiency of MBL

• Finding the neighbors can be costly

• Possibility:Build decision tree based on information gain of

features to index data = approximate kNN

W0

P-2P-1W-1

sawthe

boy

Page 20: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

MBSL• Memory-based technique relying on

sequential nature of the data– Use “tiles” of phrases in memory to “cover” a

new candidate (and context), and compute a tiling score

went to the white house for dinnerVBD PRP [[ DT ADJ NN1 ]] PRP NN1

PRP [NP DT

[NP DT ADJ NN1

NN1 NP] PRP

PRP [NP DT ADJ

ADJ NN1 NP]

Page 21: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

Tile Evidence• Memory:

[NP DT NN1 NP] VBD [NP DT NN1 NN1 NP] [NP NN2 NP] .[NP ADJ NN2 NP] AUX VBG PRP [NP DT ADJ NN1 NP] .

• Some tiles: [NP DT pos=3 neg=0 [NP DT NN1 pos=2 neg=0DT NN1 NP] pos=1 neg=1NN1 NP] pos=3 neg=1NN1 NP] VBD pos=1 neg=0

• Score tile t by ft(t) = pos / total, Only keep tiles that pass a threshhold ft(t) >

Page 22: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

Covers• Tile t1 connects to t2 in a candidate if:

– t2 starts after t1

– there is no gap between them (may be overlap)– t2 ends after t1

VBD PRP [[ DT ADJ NN1 ]] PRP NN1

PRP [NP DT

[NP DT ADJ

NN1 NP] PRP

•A sequence of tiles covers a candidate if

–each tile connects to the next

–the tiles collectively match the entire candidate including brackets and maybe some context

Page 23: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

Cover Graph

VBD PRP [[ DT ADJ NN1 ]] PRP NN1

PRP [NP DT

[NP DT ADJ NN1

NN1 NP] PRP

PRP [NP DT ADJ

ADJ NN1 NP]

START END

Page 24: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

Measures of ‘Goodness’

• Number of different covers

• Size of smallest cover (fewest tiles)

• Maximum context in any cover (left + right)

• Maximum overlap of tiles in any cover

• Grand total positive evidence divided by grand total positive+negative evidence

Combine these measures by linear weighting

Page 25: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

Scoring a Candidate

CandidateScore(candidate, T)

• G CoverGraph(candidate, T)

• Compute statistics by DFS on G• Compute candidate score as linear function

of statistics

Complexity (O(l) tiles in candidate of length l):– Creating the cover graph is O(l2)

– DFS is O(V+E)=O(l2)

Page 26: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

Full AlgorithmMBSL(sent, C, T)

1. For each subsequence of sent, do:1. Construct a candidate s by adding brackets [[ and ]]

before and after the subsequence

2. fC(s) CandidateScore(s, T)

3. If fC(s) > C, then add s to candidate-set

2. For each c in candidate-set in decreasing order of fC(c), do:

1. Remove all candidates overlapping with c from candidate-set

3. Return candidate-set as target instances

Page 27: Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short

ResultsTarget

Type

Context

size

T Prec. Recall

NP 3 0.6 92 92

SV 3 0.6 89 85

VO 2 0.5 77 90