a statistical approach to anaphora resolution

10
A Statistical Approach to Anaphora Resolution Niyu Ge, John Hale and Eugene Charniak Dept. of Computer Science, Brown University, [nge I h [ c] ~cs. brown, edu Abstract This paper presents an algorithm for identi- fying pronominal anaphora and two experi- ments based upon this algorithm. We incorpo- rate multiple anaphora resolution factors into a statistic al framewo rk - - specifically the dis- tance between the pronoun and the proposed antecedent, gender/number/animaticity of the proposed antecedent, governing head informa- tion and noun phrase repetition. We combine them into a single probability that enables us to identify the referent. Our first experiment shows the relative contribution of each source Of information and demonstrates a success rate of 82.9% for all sources combined. The second experiment investigates a method for unsuper- vised learning of gender/number/animaticity information. We present some experiments il- lustrating the accuracy of the method and note that with this information added, our pronoun resolution method achieves 84.2% accuracy. 1 Introduction We present a statistical method for determin- ing pronoun anaphora. This prog ram differs from earlier work in its almost complete lack of hand-crafting, relying instead on a very small corpus of Penn Wall Street Journal Tree-bank text (Marcus et al., 1993) that has been marked with co-reference inform ation. The first sections of this paper describe this program: the proba- bi li st ic model behind it, its implement ation , and its performance. The second half of the paper describes a method for using (portions of) t~e aforemen- tioned program to learn automatically the typi- cal gender of English words, information that is itself used in the pronoun resolution program. In particular, the scheme infers the gender of a referent from the gender of the pronouns that 161 refer to it and selects referents using the pro- noun anaphora program. We present some typ- ical results as well as the more rigorous results of a blind evaluation of its output. 2 A Probabilistic Model There are many factors, both syntactic and se- mantic, upon which a pronoun resolution sys- tem rel ies . (Mitkov (1997) does a detailed st ud y on factors in anap hora resolution.) We first dis- cuss the training features we use and then derive the probability equations from them. The first piece of useful information we con- sider is the distance between the pronoun and the candidate antecedent. Obviously the greater the distance the lower the probability. Secondly, we look at the syntactic situation in which the pronoun finds itself. The most well studied constraints are those involving reflexive pronouns. One cla ssical app roa ch to resolving pronouns in text that takes some syntactic fac- tors into consideration is that of Hobbs (1976). This algorithm searches the parse tree in a left- to-right, breadth-first fashion that obeys the major reflexive pronoun constraints while giv- ing a preference to antecedents that are closer to the pronoun. In resolving inter- senten tial pronouns, the algorithm searches the previous sentence, again in left-to-right, breadth-first or- der. This implements the observed preferenc e for subject position antecedents. Next, the actual words in a proposed noun- phrase antecedent give us information regarding the gender, number, and animatici ty of the pro- posed referent. For example: nificance as one of the last women to be ezecuted in France. She became an abortionist because it enabled her to

Upload: triveni-pal

Post on 07-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

8/6/2019 A Statistical Approach to Anaphora Resolution

http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 1/10

A S t a t i s t i c a l A p p r o a c h t o A n a p h o r a R e s o l u t i o n

N i y u G e , Jo h n H a l e a nd E u g e n e C h a r n i a k

Dept. of Computer Science,Brown University,

[ n g e I h [ c ] ~ c s . b r o w n , e d u

A b s t r a c t

This paper presents an algorithm for identi-

fying pronominal anaphora and two experi-

ments based upon this algorithm. We incorpo-

rate multiple anaphora resolution factors intoa statistical framework - - specifically the dis-

tance between the pronoun and the proposed

antecedent, gender/number/animaticity of the

proposed antecedent, governing head informa-

tion and noun phrase repetition. We combine

them into a single probability that enables us

to identify the referent. Our first experiment

shows the relative contribution of each source

Of information and demo nst rate s a success rate

of 82.9% for all sources combined. The second

experiment investigates a method for unsuper-

vised learning of gender/number/animaticityinformation. We present some exper iment s il-

lustrating the accuracy of the method and note

that with this information added, our pronoun

resolution meth od achieves 84.2% accuracy.

1 I n t r o d u c t i o n

We present a statistical method for determin-

ing pronoun anaphora. This prog ram differs

from earlier work in its almost complete lack of

hand-crafting, relying instead on a very small

corpus of Penn Wall Street Journal Tree-bank

text (Marcus et al., 1993) that has been markedwith co-reference information. The first sections

of this paper describe this program: the proba-

bilistic model behind it, its implementation , and

its performance.

The second half of the paper describes a

method for using (portions of) t~e aforemen-

tioned program to learn automatical ly the typi-

cal gender of English words, information that is

itself used in the pronoun resolution program.

In particular, the scheme infers the gender of a

referent from the gender of the pronouns that

161

refer to it and selects referents using the pro-

noun anaphora program. We present some typ-

ical results as well as the more rigorous results

of a blind evaluation of its o utp ut.

2 A P robab i l i s t i c Mode l

There are many factors, both syntactic and se-

mantic, upon which a pronoun resolution sys-

tem relies. (Mitkov (1997) does a detailed st udy

on factors in anap hora reso lution.) We first dis-

cuss the training features we use and then derive

the probability equations from th em.

The first piece of useful information we con-

sider is the distance between the pronoun

and the candidate antecedent. Obviously the

greater the distance the lower the probability.

Secondly, we look at the syntactic situation inwhich the pronoun finds itself. The most well

studied constraints are those involving reflexive

pronouns. One classical approach to resolving

pronouns in text that takes some syntactic fac-

tors into consideration is that of Hobbs (1976).

This algorithm searches the parse tree in a left-

to-right, breadth-first fashion that obeys the

major reflexive pronoun constraints while giv-

ing a preference to antecedents that are closer

to the pronoun. In resolving inter- senten tial

pronouns, the algorithm searches the previous

sentence, again in left-to-right, breadth-first or-der. This implements the observed preference

for subject position antecedents.

Next, the actual words in a proposed noun-

phrase antecedent give us information regard ing

the gender, number, and animatici ty of the pro-

posed referent. For example:

M a r i e G i r a u d c a r r i es h i s t o r ic a l s i g -

n i f ic a n c e a s o n e o f t h e l a s t w o m e n t o

b e e z e c u t e d i n F r a n c e . S h e b e c a m e

a n a b o r t i o n i s t b e c a u s e i t e n a b l e d h e r t o

8/6/2019 A Statistical Approach to Anaphora Resolution

http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 2/10

b uy ja m , co co a a n d o th er w a r - ra t io n ed

goodies.

H e r e i t is h e l p f u l t o r e co g n i z e t h a t " M a r i e " i s

p r o b a b l y f e m a l e a n d t h u s i s u n l i k e l y t o b e r e -f e r r e d t o b y " h e " o r " i t " . G i v e n t h e w o r d s in t h e

p r o p o s e d a n t e c e d e n t w e w a n t t o f i n d t h e p r o b -

a b i l it y t h a t i t i s t h e r e f e r e n t o f t h e p r o n o u n i n

q u e s t i o n . W e c o ll e c t t h e s e p r o b a b i l i t i e s o n t h e

t r a i n i n g d a t a , w h i c h a r e m a r k e d w i t h r e f e r e n c e

l in k s . T h e w o r d s in th e a n t e c e d e n t s o m e t i m e s

a l so le t u s t e s t f o r n u m b e r a g r e e m e n t . G e n e r -

a l l y , a s i n g u l a r p r o n o u n c a n n o t r e f e r t o a p l u r a l

n o u n p h r a s e , s o t h a t i n r e s o l v i n g s u c h a p r o -

n o u n a n y p l u ra l c a n d i d a t e s s h o u l d b e r u l e d o u t .

H o w e v e r a s i n g u l a r n o u n p h r a s e c a n b e t h e r e f-

e r e n t o f a p l u ra l p r o n o u n , a s i l l u s t r a t e d b y t h ef o l l o w i n g e x a m p l e :

" I t h in k i f I te ll V i a c o m I n ee d m o r e

t ime , t h ey wi l l t a ke 'C o sb y ' a cro ss th e

s t ree t , " sa ys t h e g en era l ma n a g er o l a

n e t w o r k a ~ l i a t e .

I t i s a l s o u s e f u l t o n o t e t h e i n t e r a c t i o n b e -

t w e e n t h e h e a d c o n s t i t u e n t o f t h e p r o n o u n p

a n d t h e a n t e c e d e n t . F o r e x a m p l e :

A Ja p a n ese co mp a n y mig h t ma ke t e l e -

v is ion p ic ture tubes in Japan , asse m-b le t h e T V se t s i n Ma la ys ia a n d ex to r t

th em t o I n d o n es ia .

H e r e w e w o u l d c o m p a r e t h e d e g r e e t o w h i c h

e a c h p o ss i bl e c a n d i d a t e a n t e c e d e n t ( A J a p a n e s e

comp any, te lev is ion p ic ture tubes, Japan , T V

sets , a n d Ma la ys ia i n t h is e x a m p l e ) c o u l d s e r v e

a s t h e d i r e c t o b j e c t o f " e x p o r t " . T h e s e p r o b a -

b i l it i e s g iv e u s a w a y t o i m p l e m e n t s e l e c t i o n a l

r e s t ri c t i o n . A c a n o n i c a l e x a m p l e o f s e l e c ti o n a l

r e s t r ic t i o n i s t h a t o f t h e v e r b " e a t " , w h i c h s e-

l e c t s f o o d a s i t s d i r e c t o b j e c t . I n t h e c a s e o f" e x p o r t " t h e r e s t r i c t io n i s n o t a s c l e a r c u t . N e v -

e r t h e l e s s i t c a n s t i l l g i v e u s g u i d a n c e o n w h i c h

c a n d i d a t e s a r e m o r e p r o b ab l e t h a n o t h e r s.

T h e l a s t f a c t o r w e c o n s i d e r i s r e f e r e n t s ' m e n -

t ion count . N o u n p h r a s e s t h a t a r e m e n t i o n e d

r e p e a t e d l y a re p r e f er r e d . T h e t r a i n i n g c o r p u s i s

m a r k e d w i t h t h e n u m b e r o f t i m e s a r e f e r e n t h a s

b e e n m e n t i o n e d u p t o t h a t p o i n t i n t h e s t o r y .

H e r e w e a r e c o n c e r n e d w i t h t h e p r o b a b i l i t y t h a t

a p r o p o s e d a n t e c e d e n t is c o r r e c t g i v e n t h a t i t

h a s b e e n r e p e a t e d a c e r t a i n n u m b e r o f t i m e s .

162

I n e ff e ct , w e u s e th i s p r o b a b i l i t y i n f o r m a t i o n t o

i d e n t if y t h e t o p ic o f th e s e g m e n t w i t h t h e b e l ie f

t h a t t h e t o p i c i s m o r e l i k e l y t o b e r e f e r r e d t o b y

a p r o n o u n . T h e i d e a is s i m i l a r t o t h a t u s e d in

t h e c e n t e r i n g a p p r o a c h ( B r e n n a n e t a l. , 1 98 7 )w h e r e a c o n t i n u e d t o p i c i s t h e h i g h e s t - r a n k e d

c a n d i d a t e f o r p r o n o m i n a l i z a t i o n .

G i v e n t h e a b o v e p o s s ib l e s o u r c e s o f i n f o r m a r

t io n , w e a rr i v e a t t h e f o l lo w i n g e q u a t i o n , w h e r e

F ( p ) d e n o t e s a f u n c t i o n f r o m p r o n o u n s t o t h e i r

a n t e c e d e n t s :

F(p ) = a rg m a x P ( A( p) = a lp , h , l~ ', t , l, so , d~ A~')

w h e r e A ( p ) i s a r a n d o m v a r i a b l e d e n o t i n g t h e

r e f e r e n t o f t h e p r o n o u n p a n d a i s a p r o p o s e d

a n t e c e d e n t . I n t h e c o n d i t i o n i n g e v e n t s , h i s t h eh e a d c o n s t i t u e n t a b o v e p , l ~ i s t h e l is t o f c a n d i -

d a t e a n t e c e d e n t s t o b e c o n s i d e r e d , t i s t h e t y p e

o f p h r a s e o f t h e p r o p o s e d a n t e c e d e n t ( a lw a y s

a n o u n - p h r a s e i n t h i s s t u d y ) , I i s t h e t y p e o f

t h e h e a d c o n s t i t u e n t , s p d e s c r ib e s t h e s y n t a c t i c

s t r u c t u r e i n w h i c h p a p p e a r s , d s p e c i f i e s t h e d i s-

t a n c e o f e a c h a n t e c e d e n t f r o m p a n d M " i s t h e

n u m b e r o f t i m e s t h e r e f e r en t is m e n t i o n e d . N o t e

th a t 17r , d'~ a n d A ~ a r e v e c to r qu a n t i t i e s i n w h ic h

e a c h e n t r y c o r r e s p o n d s t o a p o s s ib l e a n t e c e d e n t .

W h e n v i e w e d i n t h i s w a y , a c a n b e r e g a r d e d a s

a n i n d e x i n t o t h e s e v e c t o r s t h a t s p e ci f ie s w h i c h

v a l u e i s r e l e v a n t t o t h e p a r t i c u l a r c h o i c e o f a n -

t e c e d e n t .

T h i s e q u a t i o n i s d e c o m p o s e d i n t o p i e c e s t h a t

c o r r e s p o n d t o a ll t h e a b o v e f a c t o r s b u t a r e m o r e

s t a ti s t ic a l ly m a n a g e a b l e . T h e d e c o m p o s i t i o n

m a k e s u s e o f B a y e s ' t h e o r e m a n d i s b a s e d o n

c e r t a i n i n d e p e n d e n c e a s s u m p t i o n s d i s c u s s e d b e -

low .

P ( A ( p ) = a l p , h , f i r , t , l , s p , d ~ . Q ' )

= P ( a l A ~ ) P ( p , h , f i r , t , l , s p , ~ a , 2 ~ ) ( 1 )

P ( p , h , f i r , t , t , s p , d i M )

o ¢ P C a l M ) P ( p , h , f i r , t , l , s p , ~ a , . Q ' ) ( 2 )

= P ( a [ : Q ) P ( . % , ~ a , :~ 'I )

P ( p , h , f ir , t , l l a , ~ , s p , i ) ( 3 )

= P ( a l l ~ ) P ( s p , d ~ a , . Q )

P C h , t , Z l a , ~ '0 " , o , i )

P C . . ~ l a , . ~ ' , s o , d , h , t , l ) ( 4 )

o c P ( a ] l ~ ) P ( S o , ~ a , M ' )

8/6/2019 A Statistical Approach to Anaphora Resolution

http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 3/10

P ( p , 1 4 t in , ] Q , s o , d , h , t , I ) (5 )

= P ( a l . Q ) P ( s p , d ~ a , 3 ~ r )

P ( f f r l a , I ~ , s o , d , h , t , I ) . (6 )

P ( p l a . l ~ , s f , , d . h , t , l , l ~ )

cx P (a1 6 3 P(d t t la )P ( f f ' l h , t , I, a )

P (p lw° ) ( 7 )

E q u a t i o n ( 1) is s i m p l y a n a p p l i c a t i o n o f B a y e s '

r ul e. T h e d e n o m i n a t o r is e l i m i n a t e d in t h e

usua l f a sh ion , r e su l t i n g in e qu a t io n ( 2 ) . S e l e c -

t i v e l y a p p l y i n g t h e c h a i n r u l e r e s u l t s i n e q u a -

t i ons (3 ) a nd ( 4 ) . I n e qu a t io n ( 4 ) , t he t e r m

P ( h . t , l l a , . ~ , S o , d ) i s t h e s a m e f o r e v e r y a n -

t e c e d e n t a n d is t h u s re m o v e d . E q u a t i o n ( 6)f o ll o w s w h e n w e b r e a k t h e l a s t c o m p o n e n t o f

( 5 ) i n t o t w o p r o b a b i l i t y d i s t r i b u t i o n s . I n e q u a -

t i o n ( 7 ) w e m a k e t h e f o l l o w i n g i n d e p e n d e n c e a s -

s u m p t i o n s :

• G i v e n a p a r t i c u l a r c h o i c e o f t h e a n t e c e d e n t

c a n d i d a t e s , t h e d i s t a n c e i s i n d e p e n d e n t o f

d i s t a n c es o f c a n d i d a t e s o t h e r t h a n t h e a n -

t e c e d e n t ( a n d t h e d i s t a n c e t o n o n - r e f e r e n t s

c a n b e i g n o r e d ) :

P ( s o , d ~ a , 2 ~ ) o ¢ P ( s o , d o l a , I C 4 )

• T h e s y n t n c t i c s t r u c t u r e s t , a n d t h e d i s t a n c e

f r o m t h e p r o n o u n d a a r e i n d e p e n d e n t o f t h e

n u m b e r o f t i m e s t h e r e f e r e n t i s m e n t i o n e d .

T h u s

P ( s p , d o la , M ) = P ( s p , d . la )

T h e n w e c o m b i n e s p a n d d e i n t o o n e v a r i -

able d I t , H o b b s d i s t a n c e , s i n c e t h e H o b b s

a l g o r i t h m t a k e s b o t h t h e s y n t a x a n d d i s -

t a n c e i n t o a c c o u n t .

T h e w o r d s in t h e a n t e c e d e n t d e p e n d o n l y

o n t h e p a r e n t c o n s t i t u e n t h , t h e t y p e o f t h e

w o r d s t , a n d t h e t y p e o f t h e p a r e n t I. H e n c e

e ( f f ' l a , M , s p , ~ , h , t , l ) = P ( ~ l h , t , l , a )

• T h e c h o i c e p r o n o u n d e p e n d s o n l y o n t h e

w o r d s in t h e a n t e c e d e n t , i .e .

P { p l a , M , s p , d , h , t , l , ~ = P ( p l a , W )

1 6 3

• I f w e t r e a t a a s a n i n d e x i n t o t h e v e c t o r 1 ~ ,

t h e n ( a , I.V ') i s s i m p l y t h e a t h c a n d i d a t e i n

t h e l i st f fz . W e a s s u m e t h e s e l e c t i o n o f t h e

p r o n o u n i s i n d e p e n d e n t o f t h e c a n d i d a t e so t h e r t h a n t h e a n t e c e d e n t . H e n c e

P ( p l a , W ) = P ( p l w ,~ )

S inc e I ~" i s a ve c to r , w e ne e d to no r m a l -

iz e P ( f f ' l h , t , l , a ) t o o b t a i n t h e p r o b a b i l i t y o f

e a c h e l e m e n t in t h e v e c t o r . I t i s r e a s o n -

a b l e t o a s s u m e t h a t t h e a n t e c e d e n t s i n W a r e

i n d e p e n d e n t o f e a ch o t h e r ; i n o t h e r w o r d s ,

P ( w o + l l w o , h , t , l , a ) = P ( w o + l l h , t , l , a } . T h u s ,

w h e r e

n

P ( f f ' l h , t , l , a ) = 1 I P ( w i l h , t , l , a )i = l

P ( w d h , t , l , a ) = P ( w i l t ) if i # a

a n d

P ( w d h , t , l , a ) = P ( w o l h . t , l ) if i = a

T h e n w e ha v e ,

P ( f f ' l h , t , l , a ) = P ( w t l t ) . . . P ( w o l h , t , l ) . . . P ( w , l t )

T o g e t t h e p r o b a b i l i t y f o r e a c h c a n d i d a t e , w e

d i v i d e t h e a b o v e p r o d u c t b y :

f ( I ~ l h , t , l , a )

P ( w l l t ) . . . P ( w o l h , t , l ) . . . P ( w , l t JOC

e ( w ~ l t ) . . . P ( w ~ l t ) . . . P ( w , l t)

P ( w ~ l h , t , t )

P ( w ° l t )

N o w w e a r r i v e a t t h e f i n a l e q u a t i o n f o r c o m p u t -

i n g t h e p r o b a b i l i t y o f e a c h p r o p o s e d a n t e c e d e n t :

P(A (p) = W o) (S)

P { d H I a ) P ( p l w . ) P ~ l ) p ( a l m . )

W e o b t a i n P ( d H [ a ) b y r u n n i n g t h e H o b b s a l -

g o r i t h m o n t h e t r a i n in g d a t a . S i n c e t h e t r ai n -

i n g c o r p u s i s t a w e d w i t h r e fe r en c e i n f o rm a -

t i o n , t h e p r o b a b i l i t y P ( p l W o ) i s e a s i l y o b t a i n e d .

I n b u i l d i n g a s t a t i s t i c a l p a r s e r f o r t h e P e n n

T r e e - b a n k v a r i o u s s t a t L s t i c s h a v e b e e n c o l l e c t e d

8/6/2019 A Statistical Approach to Anaphora Resolution

http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 4/10

( C h a r n i a k , 1 9 9 7 ), t w o o f w h i c h a r e P(w ~lh , t , l)

and P(w ~ l t , l ) . T o a v o i d t h e s p a r s e - d a t a p r o b -

l e m , t h e h e a d s h a re c l u st e r e d a c c o r d i n g t o h o w

t h e y b e h a v e i n P(w ~lh, t , l ). T h e p r o b a b i l i t y o f

w e is t h e n c o m p u t e d o n t h e b a s i s o f h 's c l u s -

t e r c ( h ) . O u r c o r p u s a ls o c o n t a i n s r e fe r e w t s '

r e p e t i t i o n i n f o r m a t i o n , f r o m w h i c h w e c a n d i -

r e c tl y c o m p u t e P(alrna ) . T h e f o u r c o m p o n e n t s

i n e q u a t i o n ( 8 ) c a n b e e s t i m a t e d i n a r e a s o n -

a b le f as h io n . T h e s y s t e m c o m p u t e s t h i s p r o d u c t

a n d r e t u r n s t h e a n t e c e d e n t t0o f o r a p r o n o u n p

t h a t m a x i m i z e s t h i s p ro b a b il i ty . M o r e fo r m a l l y ,

w e w a n t t h e p r o g r a m t o r e tu r n o u r a n t e c e d e n t

f u n c t i o n F ( p ) , w h e r e

F(p)

= arg maax P( A( p) = alp, h, 1~, t , l, sp, d: M )

= a r g m a x P ( d H [ a ) P ( p l w a )112a

e(walh, t,t) e(almo )P(wol t , t )

3 T h e I m p l e m e n t a t i o n

W e u s e a s m a l l p o r t i o n o f th e P e n n W a l l S t r e e t

J o u r n a l T r e e - b a n k a s o u r t r a in i n g c o r p u s . F r o m

t h i s d a t a , w e c o l l e c t t h e t h r e e s t a t i s t i c s d e t a i l e d

h a t h e f o l l o w i n g s u b s e c t i o n s .

3 .0 .1 T h e H o b b s a l g o r i t h m

T h e H o b b s a l g o r i t h m m a k e s a f e w a s s u m p t i o n s

a b o u t t h e s y n t a c t i c t r e e s u p o n w h i c h i t o p e r a t e s

t h a t a r e n o t s a ti s fi e d b y t h e t r e e - b a n k t r e e s t h a t

f o r m t h e s u b s t r a t e f o r o u r a l g o r i t h m . M o s t n o -

t a b l y , t h e H o b b s a l g o r i t h m d e p e n d s o n t h e e x -

i s t en c e o f a n / ~ " p a r s e - t r e e n o d e t h a t i s a b s e n t

f r o m t h e P e n n T r e e - b a n k t r ee s . W e h a v e i m -

p l e m e n t e d a s l i g h tl y m o d i f i e d v e r si o n o f H o b b s

a l g o r i t h m f o r t h e T r e e - b a n k p a r s e tr e e s . W e

a l s o t r a n s f o r m o u r t r e e s u n d e r c e r t a i n c o n d i -

t i o n s t o m e e t H o b b s ' a s s u m p t i o n s a s m u c h a s

p o s s i b le . W e h a v e n o t , h o w e v e r , b e e n a b l e t o

d u p l i c a t e e x a c t l y t h e s y n t a c t i c s t r u c t u r e s a s -

s u m e d b y H o b b s.

O n c e w e h a v e t h e t r e e s i n t h e p r o p e r f o r m

( t o t h e d e g r e e t h i s i s p o s s i b l e ) w e r u n H o b b s '

a l g o r i t h m r e p e a t e d l y f o r e a c h p r o n o u n u n t i l it

h a s p r o p o s e d n ( = 1 5 i n o u r e x p e r i m e n t ) c a n -

d i d a t e s . T h e i t h c a n d i d a t e is r e g a r d e d a s o c -

c u r r i n g a t " H o b b s d i s t a n c e " dH = i . T h e n t h e

p r o b a b i l i t y P(dH = i la ) i s s i m p l y :

P (d u -= i la)

164

I correct antecedent a t H obbs d is tance i i

[ correct antec edents 1

W e u s e [ z [ t o d e n o t e t h e n u m b e r o f t im e s z i s

o b s e r v e d i n o u r t r a i n i n g s e t .

3 .1 T h e g e n d e r / a n i m a t i c i t y s t a t i s t i c s

A f t e r w e h a v e id e n t if i ed t h e c o r r e c t a n t e c e d e n t s

i t i s a s i m p l e c o u n t i n g p r o c e d u r e t o c o m p u t e

P(p[wa) w h e r e w a is in t h e c o r r e c t a n t e c e d e n t

f o r t h e p r o n o u n p ( N o t e t h e p r o n o u n s a r e

g r o u p e d b y t h e i r g e n d e r ) :

[ wa in the an teceden t fo r p [P ( p l o ) =

W h e n t h e r e a r e m u l t i p l e r e l e v a n t w o r d s i n t h ea n t e c e d e n t w e a p p l y t h e l i k e li h o o d t e s t d e s i g n e d

b y D u n n i n g ( 19 9 3) o n a l l th e w o r d s i n t h e c a n d i -

d a t e N P . G i v e n o u r l i m i t e d d a t a , t h e D u n n i n g

t e s t t e l l s w h i c h w o r d i s t h e m o s t i n f o r m a t i v e ,

ca l l i t w i, a n d w e t h e n u s e P ( p [ w i ) .

3 . 1 .1 T h e m e n t i o n c o u n t s t a t i s t i c s

T h e r e fe r e nt s r a n g e f ro m b e i n g m e n t i o n e d o n l y

o n c e t o b e gi n m e n t i o n e d 1 20 t i m e s i n t h e t r a i n -

h ag e x a m p l e s . I n s t e a d o f c o m p u t i n g t h e p r o b a -

b U i ty fo r e a c h o n e o f th e m w e g r o u p t h e m i n t o

" b u c k e t s " , s o t h a t rrta iS t h e b u c k e t f o r t h e n u m -

b e r o f t i m e s t h a t a is m e n t i o n e d . W e a l s o o b -

s e r v e t h a t t h e p o s i t i o n o f a p r o n o u n i n a s t o r y

i n f lu e n c e s th e m e n t i o n c o u n t o f i t s r e f e r e n t . I n

o t h e r w o r d s , t h e n e a r e r t h e e n d o f t h e s t o r y a

p r o n o u n o c c u r s , th e m o r e p r o b a b l e i t is t h a t

i t s r e fe r e n t h as b e e n m e n t i o n e d s e v e r a l t i m e s .

W e m e a s u r e p o si ti o n b y t h e s e n t e n c e n u m b e r ,

j . T h e m e t h o d t o c o m p u t e t h i s p r o b a b i li t y is :

[ a is ante ceden t, rna, j I

P(a lm ~ , j ) = I m s , j l

( W e o m i t t e d j f r o m e q u a t i o n s ( 1- 7 ) to r e d u c et h e n o t a t i o n a l l o a d . )

3 .2 R e s o l v i n g P r o n o u n s

A f t e r c o l le c ti n g t h e s t a t is t i c s o n t h e t r a i n i n g e x -

a n ap l es , w e r u n t h e p r o g r a m o n t h e t e s t d a t a .

F o r a n y p r o n o u n w e co l l ec t n ( = 1 5 i n t h e e x -

p e r i m e n t ) c a n d i d a t e a n t e c e d e n t s p r o p o s e d b y

H o b b s ' a l g o r i th m . I t i s q u i t e p o s s ib l e t h a t a

w o r d a p p e a rs i n t h e t e s t d a t a t h a t t h e p r o g r a m

n e v e r s a w in t h e t r a i n i n g d a t a a n d l o w w h i c h i t

h e n c e h a s n o P ( p l w o ) p r o b a b i l i t y . I n t h i s c a s e

8/6/2019 A Statistical Approach to Anaphora Resolution

http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 5/10

8/6/2019 A Statistical Approach to Anaphora Resolution

http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 6/10

e r r o n e o u s c a n d i d a t e s . S p a r s e d a t a a l s o c a u s e s

a p r o b l e m i n t h i s s t a t i s t i c . C o n s e q u e n t l y , w e

o b s e r v e a r e l a t i v e l y s m a l l e n h a n c e m e n t t o t h e

s y s t e m .T h e m e n t i o n i n f o r m a t i o n g i v e s t h e s y s ~ e m

s o m e id e a o f t h e s t o r y ' s f o c u s . T h e m o r e f re -

q u e n t l y a n e n t i t y is r e p e a t e d , t h e m o r e l i ke l y i t

i s t o b e t h e t o p i c o f t h e s t o r y a n d t h u s t o b e

a c a n d i d a t e f o r p r o n o m i n a l i z a t i o n . O u r r e s u l t s

s h o w t h a t t h i s i s i n d e e d t h e c a s e . R e f e r e n c e s

b y p r o n o u n s a r e cl o s el y r e l a t e d t o t h e t o p i c o r

t h e c e n t e r o f t h e d is c o u r se . N P r e p e t i t i o n i s

o n e s i m p l e w a y o f a p p r o x i m a t e l y i d e n t if y i n g t h e

t o p i c . T h e m o r e a c c u r a t e l y t h e t o p i c o f a s e g -

m e n t c a n b e i d e n t i f i e d , t h e h i g h e r t h e s u c c e s s

r a t e w e e x p e c t a n a n a p h o r a r e s o l u t i o n s y s t e m

c a n a c h i e v e .

5 U n s u p e r v i s e d L e a r n i n g o f G e n d e r

I n f o r m a t i o n

T h e i m p o r t a n c e o f g e n d er i n f o r m a t i o n a s re -

v e a l e d i n t h e p r e v i o us e x p e r i m e n t s c a u s e d u s t o

c o n s id e r a u t o m a t i c m e t h o d s f o r e s t im a t i n g t h e

p r o b a b i l i t y t h a t n o u n s o c c u r r i n g in a l a r g e c o r -

p u s o f E n g l i s h t e x t d e o n o t e i n a n i m a t e , m a s c u -

l in e o r f e m i n i n e t h i n g s. T h e m e t h o d d e s c r i b e d

h e r e is b as e d o n s i m p l y c o u n t i n g c o - o c c u r r e n c e s

o f p r o n o u n s a n d n o u n p h r a s e s , a n d t h u s c a ne m p l o y a n y m e t h o d o f a n a l y si s o f t h e t e x t

s t r e a m t h a t r e s u l t s i n r e f e r e n t / p r o n o u n p a i r s

( c f. ( H a t z i v a s s il o g l o u a n d M c K e o w n , 1 9 9 7 )

f o r a n o t h e r a p p l i c a t i o n i n w h i c h n o e x p l ic i t

i n d i c a t o r s a r e a v a il a bl e i n t h e s t r e a m ) . W e

p r e s e n t t w o v e r y si m p l e m e t h o d s f o r f i n d i n g

r e f e r e n t / p r o n o u n p a i r s , a n d a l so g i v e a n a p p l i -

c a t i o n o f a s a l ie n c e s t a t i s t ic t h a t c a n i n d i c a t e

h o w c o n f i d e n t w e s h o u l d b e a b o u t t h e p r e d i c -

t i o n s t h e m e t h o d m a k e s . F o l l o w i n g t h is , w e

s h o w t h e r e s u l ts o f a p p l y i n g t h i s m e t h o d t o t h e

2 1 - m i l l i o n - w o r d 1 9 8 7 W a l l S t r e e t J o u r n a l c o r -p u s u s i n g t w o d i ff e r en t p r o n o u n r e f e r e n c e s t r a t e -

g i es o f v a r y i n g s o p h i st i c a ti o n , a n d e v a l u a t e t h e i r

p e r f o r m a n c e u s i n g h o n o r i f ic s a s r e l i ab l e g e n d e r

i n d i c a t o r s .

T h e m e t h o d is a v e r y s i m p l e m e c h a n i s m

f o r h a r v e s t i n g t h e k i n d o f g e n d e r i n f o r m a t i o n

p r e s e n t i n d i s c o u r s e f r a g m e n t s l i k e " K i m s l e p t .

S h e s l e p t f o r a l o ng t i m e . " E v e n i f K i m ' s g e n d e r

w a s u n k n o w n b e f o r e s e e i n g t h e f i r s t s e n t e n c e ,

a f t e r t h e s e c o n d s e n t e n c e , i t i s k n o w n .

T h e p r o b a b i l i t y t h a t a r e f e r e n t i s i n a p a r t i c -

166

u l a r g e n d e r c l as s i s j u s t t h e r e l a t i v e f r e q u e n c y

w i t h w h i c h t h a t r e f e r e n t i s r e f e r r e d t o b y a p r o -

n o u n p t h a t i s p a r t o f t h a t g e n d e r c la s s . T h a t i s,

t h e p r o b a b i l it y o f a r e f e r e n t r e f b e i n g i n g e n d e r

c lass gc~ is

P ( r e / E g ci )

= I r e fs t o r e f w i t h p e gci I (9 )

E l r ef s t o r e / w i t h p E gc j I

J

I n t h i s w o r k w e h a v e c o n s i d e r e d o n l y t h r e e

g e n d e r c l a ss e s , m a s c u l i n e , f e m i n i n e a n d i n a n i -

m a t e , w h i c h a r e i n d i c a t e d b y t h e i r t y p i c a l p r o -

n o u n s , H E , S H E , a n d I T . H o w e v e r , a v a r i e t y o f

p r o n o u n s i n d i c a t e t h e s a m e c l as s : P l u r a l p r o -

pr onoun ge nde r c l a s s

h e , h i m s e l f ,h i m , h i s H E

s h e , h e r s el f , h e r , h e r s S H E

i t , i t se l f , i t s IT

n o u n s li ke " t h e y " a n d " u s " r e v e a l n o g e n d e r in -

f o r m a t i o n a b o u t t h e i r r e f e r e n t a n d c o n s e q u e n t l y

a r e n ' t u s e fu l , a l t h o u g h t h i s m i g h t b e a w a y t o

l e a rn p lu r a l i za t i o n i n a n u n s u p e r v i s e d m a n n e r .

I n o r d e r t o g a t h e r s t a t i s t i c s o n t h e g e n d e r o f

r e f e re n t s i n a co r p u s , t h e r e m u s t b e s o m e w a y

o f i d e n t if y i n g t h e r e f e re n t s . I n a t t e m p t i n g t ob . o o t s t r a p l e x i c a l i n f o r m a t i o n a b o u t r e f e r e n t s '

g e n d e r , w e c o n s i d e r t w o s t r a t e g i e s , b o t h c o m -

p l e te l y b l i n d t o a n y k i n d o f s e m a n t i c s .

O n e o f th e m o s t n a i ve p r o n o u n r e fe r e nc e

s t r a t eg i e s i s t h e " p r e v i o u s n o u n " h e u r i s t ic . O n

t h e i n t u i t i o n p r o n o u n s c l o s e ly f ol l o w t h e i r r e f e r-

e n ts , t h i s h e u r i s t i c s i m p l y k e e p s t r a c k o f t h e l a s t

n o u n se e n a n d s u b m i t s t h a t n o u n a s t h e r ef e r-

e n t o f a n y p r o n o u n s f o l lo w i n g . T h i s s t r a t e g y i s

c e r t a i n l y s i m p l e - m i n d e d b u t , a s n o t e d e a r l i e r , i t

a c h ie v e s a n a c c u r a c y o f 4 3 % .

I n t h e p r e s e n t s y s t e m , a s t a t i s t i c a l p a r s e r i su s e d ( s e e ( C h a r n i a k , 1 9 9 7 ) ) s i m p l y a s a t a g -

g e r . T h i s a p p a r e n t p a r s e r o v e r k i l l i s a c o n t r o l

t o e n s u r e t h a t t h e p a r t - o f - s p e e c h t a g s a s s i g n e d

t o w o r d s a r e t h e s a m e w h e n w e u s e t h e p r e v i -

o u s n o u n h e u r i s t i c a n d t h e H o b b s a l g o r i t h m , t o

w h i ch w e w i sh t o c o m p a r e t h e p r e v i o u s n o u n

m e t h o d . I n f a c t , t h e o n l y p a r t- o f - s p e e c h t a g s

n e c e s s a r y a r e th o s e i n d i c a t i n g n o u n s a n d p r o -

n o u n s .

O b v i o u s l y a m u c h s u p e r i o r s t r a t e g y w o u l d

b e t o a p p l y t h e a n a p h o r a - r e s o l u t i o n s t r a t e g y

8/6/2019 A Statistical Approach to Anaphora Resolution

http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 7/10

f r o m p r e v i o u s s e c t i o n s t o f i n d i n g p u t a t i v e r e f -

e r e n t s . H o w e v e r , w e c h o s e t o u s e o n l y t h e

H o b b s d i s ta n c e p o r t i o n t h e r e o f . W e d o n o t

u s e t h e " m e n t i o n " p r o b a b i l it i e s P(a lm a) , a st h e y a r e n o t g iv e n i n t h e u n m a r k e d t e x t . N o r

d o w e u s e t h e g e n d e r / a n i m i t i c i t y i n f o r m a t i o n

g a t h e r e d f r o m t h e m u c h s m a l l e r h a n d - m a r k e d

t e x t , b o t h b e c a u s e w e w e r e i n t e r e s t e d in s e e i ng

w h a t u n s u p e r v i s e d l e a r n i n g c o u l d a c c o m p l i s h ,

a n d b e c a u s e w e w e r e c o n c e r n e d w i t h i n h e r it -

i n g s t r o n g b i a s es f ro m t h e l i m i t e d h a n d - m a r k e d

d a t a . T h u s o u r s e c o n d m e t h o d o f f i n d in g t h e

p r o n o u n / n o u n c o - o c c u r r e n c e s i s s i m p l y t o p a r se

t h e t e x t an d t h e n a s s u m e t h a t t h e n o u n - p h r a s e

a t H o b b s d i s t a n c e o n e i s t h e a n t e c e d e n t .

G i v e n a p r o n o u n r e s o l u t i o n m e t h o d a n d a c o r-p u s , t h e r e s u l t i s a s e t o f p r o n o u n / r e f e r e n t p a i rs .

B y c o l la t in g b y r e f e re n t a n d a b s t r a c t i n g a w a y

t o t h e g e n d e r c l as s es o f p r o n o u n s , r a t h e r t h a n

i n d i v i d u a l p r o n o u n s , w e h a v e t h e r e l a t i v e f r e -

q u e n c i e s w i t h w h i c h a g i v e n r e f e r e n t i s r e f e r r e d

t o b y p r o n o u n s o f e a c h g e n d e r c la s s . W e w i ll

s a y t h a t t h e g e n d e r c l a s s f o r w h i c h t h i s r e l a t i v e

f r e q u e n c y is t h e h i g h e s t i s t h e g e n d e r c la s s t o

w h i c h t h e r e f e re n t m o s t p r o b a b l y b e l o n g s.

H o w e v e r , a n y s y n t a x - o n l y p r o n o u n r e s o lu t i o n

s t r a t e g y w i ll b e w r o n g s o m e o f t h e t i m e - t h e s e

m e t h o d s k n o w n o t h i n g a b o u t d i sc o u rs e b o u n d -a r ie s , i n t e n t i o n s , o r r e a l- w o r l d k n o w l e d g e . W e

w o u l d li ke t o k n o w , t h e r e f o r e , w h e t h e r t h e p a t -

t e r n o f p r o n o u n r e f e re n c e s t h a t w e o b s e rv e f o r

a g i v en r e f e r e n t is t h e r e s u l t o f o u r s u p p o s e d

" h y p o t h e s i s a b o u t p r o n o u n r e f e re n c e " - t h a t i s,

t h e p r o n o u n r e f e re n c e s t r a t e g y w e h a v e p r ov i -

s i o n a l l y a d o p t e d i n o r d e r t o g a t h e r s t a t i s t i c s -

o r w h e t h e r t h e r e s u lt o f s o m e o t h e r u n i d e n t if i e d

p r oc e s s .

T h i s d e c i s i o n i s m a d e b y r a n k i n g t h e r e f e r -

e n t s b y l o g - li k e l ih o o d r a t i o , t e r m e d s a l i e n c e , f o r

e a c h r e f e r e n t . T h e l ik e l i h o o d r a t i o is a d a p t e df r o m D u n n i n g ( 1 9 9 3 , p a g e 6 6 ) a n d u s e s t h e r a w

f r e q u e n c i e s o f e a c h p r o n o u n c l a ss i n t h e c o r -

p u s a s t h e n u l l h y p o t h e s i s , P r ( g c 0 i ) a s w e l l a s

P r ( r e f E gci) f r o m e q u a t i o n 9 .

s a l i e n c e ( r e / )

= - 2 l og

M a k i n g t h e u n r e a l i s t i c s i m p l i f y i n g a s s u m p t i o n

t h a t r e fe r e n ce s o f o n e g e n d e r c l a s s a r e c o m -

p l e t el y i n d e p e n d e n t o f r e fe r e n c e s f o r a n o t h e r

c l a s se s 1 , t he l i ke l ihood f un c t io n in t h i s c a se i sj u s t t h e p r o d u c t o v e r a ll cl a s s e s o f t h e p r o b a b i l -

i t ie s o f e ac h c l a s s o f r e f e r e n c e t o t h e p o w e r o f

t h e n u m b e r o f o b s e r v a t i o n s o f th i s c l a ss .

6 E v a l u a t i o n

W e r a n t h e p r o g r a m o n 2 1 m i l li o n w o r d s o f W a l l

S t r e e t J o u r n a l t e x t . O n e c a n j u d g e t h e p r o -

g r a m i n f o r m a l l y b y s i m p l y e x a m i n i n g t h e r e -

s u l t s a n d d e t e r m i n i n g i f t h e p r o g r a m ' s g e n d e r

d e c i s io n s a r e c o r r e c t ( o c c a s i o n a l l y l o o k i n g a t t h e

t e x t f o r d i f fi c u l t c a s e s ) . F i g u r e 1 s h o w s t h e 4 3

n o u n p h r a s e s w i t h t h e h i g h e s t s a l i e n c e f i g u r e s

( r u n u s in g t h e H o b b s a l g o r i t h m ) . A n e x a m i n a -

t i o n o f t h e s e s h o w t h a t a ll b u t t h r e e a r e c o r r e c t .

( T h e t h r e e m i s t a k e s a r e " h u s b a n d , " " w if e, " a n d

" y e a r s . " W e r e t u r n t o t h e s i g n i f i c an c e o f t h e s e

m i s t a k e s l a t e r . )

A s a m e a s u r e o f t h e u t i l i t y o f t h e s e r e su l t s , w e

a l s o r a n o u r p r o n o u n - a n a p h o r a p r o g r a m w i t h

t h e s e s t a t i s t i c s a d d e d . T h i s a c h i e v e d a n a c c u -

r a c y r a t e o f 8 4 . 2 % . T h i s i s o n l y a s m a l l i m p r o v e -

m e n t o v er w h a t w a s a c h i e v e d w i t h o u t t h e d a t a .

W e b e l ie v e , h o w e v e r , t h a t t h e r e a r e w a y s t o i m -

p r o v e t h e a c c u r a c y o f t h e l e a r n i n g m e t h o d a n d

t h u s i n c r e as e it s i n f lu e n c e o n p r o n o u n a n a p h o r a

r e s o l u t i o n .

F i n a l l y w e a t t e m p t e d a fu l l y a u t o m a t i c d i -

r e c t t es t o f t h e a c c u r a c y o f b o t h p r o n o u n m e t h -

o d s f o r g e n d e r d e t e r m i n a t i o n . T o t h a t e n d , w e

d e v i s e d a m o r e o b j e c t i v e t e s t , u s e f u l o n l y f o r

s c o r in g t h e s u b s e t o f r e f e r e n t s t h a t a r e n a m e s

o f p e o p le . I n p a r t i c u l a r , w e a s s u m e t h a t a n y

n o u n - p h r a s e w i t h t h e h o n o r if i c s " M r . " . " M r s . "

o r " M s . " m a y b e c o n f i d e n t l y a s s ig n e d t o g e n d e r

c l a ss e s H E , S H E , a n d S H E , r e s p e c t i v e l y . T h u s w e

c o m p u t e p r e c i s i o n a s f o l l o w s :

p r e c i s ion =

[ r a t t r i b . a s H E A M r . E r l +

[ r a t t r i b . a s S H E A M r s. or M s. E r [

I M r . , M r s . , o r M s . E r ]

H e r e r v a r i e s o v e r r e f e r e n t t y p e s , n o t t o k e n s .

T h e p r e ci s io n s c o r e c o m p u t e d o v e r a ll p h r a s e s

c o n t a i n i n g a n y o f t h e t a r g e t h o n o r if i cs a r e 6 6 .0 %

l I n ef f e ct , t h i s is t h e s a m e a s a d m i t t i n g t h a t a r e f -

e r e n t c a n b e i n d i f f e r e n t g e n d e r c l a s s e s a c r o s s d i f f e re n t

o b s e r v a t i o n s .

167

8/6/2019 A Statistical Approach to Anaphora Resolution

http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 8/10

W o r d co u n t ( sa l i en ce ) p ( h e ) p ( sh e ) p ( i t )

C O M P A N Y 7 0 52 ( 1 62 9 .39 ) 0 .0 7 64 0 .0 06 0 0 .9 1 7 4W OM AN 2 50 ( 36 8 .2 67 ) 0 .1 72 0 .7 0 8 0 .1 2

P R E S I D E N T 9 3 :[ (3 5 6 .5 3 9 ) 0 .8 20 6 0 .0 1 3 9 0 . 1 65 4

G R O U P 1 0 9 6 (2 8 7 .31 9 ) 0 .0 6 02 0 .0 0 54 0 .9 343

M R . R E A GA N 53 , t (2 7 0 .8) . 88 2 02 2 0 .0 0 37 0 .1 1 42

M AN 441 ( 2 0 2 .1 0 2 ) 0 .8 480 0 .0 38 5 0 .1 1 33

P R E S I D E N T R E A G A N 4 5 5 (1 9 4 .9 2 8) 0 .8 43 9 0 .0 04 3 0 .1 5 1 6

G O V E R N M E N T 1 2 20 (1 9 4 .1 8 7) 0 .1 17 2 0 . 01 2 2 0 . 87 0 4

U.S . 969(188 .468) 0 .1021 0 .0041 0 .8937

BA N K 81(5(161 .23) 0 .0955 0 .0073 0 .8970

M O T H E R 1 1 3 ( 1 6 1 . 2 0 4 ) 0 . 3 0 0 8 0 . 6 5 4 8 0 . 0 4 4 2

C O L . N O R T H 2 5 8 ( 1 5 8 . 6 9 2 ) 0 . 9 2 6 3 0 . 0 0 7 7 0 . 0 6 5 8

M O O D Y 38 3 ( 1 52 .40 5 ) 0 .00 7 8 0 .0 0 52 0 .9 8 6 9S P O K E S W O M A N 1 3 9 (1 4 5 .6 2 7) 0 .1 22 3 0 . 5 82 7 0 . 29 4 9

M R S . A QU I N O 7 3 ( 1 42 .2 2 3 ) 0 .0 9 58 0 .8 356 0 .0 6 8 4

M R S . T H A T C H E R 6 8 (1 2 8 .3 0 6 ) 0 .0 7 3 5 0 .8 2 3 5 0 . 1 02 9G M 513(119 .664 ) 0 .0779 0 .0038 0 .9181

P L A N 51 4 ( 1 1 1 .1 34 ) 0 .0 856 0 .0 0 58 0 .9 0 8 5

M R . G O R B A C H E V 2 0 5 (1 0 8 .7 7 6 ) 0 .8 9 26 0 . 00 4 8 0 . 10 2 4

J U D G E B O R K 2 1 2 (1 0 8 .7 4 6 ) 0 .8 8 20 0 0 . 11 7 9

HU S B A ND 9 1 ( 1 0 7.438 ) 0 .36 2 6 0 .57 1 4 0 .0 6 59

JA P A N 450 ( 1 0 0 .7 2 7 ) 0 .0 755 0 .0 11 1 0 .9 1 33

A G E N C Y 4 7 6 (9 7 .4 0 1 6 ) 0 .0 8 4 0 0 . 01 4 7 0 . 90 1 2

W I F E 1 53 ( 9 3 .748 5 ) 0 .6 1 43 0 .2 8 75 0 .0 9 8 0

D O L L A R 6 2 1 ( 9 0.8 9 6 3 ) 0 .1 304 0 .0 0 9 6 0 .8 59 9S T A N D A R D P O O R 2 0 0( 90 .1 06 2 ) 0 0 1FA TH E R 1 46 ( 8 9 .41 7 8 - ) 0 .8 0 82 0 .1 438 0 .0 47 9

U TI L I T Y 2 42 ( 8 7 .1 8 2 1 ) 0 .0 247 0 0 .9 7 52

M R . T R U M P 1 2 9( 86 .5 3 4 5 ) 0 .9 4 5 7 0 . 00 7 7 0 .0 4 6 5

M R . B A K E R 1 8 7 (8 4 .2 7 96 ) 0 .8 556 0 .0 0 53 0 .1 39 0

I B M 31 6 ( 8 2 .436 1 ) 0 .0 69 6 0 0 .9 30 3

M A K E R 2 2 4 (8 2 .2 5 2 ) 0 .0 2 23 0 0 . 97 7 6

YE AR S 1 0 55 ( 8 2 .1 6 32 ) 0 .52 98 0 .0 81 5 0 .38 8 6

M R . M E E S E 1 6 6 (8 2 .10 0 7 ) 0 .87 34 0 0 .1 2 65

B R AZ I L 2 8 5 ( 7 9 .7 31 1 ) 0 .0 596 0 0 .9 40 3

S P O K E S M A N 6 6 5 (7 8 .3 4 4 1 ) 0 .6 0 75 0 . 00 4 5 0 . 3 87 9

M R . S I M ON 1 0 5 ( 7 2.6 446 ) 0 .9 523 0 0 .0 47 6DAUGHTER 47(71.3863) 0.2340 0 .7021 0.0638

FO R D 2 49 ( 7 1 .36 0 3 ) 0 .056 2 0 0 .9 437

M R . G R E E N S P A N 1 2 0( 68 .7 8 07 ) 0 .9 0 83 0 0 . 09 1 6

AT & T 1 9 8 (6 7 .96 6 8 ) 0 .02 52 0 .0 0 50 0 .9 6 9 6

M I N I S T E R 1 2 5 (6 7 .7 4 7 5 ) 0 .8 6 4 0 . 06 4 0 . 07 2

JU D G E - 2 39 ( 6 7 .58 9 9 ) 0 .7 154 0 .0 8 36 0 .2 0 0 8

F i g u r e 1 : T o p 4 3 n o u n p h r a s e s a c c o r d i n g t o s a l i e n c e

168

8/6/2019 A Statistical Approach to Anaphora Resolution

http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 9/10

o

o~

o

l . O -

0 . 8 -

0 . 5 -

U

• • 0 •

0

• " ' ' ' " 1 " " " ' ' ' " |

10 100

N u m b e r o f r e fe r e n c es

O •

F i g u r e 2 : P r e c i s i o n u s i n g h o n o r i fi c s c o r i n g

s c h e m e w i t h s y n t a c t i c H o b b s a l g o r i t h m

f or t h e l a s t - n o u n m e t h o d a n d 7 0 . 3 % f o r t h e

H o b b s m e t h o d .

T h e r e a r e s e v e r a l t h i n g s t o n o t e a b o u t t h e s e

r e s u l t s . F i r s t , a s o n e m i g h t e x p e c t g i v e n t h e a l -

r e a d y n o t e d s u p e r i o r p e r f o r m a n c e o f t h e H o b b s

s c h e m e o v e r l a s t -n o u n , H o b b s a ls o p e r f o r m s b e t -

t e r a t d e t e r m i n i n g g e n d e r . S e c o n d l y , a t f i r st

g l a n c e , th e 7 0 .3 % a c c u r a c y o f t h e H o b b s m e t h o d

i s d i s a p p o i n t i n g , o n l y s l i g h t l y s u p e r i o r t o t h e

6 5 .3 % a c c u r a c y o f H o b b s a t f i n d i n g c o r r e c t r e f-

e r e n ts . I t m i g h t h a v e b e e n h o p e d t h a t t h e

s t a t is t i cs w o u l d m a k e t h i n g s c o n s i d e r a b l y m o r e

a c c u r a t e .

I n f a c t , t h e s t a ti s ti c s d o m a k e t h i n gs c o n s id -

e r a b l y m o r e a c cu r at e . F i g u r e 2 s h o w s a v e r a g ea c c u r a c y a s a f u n c ti o n o f n u m b e r o f r e f e re n c e s

f o r a g i v e n r e f e r e n t . I t c a n b e s e e n t h a t t h e r e i s

a s ig n if i ca n t i m p r o v e m e n t w i t h i n c r ea s e d r ef er -

e n t c o u n t . T h e r e a s o n t h a t t h e a v e r a g e o v e r al l

r e f e r en t s i s s o l o w i s t h a t t h e c o u n t s o n r e f e r e n t s

o b e y Z i p f 's l a w , s o t h a t t h e m o d e ~ f t h e di st ri -

b u t i o n o n c o u n t s is o n e . T h u s t h e 7 0 . 3 % o v e ra l l

a c c u r a c y i s a m i x o f r e la t i ve l y h i g h a c c u r a c y f o r

r e fe r en t s w i t h c o u n t s g r e a te r t h a n o n e , a n d r el -

a t iv e ly l o w a c c u r a c y f o r r e f e re n t s w i t h c o u n t s o f

e x a c t l y o n e .

7 P r e v i o u s W o r k

T h e l i t e r a t u r e o n p r o n o u n a n a p h o r a i s t o o e x -

t e n s iv e t o s u m m a r i z e , s o w e c o n c e n t r a t e h e r e o n

c o r p u s - b a s e d a n a p h o r a r e s e a r c h .A o n e a n d B e n n e t t ( 19 9 6) p r e s e n t an a p -

p r o a c h to a n a u t o m a t i c a l ly t r a i n a b le a n a p h o r a

r e s o lu t i o n s y s t e m . T h e y u s e J a p a n e s e n e w s p a -

p e r a r t i c l e s t a g g e d w i t h d i s c o u r s e i n f o r m a t i o n

a s t r a i n i n g e x a m p l e s f o r a m a c h i n e - l e a r n i n g a l -

g o r i t h m w h i c h i s t h e C 4 . 5 d e c i s i o n - t r e e a l g o -

r i t h m b y Q u i n l a n ( 1 99 3 ). T h e y t r a in t h e i r d e -

c i s i o n t r e e u s i n g (anaphora, antecedent) p a i r s

t o g e t h e r w i t h a s e t o f f e a t u r e v e c t o rs . A m o n g

t h e 6 6 f e a t u r e s a r e l e x i c a l , s y n t a c t i c , s e m a n -

t ic , a n d p o s i ti o n a l f e a t u r e s . T h e i r M a c h i n e

L e a r n i n g - b a s e d R e s o l v e r ( M L R ) i s t r a i n e d u s -i n g d e c i s io n t r e e s w i t h 1 9 7 1 a n a p h o r a s ( e x c l u d -

i n g t h o s e r e f e r ri n g t o m u l t ip l e d i s c o n t i n u o u s a n -

t e c e d e n t s ) a n d t h e y r e p o r t a n a v e r a g e s u c c es s

r a t e o f 7 4 . 8 % .

M i t k o v ( 1 9 9 7 ) d e s c r i b e s a n a p p r o a c h t h a t

u s e s a s e t o f f a c t o r s a s c o n s t r a i n t s a n d p r e f e r -

e n c e s. T h e c o n s t r a i n t s r u l e o u t i m p l a u si b le c a n -

d i d a t e s a n d t h e p r e f e r en c e s e m p h a s i z e t h e s e l ec -

t i o n o f t h e m o s t l ik e ly a n t e c e d e n t . T h e s y s t e m

i s n o t e n t i r e l y " s t a t i s t i c a l " i n t h a t i t c o n s i s t s o f

v a r i ou s t y p e s o f r u l e -b a s e d k n o w l e d g e - - s y n -

t ac ti c, s e m a n t i c , d o m a i n , d i s c ou r s e , a n d h e u r is -t ic . A s t at i s ti c a l a p p r o a c h i s p r e s e n t i n t h e d i s -

c o u r s e m o d u l e o n l y w h e r e i t i s u s e d t o d e te r -

m i n e t h e pr o ba b il i t y t h a t a n o u n ( v e r b ) p h r a s e

i s t h e c e n te r o f a s e nt e n c e . T h e s y s t e m a l s o c o n -

t a in s d o m a i n k n o w l e d g e i n cl u di n g t he d o m a i n

c o n c e p t s , s p e c i f ic l i st o f s u b j e c t s a n d v e r b s , a n d

t o p i c h e a d i ng s . T h e e v a l u at i o n w a s c o n d u c t e d

o n 1 3 3 p a r a g r a p h s o f a n n o t a t e d C o m p u t e r S ci -

e n c e t e x t . T h e r e s u l t s s h o w a n a c c u r a c y o f 8 3 %

f o r t h e 5 1 2 o c c u r r e n c e s o f it.

L a p p i n a n d L e a s s ( 1 9 9 4 ) r e p o r t o n a ( e ss e n -

t i al l y n o n - s t a t i s t i c a l ) a p p r o a c h t h a t r e l i es o ns a l i en c e m e a s u r e s d e r i v e d f r o m s y n t ac t i c s t r u c-

t u r e a n d a d y n a m i c m o d e l o f a t te n ti o n a l s t at e .

T h e s y s t e m e m p l o y s v a ri o us c o n s tr a in t s f or N P -

p r o n o u n n o n - c o r e f e r e n c e w i t h i n a s e n t en c e . I t

a l s o u s e s p e r s on , n u m b e r , a n d g e n d e r f e at u r e s

f o r r u l i n g o u t a n a p h o r i c d e p e n d e n c e o f a p r o-

n o u n o n a n N P . T h e a l g o r i t h m h a s a s op hi st i-

c a t e d m e c h a n i s m f o r a s s i gn i n g v a l u es t o s e v e ra l

s a li e n c e p a r a m e t e r s a n d f o r c o m p u t i n g g l ob a l

s a l i en c e v a l u es . A b l i n d t e s t w a s c o n d u c t e d

o n m a n u a l t e x t c o n t a i n i n g 3 6 0 p r o n o u n o c c u r -

169

8/6/2019 A Statistical Approach to Anaphora Resolution

http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 10/10

r e nc e s ; t he a l go r i t hm suc c e s s fu l l y i de n t i f i e d t he

a n t e c e d e n t o f t h e p r o n o u n i n 8 6 % o f t h e s e p r o -

n o u n o c c u rr e n ce s . T h e a d d i t i o n o f a m o d u l e

t h a t c o n t r i b u t e s s ta t i s t ic a l l y m e a s u r e d l e x Jc al

p r e f e r e n ce s t o t h e r a n g e o f f a c t o r s t h e a l g o r i t h m

c o n s id e r s i m p r o v e d t h e p e r f o r m a n c e b y 2 % .

8 C o n c l u s t i o n a n d F u t u r e R e s e a r c h

W e h a v e p r e s e n t e d a s t a t i s t i c a l m e t h o d f o r

p r o n o m i n a l a n a p h o r a t h a t a c h ie v e s a n a c c u r a c y

o f 8 4 .2 % . T h e m a i n a d v a n t a g e o f th e m e t h o d i s

i t s es s e n ti a l s i m p l ic i ty . E x c e p t f o r i m p l e m e n t i n g

t h e H o b b s r e f e r e n t - o r d e r i n g a l g o r i t h m , a l l o t h e r

s y s t e m k n o w l e d g e i s i m b e d d e d i n t a b l e s g iv i n g

t h e v a r i o u s c o m p o n e n t p r o b a b i l i t i e s u s e d in t h ep r o b a b i l i t y m o d e l .

W e b e l ie v e t h a t t h i s s im p l i c i t y o f m e t h o d w i l l

t r a n s l a t e i n t o c o m p a r a t i v e s i m p l i c i t y a s w e i m -

p r o v e t h e m e t h o d . S i n c e t h e r e s e a r c h d e s c r i b e d

h e r e in w e h a v e t h o u g h t o f o t h e r i n f lu e n c e s o n

a n a p h o r a r e s o l u t i o n a n d t h e i r s t a t i s t i c a l c o r r e -

l a te s . W e h o p e to i n c l u d e s o m e o f t h e m i n f u t u r e

w o r k .

A l s o , as i n d i c a t e d b y t h e w o r k o n u n s u p e r -

v i s e d le a r n i n g o f g e n d e r i n f o r m a t i o n , t h e r e i s a

g r o w i n g a r s e n a l o f l e a r n i n g t e c h n i q u e s t o b e a p -

p l ie d t o s t a t i s ti c a l p r o b l e m s . C o n s i d e r a g a i n t h et h r e e h i g h - s a li e n c e w o r d s t o w h i c h o u r u n s u p e r -

v i s e d l e a r n i n g p r o g r a m a s s i g n e d i n c o r r e c t g e n -

d e r : " h u s b a n d " , " w i f e ", a n d " y e a r s ." W e s u s -

p e c t t h a t h a d o u r p r o n o u n - a s s ig n m e n t m e t h o d

b e e n a b l e t o u s e t h e t o p i c i n f o r m a t i o n u s e d i n

t h e c o m p l e t e m e t h o d , t h e s e m i g h t w e l l h a v e

b e e n d e c i d e d c o r r ec t ly . T h a t i s, w e s u s p e c t

t h a t " h u s b a n d " , f o r e x a m p l e , w a s d e c i d e d i n -

c o r r e c t ly b e c a u s e t h e t o p i c o f t h e a r t ic l e w a s t h e

w o m a n , t h e re w a s a m e n t i o n o f h e r " h u s b a n d , "

b u t t h e a r t ic l e k e p t o n t al k i n g a b o u t t h e w o m a n

a n d u s e d th e p r o n o u n " s h e ." W h i l e o u r si m p l ep r o g r a m g o t c o n f u s e d , a p r o g r a m u s i n g b e t t e r

s t a t i s t ic s m i g h t n o t h a v e . T h i s t o o i s a t o p i c f o r

f u t u r e r e s e a r c h .

9 A c k n o w l e d g e m e n t s

T h e a u t h o r s w o u l d l i k e t o t h a n k M a r k J o h n s o n

a n d o t h e r m e m b e r s o f t h e B r o w n N L P g r o u p

f o r m a n y u s e f u l i d ea s a n d N S F a n d O N R f o r

s u p p o r t ( N S F g r a n t s I R I - 9 3 1 9 5 1 6 a n d S B R -

9 7 20 3 68 , O N R g r a n t N 0 0 1 4 - 9 6 - 1 -0 5 4 9 ) .

170

R e f e r e n c e s

C h i n a t s u A o n e a n d S c o t t W i l l i a m B e n n e t t ,

1996. Evaluat ing Autom ated an d Manual Ac-

quisit ion off An aph ora Res olutio n Strategies,p a g e s 3 0 2 - 3 1 5 . S p r i n g e r .

S u s a n E . B r e n n a n , M a r i l y n W a l k e r F r i e d m a n ,

a n d C a r l J . P o l l a r d . 1 9 8 7. A c e n t e r i n g ap -

p r o a c h t o p r o n o u n s . I n P r o c . 25 th AnnualMeet ing o f the A CL, p a g e s 1 5 5 - 1 6 2 . A s s o c i a -

t i o n o f C o m p u t a t i o n a l L i n g u i s t i c s .

E u g e n e C h a r n i a k . 1 9 9 7 . S t a t i s t i c a l p a r s i n g

w i t h a c o n t e x t - fr e e g r a m m a r a n d w o r d s ta t is -

t ic s . In Proceedings of the 1 4th N at ional Con-ference on Art i f icial Intel ligence, M e n l o P a r k ,

C A . A A A I P r e s s / M I T P r e ss .

T e d D u n n i n g . 1 9 93 . A c c u r a t e m e t h o d s f o r t h es t a t i s t i c s o f s u r p r i s e a n d c o i n c i d e n c e . Com-putat ional Linguist ics, 1 9 ( 1 ) , M a r c h .

V a s i l e i o s H a t z i v a s s i l o g l o u a n d K a t h l e e n R .

M c K e o w n . 1 9 97 . P r e d i c t i n g t h e s e m a n t i c o r i-

e n t a t i o n o f a d j e c ti v e s . I n Proc. 35 th AnnualMeet ing of the ACL, p a g e s 1 7 4 - 1 8 1 . A s s o c i a -

t i on o f C o m p u t a t i o n a l L i n g u is t ic s .

J e r r y R . H o b b s . 1 9 76 . P r o n o u n r es o l u ti o n .

T e c h n i c a l R e p o r t 7 6 - 1 , C i t y C o l l e g e , N e w

Y o r k .

S h a l o m L a p p i n a n d H e r b e r t J . L e a ss . 1 99 4 . A n

a l g o r i t h m f o r p r o n o m i n a l a n a p h o r a r e s o l u -t i o n . Computat ional Linguist ics, p a g e s 5 3 5 -

"561.

M i t c h e l l P . M a r c u s , B e a t r i c e S a n t o r i n i , a n d

M a r y A n n M a r c i n k i e w i c z . 1 9 9 3 . B u i l d i n g

a l a r g e a n n o t a t e d c o r p u s o f e n g l is h : t h e

p e n n t r e e b a n k . Computat ional Linguist ics,1 9 : 3 1 3 - 3 3 0 .

R u s l a n M i t k o v . 1 9 97 . F a c t o r s i n a n a p h o r a r es -

o l u t io n : t h e y a r e n o t t h e o n ly t h i n g s t h a t

m a t t e r , a c a s e s t u d y b a s e d o n t w o d i f f e r -

e n t a p p r o a c h e s . I n Proceedings o f the A CL

'g7/E A CL 'g7 W orkshop o n Operat ional Fac-tors in Practical, Robust Anaphora Resolu-tion.

J . R o s s Q u i n l a n . 1 9 9 3 . C~.5 Programs for Ma-chine Learning. M o r g a n K a u f m a n n P u b l i s h -

e rs .