free construction of a free dictionary of synonyms using computer science viggo kann and magnus...

27
free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College November 11, 2006

Upload: robert-malone

Post on 16-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

Free construction of a free dictionary of synonyms

using computer science

Viggo Kann and Magnus RosellKTH, Stockholm

Talk given by Viggo at Amherst College November 11, 2006

Page 2: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

Examples of English synonyms

Smith: A Dictionary of Synonymous Words in the English Language [1889]

CLASS. Order. Rank. Degree. Classification. Grade.

Webster’s Dictionary of Synonyms [1942]

classify. Alphabetize, pigeonhole, assort, sort. Ana. Order, arrange, systematize, methodize, marshal.

Page 3: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

Goals

To construct a Swedish dictionary of synonyms as a list of synonymous pairs

I don’t want to work a lot I don’t want to pay anyone to work The resulting list should be free

Page 4: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

Ideas

Automatically construct a large set of word pairs that might be synonyms

Use ten thousands of people, who are each willing to make a small contribution without payment, to check the word pairs

Page 5: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

More ideas

Use the Lexin on-line Swedish-English dictionary web site, that had 9 millions (now 17 M) of lookups each month

Users visit Lexin to translate words, and are thus probably motivated to help me

Each time a user makes a lookup, give her the opportunity to decide whether two words are synonyms or not

Page 6: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

My plan

1. Construct lots of possible synonyms

2. Sort out bad synonym pairs automatically

3. Ask lots of users if the rest of the pairs are good synonyms

4. Analyze the gradings done by the users and decide which pairs to keep

Page 7: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

Step 1: Construct lots of possible synonyms

If we have access to a Swedish-English dictionary SE and an English-Swedish dictionary ES, try to translate each word to English and back again to Swedish

{(w,v): y: ySE(w) vES(y)} or{(w,v): y: ySE(w) ySE(v)}

616 000 word pairs were generated

Page 8: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

Step 2: Sort out bad synonym pairs automatically

Use RI (Random Indexing)[Kanerva, Kristoferson, Holst 2000]to measure the distance between words represented in a large vector space

Keep pairs that have small enough distance in the vector space

Page 9: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

Random Indexing

Each word w is assigned a random label vector Lw of thousand elements

For each word w construct a context vector Cw by adding the random vectors

for the words appearing in the context of each occurrence of w in a large corpus

Page 10: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

Random Indexing settings

Context: 4 words to the left and 4 to the rightStop words were removed

Dimensionality: 1800 5 corpora from different domains were

used, for example newspapers and medical texts

Page 11: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

Number of pairs for different cos thresholds (435 000 of 616 000 pairs occurred in corpus)

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

-0.05 0.0 0.05 0.1* 0.15 0.2 0.25 0.3 0.35 0.4

pairs

Page 12: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

Step 3: Ask lots of users if the rest of

the pairs are good synonyms

When a user has sent a word to the Lexin dictionary he receives the translation followed by a question like:

Are 'spread' and 'lengthen' synonyms? Answer using a scale from 0 to 5 where 0 means 'I don’t agree' and 5 means 'I do fully agree', or answer 'I don’t know'

Page 13: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

After answering the user may

grade new randomly chosen word pair look up word in the synonym dictionary suggest new synonymous word pair download synonym dictionary in XML

Page 14: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College
Page 15: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

Step 4: Analyzing the gradings done by the users

1.2 millions gradings were made in less than 2 months

Grading statistics were analyzed on several occasions

Some users sent comments

Page 16: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

Keeping the users happy!

Many users said that there were too many bad pairs

Lots of pairs were graded 0 (not at all synonyms) by all users. After some weeks 25 000 such pairs were removed. Later 60 000 more pairs were removed, improving the quality of the remaining pairs considerably.

Page 17: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

User gradings first two months

0%

10%

20%

30%

40%

50%

60%

0 1 2 3 4 5 don'tknow

Page 18: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

More interesting gradings 2006

0%

10%

20%

30%

40%

50%

60%

0 1 2 3 4 5 don'tknow

20052006

Page 19: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

Distribution of mean gradings of word pairs after two months

0%

5%

10%

15%

20%

25%

30%

35%

40%

0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5

217 000 pairs

Page 20: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

Distribution of mean gradings of word pairs 2006

0%

5%

10%

15%

20%

25%

30%

35%

40%

0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5

20052006

Page 21: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

Analysis of the pairs graded 0Distance (cosine) in RI space

0%

10%

20%

30%

40%

50%

0,1 0,2 0,3 0,4 0,5 0,6

0 pairsall pairs

Page 22: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

Some statistics (November 2006)

2.5 M user gradings done 67 000 pairs (graded ≥ 2) in dictionary 90 000 pairs suggested by users 50 000 unique pairs suggested 14 000 of them have been accepted

Page 23: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

Example: Synonyms to klass (class)5: rang (grade)

rank (rank)slag (kind)

4: kategori (category)stånd (social class)årskurs (grade)

3: fack (sphere)grad (degree)grupp (group)kvalitet (quality)nivå (level)ordning (order)

3: skikt (layer)sort (sort)standard (standard)stil (style)

2: storleksordning (magnitude)typ (type)

1: poäng (point)stadga (stability)

0: uppdrag (mission)utbilda (educate)

Page 24: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

How to prevent abuse?

Many gradings of a word pair are needed before it’s considered to be good

The pair to be graded is randomly picked from a very large list

Word pairs suggested by users are spell checked before they are added to the very large list

Page 25: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

People's definition of synonymy

Exact meaning of 'synonym' wasn’t defined

Users will grade using their intuitive understanding of the concept of synonymy and the words in the pair

The produced dictionary will use the people's own definition of synonymy Hopefully this is exactly what they want!

Page 26: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

The people’s synonym dictionary on the web

http://lexin.nada.kth.se/cgi-bin/synlex

Page 27: Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College

Lessons learned

The list of suggested synonyms should be huge

Try to improve the quality of the list automatically as much as possible,Random indexing is useful for this, also try tagging and using other dictionaries

Use the 0 answers early to remove bad pairs that only irritate the users