free construction of a free dictionary of synonyms using computer science viggo kann and magnus...
TRANSCRIPT
Free construction of a free dictionary of synonyms
using computer science
Viggo Kann and Magnus RosellKTH, Stockholm
Talk given by Viggo at Amherst College November 11, 2006
Examples of English synonyms
Smith: A Dictionary of Synonymous Words in the English Language [1889]
CLASS. Order. Rank. Degree. Classification. Grade.
Webster’s Dictionary of Synonyms [1942]
classify. Alphabetize, pigeonhole, assort, sort. Ana. Order, arrange, systematize, methodize, marshal.
Goals
To construct a Swedish dictionary of synonyms as a list of synonymous pairs
I don’t want to work a lot I don’t want to pay anyone to work The resulting list should be free
Ideas
Automatically construct a large set of word pairs that might be synonyms
Use ten thousands of people, who are each willing to make a small contribution without payment, to check the word pairs
More ideas
Use the Lexin on-line Swedish-English dictionary web site, that had 9 millions (now 17 M) of lookups each month
Users visit Lexin to translate words, and are thus probably motivated to help me
Each time a user makes a lookup, give her the opportunity to decide whether two words are synonyms or not
My plan
1. Construct lots of possible synonyms
2. Sort out bad synonym pairs automatically
3. Ask lots of users if the rest of the pairs are good synonyms
4. Analyze the gradings done by the users and decide which pairs to keep
Step 1: Construct lots of possible synonyms
If we have access to a Swedish-English dictionary SE and an English-Swedish dictionary ES, try to translate each word to English and back again to Swedish
{(w,v): y: ySE(w) vES(y)} or{(w,v): y: ySE(w) ySE(v)}
616 000 word pairs were generated
Step 2: Sort out bad synonym pairs automatically
Use RI (Random Indexing)[Kanerva, Kristoferson, Holst 2000]to measure the distance between words represented in a large vector space
Keep pairs that have small enough distance in the vector space
Random Indexing
Each word w is assigned a random label vector Lw of thousand elements
For each word w construct a context vector Cw by adding the random vectors
for the words appearing in the context of each occurrence of w in a large corpus
Random Indexing settings
Context: 4 words to the left and 4 to the rightStop words were removed
Dimensionality: 1800 5 corpora from different domains were
used, for example newspapers and medical texts
Number of pairs for different cos thresholds (435 000 of 616 000 pairs occurred in corpus)
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
-0.05 0.0 0.05 0.1* 0.15 0.2 0.25 0.3 0.35 0.4
pairs
Step 3: Ask lots of users if the rest of
the pairs are good synonyms
When a user has sent a word to the Lexin dictionary he receives the translation followed by a question like:
Are 'spread' and 'lengthen' synonyms? Answer using a scale from 0 to 5 where 0 means 'I don’t agree' and 5 means 'I do fully agree', or answer 'I don’t know'
After answering the user may
grade new randomly chosen word pair look up word in the synonym dictionary suggest new synonymous word pair download synonym dictionary in XML
Step 4: Analyzing the gradings done by the users
1.2 millions gradings were made in less than 2 months
Grading statistics were analyzed on several occasions
Some users sent comments
Keeping the users happy!
Many users said that there were too many bad pairs
Lots of pairs were graded 0 (not at all synonyms) by all users. After some weeks 25 000 such pairs were removed. Later 60 000 more pairs were removed, improving the quality of the remaining pairs considerably.
User gradings first two months
0%
10%
20%
30%
40%
50%
60%
0 1 2 3 4 5 don'tknow
More interesting gradings 2006
0%
10%
20%
30%
40%
50%
60%
0 1 2 3 4 5 don'tknow
20052006
Distribution of mean gradings of word pairs after two months
0%
5%
10%
15%
20%
25%
30%
35%
40%
0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5
217 000 pairs
Distribution of mean gradings of word pairs 2006
0%
5%
10%
15%
20%
25%
30%
35%
40%
0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5
20052006
Analysis of the pairs graded 0Distance (cosine) in RI space
0%
10%
20%
30%
40%
50%
0,1 0,2 0,3 0,4 0,5 0,6
0 pairsall pairs
Some statistics (November 2006)
2.5 M user gradings done 67 000 pairs (graded ≥ 2) in dictionary 90 000 pairs suggested by users 50 000 unique pairs suggested 14 000 of them have been accepted
Example: Synonyms to klass (class)5: rang (grade)
rank (rank)slag (kind)
4: kategori (category)stånd (social class)årskurs (grade)
3: fack (sphere)grad (degree)grupp (group)kvalitet (quality)nivå (level)ordning (order)
3: skikt (layer)sort (sort)standard (standard)stil (style)
2: storleksordning (magnitude)typ (type)
1: poäng (point)stadga (stability)
0: uppdrag (mission)utbilda (educate)
How to prevent abuse?
Many gradings of a word pair are needed before it’s considered to be good
The pair to be graded is randomly picked from a very large list
Word pairs suggested by users are spell checked before they are added to the very large list
People's definition of synonymy
Exact meaning of 'synonym' wasn’t defined
Users will grade using their intuitive understanding of the concept of synonymy and the words in the pair
The produced dictionary will use the people's own definition of synonymy Hopefully this is exactly what they want!
The people’s synonym dictionary on the web
http://lexin.nada.kth.se/cgi-bin/synlex
Lessons learned
The list of suggested synonyms should be huge
Try to improve the quality of the list automatically as much as possible,Random indexing is useful for this, also try tagging and using other dictionaries
Use the 0 answers early to remove bad pairs that only irritate the users