pedersen naacl-2010-poster

Poster Design & Printing by Genigraphics® - 800.790.4001

Ted PedersenDepartment of Computer ScienceUniversity of Minnesota, Duluth

[email protected]://www.d.umn.edu/~tpederse

This poster presents an empirical comparison of similarity measures for pairs of concepts based on Information Content. It shows that using modest amounts of untagged text to derive Information Content results in higher correlation with human similarity judgments than using the largest available corpus of manually annotated sense-tagged data.

Ted PedersenUniversity of Minnesota, Duluth

The hypothesis underlying this poster is that Information Content values can be reasonably calculated without sense tagged text, and that sense-tagged text is not available in sufficient quantities to be considered reliable for this task. A number of experiments were conducted where different amounts of raw text were used to estimate Information Content values and to use those in computing semantic similarity. The automatically computed semantic similarity values were measured for rank correlation with three manually created gold standard datasets. This poster shows the results using the Miller and Charles set of 30 word pairs. Overall there was no advantage to using sense tagged text to compute Information Content values, and the Information Content based measures of semantic similarity improved upon the path based measures. Of particular interest was the fact that the add-1 smoothing method (without any text!) provided results almost as good as raw newspaper text. This shows that the real power of Information Content is not in the frequency counts but in the structure of the taxonomy. This would have come as no surprise to Aristotle and LinnaeusA measure of semantic similarity quantifies the

degree to which two concepts are “alike”● How much does c1 remind you of c2?

Concepts can be arranged in a hierarchy, where children “are of a kind of” their parents – path lengths give plausible approximation of similarity

...but path lengths are inconsistent as you move to more general or more specific concepts

Resnik (1995) assigned a value of specificity to each concept (which corrects for path length) ● Information Content (c1) = log prob (c1)● Similarity of c1 and c2 is the IC of their least common subsumer (most specific ancestor)

Resnik (1995) res (c1,c2) = IC (LCS (c1,c2))

Jiang & Conrath (1997) jcn (c1, c2) = 1/IC(c1) + IC(c2) – 2 * res (c1, c2)

Lin (1998) lin (c1, c2) = 2 * res (c1, c2) / IC(c1) + IC (c2)

Measures of Semantic Similarity

J. Jiang and D. Conrath, 1997. Semantic similarity based on corpus statistics and lexical taxonomy. Proceedings on International Conference on Research in Computational Linguistics, Taiwan.

D. Lin, 1998. An information-theoretic definition of similarity, Proceedings of the International Conference on Machine Learning, Madison.

P. Resnik, 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy, Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal.

Experimental Data

L. Finkelstein, et al. 2002. Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20(1):116-131. http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/

G. A. Miller and W.G. Charles. 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1):1-28

H. Rubenstein and J.B. Goodenough. 1965. Contextual correlates of synonymy. Communications of the ACM, 8:627-633.

Software

Pedersen, et al. 2004. WordNet::Similarity – Measuring the relatedness of concepts. In Proceedings of HLT/NAACL 2004, pg 38-41, Boston, MA.

Is a dog similar to a human? Is a cow similar to a barn?

Is a time machine similar to a telescope?Is a computer similar to a surf board?

Similar things share certain traits : X and Y both (have | are made of | like | think about | live in | studied at | play | hate | are a kind of | eat ) Z

Aristotle, 384 BC – 322 BC● Great Chain of Being - nature with ranks● Law of Similarity

Carl Linnaeus, 1707-1778● Nature as nested hierarchy● Binomial Nomenclature : Genus + Species

This leads to TAXONOMY – concepts organized in a hierarchy joined by “is-a” relations

NLP needs to incorporate similarity for WSD, lexical selection in text generation, semantic

search, recognizing textual entailment...

Semantic Similarity

Measures of Semantic Similarity

References

ExperimentsInformation ContentAbstract

Contact

Information Content Measures of Semantic Similarity Perform Better Without Sense-Tagged Text

Correlation with Miller & Charles Data

Corpus for IC

Size Coverage JCN LIN RES

SemCor (sense-tagged)

226,000 .24 .72 .73 .74

SemCor(raw)

670,000 .37 .82 .79 .76

xie199501(Jan 1995)

1.2 million .35 .78 .75 .73

xie199501-02

2.3 million .39 .79 .75 .73

xie199501-12

16 million .51 .87 .81 .75

xie1995-1998

73 million .60 .89 .81 .75

XIE 1995-2001

133 million .64 .88 .81 .76

AFE 174 million .66 .88 .80 .77

APW 560 million .75 .84 .79 .76

NYT 963 million .83 .84 .79 .77

ADD-1 0 1.00 .85 .77 .76

Concept Pairs RES LIN JCN

Similarity Measures w/ Different Sources

PATH SEMCOR AFE ADD-1

SEMCOR AFE ADD-1 SEMCOR AFE ADD-1

telescope#n#1 microscope#n#1

0.33 9.28 10.81 0.93 0.93 0.89 0.91 0.66 0.38 0.66

instrument#n#1 machine#n#1

0.33 4.37 4.66 0.70 0.70 0.69 0.65 0.26 0.24 0.26

telescope#n#1 time_machine#n#1

0.17 4.37 4.66 0.41 0.41 0.34 0.36 0.08 0.06 0.08

catapult#n#3 time_machine#n#1

0.14 4.37 4.66 0 0 0.29 0.31 0 0.04 0.06

This is a small portion of WordNet, showing Information Content for concepts based on sense-tagged text (SemCor), newspaper text (AFE), and by assuming each sense occurs one time (ADD-1). If the text is not sense tagged, then each possible sense of a word (and its ancestors) is incremented with each occurrence.

mailto:[email protected]

pedersen naacl-2010-poster

Documents