plugin-06216610

6
 Semantic similarity measure based on multiple ressources Imen Akermi LARODEC, ISG Tunis University of Tunis Tunis, Tunisia [email protected] Rim Faiz LARODEC, IHEC Carthage University of Carthage Carthage Prés idence, Tunisia [email protected]   Abstr act    The ability to accurately judge the semantic similarity between words is critical to the performance of several applications such as Information Retrieval and Natural Language Processing. Therefore, in this paper we propose a semantic similarity measure that uses in one hand, an online English dictionary provided by the  Semant ic  Atlas project of the French National Centre for Scientific Research (CNRS) and on the other hand, a page counts based metric returned by a social website whose content is generated by users.  Keywords: Similarity measures, We b searc h, Sema ntic  similarity, Use r-generate d content, Inf ormatio n Retrieva l. I. I  NTRODU CTI ON With the development of the Web, there is a great amount of information available on Web pages, which opens new  perspe ctives for sha ring information, allowing the collaborative construction of content and development of social networks. This shared information present a collective intelligence or what we prefer to call “ human swarm knowledge”. It is a remarkable potential that we took advantage of, in our work, in order to measure the similarity between words. Measures of the semantic similarity of words have been used for a long time in applications in Natural Language Processing and relate d areas, such as the au tomatic creation of thesauri [10], [19], text classification [8], word sense disambiguation [15], [16] and information extraction and retrieval [5], [30]. Indeed, one of the main goals of these applications is to facilitate user access to relevant information. According to Baeza-Yates and Ribeiro-Neto [3], an information retrieval system “  should provide the user with easy access to the information in which he is interested ”. Therefore, various methods have been proposed; Corpus-  based methods using online dictio naries such as WordNet 1  and Brown corpus 2  [18], [15], [12], [19] and Web-based metrics [20], [28], [1]. In this same context, we propose a method that uses information available on the Web to measure semantic similarity betwe en a pair of words. 1  http://wor dnet.p rinceton.edu/. 2  http://icame.uib.n o/brown/. The rest of the document is organized as follows: Section 2 introduces the related work on semantic similarity methods. In section 3, we present our approach for measuring semantic similarity betwee n words. In sectio n 4 we evaluate our method in order to demonstrate its ability. In section 5 we conclude with few notes and some perspe ctives. II. R ELATED WORK  Many previous works on semantic similarity have used manually comp iled taxonomies such as WordNet and large text corpora [18], [15], [12], [19]. Li et al. [17] combined structural semantic information from a lexical taxonomy an d information co ntent from a corpus in a nonlinear model. They proposed a similarity measure that uses shortest path length, depth and local density in taxonomy. Lin [19] defined the similarity between two concepts as the information that is in common to both concepts and the information contained in each individual concept. However, one major issue behind taxonomies and corpora oriented approaches is that they might not necessarily capture similarity  betwee n proper names such as named entities (e.g., personal names, location names, product names) and the new uses of existing words. Other Web-based approaches used Web content for measuring semantic similarity between words. Turney [28] defined a measure called point-wise mutual informati on (PMI- IR), using the number of hits returned by a web search engine, to recognize synonyms. A similar approach for measuring the similarity between words was proposed by Matsuo et al. [20]. They proposed the use of web hits for extraction of communities on the Web. They measured the association  betwee n two personal name s using the Simps on coefficien t, which is calculated based on the number of web hits for each individual name and their conjunction (i.e., AND query of the two names). However, using page counts alone as a measure of similarity is not enough since even thou gh two words appear in a page, they might not be related [4]. A perfect exemplary case for this matter; page counts for the word apple contains page counts for apple as a fruit  and apple as a company. Moreover, given the scale and noise in the Web, some words might occur arbitrarily, i.e. by random chance, on some pages. For those reasons, page counts alone are unreliable when measuring semantic similarity. 978-1-4673-1166-3/12/$31.00 ©2012 IEEE 2012 International Conference on Information Technology and e-Services

Upload: saman

Post on 04-Nov-2015

215 views

Category:

Documents


0 download

DESCRIPTION

Semantic similarity measure based on multipleressources

TRANSCRIPT

  • Semantic similarity measure based on multiple ressources

    Imen Akermi LARODEC, ISG Tunis

    University of Tunis Tunis, Tunisia

    [email protected]

    Rim Faiz LARODEC, IHEC Carthage

    University of Carthage Carthage Prsidence, Tunisia

    [email protected]

    Abstract The ability to accurately judge the semantic similarity between words is critical to the performance of several applications such as Information Retrieval and Natural Language Processing. Therefore, in this paper we propose a semantic similarity measure that uses in one hand, an online English dictionary provided by the Semantic Atlas project of the French National Centre for Scientific Research (CNRS) and on the other hand, a page counts based metric returned by a social website whose content is generated by users.

    Keywords: Similarity measures, Web search, Semantic similarity, User-generated content, Information Retrieval.

    I. INTRODUCTION With the development of the Web, there is a great amount

    of information available on Web pages, which opens new perspectives for sharing information, allowing the collaborative construction of content and development of social networks. This shared information present a collective intelligence or what we prefer to call human swarm knowledge. It is a remarkable potential that we took advantage of, in our work, in order to measure the similarity between words.

    Measures of the semantic similarity of words have been used for a long time in applications in Natural Language Processing and related areas, such as the automatic creation of thesauri [10], [19], text classification [8], word sense disambiguation [15], [16] and information extraction and retrieval [5], [30]. Indeed, one of the main goals of these applications is to facilitate user access to relevant information. According to Baeza-Yates and Ribeiro-Neto [3], an information retrieval system should provide the user with easy access to the information in which he is interested.

    Therefore, various methods have been proposed; Corpus-based methods using online dictionaries such as WordNet1 and Brown corpus2 [18], [15], [12], [19] and Web-based metrics [20], [28], [1]. In this same context, we propose a method that uses information available on the Web to measure semantic similarity between a pair of words.

    1 http://wordnet.princeton.edu/.

    2 http://icame.uib.no/brown/.

    The rest of the document is organized as follows: Section 2 introduces the related work on semantic similarity methods. In section 3, we present our approach for measuring semantic similarity between words. In section 4 we evaluate our method in order to demonstrate its ability. In section 5 we conclude with few notes and some perspectives.

    II. RELATED WORK Many previous works on semantic similarity have used

    manually compiled taxonomies such as WordNet and large text corpora [18], [15], [12], [19].

    Li et al. [17] combined structural semantic information from a lexical taxonomy and information content from a corpus in a nonlinear model. They proposed a similarity measure that uses shortest path length, depth and local density in taxonomy. Lin [19] defined the similarity between two concepts as the information that is in common to both concepts and the information contained in each individual concept. However, one major issue behind taxonomies and corpora oriented approaches is that they might not necessarily capture similarity between proper names such as named entities (e.g., personal names, location names, product names) and the new uses of existing words.

    Other Web-based approaches used Web content for measuring semantic similarity between words. Turney [28] defined a measure called point-wise mutual information (PMI-IR), using the number of hits returned by a web search engine, to recognize synonyms. A similar approach for measuring the similarity between words was proposed by Matsuo et al. [20]. They proposed the use of web hits for extraction of communities on the Web. They measured the association between two personal names using the Simpson coefficient, which is calculated based on the number of web hits for each individual name and their conjunction (i.e., AND query of the two names). However, using page counts alone as a measure of similarity is not enough since even though two words appear in a page, they might not be related [4]. A perfect exemplary case for this matter; page counts for the word apple contains page counts for apple as a fruit and apple as a company. Moreover, given the scale and noise in the Web, some words might occur arbitrarily, i.e. by random chance, on some pages. For those reasons, page counts alone are unreliable when measuring semantic similarity.

    978-1-4673-1166-3/12/$31.00 2012 IEEE

    2012 International Conference on Information Technology and e-Services

  • Sahami and Heilman [26] measured semantic similarity between two queries using snippets returned for those queries by a search engine. For each query, they collect snippets from a search engine and represent each snippet as a TF-IDF-weighted term vector. Each vector is normalized and the centroid of the set of vectors is computed. Semantic similarity between two queries is then defined as the inner product between the corresponding centroid vectors. Chen et al. [7] developed a double-checking model using text snippets returned by a web search engine to compute semantic similarity between words. For two words P and Q, they collect snippets for each word from a web search engine. Then they count the occurrences of word P in the snippets for word Q and the occurrences of word Q in the snippets for word P. These values are combined nonlinearly to compute the similarity between P and Q. However, this method depends heavily on the search engine's ranking algorithm. Although two words P and Q might be very similar, there is no reason to believe that one can find Q in the snippets for P, or vice versa.

    Bollegala et al. [4] developed a hybrid semantic similarity measure. They combined a WordNet based metric with retrieval of information from text snippets returned by a Web search engine. They automatically discover lexico-syntactic templates for semantically related and unrelated words using WordNet, and they train a support vector machine (SVM) classifier. The learned templates are used for extracting information from the text fragments returned by the search engine.

    In our work, we explore the potential offered by social networks and Web content in order to measure semantic similarity. Therefore, in this paper we propose a method that uses both page counts based similarity returned by a social website and a dictionary based metric.

    III. THE PROPOSED METHOD Our method for measuring semantic similarity between

    words uses, in one hand, an online English dictionary provided by the Semantic Atlas project (SA) [23] and in the other hand, page counts returned by a social website whose content is generated by users (see Fig. 1).

    Figure 1. The proposed method for measuring semantic similarity between a given pair of words

    We proceed in three phases:

    1. The calculation of the similarity (SD) between two words based on the online dictionary provided by the Semantic Atlas project.

    2. The calculation of the similarity (SP) between two words based on the page counts returned by the social Website Digg.com3.

    3. Integration of the two similarity measures SD and SP.

    A. Phase 1:The Semantic Atlas based similarity measure In this phase, we extract synonyms for each word from the

    online English dictionary provided by the Semantic Atlas project [23] of the French National Centre for Scientific Research (CNRS).

    What lured our attention is the fact that the SA is composed of several dictionaries and thesauri (including the Rogets thesaurus), offering thus a wide range of senses for a given word. In fact, the SA is used for the automatic treatment of polysemy and semantic disambiguation [29]. It has also been used in psycholinguistics research at the University of Geneva [13] in a survey to define the age of word acquisition in French as well as degrees of familiarity of words. The SA is currently available for French and English versions. It can be consulted online via a flexible interface allowing for interactive navigation on http://dico.isc.cnrs.fr4. The current online version allows users to set the search type for English words as (1) standard (narrow synonymy) or (2) enriched (broad synonymy). In this paper, we use the enriched search type for a better precision.

    In our work, we use the SA to extract sets of synonyms for each word. Once the two sets of synonyms for each word are collected, we calculate the degree of similarity, which we call S(w1,w2) between them, using the Jaccard coefficient:

    S (w1, w2 )= mcmw1+ mw2- mc (1) Where

    mc: The number of common words between the two synonym sets.

    mw1: The number of words contained in the w1 synonym set. mw2: The number of words contained in the w2 synonym set.

    If the group of synonyms for the words w1 explicitly contains the word w2 or vice versa, we assign directly the value 1 to S(w1,w2). For example, if we want to find the semantic similarity between words asylum and madhouse, we proceed to the extraction of two groups of synonyms for these two words from the dictionary.

    3 http://www.digg.com.

    4 This site is the most consulted address of the French National Center for Scientific Researchs domain (CNRS), one of the

    major research bodies in France.

    Word pair (w1,w2)

    Dictionary based semantic similarity

    Page counts based semantic similarity

    Synonyms for word w1

    Synonyms for word w2

    Similarity coefficient Jaccard

    Similarity coefficient WebPMI

    The semantic similarity between w1 et w2 is given by Sim (w1,w2)

  • According to TABLE I, the two words appear respectively in the two groups of synonyms, meaning that asylum and madhouse are perfect synonyms.

    TABLE I. THE GROUP OF SYNONYMS FOR THE WORDS ASYLUM AND MADHOUSE

    asylum madhouse

    ark, bedlam, booby hatch, crazy house, cuckoo's nest, funny farm, funny house, home, hospital, infirmary, insane asylum, institution, loony bin, madhouse, mental home, mental hospital, mental institution, nuthouse, psychiatric hospital, sanatorium harbour, haven, preserve, refuge, reserve, retreat, safehold, safety, sanctuary, shelter.

    asylum, bedlam, booby hatch, chaos, crazy house, cuckoo's nest, funny farm, funny house, insane asylum, loony bin, lunatic asylum, mental home, mental hospital, mental institution, nuthouse, psychiatric hospital, sanatorium.

    B. Phase 2: The page counts similarity measure In this phase, we calculate the degree of similarity between

    the two words w1 and w2 using the coefficient WebPMI [4] which has as parameters the number of pages returned by the social Website Digg.com for queries w1, w2 and (w1, w2 ). In fact, we explore the potential offered by a community site whose content is generated by users, for measuring semantic similarity between words.

    Barlow [2] identifies Digg.com as one of the most popular aggregators of articles published on the Web, and whose content is generated by users. Actually, it makes users participate in the sharing of useful information, and the exchange of advice. This offers a huge amount of information that would be impossible to obtain otherwise. Perfect example of what the relational dynamics can produce Digg.com is one of the most visited websites on the Web. Users publish on the site links to articles or pieces of information that seem interesting with or without comments [22]. We use this social site to calculate the number of pages for a given query. Therefore, once the page counts for queries w1, w2 and (w1, w2) are obtained, we calculate the WebPMI coefficient for the given pair of words:

    WebPMI(w1,w2)= log2(H(w1 w2)

    NH(w1)

    N H(w2)

    N

    ) (2)

    Where

    H(w1w2): The page count for the query (w1w2). H((w1): The page count for the query w1. H(w2): The page count for the query w2. We set N = 1010, according to [4].

    The WebPMI coefficient, for the example mentioned in the previous section, is:

    WebPMI(asylum, madhouse) =0,800.

    C. Phase 3:The overall similarity measure In this last phase, we incorporate both measures previously

    calculated by the following formula:

    SimFA(w1,w2)=* S (w1, w2 )+1-* WebPMI(w1, w2) (3) [0,1]. First experiments on Miller-Charles [21] and on Rubenstein-Goodenough's [25] datasets have shown that our measure performs better with = 0,6. We intend to perform further experiments on The WordSimilrity-353 Test Collection [9] in order to stabilize .

    IV. EXPERIMENTS

    A. The benchmarck dataset We evaluate the proposed method against Miller-Charles

    dataset [21], a dataset of 30 word pairs. Because of the omission of two word pairs in earlier versions of WordNet, most researchers had used only 28 pairs for evaluations; however we evaluated our method on the Rubenstein-Goodenough's [25] original data set of 65 word pairs. These two datasets are considered as a reliable benchmark for evaluating semantic similarity measures. The word pairs are rated on a scale from 0 (no similarity) to 4 (perfect synonymy).

    In TABLE II, we compare the results of our measure with Chens Co-occurrence Double Checking (CODC) measure [7], with Sahami and Heilman metric [26], Bollegala et al. measure [4] and with four popular co-occurrence measures modified by [4]; WebJaccard, WebSimpson, WebDice, and WebPMI (point-wise mutual information). These measures use page counts returned by the search engine Google5.

    They define WebSimpson(P,Q), as :

    WebSimpson(P,Q)= 0 if H(PQ) c H(PQ)Min(H(P),H(Q)) otherwise (4)

    They set the WebSimpson coefficient to zero if the page count for the query P Q is less than a threshold c6. Applied to the previous example:

    WebSimpson(asylum,madhouse) = 0,024.

    They define the WebDice coefficient as a variant of the Dice coefficient. WebDice(P,Q) is defined as :

    WebDice(P,Q)= 0 if H(PQ) c 2*H(PQ)H(P)+H(Q) otherwise (5)

    Example: WebDice(asylum,madhouse) = 0,008.

    5 http://www.google.com. 6 c = 5 in the experiments.

  • TABLE II. THE SIMFA SIMILARITY MEASURE COMPARED TO BASELINES ON MILLER-CHARLES' DATASET.

    Word pairs Miller-Charles Webjaccard WebSimpson WebDice WebPMI Sahami et

    al. Chen-CODC Bollegala et

    al. SimFA

    cord-smile 0,13 0,004 0,003 0,004 0,467 0,090 0 0 0,199 rooster-voyage 0,08 0,000 0,000 0,000 0,000 0,197 0 0,017 0,227 noon-string 0,08 0,002 0,002 0,002 0,450 0,082 0 0,018 0,195 glass-magician 0,11 0,002 0,008 0,003 0,513 0,143 0 0,180 0,219 monk-slave 0,55 0,006 0,003 0,007 0,609 0,095 0 0,375 0,262 coast-forest 0,42 0,022 0,013 0,026 0,570 0,248 0 0,405 0,246 monk-oracle 1,1 0,001 0,001 0,001 0,466 0,045 0 0,328 0,199 lad-wizard 0,42 0,003 0,005 0,004 0,581 0,149 0 0,220 0,250 forest-graveyard 0,84 0,002 0,005 0,002 0,556 0 0 0,547 0,244 food-rooster 0,89 0,001 0,019 0,001 0,486 0,075 0 0,060 0,210 coast-hill 0,87 0,020 0,009 0,024 0,522 0,293 0 0,874 0,234 car-journey 1,16 0,015 0,028 0,018 0,474 0,189 0,290 0,286 0,215 crane-implement 1,68 0,002 0,004 0,002 0,476 0,152 0 0,133 0,202 brother-lad 1,66 0,003 0,014 0,004 0,561 0,236 0,379 0,344 0,257 bird-crane 2,97 0,010 0,022 0,012 0,626 0,223 0 0,879 0,271 bird-cock 3,05 0,003 0,003 0,004 0,475 0,058 0,502 0,593 0,214 food-fruit 3,08 0,116 0,167 0,136 0,661 0,181 0,338 0,998 0,286 brother-monk 2,82 0,006 0,018 0,007 0,577 0,267 0,547 0,377 0,849 asylum-madhouse 3,61 0,007 0,024 0,008 0,800 0,212 0 0,773 0,946 furnace-stove 3,11 0,028 0,017 0,034 0,736 0,310 0,928 0,889 0,917 magician-wizard 3,5 0,020 0,020 0,024 0,691 0,233 0,671 1 0,901 journey-voyage 3,84 0,029 0,033 0,035 0,658 0,524 0,417 0,996 0,884 coast-shore 3,7 0,052 0,035 0,062 0,650 0,381 0,518 0,945 0,881 implement-tool 2,95 0,045 0,047 0,053 0,556 0,419 0,419 0,684 0,839 boy-lad 3,76 0,007 0,062 0,009 0,633 0,471 0 0,974 0,875 automobile-car 3,92 0,124 0,397 0,144 0,684 1 0,686 0,980 0,897 midday-noon 3,42 0,012 0,008 0,015 0,685 0,289 0,856 0,819 0,899 gem-jewel 3,84 1 1 1 1 0,211 1 0,686 0,969

    Correlation - 0,316 0,383 0,328 0,663 0,579 0,693 0,834 0,840

    They define WebJaccard as a variant form of Jaccard index using page counts as:

    WebJaccard(P,Q)= 0 if H(PQ) c H(PQ) H(P)+H(Q)-H(PQ) otherwise (6)

    Example: WebJaccard(asylum,madhouse) = 0,007.

    These measures are based on the use of association ratios between words computed using the frequency of co-occurrence of words in corpora. The main hypothesis of this approach is that two words are semantically related if their association ratio is high. All figures in TABLE II, except those of Miller-Charles, are normalized in the interval [0,1] to facilitate comparison.

    We also evaluate in TABLE III, our method on the Rubenstein-Goodenough's dataset [25]. Rubenstein and Goodenough [25] obtained synonymy judgments from 51 human subjects on 65 pairs of words. The pairs ranged from highly synonymous to semantically unrelated, and the subjects were asked to rate them, on the scale of 0.0 to 4.0, according to their similarity of meaning [6].

    We evaluate our method against Rubenstein and Goodenoughs dataset with the following five measures: Hirst and St-Onges [11], Jiang and Conraths [12], Leacock and Chodorows[14], Lins[18], and Resniks [24]. The first is claimed as a measure of semantic relatedness because it uses all noun relations in WordNet; the others are claimed only as measures of similarity because they use only the hyponymy relation. These measures were implemented by Budanitsky and Hirst [6], where the Brown Corpus was used as the basis for the frequency counts needed in the information-based approaches. TABLE III. THE SIMFA SIMILARITY MEASURE COMPARED TO BASELINES ON

    RUBENSTEIN AND GOODENOUGHS DATASET.

    Method Correlation Human replication 0.901 Resnik 0.779 Lin (1998a) 0.819 Li et al. (2003) 0.891 Leacock &chodorow 0.838 Hirst & St-Onge 0.786 Jiang & Conrath 0.781 Bollegala et al. (2007) 0.812 Proposed SimFA 0,875

  • B. Results According to TABLE II, our proposed measure SimFA

    earns the highest correlation of 0,840. CODC measure reports zero similarity scores for many word-pairs in the benchmark. One reason for this sparsity in CODC measure is that even though two words in a pair (P;Q) are semantically similar, we might not always find Q among the top snippets for P (and vice versa)[4]. Similarity measure proposed by Sahami and Heilman [26] measure is placed fifth, reflecting a correlation of 0,579. Among the four page-counts-based measures, WebPMI earns the highest correlation (r = 0,663).

    As summarized in TABLE III, our proposed measure evaluated against Rubenstein-Goodenough's [25] dataset gives a correlation coefficient of 0,875 which we can consider as promising. Our measure outperforms simple WordNet-based approaches such as Edge-counting and Information Content measures and it is comparable with the other methods.

    V. CONCLUSION AND FUTURE WORK Semantic similarity between words is fundamental to

    various fields such as Cognitive Science, Natural Language Processing and Information Retrieval. Therefore, relying on a robust semantic similarity measure is crucial.

    In this paper, we introduced a new similarity measure between words using in one hand, an online English dictionary provided by the Semantic Atlas project of the French National Centre for Scientific Research (CNRS) and in the other hand, page counts returned by the social website Digg.com whose content is generated by users.

    Experimental results on Miller-Charles benchmark dataset show that the proposed measure is promising.

    There are several lines of future work that our proposed measure lays the foundation for. We will incorporate this measure into other similarity-based applications to determine its ability to provide improvement in tasks such as classification and clustering of text. In fact, we are working on integrating our metric into an opinion classification application [8].

    Besides, we intend to go further in this domain by developing new semantic similarity measures related to sentences and documents.

    REFERENCES [1] I. Akermi and R. Faiz: Mesure de similarit smantique entre les mots

    en utilisant le contenu du Web. Revue des Nouvelles Technologies de lInformation (RNTI), Herman editions, Zighed, D. and Venturini, G., editors, EGC 2012, pp. 545546, 2012.

    [2] A. Barlow: Blogging America: The New Public Sphere. Praeger, 2008.

    [3] R. Baeza-Yates and B. Ribeiro-Neto: Modern Information Retrieval. Addison Wesley, 1999.

    [4] D. Bollegala, Y. Matsuo, and M. Ishizuka: Measuring semantic similarity between words using web search engines. In Proceedings of International Conference on the World Wide Web (WWW 2007), pp. 757766, 2007.

    [5] C. Buckley, G. Salton, J. Allan, and A. Singhal. Automatic query expansion using smart: Trec 3. In Proceedings of 3rd Text Retreival Conference, pp 69-80, 1994.

    [6] A. Budanitsky and G. Hirst. Evaluating wordnet-based measures of lexical semantic relatedness. Computational Linguistics, 32(1) pages 1347, 2006.

    [7] H. Chen, M.Lin and Y.Wei,: Novel association measures using web search with double checking. In Proceedings of the COLING/ACL, pp. 10091016, 2006.

    [8] A. Elkhlifi, R. Bouchlaghem and R. Faiz: Opinion Extraction and Classification Based on Semantic Similarities. Proceedings of the Twenty-Fourth International Florida Artificial Intelligence Research Society Conference (FLAIRS 2011), AAAI Press 2011.

    [9] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin, "Placing search in sontext: The concept revisited", ACM Transactions on Information Systems, 20(1):116-131, 2002.

    [10] G. Grefenstette: Automatic thesaurus generation from raw text using knowledge-poor techniques. In Making Sense of Words, 9th Annual Conference of the UW Centre for the New OED and Text Research, 1993.

    [11] G. Hirst, and D. St-Onge: Lexical chains as representations of context for the detection and correction of malapropisms. In Christiane Fellbaum editor, WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA, pages 305332, 1998.

    [12] J.J. Jiang and D.W. Conrath: Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the International Conference on Research in Computational Linguistics ROCLING X, 1998.

    [13] C. Lachaud: La prgnance perceptive des mots parls: une rponse au problme de la segmentation lexicale. Thse de doctorat, Universit de Genve (2005).

    [14] C. Leacock, and M. Chodorow: Combining local context and WordNet similarity for word sense identification. In Christiane Fellbaum, editor, WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA, pages 265283. 1998.

    [15] M.E. Lesk: Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the SIGDOC Conference, Toronto, 1986.

    [16] H. Li and N. Abe: Word clustering and disambiguation based on co-occurrence data. In Proceedings of COLING-ACL, pp. 749-755 (1998).

    [17] Y. Li, Z.Bandar and D. McLean: An approach for measuring semantic similarity between words using multiple information sources. IEEE Transactions on Knowledge and Data Engineering 15(4), pages 871-882, 2003.

    [18] D. Lin: Automatic retrieval and clustering of similar words. In Proceedings of the 17th, COLING, pages 768- 774, 1998a.

    [19] D. Lin.: An information-theoretic definition of similarity. In Proceedings of the 15th ICML, pages 296304, 1998b.

    [20] Y. Matsuo, T. Sakaki, K. Uchiyama and M. Ishizuka: Graph-based word clustering using web search engine. In Proceedings of EMNLP, 2006.

    [21] G.A Miller and W.G. Charles: Contextual correlates of semantic similarity. Language and Cognitive Processes 6(1), pages 128, 1991.

    [22] F. Pisani and D. Piotet: Comment le web change le monde: L'alchimie des multitudes, 2008.

    [23] S. Ploux, A. Boussidan and H. Ji: The Semantic Atlas: and interactive Model of Lexical Representation. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10), Valletta, Malta, 2010.

    [24] P. Resnik: Using information content to evaluate semantic similarity in taxonomy. In Proceedings of 14th International Joint Conference on Artificial Intelligence (IJCAI), 1995.

    [25] H. Rubenstein and J. Goodenough: Contextual correlates of synonymy. Communications of the ACM 8, pages 627-633, 1965.

    [26] M. Sahami and T. Heilman: A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of 15th International World Wide Web Conference (WWW 2006), 2006.

    [27] P. Siddharth, S. Banerjee and T. Pedersen: Using measures of semantic relatedness for word sense disambiguation. In Proceedings

  • of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, Mexico, pages 241-257, 2003.

    [28] P. D. Turney: Mining the web for synonyms: Pmi-ir versus lsa on toefl. In Proceedings of ECML, pages 491502, 2001.

    [29] F. Venant: Utiliser des classes de slection distributionnelle pour dsambiguser les adjectifs, In Proceedings of TALN, Toulouse, France, 2007.

    [30] J. Xu and B.Croft: Improving the effectiveness of information retrieval. ACM Transactions on Information Systems, 18(1) pages 79-112, 2000.

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 200 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 400 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

    /CreateJDFFile false /Description >>> setdistillerparams> setpagedevice