textometry and information discovery : a new approach to mining textual data on the web

Textometry and Information Discovery : A New Approach to Mining Textual Data on the Web

Erin MacMurray*, Marguerite Leenhardt **SYLED/CLA2T EA2290, UFR ILPGA, Université Sorbonne

Nouvelle Paris 3*[email protected]

** [email protected]

ICAI’11 Workshop on Intelligent Linguistic Technologies

In a nutshell

• Introduction & background

• Textometry and Web Mining: why?

• Textometry and Web Mining: how?

• Textometry and Web Mining: application?

• Conclusion

22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 2

Introduction & background

Structure ?

Seth Grimes sees « three categories of data : (i) Quantities, whether measured, observed, or computed (ii) Content, whichI’ll characterize as non-quantitative information (iii) Metadata describingquantities and content. Structured/unstructured is a false dichotomy. »

(July 2011 – IKS Semantic Workshop, France)

Man versus machine ?

Neil Glassman « between those on one side who feel the accuracy of automated[content analysis] is sufficient and thoseon the other side who feel we can only relyon human analysis […] most in the fieldconcur with the idea that we need to define a methodology where the software and the analyst collaborate to get over the noise and deliver accurate analysis. » (May 2011 – Sentiment Analysis Symposium review)


Textometry and Web Mining: why ?

• Improving Linguistic Models

– Semantic complexity of simple units such as NE

– Identifying paraphrases of NE

22/07/2011 E. MacMurray & M. Leenhardt

NE : an heterogeneous categoryEhrmann M. (2008) les EN de la linguistique au TAL statut théorique et méthodes de désambiguïsation.

Paraphrases of a single NE

20GB

INTEL

Gone with the wind

Harry Potter

The 4th of July

JIF Peanut Butter

Dulles International Airport

Paris

Le Tour de France

www.nytimes.com

lyzozym

Sarko

Sarkoland

Sarkozyste

Nicolas Sarkozy

Le président de la République

M. Sarkozy

Sarkozysme

Mr Sarkozy

ICAI’11 Workshop on Intelligent Linguistic Technologies 4

Textometry and Web Mining: why ?• Text is considered having its own internal structure

• Application of statistical and probabilistic calculations directly to the textualunits of comparable texts in a corpus


Textometry and Web Mining: how?


July 4th 2011

July 5th 2011

Hypergeometric Distribution

Form Specificness

b 23.43

b 12.68

b 5.57

b 5.66

Form Specificness

d 13.73

d 21.86

d 7.75

d 6.55

6ICAI’11 Workshop on Intelligent Linguistic Technologies

Textometry and Web Mining: how?


Two words or more that appear at the same time in a predetermined span of text- lexical relationships around a pivot-form (William Martinez, 2003)

Result: network of associative relationships

- - - A - - - C - - - B - - - D .

- - - B - - C - - A - - - E .

- - - C - - - A - - - D - - - H .

- - - E - - - B - - - D - - - A .

- - - F - - - C - - - B - - - D .

- - - B - - - C - - - H - - - E .

- - - E - - - B - - - D - - - F .

A

B EC

A B C

E7ICAI’11 Workshop on Intelligent Linguistic Technologies

Textometry and Web Mining : how?


1/ POINT OF ENTRY 2/ CORPUS

3/ TEXTOMETRIC ANALYSIS

NE (companiesand people)

Article selection

Company NE = XeroxPeople NE = Nicolas Sarkozy

SpecificnessCooccurrences

4/ INTERPRETATION OF RESULTS

Quantitative information to formulate qualitative interpretations.

8ICAI’11 Workshop on Intelligent Linguistic Technologies

HypergeometricDisribution

184,761 occurrences / 13,075 forms / 5,194 hapax160 articles

197,341 occurrences / 17,807 formes / 9,416 hapax103 articles

Textometry and Web Mining: results?

22/07/2011 E. MacMurray & M. Leenhardt 9

Observing forms and repeted segments of « Nicolas Sarkozy » allows identifying polarities of opinion in paraphrases, providing clues for determining how the NE is perceived.

{negative

{contextuallydependant




Figure - Monthly variation of specificness for paraphrases for the NE « Nicolas Sarkozy ».




As a current event is discussed in the media, the lexical network produced by the co-occurrence calculation will be greater during an event than during periods of calmor low activity of the NE

( « buzz effect »)



22/07/2011 E. MacMurray & M. Leenhardt 12ICAI’11 Workshop on Intelligent Linguistic Technologies

Conclusion

• Two intelligence use-cases on Le Monde and The New York Times

• Two complementary approaches : specificness and co-occurrence analysis

• Three main contributions :

– Building corpus-driven linguistic ressources (time and cost-cutting)

– Identifying trends with specificness calculation

– Targeting zones of activity or events through co-occurrence networks

• In sum, this method :

– Help derive knowledge from corpora without predefined information models

– Provides adequate functions enabling interaction between the expertise of the user and processing tools


References

Bloom K., Stein S. & Argamon S., Appraisal extraction for news opinion analysis at NTCIR-6, Proceedings of NTCIR-6, 2007, p 279-289.Bollier, D. The Promise and Peril of Big Data. Washington, DC : The Aspen Institute, 2010.Delanoë, A. 2010. Statistique textuelle et series chronologiques sur un corpus de presse écrite. Le cas de la mise en application du principe de précaution.

Proceedings, JADT’2010.Delaplace R., Leenhardt M. & Wu L-C., Methode de conception d’une application de veille et d’Analyse Linguistique Assistee par Ordinateur, VSST

Conference, Toulouse, France, 2010.Fayyard, U.M, Piatesky, G., Smyth, P. & Uthurusamy, R. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.Feldman R. & Sanger J., The Text Mining Handbook : Advanced Approaches in Analyzing Unstructured Data, Cambrigde University Press, 2006, 422 p.Firth, J.R. A Synopsis of Linguistic Theory 1930-1955, Linguistic Analysis Philological Society, Oxford, 1957.Grishman, R. & Sundheim, B. Message Understanding Conference- 6 : A Brief History. Proceedings of the 16th International Conference on Computational

Linguistics (COLING), I. Kopenhagen, 1996 p.466–471,.Kodratoff, Y. Knowledge discovery in texts: A definition and applications, Proceedings of the International Symposium on Methodologies for Intelligent

Systems, 1999, volume LNAI 1609, p. 16–29. Lebart, L. & Salem, A. Statistique textuelle. Paris, Dunod, 1994.Lent, B., Agrawal, R., & Srikant, R. Discovering trends in text databases, Proceedings KDD’1997, AAAI Press, 14–17 p. 227–230.MacMurray E. & Shen L., Textual Statistics and Information Discovery: Using Co-occurrences to Detect Events, VSST Conference, Toulouse, France, 2010.Martin J.R. & White P.R.R., The language of evaluation: appraisal in English, Palgrave, London, 2005.Martinez, W. Contribution a une methodologie de l’analyse des cooccurrences lexicales multiples dans les corpus textuels, These pour le doctorat en

Sciences du Langage, Universite de la Sorbonne nouvelle - Paris 3, 2003.Née, E. Insécurité et élections presidentielles dans le journal Le Monde, Lexicometrica numéro thématique « Explorations Textuelles », S. Fleury, A. Salem.

2008Poibeau T. Extraction automatique d’information. Du texte brut au web semantique. Paris : Hermes Sciences, 2003.Poibeau, T. Sur le statut referentiel des entites nommees, Proceedings TALN’05. Dourdan, France, 2005.Salem A., Introduction a la resonance textuelle, In Actes des JADT 2004 (7 emes Journees internationales d’Analyse Statistique des Donnees Textuelles),

2004, p 986-992.Sandhaus, E. The New York Times Annotated Corpus. Philadelphia: Linguistic Data Consortium, 2008.Tufféry, S. Data mining et statistique decisionnelle: l'intelligence des donnees. Paris : Editions Technip, 2007.Wright, K. Using Open Source Common Sense Reasoning Tools in Text Mining Research, the International Journal of Applied Management and Technology,

2006 vol 4 n°2 p.349-387.


textometry and information discovery : a new approach to mining textual data on the web

Technology

web mining

leenhardt icai11 workshop

b macmurray

web erin macmurray

leenhardticai11 workshop

linguistic ressources

iks semantic workshop

mining textual data