statistical language processing

8/2/2019 Statistical Language Processing

1/32

Statistical language processing

Concepts and Algorithms

A. Georgakis, PhD


2/32

2/32

ToC

Basic definitions

Text mining

Performance evaluation References


3/32

3/32

Definitions

SLP is NLP on steroids

Away from rule based methods

Cover a wide area: Automatic summarization, Machine translation,

Named entity recognition, Part-of-speechtagging, Sentence boundary disambiguation,

Sentiment analysis, Word sensedisambiguation, etc


4/32

4/32

Automatic summarization

...transformation of source text to summarytext through content reduction by selection,generalization and transformation S. Jones,

1999 but there are many more definitions

ambiguity for the term

For additional info go here
http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009


5/32

5/32

Machine translation

Substitution of source text into a targetlanguage

Usage of parallel corpora Internet is a vast source for such data

Pivot languages
http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009


6/32

6/32

Named entity recognition

Identify proper names and their types Peter person Paris city or person

Capitalization is not always a good tool Some languages do not not use capitals German Begining of centences
http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009


7/32

7/32

Part-of-speech tagging

Determine the part of speech for words Well, she and

young John walk

to school slowly English has 9 parts of speech:

noun, verb, article, adjective, preposition,

pronoun, adverb, conjunction, andinterjection .. but as a linguist you will need to use

somewhere between 50 and 150
http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009


8/32

8/32

Sentence boundary disambiguation

Where does a centence start and stop? Punctuation marks are problematic Rule based mathod

Precompiled list of abbreviations

90% of periods are sentence boundaries(Riley, 1999)

~47% in Wall Street Journal are abbreviations(Stammatos, 2009)
http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009


9/32

9/32

Sentiment analysis

Identify the polarity and emotional state for agiven text:

positive or negative angry, sad, unhappy

Rather tough problem to solve due tolanguage ambiguity
http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009


10/32

10/32

Word sense disambiguation

Identify the sense of different words

ML on top of human knowledge

Thesauri Ontologies Corpora ...

For more info go here
http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdf


11/32

11/32

Basic tools I

Corpora Balanced and representative collection of

documents

Stopping removal of common words

I will be at the parktomorrow evening

park tomorrow evening Stemming

removal of word inflectionwalking walk
http://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdf


12/32

12/32

Basic tools

N-grams Sequences of unigrams

Dimensionality reduction PCA, SVD, NMF, ...

Language modelling

LSA, pLSA, LDA, ...
http://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdf


13/32

13/32

Language analysis

Source text

Pre-processing

Tokenization

Disambiguation

Dim. reduction

Clustering

Syntactic

Semantic

Results

Results


14/32

14/32

Text mining I

Keyword indexing Big, REALLY big table; Term-to-Document

matrix Bag-of-words

Use IR, search engines, etc

Unigram N-gram transition


15/32

15/32

Text mining II

1968, Salton: Vector Space Model (VSM) Scalling or normalization:

Term freq. Inverse Document freq. (TFIDF) Log-entropy scalling

Document similarity: cos or Euclidean distance

VSM shortcomings Inter- and intra-document context N-grams offer a partial solution


16/32

16/32

Text mining III

1990, Deerwester: Latent Semantic Analysis(LSA)

SVD on term-by-document matrix K-dim subspace (concepts)

Linear combination of terms Frequencies in Fourier analysis

LSA shortcomings Computationally expensive Updating is equally expensive

Concepts are not intuitive


17/32

17/32

Text mining IV

1999, Hofmann: Probabilistic LSA (pLSA) oraspect model

Probabilistic topic models Statistical foundation Latent variable

Hidden states in HMM

pLSA shortcomings Overfit

pLSA. Source: Berry, 2010


18/32

18/32

Text mining V

Source: Blei, 2011


19/32

19/32

Text mining VI

Source: Blei, 2011


20/32

20/32

Text mining VII

Probabilistic topic models Uncover the relationship between observed

and hidden variables PLSA LDA

Ando's presentation

LDA extensions

Relax statistical assumptions Use meta data

For an indroduction go hereLDA. Source: Berry, 2010
http://www.cs.princeton.edu/~blei/papers/Blei2011.pdfhttp://www.cs.princeton.edu/~blei/papers/Blei2011.pdf


21/32

21/32

Text mining VIII

Assumptions Word order irrelevant; bag-of-words

Unrealistic but used extensively Words are generated in condition to previous

words; Markov property

Order of documents irrelevant; corpus Word distribution static over time

Number of topics: known and fixed


22/32

22/32

Text mining IX

Meta-data Author-topic model; Rosen-Zvi et al. 2004 Author, title, location, etc

Hyperlink analysis


23/32

23/32

Matrix factorization techniques I

SVD

Where Weigenvectors and eigenvalues

PCA

ICA Independence for principal components

(neither orthogonal nor in rank order)

NMF

X=WVT

Y=WLT

X

XW H


24/32

24/32

Matrix factorization techniques II

SVD, PCA and ICA

Eigenvalue based

Fast Converge under certain conditions Sub-space is not intuitive

NMF

Numerically unstable Converges to local minimum Iterative process Sub-space is more natural


25/32

25/32Source: Lee, 1999


26/32

26/32

Matrix factorization techniques III

Problems with NMF Initialization

Convergence speed

Iterative Local minimum


27/32

27/32

Text streams

Detecting changes in sentiment Surprise Emerging

Text-to-number conversion

Time signatures

Temporal histogram Teele's work

Source: Berry, 2009


28/32

28/32

Performance evaluation I

Contigency matrix

Accuracy

Recall

Precision

System output

Positive Negative

True

output

Positive TP FN

Negative FP TN

A=TP+TN

m

R=TP

TP+FN

R=TP

TP+FN


29/32

29/32

Performance evaluation II

Precision-Recall curve


30/32

30/32

Performance evaluation III

F-measure

F=1

a1

P+a

1a

R


31/32

31/32

References

A. Clark, C. Fox and S. Lappin, eds., The Handbook of Computational Linguisticsand Natural Language Processing, Wiley-Blackwell, 2010.

M. W. Berry and J. Kogan, Text Mining: Applications and Theory, Wiley, 2010.

J. Han, M. Kamber and J. Pei, Data mining: Concepts and Techniques, Morgan-Kaufmann, 2012.

N. Indurkhya, F. J. Damerau, eds., Handbook of Natural Language Processing,CRC, 2010.

C. D. Manning and H. Schtze, Foundations of Statistical Natural LanguageProcessing, The MIT Press, 2000.

R. Nisbet, J. Elder and G. Miner, Handbook of statistical analysis and data miningapplications, Elsevier, 2009.

M. T. zsu, ed., Methods for Mining and Summarizing Text Conversations, Morgan& Claypool, 2011.

M. Song and Y.-F. B. Wu, Handbook of Research on Text and Web MiningTechnologies, IGI, 2009.


32/32

32/32

References

D. M. Blei, A. Y. Ng, M. I. Jordan and J. Lafferty, Latent Dirichlet Allocation, J.Machine Learning Research, vol. 3, 2003.

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer and R. Harshman,Indexing by Latent Semantic Analysis, J. American Society for Information Science,vol. 41, no. 6, pp. 391407, 1990.

M. Rosen-Zvi, T. Griffiths, M. Steyvers and P. Smyth, The Author-Topic Model forAuthors and Documents, Proc. of 20th Conf. on Uncertainty in Artificial Intelligence(UAI '04), 2004.

C. Orsan,Automatic Summarisation in the Information Age, Int. Conf. on RecentAdvances in Natural Language Processing (RANLP'09), 2009.

R. Navigli, Word Sense Disambiguation: A Survey, ACM Comput. Surv., vol. 41,no. 2, 2009.

D. M. Blei, Introduction to Probabilistic Topic Models, ACM Press, pp. 1-16, 2010.S