statistical language processing
TRANSCRIPT
-
8/2/2019 Statistical Language Processing
1/32
Statistical language processing
Concepts and Algorithms
A. Georgakis, PhD
-
8/2/2019 Statistical Language Processing
2/32
2/32
ToC
Basic definitions
Text mining
Performance evaluation References
-
8/2/2019 Statistical Language Processing
3/32
3/32
Definitions
SLP is NLP on steroids
Away from rule based methods
Cover a wide area: Automatic summarization, Machine translation,
Named entity recognition, Part-of-speechtagging, Sentence boundary disambiguation,
Sentiment analysis, Word sensedisambiguation, etc
-
8/2/2019 Statistical Language Processing
4/32
4/32
Automatic summarization
...transformation of source text to summarytext through content reduction by selection,generalization and transformation S. Jones,
1999 but there are many more definitions
ambiguity for the term
For additional info go here
http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009 -
8/2/2019 Statistical Language Processing
5/32
5/32
Machine translation
Substitution of source text into a targetlanguage
Usage of parallel corpora Internet is a vast source for such data
Pivot languages
http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009 -
8/2/2019 Statistical Language Processing
6/32
6/32
Named entity recognition
Identify proper names and their types Peter person Paris city or person
Capitalization is not always a good tool Some languages do not not use capitals German Begining of centences
http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009 -
8/2/2019 Statistical Language Processing
7/32
7/32
Part-of-speech tagging
Determine the part of speech for words Well, she and
young John walk
to school slowly English has 9 parts of speech:
noun, verb, article, adjective, preposition,
pronoun, adverb, conjunction, andinterjection .. but as a linguist you will need to use
somewhere between 50 and 150
http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009 -
8/2/2019 Statistical Language Processing
8/32
8/32
Sentence boundary disambiguation
Where does a centence start and stop? Punctuation marks are problematic Rule based mathod
Precompiled list of abbreviations
90% of periods are sentence boundaries(Riley, 1999)
~47% in Wall Street Journal are abbreviations(Stammatos, 2009)
http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009 -
8/2/2019 Statistical Language Processing
9/32
9/32
Sentiment analysis
Identify the polarity and emotional state for agiven text:
positive or negative angry, sad, unhappy
Rather tough problem to solve due tolanguage ambiguity
http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009 -
8/2/2019 Statistical Language Processing
10/32
10/32
Word sense disambiguation
Identify the sense of different words
ML on top of human knowledge
Thesauri Ontologies Corpora ...
For more info go here
http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdf -
8/2/2019 Statistical Language Processing
11/32
11/32
Basic tools I
Corpora Balanced and representative collection of
documents
Stopping removal of common words
I will be at the parktomorrow evening
park tomorrow evening Stemming
removal of word inflectionwalking walk
http://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdf -
8/2/2019 Statistical Language Processing
12/32
12/32
Basic tools
N-grams Sequences of unigrams
Dimensionality reduction PCA, SVD, NMF, ...
Language modelling
LSA, pLSA, LDA, ...
http://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdf -
8/2/2019 Statistical Language Processing
13/32
13/32
Language analysis
Source text
Pre-processing
Tokenization
Disambiguation
Dim. reduction
Clustering
Syntactic
Semantic
Results
Results
-
8/2/2019 Statistical Language Processing
14/32
14/32
Text mining I
Keyword indexing Big, REALLY big table; Term-to-Document
matrix Bag-of-words
Use IR, search engines, etc
Unigram N-gram transition
-
8/2/2019 Statistical Language Processing
15/32
15/32
Text mining II
1968, Salton: Vector Space Model (VSM) Scalling or normalization:
Term freq. Inverse Document freq. (TFIDF) Log-entropy scalling
Document similarity: cos or Euclidean distance
VSM shortcomings Inter- and intra-document context N-grams offer a partial solution
-
8/2/2019 Statistical Language Processing
16/32
16/32
Text mining III
1990, Deerwester: Latent Semantic Analysis(LSA)
SVD on term-by-document matrix K-dim subspace (concepts)
Linear combination of terms Frequencies in Fourier analysis
LSA shortcomings Computationally expensive Updating is equally expensive
Concepts are not intuitive
-
8/2/2019 Statistical Language Processing
17/32
17/32
Text mining IV
1999, Hofmann: Probabilistic LSA (pLSA) oraspect model
Probabilistic topic models Statistical foundation Latent variable
Hidden states in HMM
pLSA shortcomings Overfit
pLSA. Source: Berry, 2010
-
8/2/2019 Statistical Language Processing
18/32
18/32
Text mining V
Source: Blei, 2011
-
8/2/2019 Statistical Language Processing
19/32
19/32
Text mining VI
Source: Blei, 2011
-
8/2/2019 Statistical Language Processing
20/32
20/32
Text mining VII
Probabilistic topic models Uncover the relationship between observed
and hidden variables PLSA LDA
Ando's presentation
LDA extensions
Relax statistical assumptions Use meta data
For an indroduction go hereLDA. Source: Berry, 2010
http://www.cs.princeton.edu/~blei/papers/Blei2011.pdfhttp://www.cs.princeton.edu/~blei/papers/Blei2011.pdf -
8/2/2019 Statistical Language Processing
21/32
21/32
Text mining VIII
Assumptions Word order irrelevant; bag-of-words
Unrealistic but used extensively Words are generated in condition to previous
words; Markov property
Order of documents irrelevant; corpus Word distribution static over time
Number of topics: known and fixed
-
8/2/2019 Statistical Language Processing
22/32
22/32
Text mining IX
Meta-data Author-topic model; Rosen-Zvi et al. 2004 Author, title, location, etc
Hyperlink analysis
-
8/2/2019 Statistical Language Processing
23/32
23/32
Matrix factorization techniques I
SVD
Where Weigenvectors and eigenvalues
PCA
ICA Independence for principal components
(neither orthogonal nor in rank order)
NMF
X=WVT
Y=WLT
X
XW H
-
8/2/2019 Statistical Language Processing
24/32
24/32
Matrix factorization techniques II
SVD, PCA and ICA
Eigenvalue based
Fast Converge under certain conditions Sub-space is not intuitive
NMF
Numerically unstable Converges to local minimum Iterative process Sub-space is more natural
-
8/2/2019 Statistical Language Processing
25/32
25/32Source: Lee, 1999
-
8/2/2019 Statistical Language Processing
26/32
26/32
Matrix factorization techniques III
Problems with NMF Initialization
Convergence speed
Iterative Local minimum
-
8/2/2019 Statistical Language Processing
27/32
27/32
Text streams
Detecting changes in sentiment Surprise Emerging
Text-to-number conversion
Time signatures
Temporal histogram Teele's work
Source: Berry, 2009
-
8/2/2019 Statistical Language Processing
28/32
28/32
Performance evaluation I
Contigency matrix
Accuracy
Recall
Precision
System output
Positive Negative
True
output
Positive TP FN
Negative FP TN
A=TP+TN
m
R=TP
TP+FN
R=TP
TP+FN
-
8/2/2019 Statistical Language Processing
29/32
29/32
Performance evaluation II
Precision-Recall curve
-
8/2/2019 Statistical Language Processing
30/32
30/32
Performance evaluation III
F-measure
F=1
a1
P+a
1a
R
-
8/2/2019 Statistical Language Processing
31/32
31/32
References
A. Clark, C. Fox and S. Lappin, eds., The Handbook of Computational Linguisticsand Natural Language Processing, Wiley-Blackwell, 2010.
M. W. Berry and J. Kogan, Text Mining: Applications and Theory, Wiley, 2010.
J. Han, M. Kamber and J. Pei, Data mining: Concepts and Techniques, Morgan-Kaufmann, 2012.
N. Indurkhya, F. J. Damerau, eds., Handbook of Natural Language Processing,CRC, 2010.
C. D. Manning and H. Schtze, Foundations of Statistical Natural LanguageProcessing, The MIT Press, 2000.
R. Nisbet, J. Elder and G. Miner, Handbook of statistical analysis and data miningapplications, Elsevier, 2009.
M. T. zsu, ed., Methods for Mining and Summarizing Text Conversations, Morgan& Claypool, 2011.
M. Song and Y.-F. B. Wu, Handbook of Research on Text and Web MiningTechnologies, IGI, 2009.
-
8/2/2019 Statistical Language Processing
32/32
32/32
References
D. M. Blei, A. Y. Ng, M. I. Jordan and J. Lafferty, Latent Dirichlet Allocation, J.Machine Learning Research, vol. 3, 2003.
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer and R. Harshman,Indexing by Latent Semantic Analysis, J. American Society for Information Science,vol. 41, no. 6, pp. 391407, 1990.
M. Rosen-Zvi, T. Griffiths, M. Steyvers and P. Smyth, The Author-Topic Model forAuthors and Documents, Proc. of 20th Conf. on Uncertainty in Artificial Intelligence(UAI '04), 2004.
C. Orsan,Automatic Summarisation in the Information Age, Int. Conf. on RecentAdvances in Natural Language Processing (RANLP'09), 2009.
R. Navigli, Word Sense Disambiguation: A Survey, ACM Comput. Surv., vol. 41,no. 2, 2009.
D. M. Blei, Introduction to Probabilistic Topic Models, ACM Press, pp. 1-16, 2010.S