supervised and unsupervised learning for natural language processing manaal faruqui language...
TRANSCRIPT
![Page 1: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/1.jpg)
Supervised and Unsupervised learning for
Natural language processing
Manaal FaruquiLanguage Technologies Institute
SCS, CMU
![Page 2: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/2.jpg)
Natural Language Processing
+
Linguistics Computer Science
![Page 3: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/3.jpg)
Natural Language Processing
But Why ?
• Inability to handle large amount of data
• Much much faster information access
![Page 4: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/4.jpg)
Natural Language Processing
How can this be done ?
• Can you teach a computer ?
![Page 5: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/5.jpg)
Natural Language Processing
= Mathematics
Using Maths to learn language ???Are you kidding me !
![Page 6: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/6.jpg)
Machine Learning
Teaching computers make decisions like humans
Computer vision Machine Translation Clustering
![Page 7: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/7.jpg)
Machine Learning
Supervised Unsupervised
Semi-supervised
Learning by examples
Learning by patterns
Learning by patterns + examples
![Page 8: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/8.jpg)
Formal & Informal address
• Most languages distinguish formal (V) and informal (T) address in direct speech (Brown and Gilman, 1960)• Formal address: Neutrality, distance• Informal address: Friends, subordinates
• Variety of realization in different languages• French: Pronoun usage (Vous/Tu)• German: Pronoun usage (Sie/Du)• Hindi: Pronoun usage (Aap/Tum)• Japanese: Verbal inflections• English: ???
![Page 9: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/9.jpg)
Main goals of this work
• Goal 1: Determine whether English distinguishes between V & T consistently• If yes, what are the indicators ?
• Goal 2: Develop a computational model that labels English sentences as T or V• Ideally without spending effort on
annotation
![Page 10: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/10.jpg)
Methodology
• Use a parallel corpus to analyze aligned sentences with overt (De) T/V choice and covert (En) T/V choice
• For Goal 1: Compare De & En sentences• For Goal 2 : Project De labels onto En
sentences
![Page 11: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/11.jpg)
Digression: Creation of a parallel corpus
• Current parallel corpora not suitable• Europarl: Overwhelmingly formal (99%)• Newswire: No dialogue
• Creation of a new corpus: De-En literary texts• 106 19th century novels (Project Gutenberg)• Sentence-aligned: Gargantuan (Braune &
Fraser 2010)• POS-tagged (Schmidt 1994)
• German sentence can be labeled as T, V or None• Using orthographic rules
• Corpus: http://cs.cmu.edu/~mfaruqui
![Page 12: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/12.jpg)
Goal 1: Compare De and En address
• Give English monolingual text to human annotators• Ask for T/V judgment
• Their annotation provides the following information• How well do annotators agree on English
text?• Does English monolingual text provide enough
information to identify T/V? (1a) • How well do annotators agree with copied
labels? • Is there a direct correspondence ? (1b)• Only if this is the case is the copying of labels
appropriate
![Page 13: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/13.jpg)
Experiment 1: Human Annotation
• 200 randomly drawn English sentences
• Two annotators (“A1”, “A2”)
• Two conditions:– No context: just one sentence– In context: three sentences pre- and post-
context each
![Page 14: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/14.jpg)
Results: Reliability
• Context improves reliability– Many sentences can not be tagged with T/V in
isolation
“And she is a sort of relation of your lordship’s,” said Dawson.
“And perhaps sometime you may see her.”
• Reliability in context is reasonable:• English does provide strong clues on T/V
No Context In Context
A1 vs. A2
.75 (k=.49) .79 (k=.58)
Goal 1a ✓
![Page 15: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/15.jpg)
Results: Correspondence
No Context
In Context
(A1∩ A2) vs. Projection
.67 (k=.34) .79 (k=.58)
•Agreement with German projected labels again reasonable, but not perfect
•Error analysis showed strong influence of social norms
• Example: Lovers in 19th cent. novels use V (!)
[...] she covered her face with the other to conceal her tears. “Corinne!”, said Oswald, “Dear Corinne! My absence has then rendered you unhappy!”
Goal 1b ✓
![Page 16: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/16.jpg)
Experiment 2: Prediction of T/V
• Copy German T/V labels onto English: No annotation
• Learn L2-regularized logit classifier on train set; optimize on dev set; evaluate on test set
• Feature candidates :– Lexical features (bag-of-words, χ² feature
selection)– Distributional semantic word classes• 200 word classes clustered with the algorithm by
Clark (2003)
– Politeness theory (Brown & Levinson 2003)• Polite speech has specific features, which are
inherited by V
![Page 17: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/17.jpg)
Supervised Learning
Logistic regression classifier
• Linear combination of features
• Every feature assigned a weight acc. to its importance• higher weight = more importance
• L2 regularization to avoid overfitting
• Used “Weka” as the open-source toolkit
![Page 18: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/18.jpg)
Context
• As shown by human annotation: Individual sentences often insufficient for classification
• Simplest solution: Compute features over a window of context sentences– Problem: context typically includes non-speech
sentences
“I am going to see his ghost!” Lorry quietly chafed the hands that held his arm.
![Page 19: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/19.jpg)
Context
• Our solution: A simple “direct speech” recognizer CRF-based sequence tagger (Mallet) trained on 1000 sentences
• Ideal results for 8 sentences of direct speech context +5% accuracy over no context Sentence context
Speech context
B-SP: “I am going to see his ghost!” O: Lorry quietly chafed the hands that held his arm.
![Page 20: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/20.jpg)
Quantitative results
Model AccuracyFrequency BL (V) 59.1
Lexical features 67.0
Semantic class features 57.5
Politeness features 59.6
•Only lexical features yield significant improvement over frequency baseline
Goal 2 ✓
(Faruqui & Pado, 2011; 2012)
![Page 21: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/21.jpg)
Qualitative analysis: Lexical features
Top 10 lexical features
![Page 22: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/22.jpg)
Conclusions
• Formal and informal language exists in English as well– Indicators more dispersed across context
• Bootstrapping a T/V classifier for English possible
• Results still fairly modest– Asymmetry: V more marked than T → better
features– Difficult to operationalize features with high
recall (sociolinguistic features, first names, …)
![Page 23: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/23.jpg)
References• M. Faruqui & S. Pado, “I thou thee, thou traitor”: Predicting
formal vs. informal address in English literature. ACL 2011.• M. Faruqui & S. Pado, Towards a model of formal and
informal address in English. EACL 2012.• Roger Brown and Albert Gilman. 1960. The pronouns of
power and solidarity. In Thomas A. Sebeok, editor, Style in Language, pages 253–277. MIT Press, Cambridge, MA.
• Penelope Brown and Stephen C. Levinson. 1987. Politeness: Some Universals in Language Usage. Number 4 in Studies in Interactional Sociolinguistics. Cambridge University Press.
• Fabienne Braune & Alexander Fraser. Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. COLING 2010
• Helmut Schmid. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceedings of the International Conference on New Methods in Language Processing, pages 44–49, Manchester, UK.
• Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu.
![Page 24: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/24.jpg)
Unsupervised Learning
Learning by finding patterns in data
Clustering
![Page 25: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/25.jpg)
Word clustering
Why ?
• Feature reduction• From words to word classes
• Generalization of unseen words• Bangalore ~ Bengaluru
• Identification of words with similar meaning• Word-sense disambiguation
• Reduces the need for tagged data
![Page 26: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/26.jpg)
Word clustering
How ?
• Distributional similarity• How similar is the occurrence pattern of
two words in a given corpus ?
“You shall know a word by the company it keeps” – J. R. Firth
• Morphological similarity• How similar are two words
orthographically ?• Madras ~ Chennai … NO• Bangalore ~ Bengaluru … YES
![Page 27: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/27.jpg)
Word clustering
Language modeling approach
1. Ranjitha cooks Uttapam.
2. Ranjitha cooks Rava masala dosa.
3. Ranjitha cooks Facebook.
How do you know which one is wrong ??
![Page 28: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/28.jpg)
Word clustering
Language modeling approach
• Maximize the probability of occurrence of a sequence of words
S: Ranjitha cooks Facebook
• P(S) = P(Ranjitha) * P(cooks|Ranjitha) * P(Facebook|cooks)
• P(Facebook|cooks) will be very near zero OR zero !
![Page 29: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/29.jpg)
Word clustering
W1 W2 W4W3
C3C2 C4C1
S: w1 w2 w3 w4
P(S) = P(C1) * P(w1|C1) * P(C2|C1) * P(w2|C2) * …
(Och, 1999)This is called a Hidden-Markov Model (HMM)
![Page 30: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/30.jpg)
Word clustering
Adding morphology (Clark, 2003)
W1 W2 W4W3
C3C2 C4C1
P(S) = P(C1) * P(w1|C1) * Pm(w1|C1) * P(C2|C1) * P(w2|C2) * Pm(w2|C2) …
![Page 31: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/31.jpg)
Word clustering
Implementation
• Initialization of clusters• Randomized• Heuristic-based
• Optimization algorithm• Greedy as closed form solution not present• Transfer word to the cluster with highest
improvement
• Termination• Till no more words are exchanged• Till a specific no. of words are exchanged
![Page 32: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/32.jpg)
Word clustering
Application / Evaluation
• Named Entity Recognition• Identification and labeling of names of people,
places, organization etc.• Pre-processing task for many NLP applications
• Tags from the CoNLL-03 shared-task on NER:• PERson, ORGanization, LOCation,
MISCellaneous
(Sonia Gandhi)PER is an (Italian)MISC who lives in (India)LOC.
![Page 33: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/33.jpg)
Named Entity Recognition
NER for German: Challenges
Complex Morphology:
Difficult lemmatizatio
n
Sparse data: Only one NE-
tagged dataset
(CoNLL 2003)
Common noun
capitalization: no easy
entity detection
Poor performance, in particular poor Recall
![Page 34: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/34.jpg)
Named Entity Recognition
NER for German: Challenges
Recall Precision F-Score
English 88.5% 89.0% 88.8%
German 63.7% 83.9% 72.4%
Recall is a problem !• More amount of training data can help, but
expensive !• Semantic generalization ?
![Page 35: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/35.jpg)
Named Entity Recognition
Word clustering• Provides a way to semantic generalization
But how can it help ?
Deutschland (70)Ostdeutschland(0)
Westdeutschland(5)LOC
![Page 36: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/36.jpg)
Named Entity Recognition
Experimental setup
• Cluster German words with Clark’s clustering software on the basis of an untagged generalization corpus• HGC, deWac (Baroni et. al, 2009)
• Stanford’s CRF-based NER system (Finkel and Manning 2009) • Training on an NER-tagged corpus (CoNLL 2003 German
train set newswire)
• Evaluate on CoNLL 2003 testb set (50M words, in-domain)
![Page 37: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/37.jpg)
Named Entity Recognition
Results (Faruqui & Pado, 2010)
Recall Precision F-Score
Florian et. al 2003 83.9% 63.7% 72.4%
Baseline (0/0) 84.5% 63.1% 72.3%
HGC (175m/600) 86.6% 71.2% 78.2%
deWac (175m/400) 86.4% 68.5% 76.4%
![Page 38: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/38.jpg)
Multilingual word clustering
• Clustering words from two languages together
• If parallel data in two languages available• Word alignments can give additional
information
• Additional constraints may give better clustering
IYouWe
TheyShe
IchSieUnsEr
![Page 39: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/39.jpg)
Multilingual word clustering
Language 1
Language 2
![Page 40: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/40.jpg)
Multilingual word clustering
Language 1
Language 2
![Page 41: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/41.jpg)
Multilingual word clustering
• Minimize the randomness of the clustering• Minimize the entropy of the clustering
• If clustering of L1 is represented by a random variable X
• We want to minimize the entropy of one clustering given the other:
![Page 42: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/42.jpg)
Multilingual word clustering
• We optimize both the monolingual and multilingual objective together:
• Further edge filtering heuristics can be used• Words aligned with stop words generally noisy• Low frequency words are important
• Finding out whether edge filtering is language dependent or not
![Page 43: Supervised and Unsupervised learning for Natural language processing Manaal Faruqui Language Technologies Institute SCS, CMU](https://reader035.vdocuments.us/reader035/viewer/2022062511/551a9cf355034643688b6258/html5/thumbnails/43.jpg)
References• M. Faruqui & S. Pado, Training and Evaluating a German
Named Entity Recognizer with Semantic Generalization, KONVENS 2010.
• Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. 2009. The wacky wide web: A collection of very large linguistically processed web-crawled corpora. JLRE, 43(3):209–226.
• Alexander Clark. 2003. Combining distributional and morphological information for part of speech induction. Proc. EACL 59–66, Budapest, Hungary.
• Jenny Rose Finkel and Christopher D. Manning. 2009. Nested named entity recognition. Proc. EMNLP, pages 141–150, Singapore.
• Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang. 2003. Named entity recognition through classifier combination. Proc. CoNLL, pages 168–171. Edmonton.
• Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. Proc. CoNLL, pages 142–147, Edmonton, AL