ling 388: computers and languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · homework...

25
LING 388: Computers and Language Lecture 22

Upload: others

Post on 23-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

LING 388: Computers and Language

Lecture 22

Page 2: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

Administrivia

•Homework 10 Review•Term Project Proposal•On Stylometry•Homework 11: easy!• Due Sunday midnight

Page 3: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

Homework 10

• Using text1, text2,…, text9 from nltk.book, explore the hypothesis that lexical diversity decreases with text length• from nltk.book import *• texts = [text1, text2, text3, text4, text5, text6, text7, text8, text9]• Hint:

• can do it manually for each text or use sorted() with key=len.• https://docs.python.org/3/howto/sorting.html

• Rank text1,..,text9 in order of text length. Report lengths.• Rank text1,..,text9 in order of lexical diversity. Report diversity.• Is the claim true?

Page 4: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

Homework 10 Review

from nltk.book import *texts = [text1, text2, text3, text4, text5, text6, text7, text8, text9]sorted(texts, key=len, reverse=True)[<Text: Moby Dick by Herman Melville 1851>, <Text: Inaugural Address Corpus>, <Text: Sense and Sensibility by Jane Austen 1811>, <Text: Wall Street Journal>, <Text: The Man Who Was Thursday by G . K . Chesterton 1908>, <Text: Chat Corpus>, <Text: The Book of Genesis>, <Text: Monty Python and the Holy Grail>, <Text: Personals Corpus>][len(text) for text in sorted(texts, key=len, reverse=True)][260819, 149797, 141576, 100676, 69213, 45010, 44764, 16967, 4867]

Page 5: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

Homework 10 Review

• lexical diversity: len(set(text))/len(text)[text for text in sorted(texts, key=lambda x:len(set(x))/len(x), reverse=True)][<Text: Personals Corpus>, <Text: Chat Corpus>, <Text: Monty Python and the Holy Grail>, <Text: Wall Street Journal>, <Text: The Man Who Was Thursday by G . K . Chesterton 1908>, <Text: Moby Dick by Herman Melville 1851>, <Text: Inaugural Address Corpus>, <Text: The Book of Genesis>, <Text: Sense and Sensibility by Jane Austen 1811>]["{:.3f}".format(len(set(text))/len(text)) for text in sorted(texts, key=lambda x:len(set(x))/len(x), reverse=True)]['0.228', '0.135', '0.128', '0.123', '0.098', '0.074', '0.066', '0.062', '0.048'][len(text) for text in sorted(texts, key=lambda x:len(set(x))/len(x), reverse=True)][4867, 45010, 16967, 100676, 69213, 260819, 149797, 44764, 141576]

Page 6: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

Homework 10 Review>>> import matplotlib.pyplot as plt>>> x = [len(text) for text in sorted(texts, key=lambda x:len(set(x))/len(x), reverse=True)] >>> y = [len(set(text))/len(text) for text in sorted(texts, key=lambda x:len(set(x))/len(x), reverse=True)]>>> plt.xlabel("Text length (in words)")Text(0.5, 0, 'Text length (in words)')>>> plt.ylabel("Lexical diversity")Text(0, 0.5, 'Lexical diversity')>>> plt.scatter(x,y)<matplotlib.collections.PathCollectionobject at 0x7f9dc0174dc0>>>> plt.show()

Page 7: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

Term Project Proposal

• Lecture 1: • Term project (e.g. build some application) – 25% of the grade

• Ask yourself: what are you interested in exploring?• Must involve some use of what we've covered in terms of

programming, e.g. straight Python or NLTK• Propose some task, experiment or application you plan to prototype

or build: (doesn't have to be a complete application)• One page summary• Send it to me ([email protected]) for project approval• Soft deadline: due by end of this week

Page 8: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

On stylometry

• THE CHARACTERISTIC CURVES OF COMPOSITION by T. C. Mendenhall (1887).

Course website: Mendenhall1887.pdf

Page 9: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

On stylometry

• Charles Dickens' Oliver Twist

Page 10: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

On stylometry: task 1

• Let's test Mendenhall's hypothesis on Moby Dick (text1) vs. Sense and Sensibility (text2)• Task 1: write a list comprehension each for text1 and text2, that

transforms words into length of words. Save the resulting list of numbers as len1 and len2, respectively.• Note:

• len(len1) and len(text1) should be the same, same for len2 and text2.• Example:

• ['This', 'is', 'a', 'test', '.'] transforms into [4, 2, 1, 4, 1]

Page 11: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

On stylometry: task 1

• Task 1: write a list comprehension each for text1 and text2, that transforms words into length of words. Save the resulting list of numbers as len1 and len2, respectively.

from nltk.book import *len1 = [len(word) for word in text1]len(len1)260819len(text1)260819len2 = [len(word) for word in text2]

Page 12: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

On stylometry: task 2

• Task 2: Use nltk's FreqDist() to plot corpora len1 and len2• Name some reasons why it's difficult to compare the two graphs.

Page 13: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

On stylometry: task 2

• Task 2: Use nltk's FreqDist() to plot corpora len1 and len2• Name some reasons why it's difficult to compare the two graphs.fd1 = FreqDist(len1)fd2 = FreqDist(len2)fd1.tabulate(10)

3 1 4 2 5 6 7 8 9 1050223 47933 42345 38513 26597 17111 14399 9966 6428 3528fd2.tabulate(10)

3 2 1 4 5 6 7 8 9 1028839 24826 23009 21352 11438 9507 8158 5676 3736 2596fd1.plot()fd2.plot()

Page 14: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

On stylometry: task 2

[3,1,4,2,5,6,7,…] [3,2,1,4,5,6,7,…]

Page 15: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

On stylometry: task 2

• Let's use matplotlib.pyplot to plot them together.• import matplotlib.pyplot as plt• mx = max(max(fd2),max(fd1))• mx (largest length of word across the two fds)• 20• plt.hist(len1,range(1,mx+1),histtype='step')• plt.hist(len2,range(1,mx+1),histtype='step') • ax = plt.gca() • ax.set_xticks(range(1,mx+1))• plt.show()

Page 16: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

On stylometry: task 2

y-axis: raw countsx-axis: word length

Page 17: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

On stylometry: task 2

• https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.hist.html

Page 18: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

On stylometry: task 2

• Using density=True gives us a normalized (proportional) y-axis instead of the raw counts• plt.hist(len1,range(1,mx+1),histtype='step',density=True)• plt.hist(len2,range(1,mx+1),histtype='step',density=True)• ax = plt.gca()• ax.set_xticks(range(1,mx+1))• plt.show()

Page 19: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

On stylometry: task 2

y-axis: proportionx-axis: word length

Page 20: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

On stylometry: task 3

• Task 3: Mendenhall's method uses groups of words, e.g. 10,000 at a time.• text1 contains >260,000 words• Using only the 1st 100,000 words of text1, let's divide them into 10 groups of

10,000. • Produce lists l1, l2, …, l10 (length of words for each group of 10,000) from

len1

Page 21: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

On stylometry: task 3

• Let's do it manually (first):>>> l1 = len1[0:10000]>>> l2 = len1[10000:20000]>>> l3 = len1[20000:30000]etc.• Let's do it with a loop:>>> l = []>>> for i in range(0,100000,10000):... l.append(len1[i:i+10000])...>>> len(l)10

Page 22: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

On stylometry: task 4

• Task 4: let's overlay the frequency plots to see if we see a consistent characteristic signature between the groups of 10,000 for text1.

Page 23: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

On stylometry: task 4

• Task 4: let's overlay the frequency plots to see if we see a consistent characteristic signature between the groups of 10,000 for text1.• Let's do it with a loop on the list l (from task 3)>>> mx = max(len1)>>> mx20>>> for list in l:... plt.hist(list,bins=range(1,mx+1),histtype='step',density=True)...>>> ax = plt.gca()>>> ax.set_xticks(range(1,mx+1))>>> plt.show()

Page 24: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

On stylometry: task 4

Page 25: LING 388: Computers and Languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · Homework 10 •Using text1, text2,…, text9 from nltk.book, explore the hypothesis that

Homework 11

>>> from nltk.book import *• text2: Sense and Sensibility by Jane Austen 1811• text7: Wall Street Journal• Compute task 4 for the first 100,000 words of text2 and text7.• Divide up each text into groups of 10,000 words (as shown in class)• In your opinion, do the characteristic curves of text2 seem significantly

different from that of text7?• Show your python work• Show your graphs• Due date: Sunday midnight