ling 388: computers and languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · homework...

LING 388: Computers and Language

Lecture 22

Administrivia

•Homework 10 Review•Term Project Proposal•On Stylometry•Homework 11: easy!• Due Sunday midnight

Homework 10

• Using text1, text2,…, text9 from nltk.book, explore the hypothesis that lexical diversity decreases with text length• from nltk.book import *• texts = [text1, text2, text3, text4, text5, text6, text7, text8, text9]• Hint:

• can do it manually for each text or use sorted() with key=len.• https://docs.python.org/3/howto/sorting.html

• Rank text1,..,text9 in order of text length. Report lengths.• Rank text1,..,text9 in order of lexical diversity. Report diversity.• Is the claim true?

https://docs.python.org/3/howto/sorting.html

Homework 10 Review

from nltk.book import *texts = [text1, text2, text3, text4, text5, text6, text7, text8, text9]sorted(texts, key=len, reverse=True)[<Text: Moby Dick by Herman Melville 1851>, <Text: Inaugural Address Corpus>, <Text: Sense and Sensibility by Jane Austen 1811>, <Text: Wall Street Journal>, <Text: The Man Who Was Thursday by G . K . Chesterton 1908>, <Text: Chat Corpus>, <Text: The Book of Genesis>, <Text: Monty Python and the Holy Grail>, <Text: Personals Corpus>][len(text) for text in sorted(texts, key=len, reverse=True)][260819, 149797, 141576, 100676, 69213, 45010, 44764, 16967, 4867]

Homework 10 Review

• lexical diversity: len(set(text))/len(text)[text for text in sorted(texts, key=lambda x:len(set(x))/len(x), reverse=True)][<Text: Personals Corpus>, <Text: Chat Corpus>, <Text: Monty Python and the Holy Grail>, <Text: Wall Street Journal>, <Text: The Man Who Was Thursday by G . K . Chesterton 1908>, <Text: Moby Dick by Herman Melville 1851>, <Text: Inaugural Address Corpus>, <Text: The Book of Genesis>, <Text: Sense and Sensibility by Jane Austen 1811>]["{:.3f}".format(len(set(text))/len(text)) for text in sorted(texts, key=lambda x:len(set(x))/len(x), reverse=True)]['0.228', '0.135', '0.128', '0.123', '0.098', '0.074', '0.066', '0.062', '0.048'][len(text) for text in sorted(texts, key=lambda x:len(set(x))/len(x), reverse=True)][4867, 45010, 16967, 100676, 69213, 260819, 149797, 44764, 141576]

Homework 10 Review>>> import matplotlib.pyplot as plt>>> x = [len(text) for text in sorted(texts, key=lambda x:len(set(x))/len(x), reverse=True)] >>> y = [len(set(text))/len(text) for text in sorted(texts, key=lambda x:len(set(x))/len(x), reverse=True)]>>> plt.xlabel("Text length (in words)")Text(0.5, 0, 'Text length (in words)')>>> plt.ylabel("Lexical diversity")Text(0, 0.5, 'Lexical diversity')>>> plt.scatter(x,y)<matplotlib.collections.PathCollectionobject at 0x7f9dc0174dc0>>>> plt.show()

Term Project Proposal

• Lecture 1: • Term project (e.g. build some application) – 25% of the grade

• Ask yourself: what are you interested in exploring?• Must involve some use of what we've covered in terms of

programming, e.g. straight Python or NLTK• Propose some task, experiment or application you plan to prototype

or build: (doesn't have to be a complete application)• One page summary• Send it to me ([email protected]) for project approval• Soft deadline: due by end of this week

mailto:[email protected]

On stylometry

• THE CHARACTERISTIC CURVES OF COMPOSITION by T. C. Mendenhall (1887).

Course website: Mendenhall1887.pdf

On stylometry

• Charles Dickens' Oliver Twist

On stylometry: task 1

• Let's test Mendenhall's hypothesis on Moby Dick (text1) vs. Sense and Sensibility (text2)• Task 1: write a list comprehension each for text1 and text2, that

transforms words into length of words. Save the resulting list of numbers as len1 and len2, respectively.• Note:

• len(len1) and len(text1) should be the same, same for len2 and text2.• Example:

• ['This', 'is', 'a', 'test', '.'] transforms into [4, 2, 1, 4, 1]


• Task 1: write a list comprehension each for text1 and text2, that transforms words into length of words. Save the resulting list of numbers as len1 and len2, respectively.

from nltk.book import *len1 = [len(word) for word in text1]len(len1)260819len(text1)260819len2 = [len(word) for word in text2]


• Task 2: Use nltk's FreqDist() to plot corpora len1 and len2• Name some reasons why it's difficult to compare the two graphs.


• Task 2: Use nltk's FreqDist() to plot corpora len1 and len2• Name some reasons why it's difficult to compare the two graphs.fd1 = FreqDist(len1)fd2 = FreqDist(len2)fd1.tabulate(10)

3 1 4 2 5 6 7 8 9 1050223 47933 42345 38513 26597 17111 14399 9966 6428 3528fd2.tabulate(10)

3 2 1 4 5 6 7 8 9 1028839 24826 23009 21352 11438 9507 8158 5676 3736 2596fd1.plot()fd2.plot()


[3,1,4,2,5,6,7,…] [3,2,1,4,5,6,7,…]


• Let's use matplotlib.pyplot to plot them together.• import matplotlib.pyplot as plt• mx = max(max(fd2),max(fd1))• mx (largest length of word across the two fds)• 20• plt.hist(len1,range(1,mx+1),histtype='step')• plt.hist(len2,range(1,mx+1),histtype='step') • ax = plt.gca() • ax.set_xticks(range(1,mx+1))• plt.show()


y-axis: raw countsx-axis: word length


• https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.hist.html

https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.hist.html


• Using density=True gives us a normalized (proportional) y-axis instead of the raw counts• plt.hist(len1,range(1,mx+1),histtype='step',density=True)• plt.hist(len2,range(1,mx+1),histtype='step',density=True)• ax = plt.gca()• ax.set_xticks(range(1,mx+1))• plt.show()


y-axis: proportionx-axis: word length


• Task 3: Mendenhall's method uses groups of words, e.g. 10,000 at a time.• text1 contains >260,000 words• Using only the 1st 100,000 words of text1, let's divide them into 10 groups of

10,000. • Produce lists l1, l2, …, l10 (length of words for each group of 10,000) from

len1


• Let's do it manually (first):>>> l1 = len1[0:10000]>>> l2 = len1[10000:20000]>>> l3 = len1[20000:30000]etc.• Let's do it with a loop:>>> l = []>>> for i in range(0,100000,10000):... l.append(len1[i:i+10000])...>>> len(l)10


• Task 4: let's overlay the frequency plots to see if we see a consistent characteristic signature between the groups of 10,000 for text1.


• Task 4: let's overlay the frequency plots to see if we see a consistent characteristic signature between the groups of 10,000 for text1.• Let's do it with a loop on the list l (from task 3)>>> mx = max(len1)>>> mx20>>> for list in l:... plt.hist(list,bins=range(1,mx+1),histtype='step',density=True)...>>> ax = plt.gca()>>> ax.set_xticks(range(1,mx+1))>>> plt.show()

Homework 11

>>> from nltk.book import *• text2: Sense and Sensibility by Jane Austen 1811• text7: Wall Street Journal• Compute task 4 for the first 100,000 words of text2 and text7.• Divide up each text into groups of 10,000 words (as shown in class)• In your opinion, do the characteristic curves of text2 seem significantly

different from that of text7?• Show your python work• Show your graphs• Due date: Sunday midnight

ling 388: computers and languageelmo.sbs.arizona.edu/sandiway/ling388-20/lecture22.pdf · homework...

Documents