text statistics 7 day 30 - 11/05/14 ling 3820 & 6820 natural language processing harry howard...

Text statistics 7Day 30 - 11/05/14LING 3820 & 6820

Natural Language Processing

Harry Howard

Tulane University

Course organization

03-Nov-2014NLP, Prof. Howard, Tulane University

2

http://www.tulane.edu/~howard/LING3820/ The syllabus is under construction. http://www.tulane.edu/~howard/

CompCultEN/ Chapter numbering

3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode

characters 6. Control

Final project


3

Open Spyder

03-Nov-2014

4

NLP, Prof. Howard, Tulane University

Review

03-Nov-2014

5


ConditionalFreqDist

1. >>> from nltk.corpus import brown

2. >>> from nltk.probability import ConditionalFreqDist

3. >>> cat = ['news', 'romance']

4. >>> catWord = [(c,w)

5. for c in cat

6. for w in brown.words(categories=c)]

7. >>> cfd=ConditionalFreqDist(catWord)


6

Conditional frequency distribution

03-Nov-2014

7



8

A more interesting example

can could may might must will

news 93 86 66 38 50 389

religion 82 59 78 12 54 71

hobbies 268 58 131 22 83 264

sci fi 16 49 4 12 8 16

romance 74 193 11 51 45 43

humor 16 30 8 8 9 13

Conditions = categories, sample = modal verbs

1. # from nltk.corpus import brown2. # from nltk.probability import

ConditionalFreqDist3. >>> cat = ['news', 'religion', 'hobbies',

'science_fiction', 'romance', 'humor']4. >>> mod = ['can', 'could', 'may', 'might',

'must', 'will']5. >>> catWord = [(c,w)6. for c in cat7. for w in brown.words(categories=c)8. if w in mod]9. >>> cfd = ConditionalFreqDist(catWord)10. >>> cfd.tabulate()11. >>> cfd.plot()


9

cfd.tabulate()

can could may might must will

news 93 86 66 38 50 389

religion 82 59 78 12 54 71

hobbies 268 58 131 22 83 264

science_fiction 16 49 4 12 8 16

romance 74 193 11 51 45 43

humor 16 30 8 8 9 13


10

cfd.plot()


11


12

Another example

The task is to find the frequency of 'America' and 'citizen' in NLTK's corpus of presedential inaugural addresses:1. >>> from nltk.corpus import inaugural2. >>> inaugural.fileids()

3. ['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ..., '2009-Obama.txt']


13

cfd2.plot()

First try

1. from nltk.corpus import inaugural

2. from nltk.probability import ConditionalFreqDist

3. keys = ['america', 'citizen']

4. keyYear = [(w, title[:4])

5. for title in inaugural.fileids()

6. for w in inaugural.words(title)

7. if w.lower() in keys]

8. cfd2 = ConditionalFreqDist(keyYear)

9. cfd2.plot()


14


15

cfd2.plot()

Second try

1. from nltk.corpus import inaugural2. from nltk.probability import

ConditionalFreqDist3. keys = ['america', 'citizen']4. keyYear = [(key, title[:4])5. for title in inaugural.fileids()6. for w in inaugural.words(title)7. for k in keys8. if w.lower().startswith(k)]9. cfd3 = ConditionalFreqDist(keyYear)10. cfd3.plot()


16

dfc3.plot()


17

Stemming


18

Third try

1. from nltk.stem.snowball import EnglishStemmer

2. stemmer = EnglishStemmer()

3. from nltk.corpus import inaugural

4. from nltk.probability import ConditionalFreqDist

5. keys = ['america', 'citizen']

6. keyYear = [(w, title[:4])

7. for title in inaugural.fileids()

8. for w in inaugural.words(title)

9. if stemmer.stem(w) in keys]

10. cfd4 = ConditionalFreqDist(keyYear)

11. cfd4.plot()


19

cfd4.plot()


20

Twitter

Next time


21

text statistics 7 day 30 - 11/05/14 ling 3820 & 6820 natural language processing harry howard...

Documents