text statistics 7 day 30 - 11/05/14 ling 3820 & 6820 natural language processing harry howard...

21
Text statistics 7 Day 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Upload: kelley-cain

Post on 18-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Text statistics 7Day 30 - 11/05/14LING 3820 & 6820

Natural Language Processing

Harry Howard

Tulane University

Page 2: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization

03-Nov-2014NLP, Prof. Howard, Tulane University

2

http://www.tulane.edu/~howard/LING3820/ The syllabus is under construction. http://www.tulane.edu/~howard/

CompCultEN/ Chapter numbering

3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode

characters 6. Control

Page 3: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Final project

03-Nov-2014NLP, Prof. Howard, Tulane University

3

Page 4: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Open Spyder

03-Nov-2014

4

NLP, Prof. Howard, Tulane University

Page 5: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Review

03-Nov-2014

5

NLP, Prof. Howard, Tulane University

Page 6: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

ConditionalFreqDist

1. >>> from nltk.corpus import brown

2. >>> from nltk.probability import ConditionalFreqDist

3. >>> cat = ['news', 'romance']

4. >>> catWord = [(c,w)

5. for c in cat

6. for w in brown.words(categories=c)]

7. >>> cfd=ConditionalFreqDist(catWord)

03-Nov-2014NLP, Prof. Howard, Tulane University

6

Page 7: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Conditional frequency distribution

03-Nov-2014

7

NLP, Prof. Howard, Tulane University

Page 8: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

03-Nov-2014NLP, Prof. Howard, Tulane University

8

A more interesting example

can could may might must will

news 93 86 66 38 50 389

religion 82 59 78 12 54 71

hobbies 268 58 131 22 83 264

sci fi 16 49 4 12 8 16

romance 74 193 11 51 45 43

humor 16 30 8 8 9 13

Page 9: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Conditions = categories, sample = modal verbs

1. # from nltk.corpus import brown2. # from nltk.probability import

ConditionalFreqDist3. >>> cat = ['news', 'religion', 'hobbies',

'science_fiction', 'romance', 'humor']4. >>> mod = ['can', 'could', 'may', 'might',

'must', 'will']5. >>> catWord = [(c,w)6. for c in cat7. for w in brown.words(categories=c)8. if w in mod]9. >>> cfd = ConditionalFreqDist(catWord)10. >>> cfd.tabulate()11. >>> cfd.plot()

03-Nov-2014NLP, Prof. Howard, Tulane University

9

Page 10: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

cfd.tabulate()

can could may might must will

news 93 86 66 38 50 389

religion 82 59 78 12 54 71

hobbies 268 58 131 22 83 264

science_fiction 16 49 4 12 8 16

romance 74 193 11 51 45 43

humor 16 30 8 8 9 13

03-Nov-2014NLP, Prof. Howard, Tulane University

10

Page 11: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

cfd.plot()

03-Nov-2014NLP, Prof. Howard, Tulane University

11

Page 12: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

03-Nov-2014NLP, Prof. Howard, Tulane University

12

Another example

The task is to find the frequency of 'America' and 'citizen' in NLTK's corpus of presedential inaugural addresses:1. >>> from nltk.corpus import inaugural2. >>> inaugural.fileids()

3. ['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ..., '2009-Obama.txt']

Page 13: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

03-Nov-2014NLP, Prof. Howard, Tulane University

13

cfd2.plot()

Page 14: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

First try

1. from nltk.corpus import inaugural

2. from nltk.probability import ConditionalFreqDist

3. keys = ['america', 'citizen']

4. keyYear = [(w, title[:4])

5. for title in inaugural.fileids()

6. for w in inaugural.words(title)

7. if w.lower() in keys]

8. cfd2 = ConditionalFreqDist(keyYear)

9. cfd2.plot()

03-Nov-2014NLP, Prof. Howard, Tulane University

14

Page 15: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

03-Nov-2014NLP, Prof. Howard, Tulane University

15

cfd2.plot()

Page 16: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Second try

1. from nltk.corpus import inaugural2. from nltk.probability import

ConditionalFreqDist3. keys = ['america', 'citizen']4. keyYear = [(key, title[:4])5. for title in inaugural.fileids()6. for w in inaugural.words(title)7. for k in keys8. if w.lower().startswith(k)]9. cfd3 = ConditionalFreqDist(keyYear)10. cfd3.plot()

03-Nov-2014NLP, Prof. Howard, Tulane University

16

Page 17: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

dfc3.plot()

03-Nov-2014NLP, Prof. Howard, Tulane University

17

Page 18: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Stemming

03-Nov-2014NLP, Prof. Howard, Tulane University

18

Page 19: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Third try

1. from nltk.stem.snowball import EnglishStemmer

2. stemmer = EnglishStemmer()

3. from nltk.corpus import inaugural

4. from nltk.probability import ConditionalFreqDist

5. keys = ['america', 'citizen']

6. keyYear = [(w, title[:4])

7. for title in inaugural.fileids()

8. for w in inaugural.words(title)

9. if stemmer.stem(w) in keys]

10. cfd4 = ConditionalFreqDist(keyYear)

11. cfd4.plot()

03-Nov-2014NLP, Prof. Howard, Tulane University

19

Page 20: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

cfd4.plot()

03-Nov-2014NLP, Prof. Howard, Tulane University

20

Page 21: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Twitter

Next time

03-Nov-2014NLP, Prof. Howard, Tulane University

21