natural language processing with...
TRANSCRIPT
![Page 1: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/1.jpg)
Natural Language Processing with Python
CS372: Spring, 2021
Lecture 3Accessing Text Corpora and
Lexical Resources
Jong C. ParkSchool of Computing
Korea Advanced Institute of Science and Technology
![Page 2: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/2.jpg)
ACCESSING TEXT CORPORA AND LEXICAL RESOURCESAccessing Text CorporaConditional Frequency DistributionsMore Python: Reusing CodeLexical ResourcesWordNet
CS372: NLP with Python 22021-03-09
![Page 3: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/3.jpg)
Questions• What are some useful text corpora and lexical
resources, and how can we access them with Python?
• Which Python constructs are most helpful for this work?
• How do we avoid repeating ourselves when writing Python code?
2021-03-09 CS372: NLP with Python 3
Introduction
![Page 4: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/4.jpg)
Gutenberg Corpus Web and Chat Text Brown Corpus Reuters Corpus Inaugural Address Corpus Annotated Text Corpora Corpora in Other Languages Text Corpus Structure Loading Your Own Corpus
2021-03-09 CS372: NLP with Python 4
Accessing Text Corpora
![Page 5: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/5.jpg)
The Project Gutenberg electronic text archive • contains some 25,000 electronic books• http://www.gutenberg.org/.
2021-03-09 CS372: NLP with Python 5
Gutenberg Corpus
![Page 6: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/6.jpg)
2021-03-09 CS372: NLP with Python 6
Gutenberg Corpus
![Page 7: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/7.jpg)
2021-03-09 CS372: NLP with Python 7
Gutenberg Corpus
Average sentence length and lexical diversityappear to be characteristics of particular authors.
![Page 8: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/8.jpg)
2021-03-09 CS372: NLP with Python 8
Gutenberg Corpus
The sents() function divides the text into its sentences, which are lists of words.
Most NLTK corpus readers include a variety of access methods in addition to words(), raw(), and sents().
![Page 9: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/9.jpg)
NLTK’s collection of web text includes • content from a Firefox discussion forum; • conversations overheard in New York; • the movie script of Pirates of the Carribean; • personal advertisements; and • wine reviews.
2021-03-09 CS372: NLP with Python 9
Web and Chat Text
![Page 10: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/10.jpg)
2021-03-09 CS372: NLP with Python 10
Web and Chat Text
![Page 11: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/11.jpg)
A corpus of instant messaging chat sessions:• originally collected by the Naval Postgraduate
School (nps) for research on automatic detection of Internet predators;
• contains over 10,000 posts, anonymized by replacing usernames with generic names of the form “UserNNN”, and manually edited to remove any other identifying information;
2021-03-09 CS372: NLP with Python 11
Web and Chat Text
![Page 12: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/12.jpg)
• organized into 15 files, where each file contains several hundred posts collected on a given data, for an age-specific chatroom (teens, 20s, 30s, 40s, plus a generic adults chatroom).
2021-03-09 CS372: NLP with Python 12
Web and Chat Text
![Page 13: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/13.jpg)
The Brown Corpus• the first million-word electronic corpus of
English;• created in 1961 at Brown University;• contains text from 500 sources; and• the sources have been categorized by genre,
such as news, editorial, and so on.
2021-03-09 CS372: NLP with Python 13
Brown Corpus
![Page 14: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/14.jpg)
http://icame.uib.no/brown/bcm-los.htmlfor a complete list.
2021-03-09 CS372: NLP with Python 14
Brown Corpus
![Page 15: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/15.jpg)
We can access the corpus as a list of words or a list of sentences. • We may optionally specify particular
categories or files to read.
2021-03-09 CS372: NLP with Python 15
Brown Corpus
![Page 16: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/16.jpg)
It is a convenient resource for studying systematic differences between genres, a kind of linguistic inquiry known as stylistics.
2021-03-09 CS372: NLP with Python 16
Brown Corpus
Is there any other selection of words that one can try for similar stylistics?
![Page 17: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/17.jpg)
2021-03-09 CS372: NLP with Python 17
Brown Corpus
Computing counts for each genre of interest. • Use NLTK’s support for conditional frequency
distributions.
What kind of observations can we make?
![Page 18: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/18.jpg)
The Reuters Corpus • It contains 10,788 news documents totaling
1.3 million words.• The documents are classified into 90 topics,
and grouped into two sets, “training”/“test”.• For example, the text with fileid ‘test/14826’ is
a document drawn from the test set. • The split is for training and testing algorithms
that automatically detect the topic of a document.
2021-03-09 CS372: NLP with Python 18
Reuters Corpus
![Page 19: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/19.jpg)
2021-03-09 CS372: NLP with Python 19
Reuters Corpus
![Page 20: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/20.jpg)
Unlike the Brown Corpus, categories in the Reuters Corpus overlap with each other.
2021-03-09 CS372: NLP with Python 20
Reuters Corpus
Why?
![Page 21: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/21.jpg)
We can specify the words or sentences we want in terms of files or categories.
2021-03-09 CS372: NLP with Python 21
Reuters Corpus
![Page 22: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/22.jpg)
The Inaugural Address Corpus• a collection of 55 texts, one for each
presidential address;• its time dimension is an interesting property.
2021-03-09 CS372: NLP with Python 22
Inaugural Address Corpus
![Page 23: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/23.jpg)
2021-03-09 CS372: NLP with Python 23
Inaugural Address Corpus
‘2021-Biden.txt’ is not yet available.
![Page 24: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/24.jpg)
Looking at how the words America and citizen are used over time.
2021-03-09 CS372: NLP with Python 24
Inaugural Address Corpus
![Page 25: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/25.jpg)
2021-03-09 CS372: NLP with Python 25
Inaugural Address Corpus
![Page 26: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/26.jpg)
Many text corpora containing linguistic annotations represent part-of-speech tags, named entities, syntactic structures, semantic roles, and so forth. • Consult http://www.nltk.org/data for
information about downloading them.
2021-03-09 CS372: NLP with Python 26
Annotated Text Corpora
![Page 27: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/27.jpg)
2021-03-09 CS372: NLP with Python 27
Annotated Text Corpora
![Page 28: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/28.jpg)
2021-03-09 CS372: NLP with Python 28
Annotated Text Corpora
![Page 29: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/29.jpg)
2021-03-09 CS372: NLP with Python 29
Annotated Text Corpora
![Page 30: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/30.jpg)
NLTK comes with corpora for many languages, though in some cases we need to learn how to manipulate character encodings in Python.
2021-03-09 CS372: NLP with Python 30
Corpora in Other Languages
the “Floresta Sinta(c)tica Corpus http://www.linguateca.pt/Floresta/(cf. http://nltk.googlecode.com/svn/trunk/doc/howto/portuguese_en.html)
the CESS-ESP Treebank, with 6030 parsed sentences
bangla, hindi, marathi, telugu
![Page 31: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/31.jpg)
The corpus, udhr, contains the Universal Declaration of Human Rights in over 300 languages. • The fields include information about the
character encoding used in the file, such as UTF8 or Latin1.
2021-03-09 CS372: NLP with Python 31
Corpora in Other Languages
![Page 32: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/32.jpg)
2021-03-09 CS372: NLP with Python 32
Corpora in Other Languages
![Page 33: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/33.jpg)
2021-03-09 CS372: NLP with Python 33
Corpora in Other Languages
![Page 34: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/34.jpg)
2021-03-09 CS372: NLP with Python 34
Corpora in Other Languages
![Page 35: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/35.jpg)
2021-03-09 CS372: NLP with Python 35
Corpora in Other Languages
![Page 36: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/36.jpg)
2021-03-09 CS372: NLP with Python 36
Corpora in Other Languages
![Page 37: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/37.jpg)
2021-03-09 CS372: NLP with Python 37
Corpora in Other Languages
Words having five or fewer letters account forabout 80% of Ibibio text, 60% of German text,and 25% of Inuktitut text.
![Page 38: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/38.jpg)
Common structures
2021-03-09 CS372: NLP with Python 38
Text Corpus Structure
![Page 39: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/39.jpg)
2021-03-09 CS372: NLP with Python 39
Text Corpus Structure
![Page 40: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/40.jpg)
There is a difference between some of the corpus access methods:
2021-03-09 CS372: NLP with Python 40
Text Corpus Structure
![Page 41: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/41.jpg)
Load your own collection of text files.
2021-03-09 CS372: NLP with Python 41
Loading Your Own Corpus
your own path to replace /usr/share/dict
![Page 42: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/42.jpg)
Another example
2021-03-09 CS372: NLP with Python 42
Loading Your Own Corpus
corpus reader for corpora that consist of parenthesis-delineated parse trees
your own path to replace /corpora/penntreebank/parsed/mrg/wsj”
![Page 43: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/43.jpg)
Conditions and Events Counting Words by Genre Plotting and Tabulating Distributions Generating Random Text with Bigrams
2021-03-09 CS372: NLP with Python 43
Conditional Frequency Distributions
![Page 44: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/44.jpg)
While a frequency distribution counts observable events, a conditional frequency distribution needs to pair each event with a condition.
2021-03-09 CS372: NLP with Python 44
Conditions and Events
![Page 45: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/45.jpg)
2021-03-09 CS372: NLP with Python 45
Counting Words by Genre
![Page 46: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/46.jpg)
2021-03-09 CS372: NLP with Python 46
Counting Words by Genre
![Page 47: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/47.jpg)
2021-03-09 CS372: NLP with Python 47
Plotting and Tabulating Distributions
1,638 words of the English text have nine or fewer letters.
![Page 48: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/48.jpg)
Create a table of bigrams using a conditional frequency distribution.
2021-03-09 CS372: NLP with Python 48
Generating Random Text with Bigrams
![Page 49: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/49.jpg)
2021-03-09 CS372: NLP with Python 49
Generating Random Text with Bigrams
Example 2-1. Generating random text
![Page 50: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/50.jpg)
Accessing Text Corpora• Gutenberg Corpus• Web and Chat Text• Brown Corpus• Reuters Corpus• Inaugural Address Corpus• Annotated Text Corpora• Corpora in Other Languages• Text Corpus Structure• Loading Your Own Corpus
2021-03-09 CS372: NLP with Python 50
Summary (1/2)
![Page 51: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,](https://reader036.vdocuments.us/reader036/viewer/2022071515/613755460ad5d20676488cca/html5/thumbnails/51.jpg)
Conditional Frequency Distributions• Conditions and Events• Counting Words by Genre• Plotting and Tabulating Distributions• Generating Random Text with Bigrams
2021-03-09 CS372: NLP with Python 51
Summary (2/2)