1 sims 290-2: applied natural language processing marti hearst sept 1, 2004

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

Post on 15-Jan-2016




0 download


Page 1: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004


SIMS 290-2: Applied Natural Language Processing

Marti HearstSept 1, 2004 


Page 2: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004



How shall we transform a huge text collection?Levels of LanguageIntro to NLTK and Python

Page 3: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004


The Enron Email Archive

BackgroundOriginally made public, and posted to the web by the Federal Energy Regulatory Commission during its investigation.

– ~500,000 messages– Salon article: http://www.salon.com/news/feature/2003/10/14/enron/index_np.html

Later purchased by Leslie Kaelbling at MITPeople at SRI, notably Melinda Gervasio, cleaned it up

– No attachments– Some messages have been deleted "as part of a redaction effort due

to requests from affected employees".– Invalid email addresses were converted to [email protected]

Posted online for research on email by William Cohen at– http://www-2.cs.cmu.edu/~enron/

Paper describing the dataset:– The Enron Corpus: A New Dataset for Email Classification Research, Klimt and

Yang, ECML 2004 http://www-2.cs.cmu.edu/~bklimt/papers/2004_klimt_ecml.pdf

Page 4: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004


The Enron Email Archive

A valuable resourceNo other large open email corpus for research

A sensitive resourceWe need to be respectful and careful about how we treat this information

We can add valueIdea: this class produces something more valuable and interesting than what we started with.Researchers and practitioners will build on our results

Page 5: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004


The Enron Email Archive

So … what’s in there?500,000 messages.Let’s search (on a subset of the collection): http://quasi.berkeley.edu/anlp/enron_search.html

Now … what more would we like to have?

Page 6: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

6Slide adapted from Robert Berwick's

Levels of Language

Sound Structure (Phonetics and Phonology)The sounds of speech and their productionThe systematic way that sounds are differently realized in different environments.

Word Structure (Morphology)From morphos = shape (not transform, as in morph)Analyzes how words are formed from minimal units of meaning; also derivational rules

– dog + s = dogs; eat, eats, ate

Phrase Structure (Syntax)From the Greek syntaxis, arrange togetherDescribes grammatical arrangements of words into hierarchical structure

Page 7: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

7Slide adapted from Robert Berwick's

Levels of Language

Thematic StructureGetting closer to meaningWho did what to whom

– Subject, object, predicate

Semantic StructureHow the lower levels combine to convey meaning

Pragmatics and Discourse StructureHow language is used across sentences.

Page 8: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

8Slide adapted from Robert Berwick's

Parsing at Every Level

Transforming from a surface representation to an underlying representation

It’s not straightforward to do any of these mappings!Ambiguity at every level

– Word: is “saw” a verb or noun?– Phrase: “I saw the guy on the hill with the telescope.”

Who is on the hill?– Semantic: which hill?

Page 9: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004


Python and NLTK

The following slides from Diane Litman’s lecture

Page 10: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

10Slide by Diane Litman

Python and Natural Language Processing

Python is a great language for NLP:SimpleEasy to debug:

– Exceptions– Interpreted language

Easy to structure– Modules– Object oriented programming

Powerful string manipulation

Page 11: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

11Slide by Diane Litman

Modules and Packages

Python modules “package program code and data for reuse.” (Lutz)

– Similar to library in C, package in Java.

Python packages are hierarchical modules (i.e., modules that contain other modules).

Three commands for accessing modules:1. import2. from…import3. reload

Page 12: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

12Slide by Diane Litman

Modules and Packages: import

The import command loads a module:# Load the regular expression module

>>> import re

To access the contents of a module, use dotted names:

# Use the search method from the re module>>> re.search(‘\w+’, str)

To list the contents of a module, use dir:>>> dir(re)


Page 13: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

13Slide by Diane Litman

Modules and Packagesfrom…import

The from…import command loads individual functions and objects from a module:

# Load the search function from the re module

>>> from re import search

Once an individual function or object is loaded with from…import, it can be used directly:

# Use the search method from the re module

>>> search (‘\w+’, str)

Page 14: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

14Slide by Diane Litman

Import vs. from…import


Keeps module functions separate from user functions.Requires the use of dotted names.Works with reload.


Puts module functions and user functions together.More convenient names.Does not work with reload.

Page 15: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

15Slide by Diane Litman

Modules and Packages: reload

If you edit a module, you must use the reload command before the changes become visible in Python:

>>> import mymodule


>>> reload (mymodule)

The reload command only affects modules that have been loaded with import; it does not update individual functions and objects loaded with from...import.

Page 16: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

16Slide by Diane Litman

Introduction to NLTK

The Natural Language Toolkit (NLTK) provides:Basic classes for representing data relevant to natural language processing.Standard interfaces for performing tasks, such as tokenization, tagging, and parsing.Standard implementations of each task, which can be combined to solve complex problems.

Page 17: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

17Slide by Diane Litman

NLTK: Example Modules

nltk.token: processing individual elements of text, such as words or sentences.nltk.probability: modeling frequency distributions and probabilistic systems.nltk.tagger: tagging tokens with supplemental information, such as parts of speech or wordnet sense tags.nltk.parser: high-level interface for parsing texts.nltk.chartparser: a chart-based implementation of the parser interface.nltk.chunkparser: a regular-expression based surface parser.

Page 18: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

18Slide by Diane Litman

NLTK: Top-Level Organization

NLTK is organized as a flat hierarchy of packages and modules.Each module provides the tools necessary to address a specific taskModules contain two types of classes:

Data-oriented classes are used to represent information relevant to natural language processing.Task-oriented classes encapsulate the resources and methods needed to perform a specific task.

Page 19: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

19Slide by Diane Litman

To the First Tutorials

Tokens and TokenizationFrequency Distributions

Page 20: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

20Slide by Diane Litman

The Token Module

It is often useful to think of a text in terms of smaller elements, such as words or sentences.The nltk.token module defines classes for representing and processing these smaller elements.What might be other useful smaller elements?

Page 21: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

21Slide by Diane Litman

Tokens and Types

The term word can be used in two different ways:1. To refer to an individual occurrence of a word2. To refer to an abstract vocabulary item

For example, the sentence “my dog likes his dog” contains five occurrences of words, but four vocabulary items.To avoid confusion use more precise terminology:

1. Word token: an occurrence of a word2. Word Type: a vocabulary item

Page 22: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

22Slide by Diane Litman

Tokens and Types (continued)

In NLTK, tokens are constructed from their types using the Token constructor: >>> from nltk.token import * >>> my_word= 'dog' >>> my_word_token =Token(TEXT=my_word) ‘dog'@[?]

Page 23: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

23Slide by Diane Litman

Text LocationsA text location @ [s:e] specifies a region of a text:

s is the start indexe is the end index

The text location @ [s:e]specifies the text beginning at s, and including everything up to (but not including) the text at e.This definition is consistent with Python slice.Think of indices as appearing between elements: I saw a man 0 1 2 3 4Shorthand notation when location width = 1.

Page 24: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

24Slide by Diane Litman

Text Locations (continued)Indices can be based on different units:




Locations can be tagged with sources (files, other text locations – e.g., the first word of the first sentence in the file) Location member functions:





Page 25: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

25Slide by Diane Litman


The simplest way to represent a text is with a single string.Difficult to process text in this format.Often, it is more convenient to work with a list of tokens.The task of converting a text from a single string to a list of tokens is known as tokenization.

Page 26: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

26Slide by Diane Litman

Tokenization (continued)

Tokenization is harder that it seemsI’ll see you in New York.The aluminum-export ban.

The simplest approach is to use “graphic words” (i.e., separate words using whitespace)Another approach is to use regular expressions to specify which substrings are valid words.NLTK provides a generic tokenization interface: TokenizerI

Page 27: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

27Slide by Diane Litman


Defines a single method, tokenize, which takes a string and returns a list of tokensTokenize is independent of the level of tokenization and the implementation algorithm

Page 28: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004


For Next Week

Monday: holiday, no classKevin is still installing the software

I will send email with details when readyProbably by the end of today

Sign up for the email list!Mail to: [email protected] in msg body: subscribe anlp

For Wed Sept 8Do exercises 1-3 in Tutorial 2 (Tokenizing)
