1 sims 290-2: applied natural language processing marti hearst sept 8, 2004
Post on 20-Dec-2015
215 views
TRANSCRIPT
1
SIMS 290-2: Applied Natural Language Processing
Marti HearstSept 8, 2004
2
Today
Tokenizing using Regular ExpressionsElementary MorphologyFrequency Distributions in NLTK
3Modified from Dorr and Habash (after Jurafsky and Martin)
Tokenizing in NLTK
The Whitespace Tokenizer doesn’t work very well
What are some of the problems?
NLTK provides an easy way to incorporate regex’s into your tokenizer
Uses python’s regex package (re)http://docs.python.org/lib/re-syntax.html
4Modified from Dorr and Habash (after Jurafsky and Martin)
Regex’s for TokenizingBuild up your recognizer piece by piece
Make a string of regex’s combined with OR’sPut each one in a group (surrounded by parens)
Things to recognize:urlswords with hyphens in themwords in which hyphens should be removed (end of line hyphens)Numerical termsWords with apostrophes
5
Regex’s for TokenizingHere are some I put together:
url = r'((http:\/\/)?[A-Za-z]+(\.[A-Za-z]+){1,3}(\/)?(:\d+)?)‘» Allows port number but no argument variables.
hyphen = r'(\w+\-\s?\w+)‘ » Allows for a space after the hyphen
apostro = r'(\w+\'\w+)‘
numbers = r'((\$|#)?\d+(\.)?\d+%?)‘» Needs to handle large numbers with commas
punct = r'([^\w\s]+)‘
wordr = r'(\w+)‘
A nice python trick:regexp = string.join([url, hyphen, apostro, numbers, wordr, punct],"|")
– Makes one string in which a “|” goes in between each substring
6
Regex’s for Tokenizing
More code:
import stringfrom nltk.token import *from nltk.tokenizer import *t = Token(TEXT='This is the girl\'s depart- ment.')regexp =
string.join([url, hyphen, apostrophe, numbers, wordr, punct],"|")
RegexpTokenizer(regexp,SUBTOKENS='WORDS').tokenize(t)print t['WORDS']
[<This>, <is>, <the>, <girl's>, <depart- ment>, <store>, <.>]
7Modified from Dorr and Habash (after Jurafsky and Martin)
Tokenization Issues
Sentence BoundariesInclude parens around sentences? What about quotation marks around sentences?Periods – end of line or not?
– We’ll study this in detail in a couple of weeks.
Proper NamesWhat to do about
– “New York-New Jersey train”?– “California Governor Arnold Schwarzenegger”?
Clitics and Contractions
8Modified from Dorr and Habash (after Jurafsky and Martin)
MorphologyMorphology:
The study of the way words are built up from smaller meaning units.Morphemes:
The smallest meaningful unit in the grammar of a language.Contrasts:
Derivational vs. InflectionalRegular vs. IrregularConcatinative vs. Templatic (root-and-pattern)
A useful resource:Glossary of linguistic terms by Eugene Looshttp://www.sil.org/linguistics/GlossaryOfLinguisticTerms/contents.htm
9Modified from Dorr and Habash (after Jurafsky and Martin)
Examples (English)
“unladylike”3 morphemes, 4 syllables
un- ‘not’lady ‘(well behaved) female adult human’-like ‘having the characteristics of’
Can’t break any of these down further without distorting the meaning of the units
“technique”1 morpheme, 2 syllables
“dogs”2 morphemes, 1 syllable
-s, a plural marker on nouns
10Modified from Dorr and Habash (after Jurafsky and Martin)
Morpheme DefinitionsRoot
The portion of the word that:– is common to a set of derived or inflected forms, if any, when all affixes
are removed – is not further analyzable into meaningful elements– carries the principle portion of meaning of the words
StemThe root or roots of a word, together with any derivational affixes, to which inflectional affixes are added.
AffixA bound morpheme that is joined before, after, or within a root or stem.
Clitica morpheme that functions syntactically like a word, but does not appear as an independent phonological word
– Spanish: un beso, las aguas– English: Hal’s (genetive marker)
11Modified from Dorr and Habash (after Jurafsky and Martin)
Inflectional vs. Derivational
Word ClassesParts of speech: noun, verb, adjectives, etc.Word class dictates how a word combines with morphemes to form new words
Inflection:Variation in the form of a word, typically by means of an affix, that expresses a grammatical contrast.
– Doesn’t change the word class– Usually produces a predictable, nonidiosyncratic change of
meaning.
Derivation:The formation of a new word or inflectable stem from another word or stem.
12Modified from Dorr and Habash (after Jurafsky and Martin)
Inflectional Morphology
Adds: tense, number, person, mood, aspect
Word class doesn’t changeWord serves new grammatical roleExamples
come is inflected for person and number:The pizza guy comes at noon.
las and rojas are inflected for agreement with manzanas in grammatical gender by -a and in number by –s
las manzanas rojas (‘the red apples’)
13Modified from Dorr and Habash (after Jurafsky and Martin)
Derivational MorphologyNominalization (formation of nouns from other parts of speech, primarily verbs in English):
computerizationappointeekillerfuzziness
Formation of adjectives (primarily from nouns) computationalcluelessEmbraceable
Diffulcult cases:building from which sense of “build”?
A resource:CatVar: Categorial Variation Databasehttp://clipdemos.umiacs.umd.edu/catvar
14Modified from Dorr and Habash (after Jurafsky and Martin)
Concatinative MorphologyMorpheme+Morpheme+Morpheme+…Stems: also called lemma, base form, root, lexeme
hope+ing hoping hop hopping
AffixesPrefixes: AntidisestablishmentarianismSuffixes: AntidisestablishmentarianismInfixes: hingi (borrow) – humingi (borrower) in TagalogCircumfixes: sagen (say) – gesagt (said) in German
Agglutinative Languagesuygarlaştıramadıklarımızdanmışsınızcasınauygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casınaBehaving as if you are among those whom we could not cause to become civilized
15Modified from Dorr and Habash (after Jurafsky and Martin)
Templatic MorphologyRoots and Patterns
Example: Hebrew verbsRoot:
– Consists of 3 consonants CCC– Carries basic meaning
Template:– Gives the ordering of consonants and vowels– Specifies semantic information about the verb
Active, passive, middle voiceExample:
– lmd (to learn or study) CaCaC -> lamad (he studied) CiCeC -> limed (he taught) CuCaC -> lumad (he was taught)
16Modified from Dorr and Habash (after Jurafsky and Martin)
Nouns and Verbs (in English)
Nouns have simple inflectional morphologycatcat+s, cat+’s
Verbs have more complex morphology
17Modified from Dorr and Habash (after Jurafsky and Martin)
Nouns and Verbs (in English)
NounsHave simple inflectional morphologyCat/CatsMouse/Mice, Ox, Oxen, Goose, Geese
VerbsMore complex morphologyWalk/WalkedGo/Went, Fly/Flew
18Modified from Dorr and Habash (after Jurafsky and Martin)
Regular (English) Verbs
Morphological Form Classes Regularly Inflected Verbs
Stem walk merge try map
-s form walks merges tries maps
-ing form walking merging trying mapping
Past form or –ed participle walked merged tried mapped
19Modified from Dorr and Habash (after Jurafsky and Martin)
Irregular (English) Verbs
Morphological Form Classes Irregularly Inflected Verbs
Stem eat catch cut
-s form eats catches cuts
-ing form eating catching cutting
Past form ate caught cut
-ed participle eaten caught cut
20Modified from Dorr and Habash (after Jurafsky and Martin)
“To love” in Spanish
21Modified from Dorr and Habash (after Jurafsky and Martin)
Syntax and Morphology
Phrase-level agreementSubject-Verb
– John studies hard (STUDY+3SG)
Noun-Adjective– Las vacas hermosas
Sub-word phrasal structuresנויספרבש
נו+ים+ספר+ב+ש
That+in+book+PL+Poss:1PLWhich are in our books
22Modified from Dorr and Habash (after Jurafsky and Martin)
Phonology and Morphology
Script Limitations
Spoken English has 14 vowels– heed hid hayed head had hoed hood who’d hide
how’d taught Tut toy enough
English Alphabet has 5– Use vowel combinatios: far fair fare– Consonantal doubling (hopping vs. hoping)
23Modified from Dorr and Habash (after Jurafsky and Martin)
Computational MorphologyApproaches
Lexicon onlyRules onlyLexicon and Rules
– Finite-state Automata– Finite-state Transducers
SystemsWordNet’s morphyPCKimmo
– Named after Kimmo Koskenniemi, much work done by Lauri Karttunen, Ron Kaplan, and Martin Kay
– Accurate but complex– http://www.sil.org/pckimmo/
Two-level morphology– Commercial version available from InXight Corp.
BackgroundChapter 3 of Jurafsky and MartinA short history of Two-Level Morphology
– http://www.ling.helsinki.fi/~koskenni/esslli-2001-karttunen/
24Modified from Dorr and Habash (after Jurafsky and Martin)
Porter Stemmer
Discount morphologySo not all that accurate
Uses a series of cascaded rewrite rulesATIONAL -> ATE
(relational -> relate)
ING -> if stem contains vowel (motoring -> motor)
25Modified from Dorr and Habash (after Jurafsky and Martin)
Porter StemmerStep 4: Derivational Morphology I: Multiple Suffixes (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible
26Modified from Dorr and Habash (after Jurafsky and Martin)
Porter StemmerErrors of Omission
European Europeanalysis analyzesmatrices matrixnoise noisyexplain explanation
Errors of Commissionorganization organdoing doegeneralization genericnumerical numerousuniversity universe
27Modified from Dorr and Habash (after Jurafsky and Martin)
Computational MorphologyWORD STEM (+FEATURES)*
cats cat +N +PLcat cat +N +SGcities city +N +PLgeese goose +N +PLducks (duck +N +PL) or
(duck +V +3SG)merging merge +V +PRES-PARTcaught (catch +V +PAST-PART) or
(catch +V +PAST)
28Modified from Dorr and Habash (after Jurafsky and Martin)
Lexicon-only Morphology
acclaim acclaim $N$
acclaim acclaim $V+0$
acclaimed acclaim $V+ed$
acclaimed acclaim $V+en$
acclaiming acclaim $V+ing$
acclaims acclaim $N+s$
acclaims acclaim $V+s$
acclamation acclamation $N$
acclamations acclamation $N+s$
acclimate acclimate $V+0$
acclimated acclimate $V+ed$
acclimated acclimate $V+en$
acclimates acclimate $V+s$
acclimating acclimate $V+ing$
• The lexicon lists all surface level and lexical level pairs
• No rules …
• Analysis/Generation is easy
• Very large for English
• What about
•Arabic or
•Turkish or
• Chinese?
29
For Next Week
Software status:Software on 3 lab machines, more coming
Lecture on Monday Sept 13:Part of speech tagging
For Wed Sept 15Do exercises 1-3 in Tutorial 2 (Tokenizing)Do the following exercises from Tutorial 3 (Tagging)
1a-h2, 3, 4, 5a-b
Turn them in online (I’ll have something available for this by then)