cs60057 speech &natural language processing
DESCRIPTION
CS60057 Speech &Natural Language Processing. Autumn 2007. Lecture 2 26 July 2007. Why is NLP difficult?. Because Natural Language is highly ambiguous. Syntactic ambiguity The president spoke to the nation about the problem of drug use in the schools from one coast to the other. - PowerPoint PPT PresentationTRANSCRIPT
Lecture 1, 7/21/2005 Natural Language Processing 1
CS60057Speech &Natural Language
Processing
Autumn 2007
Lecture 2
26 July 2007
Lecture 1, 7/21/2005 Natural Language Processing 2
Why is NLP difficult?
Because Natural Language is highly ambiguous. Syntactic ambiguity
The president spoke to the nation about the problem of drug use in the schools from one coast to the other.
has 720 parses. Ex:
“to the other” can attach to any of the previous NPs (ex. “the problem”), or the head verb 6 places
“from one coast” has 5 places to attach …
Lecture 1, 7/21/2005 Natural Language Processing 3
Why is NLP difficult? Word category ambiguity
book --> verb? or noun? Word sense ambiguity
bank --> financial institution? building? or river side? Words can mean more than their sum of parts
make up a story Fictitious worlds
People on mars can fly. Defining scope
People like ice-cream. Does this mean that all (or some?) people like ice cream?
Language is changing and evolving I’ll email you my answer. This new S.U.V. has a compartment for your mobile phone. Googling, …
Lecture 1, 7/21/2005 Natural Language Processing 4
Dealing with Ambiguity
Four possible approaches: Tightly coupled interaction among processing
levels; knowledge from other levels can help decide among choices at ambiguous levels.
Pipeline processing that ignores ambiguity as it occurs and hopes that other levels can eliminate incorrect structures.
Lecture 1, 7/21/2005 Natural Language Processing 5
Resolve Ambiguities We will introduce models and algorithms to resolve ambiguities at
different levels. part-of-speech tagging -- Deciding whether duck is verb or noun. word-sense disambiguation -- Deciding whether make is create or cook.
lexical disambiguation -- Resolution of part-of-speech and word-sense ambiguities are two important kinds of lexical disambiguation.
syntactic ambiguity -- her duck is an example of syntactic ambiguity, and can be addressed by probabilistic parsing.
Lecture 1, 7/21/2005 Natural Language Processing 6
Resolve Ambiguities (cont.)I made her duck
S S
NP VP NP VP
I V NP NP I V NP
made her duck made DET N
her duck
Lecture 1, 7/21/2005 Natural Language Processing 7
Dealing with Ambiguity
Three approaches: Tightly coupled interaction among processing
levels; knowledge from other levels can help decide among choices at ambiguous levels.
Pipeline processing that ignores ambiguity as it occurs and hopes that other levels can eliminate incorrect structures.
Syntax proposes/semantics disposes approach Probabilistic approaches based on making the
most likely choices
Lecture 1, 7/21/2005 Natural Language Processing 8
Models and Algorithms
By models I mean the formalisms that are used to capture the various kinds of linguistic knowledge we need.
Algorithms are then used to manipulate the knowledge representations needed to tackle the task at hand.
Lecture 1, 7/21/2005 Natural Language Processing 9
Models to Represent Linguistic Knowledge
Different formalisms (models) are used to represent the required linguistic knowledge.
State Machines -- FSAs, HMMs, ATNs, RTNs Formal Rule Systems -- Context Free Grammars,
Unification Grammars, Probabilistic CFGs. Logic-based Formalisms -- first order predicate logic,
some higher order logic. Models of Uncertainty -- Bayesian probability theory.
Lecture 1, 7/21/2005 Natural Language Processing 10
Algorithms
Many of the algorithms that we’ll study will turn out to be transducers; algorithms that take one kind of structure as input and output another.
Unfortunately, ambiguity makes this process difficult. This leads us to employ algorithms that are designed to handle ambiguity of various kinds
Lecture 1, 7/21/2005 Natural Language Processing 11
Algorithms
In particular.. State-space search
To manage the problem of making choices during processing when we lack the information needed to make the right choice
Dynamic programming To avoid having to redo work during the course of a
state-space search CKY, Earley, Minimum Edit Distance, Viterbi,
Baum-Welch
Lecture 1, 7/21/2005 Natural Language Processing 12
State Space Search
States represent pairings of partially processed inputs with partially constructed representations.
Goals are inputs paired with completed representations that satisfy some criteria.
As with most interesting problems the spaces are normally too large to exhaustively explore. We need heuristics to guide the search Criteria to trim the space
Lecture 1, 7/21/2005 Natural Language Processing 13
Dynamic Programming
Don’t do the same work over and over. Avoid this by building and making use of solutions to
sub-problems that must be invariant across all parts of the space.
Lecture 1, 7/21/2005 Natural Language Processing 14
Languages
Languages: 39,000 languages and dialects (22,000 dialects in India alone)
Top languages: Chinese/Mandarin (885M), Spanish (332M), English (322M), Bengali (189M), Hindi (182M), Portuguese (170M), Russian (170M), Japanese (125M)
Source: www.sil.org/ethnologue, www.nytimes.com Internet: English (128M), Japanese (19.7M), German (14M), Spanish
(9.4M), French (9.3M), Chinese (7.0M) Usage: English (1999-54%, 2001-51%, 2003-46%, 2005-43%) Source: www.computereconomics.com
Lecture 1, 7/21/2005 Natural Language Processing 15
The Description of Language
Language = Words and Rules Dictionary (vocabulary) + Grammar Dictionary set of words defined in the language. open (dynamic)
Traditional - paper based Electronic - machine readable dictionaries; can be obtained from paper-
based Grammar set of rules which describe what is allowable in a language Classic Grammars
meant for humans who know the language definitions and rules are mainly supported by examples no (or almost no) formal description tools; cannot be programmed
Explicit Grammar (CFG, Dependency Grammars, Link Grammars,...) formal description can be programmed & tested on data (texts)
Lecture 1, 7/21/2005 Natural Language Processing 16
Levels of (Formal) Description 6 basic levels (more or less explicitly present in most theories):
and beyond (pragmatics/logic/...)
meaning (semantics)
(surface) syntax
morphology
phonology
phonetics/orthography
Each level has an input and output representation output from one level is the input to the next (upper) level sometimes levels might be skipped (merged) or split
Lecture 1, 7/21/2005 Natural Language Processing 17
Phonetics/Orthography Input:
acoustic signal (phonetics) / text (orthography)
Output: phonetic alphabet (phonetics) / text (orthography)
Deals with: Phonetics:
consonant & vowel (& others) formation in the vocal tract classification of consonants, vowels, ... in relation to frequencies,
shape & position of the tongue and various muscles intonation
Orthography: normalization, punctuation, etc.
Lecture 1, 7/21/2005 Natural Language Processing 18
Phonology Input:
sequence of phones/sounds (in a phonetic alphabet); or “normalized” text (sequence of (surface) letters in one language’s alphabet) [NB: phones vs. phonemes]
Output: sequence of phonemes (~ (lexical) letters; in an abstract alphabet)
Deals with: relation between sounds and phonemes (units which might have
some function on the upper level) e.g.: [u] ~ oo (as in book), [æ] ~ a (cat); i ~ y (flies)
Lecture 1, 7/21/2005 Natural Language Processing 19
Morphology
Input: sequence of phonemes (~ (lexical) letters)
Output: sequence of pairs (lemma, (morphological) tag)
Deals with: composition of phonemes into word forms and their
underlying lemmas (lexical units) + morphological categories (inflection, derivation, compounding)
e.g. quotations ~ quote/V + -ation(der.V->N) + NNS.
Lecture 1, 7/21/2005 Natural Language Processing 20
(Surface) Syntax Input:
sequence of pairs (lemma, (morphological) tag) Output:
sentence structure (tree) with annotated nodes (all lemmas, (morphosyntactic) tags, functions), of various forms
Deals with: the relation between lemmas & morphological categories and
the sentence structure uses syntactic categories such as Subject, Verb, Object,... e.g.: I/PP1 see/VB a/DT dog/NN ~ ((I/sg)SB ((see/pres)V (a/ind dog/sg)OBJ)VP)S
Lecture 1, 7/21/2005 Natural Language Processing 21
Meaning (semantics) Input:
sentence structure (tree) with annotated nodes (lemmas, (morphosyntactic) tags, surface functions)
Output: sentence structure (tree) with annotated nodes (semantic
lemmas, (morpho-syntactic) tags, deep functions) Deals with:
relation between categories such as “Subject”, “Object” and (deep) categories such as “Agent”, “Effect”; adds other cat’s
e.g. ((I)SB ((was seen)V (by Tom)OBJ)VP)S ~ (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f)
Lecture 1, 7/21/2005 Natural Language Processing 22
...and Beyond
Input: sentence structure (tree): annotated nodes (autosemantic
lemmas, (morphosyntactic) tags, deep functions) Output:
logical form, which can be evaluated (true/false) Deals with:
assignment of objects from the real world to the nodes of the sentence structure
e.g.: (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f) ~ see(Mark-Twain[SSN:...],Tom-Sawyer[SSN:...])[Time:bef 99/9/27/14:15][Place:39ş19’40”N76ş37’10”W]
Lecture 1, 7/21/2005 Natural Language Processing 23
Three Views
Three equivalent formal ways to look at what we’re up to (not including tables)
Regular Expressions
Regular LanguagesFinite State Automata
Lecture 1, 7/21/2005 Natural Language Processing 24
Transition
Finite-state methods are particularly useful in dealing with a lexicon.
Lots of devices, some with limited memory, need access to big lists of words.
So we’ll switch to talking about some facts about words and then come back to computational methods
Lecture 1, 7/21/2005 Natural Language Processing 25
MORPHOLOGY
Lecture 1, 7/21/2005 Natural Language Processing 26
Morphology Morphology is the study of the ways that words are built up from
smaller meaningful units called morphemes (morph = shape, logos = word)
We can usefully divide morphemes into two classes Stems: The core meaning bearing units Affixes: Bits and pieces that adhere to stems to change their
meanings and grammatical functions Prefix: un-, anti-, etc Suffix: -ity, -ation, etc Infix: are inserted inside the stem
Tagalog: um + hingi humingi Circumfixes – precede and follow the stem
English doesn’t stack more affixes. But Turkish can have words with a lot of suffixes. Languages, such as Turkish, tend to string affixes together are
called agglutinative languages.
Lecture 1, 7/21/2005 Natural Language Processing 27
Surface and Lexical Forms
The surface level of a word represents the actual spelling of that word. geliyorum eats cats kitabım
The lexical level of a word represents a simple concatenation of morphemes making up that word. gel +PROG +1SG eat +AOR cat +PLU kitap +P1SG
Morphological processors try to find correspondences between lexical and surface forms of words. Morphological recognition/ analysis – surface to lexical Morphological generation/ synthesis – lexical to surface
Lecture 1, 7/21/2005 Natural Language Processing 28
Morphology: Morphemes & Order
Handles what is an isolated form in written text
Grouping of phonemes into morphemes sequence deliverables deliver, able and s (3
units)
Morpheme Combination certain combinations/sequencing possible, other not:
deliver+able+s, but not able+derive+s; noun+s, but not noun+ing typically fixed (in any given language)
Lecture 1, 7/21/2005 Natural Language Processing 29
Inflectional & Derivational Morphology We can also divide morphology up into two broad classes
Inflectional Derivational
Inflectional morphology concerns the combination of stems and affixes where the resulting word Has the same word class as the original Serves a grammatical/semantic purpose different from the original
After a combination with an inflectional morpheme, the meaning and class of the actual stem usually do not change. eat / eats pencil / pencils
After a combination with an derivational morpheme, the meaning and the class of the actual stem usually change. compute / computer do / undo friend / friendly Uygar / uygarlaş kapı / kapıcı
The irregular changes may happen with derivational affixes.
Lecture 1, 7/21/2005 Natural Language Processing 30
Morphological Parsing Morphological parsing is to find the lexical form of a word
from its surface form. cats -- cat +N +PLU cat -- cat +N +SG goose -- goose +N +SG or goose +V geese -- goose +N +PLU gooses -- goose +V +3SG catch -- catch +V caught -- catch +V +PAST or catch +V +PP
There can be more than one lexical level representation for a given word. (ambiguity)
Lecture 1, 7/21/2005 Natural Language Processing 31
Morphological Analysis
Analyzing words into their linguistic components (morphemes). Morphemes are the smallest meaningful units of language.
cars car+PLUgiving give+PROGAsachhilAma AsA+PROG+PAST+1st I/We was/were coming
Ambiguity: More than one alternativesflies flyVERB+PROG
flyNOUN+PLU
mAtAlakare
Lecture 1, 7/21/2005 Natural Language Processing 32
Fly + s flys flies (y i rule) Duckling
Go-getter get + erDoer do + erBeer ?
What knowledge do we need?How do we represent it?How do we compute with it?
Lecture 1, 7/21/2005 Natural Language Processing 33
Knowledge needed
Knowledge of stems or roots Duck is a possible root, not ducklWe need a dictionary (lexicon)
Only some endings go on some words Do + er ok Be + er – not ok
In addition, spelling change rules that adjust the surface form Get + er – double the t getter Fox + s – insert e – foxes Fly + s – insert e – flys – y to i – flies Chase + ed – drop e - chased
Lecture 1, 7/21/2005 Natural Language Processing 34
Put all this in a big dictionary (lexicon)
Turkish – approx 600 106 forms Finnish – 107
Hindi, Bengali, Telugu, Tamil? Besides, always novel forms can be constructed
Anti-missile Anti-anti-missile
Anti-anti-anti-missile ……..
Compounding of words – Sanskrit, German
Lecture 1, 7/21/2005 Natural Language Processing 35
Morphology: From Morphemes to Lemmas & Categories Lemma: lexical unit, “pointer” to lexicon
typically is represented as the “base form”, or “dictionary headword”
possibly indexed when ambiguous/polysemous: state1 (verb), state2 (state-of-the-art), state3 (government)
from one or more morphemes (“root”, “stem”, “root+derivation”, ...)
Categories: non-lexical small number of possible values (< 100, often < 5-10)
Lecture 1, 7/21/2005 Natural Language Processing 36
Morphology Level: The Mapping
Formally: A+ 2(L,C1,C2,...,Cn)
A is the alphabet of phonemes (A+ denotes any non-empty sequence of phonemes)
L is the set of possible lemmas, uniquely identified Ci are morphological categories, such as:
grammatical number, gender, case person, tense, negation, degree of comparison, voice, aspect, ... tone, politeness, ... part of speech (not quite morphological category, but...)
A, L and Ci are obviously language-dependent
Lecture 1, 7/21/2005 Natural Language Processing 37
Morphological Analysis (cont.)
Relatively simple for English. But for many Indian languages, it may be more difficult.
Examples
Inflectional and Derivational Morphology. Common tools: Finite-state transducers
Lecture 1, 7/21/2005 Natural Language Processing 38
Simple Rules
Lecture 1, 7/21/2005 Natural Language Processing 39
Adding in the Words
Lecture 1, 7/21/2005 Natural Language Processing 40
Derivational Rules
Lecture 1, 7/21/2005 Natural Language Processing 41
Parsing/Generation vs. Recognition
Recognition is usually not quite what we need. Usually if we find some string in the language we need
to find the structure in it (parsing) Or we have some structure and we want to produce a
surface form (production/generation) Example
From “cats” to “cat +N +PL” and back
Lecture 1, 7/21/2005 Natural Language Processing 42
Finite State Transducers
The simple story Add another tape Add extra symbols to the transitions
On one tape we read “cats”, on the other we write “cat +N +PL”, or the other way around.
Lecture 1, 7/21/2005 Natural Language Processing 43
FSTs
Lecture 1, 7/21/2005 Natural Language Processing 44
Transitions
c:c means read a c on one tape and write a c on the other +N:ε means read a +N symbol on one tape and write nothing on the
other +PL:s means read +PL and write an s
c:c a:a t:t +N:ε +PL:s
Lecture 1, 7/21/2005 Natural Language Processing 45
Typical Uses
Typically, we’ll read from one tape using the first symbol on the machine transitions (just as in a simple FSA).
And we’ll write to the second tape using the other symbols on the transitions.
Lecture 1, 7/21/2005 Natural Language Processing 46
Ambiguity
Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state. Didn’t matter which path was actually traversed
In FSTs the path to an accept state does matter since differ paths represent different parses and different outputs will result
Lecture 1, 7/21/2005 Natural Language Processing 47
Ambiguity
What’s the right parse for Unionizable Union-ize-able Un-ion-ize-able
Each represents a valid path through the derivational morphology machine.
Lecture 1, 7/21/2005 Natural Language Processing 48
Ambiguity
There are a number of ways to deal with this problem Simply take the first output found Find all the possible outputs (all paths) and return
them all (without choosing) Bias the search so that only one or a few likely paths
are explored
Lecture 1, 7/21/2005 Natural Language Processing 49
The Gory Details
Of course, its not as easy as “cat +N +PL” <-> “cats”
As we saw earlier there are geese, mice and oxen But there are also a whole host of spelling/pronunciation
changes that go along with inflectional changes Cats vs Dogs Fox and Foxes
Lecture 1, 7/21/2005 Natural Language Processing 50
Multi-Tape Machines
To deal with this we can simply add more tapes and use the output of one tape machine as the input to the next
So to handle irregular spelling changes we’ll add intermediate tapes with intermediate symbols
Lecture 1, 7/21/2005 Natural Language Processing 51
Generativity
Nothing really privileged about the directions. We can write from one and read from the other or vice-
versa. One way is generation, the other way is analysis
Lecture 1, 7/21/2005 Natural Language Processing 52
Multi-Level Tape Machines
We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape
Lecture 1, 7/21/2005 Natural Language Processing 53
Lexical to Intermediate Level
Lecture 1, 7/21/2005 Natural Language Processing 54
Intermediate to Surface
The add an “e” rule as in fox^s# <-> foxes#
Lecture 1, 7/21/2005 Natural Language Processing 55
Foxes
Lecture 1, 7/21/2005 Natural Language Processing 56
Note
A key feature of this machine is that it doesn’t do anything to inputs to which it doesn’t apply.
Meaning that they are written out unchanged to the output tape.
Turns out the multiple tapes aren’t really needed; they can be compiled away.
Lecture 1, 7/21/2005 Natural Language Processing 57
Overall Scheme
We now have one FST that has explicit information about the lexicon (actual words, their spelling, facts about word classes and regularity). Lexical level to intermediate forms
We have a larger set of machines that capture orthographic/spelling rules. Intermediate forms to surface forms
Lecture 1, 7/21/2005 Natural Language Processing 58
Overall Scheme