Download - Categorizing Unknown Words:
Categorizing Unknown Categorizing Unknown Words:Words:
Using Decision Trees to Using Decision Trees to Identify Names and Identify Names and
MisspellingsMisspellings
Janine TooleJanine TooleSimon Fraser UniversitySimon Fraser UniversityBurnaby, BC, CanadaBurnaby, BC, Canada
From ANLP-NAACL From ANLP-NAACL Proceedings, April 29-May 4, Proceedings, April 29-May 4,
2000 (pp.. 173-179)2000 (pp.. 173-179)
Goal: automatic Goal: automatic categorization of unknown categorization of unknown
wordswords
Unknown Words (UknWrds): word Unknown Words (UknWrds): word not contained in lexicon of NLP not contained in lexicon of NLP systemsystem
"unknown-ness" - property relative "unknown-ness" - property relative to NLP systemto NLP system
MotivationMotivation
Degraded system performance in Degraded system performance in presence of unknown wordspresence of unknown words
Disproportionate effect possibleDisproportionate effect possible• Min (1996) - only 0.6% of words in 300 e-mails Min (1996) - only 0.6% of words in 300 e-mails
misspelledmisspelled• Result - 12% of the sentences contained an error Result - 12% of the sentences contained an error
(discussed in (Min and Wilson, 1998)).(discussed in (Min and Wilson, 1998)).
Difficulties translating live closed Difficulties translating live closed captions (CC)captions (CC)• 5 seconds to transcribe dialogue, no post-edit5 seconds to transcribe dialogue, no post-edit
Reasons for unknown Reasons for unknown wordswords
Proper nameProper name MisspellingMisspelling Abbreviation or numberAbbreviation or number Morphological variantMorphological variant
And my favorite...And my favorite...
Misspoken wordsMisspoken words Examples (courtesy H. K. Examples (courtesy H. K.
Longmore):Longmore):– *I'll phall you on the cone (call, phone)*I'll phall you on the cone (call, phone)– *I did a lot of hiking by mysummer this *I did a lot of hiking by mysummer this
self (myself this summer)self (myself this summer)
What to do?What to do?
Identify class of unknown wordIdentify class of unknown word Take action based on goals of Take action based on goals of
system and class of wordsystem and class of word• Correct spellingCorrect spelling• Expand abbr.Expand abbr.• Convert number formatConvert number format
Overall System Overall System ArchitectureArchitecture
Multiple components, one per Multiple components, one per categorycategory
Return confidence measure Return confidence measure (Elworthy, 1998)(Elworthy, 1998)
Evaluate results from each Evaluate results from each component to determine categorycomponent to determine category
One reason for approach: take One reason for approach: take advantage of existing researchadvantage of existing research
Simplified Version:Simplified Version:Names & Spelling ErrorsNames & Spelling Errors
Decision tree architecture Decision tree architecture • combine multiple types of evidence combine multiple types of evidence
about wordabout word Results combined using weighted Results combined using weighted
voting procedurevoting procedure Evaluation: Live CC data - replete Evaluation: Live CC data - replete
with wide variety of UknWdswith wide variety of UknWds
Name IdentifierName Identifier
Proper names ==> proper name Proper names ==> proper name bucketbucket
Others ==> discardOthers ==> discard PN : person, place, concept, typically PN : person, place, concept, typically
requiring Caps in Englishrequiring Caps in English
ProblemsProblems
CC is ALL CAPS!CC is ALL CAPS! No confidence measure with No confidence measure with
existing PN Recognizersexisting PN Recognizers Perhaps future PNRs will work?Perhaps future PNRs will work?
SolutionSolution
Build custom PNRBuild custom PNR
Decision TreesDecision Trees
Highly explainable - readily Highly explainable - readily understand features affecting analysisunderstand features affecting analysis
Well suited for combining a variety of Well suited for combining a variety of info.info.
Don't grow tree from seed - use IBM's Don't grow tree from seed - use IBM's Intelligent Miner suiteIntelligent Miner suite
Ignore DT algorithm - point is Ignore DT algorithm - point is application of DTapplication of DT
Proper Names - FeaturesProper Names - Features
10 features specified per UknWrd10 features specified per UknWrd• POS and Detailed POS of UknWrd + POS and Detailed POS of UknWrd +
and - 2 wordsand - 2 words• Rule-based system for detailed tagsRule-based system for detailed tags• in-house statistical parser for POSin-house statistical parser for POS
Would include feature indicating Would include feature indicating presence of Initial Upper Case if presence of Initial Upper Case if data had itdata had it
MisspellingsMisspellings
Unintended, orthographically Unintended, orthographically incorrect representationincorrect representation
Relative to NLP systemRelative to NLP system 1 or more additions, deletions, 1 or more additions, deletions,
substitutions, reversals, substitutions, reversals, punctuationpunctuation
OrthographyOrthography
Word: orthographyWord: orthographyor.thog.ra.phy \o.r-'tha:g-r*-fe-\ n 1a: the art of or.thog.ra.phy \o.r-'tha:g-r*-fe-\ n 1a: the art of writing words with the proper letters according to writing words with the proper letters according to standard usage 1b: the representation of the standard usage 1b: the representation of the sounds of a language by written or printed symbols sounds of a language by written or printed symbols 2: a part of language study that deals with letters 2: a part of language study that deals with letters and spellingand spelling
Misspellings - FeaturesMisspellings - Features
Derived from prior research Derived from prior research (including own)(including own)
Abridged list of features usedAbridged list of features used• Corpus freq., word length, edit Corpus freq., word length, edit
distance, Ispell info, char seq. freq., distance, Ispell info, char seq. freq., Non-Engl. charsNon-Engl. chars
Misspellings Features Misspellings Features (cont.)(cont.)
Word length - (Agirre et. al., 1998)Word length - (Agirre et. al., 1998) Predictions for correct spelling Predictions for correct spelling
more accurate if |w| > 4more accurate if |w| > 4
Misspellings Features Misspellings Features (cont.)(cont.)
Edit distanceEdit distance• 1 edit distance == 1 substitution, addition, 1 edit distance == 1 substitution, addition,
deletion, reversaldeletion, reversal• 80% of errors w/in 1 edit distance of 80% of errors w/in 1 edit distance of
intended wordintended word• 70% w/in 1 edit distance of intended word70% w/in 1 edit distance of intended word
Unix spell checker: ispellUnix spell checker: ispell• edit distance = distance from UnkWrd to edit distance = distance from UnkWrd to
closest ispell suggestion, or 30closest ispell suggestion, or 30
Misspellings Features Misspellings Features (cont.)(cont.)
Char. Seq. Freq.Char. Seq. Freq.• wful, rql, etc.wful, rql, etc.• composite of individual char. seq.composite of individual char. seq.• relevance to 1 tree vs. manyrelevance to 1 tree vs. many• Non-English - Transmission noise in Non-English - Transmission noise in
CC case, or Foreign namesCC case, or Foreign names
Decision TimeDecision Time
Misspelling module says not a misspellMisspelling module says not a misspellPNR says its a name -> namePNR says its a name -> name
Both negative -> neither misspell nor Both negative -> neither misspell nor namename
What if both are positive?What if both are positive?• One with highest confidence measure winsOne with highest confidence measure wins• Confidence measureConfidence measure
– per leaf, calculated from training dataper leaf, calculated from training data– correct predictions / total # of predictions at leafcorrect predictions / total # of predictions at leaf
Evaluation - DatasetEvaluation - Dataset
7000 cases of UnkWrds7000 cases of UnkWrds 2.6 million word corpus2.6 million word corpus Live business news captionsLive business news captions 70.4% manually ID'd as names70.4% manually ID'd as names 21.3% as misspellings21.3% as misspellings Rest - other types of UnkWrdsRest - other types of UnkWrds
Dataset (cont.)Dataset (cont.)
70% of Dataset randomly selected 70% of Dataset randomly selected as training corpusas training corpus
Remainder (2100) for test corpusRemainder (2100) for test corpus Test data - 10 samples, random Test data - 10 samples, random
selection with replacementselection with replacement Total of 10 test datasetsTotal of 10 test datasets
Evaluation - TrainingEvaluation - Training
Train a DT with misspelling moduleTrain a DT with misspelling module Train a DT with misspelling & name Train a DT with misspelling & name
modulemodule Train a DT with name moduleTrain a DT with name module Train a DT with name & misspelling Train a DT with name & misspelling
modulemodule
Misspelling DT Results - Misspelling DT Results - Table 3Table 3
baseline - no recallbaseline - no recall 1st decision tree -73.8% recall1st decision tree -73.8% recall 2nd decision tree - increase in precision, 2nd decision tree - increase in precision,
decrease in recall by similar amountdecrease in recall by similar amount name features not predictive for ID'ing name features not predictive for ID'ing
misspellings in this domainmisspellings in this domain not surprising - 8 of 10 features deal with not surprising - 8 of 10 features deal with
information external to word itselfinformation external to word itself
Misspelling DT failuresMisspelling DT failures
2 classes of omissions2 classes of omissions MisidentificationsMisidentifications
• Foreign wordsForeign words
Omission type 1Omission type 1
Words with typical characteristics Words with typical characteristics of English wordsof English words
Differ from intended word by Differ from intended word by addition or deletion of a syllableaddition or deletion of a syllable• creditability for credibilitycreditability for credibility• coordinatored for coordinatedcoordinatored for coordinated• representives for representativesrepresentives for representatives
Omission type 2Omission type 2
Words differing from intended Words differing from intended word by deletion of a blankword by deletion of a blank• webpagewebpage• crewmemberscrewmembers• rainshowerrainshower
FixesFixes
Fix for 2nd typeFix for 2nd type• feature to specify whether UnkWrd feature to specify whether UnkWrd
can be split into 2 known wordscan be split into 2 known words Fix for 1st type more difficultFix for 1st type more difficult
• homophonic relationshiphomophonic relationship• phonetic distance featurephonetic distance feature
Name DT Results - Table 4Name DT Results - Table 4
1st tree1st tree• precision is large improvementprecision is large improvement• recall is excellentrecall is excellent
2nd tree2nd tree• increased recall & precisionincreased recall & precision• unlike 2nd misspelling DT - why?unlike 2nd misspelling DT - why?
Name DT failuresName DT failures
Not ID'd as a name - Names with Not ID'd as a name - Names with determinersdeterminers• the steelers, the pathfinderthe steelers, the pathfinder
Adept at individual people, placesAdept at individual people, places• trouble with names having similar trouble with names having similar
distributions to common nounsdistributions to common nouns
Name DT failures (cont.)Name DT failures (cont.)
Incorrectly ID'd as nameIncorrectly ID'd as name• Unusual character sequences: Unusual character sequences:
sxetion, fwlamgsxetion, fwlamg Misspelling identifier correctly ID's Misspelling identifier correctly ID's
as misspellingsas misspellings Decision-making component needs Decision-making component needs
to resolve theseto resolve these
Unknown Word Unknown Word CategorizerCategorizer
Precision = # of correct misspelling Precision = # of correct misspelling or name categorizations / total or name categorizations / total number of times a word was number of times a word was identified as misspelling or nameidentified as misspelling or name
Recall = # of times system Recall = # of times system correctly ID's misspelling or name / correctly ID's misspelling or name / # of misspellings and names # of misspellings and names existing in dataexisting in data
Confusion matrix of tie-Confusion matrix of tie-breakerbreaker
Table 5 - good resultsTable 5 - good results 5% of cases needed confidence 5% of cases needed confidence
measuremeasure Majority of cases decision-maker rules Majority of cases decision-maker rules
in favor of name predictionin favor of name prediction
Confusion matrix (cont.)Confusion matrix (cont.)
Name DT has better results, likely to Name DT has better results, likely to have higher confidence measureshave higher confidence measures
UknWrd as Name when it is a UknWrd as Name when it is a misspelling (37 cases)misspelling (37 cases)
Phonetic relation with intended word Phonetic relation with intended word - temt, tempt; floyda, Florida;- temt, tempt; floyda, Florida;
Encouraging ResultsEncouraging Results
Productive approachProductive approach Future focusFuture focus
• Improve existing componentsImprove existing components– features sensitive to distinction between features sensitive to distinction between
names & misspellingsnames & misspellings
• Develop components to ID remaining Develop components to ID remaining typestypes– abbr., morph variants, etc.abbr., morph variants, etc.
• Alternative decision-making processAlternative decision-making process
PortabilityPortability
Little required linguistic resourcesLittle required linguistic resources• Corpus of new domain (language)Corpus of new domain (language)• Spelling suggestionsSpelling suggestions
– ispell avail. for many languagesispell avail. for many languages
• POS taggerPOS tagger
Possible portability Possible portability problemsproblems
Edit distanceEdit distance• Words consist of alphabetic chars. Words consist of alphabetic chars.
having undergone subst/add/delhaving undergone subst/add/del• Less useful for Chinese, JapaneseLess useful for Chinese, Japanese
General approach still transferableGeneral approach still transferable• consider means by which misspellings consider means by which misspellings
differ from intended wordsdiffer from intended words• identify features to capture differencesidentify features to capture differences
Related ResearchRelated Research
Assume all UknWrds are Assume all UknWrds are misspellingsmisspellings
Rely on capitalizationRely on capitalization Expectations from scriptsExpectations from scripts
• Rely on world knowledge of situationRely on world knowledge of situation– e.g. naval ship-to-shore messagese.g. naval ship-to-shore messages
Related Research (cont.)Related Research (cont.)
(Baluja et al., 1999)(Baluja et al., 1999)DT classifier to ID PNs in textDT classifier to ID PNs in text
3 features: word level, dictionary 3 features: word level, dictionary level,level,POS informationPOS information
Highest F-score: 95.2%Highest F-score: 95.2%• slightly higher than name moduleslightly higher than name module
But...But...
Different tasksDifferent tasks• ID all words & phrases that are PNsID all words & phrases that are PNs• vs. ID only those words which are vs. ID only those words which are
UknWrdsUknWrds Different data - Case informationDifferent data - Case information If word-level features (case) excludedIf word-level features (case) excluded
F-score of 79.7%F-score of 79.7%
ConclusionConclusion
UknWrd Categorizer to ID UknWrd Categorizer to ID misspellings & namesmisspellings & names
Individual components, specializing Individual components, specializing in identifying a particular class of in identifying a particular class of UknWrdUknWrd
2 Existing components use DTs2 Existing components use DTs Encouraging results in a challenging Encouraging results in a challenging
domain (live CC transcripts)!domain (live CC transcripts)!