re-organization of ir/csc team hongchao he hongchao he conf. follow up trec-10, ntcir conf. follow...
TRANSCRIPT
Re-organization of Re-organization of IR/CSC teamIR/CSC team
Hongchao HeHongchao He Conf. follow up TREC-10, NTCIRConf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paperPaper follow up ICCLP, SIGIR paper
Guihong CaoGuihong Cao MSKK-III – Clustering for technique transferMSKK-III – Clustering for technique transfer
Yang WenYang Wen MSKK-III – Distance word dependencyMSKK-III – Distance word dependency
Min ZhangMin Zhang MSKK/CSC – Entropy based pruning for MSKK/CSC – Entropy based pruning for
applications of (Pinyin/Hiragana) input systemapplications of (Pinyin/Hiragana) input system
Chinese Spelling Chinese Spelling CheckingChecking
(or, the Big CSC)(or, the Big CSC)
Jianfeng Gao Jianfeng Gao
NLC Group, MSRCNNLC Group, MSRCN
OutlineOutline
IntroductionIntroduction Chinese spelling checking Chinese spelling checking Our approachOur approach Key techniques and experimentsKey techniques and experiments MillstoneMillstone
IntroductionIntroduction
Chinese spelling errors using MS-Chinese spelling errors using MS-Pinyin input systemPinyin input system
Chinese spelling error patternsChinese spelling error patterns English spelling checkingEnglish spelling checking Why CSC is difficult?Why CSC is difficult?
Goal:Goal: Automatically correct Automatically correct Chinese spelling errors using Chinese spelling errors using MS-Pinyin (MSPY) input MS-Pinyin (MSPY) input systemsystem
Text in the brain
Syllable
Key stroke (Typing)
Converted text
Chinese spelling errors using Chinese spelling errors using MSPYMSPY
Pinyin (phonetic) errors
Typographic errors
System errors
Chinese spelling errors Chinese spelling errors patternspatterns
Substitution errors Substitution errors Pinyin errorPinyin error System error (include Pinyin error System error (include Pinyin error
in some systems)in some systems) Non-substitution errors Non-substitution errors word word
segmentation errors segmentation errors Typographic errors –
insertion/deletion/transpositioninsertion/deletion/transposition
English spelling English spelling checkingchecking
Non-word error detection (“the” Non-word error detection (“the” “hte”) “hte”) N-gram (letter) analysisN-gram (letter) analysis Dictionary lookupDictionary lookup
Real-word error detection (“from” Real-word error detection (“from” “form”)“form”) NLP – parser drivenNLP – parser driven Statistical approach – data/error drivenStatistical approach – data/error driven
Local – n-gram language model, depend on pre-Local – n-gram language model, depend on pre-defined confusion setdefined confusion set
Global – Winnow, Bayesian, TBL, etc.Global – Winnow, Bayesian, TBL, etc. Problem – lack of error detectionProblem – lack of error detection
Why CSC is difficult?Why CSC is difficult?
Word segmentationWord segmentation AmbiguousAmbiguous OOV – Proper noun detection (personal OOV – Proper noun detection (personal
name, location, organization, etc.)name, location, organization, etc.) Segmentation error propagationSegmentation error propagation
Non-word errors (in sense of Non-word errors (in sense of English) do not existEnglish) do not exist
MSPY makes good use of word MSPY makes good use of word trigram language modeltrigram language model
Chinese spelling Chinese spelling checkingchecking
CSC – related worksCSC – related works Template matching – long distance, e.g. <Template matching – long distance, e.g. < 之所以之所以 > <> < 是是
因为因为 >> Pattern matching – long words (n>=3), e.g. Pattern matching – long words (n>=3), e.g. 一文不明 一文不明
一文不名一文不名 , , 忠耿耿 忠耿耿 忠心耿耿忠心耿耿 N-gram models – substitution errorsN-gram models – substitution errors
CSC – challengesCSC – challenges Long distance, coverage issue of template/pattern set Long distance, coverage issue of template/pattern set High-frequent-used confusion set, e.g. {High-frequent-used confusion set, e.g. { 像,象像,象 } {} { 在,在,
再再 }} OOV, especially the proper nounsOOV, especially the proper nouns N-gram, has been fully used by MSPYN-gram, has been fully used by MSPY
Chinese spelling errors Chinese spelling errors patterns in MSPYpatterns in MSPY
Proper noun Proper noun Personal namePersonal name LocationLocation organizationorganization
Non-word errors: context independentNon-word errors: context independent Insertion/deletion/transposition/substitutionInsertion/deletion/transposition/substitution E.g. E.g. 一文不明 一文不明 一文不名 一文不名 , , 忠耿耿 忠耿耿 忠心耿耿忠心耿耿
Real-word errors: context sensitive Real-word errors: context sensitive E.g. E.g. 像 像 象象 , , 在 在 再再 , , 实施 实施 事实事实
Flowchart of our Flowchart of our approachapproach
Text with errors
Word segmentation
Non-word error correction
Real-word error correction
Proper noun detection
Word fuzzy matching
Trigger: single char string , low prob Context
sensitive disambiguation
Word segmentation and Word segmentation and proper noun detection proper noun detection
Language model based word segmentationLanguage model based word segmentation Class-based language model Class-based language model
P(W) = PP(W) = Poutsideoutside(W) P(W) Pinsideinsideaa(W|<PN>), (W|<PN>), a = a = ??
Outside probability – PN tagged training Outside probability – PN tagged training datadata Using NLPWIN to tag the corpusUsing NLPWIN to tag the corpus Filtering, rule baseFiltering, rule base EM?EM?
Inside probability – PN list training dataInside probability – PN list training data Using cache (or, dynamic dictionary)Using cache (or, dynamic dictionary)
Experiments and Findings Experiments and Findings
Measure: precision/recall – definitionMeasure: precision/recall – definition Training data – People DailyTraining data – People Daily Tag tool – NLPWINTag tool – NLPWIN Test data – spec.Test data – spec. Results and FindingsResults and Findings
Long word fuzzy matching Long word fuzzy matching Definition of Distance(s1, s2)Definition of Distance(s1, s2)
Long word, n>=3,Long word, n>=3, Sum of delete/insert/substitute a character Sum of delete/insert/substitute a character
Fast fuzzy matchingFast fuzzy matching Global – Lei Zhang’s ACLGlobal – Lei Zhang’s ACL Local – trigger, (single char, or low n-gram Local – trigger, (single char, or low n-gram
probability )probability ) Search – error detection/correctionSearch – error detection/correction
ViterbiViterbi Simplified versionSimplified version
Long word + Local matchingLong word + Local matching
Experiments and Findings Experiments and Findings
Contact: 100 person, 3000 -- 5000 Contact: 100 person, 3000 -- 5000 characters/personcharacters/person
Error analysisError analysis Algorithm …Algorithm … Measure: precision/recallMeasure: precision/recall Large lexicon, acquisition.Large lexicon, acquisition. Trigger/threshold ?Trigger/threshold ? Results and FindingsResults and Findings
Context sensitive Context sensitive disambiguation disambiguation
Building confusion set – specific to MSPYBuilding confusion set – specific to MSPY Feature selection – Context vectorFeature selection – Context vector
Collocation – contiguous POS or words/charactersCollocation – contiguous POS or words/characters Context words – words/characters within a Context words – words/characters within a KK-size window-size window Triple ?Triple ?
Weighting schema and ClassifierWeighting schema and Classifier Context Vector, TFIDFContext Vector, TFIDF Winnow, Bayesian, TBL, etc.Winnow, Bayesian, TBL, etc.
Scaling upScaling up Enlarge confusion setEnlarge confusion set Feature pruningFeature pruning AdaptationAdaptation
Experiments and Findings Experiments and Findings
Measure: precision/recallMeasure: precision/recall Training dataTraining data Test data (XXX confusion set)Test data (XXX confusion set) Results and FindingsResults and Findings
Experiments and Experiments and FindingsFindings
Current WorkCurrent Work Pseudo-training set based on MSPY IMEPseudo-training set based on MSPY IME
Preliminary data processing (400M PD)Preliminary data processing (400M PD) Unigram error model (10,000 Words useful)Unigram error model (10,000 Words useful)
使 是使 是 /69484 /69484 市市 /10289 /10289 诗诗 /2394 ……/2394 …… Trigram error pattern (980,000 useful)Trigram error pattern (980,000 useful)
共共 [[ 度度 ]] 难关难关 =>=> 渡 渡 / / 不够不够 [[ 英英 ]] ,, =>=> 硬硬 Experiments based on basic approachesExperiments based on basic approaches
Pseudo-test set from Pseudo-test set from 南方周末南方周末 Continuous pair (Recall = 50%, Precision = 25%)Continuous pair (Recall = 50%, Precision = 25%) Pattern Matching (??)Pattern Matching (??)
Future WorkFuture Work Hybrid approachesHybrid approaches
Pattern Clustering + Continuous pairPattern Clustering + Continuous pair Functional words error detectionFunctional words error detection
System evaluation – put it System evaluation – put it all together all together
Evaluation toolsetEvaluation toolset Measure: precision/recallMeasure: precision/recall Training dataTraining data Test dataTest data Results and FindingsResults and Findings
Prototype Prototype
Demo …Demo … Online & offline CSCOnline & offline CSC Right click Right click
Spelling error detection/correctionSpelling error detection/correction Proper noun detection/correctionProper noun detection/correction
Assignment Assignment
Jianfeng Gao – overall, fuzzy Jianfeng Gao – overall, fuzzy matchingmatching
Mu Li – context sensitive Mu Li – context sensitive disambiguationdisambiguation
Jian Sun – PN detectionJian Sun – PN detection Yang Wen – system evaluationYang Wen – system evaluation Yulin Kang – demo Yulin Kang – demo Lei Zhang – senior consultant Lei Zhang – senior consultant
Millstone Millstone
Oct. 2001, Ming says “Yes” (TAB Oct. 2001, Ming says “Yes” (TAB demo)demo)
Dec. 2001, Dong says “Yes” Dec. 2001, Dong says “Yes” (Transfer)(Transfer)
Aug. 2002, HJ says “Yes” (Party)Aug. 2002, HJ says “Yes” (Party)
InformationInformation
Access at \\msrcn4p3\rootD\Access at \\msrcn4p3\rootD\gaojf\spellgaojf\spell
Contact me if any problemsContact me if any problems Jianfeng Gao, Tel: 86-10-Jianfeng Gao, Tel: 86-10-
62617711-5778, 62617711-5778, Email: [email protected]: [email protected]