re-organization of ir/csc team hongchao he hongchao he conf. follow up trec-10, ntcir conf. follow...

23
Re-organization of Re-organization of IR/CSC team IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper Paper follow up ICCLP, SIGIR paper Guihong Cao Guihong Cao MSKK-III – Clustering for technique transfer MSKK-III – Clustering for technique transfer Yang Wen Yang Wen MSKK-III – Distance word dependency MSKK-III – Distance word dependency Min Zhang Min Zhang MSKK/CSC – Entropy based pruning for MSKK/CSC – Entropy based pruning for applications of (Pinyin/Hiragana) input applications of (Pinyin/Hiragana) input system system

Upload: cody-mcconnell

Post on 27-Mar-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

Re-organization of Re-organization of IR/CSC teamIR/CSC team

Hongchao HeHongchao He Conf. follow up TREC-10, NTCIRConf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paperPaper follow up ICCLP, SIGIR paper

Guihong CaoGuihong Cao MSKK-III – Clustering for technique transferMSKK-III – Clustering for technique transfer

Yang WenYang Wen MSKK-III – Distance word dependencyMSKK-III – Distance word dependency

Min ZhangMin Zhang MSKK/CSC – Entropy based pruning for MSKK/CSC – Entropy based pruning for

applications of (Pinyin/Hiragana) input systemapplications of (Pinyin/Hiragana) input system

Page 2: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

Chinese Spelling Chinese Spelling CheckingChecking

(or, the Big CSC)(or, the Big CSC)

Jianfeng Gao Jianfeng Gao

NLC Group, MSRCNNLC Group, MSRCN

Page 3: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

OutlineOutline

IntroductionIntroduction Chinese spelling checking Chinese spelling checking Our approachOur approach Key techniques and experimentsKey techniques and experiments MillstoneMillstone

Page 4: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

IntroductionIntroduction

Chinese spelling errors using MS-Chinese spelling errors using MS-Pinyin input systemPinyin input system

Chinese spelling error patternsChinese spelling error patterns English spelling checkingEnglish spelling checking Why CSC is difficult?Why CSC is difficult?

Goal:Goal: Automatically correct Automatically correct Chinese spelling errors using Chinese spelling errors using MS-Pinyin (MSPY) input MS-Pinyin (MSPY) input systemsystem

Page 5: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

Text in the brain

Syllable

Key stroke (Typing)

Converted text

Chinese spelling errors using Chinese spelling errors using MSPYMSPY

Pinyin (phonetic) errors

Typographic errors

System errors

Page 6: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

Chinese spelling errors Chinese spelling errors patternspatterns

Substitution errors Substitution errors Pinyin errorPinyin error System error (include Pinyin error System error (include Pinyin error

in some systems)in some systems) Non-substitution errors Non-substitution errors word word

segmentation errors segmentation errors Typographic errors –

insertion/deletion/transpositioninsertion/deletion/transposition

Page 7: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

English spelling English spelling checkingchecking

Non-word error detection (“the” Non-word error detection (“the” “hte”) “hte”) N-gram (letter) analysisN-gram (letter) analysis Dictionary lookupDictionary lookup

Real-word error detection (“from” Real-word error detection (“from” “form”)“form”) NLP – parser drivenNLP – parser driven Statistical approach – data/error drivenStatistical approach – data/error driven

Local – n-gram language model, depend on pre-Local – n-gram language model, depend on pre-defined confusion setdefined confusion set

Global – Winnow, Bayesian, TBL, etc.Global – Winnow, Bayesian, TBL, etc. Problem – lack of error detectionProblem – lack of error detection

Page 8: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

Why CSC is difficult?Why CSC is difficult?

Word segmentationWord segmentation AmbiguousAmbiguous OOV – Proper noun detection (personal OOV – Proper noun detection (personal

name, location, organization, etc.)name, location, organization, etc.) Segmentation error propagationSegmentation error propagation

Non-word errors (in sense of Non-word errors (in sense of English) do not existEnglish) do not exist

MSPY makes good use of word MSPY makes good use of word trigram language modeltrigram language model

Page 9: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

Chinese spelling Chinese spelling checkingchecking

CSC – related worksCSC – related works Template matching – long distance, e.g. <Template matching – long distance, e.g. < 之所以之所以 > <> < 是是

因为因为 >> Pattern matching – long words (n>=3), e.g. Pattern matching – long words (n>=3), e.g. 一文不明 一文不明

一文不名一文不名 , , 忠耿耿 忠耿耿 忠心耿耿忠心耿耿 N-gram models – substitution errorsN-gram models – substitution errors

CSC – challengesCSC – challenges Long distance, coverage issue of template/pattern set Long distance, coverage issue of template/pattern set High-frequent-used confusion set, e.g. {High-frequent-used confusion set, e.g. { 像,象像,象 } {} { 在,在,

再再 }} OOV, especially the proper nounsOOV, especially the proper nouns N-gram, has been fully used by MSPYN-gram, has been fully used by MSPY

Page 10: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

Chinese spelling errors Chinese spelling errors patterns in MSPYpatterns in MSPY

Proper noun Proper noun Personal namePersonal name LocationLocation organizationorganization

Non-word errors: context independentNon-word errors: context independent Insertion/deletion/transposition/substitutionInsertion/deletion/transposition/substitution E.g. E.g. 一文不明 一文不明 一文不名 一文不名 , , 忠耿耿 忠耿耿 忠心耿耿忠心耿耿

Real-word errors: context sensitive Real-word errors: context sensitive E.g. E.g. 像 像 象象 , , 在 在 再再 , , 实施 实施 事实事实

Page 11: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

Flowchart of our Flowchart of our approachapproach

Text with errors

Word segmentation

Non-word error correction

Real-word error correction

Proper noun detection

Word fuzzy matching

Trigger: single char string , low prob Context

sensitive disambiguation

Page 12: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

Word segmentation and Word segmentation and proper noun detection proper noun detection

Language model based word segmentationLanguage model based word segmentation Class-based language model Class-based language model

P(W) = PP(W) = Poutsideoutside(W) P(W) Pinsideinsideaa(W|<PN>), (W|<PN>), a = a = ??

Outside probability – PN tagged training Outside probability – PN tagged training datadata Using NLPWIN to tag the corpusUsing NLPWIN to tag the corpus Filtering, rule baseFiltering, rule base EM?EM?

Inside probability – PN list training dataInside probability – PN list training data Using cache (or, dynamic dictionary)Using cache (or, dynamic dictionary)

Page 13: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

Experiments and Findings Experiments and Findings

Measure: precision/recall – definitionMeasure: precision/recall – definition Training data – People DailyTraining data – People Daily Tag tool – NLPWINTag tool – NLPWIN Test data – spec.Test data – spec. Results and FindingsResults and Findings

Page 14: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

Long word fuzzy matching Long word fuzzy matching Definition of Distance(s1, s2)Definition of Distance(s1, s2)

Long word, n>=3,Long word, n>=3, Sum of delete/insert/substitute a character Sum of delete/insert/substitute a character

Fast fuzzy matchingFast fuzzy matching Global – Lei Zhang’s ACLGlobal – Lei Zhang’s ACL Local – trigger, (single char, or low n-gram Local – trigger, (single char, or low n-gram

probability )probability ) Search – error detection/correctionSearch – error detection/correction

ViterbiViterbi Simplified versionSimplified version

Long word + Local matchingLong word + Local matching

Page 15: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

Experiments and Findings Experiments and Findings

Contact: 100 person, 3000 -- 5000 Contact: 100 person, 3000 -- 5000 characters/personcharacters/person

Error analysisError analysis Algorithm …Algorithm … Measure: precision/recallMeasure: precision/recall Large lexicon, acquisition.Large lexicon, acquisition. Trigger/threshold ?Trigger/threshold ? Results and FindingsResults and Findings

Page 16: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

Context sensitive Context sensitive disambiguation disambiguation

Building confusion set – specific to MSPYBuilding confusion set – specific to MSPY Feature selection – Context vectorFeature selection – Context vector

Collocation – contiguous POS or words/charactersCollocation – contiguous POS or words/characters Context words – words/characters within a Context words – words/characters within a KK-size window-size window Triple ?Triple ?

Weighting schema and ClassifierWeighting schema and Classifier Context Vector, TFIDFContext Vector, TFIDF Winnow, Bayesian, TBL, etc.Winnow, Bayesian, TBL, etc.

Scaling upScaling up Enlarge confusion setEnlarge confusion set Feature pruningFeature pruning AdaptationAdaptation

Page 17: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

Experiments and Findings Experiments and Findings

Measure: precision/recallMeasure: precision/recall Training dataTraining data Test data (XXX confusion set)Test data (XXX confusion set) Results and FindingsResults and Findings

Page 18: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

Experiments and Experiments and FindingsFindings

Current WorkCurrent Work Pseudo-training set based on MSPY IMEPseudo-training set based on MSPY IME

Preliminary data processing (400M PD)Preliminary data processing (400M PD) Unigram error model (10,000 Words useful)Unigram error model (10,000 Words useful)

使 是使 是 /69484 /69484 市市 /10289 /10289 诗诗 /2394 ……/2394 …… Trigram error pattern (980,000 useful)Trigram error pattern (980,000 useful)

共共 [[ 度度 ]] 难关难关 =>=> 渡 渡 / / 不够不够 [[ 英英 ]] ,, =>=> 硬硬 Experiments based on basic approachesExperiments based on basic approaches

Pseudo-test set from Pseudo-test set from 南方周末南方周末 Continuous pair (Recall = 50%, Precision = 25%)Continuous pair (Recall = 50%, Precision = 25%) Pattern Matching (??)Pattern Matching (??)

Future WorkFuture Work Hybrid approachesHybrid approaches

Pattern Clustering + Continuous pairPattern Clustering + Continuous pair Functional words error detectionFunctional words error detection

Page 19: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

System evaluation – put it System evaluation – put it all together all together

Evaluation toolsetEvaluation toolset Measure: precision/recallMeasure: precision/recall Training dataTraining data Test dataTest data Results and FindingsResults and Findings

Page 20: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

Prototype Prototype

Demo …Demo … Online & offline CSCOnline & offline CSC Right click Right click

Spelling error detection/correctionSpelling error detection/correction Proper noun detection/correctionProper noun detection/correction

Page 21: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

Assignment Assignment

Jianfeng Gao – overall, fuzzy Jianfeng Gao – overall, fuzzy matchingmatching

Mu Li – context sensitive Mu Li – context sensitive disambiguationdisambiguation

Jian Sun – PN detectionJian Sun – PN detection Yang Wen – system evaluationYang Wen – system evaluation Yulin Kang – demo Yulin Kang – demo Lei Zhang – senior consultant Lei Zhang – senior consultant

Page 22: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

Millstone Millstone

Oct. 2001, Ming says “Yes” (TAB Oct. 2001, Ming says “Yes” (TAB demo)demo)

Dec. 2001, Dong says “Yes” Dec. 2001, Dong says “Yes” (Transfer)(Transfer)

Aug. 2002, HJ says “Yes” (Party)Aug. 2002, HJ says “Yes” (Party)

Page 23: Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper

InformationInformation

Access at \\msrcn4p3\rootD\Access at \\msrcn4p3\rootD\gaojf\spellgaojf\spell

Contact me if any problemsContact me if any problems Jianfeng Gao, Tel: 86-10-Jianfeng Gao, Tel: 86-10-

62617711-5778, 62617711-5778, Email: [email protected]: [email protected]