cs621: artificial intelligence

31
CS621: Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 36,37–Part of Speech Tagging and HMM 21 st and 25 th Oct, 2010 (forward, backward computation and Baum Welch Algorithm will be done later)

Upload: joshua-cantrell

Post on 01-Jan-2016

11 views

Category:

Documents


0 download

DESCRIPTION

CS621: Artificial Intelligence. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 36,37–Part of Speech Tagging and HMM 21 st and 25 th Oct, 2010 (forward, backward computation and Baum Welch Algorithm will be done later). Part of Speech Tagging. - PowerPoint PPT Presentation

TRANSCRIPT

Resources

CS621: Artificial IntelligencePushpak BhattacharyyaCSE Dept., IIT Bombay Lecture 36,37Part of Speech Tagging and HMM21st and 25th Oct, 2010(forward, backward computation and Baum Welch Algorithm will be done later)

1Part of Speech TaggingPOS Tagging is a process that attaches each word in a sentence with a suitable grammar tag (noun, verb etc.) from a given set of tags.The set of tags is called the Tag-set.Standard Tag-set : Penn Treebank (for English).POS: A kind of sequence labeling taskOther such tasksMarking tags on genomic sequencesTraining for predicting protein structure: labels are primary (P), secondery (S), tertiary (T)Named entity labelsWashington_PLACE voted Washington_PERSON to power_PERS (Puja bought flowers for worshipping)Shallow parsing (noun phrase marking)The_B little_I boy_I sprained his_B ring_I finger_I.POS TagsNN Noun; e.g. Dog_NNVM Main Verb; e.g. Run_VMVAUX Auxiliary Verb; e.g. Is_VAUXJJ Adjective; e.g. Red_JJPRP Pronoun; e.g. You_PRPNNP Proper Noun; e.g. John_NNPetc.POS Tag AmbiguityIn English: I bank1 with the bank2 on the river bank3.Bank1 is verb, the other two banks are noun{Aside- generator of humour (incongruity theory)}:A man returns to his parked car and finds the sticker Parking fine. He goes and thaks the policeman for appreiating his parking skill. fine_adverb vs. fine_nounFor HindiRama achhaa gaata hai. (hai is VAUX : Auxiliary verb); Ram sings wellRama achha ladakaa hai. (hai is VCOP : Copula verb); Ram is a good boyProcessList all possible tag for each word in sentence.Choose best suitable tag sequence.Example People jump high.People : Noun/Verbjump : Noun/Verbhigh : Noun/Verb/AdjectiveWe can start with probabilities.

Challenge of POS taggingExample from Indian LanguageTagging of jo, vaha, kaun and their inflected forms in Hindi and their equivalents in multiple languages 11DEM and PRON labelsJo_DEM ladakaa kal aayaa thaa, vaha cricket acchhaa khel letaa hai

Jo_PRON kal aayaa thaa, vaha cricket acchhaa khel letaa haiDisambiguation rule-1If Jo is followed by nounThenDEMElse

False NegativeWhen there is arbitrary amount of text between the jo and the nounJo_??? bhaagtaa huaa, haftaa huaa, rotaa huaa, chennai academy a koching lenevaalaa ladakaa kal aayaa thaa, vaha cricket acchhaa khel letaa haiFalse PositiveJo_DEM (wrong!) duniyadarii samajhkar chaltaa hai, Jo_DEM/PRON? manushya manushyoM ke biich ristoM naatoM ko samajhkar chaltaa hai, (ambiguous)

False Positive for BengaliJe_DEM (wrong!) bhaalobaasaa paay, sei bhaalobaasaa dite paare (one who gets love can give love)Je_DEM (right!) bhaalobaasa tumi kalpanaa korchho, taa e jagat e sambhab nay (the love that you are image exits, is impossible in this world)Will failIn the similar situation forJis, jin, vaha, us, unAll these forms add to corpus countDisambiguation rule-2If Jo is oblique (attached with ne, ko, se etc. attached)Then It is PRONElse

Will fail (false positive)In case of languages that demand agreement between jo-form and the noun it qualifiesE.g. SanskritYasya_PRON (wrong!) baalakasya aananam drshtyaa (jis ladake kaa muha dekhkar)Yasya_PRON (wrong!) kamaniyasya baalakasya aananam drshtyaa Will also fail forRules that depend on the whether the noun following jo/vaha/kaun or its form is oblique or notBecause the case marker can be far from the noun ladakii jise piliya kii bimaarii ho gayiii thii ko Needs discussions across languagesRemark on DEM and PRONDEM vs. PRON cannot be disambiguated IN GENERALAt the level of the POS taggeri.e.Cannot assume parsingCannot assume semanticsMathematics of POS tagging Derivation of POS tagging formulaBest tag sequence= T*= argmax P(T|W)= argmax P(T)P(W|T)(by Bayes Theorem)

P(T) = P(t0=^ t1t2 tn+1=.) = P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) P(tn|tn-1tn-2t0)P(tn+1|tntn-1t0) = P(t0)P(t1|t0)P(t2|t1) P(tn|tn-1)P(tn+1|tn) = P(ti|ti-1)Bigram AssumptionN+1i = 0Lexical Probability AssumptionP(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) P(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption: A word is determined completely by its tag. This is inspired by speech recognition

= P(wo|to)P(w1|t1) P(wn+1|tn+1) = P(wi|ti) = P(wi|ti)(Lexical Probability Assumption)

n+1i = 0n+1i = 1Generative Model^_^People_NJump_VHigh_R._.^NVVNNAN.Lexical ProbabilitiesBigramProbabilitiesThis model is called Generative model. Here words are observed from tags as states.This is similar to HMM.AAParts of Speech Tags (Simplified situation)Noun (N) boyVerb (V) singAdjective (A)redAdverb (R) loudlyPreposition (P)to Article (T) a, anConjunction (C) andWh-word (W) whoPronoun (U)--he

Hidden Markov Model and POS taggingParts of Speech tags are statesWords are observation

S={N,V,A,R,P,C,T,W,U}O={Words of language}

ExampleTest sentence^ People laugh aloud $

Transition TableTag\tagNVARUN#/#NV#/#VAR...ULexical or Word ProbabilitiesTag\wordsBuyApplePeopleGoingN#/#NV#/#VAR...UCorpusCollection of coherent text^_^ People_N laugh_V aloud_A $_$

CorpusSpokenWrittenSwitchboard CorpusBrownBNC