sequential data modeling - ahcweb01.naist.jp goals a the aim of this course is tolearn basic...

37
Augmented Human Communication Laboratory Graduate School of Information Science Sequential Data Modeling Tomoki Toda Graham Neubig Sakriani Sakti

Upload: trinhhanh

Post on 12-Jul-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

AugmentedHumanCommunicationLaboratoryGraduateSchoolofInformationScience

SequentialDataModeling

TomokiTodaGrahamNeubigSakriani Sakti

CourseGoals

a

Theaimofthiscourseisto learnbasicknowledgeofsequentialdatamodelingtechniques thatcanbeappliedtosequentialdatasuchasspeechsignals,biologicalsignals,videosofmovingobjects,ornaturallanguagetext.Inparticular,itwillfocusondeepeningknowledgeofmethodsbasedonprobabilisticmodels,suchashiddenMarkovmodelsorlineardynamicalsystems.

CreditsandGrading

b

• 1creditcourse

• Scorewillbegradedby• Assignmentreportineveryclass

• Prerequisites• FundamentalMathematicsforOptimization(最適化数学基礎)• Calculus(微分積分学)• BasicDataAnalysis(データ解析基礎)

Materials

c

• Textbook• Thereisnotextbookforthiscourse.

• Lectureslides• Handoutwillbedistributedineachclass.• PDFslidesareavailabefrom

http://ahclab.naist.jp/lecture/2016/sdm/index.html(internalaccessonly)

• Referencematerials• C.M.Bishop:PatternRecognitionandMachineLearning,SpringerScience+

BusinessMedia,LLC,2006• C.M.ビショップ(著)、元田、栗田、樋口、松本、村田(訳):パターン認識と機械学習 上・下、シュプリンガー・ジャパン、2008

OfficeHours

d

• NAISTLecturers: GrahamNeubig,Sakriani SaktiAugmentedHumanCommunicationLaboratory

• Office: B714• Officehour: byappointment(sendanemailfirst)• Email: [email protected],[email protected]

• OtherContact• TomokiToda

• Email:[email protected]

TAMembers Email:[email protected]

e

• Rui Hiraoka [email protected]• YokoIshikawa [email protected]

Hiraoka-kun Ishikawa-san

Schedule

f

• 1st slotoneveryFriday9:20-10:50 inroomL1

Sun Mon Tue Wed Thu Fri Sat

1 2 3 4

5 6 7 8 9 10 11

12 13 14 15 16 17 18

19 20 21 22 23 24 25

26 27 28 29 30

Sun Mon Tue Wed Thu Fri Sat

1 2

3 4 5 6 7 8 9

10 11 12 13 14 15 16

17 18 19 20 21 22 23

24 25 26 27 28 29 30

31 1

June July/August

Syllabus

g

Date Course description Lecturer6/03 Basicsofsequentialdatamodeling1 GrahamNeubig6/10 Basics ofsequentialdatamodeling2 GrahamNeubig6/17 Discretelatentvariablemodels1 TomokiToda6/24 Discrete latentvariablemodels2 TomokiToda7/1 Continuouslatentvariablemodels1 TomokiToda7/15 Discriminativemodelsforsequentiallabeling1 Sakriani Sakti7/29 Continuouslatentvariablemodels2 TomokiToda8/1 Discriminativemodelsforsequentiallabeling2 Sakriani Sakti

1st and2nd Classes(6/03and6/10)

h

• Lecturer:GrahamNeubig

• Contents:Basicsofsequentialdatamodeling• Markovprocess• Latentvariables• Mixturemodels• Expectation-maximization(EM)algorithm

1z 2z 3z 4z

1x 2x 3x 4x1x 2x 3x 4x

3rd and4th Classes(6/17and6/24)

i

• Lecturer:TomokiToda

• Contents:Discretelatentvariablemodels• HiddenMarkovmodels• Forward-backwardalgorithm• Viterbi algorithm• Trainingalgorithm

1

3 21 2 3/s/

5th and7th Classes(7/1 and7/29)

k

• Lecturer:TomokiToda

• Contents:Continuouslatentvariablemodels• Factoranalysis• Lineardynamicalsystems• Predictionandupdate• (Trainingalgorithm)

1z 2z 3z 4z

1x 2x 3x 4x

1z

Nz

n

x

6th and8th Classes(7/15and8/1)

j

• Lecturer:Sakriani Sakti

• Contents:Discriminativemodelsforsequentiallabeling• Structuredperceptron• ConditionalRandomFields• Trainingalgorithm

「今日は晴れだ。」

今日/は/晴れ/だ/。

今/日/は/晴れ/だ/。

AugmentedHumanCommunicationLaboratoryGraduateSchoolofInformationScience

SequentialDataModeling

1st class“Basicsofsequentialdatamodeling1”

GrahamNeubig

Question

Afterthisclass,youcananswerthesequestions!

A-san

B-san

?

Oneday,someoneatethefollowingmenu.

Q1.A-sanorB-san?Q2.IfthisisA-san’smenu,

whichis“?”,or?

LogofA-san’smenu

LogofB-san’smenu

Q3.…

SequentialData

1

• Dataexamples• Timeseries(speech,actions,movingobjects,exchangerates,…)• Characterstrings(wordsequence,symbolstring,…)

• Variouslengthsofdata• E.g.,

• Probabilisticapproachtomodelingsequentialdata• Consistentframeworkforthequantificationandmanipulationof

uncertainty• Effectivefordealingwithrealdata

Datasample1(length=5):{1,0,1,1,0}Datasample2(length=8):{1,1,1,0,1,1,0,0}Datasample3(length=3):{0,0,1}Datasample4(length=6):{0,1,0,1,1,0}

HowtoRepresentSequentialData?• Asequentialdatasampleisrepresentedinahigh-dimensional

space(“#ofdimensions”=“lengthofthesequentialdatasample”).

n = 1 n = 2 n = 1 n = 2 n = 3 n = 1 n = 2 n = 3 n = N…x1 x2 x1 x2 x3 x1 x2 x3 xN…

x1

x2

x2

x1

x3

Length=2 Length=3 Length=N

Representedby2-dimensionalvector

Examplesofsequentialdata:

Representedby3-dimensionalvector

RepresentedbyN-dimensionalvector

2

Weneedtomodelprobabilitydistributioninthesehigh-dimensionalspaces!

Rules ofProbability(1)

3

• Assumetworandomvariables,X andY• X : x1 = “Bread”, x2 = “Rice”, or x3 = “Noodle”• Y : y1 = “Home” or y2 = “Restaurant”

• Assumethefollowingdatasamples{X,Y }:{Bread,Home},{Rice,Restaurant},{Noodle,Home},{Bread,Restaurant},{Rice,Restaurant},{Noodle,Home},{Bread,Home},{Rice,Home},and{Bread,Home}

• Makethefollowingtableshowingthenumberofsamples

2 1 2

1 2 0

Bread Rice Noodle

HomeRestaurant

Numberof samples#ofsamplesof{Noodle,Home}

Jointprobability :

Sumruleofprobability :

Marginalprobability :

Conditionalprobability :

Productrule ofprobability:

Rules ofProbability(2)#ofsamplesintheith column

#ofsamplesinthejth row

#ofsamplesinthecorresponding cell

Theith valueofarandomvariableX,wherei = 1, …, M

Thejth valueofarandomvariableY,wherej = 1, …, L

ic

ijn jrjy

ix

4

• Therulesofprobability– Sumrule :

– Productrule :

• Bayes’theorem :

Rules ofProbability(3)

5

• Probabilitieswithrespecttocontinuousvariables• Probabilitydensityoverareal-valuedvariablex : p(x)

– Probability thatx willlieinaninterval(a, b) :

– Conditions tobesatisfied :

• Cumulativedistribution function:P(x)– Probability thatx liesintheinterval(-∞, z) :

ProbabilityDensities

6

HowtoModelJointProbability?• Lengthofsequentialdata(#ofdatapointsoverasequence)varies…

i.e.,#ofdimensionsofjointprobabilitydistributionalsovaries…• Jointprobabilitydistributioncanberepresentedwithconditionalprobability

distributionsofindividualdatapoints!i.e.,#ofdistributionsvariesbut#ofdimensionsofeachdistributionisfixed.

• However,conditionalprobabilitydistributionofapresentdatapointgivenallpastdatapointsneedstobemodeled…

Howcanweeffectivelymodeljointprobabilitydistributionofsequentialdata?

1x 2x 3x 4x 5x

( ) ( ) ( ) ( ) ( )112131211 ,,|,||,, −= NNN xxxpxxxpxxpxpxxp !!!

( ) ( )∏=

−=N

nnn xxxpxp

2111 ,,| !

7

TwoBasicApproaches

• Markovprocess

• Latentvariables

1x 2x 3x 4x

1z 2z 3z 4z

1x 2x 3x 4x

8

MarkovProcess• Assumethattheconditionalprobabilitydistributionofthepresentstates

dependsonlyonafewpaststates

1x 2x 3x 4x

1x 2x 3x 4x

1st orderMarkovchain

2nd orderMarkovchain

e.g.,itdependsononlyonepaststate…

9

Exampleof1st OrderMarkovProcess• HowmanyprobabilitydistributionsareneededifwemodelEnglishtext

usingthe1st orderMarkovprocess?Ifonlyusing27charactersincluding“space”,P(“Thissentenceisrepresentedbythis…”)= P(T)P(h|T)P(i|h)P(s|i)P(-|s)P(s|-)P(e|s)P(n|e)P(t|n)P(e|t)…

10

P(x) P(x, y) P(y|x)x :1st lettery :2nd letter

Probabilityisshownbytheareasofwhitesquares.

DavidJ.C.MacKay,“InformationTheory, Inference,andLearningAlgorithms,”CambridgeUniversityPress,pp.22-24

StateTransitionDiagram/Matrix

雨 曇

/s/

起 寝

/e/

⎥⎥⎥

⎢⎢⎢

4.05.01.04.03.03.01.03.06.0

0.6

0.30.1

0.30.4

0.30.1

0.50.4

0.60.3

1

0.7 0.1

0.2 ⎥⎥⎥⎥

⎢⎢⎢⎢

10001.03.06.002.01.07.000010

0.1

晴晴

=A

曇 雨

)|( 晴雨p

Sumbecomesone.

Initialstate

Finalstate

/s/

/e/

/s/ /e/起 寝

Finalstatetransition

=A

w/explicitinitialandfinalstates

w/oexplicitinitialandfinalstates

Statetransitiondiagram Statetransitionmatrix

Statetransitiondiagram Statetransitionmatrix

Paststate

Presentstate

11

?

)|( 晴雨p

起:wakeup寝:sleep

晴:fine雨:rain曇:could

• Languagemodel(modelingdiscretevariables)• Modelingaword(morpheme)sequencewithMarkovmodel(n-gram)

e.g.,「学校に行く」 /s/, 学校, に, 行, く, /e/

• Autoregressivemodel(modelingcontinuousvariables)• PredictingthepresentdatapointfrompastM datapoints

ExampleofApplications

)|(|)|()|()|()( く行)(くに行学校に学校 eppppspsp

Decompose intomorphemes Modeling withbi-gram(2-gram)

LinearcombinationofpastM datapoints

12

• Trainingofconditionalprobabilitydistributionfromsequentialdatasamplesgivenastrainingdata…

ModelTraining(MaximumLikelihoodEstimation)

Likelihoodfunction:

Determinetheconditionalprobabilitydistributionsthatmaximizesthe(log-scaled)likelihoodfunction

subjectto

),|( λpc wwp

Constrainttonormalizetheestimatesasprobability

ModelparametersetFunctionofmodelparameters

Maximumlikelihoodestimate:#ofsamples

#ofsamples

{ }cp ww ,

pw13

ExampleofMLE

14

A-san

LogofA-san’smenu

P(|)=1/4P(|)=3/4

P(|)=2/4P(|)=1/4P(|)=1/4

P(|)=1/8P(|)=4/8P(|)=3/8

1/4 3/4

2/4

1/4

1/4

1/84/8

3/8

MLEofconditionalprobabilities

????????

???

???

??

Statetransitiondiagram

• Useofatestdataset{w1, …, wN}notincludedintrainingdata• Evaluationmetrics

• Likelihood

• Log-scaledlikelihood

• Entropy

• Perplexity

MethodsforEvaluatingModels

HPP 2=

( ) ( )∑=

−=N

nnnN wwpwwp

11212 |log|,,log λ!

( ) ( )∏=

−=N

nnnN wwpwwp

111 ||,, λ!

( ) ( )∑=

−−=−=N

nnnN wwp

Nwwp

NH

11212 |log1|,,log1 λ!

Ameasureofeffective“branchingfactor”

MPP

MMNMN

H NN

n

=

==−= ∑=

221

2 loglog11log1e.g.,setauniform distribution toalln-gramprobabilities forM words

15

Classification/Inference/GenerationClassification

Sequentialdata

Multiplemodels

Model selectionbasedonmaximumposteriorprobability Classificationresult

Model

Model

Calculationofconditionalprobability

Inference(dataorprobability)

Datagenerationbasedonjointprobability distribution Data

Sequentialdata

Inference(prediction)

Generation

16

Classificationw/MaximumA Posteriori• Selectamodelthatmaximizestheposteriorprobability

17

Likelihoodfunction Priorprobability

Constant

Posteriorprobability:

( )

( ) ( )iiNi

Nii

pxxp

xxpi

λλ

λ

|,,maxarg

,,|maxargˆ

1

1

!

!

=

=

( )iNixxpi λ|,,maxargˆ

1 !=

Ifpriorprobabilityisgivenbyauniformdistribution,

Themodelisselectedby

Classification

/s/

起 寝

/e/

0.80.1

1

0.7 0.1

0.2 0.1

/s/

起 寝

/e/

0.40.4

1

0.6 0.2

0.2 0.2

Observeddata:/s/起起寝起 /e/

0112.02.08.01.07.011 =×××××0096.02.04.02.06.011 =×××××

)|()|()|()|()|()( 起eppppspsp 寝起起寝起起起

/s/ 起 起 寝 起 /e/

Model1 Model2Models:

Likelihood:

Trellis:Expansionof thestatetransitiongraphovertimeaxis Model1:

Model2:

A.Classifiedtothemodel1.

Q.Whichmodelisthisdatasampleclassifiedto?

• Comparisonofmodellikelihoods

18

?

起:wakeup寝:sleep

MarginalizationforUnobservedData• Likelihoodcalculationwithmarginalizationevenifapartofdatais

notobserved.

( )( ) 017.01.01.02.08.0

1.001.011=×+××

×+×× ( )( ) 032.02.04.02.04.0

4.002.011=×+××

×+××

{ }∑∈ 寝起

寝寝,,

331131

)|()|()|()|()(xx

xepxpxpsxpsp

Observeddata: /s/?寝 ? /e/

Likelihood:

Trellis:

{ } { } ⎭⎬⎫

⎩⎨⎧

⎭⎬⎫

⎩⎨⎧

= ∑∑∈∈ 寝起寝起

寝寝,

33,

1131

)|()|()|()|()(xx

xepxpxpsxpsp

Model1

A.Classifiedtothemodel2.

Q.Whichmodelisthisdatasampleclassifiedto?

Considerallpossibledatasamples

/s/ 起

起 /e/

寝 寝

0.4

0.4

1 0.20.2

0

0.40.2

/s/ 起

起 /e/

寝 寝

0.8

0.1

1 0.10.2

0

0.10.1

Model2

19

?

Inference• Calculationofposteriorprobability

098.02.07.07.011 =××××=/s/ 起 起 起 /e/

860.0016.0098.0

098.0≅

+

Observeddata: /s/起 ?起 /e/

{ }∑∈

====

寝起

起起

起起起起起起

,2

312

2

),,,,(),,,,(),|(

xexsp

espxxxp

寝016.02.08.01.011 =××××=

140.0016.0098.0

016.0≅

+

),,,,( esp 起起起

),,,,( esp 起寝起

==== ),|( 312 起起起 xxxp

==== ),|( 312 起起寝 xxxpA.“?”is“起” with86%ofprobability.

/s/

起 寝

/e/

0.8

0.1

1

0.70.1

0.2 0.1

Model1

Posteriorprobability:

Q.Which“起” or“寝” islikelyobservedat“?”point?

0.8

1 0.7

0.1

0.20.7

20

?

?

Generation• Randomgenerationofdatasamplesfromthemodel

Variouslengthsofdataaregenerated.

/s/

起 寝

/e/

0.80.1

1

0.7 0.1

0.2 0.1

Model1/s/

e.g.,…寝

e.g.,…起

1x1.Sampling

2x2.Sampling

3x3.Sampling

4x4.Sampling

/s/

起 /e/寝

起 /e/寝

1

1

0.7 0.2 0.1

0.8 0.10.1

/s/起寝起・・・ /e/

Endifthefinalstate/e/issampled21

MaximumLikelihoodDataGeneration• Datagenerationbymaximizinglikelihoodundertheconditionthat

thelengthofdataisgiven

/s/

起 起 起

/e/

寝 寝寝

1

0

1

0

0.7

0.10.8

0.7

0.1

0.1

0.7

0.10.8

0.49

0.07

0.2

0.1

0.1

0.098

/s/

起 寝

/e/

0.80.1

1

0.7 0.1

0.2 0.1

Model1

/s/起起起 /e/isgeneratedifsettingthedatalengthto3.

Dynamicprogramming1.Storethebestpathateachstate

2.Backtrackofthebestpathfromthefinalstate

1・0.70・0.8Selection

0.7・0.70.1・0.8Selection

1・0.10・0.1Selection 0.7・0.1

0.1・0.1Selection0.49・0.20.07・0.1Selection

22

YouCanAnswertheQuestion!

Let’suseMarkovmodeltoanswerthem!

A-san

B-san

?

Oneday,someoneatethefollowingmenu.

Q1.A-sanorB-san?Q2.IfthisisA-san’smenu,

whichis“?”,or?

LogofA-san’smenu

LogofB-san’smenu

Q3.…