mallet tutorial
TRANSCRIPT
![Page 1: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/1.jpg)
MachineLearningwithMALLET
h1p://mallet.cs.umass.edu
DavidMimno
Informa@onExtrac@onandSynthesisLaboratory,DepartmentofCS
UMass,Amherst
![Page 2: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/2.jpg)
Outline
• AboutMALLET
• Represen@ngData
• Classifica@on
• SequenceTagging
• TopicModeling
![Page 3: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/3.jpg)
Outline
• AboutMALLET
• Represen@ngData
• Classifica@on
• SequenceTagging
• TopicModeling
![Page 4: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/4.jpg)
Who?
• AndrewMcCallum(mostofthework)
• CharlesSu1on,AronCulo1a,GregDruck,KedarBellare,GauravChandalia…
• FernandoPereira,othersatPenn…
![Page 5: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/5.jpg)
WhoamI?
• ChiefmaintainerofMALLET
• PrimaryauthorofMALLETtopicmodelingpackage
![Page 6: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/6.jpg)
Why?
• Mo@va@on:textclassifica@onandinforma@onextrac@on
• Commercialmachinelearning(JustResearch,WhizBang)
• Analysisandindexingofacademicpublica@ons:Cora,Rexa
![Page 7: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/7.jpg)
What?
• Textfocus:dataisdiscreteratherthancon@nuous,evenwhenvaluescouldbecon@nuous:
double value = 3.0
![Page 8: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/8.jpg)
How?
• Commandlinescripts:– bin/mallet[command]‐‐[op@on][value]…
– TextUserInterface(“tui”)classes
• DirectJavaAPI– h1p://mallet.cs.umass.edu/api
Most of this talk
![Page 9: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/9.jpg)
History
• Version0.4:c2004– Classesinedu.umass.cs.mallet.base.*
• Version2.0:c2008– Classesincc.mallet.*– Majorchangestofinitestatetransducerpackage
– bin/malletvs.specializedscripts– Java1.5generics
![Page 10: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/10.jpg)
LearningMore
• h1p://mallet.cs.umass.edu– “QuickStart”guides,focusedoncommandlineprocessing
– Developers’guides,withJavaexamples
• mallet‐[email protected]– Lowvolume,butcanbebursty
![Page 11: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/11.jpg)
Outline
• AboutMALLET
• Represen@ngData
• Classifica@on
• SequenceTagging
• TopicModeling
![Page 12: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/12.jpg)
ModelsforTextData
• Genera@vemodels(Mul@nomials)– NaïveBayes
– HiddenMarkovModels(HMMs)
– LatentDirichletTopicModels
• Discrimina@veRegressionModels– MaxEnt/Logis@cregression
– Condi@onalRandomFields(CRFs)
![Page 13: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/13.jpg)
Representa@ons
• Transformtextdocumentstovectorsx1, x2,…
• Retainmeaningofvectorindices
• Ideallysparsely
Call meIshmael.…
Document
![Page 14: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/14.jpg)
Representa@ons
• Transformtextdocumentstovectorsx1, x2,…
• Retainmeaningofvectorindices
• Ideallysparsely
1.00.0…0.06.00.0…3.0…
Call meIshmael.…
xi
Document
![Page 15: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/15.jpg)
Representa@ons
• Elementsofvectorarecalledfeaturevalues
• Example:Featureatrow345isnumberof@mes“dog”appearsindocument
1.00.0…0.06.00.0…3.0…
xi
![Page 16: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/16.jpg)
DocumentstoVectors
Call me Ishmael.
Document
![Page 17: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/17.jpg)
DocumentstoVectors
Call me Ishmael.
Document
Call me Ishmael
Tokens
![Page 18: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/18.jpg)
DocumentstoVectors
Call me Ishmael
Tokens
call me ishmael
Tokens
![Page 19: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/19.jpg)
DocumentstoVectors
call me ishmael
Tokens
473, 3591, 17
Features
17 ishmael…473 call…3591 me
![Page 20: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/20.jpg)
DocumentstoVectors
17 1.0473 1.03591 1.0
Features (bag)
17 ishmael473 call3591 me
473, 3591, 17
Features (sequence)
17 ishmael…473 call…3591 me
17 ishmael…473 call…3591 me
![Page 21: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/21.jpg)
Instances
Emailmessage,webpage,sentence,journalabstract…
• Name
• Data
• Target/Label
• Source
What is it called?
What is the input?
What is the output?
What did it originally look like?
![Page 22: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/22.jpg)
Instances
• Name
• Data
• Target
• Source
String
TokenSequenceArrayList<Token>
FeatureSequenceint[]
FeatureVectorint -> double map
cc.mallet.types
![Page 23: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/23.jpg)
Alphabets
TObjectIntHashMap mapArrayList entries
int lookupIndex(Object o, boolean shouldAdd)
Object lookupObject(int index)
cc.mallet.types, gnu.trove
17 ishmael…473 call…3591 me
![Page 24: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/24.jpg)
Alphabets
TObjectIntHashMap mapArrayList entries
int lookupIndex(Object o, boolean shouldAdd)
Object lookupObject(int index)
cc.mallet.types, gnu.trove
17 ishmael…473 call…3591 me
for
![Page 25: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/25.jpg)
Alphabets
TObjectIntHashMap mapArrayList entries
cc.mallet.types, gnu.trove
17 ishmael…473 call…3591 me
void stopGrowth()
void startGrowth()
Do not add entries fornew Objects -- defaultis to allow growth.
![Page 26: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/26.jpg)
Crea@ngInstances
• Instanceconstructormethod
• Iterators
new Instance(data, target,name, source)
Iterator<Instance>FileIterator(File[], …)CsvIterator(FileReader, Pattern…)ArrayIterator(Object[])…
cc.mallet.pipe.iterator
![Page 27: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/27.jpg)
Crea@ngInstances
• FileIterator
cc.mallet.pipe.iterator
/data/bad/
/data/good/
Label from dir name
Each instance inits own file
![Page 28: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/28.jpg)
Crea@ngInstances
• CsvIterator
cc.mallet.pipe.iterator
Name, label, data from regular expression groups.“CSV” is a lousy name. LineRegexIterator?
Each instanceon its own line
1001 Melville Call me Ishmael. Some years ago…1002 Dickens It was the best of times, it was…
^([^\t]+)\t([^\t]+)\t(.*)
![Page 29: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/29.jpg)
InstancePipelines
• Sequen@altransforma@onsofinstancefields(usuallyData)
• PassanArrayList<Pipe>toSerialPipes
cc.mallet.pipe
// “data” is a StringCharSequence2TokenSequence// tokenize with regexpTokenSequenceLowercase// modify each token’s textTokenSequenceRemoveStopwords// drop some tokensTokenSequence2FeatureSequence// convert token Strings to intsFeatureSequence2FeatureVector// lose order, count duplicates
![Page 30: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/30.jpg)
InstancePipelines
• Asmallnumberofpipesmodifythe“target”field
• Therearenowtwoalphabets:dataandlabel
cc.mallet.pipe, cc.mallet.types
// “target” is a StringTarget2Label// convert String to int// “target” is now a Label
Alphabet > LabelAlphabet
![Page 31: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/31.jpg)
Labelobjects
• Weightsonafixedsetofclasses
• Fortrainingdata,weightforcorrectlabelis1.0,allothers0.0
cc.mallet.types
implements Labeling
int getBestIndex()Label getBestLabel()
You cannot create a Label,they are only produced byLabelAlphabet
![Page 32: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/32.jpg)
InstanceLists
• AListofInstanceobjects,alongwithaPipe,dataAlphabet,andLabelAlphabet
cc.mallet.types
InstanceList instances = new InstanceList(pipe);
instances.addThruPipe(iterator);
![Page 33: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/33.jpg)
Purngitalltogether
ArrayList<Pipe> pipeList = new ArrayList<Pipe>();
pipeList.add(new Target2Label());pipeList.add(new CharSequence2TokenSequence());pipeList.add(new TokenSequence2FeatureSequence());pipeList.add(new FeatureSequence2FeatureVector());
InstanceList instances = new InstanceList(new SerialPipes(pipeList));
instances.addThruPipe(new FileIterator(. . .));
![Page 34: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/34.jpg)
PersistentStorage
• MostMALLETclassesuseJavaserializa@ontostoremodelsanddata
java.io
ObjectOutputStream oos = new ObjectOutputStream(…);oos.writeObject(instances);oos.close();
Pipes, data objects, labelings, etcall need to implementSerializable.
Be sure to include custom classesin classpath, or you get aStreamCorruptedException
![Page 35: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/35.jpg)
Review
• WhatarethefourmainfieldsinanInstance?
![Page 36: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/36.jpg)
Review
• WhatarethefourmainfieldsinanInstance?
• WhataretwowaystogenerateInstances?
![Page 37: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/37.jpg)
Review
• WhatarethefourmainfieldsinanInstance?
• WhataretwowaystogenerateInstances?
• HowdowemodifythevalueofInstancefields?
![Page 38: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/38.jpg)
Review
• WhatarethefourmainfieldsinanInstance?
• WhataretwowaystogenerateInstances?
• HowdowemodifythevalueofInstancefields?
• Namesomeclassesthatappearinthe“data”field.
![Page 39: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/39.jpg)
Outline
• AboutMALLET
• Represen@ngData
• Classifica@on
• SequenceTagging
• TopicModeling
![Page 40: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/40.jpg)
Classifierobjects
• Classifiersmapfrominstancestodistribu@onsoverafixedsetofclasses
• MaxEnt,NaïveBayes,DecisionTrees…
cc.mallet.classify
Given data Which classis best?
(this one!)wateryNNJJPRPVBCC
![Page 41: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/41.jpg)
Classifierobjects
• Classifiersmapfrominstancestodistribu@onsoverafixedsetofclasses
• MaxEnt,NaïveBayes,DecisionTrees…
cc.mallet.classify
Labeling labeling = classifier.classify(instance);
Label l = labeling.getBestLabel();
System.out.print(instance + “\t”);System.out.println(l);
![Page 42: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/42.jpg)
TrainingClassifierobjects
cc.mallet.classify
ClassifierTrainer trainer = new MaxEntTrainer();
Classifier classifier = trainer.train(instances);
• EachtypeofclassifierhasoneormoreClassifierTrainerclasses
![Page 43: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/43.jpg)
TrainingClassifierobjects
cc.mallet.optimize
log P(Labels | Data) =log f(label1, data1, w) +log f(label2, data2, w) +log f(label3, data3, w) +…
• Someclassifiersrequirenumericalop@miza@onofanobjec@vefunc@on. Maximize w.r.t. w!
![Page 44: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/44.jpg)
Parametersw
• Associa@onbetweenfeature,classlabel
• HowmanyparametersforKclassesandNfeatures?
ac@on NN 0.13ac@on VB ‐0.1ac@on JJ ‐0.21SUFF‐@on NN 1.3SUFF‐@on VB ‐2.1SUFF‐@on JJ ‐1.7SUFF‐on NN 0.01SUFF‐on VB ‐0.02…
![Page 45: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/45.jpg)
TrainingClassifierobjects
cc.mallet.optimize
interface Optimizerboolean optimize()
interface Optimizableinterface ByValueinterface ByValueGradient
Limited-memory BFGS,Conjugate gradient…
Specific objective functions
![Page 46: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/46.jpg)
TrainingClassifierobjects
cc.mallet.classify
MaxEntOptimizableByLabelLikelihooddouble[] getParameters()void setParameters(double[] parameters)…
double getValue()void getValueGradient(double[] buffer)
Log likelihood and its first derivative
ForOptimizableinterface
![Page 47: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/47.jpg)
Evalua@onofClassifiers
• Createrandomtest/trainsplits
cc.mallet.types
InstanceList[] instanceLists =instances.split(new Randoms(),
new double[] {0.9, 0.1, 0.0});
90% training
10% testing
0% validation
![Page 48: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/48.jpg)
Evalua@onofClassifiers
• TheTrialclassstorestheresultsofclassifica@onsonanInstanceList(tes@ngortraining)
cc.mallet.classify
Trial(Classifier c, InstanceList list)double getAccuracy()double getAverageRank()double getF1(int/Label/Object)double getPrecision(…)double getRecall(…)
![Page 49: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/49.jpg)
Review
• Ihaveinventedanewclassifier:Davidregression.– WhatclassshouldIimplementtoclassifyinstances?
![Page 50: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/50.jpg)
Review
• Ihaveinventedanewclassifier:Davidregression.– WhatclassshouldIimplementtotrainaDavidregressionclassifier?
![Page 51: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/51.jpg)
Review
• Ihaveinventedanewclassifier:Davidregression.– IwanttotrainusingByValueGradient.Whatmathema@calfunc@onsdoIneedtocodeup,andwhatclassshouldIputthemin?
![Page 52: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/52.jpg)
Review
• Ihaveinventedanewclassifier:Davidregression.– HowwouldIcheckwhethermynewclassifierworksbe1erthanNaïveBayes?
![Page 53: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/53.jpg)
Outline
• AboutMALLET
• Represen@ngData
• Classifica@on
• SequenceTagging
• TopicModeling
![Page 54: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/54.jpg)
SequenceTagging
• Dataoccursinsequences
• Categoricallabelsforeachposi@on
• Labelsarecorrelated
DETNNVBSVBGthedoglikesrunning
![Page 55: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/55.jpg)
SequenceTagging
• Dataoccursinsequences
• Categoricallabelsforeachposi@on
• Labelsarecorrelated
????????thedoglikesrunning
![Page 56: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/56.jpg)
SequenceTagging
• Classifica@on:n‐way
• SequenceTagging:nT‐way
NNJJPRPVBCC
NNJJPRPVBCC
NNJJPRPVBCC
NNJJPRPVBCC
NNJJPRPVBCC
NNJJPRPVBCC
NNJJPRPVBCC
orreddogsonbluetrees
![Page 57: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/57.jpg)
AvoidingExponen@alBlowup
• Markovproperty
• Dynamicprogramming
Andrei Markov
![Page 58: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/58.jpg)
AvoidingExponen@alBlowup
• Markovproperty
• Dynamicprogramming
This oneGiven this one
Is independent of theseAndrei Markov
DETJJNNVB
![Page 59: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/59.jpg)
AvoidingExponen@alBlowup
• Markovproperty
• Dynamicprogramming
NNJJPRPVBCC
NNJJPRPVBCC
NNJJPRPVBCC
NNJJPRPVBCC
NNJJPRPVBCC
NNJJPRPVBCC
orreddogsonbluetrees Andrei Markov
![Page 60: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/60.jpg)
AvoidingExponen@alBlowup
• Markovproperty
• Dynamicprogramming
NNJJPRPVBCC
NNJJPRPVBCC
NNJJPRPVBCC
NNJJPRPVBCC
NNJJPRPVBCC
reddogsonbluetrees Andrei Markov
![Page 61: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/61.jpg)
AvoidingExponen@alBlowup
• Markovproperty
• Dynamicprogramming
NNJJPRPVBCC
NNJJPRPVBCC
NNJJPRPVBCC
NNJJPRPVBCC
dogsonbluetrees Andrei Markov
![Page 62: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/62.jpg)
HiddenMarkovModelsandCondi@onalRandomFields
• HiddenMarkovModel:fullygenera@ve
• Condi@onalRandomField:condi@onal
P(Labels | Data) =P(Data, Labels) / P(Data)
P(Labels | Data)
![Page 63: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/63.jpg)
HiddenMarkovModelsandCondi@onalRandomFields
• HiddenMarkovModel:simple(independent)outputspace
• Condi@onalRandomField:arbitrarilycomplicatedoutputs
“NSF-funded”
“NSF-funded”CAPITALIZEDHYPHENATEDENDS-WITH-edENDS-WITH-d…
![Page 64: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/64.jpg)
HiddenMarkovModelsandCondi@onalRandomFields
FeatureSequence
FeatureVectorSequence
FeatureVector[]
int[]
• HiddenMarkovModel:simple(independent)outputspace
• Condi@onalRandomField:arbitrarilycomplicatedoutputs
![Page 65: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/65.jpg)
Impor@ngData
• SimpleTaggerformat:onewordperline,withinstancesdelimitedbyablankline
Call VBme PPNIshmael NNP. .
Some JJyears NNS…
![Page 66: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/66.jpg)
Impor@ngData
• SimpleTaggerformat:onewordperline,withinstancesdelimitedbyablankline
Call SUFF-ll VBme TWO_LETTERS PPNIshmael BIBLICAL_NAME NNP. PUNCTUATION .
Some CAPITALIZED JJyears TIME SUFF-s NNS…
![Page 67: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/67.jpg)
Impor@ngData
LineGroupIterator
SimpleTaggerSentence2TokenSequence()//String to Tokens, handles labels
TokenSequence2FeatureVectorSequence()//Token objects to FeatureVectors
cc.mallet.pipe, cc.mallet.pipe.iterator
![Page 68: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/68.jpg)
Impor@ngData
LineGroupIterator
SimpleTaggerSentence2TokenSequence()//String to Tokens, handles labels
[Pipes that modify tokens]
TokenSequence2FeatureVectorSequence()//Token objects to FeatureVectors
cc.mallet.pipe, cc.mallet.pipe.iterator
![Page 69: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/69.jpg)
Impor@ngData
//IshmaelTokenTextCharSuffix(“C2=”, 2)
//Ishmael C2=elRegexMatches(“CAP”, Pattern.compile(“\\p{Lu}.*”))
//Ishmael C2=el CAPLexiconMembership(“NAME”, new File(‘names’), false)
//Ishmael C2=el CAP NAME
cc.mallet.pipe.tsf
must matchentire string
one name per line
ignore case?
![Page 70: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/70.jpg)
Slidingwindowfeatures
areddogonabluetree
![Page 71: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/71.jpg)
Slidingwindowfeatures
areddogonabluetree
![Page 72: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/72.jpg)
Slidingwindowfeatures
areddogonabluetree
red@-1
![Page 73: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/73.jpg)
Slidingwindowfeatures
areddogonabluetree
red@-1a@-2
![Page 74: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/74.jpg)
Slidingwindowfeatures
areddogonabluetree
red@-1a@-2on@1
![Page 75: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/75.jpg)
Slidingwindowfeatures
areddogonabluetree
red@-1a@-2on@1a@-2_&_red@-1
![Page 76: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/76.jpg)
Impor@ngData
int[][] conjunctions = new int[3][]; conjunctions[0] = new int[] { -1 }; conjunctions[1] = new int[] { 1 }; conjunctions[2] = new int[] { -2, -1 };
OffsetConjunctions(conjunctions)
// a@-2_&_red@-1 on@1
cc.mallet.pipe.tsf
previousposition
next position
previous two
![Page 77: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/77.jpg)
Impor@ngData
int[][] conjunctions = new int[3][]; conjunctions[0] = new int[] { -1 }; conjunctions[1] = new int[] { 1 }; conjunctions[2] = new int[] { -2, -1 };
TokenTextCharSuffix("C1=", 1)OffsetConjunctions(conjunctions)
// a@-2_&_red@-1 a@-2_&_C1=d@-1
cc.mallet.pipe.tsf
previousposition
next position
previous two
![Page 78: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/78.jpg)
FiniteStateTransducers
• Finitestatemachineovertwoalphabets(observed,hidden)
![Page 79: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/79.jpg)
FiniteStateTransducers
• Finitestatemachineovertwoalphabets(observed,hidden)
DET
P(DET)
![Page 80: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/80.jpg)
FiniteStateTransducers
• Finitestatemachineovertwoalphabets(observed,hidden)
DETthe
P(the | DET)
![Page 81: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/81.jpg)
FiniteStateTransducers
• Finitestatemachineovertwoalphabets(observed,hidden)
DETNNthe
P(NN | DET)
![Page 82: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/82.jpg)
FiniteStateTransducers
• Finitestatemachineovertwoalphabets(observed,hidden)
DETNNthedog
P(dog | NN)
![Page 83: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/83.jpg)
FiniteStateTransducers
• Finitestatemachineovertwoalphabets(observed,hidden)
DETNNVBSthedog
P(VBS | NN)
![Page 84: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/84.jpg)
Howmanyparameters?
• Determinesefficiencyoftraining
• Toomanyleadstooverfirng
Trick: Don’t allowcertain transitions
P(VBS | DET) = 0
![Page 85: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/85.jpg)
Howmanyparameters?
• Determinesefficiencyoftraining
• Toomanyleadstooverfirng
DETNNVBS
thedogruns
DETNNVBS
thedogruns
DETNNVBS
thedogruns
![Page 86: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/86.jpg)
FiniteStateTransducers
abstract class TransducerCRFHMM
abstract class TransducerTrainerCRFTrainerByLabelLikelihoodHMMTrainerByLikelihood
cc.mallet.fst
![Page 87: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/87.jpg)
FiniteStateTransducers
cc.mallet.fst
First order: one weightfor every pair of labelsand observations.
CRF crf = new CRF(pipe, null);crf.addFullyConnectedStates(); // orcrf.addStatesForLabelsConnectedAsIn(instances);
DETNNVBS
thedogruns
![Page 88: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/88.jpg)
FiniteStateTransducers
cc.mallet.fst
“three-quarter” order:one weight for everypair of labels andobservations.
crf.addStatesForThreeQuarterLabelsConnectedAsIn(instances);
DETNNVBS
thedogruns
![Page 89: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/89.jpg)
FiniteStateTransducers
cc.mallet.fst
Second order: one weightfor every triplet of labelsand observations.
crf.addStatesForBiLabelsConnectedAsIn(instances);
DETNNVBS
thedogruns
![Page 90: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/90.jpg)
FiniteStateTransducers
cc.mallet.fst
“Half” order: equivalent toindependent classifiers,except some transitionsmay be illegal.
crf.addStatesForHalfLabelsConnectedAsIn(instances);
DETNNVBS
thedogruns
![Page 91: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/91.jpg)
Trainingatransducer
CRF crf = new CRF(pipe, null);crf.addStatesForLabelsConnectedAsIn(trainingInstances); CRFTrainerByLabelLikelihood trainer = new CRFTrainerByLabelLikelihood(crf);
trainer.train();
cc.mallet.fst
![Page 92: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/92.jpg)
Evalua@ngatransducer
CRFTrainerByLabelLikelihood trainer = new CRFTrainerByLabelLikelihood(transducer);
TransducerEvaluator evaluator = new TokenAccuracyEvaluator(testing, "testing"));
trainer.addEvaluator(evaluator);
trainer.train();
cc.mallet.fst
![Page 93: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/93.jpg)
Applyingatransducer
Sequence output = transducer.transduce (input);
for (int index=0; index < input.size(); input++) {System.out.print(input.get(index) + “/”);System.out.print(output.get(index) + “ “);
}
cc.mallet.fst
![Page 94: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/94.jpg)
Review
• HowdoyouaddnewfeaturestoTokenSequences?
![Page 95: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/95.jpg)
Review
• HowdoyouaddnewfeaturestoTokenSequences?
• Whatarethreefactorsthataffectthenumberofparametersinamodel?
![Page 96: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/96.jpg)
Outline
• AboutMALLET
• Represen@ngData
• Classifica@on
• SequenceTagging
• TopicModeling
![Page 97: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/97.jpg)
Topics:“Seman@cGroups”
News Article
![Page 98: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/98.jpg)
Topics:“Seman@cGroups”
“Sports” “Negotiation”
News Article
![Page 99: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/99.jpg)
Topics:“Seman@cGroups”
“Sports” “Negotiation”
News Article
teamplayer
game
strike
deadlineunion
![Page 100: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/100.jpg)
Topics:“Seman@cGroups”
News Article
teamplayer
game
strike
deadlineunion
![Page 101: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/101.jpg)
SeriesYankeesSoxRedWorldLeaguegameBostonteamgamesbaseballMetsGameserieswonClemensBraves
Yankeeteams
![Page 102: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/102.jpg)
playersLeagueownersleaguebaseballunioncommissionerBaseballAssocia@onlaborCommissionerFootballmajor
teamsSeligagreementstriketeambargaining
![Page 103: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/103.jpg)
TrainingaTopicModel
cc.mallet.topics
ParallelTopicModel lda = new ParallelTopicModel(numTopics); lda.addInstances(trainingInstances); lda.estimate();
![Page 104: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/104.jpg)
Evalua@ngaTopicModel
cc.mallet.topics
ParallelTopicModel lda = new ParallelTopicModel(numTopics);lda.addInstances(trainingInstances);lda.estimate();
MarginalProbEstimator evaluator = lda.getProbEstimator();
double logLikelihood = evaluator.evaluateLeftToRight(testing, 10, false, null);
![Page 105: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/105.jpg)
Inferringtopicsfornewdocuments
cc.mallet.topics
ParallelTopicModel lda = new ParallelTopicModel(numTopics);lda.addInstances(trainingInstances);lda.estimate();
TopicInferencer inferencer = lda.getInferencer();
double[] topicProbs = inferencer.getSampledDistribution(instance, 100, 10, 10);
![Page 106: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/106.jpg)
Morethanwords…
• Textcollec@onsmixfreetextandstructureddata
David MimnoAndrew McCallumUAI2008…
![Page 107: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/107.jpg)
Morethanwords…
• Textcollec@onsmixfreetextandstructureddata
David MimnoAndrew McCallumUAI2008
“Topic models conditionedon arbitrary features usingDirichlet-multinomialregression. …”
![Page 108: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/108.jpg)
Dirichlet‐mul@nomialRegression(DMR)
Thecorpusspecifiesavectorofreal‐valuedfeatures(x)foreachdocument,oflengthF.
EachtopichasanF‐lengthvectorofparameters.
![Page 109: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/109.jpg)
Topicparametersforfeature“publishedinJMLR”
user,users,userinterface,interac@ve,interface‐1.44
web,webpages,webpage,worldwideweb,websites‐1.36
retrieval,informa@onretrieval,query,queryexpansion‐1.23
strategies,strategy,adapta@on,adap@ve,driven‐1.21
agent,agents,mul@agent,autonomousagents‐1.12
nearestneighbor,boos@ng,nearestneighbors,adaboost1.37
blindsourcesepara@on,sourcesepara@on,separa@on,channel1.40
reinforcementlearning,learning,reinforcement1.41
bounds,vcdimension,bound,upperbound,lowerbounds1.74
kernel,kernels,ra@onalkernels,stringkernels,fisherkernel2.27
![Page 110: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/110.jpg)
FeatureparametersforRLtopic
<default>‐3.76
COLING‐1.64
IEEETrans.PAMI‐1.54
CVPR‐1.47
ACL‐1.38
MachineLearningJournal2.19
ECML2.45
KenjiDoya2.56
ICML2.88
SridharMahadevan2.99
![Page 111: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/111.jpg)
Topicparametersforfeature“publishedinUAI”
nearestneighbor,boos@ng,nearestneighbors,adaboost‐1.50
descrip@ons,descrip@on,top,bo1om,topbo1om‐1.50
workshopreport,invitedtalk,interna@onalconference,report‐1.37
digitallibraries,digitallibrary,digital,library‐1.36
shape,deformable,shapes,contour,ac@vecontour‐1.29
reasoning,logic,defaultreasoning,nonmonotonicreasoning2.11
uncertainty,symbolic,sketch,primalsketch,uncertain,[email protected]
probability,probabili@es,probabilitydistribu@ons,2.25
qualita@ve,reasoning,qualita@vereasoning,qualita@[email protected]
bayesiannetworks,bayesiannetwork,beliefnetworks2.88
![Page 112: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/112.jpg)
FeatureparametersforBayesnetstopic
<default>‐3.36
ICRA‐2.24
NeuralNetworks‐1.50
COLING‐1.38
Probabilis@cSeman@csforNonmonotonicReasoning(Pearl,KR,1989)
‐1.16
LoopyBeliefPropaga@onforApproximateInference(Murphy,Weiss,andJordan,UAI,1999)
2.04
PhilippeSmets2.15
AshrafM.Abdelbar2.23
Mary‐AnneWilliams2.41
UAI2.88
![Page 113: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/113.jpg)
Dirichlet‐mul@nomialRegression
• Arbitraryobservedfeaturesofdocuments
• TargetcontainsFeatureVector
DMRTopicModel dmr = new DMRTopicModel (numTopics);
dmr.addInstances(training);dmr.estimate();
dmr.writeParameters(new File("dmr.parameters"));
![Page 114: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/114.jpg)
PolylingualTopicModeling
• Topicsexistinmorelanguagesthanyoucouldpossiblylearn
• Topicallycomparable documentsaremucheasiertogetthantransla@onsets
• Transla@ondic@onaries– coverpairs,notsetsoflanguages– misstechnicalvocabulary– aren’tavailableforlow‐resourcelanguages
![Page 115: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/115.jpg)
TopicsfromEuropeanParliamentProceedings
![Page 116: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/116.jpg)
TopicsfromEuropeanParliamentProceedings
![Page 117: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/117.jpg)
TopicsfromWikipedia
![Page 118: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/118.jpg)
Alignedinstancelists
dog… chien… hund…cat… chat…pig… schwein…
![Page 119: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/119.jpg)
PolylingualTopics
InstanceList[] training = new InstanceList[] { english, german, arabic, mahican };
PolylingualTopicModel pltm = new PolylingualTopicModel(numTopics);
pltm.addInstances(training);
![Page 120: Mallet Tutorial](https://reader031.vdocuments.us/reader031/viewer/2022013102/54770e26b4af9f04118b4577/html5/thumbnails/120.jpg)
MALLEThands‐ontutorial
h1p://mallet.cs.umass.edu/mallet‐handson.tar.gz