mallet tutorial
TRANSCRIPT
![Page 1: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/1.jpg)
MachineLearningwithMALLET
h1p://mallet.cs.umass.edu
DavidMimno
Informa@onExtrac@onandSynthesis
Laboratory,DepartmentofCS
UMass,Amherst
![Page 2: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/2.jpg)
Outline
• AboutMALLET
• Represen@ngData
• Classifica@on
• SequenceTagging
• TopicModeling
![Page 3: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/3.jpg)
Outline
• AboutMALLET
• Represen@ngData
• Classifica@on
• SequenceTagging
• TopicModeling
![Page 4: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/4.jpg)
Who?
• AndrewMcCallum(mostofthe
work)
• CharlesSu1on,AronCulo1a,
GregDruck,KedarBellare,
GauravChandalia…
• FernandoPereira,othersat
Penn…
![Page 5: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/5.jpg)
WhoamI?
• ChiefmaintainerofMALLET
• PrimaryauthorofMALLETtopicmodeling
package
![Page 6: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/6.jpg)
Why?
• Mo@va@on:textclassifica@onand
informa@onextrac@on
• Commercialmachinelearning(Just
Research,WhizBang)
• Analysisandindexingofacademic
publica@ons:Cora,Rexa
![Page 7: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/7.jpg)
What?
• Textfocus:dataisdiscreteratherthan
con@nuous,evenwhenvaluescouldbe
con@nuous:
double value = 3.0
![Page 8: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/8.jpg)
How?
• Commandlinescripts:
– bin/mallet[command]‐‐[op@on][value]…
– TextUserInterface(“tui”)classes
• DirectJavaAPI
– h1p://mallet.cs.umass.edu/api
Most of this talk
![Page 9: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/9.jpg)
History
• Version0.4:c2004
– Classesinedu.umass.cs.mallet.base.*
• Version2.0:c2008
– Classesincc.mallet.*
– Majorchangestofinitestatetransducerpackage
– bin/malletvs.specializedscripts
– Java1.5generics
![Page 10: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/10.jpg)
LearningMore
• h1p://mallet.cs.umass.edu
– “QuickStart”guides,focusedoncommandline
processing
– Developers’guides,withJavaexamples
• mallet‐[email protected]
– Lowvolume,butcanbebursty
![Page 11: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/11.jpg)
Outline
• AboutMALLET
• Represen@ngData
• Classifica@on
• SequenceTagging
• TopicModeling
![Page 12: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/12.jpg)
ModelsforTextData
• Genera@vemodels(Mul@nomials)
– NaïveBayes
– HiddenMarkovModels(HMMs)
– LatentDirichletTopicModels
• Discrimina@veRegressionModels
– MaxEnt/Logis@cregression
– Condi@onalRandomFields(CRFs)
![Page 13: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/13.jpg)
Representa@ons
• Transformtext
documentsto
vectorsx1, x2,…
• Retainmeaning
ofvectorindices
• Ideallysparsely
Call meIshmael.…
Document
![Page 14: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/14.jpg)
Representa@ons
• Transformtext
documentsto
vectorsx1, x2,…
• Retainmeaning
ofvectorindices
• Ideallysparsely
1.00.0…0.06.00.0…3.0…
Call meIshmael.…
xi
Document
![Page 15: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/15.jpg)
Representa@ons
• Elementsofvector
arecalledfeature
values
• Example:Feature
atrow345is
numberof@mes
“dog”appearsin
document
1.00.0…0.06.00.0…3.0…
xi
![Page 16: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/16.jpg)
DocumentstoVectors
Call me Ishmael.
Document
![Page 17: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/17.jpg)
DocumentstoVectors
Call me Ishmael.
Document
Call me Ishmael
Tokens
![Page 18: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/18.jpg)
DocumentstoVectors
Call me Ishmael
Tokens
call me ishmael
Tokens
![Page 19: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/19.jpg)
DocumentstoVectors
call me ishmael
Tokens
473, 3591, 17
Features
17 ishmael…473 call…3591 me
![Page 20: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/20.jpg)
DocumentstoVectors
17 1.0473 1.03591 1.0
Features (bag)
17 ishmael473 call3591 me
473, 3591, 17
Features (sequence)
17 ishmael…473 call…3591 me
17 ishmael…473 call…3591 me
![Page 21: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/21.jpg)
Instances
Emailmessage,webpage,sentence,journal
abstract…
• Name
• Data
• Target/Label
• Source
What is it called?
What is the input?
What is the output?
What did it originally look like?
![Page 22: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/22.jpg)
Instances
• Name
• Data
• Target
• Source
String
TokenSequenceArrayList<Token>
FeatureSequenceint[]
FeatureVectorint -> double map
cc.mallet.types
![Page 23: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/23.jpg)
Alphabets
TObjectIntHashMap mapArrayList entries
int lookupIndex(Object o, boolean shouldAdd)
Object lookupObject(int index)
cc.mallet.types, gnu.trove
17 ishmael…473 call…3591 me
![Page 24: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/24.jpg)
Alphabets
TObjectIntHashMap mapArrayList entries
int lookupIndex(Object o, boolean shouldAdd)
Object lookupObject(int index)
cc.mallet.types, gnu.trove
17 ishmael…473 call…3591 me
for
![Page 25: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/25.jpg)
Alphabets
TObjectIntHashMap mapArrayList entries
cc.mallet.types, gnu.trove
17 ishmael…473 call…3591 me
void stopGrowth()
void startGrowth()
Do not add entries fornew Objects -- defaultis to allow growth.
![Page 26: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/26.jpg)
Crea@ngInstances
• Instance
constructor
method
• Iterators
new Instance(data, target,name, source)
Iterator<Instance>FileIterator(File[], …)CsvIterator(FileReader, Pattern…)ArrayIterator(Object[])…
cc.mallet.pipe.iterator
![Page 27: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/27.jpg)
Crea@ngInstances
• FileIterator
cc.mallet.pipe.iterator
/data/bad/
/data/good/
Label from dir name
Each instance inits own file
![Page 28: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/28.jpg)
Crea@ngInstances
• CsvIterator
cc.mallet.pipe.iterator
Name, label, data from regular expression groups.“CSV” is a lousy name. LineRegexIterator?
Each instanceon its own line
1001 Melville Call me Ishmael. Some years ago…1002 Dickens It was the best of times, it was…
^([^\t]+)\t([^\t]+)\t(.*)
![Page 29: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/29.jpg)
InstancePipelines
• Sequen@al
transforma@ons
ofinstancefields
(usuallyData)
• Passan
ArrayList<Pipe>
toSerialPipes
cc.mallet.pipe
// “data” is a StringCharSequence2TokenSequence// tokenize with regexpTokenSequenceLowercase// modify each token’s textTokenSequenceRemoveStopwords// drop some tokensTokenSequence2FeatureSequence// convert token Strings to intsFeatureSequence2FeatureVector// lose order, count duplicates
![Page 30: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/30.jpg)
InstancePipelines
• Asmallnumber
ofpipesmodify
the“target”
field
• Therearenow
twoalphabets:
dataandlabel
cc.mallet.pipe, cc.mallet.types
// “target” is a StringTarget2Label// convert String to int// “target” is now a Label
Alphabet > LabelAlphabet
![Page 31: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/31.jpg)
Labelobjects
• Weightsona
fixedsetof
classes
• Fortraining
data,weightfor
correctlabelis
1.0,allothers
0.0
cc.mallet.types
implements Labeling
int getBestIndex()Label getBestLabel()
You cannot create a Label,they are only produced byLabelAlphabet
![Page 32: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/32.jpg)
InstanceLists
• AListof
Instanceobjects,
alongwitha
Pipe,data
Alphabet,and
LabelAlphabet
cc.mallet.types
InstanceList instances = new InstanceList(pipe);
instances.addThruPipe(iterator);
![Page 33: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/33.jpg)
Purngitalltogether
ArrayList<Pipe> pipeList = new ArrayList<Pipe>();
pipeList.add(new Target2Label());
pipeList.add(new CharSequence2TokenSequence());
pipeList.add(new TokenSequence2FeatureSequence());
pipeList.add(new FeatureSequence2FeatureVector());
InstanceList instances =
new InstanceList(new SerialPipes(pipeList));
instances.addThruPipe(new FileIterator(. . .));
![Page 34: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/34.jpg)
PersistentStorage
• MostMALLET
classesuseJava
serializa@onto
storemodels
anddata
java.io
ObjectOutputStream oos = new ObjectOutputStream(…);oos.writeObject(instances);oos.close();
Pipes, data objects, labelings, etcall need to implementSerializable.
Be sure to include custom classesin classpath, or you get aStreamCorruptedException
![Page 35: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/35.jpg)
Review
• Whatarethefourmainfieldsinan
Instance?
![Page 36: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/36.jpg)
Review
• Whatarethefourmainfieldsinan
Instance?
• WhataretwowaystogenerateInstances?
![Page 37: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/37.jpg)
Review
• Whatarethefourmainfieldsinan
Instance?
• WhataretwowaystogenerateInstances?
• HowdowemodifythevalueofInstance
fields?
![Page 38: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/38.jpg)
Review
• Whatarethefourmainfieldsinan
Instance?
• WhataretwowaystogenerateInstances?
• HowdowemodifythevalueofInstance
fields?
• Namesomeclassesthatappearinthe
“data”field.
![Page 39: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/39.jpg)
Outline
• AboutMALLET
• Represen@ngData
• Classifica@on
• SequenceTagging
• TopicModeling
![Page 40: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/40.jpg)
Classifierobjects
• Classifiersmap
frominstances
todistribu@ons
overafixedset
ofclasses
• MaxEnt,Naïve
Bayes,Decision
Trees…
cc.mallet.classify
Given data Which classis best?
(this one!)watery
NN
JJ
PRP
VB
CC
![Page 41: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/41.jpg)
Classifierobjects
• Classifiersmap
frominstances
todistribu@ons
overafixedset
ofclasses
• MaxEnt,Naïve
Bayes,Decision
Trees…
cc.mallet.classify
Labeling labeling = classifier.classify(instance);
Label l = labeling.getBestLabel();
System.out.print(instance + “\t”);System.out.println(l);
![Page 42: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/42.jpg)
TrainingClassifierobjects
cc.mallet.classify
ClassifierTrainer trainer = new MaxEntTrainer();
Classifier classifier = trainer.train(instances);
• Eachtypeof
classifierhas
oneormore
ClassifierTrainer
classes
![Page 43: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/43.jpg)
TrainingClassifierobjects
cc.mallet.optimize
log P(Labels | Data) =log f(label1, data1, w) +log f(label2, data2, w) +log f(label3, data3, w) +…
• Someclassifiers
require
numerical
op@miza@onof
anobjec@ve
func@on. Maximize w.r.t. w!
![Page 44: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/44.jpg)
Parametersw
• Associa@on
between
feature,class
label
• Howmany
parametersfor
KclassesandN
features?
ac@on NN 0.13
ac@on VB ‐0.1
ac@on JJ ‐0.21
SUFF‐@on NN 1.3
SUFF‐@on VB ‐2.1
SUFF‐@on JJ ‐1.7
SUFF‐on NN 0.01
SUFF‐on VB ‐0.02
…
![Page 45: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/45.jpg)
TrainingClassifierobjects
cc.mallet.optimize
interface Optimizerboolean optimize()
interface Optimizableinterface ByValueinterface ByValueGradient
Limited-memory BFGS,Conjugate gradient…
Specific objective functions
![Page 46: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/46.jpg)
TrainingClassifierobjects
cc.mallet.classify
MaxEntOptimizableByLabelLikelihooddouble[] getParameters()void setParameters(double[] parameters)…
double getValue()void getValueGradient(double[] buffer)
Log likelihood and its first derivative
ForOptimizableinterface
![Page 47: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/47.jpg)
Evalua@onofClassifiers
• Create
random
test/train
splits
cc.mallet.types
InstanceList[] instanceLists =instances.split(new Randoms(),
new double[] {0.9, 0.1, 0.0});
90% training
10% testing
0% validation
![Page 48: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/48.jpg)
Evalua@onofClassifiers
• TheTrial
classstores
theresultsof
classifica@ons
onan
InstanceList
(tes@ngor
training)
cc.mallet.classify
Trial(Classifier c, InstanceList list)double getAccuracy()double getAverageRank()double getF1(int/Label/Object)double getPrecision(…)double getRecall(…)
![Page 49: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/49.jpg)
Review
• Ihaveinventedanewclassifier:David
regression.
– WhatclassshouldIimplementtoclassify
instances?
![Page 50: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/50.jpg)
Review
• Ihaveinventedanewclassifier:David
regression.
– WhatclassshouldIimplementtotrainaDavid
regressionclassifier?
![Page 51: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/51.jpg)
Review
• Ihaveinventedanewclassifier:David
regression.
– IwanttotrainusingByValueGradient.What
mathema@calfunc@onsdoIneedtocodeup,
andwhatclassshouldIputthemin?
![Page 52: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/52.jpg)
Review
• Ihaveinventedanewclassifier:Davidregression.
– HowwouldIcheckwhethermynewclassifierworksbe1erthanNaïveBayes?
![Page 53: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/53.jpg)
Outline
• AboutMALLET
• Represen@ngData
• Classifica@on
• SequenceTagging
• TopicModeling
![Page 54: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/54.jpg)
SequenceTagging
• Dataoccursin
sequences
• Categoricallabels
foreachposi@on
• Labelsare
correlated
DETNNVBSVBG
thedoglikesrunning
![Page 55: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/55.jpg)
SequenceTagging
• Dataoccursin
sequences
• Categoricallabels
foreachposi@on
• Labelsare
correlated
????????
thedoglikesrunning
![Page 56: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/56.jpg)
SequenceTagging
• Classifica@on:n‐way
• SequenceTagging:nT‐way
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
orreddogsonbluetrees
![Page 57: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/57.jpg)
AvoidingExponen@alBlowup
• Markovproperty
• Dynamicprogramming
Andrei Markov
![Page 58: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/58.jpg)
AvoidingExponen@alBlowup
• Markovproperty
• Dynamicprogramming
This oneGiven this one
Is independent of theseAndrei Markov
DETJJNNVB
![Page 59: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/59.jpg)
AvoidingExponen@alBlowup
• Markovproperty
• Dynamicprogramming
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
orreddogsonbluetrees Andrei Markov
![Page 60: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/60.jpg)
AvoidingExponen@alBlowup
• Markovproperty
• Dynamicprogramming
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
reddogsonbluetrees Andrei Markov
![Page 61: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/61.jpg)
AvoidingExponen@alBlowup
• Markovproperty
• Dynamicprogramming
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
NN
JJ
PRP
VB
CC
dogsonbluetrees Andrei Markov
![Page 62: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/62.jpg)
HiddenMarkovModelsand
Condi@onalRandomFields
• HiddenMarkov
Model:fully
genera@ve
• Condi@onal
RandomField:
condi@onal
P(Labels | Data) =P(Data, Labels) / P(Data)
P(Labels | Data)
![Page 63: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/63.jpg)
HiddenMarkovModelsand
Condi@onalRandomFields
• HiddenMarkovModel:
simple(independent)
outputspace
• Condi@onalRandom
Field:arbitrarily
complicatedoutputs
“NSF-funded”
“NSF-funded”CAPITALIZEDHYPHENATEDENDS-WITH-edENDS-WITH-d…
![Page 64: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/64.jpg)
HiddenMarkovModelsand
Condi@onalRandomFields
FeatureSequence
FeatureVectorSequence
FeatureVector[]
int[]
• HiddenMarkovModel:
simple(independent)
outputspace
• Condi@onalRandom
Field:arbitrarily
complicatedoutputs
![Page 65: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/65.jpg)
Impor@ngData
• SimpleTagger
format:one
wordperline,
withinstances
delimitedbya
blankline
Call VBme PPNIshmael NNP. .
Some JJyears NNS…
![Page 66: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/66.jpg)
Impor@ngData
• SimpleTagger
format:one
wordperline,
withinstances
delimitedbya
blankline
Call SUFF-ll VBme TWO_LETTERS PPNIshmael BIBLICAL_NAME NNP. PUNCTUATION .
Some CAPITALIZED JJyears TIME SUFF-s NNS…
![Page 67: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/67.jpg)
Impor@ngData
LineGroupIterator
SimpleTaggerSentence2TokenSequence()//String to Tokens, handles labels
TokenSequence2FeatureVectorSequence()//Token objects to FeatureVectors
cc.mallet.pipe, cc.mallet.pipe.iterator
![Page 68: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/68.jpg)
Impor@ngData
LineGroupIterator
SimpleTaggerSentence2TokenSequence()//String to Tokens, handles labels
[Pipes that modify tokens]
TokenSequence2FeatureVectorSequence()//Token objects to FeatureVectors
cc.mallet.pipe, cc.mallet.pipe.iterator
![Page 69: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/69.jpg)
Impor@ngData
//IshmaelTokenTextCharSuffix(“C2=”, 2)
//Ishmael C2=elRegexMatches(“CAP”, Pattern.compile(“\\p{Lu}.*”))
//Ishmael C2=el CAPLexiconMembership(“NAME”, new File(‘names’), false)
//Ishmael C2=el CAP NAME
cc.mallet.pipe.tsf
must matchentire string
one name per line
ignore case?
![Page 70: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/70.jpg)
Slidingwindowfeatures
areddogonabluetree
![Page 71: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/71.jpg)
Slidingwindowfeatures
areddogonabluetree
![Page 72: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/72.jpg)
Slidingwindowfeatures
areddogonabluetree
red@-1
![Page 73: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/73.jpg)
Slidingwindowfeatures
areddogonabluetree
red@-1a@-2
![Page 74: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/74.jpg)
Slidingwindowfeatures
areddogonabluetree
red@-1a@-2on@1
![Page 75: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/75.jpg)
Slidingwindowfeatures
areddogonabluetree
red@-1a@-2on@1a@-2_&_red@-1
![Page 76: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/76.jpg)
Impor@ngData
int[][] conjunctions = new int[3][]; conjunctions[0] = new int[] { -1 }; conjunctions[1] = new int[] { 1 }; conjunctions[2] = new int[] { -2, -1 };
OffsetConjunctions(conjunctions)
// a@-2_&_red@-1 on@1
cc.mallet.pipe.tsf
previousposition
next position
previous two
![Page 77: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/77.jpg)
Impor@ngData
int[][] conjunctions = new int[3][]; conjunctions[0] = new int[] { -1 }; conjunctions[1] = new int[] { 1 }; conjunctions[2] = new int[] { -2, -1 };
TokenTextCharSuffix("C1=", 1)OffsetConjunctions(conjunctions)
// a@-2_&_red@-1 a@-2_&_C1=d@-1
cc.mallet.pipe.tsf
previousposition
next position
previous two
![Page 78: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/78.jpg)
FiniteStateTransducers
• Finitestate
machineover
twoalphabets
(observed,
hidden)
![Page 79: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/79.jpg)
FiniteStateTransducers
• Finitestate
machineover
twoalphabets
(observed,
hidden)
DET
P(DET)
![Page 80: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/80.jpg)
FiniteStateTransducers
• Finitestate
machineover
twoalphabets
(observed,
hidden)
DET
the
P(the | DET)
![Page 81: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/81.jpg)
FiniteStateTransducers
• Finitestate
machineover
twoalphabets
(observed,
hidden)
DETNN
the
P(NN | DET)
![Page 82: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/82.jpg)
FiniteStateTransducers
• Finitestate
machineover
twoalphabets
(observed,
hidden)
DETNN
thedog
P(dog | NN)
![Page 83: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/83.jpg)
FiniteStateTransducers
• Finitestate
machineover
twoalphabets
(observed,
hidden)
DETNNVBS
thedog
P(VBS | NN)
![Page 84: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/84.jpg)
Howmanyparameters?
• Determines
efficiencyof
training
• Toomanyleads
tooverfirng
Trick: Don’t allowcertain transitions
P(VBS | DET) = 0
![Page 85: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/85.jpg)
Howmanyparameters?
• Determines
efficiencyof
training
• Toomanyleads
tooverfirng
DETNNVBS
thedogruns
DETNNVBS
thedogruns
DETNNVBS
thedogruns
![Page 86: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/86.jpg)
FiniteStateTransducers
abstract class TransducerCRFHMM
abstract class TransducerTrainerCRFTrainerByLabelLikelihoodHMMTrainerByLikelihood
cc.mallet.fst
![Page 87: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/87.jpg)
FiniteStateTransducers
cc.mallet.fst
First order: one weightfor every pair of labelsand observations.
CRF crf = new CRF(pipe, null);crf.addFullyConnectedStates(); // orcrf.addStatesForLabelsConnectedAsIn(instances);
DETNNVBS
thedogruns
![Page 88: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/88.jpg)
FiniteStateTransducers
cc.mallet.fst
“three-quarter” order:one weight for everypair of labels andobservations.
crf.addStatesForThreeQuarterLabelsConnectedAsIn(instances);
DETNNVBS
thedogruns
![Page 89: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/89.jpg)
FiniteStateTransducers
cc.mallet.fst
Second order: one weightfor every triplet of labelsand observations.
crf.addStatesForBiLabelsConnectedAsIn(instances);
DETNNVBS
thedogruns
![Page 90: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/90.jpg)
FiniteStateTransducers
cc.mallet.fst
“Half” order: equivalent toindependent classifiers,except some transitionsmay be illegal.
crf.addStatesForHalfLabelsConnectedAsIn(instances);
DETNNVBS
thedogruns
![Page 91: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/91.jpg)
Trainingatransducer
CRF crf = new CRF(pipe, null);crf.addStatesForLabelsConnectedAsIn(trainingInstances); CRFTrainerByLabelLikelihood trainer = new CRFTrainerByLabelLikelihood(crf);
trainer.train();
cc.mallet.fst
![Page 92: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/92.jpg)
Evalua@ngatransducer
CRFTrainerByLabelLikelihood trainer = new CRFTrainerByLabelLikelihood(transducer);
TransducerEvaluator evaluator = new TokenAccuracyEvaluator(testing, "testing"));
trainer.addEvaluator(evaluator);
trainer.train();
cc.mallet.fst
![Page 93: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/93.jpg)
Applyingatransducer
Sequence output = transducer.transduce (input);
for (int index=0; index < input.size(); input++) {System.out.print(input.get(index) + “/”);System.out.print(output.get(index) + “ “);
}
cc.mallet.fst
![Page 94: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/94.jpg)
Review
• Howdoyouaddnewfeaturesto
TokenSequences?
![Page 95: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/95.jpg)
Review
• Howdoyouaddnewfeaturesto
TokenSequences?
• Whatarethreefactorsthataffectthe
numberofparametersinamodel?
![Page 96: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/96.jpg)
Outline
• AboutMALLET
• Represen@ngData
• Classifica@on
• SequenceTagging
• TopicModeling
![Page 97: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/97.jpg)
Topics:“Seman@cGroups”
News Article
![Page 98: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/98.jpg)
Topics:“Seman@cGroups”
“Sports” “Negotiation”
News Article
![Page 99: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/99.jpg)
Topics:“Seman@cGroups”
“Sports” “Negotiation”
News Article
teamplayer
game
strike
deadlineunion
![Page 100: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/100.jpg)
Topics:“Seman@cGroups”
News Article
teamplayer
game
strike
deadlineunion
![Page 101: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/101.jpg)
SeriesYankeesSoxRedWorldLeaguegameBostonteam
gamesbaseballMetsGameserieswonClemensBraves
Yankeeteams
![Page 102: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/102.jpg)
playersLeagueownersleaguebaseballunioncommissioner
BaseballAssocia@onlaborCommissionerFootballmajor
teamsSeligagreementstriketeambargaining
![Page 103: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/103.jpg)
TrainingaTopicModel
cc.mallet.topics
ParallelTopicModel lda = new ParallelTopicModel(numTopics); lda.addInstances(trainingInstances); lda.estimate();
![Page 104: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/104.jpg)
Evalua@ngaTopicModel
cc.mallet.topics
ParallelTopicModel lda = new ParallelTopicModel(numTopics);lda.addInstances(trainingInstances);lda.estimate();
MarginalProbEstimator evaluator = lda.getProbEstimator();
double logLikelihood = evaluator.evaluateLeftToRight(testing, 10, false, null);
![Page 105: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/105.jpg)
Inferringtopicsfornew
documents
cc.mallet.topics
ParallelTopicModel lda = new ParallelTopicModel(numTopics);lda.addInstances(trainingInstances);lda.estimate();
TopicInferencer inferencer = lda.getInferencer();
double[] topicProbs = inferencer.getSampledDistribution(instance, 100, 10, 10);
![Page 106: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/106.jpg)
Morethanwords…
• Textcollec@ons
mixfreetext
andstructured
data
David MimnoAndrew McCallumUAI2008…
![Page 107: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/107.jpg)
Morethanwords…
• Textcollec@ons
mixfreetext
andstructured
data
David MimnoAndrew McCallumUAI2008
“Topic models conditionedon arbitrary features usingDirichlet-multinomialregression. …”
![Page 108: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/108.jpg)
Dirichlet‐mul@nomialRegression
(DMR)
Thecorpusspecifiesavectorofreal‐valued
features(x)foreachdocument,oflengthF.
EachtopichasanF‐lengthvectorof
parameters.
![Page 109: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/109.jpg)
Topicparametersforfeature
“publishedinJMLR”
user,users,userinterface,interac@ve,interface‐1.44
web,webpages,webpage,worldwideweb,websites‐1.36
retrieval,informa@onretrieval,query,queryexpansion‐1.23
strategies,strategy,adapta@on,adap@ve,driven‐1.21
agent,agents,mul@agent,autonomousagents‐1.12
nearestneighbor,boos@ng,nearestneighbors,adaboost1.37
blindsourcesepara@on,sourcesepara@on,separa@on,channel1.40
reinforcementlearning,learning,reinforcement1.41
bounds,vcdimension,bound,upperbound,lowerbounds1.74
kernel,kernels,ra@onalkernels,stringkernels,fisherkernel2.27
![Page 110: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/110.jpg)
FeatureparametersforRLtopic
<default>‐3.76
COLING‐1.64
IEEETrans.PAMI‐1.54
CVPR‐1.47
ACL‐1.38
MachineLearningJournal2.19
ECML2.45
KenjiDoya2.56
ICML2.88
SridharMahadevan2.99
![Page 111: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/111.jpg)
Topicparametersforfeature
“publishedinUAI”
nearestneighbor,boos@ng,nearestneighbors,adaboost‐1.50
descrip@ons,descrip@on,top,bo1om,topbo1om‐1.50
workshopreport,invitedtalk,interna@onalconference,report‐1.37
digitallibraries,digitallibrary,digital,library‐1.36
shape,deformable,shapes,contour,ac@vecontour‐1.29
reasoning,logic,defaultreasoning,nonmonotonicreasoning2.11
uncertainty,symbolic,sketch,primalsketch,uncertain,[email protected]
probability,probabili@es,probabilitydistribu@ons,2.25
qualita@ve,reasoning,qualita@vereasoning,qualita@[email protected]
bayesiannetworks,bayesiannetwork,beliefnetworks2.88
![Page 112: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/112.jpg)
FeatureparametersforBayes
netstopic
<default>‐3.36
ICRA‐2.24
NeuralNetworks‐1.50
COLING‐1.38
Probabilis@cSeman@csforNonmonotonicReasoning(Pearl,KR,
1989)
‐1.16
LoopyBeliefPropaga@onforApproximateInference(Murphy,Weiss,
andJordan,UAI,1999)
2.04
PhilippeSmets2.15
AshrafM.Abdelbar2.23
Mary‐AnneWilliams2.41
UAI2.88
![Page 113: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/113.jpg)
Dirichlet‐mul@nomialRegression
• Arbitraryobservedfeaturesofdocuments
• TargetcontainsFeatureVector
DMRTopicModel dmr = new DMRTopicModel (numTopics);
dmr.addInstances(training);dmr.estimate();
dmr.writeParameters(new File("dmr.parameters"));
![Page 114: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/114.jpg)
PolylingualTopicModeling
• Topicsexistinmorelanguagesthanyoucouldpossiblylearn
• Topicallycomparable documentsaremucheasiertogetthantransla@onsets
• Transla@ondic@onaries
– coverpairs,notsetsoflanguages
– misstechnicalvocabulary
– aren’tavailableforlow‐resourcelanguages
![Page 115: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/115.jpg)
Topicsfrom
European
Parliament
Proceedings
![Page 116: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/116.jpg)
Topicsfrom
European
Parliament
Proceedings
![Page 117: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/117.jpg)
Topicsfrom
Wikipedia
![Page 118: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/118.jpg)
Alignedinstancelists
dog… chien… hund…cat… chat…pig… schwein…
![Page 119: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/119.jpg)
PolylingualTopics
InstanceList[] training = new InstanceList[] { english, german, arabic, mahican };
PolylingualTopicModel pltm = new PolylingualTopicModel(numTopics);
pltm.addInstances(training);
![Page 120: Mallet Tutorial](https://reader033.vdocuments.us/reader033/viewer/2022052218/544b129caf79599c438b4d6d/html5/thumbnails/120.jpg)
MALLEThands‐ontutorial
h1p://mallet.cs.umass.edu/mallet‐handson.tar.gz