cmusphinx*and* pocketsphinx*cs136a/cs136a_docs/pocket...running*pocketsphnix* •...
TRANSCRIPT
CMUSphinx and pocketSphinx
Windows install • Make subdirectory CMUSphinx
– Mine is d:\Stephans\CMUSphinx • From h>p://cmusphinx.sourceforge.net/wiki/download/, download snapshot of pocketsphinx,
sphinxbase, and sphinxtrain to your CMUSphinx directory – I have made window binaries. They are available from the class web page.
• If you get binaries, you sGll need to get the full sphinxtrain file as well (so you will need to download two versions of sphinxtrain)
– First get and decompress complete version – Second, get executables. Put executables in SphinxTrain\bin\Release (you will need to make this dirtectory) – This way the directory+file structure is the same as if you had compiled the files
• Put binaries of sphinxbase in CMUSphinx/sphinxbase/bin/Release • Put binaries of pocketsphinx in CMUSphinx/pocketsphinx/bin/Release • To run on android, you need to get the full version of pocketsphinx. But this only compiles on linux. We will do this later
– The non windows binaries requires MS Visual Studio 2010 (I used Visual Studio 2010 UlGmate) • If you are a student, you can get it for free from
– h>p://e5.onthehub.com/WebStore/ProductsByMajorVersionList.aspx?ws=29950cc3-‐3670-‐e011-‐971f-‐0030487d8897&vsro=8&JSEnabled=1
– Or find MSDNAA link at h>ps://www.eecis.udel.edu/wiki/ececis-‐docs/index.php/FAQ/ApplicaGons • You will also need an eecis account. You can sign up for one.
– The snapshots include .sh visual studio 2010 project files (earlier versions will not work) • Open visual studio. File-‐>open-‐>Preoject/SoluGon. Navigate to and select .sl file. To build: Build-‐>Build soluGon • First download and build sphinxbase
– Before buildigng, switch to release. » Select Build -‐> Configuratrion Manager: under “AcGve soluGons configuraGon:” Change from Debug to Release
• Then pocketsphinx and sphinxtrain
• You need perl and python – For perl, get acGvestate’s acGveperl: h>p://www.acGvestate.com/acGveperl/downloads
• For python get v2.7.X – h>p://www.python.org/geGt/
• Once python is installed, add the directory to you path • Add path to sphinxtrain binaries
– If downloaded binaries, then add path to where
Running pocketsphnix • Note audio file in CMUSphinx\pocketsphinx\test\data\goforward.raw • Open terminal and
– Change directory to d:\Stephans\CMUSphinx\pocketsphinx\bin\Release • Pocketsphinx_batch.exe should be there, unless compile failed
• Make file ctlFile.txt with text of the name of the file we will decode – goforward
• Make file called argFile.txt with contents (more about these later) – -‐hmm ../../model/hmm/en_US/hub4wsj_sc_8k – -‐lm ../../model/lm/en/turtle.DMP – -‐dict ../../model/lm/en/turtle.dic
• Move – CMUSphinx/sphinxbase/bin/Release/sphinxbase.dll – To – CMUSphinx/pocketsphinx/bin/Release
• Move – CMUSphinx\pocketsphinx\test\data\goforward.raw – To – CMUSphinx\pocketsphinx\bin\Release\goforward.raw
• run – pocketsphinx_batch.exe -‐argfile argFile.txt -‐cepdir ../../test/data -‐ctl ctlFile.txt -‐cepext .raw -‐adcin true -‐hyp out.txt
• Note: the command line arguments must be in this order!! – Where
• -‐argfile argFile.txt defines the name of the arguments file. These aurgments are displayed on the screen when the program runs. You can check if they match
• -‐cepdir ../../test/data defines the path to the files to be processed – -‐cepdir must come before -‐ctl
• -‐ctl ctlFile.txt defines the ctlFile, which contains the name of the files to process. These names cannoy have the path or the extension • -‐cepext .raw defines the extension of the files in the ctlFile • -‐adcin true means that the files are audio files • -‐hyp out.txt defines the output file • More details on the parameters are h>p://manpages.ubuntu.com/manpages/lucid/man1/pocketsphinx_batch.1.html
• Aper running, the ouqile contains – go forward ten meters (goforward -‐26532)
Make and decode a new audio file • Open windows sound recorder • Record “go forward ten meters”
– Save as myGoForward.wma – Saves as .wma file
• Get wma to wav converted – Save as c:\pocketsphnix\test\data\myGoForward.wav – I use 4musics mulGformat converted. Other converters should work
• Change ctlFile.txt to – myGoForward
• In terminal run – pocketsphinx_batch.exe -‐argfile argFile.txt -‐ctl ctlFile.txt -‐cepdir ./ -‐
cepext .wav -‐adcin true -‐hyp out2.txt • Check that out2.txt says go forward ten meters
Make your own acousGc model and language
• We will go over the what is going on later. But first, let’s try the process. – AlternaGvely, you can read the about what is going on first and then return to this secGon
• Download data – h>p://www.speech.cs.cmu.edu/databases/an4/index.html
– Get mswav version – Save it to your CMUSphinx directory – Decompress
models • Three types of models are used • acousGc model
– Used to model the sound of a phone – Typically, this a HMM is used – Each phone has a HMM – Mapping from HMMs to phones – Since the acousGc model is a HMM, in the CMU Sphinx the HMM is
the same as the acousGc model • phoneGc dicGonary
– Maps phones to words – In CMU Sphinx, .dic files are dicGonary files
• language model – Used to determine sequences of words are allowed. For example, “he
super run the sally” is not allowed in the language model
Set up config file • From CMUSphinx\SphinxTrain\etc
– Copy • feat.params • sphinx_train.cfg
– To CMUSphinx\an4\etc • Sphinc_train.cfg is the main configuraGon file
• Open sphinx_train.cfg in an editor – Line 6: $CFG_DB_NAME = “an4”; – Line 7: $CFG_BASE_DIR = "d:\\stephans\\CMUSphinx\\an4"; – Line 8: $CFG_SPHINXTRAIN_DIR = "d:\\Stephans\\CMUSphinx\\SphinxTrain"; – Line 11: $CFG_BIN_DIR = "d:\\Stephans\\CMUSphinx\\sphinxbase\\bin\\Release"; – Line 13: $CFG_SCRIPT_DIR = "d:\\Stephans\\CMUSphinx\\SphinxTrain\\scripts"; – Check out line 19-‐21. These say where the wav files are and that we are using mswav, which is
what we downloaded – Line 232: $DEC_CFG_DB_NAME = 'an4'; – Line 233: $DEC_CFG_BASE_DIR = 'd:\\Stephans\\CMUSphinx\\an4'; – Line 234 does not seem to ma>er – Line 239: $DEC_CFG_BIN_DIR = "d:\\Stephans\\CMUSphinx\\pocketsphinx\\bin\\Release"; – Save sphinx_train.cfg
Other changes • copy sphinxbase.dll from
– CMUSphinx\sphinxbase\bin\Release – To – CMUSphinx\SphinxTrain\bin\Release
• In CMUSphinx\an4\etc directory, copy or rename – an4.ug.lm.DMP to an4.lm.DMP
• Open CMUSphinx\SphinxTrain\scripts\sphinxtrain.in in an editor – Line 3: sphinxpath="d:\\Stephans\\CMUSphinx“ – In many places is /lib/sphinxtrain. Change this to /SphinxTrain
• Copy files – From CMUSphinx\pocketsphinx\bin\Release, copy
• pocketspinx_batch.exe and pocketsphinx.dll to CMUSphinx\SphinxTrain/bin/Release – Try skipping this and sexng line 243 of .cfg
check
• Open a cmd prompt – Type path and make sure that the directory to • python is there • SphinxTrain\bin\Release is there
Run training
• Change to CMUSphinx\an4 directory • Run – python ..\SphinxTrain\scripts\sphinxtrain.in run
• This will take a while (15 minutes) – Results from test is sentence error rate of 45% (nearly half of the sentences had at least one error) and 15.7% word error rate (15.7% of the words were incorrectly esGmated)
• This can fail because python was not installed or the path to python was not set
• Or the path to SphinxTrain
Check log
• Open an4.html • Check for errors – MODULE: 30 Training Context Dependent models
• A few errors of type: “Failed to align audio to trancript: final state of the search is not reached” are acceptable
– MODULE: 50 Training Context dependent models • A few errors of type: “Failed to align audio to trancript: final state of the search is not reached” are acceptable
• At the very end is the test decoding – Open log file – Note parameters for running decoding, specifically, where Hmm, dic, and lm is
Test with your own voice sample • Record sample • Convert to .wav • Run pocketsphinx_batch • pocketsphinx_batch
– -‐hmm d:\Stephans\CMUSphinx\an4/model_parameters/an4.cd_cont_200 – -‐lw 10 -‐feat 1s_c_d_dd – -‐beam 1e-‐80 -‐wbeam 1e-‐40 – -‐dict d:\Stephans\CMUSphinx\an4/etc/an4.dic – -‐lm d:\Stephans\CMUSphinx\an4/etc/an4.lm.DMP – -‐wip 0.2 -‐ctl d:\Stephans\CMUSphinx\an4/myTest/ctlFile.txt – -‐ctloffset 0 – -‐ctlcount 130 – -‐cepdir d:\Stephans\CMUSphinx\an4/myTest -‐cepext .wav – -‐hyp d:\Stephans\CMUSphinx\an4/myTest/results.txt – -‐agc none – -‐varnorm no – -‐cmn current – -‐adcin true
test
background • At a first approximaGon, words are a sequences of sounds, where
each sound is a phone. • However, the exactly pronunciaGon of a phone depends on the
phones before and aper. • Diphones are two phones. Diphones are less impacted by the
phones that come before or aper. • Triphones and quinphones are possible. The general name is
senone • While there are many phones, not all combinaGons of a phone is a
word. Thus, we should not simple recognize phones, by recognize words as a sequence of phones
• Besides phones are fillers (e.g., breath, “um”). An U>erance is a sequence of words and fillers
• U>erances are separated by a pause
models • Three types of models are used • acousGc model
– Used to model the sound of a phone – Typically, this a HMM is used – Each phone has a HMM – Mapping from HMMs to phones – Since the acousGc model is a HMM, in the CMU Sphinx the HMM is
the same as the acousGc model • phoneGc dicGonary
– Maps phones to words – In CMU Sphinx, .dic files are dicGonary files
• language model – Used to determine sequences of words are allowed. For example, “he
super run the sally” is not allowed in the language model
Running with other models
• Many acousGc and language models are available at – h>p://sourceforge.net/projects/cmusphinx/files/AcousGc%20and%20Language%20Models/
Building Your Own AcousGc Model and Language Model
• Building your own models is Gme consuming • AcousGc models require
– Lots of recordings of people saying words and sentences • Not that difficult to do
– Accurate transcripGon of the recording • Time consuming
– There are many acousGc models available online – It is possible to take an exisGng model are quickly adapt it to a parGcular speaker
• Language Model – Different systems need different language models
• A voice control for your TV needs to recognize only a few words like “volume up,” “change channel,” … • A voice driven email composer needs to recognize a different set of words
– The performance of the recognizer is improved if your language only considers the relevant words.
– You can take an exisGng language model and trim it to what you need, or make on from scratch • Many models are available from h>p://www.ldc.upenn.edu/Catalog/index.jsp
example
• To explore acousGc and language models, get the AN4 database – h>p://www.speech.cs.cmu.edu/databases/an4/index.html – Save it to your CMUSphinx directory – Decompress
• Also, explore the PDA dataset – h>p://www.speech.cs.cmu.edu/databases/pda/index.html
• This data is from le>ers and numbers, e.g., “A”, “B”, “19”
• We can test this system by saying things like “A”, “B”, etc.
AcousGc model • The acousGc model is used to translate recorded sounds into
labeled phones, – e.g., recorded sound in file asc.wav is “AH”
• Roughly speaking, acousGc models take the sound sample as input and the quality of fit as output – asc.wav -‐> AH-‐Model-‐> -‐12 – asc.wav -‐> AY-‐Model-‐> -‐14 – … – AH-‐Model gives a be>er fit of the recorder sound
• Making a acousGc model is called training – Inputs to training are audio files and transcripGons
• Challenge: Usually the audio file has many phones, not just one – E.g., from AN4 data set, an audio file contains a recording of the
words “TWO SIX EIGHT FOUR FOUR ONE EIGHT “ • CMUSphinx\an4\wav\an4_clstk\fash\cen7-‐fash-‐b.wav
– E.g. from PDA data set, an audio file might contain a recording of the words: “MARGINS HISTORICALLY HAVE PEAKED BY MID YEAR HE SAYS” • CMUSphinx\PDA\PDAs\001\PDAs01_001_1.wav
TranscripGons • Approach one: the recording from the PDA set is transcribed as: M AA R JH AX N Z SIL HH IX S T AO
R IX K AX L IY SIL ... • Two problems with approach one
– If the word margins are in other files, we need to enter the pronounciaGon of the word twice – There are two ways that people pronounce historically
• HH IX S T AO R IX K AX L IY • HH IX S T AO R IX K L IY (this one actually says historicly, which is incorrect)
• Two stage transcipGons (results in many files) – TranscripGon file: gives the words spoken
• This file contains one line for each file used in training • The line contains the text of the words spoken and the filename (without extension such as .wav)
– The AN4 dataset includes the file an4_train_transcripGon and it includes the line: <s> TWO SIX EIGHT FOUR FOUR ONE EIGHT </s> (cen7-‐fash-‐b)
– The PDA dataset includes the file PDAs.train_all.sent and it includes the line: MARGINS HISTORICALLY HAVE PEAKED BY MID YEAR HE SAYS (PDAs01_041)
» Hmm, this is missing the <s> and </s>, I think that the sopware requires <s> and </s>.. To use the pda data set, add <s> and </s>
– DicGonary file • A mapping from words to phones (elementary spoken sounds) • Allows words to have mulGple pronunciaGons • E.g., the AN4 dataset includes the file an4.dic and it includes the lines
– ELEVEN IH L EH V AH N – ELEVEN(2) IY L EH V AH N – E IY
– By combining the transcript file and dicGonary file, the sounds in each recorded audio file can be determined • However, it is a bit tricky to determine which part of the audio file corresponds to which sound.
– This is a major challenge facing training – Recall, the overall goal of training is to find models for each sound. But to make the training process easier
for the users, we only provide recordings of words and sentences.
Training • Files needed
– your_db_train.fileids -‐ List of files used for training • E.g., AN4 includes an4_train.fileids • Format
– path/filename (without extension!) – The path is from where the SphinxTrain program is executed – E.g., an4_train.fileids path is relaEve to where AN4 /etc directory. So SphinxTrain needs to be run from this
directory
– your_db_train.transcripGon -‐ TranscripEon for training (described on previous slide) – your_db.dic -‐ PhoneEc dicEonary (described on previous slide) – your_db.filler -‐ List of fillers and what they map to
• Fillers are things like silence, breathing, “um” etc. • Fillers should also be used in the transcript
– E.g., <s> TWO +UM+ SIX EIGHT FOUR FOUR ONE EIGHT </s> » Fillers use the + sign before and aper
• During training, models for fillers will be computed • Decoding is more complicated
– Fillers are allowed to be added, but there is some penalty – the fillers are ignored when compuGng the probability of a sequence of words
» E.g., the language model might tell us that “go to bed” is common, and “go up bed” is uncommon. If the decoder detect “go um to bed” it translates it to “go to bed”
– For some reason, fillers are not used in the an4 and PDA transcript files » <s>, </s>, SIL are silence are included » SMACK is listed in the PDA filler file, but not in the transcript
• File format – </s> SIL – <s> SIL – <sil> SIL – ++INHALE++ +INHALE+
– your_db.phone -‐ Phoneset file • a list of all labels of phones used (sounds), including fillers • E.g., an4.phone: AA, AE, AH, …
– Every phone label used in the dicGonary must be in the .phone file AND the filler labels
• Must have sphinxtrain/bin/debug in path • Must copy sphinxbase.dll to sphinxtrain/bin/debug or set path to
• Move pocketsphinix exe and dll • Edit sphinxtrain.in to remove /log and set prefix to path • Must use python 2.7 • Delete an4.html before running – This is a log file. Will not exist before the first run. But if you run and find errors, you can check it. But make sure to delete it before running so you can see the errors
• Change an4.ug.lm.DMP to an4.lm.DMP
Language model • Language models define which combinaGons of words are allowed.
– And, which combinaGons are more common or less common • Language model defines
– How open a word appears • Words: Go, stop, hi, bye
– How open combinaGons of word appear • CombinaGons with 2 words: Go forward; go back; … • Note that the length of these sequences can be 2, 3, .. • The language cannot specify all combinaGons of any length. So only combinaGons up to some length (e.g., 2 or 3) are
specified
• .ARPA files specify the language with a parGcular format – See h>p://msdn.microsop.com/en-‐us/library/hh378460.aspx for some details – See next slide
• There is an online language maker that takes sentences, counts the combinaGons of words and makes a ARPA file
• If you make your own arpa file, – you must sort it before using
• sphinx_lm_sort < unsorted.arpa > sorted.arpa – Then convert to lm
• sphinx_lm_convert –I sorted.arpa –o sorted.lm.DMP • Note that someGmes files that end in .lm are in the arpa format
– The DMP can be used to decode
ARPA format • <header -‐ informaGon ignored by applicaGons> • \data\ • ngram 1=9 • ngram 2=11 • ngram 3=3 • \1-‐grams: • -‐0.8953 <unk> -‐0.7373 • -‐0.7404 </s> -‐0.6515 • -‐0.7861 <s> -‐0.1764 • -‐1.0414 When -‐0.4754 • -‐1.0414 will -‐0.1315 • -‐0.9622 the 0.0080 • -‐1.4393 Stock -‐0.3100 • -‐1.0414 Go -‐0.3852 • -‐0.9622 Up -‐0.1286 • \2-‐grams: • -‐0.3626 <s> When -‐0.1736 • -‐1.2765 <s> the 0.0000 • -‐1.2765 <s> Up 0.0000 • -‐0.2359 When will 0.1011 • -‐1.0212 will </s> 0.0000 • -‐0.4191 will the 0.0000 • -‐1.1004 the </s> 0.0000 • -‐1.1004 the Go 0.0000 • -‐0.6232 Stock Go 0.0000 • -‐0.2359 Go Up 0.0587 • -‐0.4983 Up </s> • \3-‐grams: • -‐0.4260 <s> When will • -‐0.6601 When will the • -‐0.6601 Go Up </s> • \end\
• /data/ specifies how many entries • The numbers are log10 of probabiliGes • For the 3-‐gram entry
– -‐1.2 go to bed -‐.1. – The first number, -‐0.2 is log10 of the probability that
the last word (bed) occurs given the first two words have occurred • There might be other 3-‐grams like go to sleep, etc.
– The second number is the probability that no words occur aper this 3-‐gram
• For the 2-‐gram entry – -‐.2 go to -‐10.1 – The first number is the log10 of the probability that to
occurs aper go – The second number is the probability that no words
will come aper go to • Not so likely
• For the 1-‐gram – -‐1.041 go -‐0.27 – The first number is the probability that go occurs
• Go can occur by itself
• The second number is not the log10 of a probability, but is log10 of a weight (it could be log10 of a probability, but does not have to be)
Running pocketsphinx on android • I could only get this working on Linux.
– Windows might be possible (I didn’t try MAC) • The instrucGons here
h>p://cmusphinx.sourceforge.net/2011/05/building-‐pocketsphinx-‐on-‐android/ are almost correct
• Follow instrucGons for gexng and compiling sphinxbase and pocketsphinx • Get PocketSphinxDemo.tar.gz
– Import that to eclipse • File-‐>import-‐>ExisGng Projects into workspace-‐(next)-‐ • Select “Select archive file” • browse and select PocketSphinxDemo.tar.gz
• In an editor, open eclipse/workspace/PocketSphinxDemo/jni/Anroid.mk • In the second to last line
– Change • LOCAL_STATIC_LIBRARIES := sphinxuGl sphinxfe sphinxfeat sphinxlm pocketsphinx
– To • LOCAL_STATIC_LIBRARIES := pocketsphinx sphinxlm sphinxfeat sphinxfe sphinxuGl
• (back to instrucGons from web page) • Build,
– Change directory to eclipse/workspace/PocketSphinxDemo/jni – Android/andtroid-‐ndk-‐r7b/ndk-‐build –B
• Adjust properGes-‐>Builders as described on web page – I’m not sure how important this is.
• Swig makes an interface between java and c++, but these files have already been down loaded. • ndk is run from the command line
On phone • (the directory should be /mnt/sdcard/Android/data/edu.cmu.pocketsphinx) • adb shell • mkdir /mnt/sdcard/Android/data/edu.cmu.pocketsphinx • cd /mnt/sdcard/Android/data/edu.cmu.pocketsphinx • Make directory strucGon as shown on web page
– /mnt/sdcard/Android/data/edu.cmu.pocketsphinx/hmm – /mnt/sdcard/Android/data/edu.cmu.pocketsphinx /hmm/en_US – /mnt/sdcard/Android/data/edu.cmu.pocketsphinx /hmm/hub4wsj_sc_8k
• Not sure if this is needed. – /mnt/sdcard/Android/data/edu.cmu.pocketsphinx /lm – /mnt/sdcard/Android/data/edu.cmu.pocketsphinx/lm/en_US
• Cd to CMUSphinx/pocketsphinx/model/hmm/en_US/ – Android/android-‐sdk/plaqorm-‐tools/adb push ./hub4wsj_sc_8k /mnt/sdcard/Android/data/
edu.cmu.pocketsphinx/hmm/en_US/hub4wsj_sc_8k • Cd to CMUSphinx/pocketsphinx/model/lm
– Android/android-‐sdk/plaqorm-‐tools/adb push ./en_US /mnt/sdcard/Android/data/edu.cmu.pocketsphinx/lm/en_US/
In eclipse • In RecognizerTask.java, change code to include the correct path
– This path must match the path where the model files are located • pocketsphinx.setLogfile("/mnt/sdcard/Android/data/edu.cmu.pocketsphinx/pocketsphinx.log");
Config c = new Config(); /* * In 2.2 and above we can use getExternalFilesDir() or whatever it's called */ c.setString("-‐hmm", "/mnt/sdcard/Android/data/edu.cmu.pocketsphinx/hmm/en_US/hub4wsj_sc_8k"); c.setString("-‐dict", "/mnt/sdcard/Android/data/edu.cmu.pocketsphinx/lm/en_US/hub4.5000.dic"); c.setString("-‐lm", "/mnt/sdcard/Android/data/edu.cmu.pocketsphinx/lm/en_US/hub4.5000.DMP"); c.setString("-‐rawlogdir", "/mnt/sdcard/Android/data/edu.cmu.pocketsphinx"); // Only use it to store the audio
• Note that these lines are also changed if you use different models • Build, run and test
Windows install
• Requires Android NDK • Flex for windows: h>p://gnuwin32.sourceforge.net/packages/flex.htm
• Bison for windows: h>p://gnuwin32.sourceforge.net/packages/bison.htm
• Get CMUSphinix from here: ?? – Note that this contains the
• Follow direcGons from – h>p://cmusphinx.sourceforge.net/2011/05/building-‐pocketsphinx-‐on-‐android/
– Or google: pocketSphinx android • Or: – But order of libs at the end need to be reversed – Only compiles on linux, because is need yacc
• resources: • h>p://www.speech.cs.cmu.edu/sphinxman/
Voice acGvity detecGon
• VAD is used to detect if anyone is speaking