timit_nistir4930

Upload: mmehala

Post on 12-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/21/2019 TIMIT_NISTIR4930

    1/94

    DARPA

    TIMIT

    N

    A

    NST Speech Dc

    Jhn S Garl

    Lr F Lamel

    Wllam M FherJnathan G Fcu

    Dav S Pallett

    ancy L Dalren

    U.S. DEPARTT OF COERCETecnology AdmnistaionNaonal Instiute of Sanddsand TechnologyCompuer Systems Laboatory

    Advnced Sysems DivisonGaithesbug,

    CD-ROM Reesed Ocobe 1Documenaion Pubsed Febury

    N

  • 7/21/2019 TIMIT_NISTIR4930

    2/94

  • 7/21/2019 TIMIT_NISTIR4930

    3/94

    DARPA

    TMT

    90

    A

    NST Speech Dc .

    Jhn S Gal

    L F Lamel

    Wam M Fiher

    Jnahan G Ficu

    Davi S Palle

    Nancy L Dahlren

    U.S. DEPARTNT OF COERCETechnology AdmnisrationNaional Insitute of Standrdsand TechnoogyComputer Sysems LaboratoryAdvanced Systems DivisionGaithersburg,

    CD-ROM Reeased Ocober Docmentation Pubished Febray, 1

    U DEPRTMENT O OMMERRonald Brown, eceary

    NAONA INSTTUE OF SADARDSA TECHNOLGYJohn yns, Director

  • 7/21/2019 TIMIT_NISTIR4930

    4/94

  • 7/21/2019 TIMIT_NISTIR4930

    5/94

    Ab

    Te Texas nstmentsMassacsetts nsttute o Tecnoogy (TMIT cors o easeec as been esgne to ove seec ata o te acqston o acostc-onetcknowlege an o te eveloment an evalaton o atomatc seec ecogntonsystems TMT contans seec om 63 seakes eesentng 8 majo alect vsonso Amecan Engls eac seakng 1 onetcally-c sentences Te TIMIT corsnces tme-algne otogac onetc an wo tansctons as well as seecwaveom ata o eac soken sentence

    Ts elease o TIMT contans sevea movements ove te Pototye CD-RMeease n Decembe 188 (1 l 63-seake cors 2 cecke an coectetansctons (3 wo-algnment tansctons NIST SPHERE-eaee waveormes an eae manlaton sotwae 5 onemc ctonary (6 new test an tanngsbsets balance o aecta an onetc coveage an 7 moe extensveocmentaton

    Te TMIT CD-RM as eslte om te jont eots o seveal stes ne sonsosom te Deense Avance Reseac Pojects Agency - Inomaton Scence anTecnology ce (DARPA-IST [now te Sowae an ntellgent Systems Tecnologyce (SST] Text cors esgn was a ont eot among te Massacusetts nsttteo Tecnology (MIT SRI ntenatonal (SR an Texas Instments (TI Te seecwas ecoe at T tanscbe at MT an te ata as been vee an eae oCD-RM octon by te Natona Insttte o Stanas an Tecnoogy (NIST

    Certain commercial products are ideniied in his documen il order o adequately speciprocedures described. In no case does such identification imply recommendaion or endosement by heNational nsttute of Standads and Tecnology, the U.. Departmen of Commerce o te Unied atesFederal Government nor does it imply that the materia identified is necessariy he best o te pupose

  • 7/21/2019 TIMIT_NISTIR4930

    6/94

  • 7/21/2019 TIMIT_NISTIR4930

    7/94

    f

    ntocton

    2 CD-RM Contents an Fe Stce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2. Reang te CD-RM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 CD-RM Contents 42.3 TM T Decto an Fle Stcte 4

    2.3. ganzaton 42.3.2 Fle Tyes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3.3 n-ne Dmentaton .

    2. 4 SPHERE Sofwae veson .5 . . . . . . . . . . . . . . . . . . . . . . . . 02.5 Convet Sofwae 4

    3 Te TM T Cos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53. Cos Seake Seecton an Dstbton . 5

    3.2 Recong Contons an Pocees . . . . . . . . . . . . . . . . . . . . 83.3 Cos Text Mateal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.4 Sggeste Tanngest Sbvson

    3.4. Coe Test Set . 203.42 Coete Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.4.3 Tanng Set . . . 223. 4.4 Dstbtonal Poetes of te Tanng an Test Sbsets . . 23

    3.5 Tantons . 25

    4 TMT Lexcon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264. Fomat of te excon . . 26

    4.2 Ponncaton Conventons . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2. Vowel Vaaby . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2742.2 Stess Dffeences . . 284.2.3 Sylabcs 28

    4.3 Ponetc an Ponec Symbol Coes . . . . . . . . . . . . . . . . . . . . . 24.4 Eata 32

    5 Tanscton Potocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 Rent of a Pblcaton Descbng TMT Tanscton Conventons

    5.2 Notes on Ceckng te Ponetc Tansctons

    5.2. Acostc-Ponetc Labes 5.2.2 Bonaes 5.2.3 Dsclae

    5.3 Notes on Atomatc Geneaton of o Bonaes 5.3 Geneal Metoology 53.2 Algnment Pee .5.3.3 Ponologcal Rle Post-Pocessng

    35

    46

    4649

    49

    50

    50

    50

    50

    6 Rents of Selecte Atcles . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    7 Refeences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

  • 7/21/2019 TIMIT_NISTIR4930

    8/94

  • 7/21/2019 TIMIT_NISTIR4930

    9/94

    f b

    2 Utterance-assate e tyes

    3 Daect strbton of seakers . . . . . . . . . . . . . . . . . . . . . . . . 632 : TMIT seec matera . . . . . . . . 33: Seakers n te core test set . . . 23.4 : Daect stbton of seakers n comee est set . . . . . . . . . . . . 2235: Daect srbon of seakes n te tranng set 2336: Dstrbtona roertes of tranng an est sbses . . . . . . . . . . . . . 24

  • 7/21/2019 TIMIT_NISTIR4930

    10/94

  • 7/21/2019 TIMIT_NISTIR4930

    11/94

    1 ntroducton

    e NIS Seec Dsc CD11.1 contans te comlete exas InstrumentsMassacusettsInsttute o ecnology IMI acoustconetc corus o read seec. IMI wasdesgned to rovde seec data or te acquston o acoustconetc knowedge and orte develoment and evaluaton o automatc seec recognton systems. IM I asresuted om te jont eorts o several stes under sonsors om te DeenseAdvanced Researc Projects Agency Inormaton Scence and ecnology OceDARPAISO, and Deense Scence Oce DARPADSO. ext corus desgn was ajont eort among te Massacusetts Instute o ecnology MI, SRI InternatonalSRI, and exas Instruments I. e seec was recorded at I transcrbed at MIand as been mantaned, vered and reared or CDROM roducton by te NatonalInsttute o Standards and ecnoo NIS s ublcaton and te dsc werereared by NIS wt assstance by or ame.

    IMI contans a total o 6300 utterances, 0 sentences soken by eac o 630 seakersrom 8 major dalect regons o te Unted States. 70% o te seakers are mae and 30%are emale More normaton on te selecton and dstrbuton o seakers s gven nSecton 3 e recordng condtons are descrbed n Secton 3.2.

    e text materal n te IMI romts conssts o 2 dalect "sbbolet" sentencesdesgned at SRI 450 onemcaycomact sentences desgned at MI and 180onetcaydverse sentences seected at ac seaker read te 2 dalect sentences5 o te onemcaycomact sentences, and 3 o te onetcaydverse sentences SeeSecton 33 or more normaton on te corus text matera and Secton 6 or rernts oublcatons on te desgn o IMI.

    e seec materal n IMI as been subdvded nto dalectbalanced ortons ortranng and testng wt comlete onemc coverage. e crtera or te subdvson aredescrbed n Secton 34. A core" test set contans seec data om 24 seakers 2 maeand 1 emae rom eac dalect regon and a comete" test set contans 134 4 utterancessoken by 168 seakers, accountn or about 27% o te total seec materal n tecorus

    ac sentence as an assocated ortograc transcrton tmealgned word boundarytranscrton rovded by NIS, see Secton 53, and tmealgned onetc transcrtonrovded by MI see Sectons 5.1 and 5.2

    e CDROM contans a erarcca treestructured drectory system wc aows te dscto be easly erused. e IMI seec and transcrton matera s located n tetmttran and "tmttest drectores and onlne documentaton ertanng to te coruss ocated n te tmtdoc drecto. Verson 1.5 o te NIS SPeec ader RsourcesSPR sotware s n te sere drectory and SPRI SAM sotware CONVER

  • 7/21/2019 TIMIT_NISTIR4930

    12/94

    to convert M seec les nto a SAM comatbe ormat can be ond n te convertdrectory Eac o tese drectores contans a "readme d le wc may be consltedor rter normaton

    e remander o ts dment s strctred as ollows:

    Secton 2 contans a descrton o te CDROM strctre and ormat,ncdng normaton on ow to mont and read te CDROM and a bredescrton o te SERE and CONVER sotware

    Secton 3 descrbes te M cors n more detal and te crtera sed todvde te seec data nto tranng and test sbsets

    Secton 4 rovdes a descrton o te accomanyng onemc lexcon andte onemc and onetc symbols sed n te lexcon and te onetctranscrtons

    Secton 5 ncldes a rernt o te artce by Sene and Ze "ranscrtonand Agnment o te M Database and notes on ceckng te onetctranscrtons and on te tmealgned word bondares

    Secton 6 contans rernts o artces on te desgn o M

    2

  • 7/21/2019 TIMIT_NISTIR4930

    13/94

    2 CDROM Contents and Fle Structure

    Te CD-ROM, NIST Seec Dc CD-, contan te comete TIMIT acoutconetc eec coru Ao ncuded on te dc are a new veron 1 o te NISTSPeec ader Rource SPR oware and SPRIT SAM otware CONVRTor convertng TIMIT eec le nto a SAM comatble ormat

    2.1 Readig the CDROM

    Te TIM IT CD-ROM and al NIST eec dc are ormatted accordng to te ISO-9660nternatonal tandard or CD-ROM volume and e tructre ISO, 1988 Te ISO-9660ormat allow te CDROM to be read on any comuter latorm wc uort tetandard To date, ISO-9660 drver ave been mlemented or a wde vare o comuterytem rom eronal comuter to maname ee drver ermt an ISO-9660 dc

    to emulate a read-only Wnceter dk, alowng vrtualy eamle ntegraton o te CDROM Te TIMIT CD-ROM wa degned to be uable on any ytem wc uortISO-9660 e dc contan ony data le and ource code wt te exceton o te PCexecutable otware n "convert wc can be ealy morted nto any eec reearcenvronment

    Te TIMIT CD-ROM a been degned to be eay browed or earcedrogrammatcaly e TIMIT coru and dumentaton n "tmt tructured nto adrectory erarcy wc reect te organzaton o te coru Several comuterearcable text le n tmtdoc* t contan tabuar coru-reated normaton Inaddton to TIMIT, a et o otware tool "SPeec ader Rource SPR n

    "ere" ncluded to eae mortaton o eec waveorm e e remander oSecton 2 contan more normaton on te CD-ROM drecto and e tructure

    3

  • 7/21/2019 TIMIT_NISTIR4930

    14/94

    22 CD-ROM Ctets

    Te olowng iles an subrectores ae ocate n e o-evel ecoy o te CDROM Eac o e subrectores conans a "reame " le wc may be consute ormore eale normaton

    convert - recory conanng verson 2 o e ESPR SAMsotware CONVERT or conveng TMT seecles nto a SAM comatbe omat

    reame.oc general normaon e

    sere -

    tmt! -

    rectory contanng verson 15 o te NIS SPeecEaer REsources SPERE soare SPEREs a set o "C" lbrary outnes an rograms ormanuatng te NS eaer structure eene

    to te MT waveorm les

    recory conanng te M corus as wel asM-reate ocumentaon.

    23 TMT Diect ad Fle Stcte

    Ts section escrbes e organzaon o te les n te tm" rectoy. Desctonso te ile yes an a summary o e onlne umenaon are gven n Secons 23an 233

    231 Ogaizati

    On-lne ocumenaon an comueseacabe abula text es ae ocae n teectory tmtoc" A be escron o eac le n ts rectory can be oun at teen o ts secton. Te seec an assocae ata are oganze on te CD-ROM n te"mt" rectory accong o te olowng erarcy

    4

  • 7/21/2019 TIMIT_NISTIR4930

    15/94

    /////

    whr

    CORPU tmtUAGE:

    tran tstDALEC = = d d2 d3 d dr5 dr6 dr d8(S ab 3 for a dspton of th dat ods

    EX m fPEAKERID

    wh

    NTAL : = = spakr ntas, 3 ttrsDIGI: numb 0-9 to drntat spaks wth dnta ntas

    whr

    TEXTTYPE sa s sx(S Ston 32 fo th dsrpton of sntn txt typsENTENCENUMBE 2342

    FIE_TYPE wav t wrd phn(S Tab 2 fo a dsrpton of th f typs

    Examps:

    tmttrandfsawav

    (T orpus tranng st dat gon fma spakr spakr sntn txt sa, sph wavform f

    tmttstd5mbpmsx0phn

    ( opus, tst st, dat rgon ma spakr spakr "bpm"

    sntn txt sx40", phont transrpton

    5

  • 7/21/2019 TIMIT_NISTIR4930

    16/94

    232 File pes

    The orpus nles several fles assoate wth eah tterane n aton to aspeeh waveform fle (wav, there are three assoate transrpton fes (t wr phnfor eah tterane These assate fles have the form:

    where

    = The begnnng nteger sampe nmber for the segment(Note the frst of eah t an phn fle s always 0

    The enng nteger sample nmber for the segment(Note the ast n eah transrpton fe may be less than theatual last sample n the orresponng wav fe

    = =

    where

    Compete orthograph text transrpton : = = Snge wor from the orthography = = Sngle phonet transrpton oe

    (See Seton 3 for a esrpton of the phone oes

    6

  • 7/21/2019 TIMIT_NISTIR4930

    17/94

    Tae 2 : eance-assocaed fie types

    Fe pe I esciponwav SPHEE-headeed speech waveom fie (See Secion 2

    o a descpion o he speech ie manipuaion uilies Assiaed ohogaphic ansipon o he wods he peson

    said (Ths is usuay he same as he pomp, u in a ewcases he ohogaphy and pomp dsagee

    wd me-aigned wod anscipon he wod oundaies weeaigned wh he phoneic segmens using a dynamicpogamming sing aignmen pogam (See Secion 53 onomaion on he alignmen poedue

    phn Timeaigned phoneic anscpion (See Secions 5 and

    52 o moe deas on he phoneic anscipion pooos

    Eampe anscpions om he ueance in /mi/es/d5/np/sawav

    Ohogaphy ( :

    678 She had you da s in geasy wash wae a yea

    Wod ael (wd

    770 362 she362 6000 had520 7503 you7503 23360 da23360 28360 s 28360 30960 in30960 3697 geasy3697 2290 wash320 780 wae902 528 a

    528 5880 yea

    7

  • 7/21/2019 TIMIT_NISTIR4930

    18/94

    honet abe (phn(Note: begnnng and endng sene egons ae maked wth h#

    0 0 h#0 980 sh

    980 11362 y11362 12908 hv12908 160 ae160 1520 d1520 16000 h16000 1503 ax1503 1850 d1850 18950 d18950 21053 aa21053 22200 22200 220 k

    220 23360 k23360 25315 s25315 263 ux263 28360 t28360 2922 q2922 29932 h29932 30960 n30960 3180 g3180 32550 g32550 33253 33253 3660 y

    3660 35890 z35890 3691 y3691 38391 w38391 0690 ao0690 2290 sh2290 3120 ep3120 3906 w3906 580 ao580 600 d600 80 a80 9021 q

    9021 5138 ao5138 5218 5218 51 Y51 5665 h5665 5880 ax5880 61680 h#

    8

  • 7/21/2019 TIMIT_NISTIR4930

    19/94

    233 On-lne Documentation

    Compact on-ne documenaon s ated n the tmtdoc dectoy Fes n thsdectoy h a doc extenson conan feefom descptve text, and es th a extenson contan abes of fomatted text hch can be seached pogammatcay nes

    n the t fes begnnng th a semcoon ae comments and shoud be gnoed onseaches he foong s a bef descpton of each fe:

    phoncoded - Lst of phone symbos used n the phonemc dctonay and hephonec anscpons

    pomps t - abe of senence pomps and coespondng sentence numbes

    spknot - abe of speake abutes

    spksent - abe o senence- numbes fo each speake

    estsetdoc - escpon o the suggested anes subdvson

    tmtdcdoc - escpon o he phonemc excon

    tmtdc t - Phonemc dconay of a the othogaphc ods n the pompts

    9

  • 7/21/2019 TIMIT_NISTIR4930

    20/94

    24 SPHERE Soare (version 15)

    Te NIST SPERE eader ormat was desgned to acltate te excange o seec sgnaldata on varos meda, artcarly on CDROM Te NST eader s an objectorented024 byteblked strctre reended to te waveorm data See te e

    sere!eaders.doc" or a descrton o te eader ormat.

    NST SPeec Eader REsorces SPERE s a soware ackage or manlatng teNSTeadered seec waveorm .wav les. Te soware conssts o a lbrary o Clangage nctons and a set o Cangage systemlevel tltes wc can be sed tocreate or mod seec le eaders n memory and to readwrte te eaders omto dsk.See te le, serereadmed" or more normaton on te SPERE sotware,ncldng sage, nstalaton on UNX systems, and some samle rograms.

    Please note: Te SPERE brary and ttes are moded erodcally Te most todate verson o te soware s avaable va anonymos t rom ss.ncsl.nst.gov" nder te

    b" drectory n te comressed tarormatted lesere< REEASENUMBER>. tar.Z. Users are encoraged to acqre te most recentsorce code and docmentaton

    Te SPERE Cangage lbrary contans te olowng nctons

    strct eader_t *s_create_eaderRetrns a onter to an emty eader stctre

    strct eader_t *s _oen _eader,arselag error

    Reads an exstng eader n om le onter "". Te e onter s assmedto be ostoned at te begnnng o a seec le wt a eader n NSTSPERE ormat. On sccess, " s ostoned at te end o te eaderready to read samles and a onter to a eader strctre s retrned. Onalre argment error wl ont to a strng descrbng te robem arse_ag s ase ero, te elds n te eader wl not be arsed andnserted nto te eader strctre te strctre wl contan ero elds. Tss se or oeratons on es wen te contents o te eader are notmortant, or examle wen strng te eader

    nt s _clear_eds

    Deletes al elds rom te eader onted to by . Retrns a negatve valeon alre.

    0

  • 7/21/2019 TIMIT_NISTIR4930

    21/94

    it s _lose _heer(h)Ulis the heer oite to y "h" releses these llte for the heer irst reims ll se lte for theheers els if y exist Retrs egtive vle o filre

    it setfiels(h)Retrs the mer of es store i the seifie heer "h". Retrs egtive vle o filre

    it s etfielmes(h,,v)ills i rry "v" of hrter oiters with resses of the fiels i theseifie heer "h" o more th "" oiters i the rry will e setRetrs the mer of oiters set

    it s etfel(hmee,size)Retrs the "type" "size" (i ytes) of the seifie heer fiel "me"i the seifie heer "h" yes re NTGR R T_RGefie i "heerh")

    he size of T_GR fiel is sizeoflog)e size of T_R fiel is sizeof(ole)he size of strig is vrile oes ot ile lltermitoryte l ytes re lowe i strig)

    Retrs egtive vle po filre

    it set_tye(hme)Retrs the tye of the speie heer fiel "me" of the seifie heer"h" ypes re T_TGR T_R, TRG ee i "heerh")Retrs egtive vle o fire

    it s et_ sizehme)Retrs the size i ytes) of the speifie heer fiel "me" of thespeifie heer "h".

    e size of T_NGR fiel is sizeoflog)e size of TR fiel is sizeof(oe)e size of strig is vrile oes ot ile ll-termitoryte l ytes re lowe i strig)

    Retrs egtive vle po filre

    it sp ett(hmefle)Retrs the ve of the speife heer fie "me" i the seifie heer"h" i " o more th "le" ytes re opie "le" mst e ositive trely oest mke mh sese to s for rt of log or ole t itsot ilegl. Rememer tht strigs re ot ll-termite Retrs egtive vle po filre.

    1

  • 7/21/2019 TIMIT_NISTIR4930

    22/94

    t s_e(hme,tye)As the ie me to heer seie by h. Argmet tye s_IEGER, _REA, or _RG Argmet s oter to hrter oter, or og teger or obe st to hrter oter

    he see ie mst ot rey exst the heer. Retrs egtveve o re

    t s_eeteie(h,me)Deetes e me rom heer see by h. he e mst exst theheer. Retrs egtve ve o re.

    t s _hge_ie(h,me ye,)Chges exstg e me heer h to ew ye /or ve he e mst rey exst the heer. Retrs egtve veo re

    t ss_st(me)Returs RUE the see e me s str e, FAEothese. tr es re ste ste. he oto o stres s ow rh

    ssete)rs o (>) or o (=) memory eto he et s o

    t s et_ eoRetrs the stte o memoy eoto.

    t s _wrte_heer(hhbytes,tbytes)Prts the see heer h to strem the str PHEREheer ormt he mber o bytes the heer bo ( mte o 02s reure hbytes the mber o t heer t bytes se sretre tbytes. Retrs egtve ve o re.

    t sres(h,)Prts the seie heer h to strem hmrebe ormt.Retrs egtive ve o re

    t s_oy(,ot)Coes strem to strem ot t e-o-e Retrs egtveve o re.

    2

  • 7/21/2019 TIMIT_NISTIR4930

    23/94

    he HERE syste-level utilites re

    hed [otios] file reds heders o te les isted o the od lie; by defult outut

    is les of tules osistig of ll eldes d vlues; y otios odithe rogrs behvior see the ul ge "hed";

    h_dd iute tutfiledds ety heder to the "rw" uhedered seeh sles i iutfled stores the result i oututfle;

    h _stri utle oututlestris the HER heder o iutfile stores the reg dt ioututfie if oututfile is "" writes the sle dt to "stdout";

    h_edt -u] [-D dir] -ohr feldevlue le hedt -u [- outile] -ohr iedevlue fileedit seifed heder elds the seied ile(s) the first for it eitherodifes the le(s) e or oies the to the seied diretoy "dir" the seod fo, it either odfies the e i le or oies t to theseified fie "outle"

    he "u" otio uses the origil les to be ulied (deleted) fterodifito e - oto fores the rogr to otiue ter reortigy errors

    e "ohr" ust be either """" or "R" to deote strig iteger or relfied es resetivey

    hdelete [u] [D dr]-F fielde le h _delete [-u] [- outfe] -F felde fie

    deete sefied heder elds i the seied fle(s) the rst fo iteither odifes the fie(s) i le or oies the to the seied dretoy"dir"

    the seod fo it ether odifies the le le or oies it to thesefied fie "outle"

    e "u" otio uses the origi les to be uied (deleted) erodifti e - oto ores the rogr to otiue ter reortigy errors

    3

  • 7/21/2019 TIMIT_NISTIR4930

    24/94

    Examle IM SPHERE-formatte ee waveform eaer om te waveform file,timit/trai/r/fj/a1wav

    S A

    1024atabae -5 IMatabae verio - 0tterae - j _aael ot -i 1ame_ot -i 46797amleate -i 16000amle_mi -i -2191amle_max -i 2790amlebyte -i 2amle_byte_format -2 01

    amle_ig_bit -i 16e ea

    (e ee ata follow te eaer blok

    25 Convert Softwe

    e iretoy overt otai Eroea Strategi PRojet o formato eology(ESPR See t/ott Aemet Metoology a Staarzatio (SAMProjet oftware (verio 12 for overtig IMI ee file to a SAM-omatibe

    format e oftware wa eveloe at te Ittut e la Commatio Parle GreobleFrae, i a ooerato wit IS Some mior mofiatio to te oftware were maeat S to eable te oftware to r wt te IM CD-ROM fle trtre

    SAM file amig ovetio iffer from toe e i IMI A mag fie/overt/kr_maam a bee le by S o te CD-ROM to be e foratomati fieame overo we te CD-ROM i o-lie e Covert oftwareremove te SPHERE eaer from te file ie SAM ee le ota o eaeriformatio a roe 2 SAM file for ea MI tterae e firt fie teigal fe, a te eo otai te ortogra trartio a eaker formatoMore etail abot Covert a examle of ow to e te akage are give i te fle/overt/reameo

    1

  • 7/21/2019 TIMIT_NISTIR4930

    25/94

    3 The TMT orpus

    Te TMT ors of rea see as been esgne to rove te see eseaonity wit a stanarize os for te aisition of aost-oneti nowlegean for te eveloent an evalaton of atoat see reognition systes Tereation of any reasonably-sze see ors is vey labor intensive Wit tis in n,TMT was esign so as to balane tiy an anageability, ontaining sall aontsof see fro a relatively verse seaer olaton an a range of onetienvironents is setion rovies oe etaile inforation on te ontents of TMTan on te ivision of te TMT see ateal nto sbsets for trainng an testingroses

    3 Corps Speaker Seecto ad Dstrbuto

    TMT ontains a total of 600 tteranes 0 sentenes soen by ea of 60 seaersfo 8 ajor ialet ivisions of te nite tates Te 10 sentenes reresent rogy0 seons of see ateal e seaer n total te ors ontains aoxiatey 5ors of see All seaes ae natve seaers of Aeran ngis an were jgeby a ofessional see atoogst to ave no inial see atooges oe seeor earing abnoralities of sbjets ae note in te seaer inforaton fle/tiit//srinfot wi lsts seaer-seii infoation n ation to tese 60seaers, a sall nbe of seaers wt foregn aents or oter extree see an/orearing abnoalties were reore as axiiay sbets, bt tey are not inle onte CD-ROM

    Te seaers were riarily ersonne, any of wo were new to T an te Dalasarea Tey wee selete to be reresentatve of ffeent geograal alet regions ofte A seaers ialet egon was ene as te geograal area of te were e or se lve ring tei ioo years age 2 to 0) Te geogaial areasorreson wit eognze alet egons of te angage iles, Oo tatenivesi ingstis Det 1982), wt te exeton of te Western alet regon 7)in w ialet bonaries are not nown wit any onfene an alet egon 8 weete seaers ove aon a lot ring tei iloo Te ialet egons are ilstrateby te lnes on te a sown in ge 1 Te loae of ea seaes ioo isiniate by a olor-oe aer on te a

    T attete to rerit seaes wo eqaly reresente te 8 iaet regions, bt tis wasfon to be iratial given te onstrants of te an reorng ation a reslt,te regions r6, an 8 are ess wel-reresente tan te otes Table 1 sows te

    2or mor iformatio o Amrica Eglh dialctology for xampl Atwood 980; Baily adRoio 97 Brot 90 Davi 98 Krat 949 ad Williamo ad Burke 97

    5

  • 7/21/2019 TIMIT_NISTIR4930

    26/94

    totl umber of eer wel the umber of mle femle eer for eh ofthe let rego. The eretge re gve rethee

    Tble 1: Det trbuto of eer

    Dlet Rego # Mle # emle

    me Coe (r)eer eer

    ew gl 1 1 (6%) (27%)

    orther 2 71 (70%) 1 (0%)

    orth Ml 79 (67%) 2 (2%)

    oth Ml 4 69 (69%) 1 (1%)

    outher 5 62 (6%) 6 (7%)

    ew Yor Cty 6 0 (65%) 16 (5%)

    Weter 7 74 (74%) 26 (26%)

    Army Brt 22 (67%) 11 (%)m au

    Totl # eer: 4 (70%) 192 (0%)

    Table 2: Dialect strbuton f Speakes Cmplete est Set

    alect IMe IFee ta-----

    ----

    1 7 4 (S)2 8 8 26 l1

  • 7/21/2019 TIMIT_NISTIR4930

    27/94

    Figre 3 Map of TMT Diaet Regio

    J

    -. . \ '0: ' 0A

    . . NEW ENGLAND NORTHERN ; NRH MLAN ! SUTH MDAND SUHERN

    Cortey of Texa tret, .

    7

    (''0 '-, I

    = . IUNeD TA.f'-;p

    - NEW YORK CY 1 WESTRN

  • 7/21/2019 TIMIT_NISTIR4930

    28/94

  • 7/21/2019 TIMIT_NISTIR4930

    29/94

    he o-le le tmt/d/srfot" cotas a table of seaer attrbutes or eachseaer the formato cludes the D (seaers tals) Sex (male o female) DR(dalect rego), se (tra or test), RecDate (recordg date) BrthDate, Ht (heght),Race, du (educato level) ad otoal commets stg terestg seaer attrbutesor abormaltes

    3. Rcrding Cnditins and Prcdurs

    Recordgs were made a ose-solated ecodg booth at TI usg a semautomatccomuter system (STRODS) to coto the resetato of romts to the seaerad the recordg Twochael recodgs were made usg a Seheser HMD 414headset-mouted mcohoe ad a Breul & Kjaer 1/2 farfeld ressure mcrohoe(#4165 Oly the seech data ecoded wth the Seheser mcrohoe s cluded oths CD-ROM

    The seech was drectly dgtzed at a samle rate of 20 Hz usg a Dgta SoudCoorato DSC 200 wth the at-alasg fte at 10 Hz The seech was thedgtaly fltered debased, ad dowsamled to 6 Hz (or more formato o therecordg codtos ad the ost-rocessg of the seech sgals see the atce byFsher et al Secto 6

    Subjects wee seated the recordg booth ad romts were reseted o a motorThe subects woe earhoes though whch a low-level (aoxmately 5 dB SP) ofbacgroud ose was layed to elmate the uusual voce ualty roduced by thedead room effect T attemted to ee both the ecordg ga ad the level of ose

    the subects earhoes costat durg the collecto At the begg of eachrecodg day, a stadard calbrato toe was ecorded from each mcrohoe ad thevotage at the subjects earhoes was checed ad adjusted as ecessay

    The seaes were gve mmal structos ad ased to read the romts aatual voce The recodgs wee motoed, ad ay susected msoucatoswere fagged fo verfcato Vefcato cossted of lsteg to the utteace by boththe moto ad the seae Whe a oucato eo was detected the setecewas re-recorded Varat roucatos wee ot couted as mstaes

    3.3 Crpus Tx Matria

    The text materal the TIMT romts, foud the fle, "/tmt/romtsdoc", cosstsof 2 dalect "shbboleth" seteces desged at SR, 450 hoetcaly-comact setecesdesged at MT, ad 1890 hoetcaly-dvese seteces selected at T Table 2summarzes the seech mateal TMT e o-le file /tmt/doc/srsett lststhe setece texts read by each seaer

    18

  • 7/21/2019 TIMIT_NISTIR4930

    30/94

  • 7/21/2019 TIMIT_NISTIR4930

    31/94

    e dlect seteces (te A seteces) wee et to exose dlectl vrts of teseers d were red by l 60 seers e two dlect seteces re "e dyour dr sut gresy ws wte l ye. d Dot s e to crry oly rg lett oe exected vrtos cur te roucto of te words gresy (wt

    s or /z) d te vowe colo te word "wter". (For study of suc dlecteoe see te rtcle by Coe et l. ecto 6

    he oetcllycoct seteces (te X seteces) were d-desged to becoreesve s well s coct. e objectve ws to ovde good covege ofrs of oes wt extr cureces of oetc cotexts tougt to be etedcult or of rtcur terest. (ee te rtcle by el et l. ecto 6 for oeforto o te desg of tese seteces.) Ec seer red 5 of tese setecesd ec text ws soe by 7 dfferet seers

    e oetclly-dvese seteces (te seteces) were selected o exstg textsources te Brow Corus (Kucer d Frcs 1967 d collecto of dlogso recet stge lys (Hutze et ., 1964 - so s to dd dvers setece esd oetc cotets. he selecto crter xzed te vrety of looccotexts foud te texts (ee te rtcle by Fse et . ecto 6 fo oreforto o te selecto of tese seteces.) c seer red of teseseteces, wt ec setece beg red by oly sge see

    ble 2 : M seec terl

    #eers #eteces

    etece ype #eteces etece ot eer

    Dlect (A) 2 60 1260 2

    Coct (X) 450 7 150 5

    Dvese () 190 1 190

    ot 242 600 10

    34 uggstd Tanng/Ts ubdvson

    e tets d sees M ve bee subdvded to suggested trg d test setsusg te folog cte

    1 - Rougy 20 to 0% of te cous sould be used for testg uroses levgte reg 70 to 0% fo tg

    19

  • 7/21/2019 TIMIT_NISTIR4930

    32/94

    2 - o seer sol er ot te trig testig ortios

    - Al te ilet regios sol e reresete i ot ssets wit t lest 1mle 1 femle seer from e ilet

    4 - e mot of overl of text mteri i te two ssets sol e miimizei ossie te triig set test set so ve o setee texts iommo

    5 - All te oemes so e overe i te test mterl referly eoeme sol or mtie times i ifferet otexts

    e ext tree ssetios rovie more etils o te trg test rttios oMIT I orer to esre ete overge i te test sset te test mteril wsselete from te etire ors orig to te ove rteri wo test sets were seletee

    ore" test set otiig miiml e set of test t is esrie i etio.4.1. A esritio of te lrger test set te omlete test set s give i etio .4.2.Aer exlso o te seete test mteri te remer of te ors ws esigtes te triig set ome roertes of te trig rtitio re seifie i etio .4..

    OT: Th bdon h no coepondence h he on "nnme dbed on he pooype CDROM The on dvon o nnnd e me bed OLY on dec nd dbon ho ohecondeon. In con he nn nd e dvon on h CDROM bed on moe co nd bee bnced Theeoe ony he denednn me on CD-ROM "1-11 hod be ed o nn ppoe

    341 Core Tet Set

    Usg te ove riteri 2 mle seers 1 emle seer fom e ilet wereselete rovg ore" test set o 24 seers E seer re ometelyieret set of 5 X setee texts ie e setee ws re y oly oe seertese texts ot mose ostrts seetig te texts or seers

    Te seete texts were ee to esre tt te set le t lest oe rree ofe oeme Te oemi lysis ws se o otete oemi trsrtios

    of te wors i te setee ot te t reize oeti trsritio Ts teoeti llooes fo te test t my e exete to iffer from te eryigoemi forms i ore wt tyi ooogl vritos

    e ore test set ots 192 ieret texts 5 X + setees x 24 seers Tovo overl wit te trig mter te 2 A setees ve ee exle from teore omete test sets

    20

  • 7/21/2019 TIMIT_NISTIR4930

    33/94

    OE The SA sentences for the test speakers are included on the CD-ROMfor completeness However, th should not be used for training or test puoses the suested training and test subsets are used since th xt for bothtaining and test speaers.

    Table 33 lists the speakes n the ce test set f each dialect egin his set is theminimum ecmmended set f test pupses

    Dialect 1

    2

    3

    4

    5

    6

    7

    Ttal:

    Table 33: Speakes in the ce test set

    Mae I Female #Tets/SpeakeDAB WBT ELC

    TAS WEW PAS

    JMP LT PKT

    LLL TLS JLM

    BPM KLT LP

    CMJ JDH MGD

    GRT JM DHC

    JL PAM MLD .' ," > . .: .;

    6

    34.2 Compete Tes Se

    I Ttal Tets 24

    24

    24

    24

    24

    24

    24

    24

    192

    Thecmplete test set as fmed by incudng al 7 epetitins f the SX tets in the cetest set hus the utteances fm 44 (6x24 addtna speakes ee added includngthe 3 unque S sentences spken by each speake hs insued that n sentence textappeaed n bth the tainng and test mateia The 16 speakes n the cmpete test setepesent 27% f the tta numbe f speakes n the cpus The esuting dalectdistibutin f the cmpete speake test set s given in Table 3 4 As in the entie TM Tcpus dialects 1, 6, and ae less epesented than the the dalects

    21

  • 7/21/2019 TIMIT_NISTIR4930

    34/94

    able 34 Dialect distribution of speakers in complete test set

    Dialect #Male I #emale I Total 7 4 2 8 8 26

    3 23 3 26

    4 6 6 32

    5 7 28

    6 8 3

    7 5 8 23

    8 8 3

    Total 2 56 68

    he complete test set contains a tota of 344 sentences 8 sentences from each of the 68speakers n this set there are 20 distinct SX texts and 504 dierent S texts Thusroughly 7% 64 of the tets hae n srd for th ts maial

    The minimum recommended test material is the core test set consisting of 2 mae speakersand female speaker from each diaect region and 92 unique texts Those ishing toperform more extensive testing shoud use the compete test set

    343 Training Se

    he training material consists of al the speech data OT incuded in either the "core" or"complete test sets There are 462 speakes in the training set comprising 73% of thespeakers The taining mateia contains a tota of 4620 utteances ith 0utterances/speaker The dialect distribution of the taining speakes is gien in abe 35

    he training material contains 78 uniue texts the 2 SA texts 330 different SX texts and

    386 distinct S texts he 2 SA texts ee spoken by al the speakers in the corpus Eachof the SX sentence texts ee read by 7 speakes and each S text as spoken by a singlespeaker With the exception of the 2 SA sentences there is no oeap beteen the textsread by the test speakers and those read by the training speakers

    22

  • 7/21/2019 TIMIT_NISTIR4930

    35/94

    N: T S ss sul b us fr raiig r s puss if susaiig a s subss ar us si x fr b raiig a s spakrsE if ar us i raii SA ss mig sk raiig mls si rs ai i m ul b r-rprs ir sus us fr

    mparai ialal rsar

    abl 35 Dialct distributi f spaks i th traig st

    Dialct #Mal I #mal I tal 24 4 38

    2 53 23 76

    3 56 20 76

    4 53 5 68

    5 45 25 70

    6 22 3 35

    7 59 8 77

    8 4 8 22

    tal 326 36 462

    344 Dstrbutonal Properes of he Tranng and Test Subsets

    abl 36 shs sm f th dstrbutal prprtis f th taig ad tst substs Allf th 45 phms ar fud i th thr tt substs as dtrmid by lkup f achrd i th lic suppid th CD-ROM (S Scti 4 fr mr ifrmati thlic) h tta umbr f dstct ds th M scripts is 6099 th cr tstst 92 distict rds ccur 403 f hch as ccu th traig tts h cmplttst st ctas 624 difrt tts ad 237 distict rds - 08 f ths rds als curi th traig tts Apprimaty 45% f th rds i th tts f th tst matial alsccur i th tts f th traig matia h rmaig ds i th tst matrial ar

    " his is du part t th dsg f th crpus itslf M as dsigd tpvid a crpus f acustc-phtic spch data f th valuati f rcgiti systmsat th phmic lvl Bcaus th primary fus i th dsg f th cpus as thcvrag f phmic mts mphasis as placd prvidg multpl cttualvirmts f th phms dug tt slcti d t pvd cttual adlical vaati rds r pfrtialy chs vr ld rds durig thgati f th phmicaly-cmpact SX stcs h phticaly-divrs S

    23

  • 7/21/2019 TIMIT_NISTIR4930

    36/94

    etece ere elected a t maximize allphic ctex ad thu al favredelecti f text ctaiig e rd r rd euece

    able 36: Ditributial prpertie f traiig ad tet ubet

    Etire Tet

    Crpu ai Cre Cmplete

    Setece 6300 4620 92 344

    Diict ext 2342 8 92 624

    Ditict Wrd 6099 489 92 237

    Diict heme 45 45 45 45

    24

  • 7/21/2019 TIMIT_NISTIR4930

    37/94

    35 Transcriions

    he MT crpus ncudes several transcptn les assaed th each utterancehese fles cntan an rthgraphc ranscrptn a tme-algned rd transcrptn and

    a tme-agned phnec transcrpn Detals n the le fats are gven n Sectn 23he rthgraphc transcrpn cntans he tet f the sentence the speaker sad herthgraphc transcrptn s usually the same as the prmpt bu n a fe cases theydsagree Wrd bundares ere assgned usng a dynamc prgrammng srng algnmentprgram (see Sectn 53 hch algned the rd prnuncatns fund n he lecn seeSectn 4 th the phnec segments nfatn n the phnec transcrptncnventns can be fund n he artcle by Seneff and Zue n Secn 5 and n the ntesn checkng the phnetc transcrpns n Sectn 52

    25

  • 7/21/2019 TIMIT_NISTIR4930

    38/94

    4 TMT Lexcon

    he lexicon found in he file imidimidic conains enies fo all of he ods inhe M pomps hee ae a oal of 6229 enies in he dicionay he lexicon as

    deived in pa fom he M adaped vesion of he Meiam-Webse ke Dicionayof 964 pke and a peliminay vesion of a geneal English dicionay undedevelopmen a MU he ponunciaions in he M pke lexicon have been veifiedand modified ove he yeas Hoeve many of he ods in he M scips did noappea in he pke lexicon and needed o be added hese include ohe foms of odsfound in pke and ods no found in any fom Rules ee used o geneaeponunciaions in he fome case and he deived ponunciaions ee handchecked nhe lae case consising mainly of pope names and abbeviaed foms such as akininsead of aking o 'em' fo hem he ponunciaions ee added by hand

    he symbos in he lexical epesenaion ae absac quasi-phonemic maks epesening

    he undelying sounds and typicaly coespond o a vaiey of diffeen sounds in he acualecodings he em quasphone is used because some dieences epesened in helexicon ae no phonemically disincive in English such as he e ax in hich e cocus ih sess

    he em quas-phone is sed becase some diffeences epesened in he lexicon aeno phonemicaly disincive in English such as he conas beeen e and ax Sincehe fome alays occus ih sess and he lae neve occus ih i as in bune !e n ax he o ae in complemenay disibuion and cod be consideed diffeenallophones of he same phoneme

    41 Frma f he Lexcn

    All enies have been conved o loe case Sess is epesened as a " fo pimaysess and a "2 fo seconday sess appended o he end of he voe symbolHyphenaed ods such as headinhe-cods can be found boh as a single eny and ashe individual ods head in he and cods hich esul hen he hyphens aeeplaced by spaces his as done o alo moe fexibility in he pasing of senences inohei consiuen exical iems f hese pas of hyphenaed ods cu ony as boundfoms and neve as fee ods he hyphen is lef in hei eny as in knick- and -knackfom knickknack Due o vagaies of Engish ohogaphy his pocedue someimes

    esuls in exical enies ha ae neihe ods no pope consiuens of ods such as

    3Th umr gratr tha th umr of tct wor lt Tal 3.6 cau th ctoaclu tr for compou wo a thr compots

    'Th trm "xco a ctoar" ar u rchagal throughout th pulcato

    26

  • 7/21/2019 TIMIT_NISTIR4930

    39/94

    -upmanship m ne-upmanship

    One prnunciatin is prvided per entry ecept in the case here the same rthgraphycrrespnds t different parts f speech ith different prnunciatins and bth frms eist

    n the TMT prmpts T dierentiate these rds multiple entries are given ith thesyntactic class fling the symbl The casses fund in the leicn are:

    n nun v verb ad adjective pres present tense past past tense

    eample is the rd live ith the entries

    live v ih vive adj ay v

    2 Pronncaton Conventons

    The prnunciatin is specifed using the the MU symbl set see Sectin 4 3 fr adescriptin f the symbls While e realize that representing nly ne prnunciatin isften nt sufficient t cver cmmnly bseed prnunciatins many f the alternateprnunciatins may be predicted by use f phngical rules and may be highly diaectdependent Using nly ne prnunciatin per rd frced the smehat veing decisin

    f hich ne t use We did nt put etensive study int such issues and d nt make anyclaims f the theretical crrectness f ur decisins n particular rds Our tendencyhas been t use the mre marked alterate because e think it is harder t predict Wetried t make the prnunciatins as cnsistent as pssible n a number f cases ereferred t the authrities Kenyn and Kntt 953 and Websters hird enternatinal Dictinary 966

    21 Vowel Varaby

    Many f the prnunciatin diferences r vels ccur in semi-vel envirnments and

    in unstressed sylabes

    The vel in rds ike fr pur and mre are ften represented using either thevel r the vel a Tis leicn uses a

    he ve in rds ike air and care has been represented using ae t dierentiatethis ve frm the eh in bery Sme speakers actualy make a threeay distinctin

    27

  • 7/21/2019 TIMIT_NISTIR4930

    40/94

    (Mary merry marry ih he ve in Mary being smeha in beteen an eh/ae/ and ey/ These speakers may use he same vel in rds like care

    The ve ih (as ppsed iy has been sysemaically used in he represenain frds like fear and yea

    unsressed scha alernain i! ax!: i! is usually used f schas beteen 2 alvelars(rses r z i z herise a is used (ahead a hh eh d

    r fling he diphhngs a (hu and ay (fire has been represened as a/ecep here he r is sylable-iniial as in rds like irae and vius

    vel reducin n sme cases he prnunciain f a rd may alenae beeen afull vel and a highly educed ne n hese cases preference as given hepnunciain ih he me marked vel insead f he scha eample heprnunciain f accep is given as ae k s eh p / n a k s eh p /

    422 Stress Dfferences

    er/ a aernain -- ar is used in unsessed sylables and er in sressed sylabes

    ih ix! aha! -- nce again he disincin is based n sress he fms i! anda! ae used in unsressed sylabes

    y u/ y uh -- he endency is use y u in sressed psiins as in aibuinae r ih b Y u sh i n/ and y uh in unsressed psiin as in aribue v

    a r ih2 b y uh /

    423 Sylacs

    he sylabics em/ en/ and e ae used freuenly in he phnemic epresenains evenhugh hey may be prnunced as a seuence f a scha fled by m/ n/ r /f. reample rds ending in ism ae represened as ih z em even hugh a shr schafen appeas in he ransiin frm he z/ he em/

    en mus fl a crna ecep in rare ccuences such as capn ae p en andhaven ae1 v en /

    in genea he sylabic el is used insead f a ! ecep befe a sressed vel Smeecepins ae fund in ds ending in he -y suffi eampe angrily isrepresened ae ng g r a iy/ n ae ng g r e iy/ The ny ccuences f e ! aefund in cmpund ds such as junge-like and liberal-led

    28

  • 7/21/2019 TIMIT_NISTIR4930

    41/94

    43 Ponetc and Ponemc Symbol Codes

    hs fllng tabe cntans a st f all the phnemc and phnetc symbls used n theM lexcn and n the phnetc transcrptns hese nclude the stress markers { 2}fund nly n the excn and the fllng symbs hch cur nly n the transcrptns

    the clsure nteals f stps hch are dstngushed fm the stp release he csuresymbls fr the stps bdgptk/ are bcdclgclpctckkc/ respectvely he clsureprtns f /jh/ and /ch/ are /dc and /tcl/

    2 allphnes that d nt cur n the lexcn he use f a gven allphne may bedependent n the speaker dalect speakng rate and phnemc cntext amng therfactrs Snce the use f these alphnes s dffcut t predct they have nt been usedn the phnemc transcrptns n the excn

    flap /dx/ as n rds muddy r drty"

    nasal flap /nx/ as n "nner"

    glttal stp /q/ hch may be an allphne f /t/ r may mark an ntavel r a vel-ve bunday

    vced-h /hv/ a vced aphne f /h pcaly und ntevcalcaly

    frnted-u /ux/ an alphne f /u/ typcaly fund n an avelar cntet

    devced-scha a-h/ a very shrt devced vel pcally seen henreduced vels are surrunded by vceless cnsnants

    3 ther symbs ncude t types f sence "pau markng a pause "ep dentng theepenthetc slence ften fund beteen a catve and a semvel r nasal as n sl"and "h# used t mark the sence and/r nn-speech events fund at the begnnng andend f the sgna

    29

  • 7/21/2019 TIMIT_NISTIR4930

    42/94

    Sybl Examl Wrd ssibe hneic Transcritn CmmentStps b bee BCL B iy

    d day DCL D eyg gay GCL G eyp pea CL iyt ea TCL T yk key KCL K iy muddy dity m ah DX iy dl d e DX iy flap ba bl b ae Q gltta stp

    ricates: jh jke DCL H kl kch chke TCL CH kl k

    ricatives s sea S iysh she SH iyz zne Z nzh azure ae ZH erf fin ih nth thin TH ih nv van ae ndh then DH eh nm mm M aa Mn nn N Nng sg s ih NGem bttm b aa d EMen buttn b ah ENeng ashngn

    aa sh ENG l t a n

    o ner ih NX ar nasal flap

    Semivelsand Glides: lay L ey

    r ay R ey ay W eyy yach aa t hh hay HH eyhv ahead a H eh dl del bttle bl b aa d EL

    30

  • 7/21/2019 TIMIT_NISTIR4930

    43/94

    Vowels: y beet bl b Y tl th bt bl b H tl teh bet bl b EH tl tey bait bl b EY tl tae bat bl b t taa bott bl b tl taw bout bl b AW tl tay bite bl b AY tl tah but bl b tl tao bought bl b AO tl toy boy bl b OYow boat bl b OW tl tu book b b UH kl kuw boot bl b UW tl tux toot tl t UX t t

    er br bl b ER l ax about AX bl b aw tl tI ebt eh b b X tl taxr butter bl b ah x AXRax- suspect s AX-H s p p eh kl k tl t

    Sbol D sctionOthers pau pause

    ep epentetic sienceh# begn/en marke nonspeec events

    1 primary stress2 seconary stress

    1

  • 7/21/2019 TIMIT_NISTIR4930

    44/94

    44 Erraa

    A few errrs were fund in e pnemic lexicn fle "imidcimidic" afer e CDROM was pressed e crrecins are as fllws

    delee"-knacks n ae k s"-upmansip a p m ax n s i p"-ups a p s"zagged ae1 g d"bdied aa d iy d(ese aren wrds r cmbining fms

    2 cange "casrbeans ae s axr b iy n z "casrbeans ae s axr b iy n z

    3 cange "fasclsing f ae s a w z i ng "fas-clsing f ae s k w z i ng

    4 cange "clverleaf a w v axr iy2 "clvereaf w v axr iy2 f

    5 cang "cnsany a aa n s n iy "cnsany aa n s n iy

    6 cange "cunyside a a n r iy s ay2 d

    "cunryside a n r iy s ay2 d7 cange "nancys n ae n a iy z

    "nancys n ae n s iy z

    cange "singers s ng g axr z "singers s i ng axr z

    9 cange "uncmfrable a n a a m f axr ax b el "uncmfrable a n k a m f axr ax b el

    0. cange "backward ae k w er d z "backward ae k w er d

    11 cange "cleaners al iy n axr z "cleaners iy n axr z

    32

  • 7/21/2019 TIMIT_NISTIR4930

    45/94

    cge crey r w I y'o crey r w iy

    3 cge deecle d e k elo deece d i e k e

    cge disic d i s i g 'o disic d s i g k

    5 cge elipsoids I i P s oy do elpsods P s oy d z'

    6 cge eiy e iyo ey e iy

    7 cge owee / e

    ow iy

    o lowee / e ow w iy

    8 cge edrers / e d k w r z'o edqrers / e d k w r r z'

    9 cge ideifed y d e f y'o defied y d e f y d

    0 cge isc s g o isc s i g k

    cge mscl m w z k elo msic m y w z i k el'

    cge preseed p r z e i d'o preseed r z e d'

    3 cge werigly w ey r o werigy w ey r g I iy'

    ese folowg re less srely errors proy sold e ed

    cge poocemic f ow k e m k elo poocemic f ow ow k e m k e'

    cge poogrps f ow ow g r e f so poogrps f ow g r e f s'

    33

  • 7/21/2019 TIMIT_NISTIR4930

    46/94

    3 age reorgazao r y2 ao r g ay z ey s to "reorgaizatio r iy a2 r g i ay z ey s x

    4. age "tyray t r ae yto tyray t 1 r ax y

    3

  • 7/21/2019 TIMIT_NISTIR4930

    47/94

    5 Transcrpon Proocols

    is section inclues infomation ertag to te rotocols use in obtaining tetranscrition files assiate it eac utterance Section 5 rerints an article on te

    onetic transcrition metoology Aitiona etails are given in e notes in Section52 Te or bouna alignment reure is escribe in Section 53

    5 Repn of a Pubaon Debng TIMIT Tanpon Convenon

    is section contains a rerint of te article Transcition an Alignment of te MTatabase" by Steanie Seneff an Victor W Zue e aer as resente at eSecond Sympoum on Adanced Man-Macne Inteace toug Spoken Lanage OauHaaii November 20-22, 88

    3

  • 7/21/2019 TIMIT_NISTIR4930

    48/94

    Transcription And Aligrunent of the Timit DatabaseV i c : t ~ w Zue and ~ h m i e S e n e f f

    Spolwl Languaae SYSlems Group.l abor uory far Compwer seence

    Mas.sac/wseus Imtirute of T eclutology.Cambridge. MA. 01890. USA.

    ABSTRAcr1bc 11MIT 3I:oustic-pltonctic d3 base was designed jointly by researcllers at MIT. TI and SRI.

    It was intended 10 provide a rich collection of acoustic phonetic and phonological data 10 be usedfor basic research as well as the development and evaluation of speech =i r i t i an systems. 1bcdarabase consiStS of a IQ :al of 6.300 sentences from 630 speakers. representing over 5 hours ofspeech rn=rial and was recorded by resean:hers at Tl. This paper describes the tran.'lCription andalignment of the TIMIT database, which was perfonned l t MIT,I BACKGROUND

    When the DARPA Scrntegic Computing speech program was first fomlulalcd in 1984.

  • 7/21/2019 TIMIT_NISTIR4930

    49/94

    on-line. Websrtr's Podcet Dictionary conlaining nearly 20.000 wonls.. Words or woro-sequencescontaininS particular phone pails could be accessed from this dictionary :wtomatically. whichsreallY facilir:uec1lbe darai)ase desisn process. We performed a da:illed analysis of the n=lt inlsenzenc:e sa. welJ as he SI senrmces that malce up the remZDderof he dtlIbase . The lnret eSted

    should consult Umelet a1.(3] far f tut '- iDfmmarim about the ~2. HE ACOUSTIC PHONETIC LABEL SET

    All of

  • 7/21/2019 TIMIT_NISTIR4930

    50/94

    Rt:daced vowels ale represented by four separntc: allophones: bade schwa I D frtIIIt schwa{ I D, reaoflexed schwa (L:t I), and voiceless schwa (I J). The decision tor ) v. ( 1 J is

    based on whether the second formant is closcr to the fiat coustic events c:ut be accessed convenientlybased on the trnnscriprion. We must stress, however. that the aligned tnUlSCription is intended toestablish cofTespondence betWeen the tr:u f>Criprion and import:lnt acousric andrnarlrs. One shooldnor directly associate a region berween rwo time markl:rs as a distinCt phonetic unit, since theencodlng of phonetic informalion in the speech signal is extremely complic:ued.

    In most cases. the boundaries between fwO acoustic-phonetic events = Je:u- and well-defined.such as thar between a stop closure and its release. However. there are a number of cases where theexact pJacement ofa boundary s problematic (as is the case between a seniivowe and a vowel). orcases where it's not clear whether a region should be represented as one or tWO acoustic-phoneticWlics (as is the case foe diphthongs). [n these=. e tried to define a sec of criteria that would be~ m t i c :Itld IeasI subject to human enor, in order to prod,",e llotJndzy positionings that were asconsistent as possihle.

    38

  • 7/21/2019 TIMIT_NISTIR4930

    51/94

    Asmentioned pr.,.iousIy, we decided Ilut the boundary berween the closuJe inu:rval and therele:lse of a srop is an important one thar should be assignc:i " is cenainly a very distinct landmarkin the waveform. Anyone inleresred in srudying the burs c:I\3lac etistics of a stOp would then beable to focus 011 just th.nl gloo haI includes only the released poRion. In a Slrictly phonemic~ the clOSlUl and release would be tq>rCSen1rd as a single unit. and Ih=fore thaio1tial boundary would renWn unmarked.

    A problematic boundary is one dlaI sepataleS a prevocalic SlOp from a following lCDlivowel, asin "truck." Typically part of he rl is devoiced. and therefore is absorbed into the aspintionportioo of the SlOp. If lisrening were the only c:riterion, lIlen the left boundary of the /rJ would occursomewhere in the aspinllion, and

  • 7/21/2019 TIMIT_NISTIR4930

    52/94

    In bodt St:IgIOS I nnd 3. the lo.beller makes herJhis aeousti

  • 7/21/2019 TIMIT_NISTIR4930

    53/94

    S)'StCIIl well wid those produced by human transcribers. for example, over 7S of theaulOllWical.ly genelllUld boundaries were wimin 10 msec of a boundary enlA: Cd by a trainedphonetician.

    Fiswe 2 displays the 0UIpUl for the 3CIIIalCC, "She had your darli: suit in i USY ash Water alIyear." The ttanscription ."d boundaries are overlaid on tbe spectrogram for ....., ofexamirwicn.Far Ibis eumple. most of the boundaries have been found coorectly by CASPAR.Noce, bowever, II\IIl boundaries axe missing in the [iIta:) 2quence of "She had: The wavefanndisplays the word "darlc" and the Is) of suit. Nace tIw me initial boundary of he first [d issli tttIY 100 fat forwanl in time.4.3 PosI-Processing

    The finaJ step is 10 conect by hand any etroD in the automazic:illy aligned acousticphoneticsequence. Some of the errors axe due 10 the fact mar CASPAR is not obit to deImnine certainboundaries such as some of mose between two vowels. In oIher cases the boundaries may havebeen mispW:ed.Hand correction of the aligned transcriplion is based on critical listening of portions of theutterance as well as visual examination of the spectrogram 3Dd the wavefonn. The spectrogramcovers close to 3 seconds worth of at one lime, w e ~ he wavefarm is displayed on 3much m o ~ ponded time serle. For example, to accur.uely mark the onset of the release of a srop,the CUISOr is first positioned 00 the spectrogram ar the approxima e point in time. The wavefonndisplay :rutomarically moves to synchronize in time wilb the CUTSOr, and 3 fine tuning of theboundary = be acbieved by mousing the exacl time point in the waveform.

    The mouse c:m be used wUh ease to move an existing bound:lry to a new point in time. to erosea boundary, or [0 insert a boundary. Furthermore, a specified mouse click on any segment allowsthe labeller to change the acoustic phonetic label associated with char segmem. This step issometimes necessary to cancer an ector of judgmJent in stnge I .

    An example of the screen layout used for the correction process is shown in Figure 3. Theboundary for the [d] burst onset h"" been corrected. Missini bound: ries were insened for theUh ,,] 2quexe. In addition. the boundories associ3led with the first [w were e.tended on bothsides. and an epenchetic silence was inserted hetwgr.lphic transcriprion as weD as the inlermediare phonemictrunscription. A limealigned orthographic transcription is useful when sean:hing for a specificword, while a timealigned phonemic tranSCtiprioo coo be used tD relate the lexicalrep=tatioo ofwords to their acoustic realizations. For example, the lexical e p ~ n t t i o n of the word sequence"gas shortaie" contains. word-finalisl and a word-initial I I i. whereas its acoustic realization maysimply be a long B1. In this case, the timealigned phonemic transcriprion will map the long 10 [S]both the underlying friC31ive. Researchers inlerested in studying the f'requcncy of occumnce ofamain low level phonological rules will thus be able 10 derive the information from theset aJlscriptions

    41

  • 7/21/2019 TIMIT_NISTIR4930

    54/94

    We have developed a SYSlem hat maps a Iime-:I igned acousric-phonetic =r ip r ion .0 hephonemic ond oltbogrnphic anscriptiOfls [71. How.ver. he :dignmen[ effon for theseIr.U ISCriJXion lags somewhat behind the phonetic alignment. In he inrerest of expeditiously makingas much data available [0 the interesre

  • 7/21/2019 TIMIT_NISTIR4930

    55/94

    of Technology. May 1986,

    Nocesp I ? I 'J . b b It / t / I d d Ik k / q g Ip- 9 / Symboi.-.;. b' / 6 i Symbol s.iillt-Dl l Symbol-i d / T I Symbol-,:

  • 7/21/2019 TIMIT_NISTIR4930

    56/94

    = : . . . . ~ : - ; - : ' : . ; - : . ~ : t _ , _ : .. . . ,1 ' ~ t .. : ._ . . tI .. , . , , ' . ~ _ 6 _ .6- ,_I. ~ . , ~ . , . . ,

    -:i5 ii. .

    1.164

    I.. 1 i

    ;1 i; i ~

    . . , . , . .'. '- '-- -- -Figme 1: SPIRE la}'Oll' for

  • 7/21/2019 TIMIT_NISTIR4930

    57/94

    ~ .

    36006

    ............

    i. i

    . I i : I

    :s;;;; _ , ...-_.

    ._ ..:.: . __ _ _Figure 3: SPIRE layout showing the aligned mmscrlprion foUowing post-processing.

    45

  • 7/21/2019 TIMIT_NISTIR4930

    58/94

    S.2 Ntes n Checkng the Phnetic Tanscptns

    The phonetc an orthographc transcrptons have been re-checke before ths releaseThe a n checkng these transcrptons was to correct an blatant errors often e tosng an to ake the transcrptons a bt ore consstent. The phonetc

    transcrptons were checke at MT sng the SPRE sste (Ze et a 198. Thephonetc transcrpton of an tterance s hghl sbectve partclar wth regar to fnephonetc stnctons sch as the exact vowel color an the parta evocng of voceconsonants.

    The orthographc transcrptons were checke prarl for spelng errors an to ensrethat the transcrptons were accrate Occassonall an orthographc transcrpton ersfo ts corresponng text propt.

    All coents receve on the phonetc transcrptons were revewe an taken ntoconseraton athogh the transcrptons were not neccessar change accorngl. We

    wol lke to thank al of the people who took the te to sen coents to spartclarl Mke Re an hs coleages at Bel abs who set s the ost extensverearks

    The followng notes wrtten b or ae (who checke the phonetc transcrptons forthe CDROM) sarze the ost coon changes ae to the phonetc transcrptonsan attept to fll n etals ssng n the artce b Senef an Ze. The are not eantto be a coprehensve escrpton of the transcrpton process see the artcle n theprevos secton for ore etas on the transcrpton press an protols.

    521 AcusticPhnetic Labels

    Stops: Stop-stop seences are often reaze wth a sngle closre an a sngle release. For

    exaple "bg bo s often realze as bc b h gcl b o wth the g beng nreease.Generall the cosre nterval s gven to the frst stop an the release to the seconnless t s clear that there was no gestre towars the frst stop. For exaple f thereare clear labal transtons at the en of the h, the seence wol be transcrbe as

    bc b h bcl b o.

    The glotta stop q s se to enote severa erent pes of acostc-phonetc events,leang to soe apparent consons A stop pcal t a be realze as a glottalstop. n ths case, the sgna s carefl revewe to ensre that there are no aveoarforant transtons an that one real hears a gottal stop When a t s transcrbeas a glotta stop, no cosre ntera s arke.

    4

  • 7/21/2019 TIMIT_NISTIR4930

    59/94

    is also use to ark the glotaliation foun a he beginning of a wor starting witha vowel or the glotal stop o gottalizaion that a be use o ark a vowelvowelbounar The is no use to ark nonevent speifi gloalizaion, suh as a befoun at he en of senenes, o as a be haratersi of he speeh of soespeakers

    Stop losures are not arke after pauses, exept in the few ases whee there was apause followe b lear prevoiing for a voie sop

    Nasalstop seuenes are soeties tansribe wthou a losure ineal for he sopThus, "uno" a be foun as uh n l u or as uh n u The latter ase urs whenthee s no visible weakening n he nasa uur prior to the stop release

    There is a eativel boa use of faps in the ransripons Flaps a be as ong as40-0 s a ies, or even ontain a weak, inelke elease, if the are hear as a fap.

    Friaiveike alophones of voie sops are transbe as having onl a losure intevasne there is no visible eease While i ight be ore realis to ransbe fravelike voieess stops wih onl a eease poron, the are tpiall ansibe as onl alosure for onsisten wth the voe stops.

    asas

    Soetes, partular when foowe b a voeless onsonant as in wors like "an'"an "ane, there is no segent orresponing o he nasa urur an he oneviene of he unerlng nasal is foun in the nasalzaton of he vowel Sine hereis no sbol n he se o ak he nasal in this wa a sall nasa segen is arkewhen a nasal s hear, even f t is alost ipossbe o lae he nasa in the signalIn this ase he last -2 pth perios of he preeeng vowel are labele as he nasal

    Liquis an Gles Posvai s tpall ansibe usng the slab sbos, e or axr, epening

    on the stress. This onvenon is use sne he posvoali r is aousal oeslar to the slab han the onsonanal for. This oes not neessal ean ha

    there are two sllables present. When a posvoai r ours inteaiall i sransrbe as r if here are goo iniar ransions ino he folowing vowel.

    Vowes

    Sine fne snons n vowel oor are hgh subetive, the vowel oor was lef

    7

  • 7/21/2019 TIMIT_NISTIR4930

    60/94

    unchanged except when the eifie stngly disageed with the label hus thee maysil be many diffeences f pinin in whethe a gien wel is an aa, a w andin distinguishing beween the use f iy and ih etc

    In geneal, /-cl geatly aects the quality f the wel It is had t distinguishbeteen w and a /. he wel peceeding in wds ike ea" and yea hasbeen systematically tanscibed as ih/. Occassinally a speake pnunced an exteme

    iy in this cntext and the wel is s maked Unde these cnditins the wd maybe pnunced wth t sylables (/y iy e). Similay the wel in wds like "wea" hasbeen labeled eh /.

    Schwas ae libealy used in the tansciptins t epesent unstessed and educedwels As a eminde, fu types f schwas ae used he use f ax! ix! is basedn the psitin f the secnd fmant: if it is cse t the thid fmant ix! is used,thewise ax/ is used A deiced schwa (axh) is maked when thee is n aic"ptin when thee ae nly 1 2 pitch peids isible in the waefm. ax maksthe bseatin that the thid fmant is lw indicating etectn Schwa -gidesae maked ny when they can be head seen as a sylable.

    he fnted-/uw (ux) is fund in the tansciptins een thugh it is nt phnemicalydistinct in English. Smetimes the ist pat f the wel is ux/-like and the end pat is

    uw/-like In sme f these cases, the wel had been tanscibed as a sequence f twwels u uw/. Since thee is ealy y ne wel pesent these tanscptins weechanged. If the ux/-like ptin was nge than the uw/-ike ptin, if the secndfmant nee ealy gt cse enugh t the ist fmant, the symbl ux/ was used Inthe cases the wel was abeled uw despite the fnting at the nset

    icaties:

    he labels used f the ficaties wee nt discussed n the Sene/Zue atice. Smecmments may help t clai the tansciptins

    Viced ficaties hae a tendency t be deiced in English, with the pimay cue ticing caied in the sement duatin hus iced ficaties ae abelled as such eenthugh al fld ibatin is nt pesent thughut the segment if at east ne f thefllwing hlds:

    1 thee is eidence f ca fld ibatin duing pat f the segment,typically fund at the beginning

    (2) the segment duatin is sht eatie t the iceless icaties in thesentence

    48

  • 7/21/2019 TIMIT_NISTIR4930

    61/94

    ,

    3 the duatin f the peceeding vwel i lengthened

    In me cae the vicing chaacteitic may be vey had t detemine anddiageement may aie paticulaly when the icative i pat f a clute entence

    fnal

    icative-icative equence en hw mdicain. example, the equence hi mt ften een a a lng h/ When the i viible it labelled Similaly a vicedicative peceding a vicele ne i ften deviced / i uually een a a lng The z i maked nly when vicing i evident and the begin whee the peidiciyend

    Sme peake tend t pduce tp-like allphne f the weak ficative Thee aepically tancibed a the weak icaive ecgnizing thi a an allphne Smetimehweve thee i evidence f a vey clea p clue fllwed by a tp-eleae-like

    icaive. In thee cae a tp clue ha been maked in he tanciptin. A dcl iued befe dh and a cl befe th/; f the the icaive a hmganic tp cluei ued.

    Thee may be a mall peid f ilence beween a naal and a ficative Hmganic tpclue have been ued t mak thi inteval in the place f epenthetic ilence.Smeime a tp eleae i al ineted. Thi pe i knwn a hmganic pinetin and i tancibed a a p

    522 Boundaries

    ew adjutment wee made egment bunday latin Sevee mialignment (aelaively ae happening) wee cected The mt cmmn bunday change wa headjutmen f the at latin f a tp eleae, which ccainally cut f the beginningf he eleae

    523 Disclaimer

    hnetic tanciptin ae inheently etemely ubjective hu, we expect that thee willalway be diageement with me f the deciin made in tancibing and checkingTIM T Ou gal wa pvide a elaively bad acuticphnetic tancipin wheethe m eliable acutic landmak have been maked The e-checking f TMIT, aimedat cecting elatively blatant e and nt at making fine ditinctin epeen abut200 hu f humanineactive time and a uch i a ak ubect t e We hpe thave minimized he anciptin e in TIMT and t have made the tanciptinme cnitent.

    49

  • 7/21/2019 TIMIT_NISTIR4930

    62/94

    5.3 Notes on Automatc Geneaton of Wo Bounaes

    Th ecton decrbe the program ed to atomatcaly aate word bondare wthphonetc egment. he program mlar n concept to the wor of Kael (986)

    531 Geneal Methoolo

    he atomatc generaton of word bondare accomplhed g the folowngalgorthm:

    () A phonemc trancrpton of a entence generated from an orthographyby concatenatng the phonemc form of the lexcal entry for each word.

    (2) The reltng trng then agned wth the phonetc trancrpton nga dynamc programmngbaed trng algnment program (aaabe omNS) wth weght baed on phonetc featre

    (3) After agnment, the word bondare n the phonetc trng are nferredfom the phonemc trng by appyng a et of phonologcal rle

    Atomatcally-generated word bondare ng the aboe algorthm agreed wth 96% ofthe aaabe hmancheced bondae on a ampe of 4000 entence

    532 Algnment Poceue

    he algnment of phoneme and phone performed ng a dynamc programmng trngalgnment agorthm to determne a mappng from phoneme to phone whch mnmea dtance fncton The dtance ncton whch wa ed, techncaly a weghtedLeenhten metrc a weghted m of al nerton deeton and bttton operatonneceary to edt the phoneme trng nto the phone trng. he weght of each elementaryoperaton the m of the phonetc featre that are dfferent beteen the mappedphoneme and phone By conenton deeted phoneme and nerted phone are mappedto "nl", a ymbol defned a hang no phonetc featre o that ther contrbton to thedtance the nmber of phonetc featre denng them. he agnment code ng thconcept of phonologcal dtance wa reported on at CASSP90 (Palett et a\., 990) and

    aaable fom NS.

    533 Phonologcal Rle Pos-Pocessng

    The followng rle are apped to the algned phonemc and phonetc trng. If the re

    50

  • 7/21/2019 TIMIT_NISTIR4930

    63/94

    pecndiins ae me he ule is acivaed ( "fied) and he wd bunda is mdifiedis mdicain can add delee phan shae phnes a he bunday e ules aelised belw in he de f pecedence in which hey ae applied Only ne ue is acivea a ime e ule fma is:

    pecndiin : [phne sequence] -> [phneme sequence]

    wee " means is mapped

    Disclaime While his ule se is liely be incmplee we feel i pvides adequaeageemen wih human-checed bundaies

    Rule 1 Ohanizain f silence p ids

    e phnes in he se {h# pau epi} wee phaned uness e alignmen uinemached hem wih a wd fina phneme

    any cnex] : (pau) >

    Rule Ohanizain f gal sp inseinsf a gla sp was inseed beeen a wd final vwel and he fllwing wdiniial vwel i was le as an pan phne

    [(vw q vw) (q)>

    Rule 3 Sp Csue and Release M g se pneic anscipin and pnemic epesenains dife wih espec eepesenain f sps. n e pnemic anscipin he sp is a single enwie in e pneic anscipin sp clsues ae maed sepaaely fm speleases is ule seaches f a clsue eease e.g. cl- bc-b dcl-d a spanshe infeed wd bunday. I is cndiin ccus e bunday is shifed include b pnes in e ppe wd

    any cnex : (cl ) > ()

    1

  • 7/21/2019 TIMIT_NISTIR4930

    64/94

    Rule 4: Sharng of g mnat phon sMIT's phonetc transcrpton conenton for gemnate phones .e. where the wordna and word ntal phones were dentcal was to mar them as a sngle segment.Ths rule adjusts the word boundares to allow ths sngle phone to be shared

    [gemnate phoneme (m m) : (m) > (m m)

    Rule : Sharng word fnal and ntal owelsWord fnal owels followed by word ntal owels were casonaly transcrbed asa sngle owel segment. ypcally at least one of the 2 owels was unstressed Thsrule searches for a mssng owel at the nferred word boundary then forces theremanng owel to be shared

    phoneme (ow ow2)] (ow) > (ow ow2)

    Rule : Palataton sharnA farly common phonoogcal transformaton s y-palatalaton of stops andcates across word boundares. In ths case the derlyng phonemc sequenceof (d y) may be manfest phonetcaly as (dl jh).

    The followng set of rues account for these phonomena

    a [phoneme (d y (dl jh) > (d)(jh) > (y)

    b phoneme (t y (tl ch) > (t)(ch) -> (y)

    c. phoneme (s y : (sh) > (s)(sh) > (y)

    d. [phoneme (z y)] (zh) > ()

    (zh)>

    (y)In cases ( a) and (b) sometmes the losure nteal s mssng and the d t are algned only

    wth jh ch.

    2

  • 7/21/2019 TIMIT_NISTIR4930

    65/94

    6 Reprints of Selected Artcles

    This sctin includs rpints f th flwin thr TMT-rlatd articls that appad

    in th prdins f DARPA Spch Rcnitin Wshps.Fishr William M. Dddintn Gr R. and Gudi-Marshal Kathln M. (1986)Th DARPA Spch Rcnitin Rsach Databas: Spcificatins and Status"Pg h DARA h g Wkh, Rprt N. SAC-861546Fbrua 1986 Pa Alt.

    Lam Lri F. Kassl Rbrt H and Snff Stphani (1986) Spch DatabasDvpmnt: Dsin and Analysis f th Acustic-Phntic Crps Pg hDARA h g kh, Rprt N SAIC-861546 Fbuay 986 Pat.

    Chn ichal Baldwin Gay Bnstin Jard Muit Hy and Wintraub Mitchl(198) Stdis fr an Adaptiv Rcnitin Lxicn Pg h DARA hg Wkh Rprt N. SAIC-871644 March 197 San Di

    53

  • 7/21/2019 TIMIT_NISTIR4930

    66/94

    THE DARPA SPEECH RECOGNITION RESEARCH DATABASE:SPECIFICATIONS AND STATUSWilliam M Fi.herGeorge R. Doddington

    K a ~ b l e e n W Coudie-WarshallT x ~ Instruments Inc.Computer Science. CenterP.O. Box 226015. US 238Dallas. Texa. 76266 . USATel. (214) 996-0394

    ABSTRACThi. paper describes general 8pecif i c a ~ l o n 8 and current . t a tu . of the speechdatabase . that . Texas Inatru.aents (TI) 1scolleceing to .upport the Darpa speechrecognition research effort. Emphasis isplaced on the port ion of the databasedevelopment work tha t 1I i s speciallyresponsible for. We give speci f icat ions

    n general . our recording procedures.theoret ical and pract ical aspects 01 sentence se lec t ion selected charac te r i s t i cso . e l ec t ed sentences, and our p r o g r e ~ B inr c o r d l ~

    1 . INTRODUCTIONThis paper i8 a repor t on the specif ica t ion and current s ta tus of the Yorkdone by Texas Instruments, Inc. (TI) onDarpa-funded Acoustic Phonetic Databasedevelopment as of the ear ly p a r ~ ofFebruary, 1986. I t i s meant to be c o ~ p l e -mentary to similar reports from othergroups included in th i s volume.

    2. GENERAL SPECIFICATIONSOriginal ly three data bases wereplanned: -stress ,- -acoustic-phoneticand -t 8k-specific . - The s t re ss data b ~ ewas to investig te variat ions of speechwith stress and would be done primari lyby AFAMRL The acoustic-phonetic databaae, to be done by TI in col laborat ion

    with WIT and SRI. was intended to uncovergeneral acoustic-phonetic facts about a l lmajor dialectB of continental u.s .English. And the task-specific data baseproviding data for the study of the effecton speech recognit ion t l imiting domainof discourse, would be defined la ter . Atour l meeting. there WAS a consensustha t the t 8k-.pecific data base should be

    54

    folded in to the acoustic-phonetic databale , becomiqg one t the l a te r phase .The a c o u s t i ~ p h o n e t i c d a ~ a base i sphased BO tha t a small amount t speech 18in i t i a l ly recorded from a large number ofsubjects . folloYed by succes8ively largerdurations of speech from tewer subjects.culminating in two hours recorded tromeach of two subjects. MIT and SRI havehelped UB design the material to be readby 8ubjecta. Figure 1 below shows thecurrent goneral specifications for th isdata base.

    3. RECORDING PROCEDURES3.1 STEROIDS

    Thie large scale database col lect ionwould be di f f icu l t or impossible to coll ec t without ~ e VAX Fortran automatedspeech data collection system developedhere a t TI. ca l led the STEReO automaticInteract ive Data col lect ion System, orSTEROIDS . Use of STEROIDS requires aseereo DSC 200 sound system di rec t l rconnected to 2 DSC 240 audio controboxes, one for each of the 2 channels ofstereo input .

  • 7/21/2019 TIMIT_NISTIR4930

    67/94

    Recording conditions:o Low noise (acceptAble to NBS)o 2 channel recording: 1 noise-cancell ing (Sennheiser) mike.1 far - f ield pres6Ure (Bruel and Xjaer) mike.o Subjects exposed to 76 dB SPL noise through earphones

    St.yle:o Read .trom prompt.sil t.erial :Phase1 Speech/Subject30 site. Subjects630 Cont.ent.s, etc .Broad P h o n e ~ i c Coverage2 2 IIdn3 S lOin.4 30 D,in.2 hr .

    16040102/Standard Paragraph/Explic i t VariationsInterview Format

    Figure 1. General Specifications of A c o u s ~ 1 c P h o n e t 1 c Database.

    3 2 GENERAL PROCEDUREWe created and ran a program. whichread sentences and sentence assignment.and made 630 VJIL f i le . . Our recordingprocedure ~ e n take. t ive steps: 1. Atthe beginning of each day; cal ibrat iontoones are recorded from bot h channels ; 2.For each subject , one of the 630 VL t i l e si s copied h i s naaed Bub-d i rec to ry andSTEROIDS ie used to collect. his data; 3.At the end of the day, a REDUCE procedureis run on a l l data collected tha t day,which produce8 the f i les tha t ye send out;by spl i t t ing the in i t ia1 stereo f i l e intot .o mono f i les . de-biasing each, high-passf i l t e r ing the BK f i l . a t 70 Hz . . anddo ...-.aapling oach to 16,000 s&IIpl perlecond.; 4. A backup procedure i8 t.henrun, which make. three tape copies of tho

    VW f i l e . the calibration tone f i les anda l l the speech f i l es recorded on that day;and 5. The disk i s cleaned up for re-useby deleting the f i les t ha t were p u ~ ontotape. One copy of the back-up tape i sthen Bent to NBS.Dat.a on each SUbject recorded in eachsession i_ added to an ASCII t ex t f i l e fordocUIlentation.3.3 NOISE

    After the sound booth waB moved totho third floor of the Nort.h Building. ..very large noise signal was observedcoming from the combination BK power8upply and preamplifier. At f irst . thisnoile as t.hougb.t to be 'the r esu l t of adefect . in the amplifier, but. the BK.e rvice center could l ind no p r o b l e ~ . It.was then that. we real ized tha t the noisewas actually an acoustical signal beingpicked up by the microphone. Figure 2shows the spectrum of the noise signal

    55

    beloy 500 Hz for a 5 second segment of- s i lence- . The spectrum ia f l a t from 300Hz up to 10 kHz. (The spectrum of t.hesignal from the Sennheiser noise-cancell ing microphone i s f l a t from DC to 10 kH2 .which indicates tha t the noise-cancella t ion property . n d the low-frequencyro l l -o f f of the Sennhei.er ia adequ&te torendor the acoust ic rumble of noconsequence ~ o r t h i s microphone.)

    I K - - - - - ~ - - - - . _ - - - - . _ - - - - _ r - - - -

    ' -Figure 2 . Amplitude Spectrum ofAcoustic Rumble. Recorded in theTI 80und booth over a 6 secondperiod of -s i lence . -

  • 7/21/2019 TIMIT_NISTIR4930

    68/94

    With c o n . u l ~ a ~ l o n from an Acousticalengineering consule&nt it was judged thatthe acoustical noiae in our double-walledBound booth 1 . beiDg introduced brmechanical vibrationa t ran.ui t ted throught.he f loor . Opinion var ies as to t.heamount of reduct ion that. may be achievedby be t te r i so l a t ion from ~ f loor . fromI e . . than 3 dB to more than 20 dB.Current plans are to ins ta l l an ai r suspen.1on vibration isolation mount systemunder the 80Wld boot.h t.o reduce the rumbleas .uch po ble.A. an interim 801ution, a 1581-pointFIR f i l t e r ha. been daaigned to provide ahigh-p t l1t er funct.ion, wi th a cut -o t ta t 70 Hz and an in-band r ipp le of 1 st ban O. 1 dB above 100 Hz U8ing t h i .f i l t e r . reasonably acceptable S N ratiOShave been achieved during data collection.The following SIN ratios have beenme red. using seventeen subjocts ' (ninemen and e i g h ~ women) u ~ ~ e r a n c e s of .en tenee SAl.Condit.ion

    No HP70 Hz HP200 Hz HP

    ENrms421954

    SNavK:8dB21 dB48 dB

    SNpk16 dB29 dB66 dB

    Table 1. Raw S N Ratio.Ezplanat.ory notes for Table 1:ENrm8 i s the RWS energy of t.he noise;SHavg i s ~ h e average S X r a t io , signal

    . ~ e r g y being computed a8 the averageRUS 8ignal value over the ent i re

    u ~ t . . r a n c e : SHpk ia t.he peale S Nrat.ioD, aignal energy being computedas th . peak RWS signal value in a 30JUlcc. Hamming-weight.ed windOW . l i dacrosB t.he u ~ t e r a n c e .

    During t.his tabulation was noticedt.hat the RKS energy for men's utt.erancesaveraged 4 dB great.er than t h a ~ for women.There are variat.ions of signal level withap.aker and with ut.terance, of course. andthe we kest. of the 8 e v e n ~ e e n u t t e r ~ c e 8used for th is tabula t ion showed an averageSIN rat io of minus 1 dB for the or iginals igna l .The ef fect ive SIN rat.ios for speechproceasing and l is tening or perceptualpurposes will be somewhat higher for theno high-pass and 70 Hz h i ~ p a s s re su l t sl is ted above, because typIcally preempha8is i performed on ehe 9peech s ignalbetore further processing. and because thehuman ear 18 progressively leRs sens i t iveto sound frequenci below 200 Hz . Fora preepph&sis c o n s ~ a n e of 1 0 (&t asampling frequency of 16 k H z ~ the S/N. ra t ios were measured as follows :

    56

    ConditionNo HP70 Hz HP200 Hz HP

    ENrl1s643

    SNavg36 dB39 dB41 dB

    SNpk46 dB50 dB52 dB

    Table 2. Pre-emph&8ized S N RatiOB .Symbols are Rame as in Table 1.4. ACOUSTIC-PHONETIC D T BASE PHASE 1

    4 1 GENER LThe sentences const.ituting the phase1 material will have a mean value ofexpected reading t ime of 3 seconds, 8 0

    t h a ~ oach of the 630 8ub1ects reading tensentences wil l 6ive us the specified 30Reconds per subject ot speech data.A 1 t . o ~ e t h e r 6 3 0 x l 0 ~ 6 3 0 0 sentencetokens w111 be collected. The sentence

    ~ y p e 8 are divided in to three sor ts : 1.Two -dia lect- or -calibration" sentences;2. 450 WIT- sentences; and 3. 1890 -TI"a.ntence. . Each subject. reads both thedia lec t sentences. a select ion of five otthe MIT sentences, and a select ion ofthree TI sentences. Each WIT Bentencewill be read by seven speakers and each T1s e n ~ n c e by oue. This var ia t ion in thenumber of subject.s reading d i f f e r e n ~ a ntences i s a compromise between thedesiderata of breadt.h and depth of p h o n e ~ -lc coverage acroaa SUbjects.The dia lec t sent.onces were devised bySRI and the WIT 8 ~ e e n c e . by MIT, who willrepor t separately on t.heir design.4.2 THE Tr N TUR L PHONETIC SENTENCES

    Our strategy in se lect ing our 1890.entencea was a l m o _ ~ ident ical to one wereported on ear l ie r [1): uae a computerprocedure to se lec t from large orin in1te se t of sentences a subaet tha tmeets cer ta in feasibi l i ty cr i t e r i a , t ryingto optimize Il objective function ot the

    8 e l e c ~ e d sentences. The idea l se t of sentencea to dray from in th i s ca_e 1s these t of normal. aeceptable American Eng1i.hs e n ~ n c e s . Lacking an off - the-ahelf grammar of au f ic ient generali ty, we ap proximat.e t h 1 ~ se t with the larges t se t ofAmerican English aentences in coaput.erreadable torm t.hat we know of, the -BrownCorpus [2 J . - Responding to concerns ofsome in the DARPA Database SIC tha t theeesentences yere W T i t t e n ~ English insteadof spoken- English. we augmented ourf inal pool of sentences from th is corpuswith 136 sentences 01 playwrights ' dialogfrom the corpus published by Hultzen a ta . [3] . (We are not concerned tha t oursentenceB are too -written-: thealternative. natural ly spoken- &ent.ences.are reple te with run-on aentenees.self -cor rect ions . and ungrammat1ca11ty.)

  • 7/21/2019 TIMIT_NISTIR4930

    69/94

    There may be lomo alight d1acrepencybetween the original writ ten form of ~ h e 8 eHultzen l e n ~ e n c e s and the form in which euse them, aince we reconatructed theirspellings from the phonemie traneeription.published in the H u l ~ z e n book using TIotf-the-shelf epeech-to-text technology,A ser iee of programs was e x e c u ~ dt h . t produced a f i le of p o i n ~ e r . to ~ h ebeginnings of .entencea in the Browncorpus, then f i l te red out sentences fromth is eet unt i l about 10,000 Yere l e f t inthe lec t ion pool. Sentences yere liminated 1f they were over 80 character .long. included any proscribed word., orincluded charac ter other t.han l e t t e r . andpunctuation. This pool was augmented with136 Hultzen sentences .The fixed Bet of sentences - - the twodialect sentences and the reviaed se t of'50 sentences that TI received f ro . WIT inthe middle of November - - were transcribedphonemically by TI . best o f - ~ h e - s h e ltftxt.-to-phoneme program and, after carefulchecking by two experts in phonetics ~ dphonology. f i le . of allophonic transcrip

    ~ i o n of t h e ~ were computed 8 describedbelow. The ..elect.ion progr ll. assumed th i .. e t of u t ~ r ~ c as a b to build on inthe select ion of the 1890 TI sentence .The Iection pool of 10,000 sentence. was prepared in similar way,except that. it wa. not f e ~ 8 i b l e tohand-check the transcript.ion .The .e lect ion progr m acce.se. ~ e eallophonic tr n8crtption f i le in addit ion to a t i l e of pointerl to sentencest ha t bave previously been select.ed and oneof pointers to sentences that. have beenmanually zapped (ruled oue). Ie produce.a ne,.. ver. ion of t.he sentence select.ionf i le . Both t he .entence selection f i l eand the zapped ntence f i l e are in ASCIItext f i le format 8 0 that they can bemaniplllated with a text editor. One ofthe prograJll's typed-in paraJDe:ter. te l ls ithow many sentences to se lec t . The programYO S run in a ser ies of batch jobs, eachtypically selecting an additional 100 or8 sentences. The addit ional sentences.e lected in each batch run were examined.and un Cceptable ones were str icken fromthe selected sentence f i le and added tothe z.pped lentence f i le before the nextrun. The internal procedure used. by theprogr m 11 th is :

    1. Build the i n i t i ~ l versiont the d a t ~ . tructure holding phonet ic data on the .elected sentences by reading in the dia lec tsentonees weighted by 630. the ~ I Tsentences weighted by 7, and ~ h epreviously selected TI sentences..eighted by 1;2. Repeat thiB unt i l th isrun . quota of eentences h b.enselected:

    57

    a. Scan through a l i a t ofprolpective aentencea lroD thepool 01 unaelected and unzapp.d.entences. calculating for .achthe increase in the phonet ic object.ive function under thehypothesi. tha t the s.ntence iadded to the .e lected . e tremem.beri.ng, tone oue pTod:ucing t.b.ehighest va.-Iue;b. Add the remembered aentence to the selected sentencel i s t S. Write out the ney versionof .entence . e lec t ions.

    The progrloD kno.s two b u i c way. oflIa.king a l i s t of sentenc trOD the poolfor exazaination : 1. talce If ( typically.(00) randal). grabs; and 2. look a t themal l . This option 1. selectable by theuser, .nd both were used in actual runs. e l e c ~ i n g sentence.. The f i r s t .i f a 6 ~ rand les8 optimal than the cond .4.3 CONTROL OF AVERAGE UTTERAHCE DURATION

    In order to control the averageduration of utterance. . a heuristic wasused ; lb expected speech d\lX tion ofeach ntence was c a l c u l . ~ e d using theformulaSPDUR .. -0.0928 .06302 HLETTS

    where NLETTS is the number of le t ters intho spell ing of t.he sentence and SPDUR i8the speech duration 01 the sentence inunits o seconds. This formula wa.derived by the least-aquared-error ~ i t ofa l inear function to .peech durat ion dataobtained from a previously c o l l e c ~ d databaae of continuoua speech ; 750 sentencf ro . each of eight SUbjects, half male andhalf feaale. The mean value of .peechutter nc. duration 01 the current . e lec tedsentence eet waa kept t.rack of, and i f itwa. lower t h ~ the target duration (three.econde) minu. a tolerance. the next sentence selection was taken from a l i s t oflonger-than-average pool sentences; i f themean speech duratIon was greater t.han thet.arget plus a tolerance, the next select ion was from the sub.et of ahort pool.ontencea; and i f within the tolerances.any of the 10.000 pool aentences could beselected. The tolerance used in the finallo l .c t ion . 11 11.4.4 OBJECTIVE FUNCTION

    The function 'that. 18 used to measuretn. aggregate phonetiC coverage of the se tof elect.ed utterances. called -allophoneinformation-, ia:Iala SUM(Ni*LOG2(Nl/Neot

  • 7/21/2019 TIMIT_NISTIR4930

    70/94

    where Nt 1s the frequency of phoneticu n i t 1 nd Ntot 18 the tot. l nUll.ber ofphonetic uni ts in the ut.t.erance set . Auser-specified switch det.rmines whetherthe funct.1oiJ. 18 used in i t a absolut .e f o r a&S given above, or normalized by dividingby the number of le t ters in the BentQncee .Wost of the la te r run. were made u8ing therelat ive fora of the function.Following moat authOrit ies on phonetics , we take the relevant. aet of phoneticunits to be phonee, allophones or V a r i ~ t 8of phonemes of American En,l lah (4 .5] ,roughlt equivalent eo Pike 8 MBpeechsounds [S, pp. 42], The problem ofcalculating or defining the complete o allophones i equivalent to definingthe s e t of poss ib le ~ h o n o l o g i c a l ru les .Th. f i rs t -order a p p r o x ~ m a t l o n eo th is tha twas uaed i s : an allophone 115 a variant ofa phon6me tha t i dist inguished by t he

    ~ h o n e on i t s immediate le f t , the phone on~ t s immediate r ight . and. i f t 1ssyl labic . by & binary mark ot s tressed ornon-stressed; par t of ehe allophonicrepresentation, a lso. i s whether there areword boundaries on ies immediate r igh t orl e t t betore the adjacent .egments. Forthe purposes of th ia specif icat ion. l e f tnd r ight environmental phones are thegaental phonemes with vowels marked asatre ed or nonstres.ed and the complexphoneme. Ich/. I jh l written a . ( t stu and[d zh]. (This i & correction andgeneralization ot a proposal to r p.ychol inguist ic units of speech recognit ionmade by Wickelgren some years ago (7.chap. 6,7] . )I t i s import.ant to use phones instead.ot phonemes as possible phoneticconditioning environments for severalreason .Complex phones condit ion phoneticallyaccording to t.heir separa.te par ts . I f youthink, aa we do, ~ t the vowels of -chewand -shoe- are phonet ical ly identica.l .then always counting phonea a8 dif te rentif they have different adjacent phonemes_ark: the two vowels h&ve d i f f e r e n ~phonemes on the i r l e f t - - Ichl va. Ish- - but the ident ical phone. Cab] . Andloyl and lawl probably cause roundingas. imilat ion on different ends. loy l &tthe beginning and lawl a t the end.although there i s no principled way to

    d 1 . ~ 1 n g u i . h ~ h e m with the phonologic&lfeature of rounding i f they are regardedas hol is t ic segments.In general. conditioning phones.bould also be marked redundantly forf ea tu re . ~ t can assimilate over anintervening aegment. Only i f the It of.t.ew i s marked for l ip rounding w111 thel s i in an environment tha t will causpt to become rounded. but l ip rounding i snot phonemic in English consonants. I ftOU think. as we do. tha t the 11/ s ofstew and -sty- are p h o n e t i c ~ l l y dif -

    58

    ferent. then the re levant conditioningenvironment cannot be ju . t the immediat.lyfolloYing phoneme.Of course. . u p r . - s e g m e n ~ l featuresaffec t phonetics alao. As & f i r e tapproximation to th is . w mark vowel. asbeing stressed or non-stressed and includeword and utterance boundaries inconditioning environment... Something l iketh i s Dust be done i f you think. as we do.tha t the It/. of -deter- and ~ v e t o arephonetically different . and tha t thol ay / s of INye t r a i t I and ~ n i g h t r t e ~ areaillo diff erent.Bec&u,e of exigencies of time andresources available . the allophoniC code.actual ly used yere 4-byte integersconsist ing of these b i t pat terns :EACH ALLOPHONE CODE:o 6 bi ts for segmenta1 phone codeo 6 b i ~ s tor segmental phone on l e f to 6 bi ts fo r segmental phone on r igh to 1 b i t for word boundary on l e f to 1 b i t tor word boundary on r ightwhere segmental phone code. areclassica l phonemes except :o Voyels marked s tressed/un. t ressedo Complex phonemes