a new golden age of phonetics?

44
A new Golden Age of phonetics? Mark Liberman University of Pennsylvania [email protected]

Upload: thisbe

Post on 24-Feb-2016

40 views

Category:

Documents


0 download

DESCRIPTION

A new Golden Age of phonetics?. Mark Liberman University of Pennsylvania [email protected]. A bit of technical history. Recording devices “visible speech” wax cylinders wire recorders tape recorders digital audio Analysis devices mechanical ( oscillograph , …) - PowerPoint PPT Presentation

TRANSCRIPT

A new Golden Age of phonetics?

A new Golden Age of phonetics?Mark LibermanUniversity of [email protected] bit of technical historyRecording devicesvisible speechwax cylinderswire recorderstape recordersdigital audioAnalysis devicesmechanical (oscillograph, )electro-mechanical (spectrograph, )digital ()A quick course in acoustic phoneticsLetters, phonemes, segmentsDurationPitchVowel qualitySource-filter modelFormant modelThe promiseWe see that the computer has opened up to linguists a host of challenges, partial insights, and potentialities. We believe these can be aptly compared with the challenges, problems, and insights of particle physics. Certainly, language is second to no phenomenon in importance. And the tools of computational linguistics are considerably less costly than the multibillion-volt accelerators of particle physics. The new linguistics presents an attractive as well as an extremely important challenge.

There is every reason to believe that facing up to this challenge will ultimately lead to important contributions in many fields.Language and Machines: Computers in Translation and Linguistics Report by the Automatic Language Processing Advisory Committee (ALPAC), National Academy of Sciences, 1966The paradoxALPAC (and Pierce 1969):computers new language sciencelanguage science language engineeringWhat actually happened:computers new language engineeringengineering new language science ???Focusing on speech sciencePlenty of computer useminicomputers in the 1960smicro- and super-computers in the 1980subiquitous laptops todayApplications:replaced tape splicingreplaced sound spectrographeasier pitch tracking, formant trackingmore convenient statistical analysisand so onBUT

No phonetic quantum mechanics Great speech science by smart peopleBut surprisingly little changein style and scale of research 1966-2009in scientific questions about speechin the rate of progress compared to 1946-1966(the first golden age of phonetics) at least on the acoustic analysis side.Peterson & Barney 1951data is still relevantmany contemporary publications are similar in style and scaleChallenges, partial insights, and potentialitiesIn speech sciencethere are many unmet challenges,not enough new insights,and the potentialities are still mostly potentialSomething was missing in 1966adequate accessible digital speech datatools for large-scale automated analysisapplicable research paradigmsWe now have two out of three

ChallengesVariation and invariantsIndividualContextualSocialCommunicativeProblem of correlated variablesFactorial design vs. MLMLaboratory vs. natural dataDescriptive dimensions?e.g. F0/amplitude/time/etc. for prosodypartial insights [in fact we know a lot about speech]potentialities4-6 orders of magnitude more speechFound data as well as gifts from DARPAPurely bottom-up analysisF0, Voice/noise/silence, speaking rateNew descriptive dimensionsAnalysis based on forced alignmentGood results from OK transcriptsQualitative: pronunciation modelingQuantitative: old and new dimensionsBetter statistical methods

important contributions in many fieldsSocial sciencesSociolinguistics and dialect geographyNew approaches to survey dataPhonetics of rhetoricSpeech pathology/therapyLanguage teaching/learningTwo kinds of scienceExplore, observe, explainYogi Berra:Sometimes you can observe a lot just by watchingBotanizing / exploratory data analysis

2. Hypothesize and test

Data from CallHome M/F conversations; about 1M F0 values per category.

(11,700 conversational sides; mean=173, sd=27)(Male mean 174.3, female 172.6: difference 1.7, effect size d=0.06)

Data from Switchboard; phrases defined by silent pauses (Yuan, Liberman & Cieri, ICSLP 2006)

Evanini, Isard & Liberman, Automatic formant extraction for sociolinguistic analysis of large corpora, Interspeech 2009

Yuan & LibermanInterspeech 2009Orthographically-transcribed natural speechis available in very large quantitiesWith pronunciation modeling and forced alignmentwe can use this data for phonetics researchAutomatic acoustic measures based on simple statistical modelscan sometimes be helpfulHere we examine the distribution of /l/-darknessin ~26 (out of ~9000) hoursof U.S. Supreme Court oral arguments ~22,000 tokens of /l/

Yuan & Liberman: Interspeech 200927IntroductionEnglish /l/ is traditionally classified into at least two allophones:dark /l/, which appears in syllable rimes clear /l/,which appears in syllable onsets.

Sproat and Fujimura (1993) : clear and dark allophones are not categorically distinct; single phonological entity /l/ involves two gestures a vocalic dorsal gesture and a consonantal apical gesture.

The two gestures are inherently asynchronous: the vocalic gesture is attracted to the nucleus of the syllable the consonantal gesture is attracted to the margin (gestural affinity).

In a syllable-final /l/, the tongue dorsum gesture shifts left to the syllable nucleus, making the vocalic gesture precede the consonantal, tongue apex gesture. In a syllable-initial /l/, the apical gesture precedes the dorsal gesture.27Yuan & Liberman: Interspeech 200928IntroductionClear /l/ has a relatively high F2 and a low F1; Dark /l/ has a lower F2 and a higher F1; Intervocalic /l/s are intermediate between the clear and dark variants (Lehiste 1964).

An important piece of evidence for the gestural affinity proposal:Sproat and Fujimura (1993) found that the backness of pre-boundary intervocalic /l/ (in /i - /) is correlated with the duration of the pre-boundary rime. The /l/ in longer rimes is darker.

S&F (1993) devised a set of boundaries with a variety of strengths, to elicit different rime durations in laboratory speech: Major intonation boundary: Beel, equate the actors. | VP phrase boundary: Beel equates the actors. V Compound-internal boundary: The beel-equators amazing. C # boundary: The beel-ing men are actors. # No boundary: Mr Beelik wants actors. %

28Yuan & Liberman: Interspeech 200929IntroductionFigure 1 in Sproat and Fujimura (1993): Relation between F2-F1 (in Hz) and pre-boundary rime duration (in s) for (a) speaker CS and (b) speaker RS.

29Yuan & Liberman: Interspeech 200930IntroductionFigure 4 in Sproat and Fujimura (1993): A schematic illustration of the effects of rime duration on pre-boundary post-nuclear /l/.

30Yuan & Liberman: Interspeech 200931IntroductionHuffman (1997) showed that onset [l]s also vary in backness: the dorsum gesture for the intervocalic onset [l]s (e.g., in below) may be shifted leftward in time relative to the apical gesture, which makes a dark(er) /l/.

The data utilized in these studies (both F&S and Huffman)comprised only a few hundred tokens of /l/ in laboratory speech.

The relation of duration and backness can be complicated by differences in coarticulatory effects of neighboring vowels, or by speaker-specific constraints on absolute degree of backness. (Huffman 1997).=> Our study uses a very large speech corpus where these complications average out.Automatic formant tracking is error-prone, and it is time-consuming to measure formants by hand.=> We develop a new method to quantify /l/ backness without formant tracking.31Yuan & Liberman: Interspeech 200932Our DataThe SCOTUS corpus includes more than 50 years of oral arguments from the Supreme Court of the United States nearly 9,000 hours in total. For this study, we used only the Justices speech (25.5 hours) from the 2001-term arguments, along with the orthographic transcripts.

The phone boundaries were automatically aligned using the PPL forced aligner trained on the same data, with the HTK toolkit and the CMU pronouncing dictionary.

The dataset contains 21,706 tokens of /l/, including

3,410 word-initial [l]s,

7,565 word-final [l]s, and

10,731 word-medial [l]s. 32Yuan & Liberman: Interspeech 200933The Penn Phonetics Lab Forced AlignerThe aligners acoustic models are GMM-based monophone HMMs on 39 PLP coefficients. The monophones include: speech segments: /t/, /l/, /aa1/, /ih0/, (ARPAbet) non-speech segments: {sil} silence; {LG} laugh; {NS} noise; {BR} breath; {CG} cough; {LS} lip smack {sp } is a tee model with a direct transition from the entry to the exit node in the HMM (so sp can have 0 length) .... used for handling possible inter-word silence.

The mean absolute difference between manual and automatically-aligned phone boundaries in TIMIT is about 12 milliseconds.

http://www.ling.upenn.edu/phonetics/p2fa/33Yuan & Liberman: Interspeech 200934Forced Alignment Architecture

Word and phone boundaries located

34Yuan & Liberman: Interspeech 200935MethodTo measure the darkness of /l/ through forced alignment, we first split /l/ into two phones, L1 for the clear /l/ and L2 for the dark /l/, and retrained the acoustic models for the new phone set.

In training, word-initial [l]s (e.g., like, please) were categorized as L1 (clear); the word-final [l]s (e.g., full, felt) were L2 (dark). All other [l]s were ambiguous, which could be either L1 or L2.

During each iteration of training, the real pronunciations of the ambiguous [l]s were automatically determined,and then the acoustic models of L1 and L2 were updated.

The new acoustic models were tested on both the training data and on a data subset that had been set aside for testing. During the tests, all [l]s were treated as ambiguous the aligner determined whether a given [l] was L1 or L2.

35Yuan & Liberman: Interspeech 200936MethodAn example of L1/L2 classification through forced alignment:

36Yuan & Liberman: Interspeech 200937MethodIf we use word-initial vs. word-final as the gold standard, the accuracy of /l/ classification by forced alignment is 93.8% on the training data and 92.8% on the test data.

L1 L2 L1 2987 235 (training data)L2 414 6757 gold-standard by word positionL1 169 19L2 23 371 (test data) classified by the aligner

These results suggest that acoustic fit to clear/dark allophones in forced alignment is a plausible way to estimate the darkness of /l/.

37Yuan & Liberman: Interspeech 200938MethodTo compute a metric to measure the degree of /l/-darkness, we therefore ran forced alignment twice. All [l]s were first aligned with L1 model, and then with the L2 model. The difference in log likelihood scores between L2 and L1 alignments the D score measures the darkness of [l]. The larger the D score, the darker the [l].

The histograms of the D scores:

38Yuan & Liberman: Interspeech 200939ResultsTo study the relation between rime duration and /l/-darkness, we use the [l]s that follow a primary-stress vowel (denoted as 1).

Such [l]s can precede a word boundary (#), or a consonant (C) or a non-stress vowel (0) within the word.

39Yuan & Liberman: Interspeech 200940ResultsFrom the figure we can see that:

The [l]s in longer rimes have larger D scores, and hence are darker. This result is consistent with Sproat and Fujimura (1993).

The [l]s preceding a non-stress vowel (1_L_0) are less dark than the [l]s preceding a word boundary (1_L_#) or a consonant (1_L_C).

The relation between duration and darkness for 1_L_C is non-linear. The /l/ reaches maximum darkness when the stressed vowel plus /l/ is about 150-200 ms.

The syllable-final [l]s are always dark, even in very short rimes.This contradicts Sproat and Fujimura (1993)s finding that the syllable-final /l/ in very short rimes is as clear as the canonical clear /l/.

40Yuan & Liberman: Interspeech 200941ResultsTo further examine the difference between clear and dark /l/, we compare the intervocalic (1_L_0) syllable-final or "ambisyllabic" with the intervocalic (0_L_1) - syllable-initial

The rime duration here meansthe duration of the previous vowel plus the duration of [l] regardless of putative syllabic affinity.

41Yuan & Liberman: Interspeech 200942ResultsFrom the figures we can see that:

The intervocalic syllable-final [l]s have positive D scores whereas the intervocalic syllable-initial [l]s have negative D scores.

There is a positive correlation between darkness and rime duration(i.e., the duration of /l/ and its preceding vowel) for the intervocalic syllable-final [l]s, but no correlation for the intervocalic syllable-initial [l]s.

For the intervocalic syllable-final /l/, there is a positive correlation between /l/ duration and darkness. No correlation between /l/ duration and darkness was found, however, for the intervocalic syllable-initial /l/.

These results suggest that there is a clear difference between the intervocalic syllable-final and syllable-initial /l/s.

42Yuan & Liberman: Interspeech 200943ConclusionsWe found a strong correlation between the rime duration and /l/-darkness for syllable-final /l/. This result is consistent with Sproat and Fujimura (1993). We found no correlation between /l/ duration and darkness for syllable-initial /l/. This result is different from Huffman (1997).

We found a clear difference in /l/ darkness between the 0_1 and 1_0 stress contexts, across all values of V+/l/ duration and of /l/ duration.

We found that the syllable-final /l/ preceding a non-stress vowel was less dark than preceding a consonant or a word boundary. Also, there was a non-linear relationship between timing and quality for the /l/ preceding a consonant and following a primary-stress vowel. These segments reach a peak of darkness when the duration of the stressed vowel plus /l/ is about 150-200 ms. Further research is needed to confirm and explain these results.43Meta-ConclusionLarge found collections of speech can be used effectively in phonetics research.

Better pronunciation modelingand better forced alignment will be helpful.

But the existing technology is good enough to start with.