variation in speech tempo: capt. kirk, mr. spock, and all of us in between

61
Variation in speech tempo: Capt. Kirk, Mr. Spock, and all of us in between Tyler Schnoebelen Stanford University

Upload: tyler-schnoebelen

Post on 14-Apr-2017

100 views

Category:

Data & Analytics


0 download

TRANSCRIPT

PowerPoint Presentation

Variation in speech tempo:Capt. Kirk, Mr. Spock, and all of us in betweenTyler SchnoebelenStanford University

GoalsThe social meanings of tempoWho uses it, hows it understood?Indexical field constructionNew approaches to structuring indexical fieldsBetween fast speech and slow speech: burstinessA way to measure William Shatner (and everybody else)

Kendall (2009)

A great place to begin thinking about speech rateor what Ill be calling tempois Tyler Kendalls work. 3

Region

Looking at 27,000+ tokens, Kendall finds significant effects for:RegionEthnicityGenderAgeEthnicity*Gender interactionUtterance lengthOn a by-speaker model:RegionGenderRegion * Utterance lengthAgeMedian pause duration

4

Ethnicity

Looking at 27,000+ tokens, Kendall finds significant effects for:RegionEthnicityGenderAgeEthnicity*Gender interactionUtterance lengthOn a by-speaker model:RegionGenderRegion * Utterance lengthAgeMedian pause duration

5

Gender

Looking at 27,000+ tokens, Kendall finds significant effects for:RegionEthnicityGenderAgeEthnicity*Gender interactionUtterance lengthOn a by-speaker model:RegionGenderRegion * Utterance lengthAgeMedian pause duration

6

Age

Looking at 27,000+ tokens, Kendall finds significant effects for:RegionEthnicityGenderAgeEthnicity*Gender interactionUtterance lengthOn a by-speaker model:RegionGenderRegion * Utterance lengthAgeMedian pause duration

7

Utterance length

Looking at 27,000+ tokens, Kendall finds significant effects for:RegionEthnicityGenderAgeEthnicity*Gender interactionUtterance lengthOn a by-speaker model:RegionGenderRegion * Utterance lengthAgeMedian pause duration

8

Median pause

Looking at 27,000+ tokens, Kendall finds significant effects for:RegionEthnicityGenderAgeEthnicity*Gender interactionUtterance lengthOn a by-speaker model:RegionGender(ETHNICITY DROPS)Region * Utterance lengthAgeMedian pause duration

9

But we arent robots

Social meaningSpeech rates are not stable by demographic categoryThey vary all over the placeConveying and creating identities and attitudes

Accommodation by ethnicity

From Kendalls discussion of his dissertation work at Stanford in 2010. 12

Moreover, speech rate is connected to how people attribute emotions and personality characteristics to their interlocutors.

Ive put together a lot of essays and reading notes about language and emotion here:http://www.stanford.edu/~tylers/emotions.shtml13

Psychology, phonetics

Speech rate and pitch are among the most commonly studied cues to emotion in the psychology and phonetics literature. (But note that there is a tremendous amount of indeterminacyfast speech rate is connected to anger, happiness and fear, for example.)

Top table from Scherer 2003, based on Johnstone and Scherer 2000

Bottom table from Ververidis and Kotropoulos (2006: 1171).

Ive put together a lot of essays and reading notes about language and emotion here:http://www.stanford.edu/~tylers/emotions.shtml

14

Musicologists are also interested in how music has the effects that it does. 15

Fast tempo in music

Emotional states corresponding to fast tempo (Scherer and Oshinsky 1977: 340). Those in parentheses had other characteristics ranked higher than tempo. For example, "anger" is expressed mainly through many harmonics and then by fast tempo.

Scherer (1981: 206) adds confidence and indifference; Collins (1989: 45) would have us add excitement but probably wouldnt put up happiness. Fear is especially associated with a highly irregular tempo (Collins 1989: 45).

16

Psychology, phonetics, musicology

Juslin and Laukka (2003)104 studies of speech, 41 studies of music

Juslin and Laukka (2003) review 104 studies of speech and 41 studies of music having to do with cues to emotion. This comes from pg 792.

Note that the vast majority of the studies of vocal emotion use ACTED speech. It is only in recent years that any naturalistic data has started to influence the emotionology literature.

Ive put together a lot of essays and reading notes about language and emotion here:http://www.stanford.edu/~tylers/emotions.shtml

17

Intraspeaker variation

One of the major points is that speakers themselves vary their speech tempo. Conveying emotion is one part of it, but speech tempo has interactional effects. 18

Style

This gets us into the realm of style, where we look at the meaning a variable has in interaction, not just according to broad demographic groups. 19

Indexical fieldsVariables arent fixed but are located in a constellation of ideologically related meanings (Eckert 2008)

One of the most useful concepts is that of the indexical field. 20

3 steps to an indexical fieldStatistically significant correlations with fast speech rate in psychology and linguistics literatureCorpora (talk fast and are fast talkers)Corpus of Contemporary American EnglishBing! search resultsSurvey50 participants across the US via Mechanical Turk

Here are the steps I took to come up with the indexical field of fast talk on the next slide (I did the same for slow talk, but dont talk about that heresee my 2009 paper). Step 1 involved reviewing several dozen studies. The odder items in the indexical field (not very powerful) usually come from those. 21

Who talks fast?

The 40-odd items that you see in this figure demonstrate the indexical field for fast talking. These attributes come from how people in several corpora talk about fast speech and from research by psychologists and others who have found statistically significant relationships between, say, fast speech and listeners perceptions of intelligence. It is augmented by looking at attributes a pilot survey I ran on 50 people, asking what sorts of people talk fast.

Tempo doesnt mean one thing; nor can we say that fast tempo means something particular. A fast tempo has multiple meanings. Depending upon what other variables it combines with, it can mean happy, New Yorker, angry, con artist, and a number of other things. Its meaning is indeterminate and requires other cuessome of which are linguistic, others which are not. This doesnt stop people from explicitly commenting on what fast talkers are like or what kinds of people talk fast.

22

Unsatisfying?

If these characteristics were presented as a list, we might be distracted by all the contradictionshow can fast speech be about both rage and joy? How does it signal an honest person and a con-man? Laying out the characteristics in a field, however, allows clusters to form. Some of these we sense intuitivelyin America, there is a cultural association between New Yorker and Jewish, for example, which can be traced both to historical settlement patterns and current demographics, but which also work out to be ideological judgments about shared outlook and behavioral patterns. The connection between New Yorkers and other elements are also relatively straight-forward. As the stereotype goes, New Yorkers are in a hurry. Portrayals of New Yorkers often show them to be aggressivean attribute that connects to active and emphatic, and which can be read as a version of angry, too.

23

The structure of indexical fieldsIt should be possible to relate items within the fieldAnd this should allows us to understand constraints And how different meanings come to attach to different variablesMy assumptions: Indexical fields expand and contract over timeNew meanings rely upon whats already thereThe no teleportation hypothesisIn principle, sadness and fast talk could come to be related but the path is unlikely

The no teleportation hypothesis is that a new meaning for a variable relies on whats already therethere are constraints on what a variable can mean and its based on whats there. New meanings proceed from the existing.

24

ClusteringTake the indexical field (41 items)Also add fast-talkers and slow-talkersAsk for pair-wise judgments of how much overlap there is between each pairie, New Yorkers overlap with Northerners, but teachers dont overlap with con-men/hustlers20 judgments for each pair (840 pairs)245 Americans surveyed via Amazon Mechanical TurkHierarchical clustering based on correlation patterns (but non-hierarchical methods give similar results)

Actually I got judgments for both a overlap b and b overlap a, 10 of each, 1680 pairsbut in the end I collapsed these and took the averages. Itd be better to NOT do this, but Im still trying to figure out a way to do clustering that is lopsided. Suggestions welcome.

Hierarchical clustering assumes that there is a true relationship underneath. This assumption is problematic and may ultimately lead us to prefer other clustering techniques. Therefore, I am not committed to the particular clustering technique but to the concept itself as a way of understanding structure. And ultimately understanding how meaning works in interaction and how variables change in meaning over time.

25

Sample of the dataactiveangry; in a rageanxiousactive2.11-0.373-0.156angry; in a rage-0.4332.08-0.0549anxious-0.298-0.08752.08auctioneers0.500-1.331-0.898

This is a sample of the table before turning it into correlations. These are z-scores and since different participants answered slightly different questions, thats why active-active, angry-angry, and anxious-anxious dont have precisely the same score. But the a overlaps with a scores really are the highest, as we would expect/wish.

Basically, we are looking for clusters of relationships, so active and auctioneers are more alike than anxious and auctioneers in this minitableie, active and auctioneers have similar positive/negative correlation patterns with the columns: +, -, -. While anxious is -, -, +.

The true table is 43x43 (the 41 indexical field items plus fast talkers and slow talkers).

26

PredictionsThe main clusters that emerge will be connected to two inter-related notions:Ideologies of timeEmotional arousal

This notion of activity (also known as arousal) is crucial for us to understand. I believe it is a core part of how indexical fields for tempo get constructed. Activity is, of course, measurable in terms of heart-rate, respiration, glucose uptake, epinephrine release, etc. When we look at the indexical fields for different tempos, we see that the active emotions (rage, joy) are associated with fast talk while the passive emotions (sadness, boredom) are associated with slow talk.

27

OverwhelmedOther-orientedOverwhelmingAuthoritativePersuading

This is created using R and a tool called Dendroscope that biologists use. Im happy to give a brief tutorial on itit isnt hard to do, especially if you know R already.

The way clustering works is that ANYTHING you place in it will find a home, even if it doesnt really belong. I could put in people who keep whales for pets and youd see it show up somewhere. The important thing is where.

We expect most of these attributes to connect with fast speech tempo (since thats the indexical field were building). People in this experiment had no idea what they were doing beyond saying how much overlap there was between two given terms. It would have been problematic if slow talkers had patterned inside any of the other clusters. It is meaningful (and reassuring) that they branch off.

My high-level labels are an attempt to describe what the clusters are about, how they seem to work in interactional evaluation. The upper left clusters seem to be very much this fast talk is more about the speaker than me the listener, contrast this with the other clusters, which seem to have to have more of an interactional aspect foregrounded. (The overwhelming and overwhelmed clusters have significant interactional effects, too, but I am trying to tease apart how the speakers style gets assessed and I think the how much is this about me part is relevantnonetheless, this is all rather tentative, as you can imagine. I think the crucial work is to build clusters for other indexical fields and see if similar types of clusters emerge.)

28

Whats this showing?Time and emotional arousal may well underlie the fieldBut a different axis is much more apparent:Speaker-orientation vs. listener-orientationFast-speech is about time, but are you talking fast for me or for yourself?Theres a parallel to IN vs. INGDo I take your IN as a sign of friendliness or as evidence of laziness?

29

Tempo varies

As I worked on fast and slow speech tempo, it occurred to me that not only do people vary speed across utterances, but within them, too. 30

Who

Iasked

who

uses

tempo

at

theoppositeextreme

who?

William Shatner, of course, is known for his unusual phrasing. 39

A few examples. This one is among my favorites. Kirk beamed down to a planet where an old flame switched bodies with him (she was crazy from having been denied a starship captainship due to sexismeven in the future sexism!). So Shatner is playing Janice Lester pretending to be Kirk. Mostly, this is just fast speech, but listen for some tempo adjustments to the end. 40

Weve got to get Spock to Vulcan!

The following clips are among the very most burstya term Ill define in a moment. 41

Tunnel of terror

And some others

Notice the wide range of emotions that use burstiness.

The second clip on this is actually Kevin Pollack doing an impersonation. His rates arent so terribly different from Shatner/Kirk at his burstiest, BUT Pollack does it in the captains log, which Shatner/Kirk doesnt (since its not a particularly emotional genre for Kirk). 43

William Shatner impersonation

Here is what a Shatner and his impersonators look like at their burstiest. "William Shatner talks in fast bursts and moves his head a lot it's true he does", http://www.stanford.edu/~tylers/notes/socioling/Sounds/Kids_impersonation_of_Shatner_converted.wav

44

Leonard Nimoys Mr. Spock is even

Versus Leonard Nimoys logical, no-emotions Mr. Spockhttp://www.stanford.edu/~tylers/notes/socioling/Sounds/Spock_Our_involvement.wav

45

A cue from the Internet

Packets traveling through the Internet sometimes dont come at an even pace. They come with bursts and lags (which can mess up video/audio quality, for example). I take measurements from this domain and apply them to speech. 46

BurstinessVariance / (syllables * 0.5)Variance gets us dispersion of the dataThe denominator helps us see how spread out the data isThe bigger the ratio, the more it is characterized by clusters (bursts)

Do we use intonational phrases, breath units, sentences?The results are largely the same, though I used sentences.This obscures the part of burstiness that Kirk/Shatner gets from the deleting pauses between phrases.

We can draw a parallel to network traffic engineering, which attempts to measure the burstiness of packets traveling across the Internet (such burstiness affects the quality of some applications, like voice and video). Of the several measurements I have tried, the one that seems to work best is variance/mean. That is, I input how much time there is between syllables and calculate the variance (the dispersion of the data). Then I divide the variance by the mean number of syllables in the utterance. This provides a measure of how spread out the data is; the bigger the ratio, the more it is characterized by clusterswhat wed call bursts.

47

Burstiness and emotionality48 Americans judged the emotional intensity of 228 utterances Utterances taken from 8 episodes, focusing on:Captain KirkMr. SpockLt. SuluDr. (Bones) McCoyEach utterance judged by 3-5 peopleScores were normalized per judge and then averagedTop 30, bottom 30 and 63 randomly chosen in between were analyzed for speech rate and burstinessRestricted to utterances that were at least 5 syllables

Nivja de Jong and Ton Wempes Syllable Nuclei script was used for first pass, but basically every file had to be done by hand.48

Emotional speech in Star Trek is bursty speech

Burstiness and emotionality correlate49

Better than speech rateAmong factors tested:BurstinessSpeech rateSyllable countInteractions among theseOnly burstiness is significant (in a simple linear regression model or an ordinary least squares model, p=~0.0125)But note that the r-squared isnt all that great: 0.05044

If we added in pitch, intensity, etc, wed have a better r-squared, presumably.

Speech rate, syllables, interactions dont play a role (speech rate model p=0.224)

> data.lm summary(data.lm)

Call:lm(formula = Emotionality ~ Burstiness, data = data)

Residuals: Min 1Q Median 3Q Max -2.0913 -0.9090 -0.1158 1.0076 2.1043

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.13846 0.09392 1.474 0.1430 Burstiness 8.36138 3.29794 2.535 0.0125 *---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 1.007 on 121 degrees of freedom (102 observations deleted due to missingness)Multiple R-squared: 0.05044, Adjusted R-squared: 0.0426 F-statistic: 6.428 on 1 and 121 DF, p-value: 0.01251

50

A better approach is to use a mixed model, where speaker is a random effect. This allows us to see that Kirk and Bones use burstiness, while Sulu and Spock dont.Kirk 0.4371045Bones 0.1710811Sulu -0.1518260Spock -0.4563595

Mixed model

> data.lmer=lmer(Emotionality ~ Burstiness + (1|Speaker), data)> ranef(data.lmer)$Speaker (Intercept)Bone 0.1710811Kirk 0.4371045Spoc -0.4563595Sulu -0.1518260

> data.lmer=lmer(Emotionality ~ Burstiness + (1|Speaker), data)> print(data.lmer, corr=FALSE)Linear mixed model fit by REML Formula: Emotionality ~ Burstiness + (1 | Speaker) Data: data AIC BIC logLik deviance REMLdev 341.4 352.7 -166.7 336.7 333.4Random effects: Groups Name Variance Std.Dev. Speaker (Intercept) 0.21810 0.46701 Residual 0.87055 0.93303 Number of obs: 123, groups: Speaker, 4

Fixed effects: Estimate Std. Error t value(Intercept) -0.07646 0.27625 -0.2768Burstiness 7.07226 3.07077 2.3031

51

Spock in reversal

Bursty, but not emotionalEmotional, but not bursty

Spock isnt very bursty and isnt *supposed* to be emotional, but the clips here are the most bursty and most emotional, respectively.

Note that there are episodes where Spock loses his logic (Amok Time), but I have set those to the side. It would be good to add them in to the analysis. 52

Emotionality by Burstiness and SpeakerAIC BIC logLik deviance REMLdev 341.4 352.7 -166.7 336.7 333.4

Random effects: Groups Name Variance Std.Dev. Speaker (Intercept) 0.21810 0.46701 Residual 0.87055 0.93303 Number of obs: 123, groups: Speaker, 4

Fixed effects: Estimate Std. Error t value(Intercept) -0.07646 0.27625 -0.2768Burstiness 7.07226 3.07077 2.3031

> pvals.fnc(data.lmer)$fixed Estimate MCMCmean HPD95lower HPD95upper pMCMC Pr(>|t|)(Intercept) -0.0765 -0.0845 -0.8558 0.6649 0.8174 0.7824Burstiness 7.0723 7.1217 1.0825 13.1220 0.0194 0.0230

We estimate the p-value for a mixed-effects model (in green) using MCMC.

MCMC (Markov chain Monte Carlo) sampling works this way:Each sample contains one number for each parameter in the model. With lots of samples, we get a posterior distribution of the parameters.We can estimate the p-values and confidence intervals This is 10,000 samples

53

SummaryWe can move beyond the who of variation and into how and whyIndexical fields are a useful conceptual tool and we can use them to understand constraints on meaningIt seems likely that many indexical fields are structured by axes like self/other-orientationWhich are made visible to listeners and appraised by themRate is not the only thing that matters for emotionBurstiness also communicates the drama of the situationIt is unlikely that people go to the extent that Shatner doesBut theres reason to believe that tempo may be as usefulor more sothan simple rates

Thank you!Collins, S. 1989. Subjective and autonomic responses to Western classical music. Unpublished doctoral dissertation, University of Manchester, UK Eckert, P. (2008). Variation and the indexical field. Journal of Sociolinguistics, 12(4), 453476. Huson, D., D. Richter, C. Rausch, T. Dezulian, M. Franz and R. Rupp. (2007). Dendroscope: An interactive viewer for large phylogenetic trees . BMC Bioinformatics 8:460, 2007, software freely available from www.dendroscope.orgde Jong, N. H., and T. Wempe. (2009). Praat script to detect syllable nuclei and measure speech rate automatically. Behavior research methods, 41(2), 385. Juslin, P. N., and P. Laukka. (2003). Communication of emotions in vocal expression and music performance: Different channels, same code?. Psychological Bulletin, 129(5), 770814. Kendall, T. (2010). Language Variation and Sequential Temporal Patterns of Talk. Linguistics Department, Stanford University: Palo Alto, CA. February. Kendall, T. (2009). Speech Rate, Pause, and Linguistic Variation: An Examination Through the Sociolinguistic Archive and Analysis Project, Doctoral Dissertation. Durham, NC: Duke University.Scherer, K. (2003). Vocal communication of emotion: a review of research paradigms. Speech Communication, 40, 227-256. Scherer, K. R. (1981). Speech and emotional states. Speech evaluation in psychiatry, 189220. Schnoebelen, T. (2009). The social meaning of tempo. http://www.stanford.edu/~tylers/notes/socioling/Social_meaning_tempo_Schnoebelen_3-23-09.pdfScherer, K. 2003. Vocal communication of emotion: a review of research paradigms, Speech Comm. 40 227256. Scherer, K. and J. Oshinsky. (1977). Cue utilization in emotion attribution from auditory stimuli. Motiv. Emot. 1, 331346. Ververidis, D., and C. Kotropoulos. (2006). Emotional speech recognition: Resources, features, and methods. Speech Communication, 48(9), 11621181.

Special thanks to Penny Eckert, John Rickford, Kate Geenberg, Kyuwon Moon, Roey Gafter, and Mathew Lodge

Also, fwiw, Ive put together a lot of essays and reading notes about language and emotion here:http://www.stanford.edu/~tylers/emotions.shtml

Appendix

Why look at TV/movies for style?Actors are a good source for studies of style since they make vivid the cues that are more mixed and grey in real life. The act that one does, the act that one performs, is, in a sense, an act that has been going on before one arrived on the scene (Butler 1988: 526). Butler is talking about gender, but this idea applies to acting as well. Actors dont really create anything out of whole cloth. They assemble bits and pieces. It would be difficult to analyze the acoustic signal of wooden acting, since wed be measuring perceptions of lack, but even histrionic, scene-chewing samples offer us speech cues associated with various social categories. Again, the assumption is that actors use stylistic resources that their audiences can be expected to understand. If audiences uniformly agree on what a performance expresses, it doesnt necessarily matter what the intention was. Were after that shared social meaning and the components that comprise it, though we may be giving up the psychophysiological effects on the voice that happen under natural conditions.Writers create scenes of dramatic interest, so that there is also a higher proportion of arousal in a scene than in daily life.

From a PCA of the indexical field pairwise overlap ratings. 58

OverwhelmedOther-orientedOverwhelmingAuthoritativePersuading

A different way of visualizing the data. 59

Yet another visualization method from Dendroscope.60

This uses SplitsTrees NeighborNet algorithm for relating attributes. Unlike hierarchical clustering, this technique models the indeterminacythe more webbing there is the more different ways to cluster the data there are. Distance is also meaningfulthe further away two things are, the less they are related. This may ultimately be the best way to show the indexical fields.

I have a tutorial of how how to use SplitsTree for language classification that should be relatively straight-forward for adaptation to this stuff: http://www.stanford.edu/~tylers/notes/qp/Linguistic_phylogenetics_4-23-09.pdf

Huson, D. and D. Bryant. (2006). Application of Phylogenetic Networks in Evolutionary Studies, Molecular Biology and Evolution, 23(2): 254-267. www.splitstree.orgHuson, D. SplitsTree: A program for analyzing and visualizing evolutionary data. Bioinformatics, 14(10): 68-73, 1998.

61