states and times or cycles and frequencies?

Phonology and Phonetics of Rhythm:

States and Timesor

Cycles and Frequencies?

Dafydd Gibbon

Bielefeld UniversityJinan University

13th Chinese Phonetics ConferenceGuangzhou, November 2018

13th Chinese Phonetics Conference, 2018 D. Gibbon, States and Times or Cycles and Frequencies 2

Theses: YARD (Yet Another Discussion of Rhythm)

In speech (and in music):1) Rhythms are oscillations, not sequences of isochronous states2) Rhythmic oscillations are essentially linear (‘finite state’)3) Rhythms occur in time domains of different sizes, each of them

linear4) The time domains are associated with ranks of units of

speech/language from discourse to phoneme

So rhythms appear to be hierarchical, but the hierarchy● is layered● has limited depth● has linear layers with iterative cycles● is not a general recursive hierarchy


Linear Layered Discourse Rhythms

57 seconds of a BBC news broadcast, from Aix-MARSEC corpus (A0101B)

Two Frequency zones:1) Approximately 120 – 220 Hz2) Approximately 300 – 400 Hz (on ‘paratone’ onsets)

Finite depth linear 2-cycle iterative layered structure(FS, cf. Pierrhumbert, not a recursive hierarchy)

● Two independently motivated but structurally dependent linear layers(like hours & minutes on a clock)

● Each gradually declining and periodically resetting


1. Denotation, Reference

2.symbols, Icons,

Indices inSpace-Time

semiotic relation

3.Categorial

simple and structured

forms

Multimodal Interpretation

hierarchicalpatterning

FocusContrast

Emphasis

linearpatterns

Three ‘hierarchies’, not one, with a ternary semiotic basis for signs at all ranks.

Semantic Interpretation

The Multilinear Rank Interpretation Model


Discourse: Monologue, Dialogue

Utterance: turn, IPU, ...

Sentence, clause, phrase

Word: simple, inflected, compound, derived

Multimodal

Rank

Interpretation

Architecture

MultilinearCategory Ranks

Ranks: Categories and their Interpretations



The discourse dimension: people communicating with their multimodal rank interpretation architectures



States and Times or Cycles and Frequencies? - Overview

1.Rhythm Communities

2. Phonological cycles: abstract, iterative rhythm

3. Phonetic states: sequences of isochronous states as rhythm

4.Phonetic cycles: rhythm as oscillation:– Production as amplitude and frequency modulation– Perception as amplitude and frequency demodulation

5. Pitch cycles: the role of F0 in rhythm– AM and FM similarities– Some problems with F0 estimators (aka ‘pitch trackers’)

6. More roles of F0 in discourse rhythms


Rhythm Communities – a Selection

Conversation Analysis

Linguistic Analysis

Phonetic Annotation Analysis

PhoneticSignal

Analysis




‘Authentic’ everyday dialogue dataEthnomethodology: task scenarioSystematic but subjective judgmentsFocus on interpretation:

● discourse grammar: framing● semantics: topic development● pragmatics: speaker, addressee

Linguistic Analysis

Constructed data paradigmsRules for verbal categoriesAbstract stress position patterns:

● linear structures, grids● hierarchical structures, trees

● words● sentences


Standardised reading data● sentences● stories (The North Wind…)

Annotation:● manual, semi-automatic● syllables, feet, C or V chunks

Focus on isochrony

Phonetic Signal Analysis

Smoothing models● splines, polynomials

Oscillation models● production oriented● perception oriented

● Amplitude Modulation Spectrum● Frequency Modulation Spectrum




‘Authentic’ everyday dialogue dataEthnomethodology: task scenarioSystematic but subjective judgmentsFocus on interpretation:

● discourse grammar: framing● semantics: topic development● pragmatics: speaker, addressee

Linguistic Analysis

Constructed data paradigmsRules for verbal categoriesAbstract stress position patterns:

● linear structures, grids● hierarchical structures, trees

● words● sentences


Standardised reading data● sentences● stories (The North Wind…)

Annotation:● manual, semi-automatic● syllables, feet, C or V chunks

Focus on isochrony

Phonetic Signal Analysis

Smoothing models● splines, polynomials

Oscillation models● production oriented● perception oriented

● Amplitude Modulation Spectrum● Frequency Modulation Spectrum

Tree Structures

Linguistics: deductive, top-down● paradigmatic (classificatory)● syntagmatic (compositional)

Phonetics: inductive, bottom-up● classificatory (CART etc.)● compositional


Phonological Iteration as Abstract Oscillation:

Iterative intonation

Iterative tonal sandhi (Niger Congo)

Iterative tonal sandhi (Tianjin dialect)

Note: by ‘cycle’ I do not mean tone cycles in theparadigmatic , classificatory sense,

but syntagmatic, compositional iterations


Empirical overgeneration1) Accents in a sequence tend to be all H* or all L*2) Global contours tend to be rising with L*

accents, falling with H* accents3) Global contours may span more than 1 turn

Empirical undergeneration1) Paratone hierarchy not included2) No time constraints

Pierrehumbert’s regular grammar / finite state transition network

Phonological iteration as a layered hierarchyof linear abstract oscillations

Not the first (cf. Reich,’t Hart et al., Fujisaki, …)

But linguistically the most interesting.


From an allotonic point of view:

● 3 cycles● 1-tape (1-level) transition

network

Niger-Congo Iterative Tonal Sandhi (the most general case)


At the most abstract level, just one node with H and L cycling around it.


From an allotonic point of view:

● 3 cycles● 2-tape (= 2-level) transition

network




From phonetic signal processing point of view:

● 3 cycles● 3-tape (= 3-level) transition

network




Martin Jansche 1998Tianjin Mandarin tone sandhi

Tianjin Dialect Iterative Tonal Sandhi



Phonetics, the Search for Rhythm I

States and Times: the Isochrony Approach

1-dimensional2-dimensional3-dimensional


One-dimensional annotation mining of time-stamp durations

One-dimensional because the result of the analysis is a single scale. The results are all comparable to variance or standard deviation, but differ in detail.

For example, with the nPVI, subtraction is between neighbouring data in a moving window (so a kind of AMDF, Average Magnitude Difference Function), not between mean and data, thus factoring out tempo variations to some extent.

(or Standard Deviation)


Wagner, Petra (2007). “Visualizing levels of rhythmic organisation.” Proc. International Congress of Phonetic Sciences, Saarbrücken 2007, pp. 1113-1116, 2007

Two-dimensional because duration relations are represented in a z-scored scatter plot, not as a single scale.Result, visualising the scale in two dimensions:

Mandarin: means are scattered relatively evenly around the centreEnglish: e.g. count(short-short) > count(long-long), not binary!

LONG-LONG

LONG-SHORTSHORT-SHORT

SHORT-LONG

Two-dimensional annotation mining of time-stamp durations

LONG-LONGSHORT-LONG

LONG-SHORTSHORT-SHORT


Gibbon, Dafydd. 2006. “Time types and time trees: Prosodic mining and alignment of temporally annotated data”. In: Stefan Sudhoff, et al., eds. Methods in Empirical Prosody Research. Berlin: Walter de Gruyter, pp. 281–209, 2006.

Three-dimensional because alternative trees are possible, depending on the algorithm settings:

- binary/nonbinary, lower/higher percolated- related to phrasal and discourse patterns

Induction of compositional tree structures

Two-dimensional annotation mining of time-stamp durations


All these approaches

● are based on annotated time-stamps, not the signal● model static durations● use pairwise relations● focus on isochrony, equal timing of durations

● completely ignore the essential property of rhythm, which is alternation of units with approximately equal duration

● in other words: oscillation

Time Stamps:

AnnotationMining

Time Stamps:

1D Isochrony

Time Stamps:

2DRelations

Time Stamps:

3DTime Trees

Time Stamps:

AnnotationMining

Time Stamps:

1D Isochrony

Time Stamps:

2DRelations

Time Stamps:

3DTime Trees


Phonetics, The Search for Rhythm II

Cycles and Frequencies: Oscillation Approaches

Rhythm Production as Amplitude Modulation (AM)Rhythmic Amplitude Envelope + Carrier Frequency

There are many studies of oscillation in production

I will concentrate on ...

Rhythm Perception as Amplitude DemodulationRhythmic Amplitude Envelope – Carrier Frequency


Selected Work on Amplitude Envelope Demodulation Spectra[1] Cummins, Fred, Felix Gers and Jürgen Schmidhuber. “Language identification from prosody

without explicit features.” Proc. Eurospeech. 1999.[2] He, Lei and Volker Dellwo. “A Praat-Based Algorithm to Extract the Amplitude Envelope and

Temporal Fine Structure Using the Hilbert Transform.” In: Proc. Interspeech 2016, San Francisco, pp. 530-534, 2016.

[3] Hermansky, Hynek. “History of modulation spectrum in ASR.” Proc. ICASSP 2010.[4] Leong, Victoria and Usha Goswami. “Acoustic-Emergent Phonology in the Amplitude Envelope of

Child-Directed Speech.” PLoS One 10(12), 2015.[5] Leong, Victoria, Michael A. Stone, Richard E. Turner, and Usha Goswami. “A role for amplitude

modulation phase relationships in speech rhythm perception.” JAcSocAm, 2014.[6] Liss, Julie M., Sue LeGendre, and Andrew J. Lotto. “Discriminating Dysarthria Type From

Envelope Modulation Spectra.” Journal of Speech, Language and Hearing Research 53(5):1246–1255, 2010.

[7] Ludusan, Bogdan Antonio Origlia, Francesco Cutugno. “On the use of the rhythmogram for automatic syllabic prominence detection.” Proc. Interspeech, pp. 2413-2416, 2011.

[8] Ojeda, Ariana, Ratree Wayland, and Andrew Lotto. “Speech rhythm classification using modulation spectra (EMS).” Poster presentation at the 3rd Annual Florida Psycholinguistics Meeting, 21.10.2017, U Florida. 2017.

[9] Tilsen Samuel and Keith Johnson. “Low-frequency Fourier analysis of speech rhythm.” Journal of the Acoustical Society of America. 2008; 124(2):EL34–EL39. [PubMed: 18681499]

[10] Tilsen, Samuel and Amalia Arvaniti. “Speech rhythm analysis with decomposition of the amplitude envelope: Characterizing rhythmic patterns within and across languages.” The Journal of the Acoustical Society of America 134, p. 628 .2013.

[11] Todd, Neil P. McAngus and Guy J. Brown. “A computational model of prosody perception.” Proc. ICSLP 94, pp. 127-130, 1994.

[12] Varnet, Léo, Maria Clemencia Ortiz-Barajas, Ramón Guevara Erra, Judit Gervain, and Christian Lorenzi. “A cross-linguistic study of speech modulation spectra.” JAcSocAm 142 (4), 1976–1989, 2017.

☛

☛

☛

☛


INFORMATION

AMPLITUDEMODULATION

CARRIERFREQUENCY

NOISEFREQUENCIES

+ ✕ FILTERCOEFFICIENTS

INFORMATION

FREQUENCYMODULATION

SPECTRALANALYSESin different

frequency zones,COORDINATION

AMPLITUDE DEMODULATIONin different time zonesrectification, LP filtering

envelope detection

FREQUENCYDEMODULATION

in different frequency zonespitch tracking, formant tracking

AM Multiple

Oscillators,productionemulation

Rhythmis

oscillationand iteration

AM Spectral Zones,

perceptionemulation

FMdiscourse

turnsemotion


Modulation Regularities – Maybe the Long-Term Spectrum?

Rhythm: thelong-term spectrum?



F0




F0 F0 F0harmonics

Long-term spectrum:

not so useful




?

What is happening

here?


Modulation Regularities – Maybe F0?

And what is the role of

F0?


Rhythm as iteration

The Reality of Rhythm!

13th Chinese Phonetics Conference, 2018 D. Gibbon, States and Times or Cycles and Frequencies 31Amplitude Envelope Modulation Spectrum (AEMS, AMS, EMS) Frequency Zones

Amplitude Envelope Modulation

↓Amplitude Envelope Demodulation

absolute value of Hilbert transform(or rectification & peak-picking / LP filtering)

↓Spectral slice (FFT)

↓Spectral Zone Edge Detection

Amplitude Modulation, Demodulation and Envelope Spectrum


Amplitude Demodulation and the Amplitude Envelope Spectrum

Amplitude ModulationSpectrum


AM Spectrum and the AM Difference Spectrum

Amplitude ModulationDifference Spectrum

(pioneered by Wiktor Jassem)


The Role of F0

If a spectrum can be derived from the AM envelope,

why not derive a spectrum from the FM track

and see whether they correlate?

Preliminary answer:

Yes, they do correlate to some extent, but not overwhelmingly strongly!This is not very surprising, of course, since they are partly co-extensive locally and globally, though locally not too similar.

I will look at both AM and FM spectra.


Mandarin, female30 sec, < 1Hz

AM and FM Demodulation


Mandarin, female30 sec, < 1Hz

0.2 Hz / 5s:the rhythm ofinterpausal units?

AM and FM Demodulation


AM and FM Demodulation and Spectrum Analysis: Calibration


English (RP)Edinburgh corpus

“The North Wind and the Sun”

Beijing MandarinYu corpus

“bei3 feng1 gen1 tai4 yang2”

Envelope Demodulation: Extending to Discourse Spectra


English (RP)Edinburgh corpus

“The North Wind and the Sun”

Beijing MandarinYu corpus

“bei3 feng1 gen1 tai4 yang2”

Short phrases

Short IPUs

ParatoneIPUs

IPU layers

PhrasesIPUs

1 Hz

Envelope Demodulation: Extending to Discourse Spectra


SpectralFrequency

ZoneBoundaries EnglishNewsreading

EnglishStory

Extending to Discourse Spectra: English Genres


Contrast

elderly male Englishyoung female Mandarin

Extending to discourse spectra: English-Mandarin


Rhythms in Syntagmatic, Compositional Spectral Trees


L-strong, > L-strong, < R-strong, > R-strong, <

Frequency Domain Trees: Spectral Zone Hierarchies

EnglishNewsreading

AM and FM Demodulation and Spectral Tree Induction


L-strong, <

AEMS Frequency Tree

EnglishNewsreading



L-strong, <

AEMS Frequency Tree

EnglishNorth Wind & Sun



L-strong,

<

AEMS Frequency TreeMandarinNorth Wind &

Sun



Rhythms in Paradigmatic, Classificatory Spectral Trees


Data:Chunks from “The North Wind and the Sun”

Male, English: 40s

Female, Mandarin: 40s

Method:Comparison of non-overlapping adjacent 5s audio chunks– offsets into recording: 0, 5, 10, 15, 20, 25, 30, 35– AEMS for each chunk– Inter-speaker comparison (AEMS pointwise means, r=0.82)– Classification by hierarchical similarity / distance:

● Pearson Distance: 1 – r , range 0 … 2● comparison of 7 classifiers● similar classification methods used in dialectometry,

stylometry, language typology

Consistency of AM Spectra: Rhythm Classification


Similarity criterion:>3 adjacent same-speaker settings

Largest per speaker score: (4+3)/16

Largest cluster:4/16

Consistency of Spectra: Rhythm Classification


Highest speaker-specific total:1 Nrst.Pt. (4+3+3)/10

Largest cluster:5 UPGMC 5/10

Consistency of Spectra: Rhythm Classification


Discourse Rhythms: Long FM contours

Thesis: in evolution,– frequency modulation and rhythm came first

● emotional cries● turn-taking came before grammar,

Levinson, “Turn-taking in Human Communication – Origins and Implications for Language Processing”, 2015

Note: in infant speech,– frequency modulation and rhythm also come first

● emotional criesWermke, Sebastian-Galles

● turn-takingcf. the ‘bootstrapping’ literature

the infant ‘twin-talk’ videos on YouTube ☺


Answer: falling utterance contour

Question+Answer: rising-falling adjacency pair contoursyntagmatic entrainment

Question: rising utterance contour



Answer: falling utterance contour

L* sequence, global rise, final rise

H* sequence, global fall

H* sequence, global fall

L* sequence, global fall, final rise

Response Continuation

Interview start Question Question



But there are Methodological Problems in F0 / Pitch estimation

1. Terminology:– articulation rate (production)– F0 (acoustic transmission)– pitch (perception)

2. Measurement:– F0 estimation implementations yield slightly different results

● Autocorrelation● Normalised Cross-correlation● Average Magnitude Difference Function (AMDF)● FFT peak detection● Cepstrum

– Environment differences● Preprocessing: low-pass filter; centre-clipping● Postprocessing: moving median


RAPT (Robust Algorithm for Pitch Tracking)

RAPTC binary

David Talkin


RAPT (Python emulation)

RAPTPython

emulation

Daniel Gaspari


Reaper (Robust Epoch And Pitch EstimatoR)

Reapercompiled C

binary

David Talkin


Praat

Praatscript+binary

Paul Boersma


YIN (as opposed to YANG, Python emulation)

YINPython

emulation

Patrice Guyot


YAAPT (Yet Another Algorithm for Pitch Tracking, Python emulation)

YAAPTPython

emulation

Bernardo J. B. Schmitt


SWIPE (Square Wave Inspired Pitch Estimator, Python emulation)

SWIPEPython

emulation

Disha Garg


F0 – Pitch: Methodological Problems

1. Terminology:– articulation rate (production)– F0 (acoustic transmission)– pitch (perception)

2. Measurement:– F0 estimation implementations yield slightly different results

● Autocorrelation● Normalised Cross-correlation● Average Magnitude Difference Function (AMDF)● FFT peak detection● Cepstrum

– Environment differences● Preprocessing: low-pass filter; centre-clipping● Postprocessing: moving median

So let’s do somthing about it

● do-it-yourself F0 estimation● with full control over all parameters.


F0 – Pitch: A Constructive Do-It-Yourself Strategy

Time domain:

AMDF

A kind of auto-correlation, but with subtraction minima not correlation maxima

Preprocessing:● centre-clipper● low-pass filterPostprocessing:● moving median

Simple: no● voice detection● candidate weighting

(code on GitHub)

Frequency domain:

FFT+spectrum peak-picking

Finding the lowest frequency spectral peak in the Fourier transform

Preprocessing:● centre-clipper● low-pass filterPostprocessing:● moving median

Simple - no● voice detection● candidate weighting

(code on GitHub)


How about AMDF (Average Magnitude Difference Function, Python)

Dafydd Gibbon

AMDFnaive difference minima, Python


FFTpeak (Simple F0 Tracker, Python)

S0FTnaive FFT

peak, Python

Dafydd Gibbon*

* inspired by a snippet from ‘Jonathan’, gist.github.com/endolith/255291


Comparing F0 estimators with RAPT as ‘gold standard’

Benchmark against standard F0 estimators

Encouraging for FFTpeak (problems remain, of course):● correlation ignores some relevant properties such as

overall difference in pitch height● (slightly positively biased) idea of the relationship● not enough test data● not as robust as RAPT● but suggests that RAPT is fit for purpose

F0 estimator pair Correlation

RAPT:PyRAPT 0,8902RAPT:Praat 0,8657RAPT:FFTpeak 0,9096RAPT:f0AMDF 0,8605

FFTpeak:RAPT 0,9096FFTpeak:Praat 0,8409FFTpeak:PyRAPT 0,8352FFTpeak:f0AMDF 0,8016


RAPT (Robust Algorithm for Pitch Tracking)

RAPTC binary

David Talkin


Continuing with FM anyway: Emotive Rhythms

Thesis 1:In the evolutionary time domain: emotive ‘animal’ modulations came before structural modulations

Thesis 2:In the beginning was “Wow!” (Or “Aaah!”)

Thesis 3:Or the wolf whistle (it’s not simply ‘cat-calling’)

Thesis 4:Other primates wowed, aahed and whistled first – we humans continued the custom

Is this why in some societies there is a taboo on whistling?


Many Uses of Frequency Modulation in Discourse Rhythms


Alibaba FM


哇

Cantonese region (Guangzhou) Wu region (Shanghai)

‘Tone 6’ ☺Tone 4

Emotive Exclamations


啊

Cantonese region (Shenzhen)

Emotive Exclamations


FM Rhythms in Teleglossia


Thank you!谢谢 !

states and times or cycles and frequencies?

Documents