forensic phonetics

Forensic Phonetics: Issues in speaker identification evidence

Andrew Butcher

Centre for Human Communication Research

Flinders Medical Research Institute Flinders University, Adelaide, Australia

Abstract The field of forensic phonetics has developed over the last 20 years or so and embraces a number of areas involving analysis of the recorded human voice. The area in which expert opinion is most frequently sought is that of speaker identification – the question of whether two or more recordings of speech (from suspect and perpetrator) are from the same speaker. Automated analysis (in which Australia is a world leader) is only possible where recording conditions are identical. In the most frequently encountered real-world forensic situation, comparison is required between a police interview recording and recordings made via telephone intercepts or listening devices. This necessitates a complex procedure, involving auditory and acoustic comparison of both linguistic and non-linguistic features of the speech samples in order to build up a profile of the speaker. The most commonly used measures are average fundamental frequency and the first and second formant frequencies of vowels. Much work is still needed to develop appropriate statistical procedures for the evaluation of phonetic evidence. This means estimating the probability of finding the observed differences between samples from the same speaker and the probability of finding those same differences between samples from two different speakers. Thus there needs to be an acceptance that the outcome will not be an absolute identification or exclusion of the suspect. By itself, your voice is not a complete giveaway. 1. The field of forensic phonetics

The use of phonetics as a forensic tool has developed over the past 20 years or so (Hollien 1990;

Baldwin & French 1991), but with the rapid expansion in the number of cases depending on the

evidence of covert audio and video recordings in recent years, forensic phonetics now plays a

crucial role in an increasing number of criminal trials. A forensic phonetician may be asked to

prepare reports in a number of areas, of which the following four are the most frequently

encountered:

1.1 Speaker identification. This is by far the most commonly required task and the subject of the

remainder of this paper.

1.2 Disputed utterances. In view of the usually very poor quality of covert police recordings

(especially those made via a listening device), there is often ample scope for a defendant to

Forensic Phonetics Butcher

2

challenge the prosecution’s version of what was actually said in the course of a recorded

conversation. Forensic phoneticians may be asked to prepare a report on the quality of the

recording and the intelligibility of the speech. They may also be asked to prepare an ‘objective’

transcript of the recording.

1.3 Tape authentication. Occasionally a defendant (or a civil litigant) may have cause to question

whether an audio recording has been tampered with in some way. Usually the claim is that

certain sections have been excised or perhaps transposed. It is not generally within the

competence of a phonetician to give an opinion as to the physical condition of a tape, but there

may be evidence within the acoustic signal (‘pops’ or abrupt changes in either the signal itself or

the background noise) which would be indicative of electronic editing. However, currently

available software makes ‘seamless’ editing comparatively easy, and a phonetician may be

needed to give an opinion on the only remaining evidence of any tampering – linguistic evidence

in the form of unnatural changes in rhythm, tempo or intonation.

1.4 Voice line-ups. The practice of confronting witnesses of a crime with a tape recorded ‘voice

line-up’, where the voice of a suspect is included amongst a series of ‘foils’, may be used to

obtain evidence of identification in cases where, in the course of committing a crime, an unseen

or masked perpetrator has spoken in the presence of the witnesses. This recording is played to

the witness(es) and they are asked to state whether they can identify any of the voices as that of

the perpetrator. In order to be entirely fair to the suspect, there are a number of criteria which

need to be observed (Broeders & Rietveld 1995; Hollien, Huntley, Künzel & Hollien P 1995). As

with visual identification parades, it is a general principle of fairness in the conducting of voice

line-ups is that there should be no feature of any of the voices or the recordings which would

cause non-witnesses to pick out a particular speaker (whether suspect or foil) as being different

from the rest. A phonetician may be consulted on aspects of the construction of the tape and the

administration of the confrontation.

2. Speaker Identification: analysis and measurement

I would estimate that at least 90% of my work as a forensic phonetician is concerned with the

identity of speakers in audio recordings. There is a good deal of misunderstanding surrounding

the capabilities of speech technology in this area. Some of this misunderstanding dates from the

1960’s, when the “Voiceprint” technique became a favourite tool of certain police forces, most


3

notably in the USA. This methodology, which involved the visual inspection and impressionistic

comparison of sound spectrograms, was regarded sceptically by the scientific community at the

time, and has since been entirely discredited (Hollien 1990, 2002; Gruba & Poza 1995). The

term “Voiceprint” suggests that the technique is analogous to forensic techniques such as

fingerprinting or DNA analysis. There are a number of reasons why this is an inappropriate

analogy. Firstly, there is no single feature of the voice which is unique to every speaker. Unlike

the vanishingly small possibility in the case of fingerprints or DNA molecules, it is quite

possible for two speakers to be, for all practical purposes, identical in some respect. Secondly,

most (if not all) of the features of the voice which are measurable in recordings of the quality

typically encountered in the forensic context are capable of being consciously changed by the

speaker. These include, voice pitch, aspects of voice quality, consonantal articulation, and vowel

quality. At present it is not impossible for a skilled mimic to defeat the forensic voice

identification procedure. Thirdly, for most of the voice features, we do not have sufficient data

on the normal population to know what the chances are of two speakers being similar or identical

with respect to that feature. Finally, acoustic parameters vary as a consequence of differences in

recording conditions as well as of differences in the voice itself. Australia leads the world in the

technology of automatic speaker recognition (in 2001 a team from the RCSAVT Speech

Research Lab at Queensland University of Technology won two of the categories for single

speaker detection tasks in the National Institute of Standards & Technology’s benchmark tests

on speaker recognition), but automatic speaker recognition is not yet able to separate out

variation due to speaker differences from variation due to recording conditions (and it is doubtful

whether it will ever be able to). Thus automatic speaker recognition techniques are of limited use

in the typical forensic situation, where a voice recorded over the telephone or via a listening

device is to be compared with a voice recorded in a police interview room. The intervention of a

phonetically and linguistically qualified human operator is required. The main components of the

procedure are an auditory analysis and an acoustic analysis, each of which in turn has a number

of component parts. Voice ID is therefore more appropriately compared with a technique such as

a ‘photo-fit’ type of procedure, where a number of features are considered as part of an overall

profile.

2.1. Auditory analysis


4

This part of the analysis involves careful and repeated listening by the expert, noting features of

the voices in question under four basic headings. Firstly, voice quality features are ascertained.

This means describing ‘voice’ in the technical sense – i.e. the sound made by the vibration of the

vocal folds – and ignoring for the moment any variations contributed by the resonances of the

throat, mouth and nasal passages above. It can be done using one of a number of descriptive

frameworks (e.g. Isshiki & Takeuchi 1970; Laver 1980; Wendler, Rauhut & Krüger 1986; Oates

& Russell 1998), whereby aspects of the voice can be quantified according to parameters such as

‘roughness’, ‘strain’, ‘creakiness’, ‘breathiness’ and so on – terms which are meaningful to other

phoneticians and speech scientists and which describe in as accurate and objective way as

possible the auditory impressions of the listener. Secondly, the investigator attends to the non-

linguistic characteristics of the speech which are not produced by the larynx. This means

listening to the effects of the long-term setting of the throat, the tongue and lips and the

resonances of the nasal passages and sinuses. This is known as the articulatory setting, and here

too, established descriptive frameworks are available (Laver 1980; Esling 1994) which rate the

voice according to such parameters as ‘hypernasality’, ‘pharyngealisation’, ‘labialisation’, as

well as vertical position of the larynx. The third set of parameters relate to aspects of (mainly

vowel) articulation which provide clues to the speaker’s geographical and social background. In

long-established linguistic communities such as in the United Kingdom and Europe, this part of

the analysis can provide very useful information. In a recently established community such as

(non-Aboriginal) Australia, the information which can be gleaned is usually quite scanty.

Australian English accents are traditionally classified on a three-point scale as being ‘Broad’,

‘General’ or ‘Cultivated’ (Mitchell & Delbridge 1965), but there are very few features which

enable us to pinpoint the speaker’s geographical origins with any accuracy. One or two

pronunciations are peculiar to Queensland and another one or two distinguish speakers with a

South Australian background. A more recent phenomenon is the “pan-ethnic” accent (sometimes

known as “wogspeak”) which has developed among second- and subsequent-generation

Australians of non-English-speaking background (Warren 1999). The final component of the

auditory analysis is the identification of any idiosyncratic pronunciation features which may be

present. The more commonly occurring idiosyncrasies involve the articulation of consonants,

and include various types of ‘lisp’, the labialising of ‘r’ (‘rabbit’ becomes something likes


5

‘wabbit’) and the pronunciation of ‘th’ as ‘v’. Apart from this, speakers may exhibit various

kinds of dysfluency, including stuttering, ‘cluttering’ and slurring of words.

2.2 Acoustic analysis

In order to carry out an acoustic analysis the recording must be digitised to a computer hard

drive or compact disc (a sampling rate of 22.05 kHz and a 16-bit resolution are normally used).

The recordings are usually edited so as to contain only the voice of the speaker under

investigation. Published recommended minimal sample sizes for forensic speaker comparison

range from 15 s to 120 s. With regard to fundamental frequency measurement (F0), one recent

review of the forensic phonetic literature concludes: “If the communicative behaviour may be

considered ‘normal’, 15-20 sec of speech will be sufficient to calculate speaker F0” (Braun

1995). The analyses described below can be performed using any one of a number of currently

available speech analysis software packages.

2.2.1 Fundamental frequency

The rate of vibration of the vocal folds during voiced segments of speech is what the listener

perceives as the pitch of the voice. This is known as the fundamental frequency, and is

measured in cycles per second or ‘Hertz’ (Hz). Obviously this is capable of variation by the

speaker, and indeed this is one of the main ways of conveying both grammatical and emotional

meaning in speech.


6

Figure 1: Waveform (above) and pitch contour (below) of the utterance “We went to Woolloomooloo”

Figure 1 shows a waveform and pitch contour for an Australian English sentence. The waveform

at the top represents the tiny variations in air pressure caused by the transmission of the sound

waves. The bottom trace shows the variation in frequency of those vibrations over time: the

fundamental frequency. Each speaker has a particular range of fundamental frequency which

s/he habitually uses and within which s/he feels most comfortable and this is an important

measure for forensic purposes, because it is one of the few measures for which we know the

distribution amongst the adult population at large. The average speaking fundamental frequency

for an adult caucasian male is 113 Hz, and 50% of the male population lie somewhere between

100 to 130 Hz in spontaneous speech (Kuenzel 1989). The corresponding average for females is

225 Hz. Figure 2 shows how this measure may be used in building up a voice profile. In this case

the voice of a person issuing a ransom demand over the telephone is compared with the voices of

two suspects (Butcher & Moody 1999). Clearly the fundamental frequency of suspect 1 is much

closer to that of the perpetrator than is the fundamental frequency of suspect 2. Furthermore both

the perpetrator and suspect 1 differ markedly from the population mean and in the same

direction.


7

Figure 2:Graph of mean fundamental frequency of three speakers in a number of recordings. The vertical lines represent one standard deviation either side of the mean. The dashed line represents the mean for the adult male population.

2.2.2. Long term average spectrum

A spectrum is a plot of energy against frequency. It shows the distribution of energy throughout

the frequency range during a very small time ‘slice’ of sound. A long-term spectral energy

profile is derived by averaging a large number of spectral slices over a longer sample of speech,

thus eliminating information on the details of individual sounds. This is, in theory, the best

measure of what we perceive as voice quality and vocal effort, as well as the overall effects of

long-term articulatory settings. It is this kind of measure which is used in most automated

speaker recognition procedures (Butcher & Moody 1999). Unfortunately, such measures also

reflect differences in recording conditions, and often these may be sufficiently large to mask any

similarities between speakers. Figure 3 illustrates this problem. In Figure 3a the voice of a

suspect recorded via a mobile telephone is compared with unknown voices from four other calls

made from the same phone. Clearly there is a high degree of similarity between the voices. In

Figure 3b, however, the voice of the suspect is shown under three different conditions: recorded

on standard audio cassette via telephone, recorded in free field on microcassette and recorded in

free field on VHS-C cassette. In this case the three spectra look quite different – in particular the

there is a large discrepancy between the telephone recording and the two free-field recordings,

6 0

8 0

1 0 0

1 2 0

1 4 0

1 6 0

1 8 0

2 0 0

2 2 0

2 4 0

perp

'I'

perp

'D'

Sant 'I

'

Sant 'D

'

Sant c

om

Ron1 '

I'

Ron1'D

'

Ron1 c

om

mea

n fu

ndam

enta

l fre

quen

cy (H

z)

perpetrator suspect 1 suspect 2

mea

n fu

ndam

enta

l fre

quen

cy (H

z) →


8

which represent the two recording conditions most commonly offered for comparison in the

forensic situation. Clearly this measure can only be used in the limited number of situations

where the conditions under which recordings have been made can be assumed to be similar.

(a) (b) Figure 3:Long-term average spectra (a) from 5 separate phone calls, allegedly by a single speaker and (b) from the

same speaker, recorded under three different conditions

2.2.3 Vowel formant frequencies

When a speaker pronounces a vowel sound, a number of resonances are produced in the vocal

tract (the tube formed by the mouth and throat cavities). These are known as formants. The

frequencies of the lowest two or three formants change according to the ‘colour’, ‘quality’ or

‘timbre’ of the vowel. Formants can be measured from a sound spectrogram, which is a kind of

three-dimensional spectrum. As with the spectrum, the distribution of energy is shown over the

frequency range, but in this case we can see how this distribution varies as a function of time.

Frequency is shown on the vertical axis and time on the vertical axis, whilst the amount of

energy present is represented by the darkness of the shading. The formants appear as dark

horizontal bands, whose vertical position varies according to the nature of the vowel. This is

illustrated in Figure 4. For example, if a number of speakers pronounce the short ‘a’ vowel in

words such as ‘cat’, ‘bad’, ‘sack’ etc, one might expect to find some small, but consistent

differences between speakers, if the sample is large enough, and likewise for each of the other

vowels of the language.

frequency (Hz) →

ener

gy (d

B) →


9

Figure 4: A sound spectrogram of the words ‘head, had, hard’, spoken by an adult male. The dark horizontal bands

(F1, F2, F3) in the vowels represent areas of higher energy known as FORMANTS. A useful way of summarising vowel formant frequency data from a given speaker is to plot the

mean values of the first formant against the mean values of the second formant for all the

vowels. This provides a characteristic pattern or ‘vowel space’ for the speaker, as shown in

Figure 5, which is based on data measured from the voice of a murder suspect during interview.

In this figure the first formant frequency is shown on the vertical axis and the second formant

frequency on the horizontal axis. The origins of the axes are placed in the top right hand corner,

so that the positions of the points on the chart relate approximately to the position of the tongue

and jaw: vowels pronounced with a forward position of the tongue and spread lips appear on the

left and those with a retracted tongue and rounded lips appear on the right. Vowels with a raised

tongue and closed jaw are at the top and vowels with a lower tongue and open jaw at the bottom.

The individual letters represent a point positioned at the intersection of the means of the first and

second formant frequencies of the vowel in question. The ellipses represent a distance of two

standard deviations around the mean for that vowel, i.e the area which would include 95% of the

speaker’s vowels of that type.

F1

F2

F3

frequ

ency

→

F3 F3

F2

F2

F1 F1

4.5 kHz

0 time → 1.775 s

head had hard


10

Figure 5: Formant plot of short vowels of a suspect in a police interview recording. The phonetic symbols

represent a point positioned at the intersection of the mean first and second formant frequencies of the vowel in question and the ellipses represent two standard deviations around the means. From left to right, the symbols represent ‘i’ as in ‘ring’, ‘e’ as in ‘left’, ‘a’ as in ‘that’, ‘u’ as in ‘up’, ‘o’ as in ‘got’, and ‘oo’ as in ‘good’.

Figure 6: Comparison of short vowels from a suspect in a police interview recording with corresponding vowels of

a speaker recorded via a listening device. The ellipses are the same as in Figure 5 – i.e. they represent two standard deviations around the means for the suspect’s voice. The phonetic symbols represent individual short vowels from the unknown speaker.

← second formant frequency (Hz)

first formant frequency (H

z) →



z) →


11

In Figure 6 the same ellipses are superimposed on a set of data points representing the formant

frequencies of vowels from an unknown speaker recorded via a listening device. The degree of

overlap between the two speakers can be roughly quantified by calculating the proportion of

vowel points from the unknown speaker which fall within the appropriate ellipse of the suspect

speaker. In this particular diagram, only 50% of the unknown speaker’s vowels fall within the

corresponding ellipse of the suspect speaker. Based on this data alone, there would have to be

considerable doubt that the speakers are the same.

Data from a different case are shown in Figures 7 and 8. In these plots the mean frequencies of

the vowel sets are compared. In Figure 7 the combined mean values from a perpetrator’s vowels

in a number of phone calls are compared with the values for the equivalent vowels spoken by 20

adult male speakers of General Australian English from the Australian National Database of

Spoken Language (Millar, Vonwiller, Harrington & Dermody 1994). The two patterns look quite

different, and in the overall mean difference between the values of the perpetrator and those of

this sample of the general population is 12.2%. Figure 8 shows the same set of perpetrator

vowels compared with those of a suspect. The degree of similarity between the two patterns

appears much greater, and indeed the mean difference between the values for the perpetrator and

those for the suspect is 3.3%. Thus the formant frequencies of the perpetrator are considerably

closer to those of the suspect than they are to those of the general population. Experience

suggests that a variation of 5% or less is of the order expected within a single speaker.

These, then are the major parameters that may be used to build up a profile of a two or more

voices for the purposes of forming an opinion as to their overall similarity.


12

Figure 7: Comparison of vowels from a perpetrator with vowels from the Australian National Database of Spoken

Language. Each phonetic symbol represents the mean for that vowel in one of the two sets of data. All means for a given data set are connected by a line:

= perpetrator = ANDOSL data

Figure 8: Comparison of vowels from a perpetrator with vowels from a suspect. Each phonetic symbol represents

the mean for that vowel in one of the two sets of data. All means for a given data set are connected by a line: = perpetrator = suspect


z) →




z) →


z) →


13

3. Presenting the evidence

3.1 Problems with ‘Probability’

Having carried out the analyses and formed an opinion, the phonetician must now present the

evidence to the court and express his opinions based upon it. The usual expectation of lawyers

appears to be that the expert give his opinion in the form of an answer – preferably in numerical

terms – to the question “Given the degree of similarity between the speech samples, what is the

probability of the two voices belonging to the same speaker?” And the answer that is required is

something along the lines of: “Given the high degree of similarity between the two speech

samples, there is a very high (90%) probability that these two samples are from the same

speaker”. Some expert (but non-phonetician) witnesses in the field appear to be prepared to

make statements of this kind. This is, however, highly inappropriate and has no statistical basis.

The witness is in fact expressing the probability of a hypothesis, given the evidence. This is

not only logically incorrect, but, according to my understanding, also legally incorrect, as this is

ultimately the job of the court and not of the expert witness. Essentially what forensic

phoneticians have traditionally been asked to do is akin to answering the question “Given that

this creature has wings, what are the chances of it being a bird?” The question the expert witness

should be answering, however, is the equivalent of “Given that this creature is a bird, what are

the chances of it having wings?” Translated back into the real world, this means “If we assume

that the two speech samples are from the same speaker, what is the likelihood of them displaying

this degree of similarity?” In other words s/he should be expressing the probability of the

evidence, given the hypothesis.

3.2 The Likelihood Ratio

Ideally the evidence of the expert witness should be expressed within the framework of Bayesian

statistics (Robertson & Vignaux 1995). This means answering a question of the type “How much

more likely is this creature to have wings if it were a bird than if it were not a bird?” or in reality

“How much more likely is the given degree of similarity between samples if they were by the

same speaker than if they were by different speakers?”. This involves the use of a likelihood

ratio, which is arrived at in the following way (Rose 2002). The phonetician observes and

quantifies a certain degree of similarity (X) between the perpetrator and suspect speech samples.

Let’s assume, for the sake of argument, that published research has shown that, with paired


14

samples of X degree of similarity, 85% are from the same speaker. This means that the

probability of observing X degree of similarity between samples from the same speaker would

be 85% and the probability of finding X degree of similarity between different speakers would

be 15%. The likelihood ratio is then 85 divided by 15 or 5.67.

A likelihood ratio greater than 1.0 supports the prosecution hypothesis – i.e shows that the

degree of similarity found between the speech samples is more likely if they were by the same

speaker than if they were by different speakers. A likelihood ratio less than 1.0 supports the

defence hypothesis – i.e shows that the degree of similarity found between the speech samples is

more likely if they were by different speakers than if they were by the same speaker. The value

of the likelihood ratio thus quantifies the strength of the evidence, and likelihood ratios from

different areas of forensic evidence can be combined. Each successive likelihood ratio should be

evaluated in terms of the degree of confidence in the assertion of guilt before consideration of

the evidence in question (the so-called ‘prior odds’) (Robertson & Vignaux 1995).

4. Speaker Identification: where we are now

At the beginning of the previous subsection I used the word “ideally” and indeed the subsequent

paragraphs describe the ideal situation. The key sentence is the one beginning “Let’s assume, for

the sake of argument, that published research has shown …”. Unfortunately, however, we cannot

assume any such thing at this time. Our knowledge of what is ‘normal’ or ‘average’ for the

population is severely lacking in most areas and the data that we do have is inevitably limited to

the majority population groups – i.e. in the case of Australia to the Anglo-Celtic community –

and to somewhat artificial ‘laboratory’ conditions. Furthermore, the statistical modelling of the

highly complex variation that occurs in speech is still in its infancy, and is still a long way from

being able to cope with the distinction between variation due to speaker differences and variation

due to differences in recording conditions – as we have seen, a crucial requirement in the

forensic context. Thus statements as to probability made by forensic phoneticians are at this

stage limited by these two significant constraints. Whilst every scrap of available quantitative

data on the general population will be taken into account, such statements will inevitably rely

heavily on the extensive experience and accumulated knowledge of the individual expert.


15

References BALDWIN J & FRENCH P (1991) Forensic Phonetics. London & New York: Pinter. BRAUN A (1995) Fundamental frequency – how speaker-specific is it? In: A BRAUN & J-P KÖSTER (eds) Studies in Forensic Phonetics. Trier: Wissenschaftlicher Verlag Trier, 9-23. BROEDERS APA & RIETVELD ACM (1995) Speaker identification by earwitness. In A Braun and J-P Köster (eds), Studies in Forensic Phonetics. Trier: Wissenschaftlicher Verlag. BUTCHER AR & MOODY MP (1999) The case of the ‘third voice’: a rare opportunity for closed set comparison in the forensic context. Paper presented at the Annual Conference of the International Association for Forensic Phonetics, York, England. ESLING JH (1994) Voice quality. In R.E. Asher & J.M.Y. Simpson (eds) The Encyclopedia of Language and Linguistics. Oxford: Pergamon Press, 4950-4953. GRUBA JS & POZA FT (1995) Voicegram identification evidence. 54 American Jurisprudence Trials 1. HOLLIEN H (1990) The Acoustics of Crime. New York & London: Plenum. HOLLIEN H (2002) Forensic Voice Identification. San Diego: Academic Press. HOLLIEN H, HUNTLEY RA, KÜNZEL HJ & HOLLIEN PA (1995) Criteria for earwitness lineups. Forensic Linguistics 2, 143-153. ISSHIKI N & TAKEUCHI Y (1970) Factor analysis of hoarseness. Studia Phonologica 5, 37-44. KÜNZEL HJ (1989) How well does average fundamental frequency correlate with speaker height and weight? Phonetica 46, 117-125. LAVER J (1980) The Phonetic Description of Voice Quality. Cambridge: Cambridge University Press. OATES JM & RUSSELL A (1998) Learning voice analysis using an interactive multi-media package: Development and preliminary evaluation. Journal of Voice 12, 500-512. MILLAR JB, VONWILLER J, HARRINGTON JM & DERMODY P (1994). The Australian National Database of Spoken Language. Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Adelaide, 67-100. MITCHELL AG & DELBRIDGE A (1965) The pronunciation of English in Australia (revised edition). Sydney: Angus and Robertson.


16

ROBERTSON B & VIGNAUX GA (1995) Interpreting Evidence: Evaluating Forensic Science in the Courtroom. New York: John Wiley & Sons. ROSE P (2002) Forensic Speaker Identification. London: Taylor & Francis. WARREN J (1999) ‘Wogspeak’: transformations of Australian English. Journal of Australian Studies 62, 86-94. WENDLER J, RAUHUT A & KRÜGER H (1986) Classification of voice qualities. Journal of Phonetics 14, 483-488.

forensic phonetics

Documents