051013 speaker identification-ocr

7/30/2019 051013 Speaker Identification-ocr

1/17

Attempted Speaker Identification

Florida vs. Zimmerman

Report to:

Richard W. Mantei

Assistant State Attorney

Fourth Judicial Circuit of Florida

220 E. Bay St., Jacksonville, FL 32202

March 20,2013

Report Prepared by:

Harry Hollien PhD

James D. Harnsberger PhD

Senior ConsultantsForensic Communication Associates


2/17

REPORT ON ATTEMPTED SPEAKER IDENTIFICATION

Florida vs. Zimmerman

INTRODUCTION

Personnel at Forensic Communication Associates (FCA) were contacted by Mr. Richard

Mantei, Assistant State Attorney, Fourth Judicial Circuit of Florida, Jacksonville, regarding

recordings associated with the above cited case. It was requested that an attempt be made to

discover if the male voice found on a 9-11 recording (i.e., the unknown voice) was the same

as the one recorded on an exemplar5 CD (the known voice). The person speaking on the

exemplar was Mr. George Zimmerman. Later exemplar recordings were requested for Mr.

Trayvon Martin; the speech on them was to be compared to the 9-11 utterances also.

MATERIALS RECEIVED

Two CD recordings were received at FCA. One was of the relevant 9-11 call. It was

labeled 911 witness call, an address and^m^^i^. The second CD contained the voice of

George Zimmerman. It was dated 2/26/12, dispatch callfllMHHSfc. Both were labeled

with an FCA number and digital copies made on laboratory equipment and a computer.

FCA personnel then requested additional voice samples both of G. Zimmerman and T.

Martin. At various later dates, three Zimmerman CDs were received (they were jail calls,

4/20/12; video interview, 2/27/12; and reenactment audio, 3/22/12

Finally, two DVDs taken from Trayvon Martins phone (only markings) were received at yet

a later date. Identifying marks were placed on these recordings also and digital copies made of

Page 2 of 17


3/17

them (i.e., via computer input). The undersigned and a senior technician listened to them in

their entirety several times. Analysis CDs were constructed (of evidence-exemplar sets and

pairs) and the samples they contained processed by means of several aural-perceptual speaker

identification techniques (see below and Hollien, H., Acoustics of Crime, Plenum, 1990;

Hollien and Hollien, Improving Aura-perception Speaker Identification Techniques, Studies in

Forensic Phonetics, 1995, Wissenchaftlicher, Trier, and Hollien, H., Forensic Voice

Identification. Academic Press, 2001).

THE RECORDINGS

The Evidence Recording.

As expected, the samples on the 9-11 evidence recording were not at all suitable for

ordinary speaker identification analyses. First, they were mostly short grunts, calls or cries; a

few gave the illusion of speech, mostly help or help me. Second, with only two exceptions,

they were rather faint. Third, since they were recorded at a 9-11 center, other voices were heard

(they were much louder, of course). In many instances, these voices obliterated and/or

overlapped those in the background. Fourth, 16 utterances in all could be identified. However,

only six were found to be potentially useful and some of their extent was lost when they were

extracted. Taken as a whole, only a little over 8 sec. of speech was found to be available for

assessment.

Exemplars.

On the other hand, the energy levels of the utterances on the exemplar recordings were

sufficient and the overall quality of those produced was quite good in all instances. They were

Page 3 of 17


4/17

of the type suitable for speaker identification purposes; that is, they were intelligible and,

although noise was intermittently present, it rarely masked the speech. The problem, of course,

was that very few of the utterances they contained actually were suitable for comparison as

those involving very short samples produced under high stress were quite rare. Although a

number of procedures were tried, ultimately the judgments made contrasting the Zimmerman re

created cries as exemplars (i.e., when compared to the six 9-11 samples) were the most useful.

For Martin, several of his high frequency laughs, exclamations and mocking utterances were

employed.

Selection of the Unknown Samples.

As stated, the major problem was that very little speech/voice material was available for

processing; a second problem was that they were all calls or cries; the third was that they were

very faint and, the fourth, that most were at least partly masked by the speech of other talkers. In

all, 16 short calls/cries were identified and very little intelligible speech was available: i.e., only

one or two instances of help or help me. Of these 16 samples, only six provided at least

500ms or more of a clear call and, even in these instances, part of the total call had to be

removed. As stated, just a little more than 8 sec. of phonation was available. Samples this brief

rarely lead to attempts at speaker identification. Ordinarily, 10 words or 10 seconds of speech

constitute a bare minimum. However, they were the only unknown samples available and the

task involved making a determination between but two speakers (i.e., G. Zimmerman and T.

Martin).

Page 4 of 17


5/17

Preparation of the Recordings.

Samples of the "unknown" (U) and the "known" (K) speakers were prepared for the

aural-perceptual comparisons. These procedures included the selection of three sets of samples

for three different analyses. The first set was mostly for familiarization purposes. It involved a

compilation of all the calls/cries onto a single recording. It was compared (serially) to a group of

short speech passages drawn from the several interviews/calls made by Mr. Zimmerman (K).

Later on, this procedure was independently applied to samples from Mr. Martins telephone

calls (K).

The second procedure was to create six separate CDs, each with a different call or cry

from the 9-11 recordings and individually compare them to a variety of short speech samples

from the K recordings (first to Zimmerman samples, then separately to those by Martin). The

final procedure (i.e., the third one) was most important. These same cries/calls were individually

compared to the cries/calls from the reenactment recording. In Mr. Martins case, samples of

laughter, mocking, and high pitched exclamations were employed. As would be expected, all

samples were of the best available quality and where noise was at a minimum. Both the

known and the unknown samples were band-pass filtered with both the high pass and low

pass filtering cutoffs set outside the speech range. This procedure was carried out in order to 1)

minimize any situational differences, 2) reduce distractions and 3) eliminate some of the non

speech artifacts present on the recordings. Thus, the highest quality samples possible were

made available.

AURAL-PERCEPTUAL SPEAKER IDENTIFICATION

The aural-perceptual speaker identification procedures employed are those where an

Page 5 of 17


6/17

unknown voice (U), drawn from an evidence recording, is compared to exemplars of a known

voice (K). As stated above, samples of a number of U-K combinations were placed on a CD in

pairs for direct and repeated comparison. The undersigned then carried out evaluations which

were based on a number of heard parameters. In this instance, they only included comparisons

of: I) fundamental frequency, 2) voice quality, 3) vocal intensity (variability) patterns, 4)

vowels, and 5) nasality. Subjective impressions also were logged for consideration. This entire

procedure was completed, then repeated in its entirety some time later -- usually the next day.

In this case, a speaker identification procedure had to be employed in which an attempt

had to be made to match - or not match - Mr. Zimmermans vocalizations to the six usable cries

found on the 9-11 call with utterances (as similar as possible) from exemplar recordings. As

stated, the process was carried out as follows. The greatest extent of the cry possible was

isolated; all extraneous noise was removed; the cry was repeated 8-10 times. A comparison

recording of a number of Mr. Zimmermans exemplar utterances was compared in turn - and,

individually - with each of the eight samples. Voice quality, pitch, vowel quality, nasality and

intensity inflections were the primary judgmental features. Finally, the entire process was

repeated using the mimicked cries by Mr. Zimmerman. It then was carried out twice for Mr.

Martin. First with general speech samples, then with the available stress units.

It also should be noted that, for this case, the usual procedure was further modified.

Ordinarily, evaluating an identification parameter (pitch say) was carried out by playing the

pairs over and over until a decision could be made. Here, the single identification parameter

remained the same but the specific cry (No. 8 say) was compared to a variety of exemplar

samples (again, over and over) until the judgment is made. It then was repeated for the five

other U utterances. Thus, the process here more closely parallels six separate speaker

Page 6 of 17


7/17

identifications with the product for each summed both by cry andfeature.

Please note that the two investigators worked independently and did not compare results

until afterall evaluations had been completed.

As implied, the (individual) assessments ordinarily obtained are summarized on a

continua like the one found in Figure 1. In general, the range of scores making up each

continuum can be divided roughly as follows: 1) any mean scores in the 0-3 range suggest that a

match cannot be made and the samples were produced by two different individuals, 2) a scoring

of 4-6 is generally neutral but somewhat on the positive side (i.e., toward a match) and 3) those

that fall within the 7-10 range indicate a positive-to-strong match. It should be stressed again

that the listed parameters were evaluated one at a time with the complete procedure

independently replicated a number of times. This method of presentation was adapted for these

evaluations (see Figures 2-5).

RESULTS

The prepared samples were played (repeatedly) on high quality laboratory equipment.

The findings and impressions of the undersigned resulted in differing conclusions depending on

which of the U samples were compared to which of those uttered by the two known (K)

speakers. As stated, a maximum of only five speech/voice parameters (plus an overall

judgment) could be used to permit U-K judgments.

The Bases of the Comparisons.

1. Pitch. Perceived pitch is the psychophysical correlate of fundamental frequency

Page 7 of 17


8/17

usage. In this case, it refers to the level of those tones produced by the speaker. It

proved to be one of the weaker contrasts in this evaluation.

2. Voice Quality. This dimension is a little difficult to define but rather easy to

demonstrate. Any hearing individual would have little difficulty differentiating a

violin from a saxophone even though both were played at the same fundamental

frequency and intensity. The relative differences among the partials (frequencies)

within the complex musical sounds are what make this discrimination possible. This

characteristic proved to be a major factor for these assessments as the same is true

for human speakers.

3. Vocal Intensity Patterns. Absolute vocal intensity levels are very difficult to detect

because even slight differences in the environmental situation, microphone position,

talker distance, etc., can result in large differences in the absolute level of

measured or perceived loudness. As with pitch, the intensity variability patterns

proved to be one of the lesser identification features. Yet they aided in some

judgments.

4. Nasality. Detection of the amount of nasality in the cries and exemplar samples

proved to be helpful.

5. Vowels. In some cases, vowel format comparisons of the calls with exemplar

samples provided enough information to permit graded same-different judgments.

Page 8 of 17


9/17

6. Finally, each of our evaluators provided a general overall assessment of the U-K

samples. In many cases, these efforts aided in the decision making.

Specific Results.

The first of the three sets of judgments (i.e., general speech) for Mr. Zimmerman was

simply inconclusive and will not be included in the results. The second provided some insight so

its results will be established as Figure 2. The third (i.e., the U-K comparisons of the 9-11 call

calls/cries vs. those from the reenactment) was the most important. Please see Figure 3. The two

sets for Mr. Martin parallel those of Mr. Zimmermans to some extent - i.e., short speech

samples and more relevant samples. They will be presented as Figures 4 and 5. Over two

thousand specific judgments were required to permit the following decisions to be made.

As can be seen from consideration of the four figures, no robust matches were obtained.

On the other hand, several rather strong tendencies were found. First, please note the following.

Call No. 11 proved almost impossible to judge once the masking (of other) voices was trimmed

from its borders. Accordingly, data from this sample will not be included on any figures.

Data on Mr. Zimmerman. The scores for cries/calls Numbers 1 and 8 were so low (see

Figures 2 and 3) that Iittle-to-no evidence that Mr. Zimmerman made them appeared to exist.

His scores for call No. 13 were rather mixed - and they were quite variable. Thus, even though

their mean was close to 5.0, the judgment had to be that they were inconclusive. That is, while

they graded above the 0-3 range, they still fell far short of a match. On the other hand, the data

for cry No. 14 and (more so) for cry No. 16 proved to be more toward - but in most instances

not quite reaching - a match. Indeed, as may be seen from the range data, several of the

Page 9 of 17


10/17

individual scores exceeded the border of the match category - and the mean for No. 16 (see

Figure 3) also came very close to a match. In short, there is a very good possibility that, under

normal circumstances, cry No. 14 and, especially, cry No. 16 would be judged to be a match -

i.e., that Mr. Zimmerman had, indeed, made one or both of those two utterances. In this

instance, the confidence level only reaches about 65-70%. Nevertheless, it is even much less

likely that he (George Zimmerman) was notthe person who made these two cries.

Data for Mr. Martin. The data for Mr. Martin are similar in extent but different in

pattern. Of course, the judgments here are even more difficult to make as they were drawn from

a telephone call and, unlike those for Mr. Zimmerman, no reenactment samples were available.

In judging Figures 4 and 5 (based also on two separate analyses), it can be noted 1) that no

judgments were possible for call No. 11, and data for calls No. 13, 14, and 16 quite clearly

demonstrate that he did not make them. That is, the means of all the many hundreds of

judgments usually ranged from 1.0 to 3.1. And, even though one score reached 5.5, very few of

the individual judgments exceeded the non-match category. Thus, even with these restricted

judgments, there is too little evidence suggesting that he uttered any of these three calls. On the

other hand, there is some evidence that he was responsible for the first two calls/cries (i.e., No. 1

and No. 8). Note their mean scores on Figures 4 and 5. They range from 5.8-6.5 for the first

identification run (Fig. 4) and 6.4-6.5 for the second (Fig. 5). Note also, that several of the

individual judgments are in the 7.0 or above (i.e., match) category. Thus, it may be concluded

with a nearly 70% confidence level that Mr. Martin produced the first two calls. Again, while

they did not reach the definitely match category, the data do not provide any real evidence

that he did notmake these utterances.

Page 10 of 17


11/17

DISCUSSION

While the evidence suggests that Mr. Martin produced the first two utterances and Mr.

Zimmerman made the last two, the confidence level for these relationships is not very robust.

Yet, conclusions of these low magnitudes are hardly surprising, given the limits and difficulty of

the evaluation process. It is possible, of course, that more robust data could have been obtained

if we had been supported in conducting two additional sets of procedures. The first of these

procedures would have included comparative acoustic analyses of the listed U-K samples. The

second would have been a perceptual experiment to compare the evidence recordings to an

appropriately-sized samples of male speakers that were matched in age, gender, and linguistic

background to, alternatively, Trayvon Martin and George Zimmerman. These two groups of

speakers would produce utterances similar to those found on the 9-11 and exemplar recordings.

The results of these procedures would have aided the undersigned in confirming or not

confirming the findings reviewed above.

CONCLUSIONS

The opinions to follow are based primarily on the aural-perceptual evaluations described

above. As was stated, even though many problems were evident, the evidence recording

provided minimum-to-marginal material for identification purposes. Moreover, the exemplar

recordings contained enough material to permit a number of different judgments to be made.

Based on the many analyses carried out, the undersigned had to conclude that, while there is

evidence to suggest that Mr. Martin made the first two calls/cries (Nos. 1 and 8) and that Mr.

Zimmerman made those identified as 14 and 16, none of these conclusions reached the criterion

for a match. Neither speaker could be identified as being responsible for the others.

Page 11 of 17


12/17

Finally, it must be conceded that the aural-perceptual method of speaker identification,

while reasonably well organized and extensive in this case, is somewhat subjective in nature

and, hence, the possibility of error exists. Nonetheless, the reported data can be defended on the

basis of the rigorous procedures employed and, hence, the conclusions drawn can be viewed as

reasonable.

Respectively submitted,

Forensic Communication Associates

James D. Hamsberger, PhD

Senior Consultant

Harry Hollien, Ph.D.

Senior Consultant

Page 12 of 17


13/17

Figure 1.

Case Name:

A sample of the type of summary figure employed in ordinary aural-perceptual

speaker identification. The structuring of figures 2-5 is patterned on this one.

FORENSIC COMMUNICATION ASSOCIATES

Aural-perceptual Approach to Speaker Identification Score Sheet

0 = U-K least alike; 10 = U-K most alike

FCA REF:

1. PITCH

a. Level

b. Variability

c. Patterns

2. VOICE QUALITY

a. General

b. Vocal Fry

c. Other

3. INTENSITY

a. Variability

4. DIALECT

a. Regional

b. Foreign

c. Idiolect

5. ARTICULATION

a. Vowels

b. Consonants

c. Misarticulations

d. Nasality

6. PROSODY

a. Rate

b. Speech Bursts

c. Other

10

0 . . . . 5 . . . . 10

0 . . . . 5 . . . . 10

0 . . . . 5 . . . . 10

0 . . . . 5 . . . . 10

0 . . . . 5 . . . . 10

0 . . . . 5 . . . . 10

0 . . . . 5 . . . . 10

0 . . . . 5 . . . . 10

0 . . . . 5 . . . . 10

0 . . . . 5 . . . . 10

0 . . . . 5 . . . . 10

0 . . . . 5 . . . . 10

0 . . . . 5 . . . . 10

0 . . . . 5 . . . . 10

0 . . . . 5 . . . . 10

0 . . . . 5 . . . . 10

SCORE RANGE

MEAN

Page 13 of 17


14/17

Figure 2. Comparison of Mr. Zimmermans general (but short) samples with the six cries

or calls drawn from the 9-11 telephone call. Twenty such samples were

compared to each cry. The data (x) on the continuum are the means of at least

four of the five features plus a general assessment.

Cry Number Perception Mean Judgment Range

No. 1 wow 0. .X.. 5.. . 10 = 2.4 1.0-4.0

No. 8 ow 0.X..5... . 10= 1.9 0-4.5

No. 11 (mainly cherp) 0 .... 5 .... 10 = N/A null

No. 13 wyra 0... .X5.. . 10 = 4.7 2.5 - 5.5

No. 14 owa 0 .... 5 X .. . 10 = 6.1 4.0 - 6.0

No. 16 swaa 0 .... 5 .X.. . 10 = 6.6 4.0-7.0

Page 14 of 17


15/17

Figure 3. Comparison of Mr. Zimmermans reenacted cries with each of the six cries/calls.

Twenty such samples were matched with each cry. The data (x) on the

continuum are the means of at least four of the Five features plus a general

assessment.


No. 1 wow 0 . X . . 5 . . . . 1 0 = 2 . 0 0-4.0

No. 8 ow 0 .X .. 5. . . . 1 0 = 2 .1 1.0-3.5

No. 11 (mainly cherp) 0 . . . . 5 . . . . 1 0 = N / A null

N o . 1 3 wyra 0 . . . . X 5 . .

< . y>., {

. 1 0 = 4 . 8 2.0-5.5

No. 14 owa 0 . . . . 5 X . .. 1 0 = 6 . 0 5 . 0 - 7 . 0

ti,.'ij Site f * . iN o . 16 swaa 0 . . . . 5 . X .. 1 0 = 6 .9 4 . 5- 7 .5

Page 15 of 17


16/17

Figure 4. Comparison of Mr. Martins general (but short) samples with the six cries or

calls drawn from the 9-11 telephone call. Twenty such samples were matched

with each cry. The data (x) on the continuum are the means of at least four of the

five features plus a general assessment.


No. 1 wow o Ui

X o II

bo

3.0-7.0

No. 8 "ow 0....5.X... 10 = 6.5 4.0 - 7.5

No. 11 (mainly cherp) 0 . . . . 5 . . . . 1 0 = N /A null

HiNo. 13 wyra O X . . . 5 . . . . 1 0 = 1 . 2 0-3.2

No. 14 owa 0 . . . X 5 . . . . 1 0 = 3 . 9 2.5-5.5

No. 16 "swaa 0 . . X . . 5 . . . . 1 0 = 2 . 5 1.0-4.0

Page 16 of 17


17/17

Figure 5. Comparison of Mr. Martins selected samples (shouts/cries) with the six cries or

calls drawn from the 9-11 telephone call. Twenty such samples were matched

with each cry. The data (x) on the continuum are the means of at least four of the

five features plus a general assessment.


No. 1 wow 0 . . . . 5 . X . . . 10 = 6. 5 4 . 5 - 7 .5

No. 8 ow 0 . . . . 5 . X . . . 1 0 = 6 . 4 4 . 5 - 7 . 0

No. 11 (mainly cherp) 0___5____1 0 = N / A n u l l

No. 1 3 wyra O X . . . 5 . . . . 1 0 = 1 . 1 0 . 5 - 3 . 2

No. 14 owa 0 . . X . 5 . . . . 1 0 = 3 . 1 2 . 0 - 5 . 0

No. 16 swaa 0 . . X . 5 . . . . 1 0 = 2 . 8 1 . 0 - 4 . 5. . _ ^ ' '

Page 17 of 17

051013 speaker identification-ocr

Documents