051013 speaker identification-ocr
TRANSCRIPT
-
7/30/2019 051013 Speaker Identification-ocr
1/17
Attempted Speaker Identification
Florida vs. Zimmerman
Report to:
Richard W. Mantei
Assistant State Attorney
Fourth Judicial Circuit of Florida
220 E. Bay St., Jacksonville, FL 32202
March 20,2013
Report Prepared by:
Harry Hollien PhD
James D. Harnsberger PhD
Senior ConsultantsForensic Communication Associates
-
7/30/2019 051013 Speaker Identification-ocr
2/17
REPORT ON ATTEMPTED SPEAKER IDENTIFICATION
Florida vs. Zimmerman
INTRODUCTION
Personnel at Forensic Communication Associates (FCA) were contacted by Mr. Richard
Mantei, Assistant State Attorney, Fourth Judicial Circuit of Florida, Jacksonville, regarding
recordings associated with the above cited case. It was requested that an attempt be made to
discover if the male voice found on a 9-11 recording (i.e., the unknown voice) was the same
as the one recorded on an exemplar5 CD (the known voice). The person speaking on the
exemplar was Mr. George Zimmerman. Later exemplar recordings were requested for Mr.
Trayvon Martin; the speech on them was to be compared to the 9-11 utterances also.
MATERIALS RECEIVED
Two CD recordings were received at FCA. One was of the relevant 9-11 call. It was
labeled 911 witness call, an address and^m^^i^. The second CD contained the voice of
George Zimmerman. It was dated 2/26/12, dispatch callfllMHHSfc. Both were labeled
with an FCA number and digital copies made on laboratory equipment and a computer.
FCA personnel then requested additional voice samples both of G. Zimmerman and T.
Martin. At various later dates, three Zimmerman CDs were received (they were jail calls,
4/20/12; video interview, 2/27/12; and reenactment audio, 3/22/12
Finally, two DVDs taken from Trayvon Martins phone (only markings) were received at yet
a later date. Identifying marks were placed on these recordings also and digital copies made of
Page 2 of 17
-
7/30/2019 051013 Speaker Identification-ocr
3/17
them (i.e., via computer input). The undersigned and a senior technician listened to them in
their entirety several times. Analysis CDs were constructed (of evidence-exemplar sets and
pairs) and the samples they contained processed by means of several aural-perceptual speaker
identification techniques (see below and Hollien, H., Acoustics of Crime, Plenum, 1990;
Hollien and Hollien, Improving Aura-perception Speaker Identification Techniques, Studies in
Forensic Phonetics, 1995, Wissenchaftlicher, Trier, and Hollien, H., Forensic Voice
Identification. Academic Press, 2001).
THE RECORDINGS
The Evidence Recording.
As expected, the samples on the 9-11 evidence recording were not at all suitable for
ordinary speaker identification analyses. First, they were mostly short grunts, calls or cries; a
few gave the illusion of speech, mostly help or help me. Second, with only two exceptions,
they were rather faint. Third, since they were recorded at a 9-11 center, other voices were heard
(they were much louder, of course). In many instances, these voices obliterated and/or
overlapped those in the background. Fourth, 16 utterances in all could be identified. However,
only six were found to be potentially useful and some of their extent was lost when they were
extracted. Taken as a whole, only a little over 8 sec. of speech was found to be available for
assessment.
Exemplars.
On the other hand, the energy levels of the utterances on the exemplar recordings were
sufficient and the overall quality of those produced was quite good in all instances. They were
Page 3 of 17
-
7/30/2019 051013 Speaker Identification-ocr
4/17
of the type suitable for speaker identification purposes; that is, they were intelligible and,
although noise was intermittently present, it rarely masked the speech. The problem, of course,
was that very few of the utterances they contained actually were suitable for comparison as
those involving very short samples produced under high stress were quite rare. Although a
number of procedures were tried, ultimately the judgments made contrasting the Zimmerman re
created cries as exemplars (i.e., when compared to the six 9-11 samples) were the most useful.
For Martin, several of his high frequency laughs, exclamations and mocking utterances were
employed.
Selection of the Unknown Samples.
As stated, the major problem was that very little speech/voice material was available for
processing; a second problem was that they were all calls or cries; the third was that they were
very faint and, the fourth, that most were at least partly masked by the speech of other talkers. In
all, 16 short calls/cries were identified and very little intelligible speech was available: i.e., only
one or two instances of help or help me. Of these 16 samples, only six provided at least
500ms or more of a clear call and, even in these instances, part of the total call had to be
removed. As stated, just a little more than 8 sec. of phonation was available. Samples this brief
rarely lead to attempts at speaker identification. Ordinarily, 10 words or 10 seconds of speech
constitute a bare minimum. However, they were the only unknown samples available and the
task involved making a determination between but two speakers (i.e., G. Zimmerman and T.
Martin).
Page 4 of 17
-
7/30/2019 051013 Speaker Identification-ocr
5/17
Preparation of the Recordings.
Samples of the "unknown" (U) and the "known" (K) speakers were prepared for the
aural-perceptual comparisons. These procedures included the selection of three sets of samples
for three different analyses. The first set was mostly for familiarization purposes. It involved a
compilation of all the calls/cries onto a single recording. It was compared (serially) to a group of
short speech passages drawn from the several interviews/calls made by Mr. Zimmerman (K).
Later on, this procedure was independently applied to samples from Mr. Martins telephone
calls (K).
The second procedure was to create six separate CDs, each with a different call or cry
from the 9-11 recordings and individually compare them to a variety of short speech samples
from the K recordings (first to Zimmerman samples, then separately to those by Martin). The
final procedure (i.e., the third one) was most important. These same cries/calls were individually
compared to the cries/calls from the reenactment recording. In Mr. Martins case, samples of
laughter, mocking, and high pitched exclamations were employed. As would be expected, all
samples were of the best available quality and where noise was at a minimum. Both the
known and the unknown samples were band-pass filtered with both the high pass and low
pass filtering cutoffs set outside the speech range. This procedure was carried out in order to 1)
minimize any situational differences, 2) reduce distractions and 3) eliminate some of the non
speech artifacts present on the recordings. Thus, the highest quality samples possible were
made available.
AURAL-PERCEPTUAL SPEAKER IDENTIFICATION
The aural-perceptual speaker identification procedures employed are those where an
Page 5 of 17
-
7/30/2019 051013 Speaker Identification-ocr
6/17
unknown voice (U), drawn from an evidence recording, is compared to exemplars of a known
voice (K). As stated above, samples of a number of U-K combinations were placed on a CD in
pairs for direct and repeated comparison. The undersigned then carried out evaluations which
were based on a number of heard parameters. In this instance, they only included comparisons
of: I) fundamental frequency, 2) voice quality, 3) vocal intensity (variability) patterns, 4)
vowels, and 5) nasality. Subjective impressions also were logged for consideration. This entire
procedure was completed, then repeated in its entirety some time later -- usually the next day.
In this case, a speaker identification procedure had to be employed in which an attempt
had to be made to match - or not match - Mr. Zimmermans vocalizations to the six usable cries
found on the 9-11 call with utterances (as similar as possible) from exemplar recordings. As
stated, the process was carried out as follows. The greatest extent of the cry possible was
isolated; all extraneous noise was removed; the cry was repeated 8-10 times. A comparison
recording of a number of Mr. Zimmermans exemplar utterances was compared in turn - and,
individually - with each of the eight samples. Voice quality, pitch, vowel quality, nasality and
intensity inflections were the primary judgmental features. Finally, the entire process was
repeated using the mimicked cries by Mr. Zimmerman. It then was carried out twice for Mr.
Martin. First with general speech samples, then with the available stress units.
It also should be noted that, for this case, the usual procedure was further modified.
Ordinarily, evaluating an identification parameter (pitch say) was carried out by playing the
pairs over and over until a decision could be made. Here, the single identification parameter
remained the same but the specific cry (No. 8 say) was compared to a variety of exemplar
samples (again, over and over) until the judgment is made. It then was repeated for the five
other U utterances. Thus, the process here more closely parallels six separate speaker
Page 6 of 17
-
7/30/2019 051013 Speaker Identification-ocr
7/17
identifications with the product for each summed both by cry andfeature.
Please note that the two investigators worked independently and did not compare results
until afterall evaluations had been completed.
As implied, the (individual) assessments ordinarily obtained are summarized on a
continua like the one found in Figure 1. In general, the range of scores making up each
continuum can be divided roughly as follows: 1) any mean scores in the 0-3 range suggest that a
match cannot be made and the samples were produced by two different individuals, 2) a scoring
of 4-6 is generally neutral but somewhat on the positive side (i.e., toward a match) and 3) those
that fall within the 7-10 range indicate a positive-to-strong match. It should be stressed again
that the listed parameters were evaluated one at a time with the complete procedure
independently replicated a number of times. This method of presentation was adapted for these
evaluations (see Figures 2-5).
RESULTS
The prepared samples were played (repeatedly) on high quality laboratory equipment.
The findings and impressions of the undersigned resulted in differing conclusions depending on
which of the U samples were compared to which of those uttered by the two known (K)
speakers. As stated, a maximum of only five speech/voice parameters (plus an overall
judgment) could be used to permit U-K judgments.
The Bases of the Comparisons.
1. Pitch. Perceived pitch is the psychophysical correlate of fundamental frequency
Page 7 of 17
-
7/30/2019 051013 Speaker Identification-ocr
8/17
usage. In this case, it refers to the level of those tones produced by the speaker. It
proved to be one of the weaker contrasts in this evaluation.
2. Voice Quality. This dimension is a little difficult to define but rather easy to
demonstrate. Any hearing individual would have little difficulty differentiating a
violin from a saxophone even though both were played at the same fundamental
frequency and intensity. The relative differences among the partials (frequencies)
within the complex musical sounds are what make this discrimination possible. This
characteristic proved to be a major factor for these assessments as the same is true
for human speakers.
3. Vocal Intensity Patterns. Absolute vocal intensity levels are very difficult to detect
because even slight differences in the environmental situation, microphone position,
talker distance, etc., can result in large differences in the absolute level of
measured or perceived loudness. As with pitch, the intensity variability patterns
proved to be one of the lesser identification features. Yet they aided in some
judgments.
4. Nasality. Detection of the amount of nasality in the cries and exemplar samples
proved to be helpful.
5. Vowels. In some cases, vowel format comparisons of the calls with exemplar
samples provided enough information to permit graded same-different judgments.
Page 8 of 17
-
7/30/2019 051013 Speaker Identification-ocr
9/17
6. Finally, each of our evaluators provided a general overall assessment of the U-K
samples. In many cases, these efforts aided in the decision making.
Specific Results.
The first of the three sets of judgments (i.e., general speech) for Mr. Zimmerman was
simply inconclusive and will not be included in the results. The second provided some insight so
its results will be established as Figure 2. The third (i.e., the U-K comparisons of the 9-11 call
calls/cries vs. those from the reenactment) was the most important. Please see Figure 3. The two
sets for Mr. Martin parallel those of Mr. Zimmermans to some extent - i.e., short speech
samples and more relevant samples. They will be presented as Figures 4 and 5. Over two
thousand specific judgments were required to permit the following decisions to be made.
As can be seen from consideration of the four figures, no robust matches were obtained.
On the other hand, several rather strong tendencies were found. First, please note the following.
Call No. 11 proved almost impossible to judge once the masking (of other) voices was trimmed
from its borders. Accordingly, data from this sample will not be included on any figures.
Data on Mr. Zimmerman. The scores for cries/calls Numbers 1 and 8 were so low (see
Figures 2 and 3) that Iittle-to-no evidence that Mr. Zimmerman made them appeared to exist.
His scores for call No. 13 were rather mixed - and they were quite variable. Thus, even though
their mean was close to 5.0, the judgment had to be that they were inconclusive. That is, while
they graded above the 0-3 range, they still fell far short of a match. On the other hand, the data
for cry No. 14 and (more so) for cry No. 16 proved to be more toward - but in most instances
not quite reaching - a match. Indeed, as may be seen from the range data, several of the
Page 9 of 17
-
7/30/2019 051013 Speaker Identification-ocr
10/17
individual scores exceeded the border of the match category - and the mean for No. 16 (see
Figure 3) also came very close to a match. In short, there is a very good possibility that, under
normal circumstances, cry No. 14 and, especially, cry No. 16 would be judged to be a match -
i.e., that Mr. Zimmerman had, indeed, made one or both of those two utterances. In this
instance, the confidence level only reaches about 65-70%. Nevertheless, it is even much less
likely that he (George Zimmerman) was notthe person who made these two cries.
Data for Mr. Martin. The data for Mr. Martin are similar in extent but different in
pattern. Of course, the judgments here are even more difficult to make as they were drawn from
a telephone call and, unlike those for Mr. Zimmerman, no reenactment samples were available.
In judging Figures 4 and 5 (based also on two separate analyses), it can be noted 1) that no
judgments were possible for call No. 11, and data for calls No. 13, 14, and 16 quite clearly
demonstrate that he did not make them. That is, the means of all the many hundreds of
judgments usually ranged from 1.0 to 3.1. And, even though one score reached 5.5, very few of
the individual judgments exceeded the non-match category. Thus, even with these restricted
judgments, there is too little evidence suggesting that he uttered any of these three calls. On the
other hand, there is some evidence that he was responsible for the first two calls/cries (i.e., No. 1
and No. 8). Note their mean scores on Figures 4 and 5. They range from 5.8-6.5 for the first
identification run (Fig. 4) and 6.4-6.5 for the second (Fig. 5). Note also, that several of the
individual judgments are in the 7.0 or above (i.e., match) category. Thus, it may be concluded
with a nearly 70% confidence level that Mr. Martin produced the first two calls. Again, while
they did not reach the definitely match category, the data do not provide any real evidence
that he did notmake these utterances.
Page 10 of 17
-
7/30/2019 051013 Speaker Identification-ocr
11/17
DISCUSSION
While the evidence suggests that Mr. Martin produced the first two utterances and Mr.
Zimmerman made the last two, the confidence level for these relationships is not very robust.
Yet, conclusions of these low magnitudes are hardly surprising, given the limits and difficulty of
the evaluation process. It is possible, of course, that more robust data could have been obtained
if we had been supported in conducting two additional sets of procedures. The first of these
procedures would have included comparative acoustic analyses of the listed U-K samples. The
second would have been a perceptual experiment to compare the evidence recordings to an
appropriately-sized samples of male speakers that were matched in age, gender, and linguistic
background to, alternatively, Trayvon Martin and George Zimmerman. These two groups of
speakers would produce utterances similar to those found on the 9-11 and exemplar recordings.
The results of these procedures would have aided the undersigned in confirming or not
confirming the findings reviewed above.
CONCLUSIONS
The opinions to follow are based primarily on the aural-perceptual evaluations described
above. As was stated, even though many problems were evident, the evidence recording
provided minimum-to-marginal material for identification purposes. Moreover, the exemplar
recordings contained enough material to permit a number of different judgments to be made.
Based on the many analyses carried out, the undersigned had to conclude that, while there is
evidence to suggest that Mr. Martin made the first two calls/cries (Nos. 1 and 8) and that Mr.
Zimmerman made those identified as 14 and 16, none of these conclusions reached the criterion
for a match. Neither speaker could be identified as being responsible for the others.
Page 11 of 17
-
7/30/2019 051013 Speaker Identification-ocr
12/17
Finally, it must be conceded that the aural-perceptual method of speaker identification,
while reasonably well organized and extensive in this case, is somewhat subjective in nature
and, hence, the possibility of error exists. Nonetheless, the reported data can be defended on the
basis of the rigorous procedures employed and, hence, the conclusions drawn can be viewed as
reasonable.
Respectively submitted,
Forensic Communication Associates
James D. Hamsberger, PhD
Senior Consultant
Harry Hollien, Ph.D.
Senior Consultant
Page 12 of 17
-
7/30/2019 051013 Speaker Identification-ocr
13/17
Figure 1.
Case Name:
A sample of the type of summary figure employed in ordinary aural-perceptual
speaker identification. The structuring of figures 2-5 is patterned on this one.
FORENSIC COMMUNICATION ASSOCIATES
Aural-perceptual Approach to Speaker Identification Score Sheet
0 = U-K least alike; 10 = U-K most alike
FCA REF:
1. PITCH
a. Level
b. Variability
c. Patterns
2. VOICE QUALITY
a. General
b. Vocal Fry
c. Other
3. INTENSITY
a. Variability
4. DIALECT
a. Regional
b. Foreign
c. Idiolect
5. ARTICULATION
a. Vowels
b. Consonants
c. Misarticulations
d. Nasality
6. PROSODY
a. Rate
b. Speech Bursts
c. Other
10
0 . . . . 5 . . . . 10
0 . . . . 5 . . . . 10
0 . . . . 5 . . . . 10
0 . . . . 5 . . . . 10
0 . . . . 5 . . . . 10
0 . . . . 5 . . . . 10
0 . . . . 5 . . . . 10
0 . . . . 5 . . . . 10
0 . . . . 5 . . . . 10
0 . . . . 5 . . . . 10
0 . . . . 5 . . . . 10
0 . . . . 5 . . . . 10
0 . . . . 5 . . . . 10
0 . . . . 5 . . . . 10
0 . . . . 5 . . . . 10
0 . . . . 5 . . . . 10
SCORE RANGE
MEAN
Page 13 of 17
-
7/30/2019 051013 Speaker Identification-ocr
14/17
Figure 2. Comparison of Mr. Zimmermans general (but short) samples with the six cries
or calls drawn from the 9-11 telephone call. Twenty such samples were
compared to each cry. The data (x) on the continuum are the means of at least
four of the five features plus a general assessment.
Cry Number Perception Mean Judgment Range
No. 1 wow 0. .X.. 5.. . 10 = 2.4 1.0-4.0
No. 8 ow 0.X..5... . 10= 1.9 0-4.5
No. 11 (mainly cherp) 0 .... 5 .... 10 = N/A null
No. 13 wyra 0... .X5.. . 10 = 4.7 2.5 - 5.5
No. 14 owa 0 .... 5 X .. . 10 = 6.1 4.0 - 6.0
No. 16 swaa 0 .... 5 .X.. . 10 = 6.6 4.0-7.0
Page 14 of 17
-
7/30/2019 051013 Speaker Identification-ocr
15/17
Figure 3. Comparison of Mr. Zimmermans reenacted cries with each of the six cries/calls.
Twenty such samples were matched with each cry. The data (x) on the
continuum are the means of at least four of the Five features plus a general
assessment.
Cry Number Perception Mean Judgment Range
No. 1 wow 0 . X . . 5 . . . . 1 0 = 2 . 0 0-4.0
No. 8 ow 0 .X .. 5. . . . 1 0 = 2 .1 1.0-3.5
No. 11 (mainly cherp) 0 . . . . 5 . . . . 1 0 = N / A null
N o . 1 3 wyra 0 . . . . X 5 . .
< . y>., {
. 1 0 = 4 . 8 2.0-5.5
No. 14 owa 0 . . . . 5 X . .. 1 0 = 6 . 0 5 . 0 - 7 . 0
ti,.'ij Site f * . iN o . 16 swaa 0 . . . . 5 . X .. 1 0 = 6 .9 4 . 5- 7 .5
Page 15 of 17
-
7/30/2019 051013 Speaker Identification-ocr
16/17
Figure 4. Comparison of Mr. Martins general (but short) samples with the six cries or
calls drawn from the 9-11 telephone call. Twenty such samples were matched
with each cry. The data (x) on the continuum are the means of at least four of the
five features plus a general assessment.
Cry Number Perception Mean Judgment Range
No. 1 wow o Ui
X o II
bo
3.0-7.0
No. 8 "ow 0....5.X... 10 = 6.5 4.0 - 7.5
No. 11 (mainly cherp) 0 . . . . 5 . . . . 1 0 = N /A null
HiNo. 13 wyra O X . . . 5 . . . . 1 0 = 1 . 2 0-3.2
No. 14 owa 0 . . . X 5 . . . . 1 0 = 3 . 9 2.5-5.5
No. 16 "swaa 0 . . X . . 5 . . . . 1 0 = 2 . 5 1.0-4.0
Page 16 of 17
-
7/30/2019 051013 Speaker Identification-ocr
17/17
Figure 5. Comparison of Mr. Martins selected samples (shouts/cries) with the six cries or
calls drawn from the 9-11 telephone call. Twenty such samples were matched
with each cry. The data (x) on the continuum are the means of at least four of the
five features plus a general assessment.
Cry Number Perception Mean Judgment Range
No. 1 wow 0 . . . . 5 . X . . . 10 = 6. 5 4 . 5 - 7 .5
No. 8 ow 0 . . . . 5 . X . . . 1 0 = 6 . 4 4 . 5 - 7 . 0
No. 11 (mainly cherp) 0___5____1 0 = N / A n u l l
No. 1 3 wyra O X . . . 5 . . . . 1 0 = 1 . 1 0 . 5 - 3 . 2
No. 14 owa 0 . . X . 5 . . . . 1 0 = 3 . 1 2 . 0 - 5 . 0
No. 16 swaa 0 . . X . 5 . . . . 1 0 = 2 . 8 1 . 0 - 4 . 5. . _ ^ ' '
Page 17 of 17