automatic detection and classification of microchiropteran echolocation calls: why the current...

1
Automatic detection and classification of Microchiropteran echolocation calls: Why the current technology is wrong and what can be done about it Mark D. Skowronski and John G. Harris Computational Neuro-Engineering Lab, University of Florida, Gainesville, FL ABSTRACT Existing methods for automatic detection and classification of bat calls suffer from two significant limitations: inadequate feature extraction and simplistic modeling. A 10-ms call, sampled at 200 kHz, is represented by 2000 time-domain samples, yet the same call is reduced to less than 10 global feature values (minimum and maximum frequency, duration, shape features, energy) during feature extraction. Such a reduction in representation excludes discriminating information from the call, which increases classification and detection errors. Furthermore, detection is typically performed using an energy threshold, excluding frequency information altogether, and classification is performed using discriminant function analysis (DFA), which uses a single Gaussian kernel to model the probability distribution function of call features. For multi- modal distributions of features (e.g., Tadarida brasiliensis uses FM, CF, and FM-CF calls), a uni-modal Gaussian kernel is woefully inadequate. What can be done to address these limitations? A viable solution, developed and refined over the past 3 decades, comes from automatic speech recognition (ASR). The ASR research community has shifted from expert-driven models to data-driven models (machine learning), primarily because machine learning methods employ superior statistical models which better account for the variations of human speech. The role of experts in the machine learning paradigm of ASR has focused on incorporating knowledge about the production and perception of speech into feature extraction algorithms. We have recently applied two machine learning algorithms to the problem of automatic bat call detection and classification, a hidden Markov model (HMM) and a Gaussian mixture model (GMM), in an experiment using about 3000 hand-labeled calls from 5 species (Pipistrellus bodenheimeri, Molossus molossus, Lasiurus borealis, Lasiurus cinereus semotus , and Tadarida brasiliensis). We applied two techniques common in ASR to improve performance: a noise-reduction algorithm called spectral mean subtraction, and the use of temporal derivatives to add local shape information to the feature vectors. For detection, we compared a GMM to a baseline energy method. At equal sensitivity and specificity, the accuracy of the GMM was 96%, while the accuracy for an energy baseline was 68%. For classification, we compared the machine learning algorithms to a baseline DFA classifier using a cross-validation experiment in which 50% of the calls were used to train the models and the remaining 50% of the calls were used to test the models. Over 20 trials, classification for the GMM and HMM were 99.4 ± 0.2 % while the accuracy of a DFA was 83.1 ± 1.1% (mean ± st. dev.). The experiment results demonstrate the superior performance of machine learning algorithms, reducing detection and classification errors by an order of magnitude compared to the existing methods. Machine learning methods have the potential to profoundly impact the use of acoustic studies in bat research. DETECTION L n k n x L k 1 2 ) ( 1 ) ( E BIBLIOGRAPHY [1] M. K. Obrist, “Flexible bat echolocation: the influence of individual, habitat and conspecifics on sonar signal design,” Behav. Ecol. Sociobiol., vol. 36, pp. 207-219, 1995 [2] S. Parsons and G. Jones, “Acoustic identification of twelve species of echolocating bat by discriminant function analysis and artificial neural networks,” J. Exp. Biol., vol. 203, pp. 2641-2656, 2000 [3] M. D. Skowronski and J. G. Harris, “Acoustic detection and classification of microchiroptera using machine learning: lessons learned from automatic speech recognition,” J. Acoust. Soc. Am., 2005, submitted [4] M. B. Fenton and G. P. Bell, “Recognition of species of insectivorous bats by their echolocation calls,” J. Mammal., vol. 62, no. 2, pp. 233-243, May 1981 [5] M. J. O'Farrell, B. W. Miller, and W. L. Gannon, “Qualitative identification of free-flying bats using the Anabat detector,” J. Mammal., vol. 80, no. 1, pp. 11-23, Jan. 1999 [6] M. K. Obrist, “Flexible bat echolocation: the influence of individual, habitat and conspecifics on sonar signal design,” Behav. Ecol. Sociobiol., vol. 36, pp. 207-219, 1995 [7] M. K. Obrist, R. Boesch, and P. F. Fluckiger, “Variability in echolocation call design of 26 Swiss bat species: consequences, limits and options for automated field identification with a synergetic pattern recognition approach,” Mammalia, vol. 68, no. 4, pp. 307-322, Dec. 2004 [8] R. F. Lance, B. Bollich, C. L. Callahan, and P. L. Leberg, “Surveying forest-bat communities with Anabat detectors,” in Bats and Forests Symposium, R. M. R. Barclay and R. M. Brigham, eds., Res. Br., B.C. Min. For., Victoria, B.C., CA, pp. 175-184, 1996 [9] D. Russo and G. Jones, “Identification of twenty-two bat species (Mammalia: Chiroptera) from Italy by analysis of time-expanded recordings of echolocation calls,” J. Zool., Lond., vol. 258, no. 1, pp. Conventional method [1,2]: x k (n) - frame k of raw signal x(n) E(k) - energy in frame k L - frame length (~1ms) d(k) - detection decision θ - energy threshold ) ( E , 0 ) ( E , 1 ) ( d k k k Gaussian mixture model (GMM) [3]: M m m i m i k m i i k x G w x p 1 , , , ) , , ( ) | ( x k - input features for frame k: spectral peak amplitude, frequency at peak amplitude, first- and second-order temporal derivatives ω i - class of signal: i = 1 for background frames, i = 2 for call frames p (x k i ) - class-conditional probability density for frame k of input feature vector x given class ω i G - Gaussian kernel with mean vector μ and covariance matrix Σ, estimated from hand-labeled data w i,m , μ i,m , Σ i,m - mixture weight, mean, and covariance of m th kernel for class ω i d(k) - detection decision for frame k θ - likelihood threshold )) | ( log( )) | ( ( log , 0 )) | ( log( )) | ( ( log , 1 ) ( d 1 2 1 2 k k k k x p x p x p x p k Receiver operator characteristic (ROC) curve: Conventional method: DETECTOR T F Sensiti vity GMM 1 0 4314 3985 176 96615 0.96 Peak energy 1 0 4132 7819 358 92781 0.92 Broadband energy 1 0 3047 3189 1 1443 68709 0.68 Confusion matrices at equal sensitivity and specificity:105,090 detection blocks (20 ms) Features [2,4-8]: min frequency, max frequency, frequency at peak amplitude, and duration, extracted from hand-labeled calls using noise-robust methods [3]. Classifier [2,7-9]: discriminant function analysis (DFA) with stratified covariance matrices (quadratic) Gaussian mixture model (GMM) classifer: Hidden Markov model (HMM) classifier [10]: Same as GMM detector, except ω i represent each species. Averaged log likelihood over all K frames of a call was calculated for each class, and the classifier output was the label of the class with the maximum averaged log likelihood. State model of nonstationary signal, each state represents pseudo-stationary probability density function with a GMM. One model for each species was trained using the Baum- Welch algorithm on hand-labeled calls. Testing was performed using the Viterbi dynamic programming algorithm, which determines the log likelihood of the single most likely state sequence through a model. Pipistrellus bodenheimeri: Tadarida brasiliensis: Lasiurus cinereus semotus: Lasiurus borealis: Molossus molossus: Detector output examples: Each gray column is a hand-labeled call from a pass of 25 calls from L. borealis. The black horizontal line represents θ for equal sensitivity and specificity. HMM Pb Mm Lb Lc Tb Total Pb 99.8 ± 0.3 0 ± 0 0 ± 0 0 ± 0 0.2 ± 0.3 99.8 ± 0.3 Mm 0.03 ± 0.2 95.6 ± 1.6 0 ± 0 0 ± 0 4.3 ± 1.6 95.6 ± 1.6 Lb 0 ± 0 0 ± 0 99.8 ± 0.1 0.2 ± 0.1 0 ± 0 99.8 ± 0.1 Lc 0 ± 0 0 ± 0 0.2 ± 0.2 99.8 ± 0.2 0 ± 0 99.8 ± 0.2 Tb 0 ± 0 0.2 ± 0.2 0 ± 0 0 ± 0 99.8 ± 0.2 99.8 ± 0.2 Tota l 99.4 ± 0.2 DFA Pb Mm Lb Lc Tb Total Pb 97.1 ± 0.8 0.2 ± 0.2 2.7 ± 0.8 0 ± 0 0 ± 0 97.1 ± 0.8 Mm 0.6 ± 0.5 76.7 ± 3 4.1 ± 2 17.3 ± 3 1.3 ± 0.6 76.7 ± 3 Lb 1.2 ± 0.4 16.9 ± 1.5 79.6 ± 1.3 0.3 ± 0.3 2.1 ± 0.5 79.6 ± 1.3 Lc 0 ± 0 1.1 ± 0.9 0.3 ± 0.5 89.7 ± 1.4 8.8 ± 0.9 89.7 ± 1.4 Tb 0 ± 0 6.6 ± 5.4 ± 16.5 ± 71.4 ± 71.4 ± CLASSIFICATION CLASSIFICATION DETECTION Average and st. dev. over 20 trials of randomly selected test and train calls, 50% test, 50% train. The GMM and HMM results were statistically indistinguishable (t-test, p>0.9). Classification confusion matrices:

Upload: geoffrey-mcgee

Post on 11-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automatic detection and classification of Microchiropteran echolocation calls: Why the current technology is wrong and what can be done about it Mark D

Automatic detection and classification of Microchiropteran echolocation calls: Why the current technology is wrong and what can be done about it

Mark D. Skowronski and John G. HarrisComputational Neuro-Engineering Lab, University of Florida, Gainesville, FL

ABSTRACT

Existing methods for automatic detection and classification of bat calls suffer from two significant limitations: inadequate feature extraction and simplistic modeling. A 10-ms call, sampled at 200 kHz, is represented by 2000 time-domain samples, yet the same call is reduced to less than 10 global feature values (minimum and maximum frequency, duration, shape features, energy) during feature extraction. Such a reduction in representation excludes discriminating information from the call, which increases classification and detection errors. Furthermore, detection is typically performed using an energy threshold, excluding frequency information altogether, and classification is performed using discriminant function analysis (DFA), which uses a single Gaussian kernel to model the probability distribution function of call features. For multi-modal distributions of features (e.g., Tadarida brasiliensis uses FM, CF, and FM-CF calls), a uni-modal Gaussian kernel is woefully inadequate. What can be done to address these limitations? A viable solution, developed and refined over the past 3 decades, comes from automatic speech recognition (ASR). The ASR research community has shifted from expert-driven models to data-driven models (machine learning), primarily because machine learning methods employ superior statistical models which better account for the variations of human speech. The role of experts in the machine learning paradigm of ASR has focused on incorporating knowledge about the production and perception of speech into feature extraction algorithms. We have recently applied two machine learning algorithms to the problem of automatic bat call detection and classification, a hidden Markov model (HMM) and a Gaussian mixture model (GMM), in an experiment using about 3000 hand-labeled calls from 5 species (Pipistrellus bodenheimeri, Molossus molossus, Lasiurus borealis, Lasiurus cinereus semotus, and Tadarida brasiliensis). We applied two techniques common in ASR to improve performance: a noise-reduction algorithm called spectral mean subtraction, and the use of temporal derivatives to add local shape information to the feature vectors. For detection, we compared a GMM to a baseline energy method. At equal sensitivity and specificity, the accuracy of the GMM was 96%, while the accuracy for an energy baseline was 68%. For classification, we compared the machine learning algorithms to a baseline DFA classifier using a cross-validation experiment in which 50% of the calls were used to train the models and the remaining 50% of the calls were used to test the models. Over 20 trials, classification for the GMM and HMM were 99.4 ± 0.2 % while the accuracy of a DFA was 83.1 ± 1.1% (mean ± st. dev.). The experiment results demonstrate the superior performance of machine learning algorithms, reducing detection and classification errors by an order of magnitude compared to the existing methods. Machine learning methods have the potential to profoundly impact the use of acoustic studies in bat research.

DETECTION

L

nk nx

Lk

1

2 )(1

)(E

BIBLIOGRAPHY

[1] M. K. Obrist, “Flexible bat echolocation: the influence of individual, habitat and conspecifics on sonar signal design,” Behav. Ecol. Sociobiol., vol. 36, pp. 207-219, 1995

[2] S. Parsons and G. Jones, “Acoustic identification of twelve species of echolocating bat by discriminant function analysis and artificial neural networks,” J. Exp. Biol., vol. 203, pp. 2641-2656, 2000

[3] M. D. Skowronski and J. G. Harris, “Acoustic detection and classification of microchiroptera using machine learning: lessons learned from automatic speech recognition,” J. Acoust. Soc. Am., 2005, submitted

[4] M. B. Fenton and G. P. Bell, “Recognition of species of insectivorous bats by their echolocation calls,” J. Mammal., vol. 62, no. 2, pp. 233-243, May 1981

[5] M. J. O'Farrell, B. W. Miller, and W. L. Gannon, “Qualitative identification of free-flying bats using the Anabat detector,” J. Mammal., vol. 80, no. 1, pp. 11-23, Jan. 1999

[6] M. K. Obrist, “Flexible bat echolocation: the influence of individual, habitat and conspecifics on sonar signal design,” Behav. Ecol. Sociobiol., vol. 36, pp. 207-219, 1995

[7] M. K. Obrist, R. Boesch, and P. F. Fluckiger, “Variability in echolocation call design of 26 Swiss bat species: consequences, limits and options for automated field identification with a synergetic pattern recognition approach,” Mammalia, vol. 68, no. 4, pp. 307-322, Dec. 2004

[8] R. F. Lance, B. Bollich, C. L. Callahan, and P. L. Leberg, “Surveying forest-bat communities with Anabat detectors,” in Bats and Forests Symposium, R. M. R. Barclay and R. M. Brigham, eds., Res. Br., B.C. Min. For., Victoria, B.C., CA, pp. 175-184, 1996

[9] D. Russo and G. Jones, “Identification of twenty-two bat species (Mammalia: Chiroptera) from Italy by analysis of time-expanded recordings of echolocation calls,” J. Zool., Lond., vol. 258, no. 1, pp. 91-103, Sept. 2002

[10] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” in Readings in Speech Recognition, A. Waibel and K.-F. Lee, eds., Kaufmann, San Mateo, CA, pp. 267-296, 1990

Conventional method [1,2]:

xk(n) - frame k of raw signal x(n)E(k) - energy in frame kL - frame length (~1ms)d(k) - detection decisionθ - energy threshold

)(E,0

)(E,1)(d

k

kk

Gaussian mixture model (GMM) [3]:

M

mmimikmiik xGwxp

1,,, ),,()|(

xk - input features for frame k: spectral peak amplitude, frequency at peak amplitude, first- and second-order temporal derivativesωi - class of signal: i = 1 for background frames, i = 2 for call framesp (xk|ωi) - class-conditional probability density for frame k of input feature vector x given class ωi

G - Gaussian kernel with mean vector μ and covariance matrix Σ, estimated from hand-labeled datawi,m, μi,m, Σi,m - mixture weight, mean, and covariance of mth kernel for class ωi

d(k) - detection decision for frame kθ - likelihood threshold

))|(log())|((log,0

))|(log())|((log,1)(d

12

12

kk

kk

xpxp

xpxpk

Receiver operator characteristic (ROC) curve:

Conventional method:

DETECTOR T F Sensitivity

GMM 1

0

4314

3985

176

96615

0.96

Peak energy 1

0

4132

7819

358

92781

0.92

Broadband energy

1

0

3047

31891

1443

68709

0.68

Confusion matrices at equal sensitivity and specificity:105,090 detection blocks (20 ms)

Features [2,4-8]: min frequency, max frequency, frequency at peak amplitude, and duration, extracted from hand-labeled calls using noise-robust methods [3].

Classifier [2,7-9]: discriminant function analysis (DFA) with stratified covariance matrices (quadratic)

Gaussian mixture model (GMM) classifer:

Hidden Markov model (HMM) classifier [10]:

Same as GMM detector, except ωi represent each species. Averaged log likelihood over all K frames of a call was calculated for each class, and the classifier output was the label of the class with the maximum averaged log likelihood.

State model of nonstationary signal, each state represents pseudo-stationary probability density function with a GMM. One model for each species was trained using the Baum-Welch algorithm on hand-labeled calls. Testing was performed using the Viterbi dynamic programming algorithm, which determines the log likelihood of the single most likely state sequence through a model.

Pipistrellus bodenheimeri:

Tadarida brasiliensis:

Lasiurus cinereus semotus:Lasiurus borealis:Molossus molossus:

Detector output examples:Each gray column is a hand-labeled call from a pass of 25 calls from L. borealis. The black horizontal line represents θ for equal sensitivity and specificity.

HMM Pb Mm Lb Lc Tb Total

Pb 99.8 ± 0.3 0 ± 0 0 ± 0 0 ± 0 0.2 ± 0.3 99.8 ± 0.3

Mm 0.03 ± 0.2 95.6 ± 1.6 0 ± 0 0 ± 0 4.3 ± 1.6 95.6 ± 1.6

Lb 0 ± 0 0 ± 0 99.8 ± 0.1 0.2 ± 0.1 0 ± 0 99.8 ± 0.1

Lc 0 ± 0 0 ± 0 0.2 ± 0.2 99.8 ± 0.2 0 ± 0 99.8 ± 0.2

Tb 0 ± 0 0.2 ± 0.2 0 ± 0 0 ± 0 99.8 ± 0.2 99.8 ± 0.2

Total 99.4 ± 0.2

DFA Pb Mm Lb Lc Tb Total

Pb 97.1 ± 0.8 0.2 ± 0.2 2.7 ± 0.8 0 ± 0 0 ± 0 97.1 ± 0.8

Mm 0.6 ± 0.5 76.7 ± 3 4.1 ± 2 17.3 ± 3 1.3 ± 0.6 76.7 ± 3

Lb 1.2 ± 0.4 16.9 ± 1.5 79.6 ± 1.3 0.3 ± 0.3 2.1 ± 0.5 79.6 ± 1.3

Lc 0 ± 0 1.1 ± 0.9 0.3 ± 0.5 89.7 ± 1.4 8.8 ± 0.9 89.7 ± 1.4

Tb 0 ± 0 6.6 ± 1.5 5.4 ± 1.4 16.5 ± 3 71.4 ± 3 71.4 ± 3

Total 83.1 ± 1.1

CLASSIFICATIONCLASSIFICATION

DETECTION

Average and st. dev. over 20 trials of randomly selected test and train calls, 50% test, 50% train. The GMM and HMM results were statistically indistinguishable (t-test, p>0.9).

Classification confusion matrices: