recognizing phonemes and their distinctive features …wj289qm5838/final version - e... · phonemes...
TRANSCRIPT
RECOGNIZING PHONEMES AND THEIR DISTINCTIVE FEATURES
IN THE BRAIN
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF
ELECTRICAL ENGINEERING AND THE
COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Rui Wang
March, 2011
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/wj289qm5838
© 2011 by Wang Rui. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Patrick Suppes, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Stephen Boyd
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Bernard Widrow
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
iv
Abstract
How the human brain processes phonemes has been a subject of interest for
linguists and neuroscientists for a long time. Electroencephalography (EEG) offers a
promising approach to observe neural activities of phoneme processing in the brain,
thanks to its high temporal resolution, low cost and noninvasiveness. The studies on
Mismatch Negativity (MMN) effects in EEG activities in the 1990s suggested the
existence of a language-specific central phoneme representation in the brain. Recent
findings using magnetoencephalograph (MEG) also suggested that the brain encodes
the complex acoustic-phonetic information of speech into the representations of
phonological features before the lexical information is retrieved. However, very little
success has yet been reported in classifying the brain activities associated with
phoneme processing.
In my work, I proposed a classification framework which incorporates Principal
Components Analysis (PCA), cross-validation and support vector machine (SVM)
methods. The initial classification rates were not very good. Progress was made by
using bootstrap aggregation (Bagging) scheme and introducing phase calculations. To
calculate phase, I computed the Discrete Fourier Transform (DFT) of the original
time-domain signal and kept the angles of the finite sample of frequencies. The
resulting EEG spectral representation contains only the phase and frequency
information and ignores the amplitudes. Using this method, the accurate rate of
classifying averaged test samples of eight consonants improved from 41% to 51%.
Furthermore, the qualitative analysis of the similarities between the EEG
representations, derived from the confusion matrices, illustrates the invariance of brain
and perceptual representation of phonemes. For brain and perceptual representation of
consonants, voicing is the most distinguishable feature among voicing, continuant and
place of articulation. And the feature vowel-height is more robust than vowel-
backness in both brain and perceptual representation of vowels.
By extending and further refining these methods, it is likely significant
classification of other phonemes and features can be made.
v
Acknowledgements
First of all, I would like to express my gratitude to my principle advisor,
Professor Patrick Suppes, for directing me to this interesting area and giving me his
invaluable support and guidance throughout my study. The enthusiasm he has for
research is infectious and encouraging. I want to thank Professor Bernard Widrow and
Professor Stephen Boyd for helpful advises on both my research and academic
progress and very insightful comments on the draft of this dissertation. I would also
like to thank Professor Christopher Potts for serving as the chairman of my oral exam
and giving valuable suggestions from the perspective of linguistics.
I am very fortunate to pursue my Ph.D. degree in a supportive and inspiring
environment at Stanford University. Being able to work closely with a group of
outstanding researchers is important to make my Ph.D. pursuit productive and
enjoyable. I am especially grateful for the members of Suppes Brain Lab. In particular,
I would like to acknowledge Marcos Perreau Guimaraes, who gave helpful advises
and tips on SVM-with-Bagging methods of EEG classification and similarity analysis
discussed in this dissertation. Dik Kin Wong, Logan Grosenick, Claudio Carvalhaes,
Acacio de Barros and Lene Harbott gave lots of thoughtful ideas and asked motivating
questions in group discussion. Blair Bohannan and Duc Nguyen helped me on
collecting EEG data.
Finally, I would like to thank my family and my parents for their love and support.
vi
Table of Contents
Chapter 1 Introduction ........................................................................................ 1
1.1 Phonemes and distinctive features ............................................................ 1
1.2 Brain activities in phoneme perception .................................................... 7
1.2.1 Measurements of brain activities ...................................................... 7
1.2.2 Brain activities in phoneme perception ............................................. 8
1.3 Motivation and Contribution .................................................................... 9
1.4 Outline of the thesis ................................................................................ 12
Chapter 2 Relevant EEG Data .......................................................................... 13
2.1 Syllables-I data ....................................................................................... 13
2.2 Syllables-III data .................................................................................... 14
2.3 Isolated-vowels data ............................................................................... 17
Chapter 3 Signal Processing Methods for Classifying EEG Data ................. 18
3.1 EEG pre-processing ................................................................................ 18
3.2 Classifiers based on brain-speech mapping ............................................ 22
3.2.1 Methodology ................................................................................... 22
3.2.1.1 Diagram of the classification model ........................................... 22
3.2.1.2 Speech features ........................................................................... 24
3.2.1.3 Parameters search ....................................................................... 25
3.2.1.4 Significance level: p-value ......................................................... 25
3.2.2 Experimental results ........................................................................ 26
3.3 Support Vector Machine (SVM) classifiers ........................................... 29
3.3.1 Methodology ................................................................................... 29
3.3.1.1 SVM with Bootstrap aggregating ............................................... 29
vii
3.3.1.2 Diagram of the classifier ............................................................ 32
3.3.2 Classification results ....................................................................... 38
3.3.2.1 Linear vs. Nonlinear Kernels...................................................... 38
3.3.2.2 Leave-out-one-subject experiment ............................................. 41
3.3.2.3 Experiment on the number of trials to calculate average ........... 42
3.3.2.4 Experiment on classifying individual EEG trials using data from
single channel. .................................................................................................. 43
3.4 Summary ................................................................................................. 44
Chapter 4 Frequency Analysis of EEG Signals ............................................... 46
4.1 EEG signals in frequency domain .......................................................... 46
4.2 EEG spectral features ............................................................................. 48
4.3 Classification results ............................................................................... 51
4.3.1 Compare the EEG features based on DFT ...................................... 51
4.3.2 Frequency selection ......................................................................... 54
Chapter 5 Invariant Similarities between Brain and Perceptual
Representations of Phonemes .................................................................................... 58
5.1 Psychological experiments on phoneme perception ............................... 58
5.2 Similarity measurements ........................................................................ 59
5.2.1 Semi-Order and Invariant Partial Order of similarities ................... 59
5.2.2 Partition tree of similarities ............................................................. 61
5.3 Experimental data analysis ..................................................................... 62
5.3.1 Vowels ............................................................................................. 62
5.3.2 Consonants ...................................................................................... 66
Chapter 6 Classifiers Based-on Distinctive Features ...................................... 71
viii
6.1 Classifying the distinctive features ......................................................... 71
6.2 Distinctive-feature-based classifiers ....................................................... 74
6.3 Parallel structure vs. Hierarchical structure ............................................ 75
Chapter 7 Conclusion and Prospects ............................................................... 82
List of References ................................................................................................ 84
ix
List of Tables
Table 2.1: The traditional phonological features of the 8 consonants and 4 vowels
...................................................................................................................................... 15
Table 2.2: Chomsky-Halle‟s Distinctive features of the 8 initial consonants ...... 15
Table 3.1: Results of classifying the 4 consonants of Syllables-I data using brain-
speech mapping method ............................................................................................... 27
Table 3.2: Phoneme classification results using SVM-with-Bagging method with
linear or non-linear kernels ........................................................................................... 40
Table 3.3: Leave-one-subject-out classification results using SVM-with-Bagging
method .......................................................................................................................... 41
Table 4.1: Comparing the classification rates of 4 EEG spectral features ........... 52
Table 4.2: SVM-with-Bagging classification results using the EEG phase feature
in the frequency range from 2Hz to 9Hz ...................................................................... 57
Table 5.1: Normalized confusion matrices of 4 vowels ....................................... 63
Table 5.2: Normalized confusion matrices of 8 consonants ................................ 66
Table 6.1: Classifying the distinctive features ..................................................... 73
Table 6.2: Vowels classification results using DF-based classifiers .................... 79
Table 6.3: Initial consonants classification results using DF-based classifiers .... 80
Table 6.4: The results of classifying the combination of voicing and continuant
using SVM-with-Bagging model ................................................................................. 80
x
List of Figures
Figure 1.1: Spectrogram of the English syllable /pɑ/ and /fɑ/ ............................... 4
Figure 1.2: Comparing the English syllables /fɑ/ and /vɑ/ .................................... 5
Figure 2.1: EEG international 10-20 sensor location system. .............................. 14
Figure 2.2: The layout of EGI-128 sensors system .............................................. 16
Figure 3.1: Example of EEG artifacts removing .................................................. 20
Figure 3.2: Independent Components Analysis ................................................... 21
Figure 3.3: Diagram of classifying brainwaves of speech stimuli by estimating
the mapping between EEG and speech signal .............................................................. 22
Figure 3.4: Classifying the 4 consonants /p/, /t/, /b/, /g/ of the Syllables-III data
using brain-speech mapping method ............................................................................ 28
Figure 3.5: Diagram of SVM with bootstrap aggregating ................................... 32
Figure 3.6: Diagram of the SVM-with-Bagging EEG classifier .......................... 33
Figure 3.7: Mean validation accuracy on parameter search grid (K,C) using linear
kernel ............................................................................................................................ 39
Figure 3.8: The changing of 8 initial consonants classification rates with respect
to the number of trials to calculate averages ................................................................ 42
Figure 3.9: Performance of SVM-with-Bagging method on classifying 4 initial
consonants using single channel data ........................................................................... 44
Figure 4.1: Average power spectral densities of EEG signal sampled at 250Hz . 47
Figure 4.2: Magnitude and phase frequency response of 1 Hz high-pass filter ... 50
Figure 4.3: Mean classification rates of parameter pair (L, H) obtained from 10-
fold cross validation ..................................................................................................... 56
Figure 5.1: The similarities of brain representation and perceptual representation
of 4 vowels ................................................................................................................... 65
xi
Figure 5.2: Invariant partial order between brainwave and perceptual confusions
of the vowels ................................................................................................................. 65
Figure 5.3: The similarities of brain and perceptual representation of 8
consonants .................................................................................................................... 67
Figure 5.4: Invariant partial order between brainwave confusions and perceptual
confusions of the consonants ........................................................................................ 69
Figure 6.1: Classifying 4 vowels in F1-F2 space ................................................. 76
Figure 6.2: Hierarchical models for classifying 8 classes .................................... 78
1
Chapter 1
Introduction
1.1 Phonemes and distinctive features
Natural languages are organized hierarchically: sentences are built from phrases,
phrases from words, words from syllables and syllables from phonemes. Phoneme is
the smallest segmental unit of speech that differentiates meaningful words (Handbook
of IPA, 1999). For example, in American English, the words light and right are
pronounced differently only on the initial consonants /l/ and /r/, thus /l/ and /r/ are
different phonemes in American English. Two sounds that belong to separate
phonemes in one language or dialect may be variants of one phoneme in another
language or dialect. (The two sounds are called allophones if they belong to the same
phoneme in the language.) It is widely recognized that phonemes are language-
specific. All the phonemes studied in this thesis are American English phonemes. In
most languages, the number of phonemes ranges from twenty to sixty. Although the
pronunciation of a phoneme can be slightly different in various contexts, a phoneme
has relatively stable articulatory and acoustic properties. Thus besides being used to
derive and describe phonological rules, the concept of phoneme is also extensively
used in building computational models of natural speech. Most of modern large-
vocabulary speech recognition systems and speech synthesis systems are based on
statistical models of acoustic features of phonemes. Phonemes are also very important
for modeling the brain activities of speech production or perception.
Linguists have proposed that phonemes can be further decomposed into
distinctive features. Phonological features such as voice, nasal and stop had been used
to describe speech sounds for a long time before the concept of distinctive feature was
proposed. Those features are commonly referred to as traditional features. Such a
2
feature relates to either articulatory or acoustic properties of the sound. They are not
necessarily binary and so may have more than two values. Ladefoged (1982) gave a
good summary of the traditional features in his book. Around the middle of the 20th
century, Jakobson and Halle introduced the notion of „distinctive features‟ as the
smallest language components that are able to differentiate meaningful units
(Jakobson, Morris & Halle, 1956). Unlike phonemes, distinctive features can overlap
in time, thus they are the suprasegmental elements of language that carry lexical
contrasts. Jakobson and Halle also proposed a set of distinctive features and gave both
acoustic and articulatory descriptions of them. The Jakobson-Halle distinctive features
are binary, which means each feature has two relative values. Today‟s most commonly
used distinctive-feature system in phonology literature is mostly taken from Chomsky
and Halle‟s work „The Sound Pattern of English‟ (1968). Chomsky and Halle
proposed in total 27 distinctive features. Each feature takes two values: a positive
value, [+], denotes the presence of a feature, while a negative value, [-], indicates its
absence. Their feature set is considered to be “universal”, which means they
“represented the phonetic capabilities of man” and are therefore the same for all
languages. Any phoneme can be represented as a set of distinctive features. For
example, according to the Chomsky-Halle system, /p/ can be represented as: “[-vocalic]
[+consonantal] [-high] [-back] [-low] [+anterior] [-coronal] [-voice] [-continuant] [-
nasal] [-strident]”. (Chomsky & Halle, 1968)
Limited by the availability of brain data, we cannot explore all the distinctive
features in this thesis. We will focus only on the brain representations of phonemes
associated with the following features:
Height and backness of vowels.
To describe the vowels, we use the vowel features of the International Phonetic
Alphabet (IPA) chart: height and backness.
Vowel height is named for the vertical position of the tongue relative to either the
roof of the mouth or the aperture of the jaw. In high vowels, such as /i/ and /u/, the
tongue is positioned high in the mouth, whereas in low vowels, such as /ɑ/, the tongue
3
is positioned low in the mouth. In the IPA chart, the terms close and open are used to
describe the jaw as being relatively open or closed. Although described using
articulatory terms, vowel-height is nowadays defined as an acoustic quality according
to the relative frequency of the first formant (F1)1. The higher the F1 value, the lower
(more open) the vowel is. Height is thus inversely correlated to F1.
Vowel-backness refers to the position of the tongue during the articulation of a
vowel. In front vowels, such as /i/, the tongue is positioned forward in the mouth,
whereas in back vowels, such as /u/, the tongue is positioned towards the back of the
mouth. Similar to vowel-height, vowel-backness is defined according to the frequency
of the second formant (F2). The back vowels have lower F2 values and front vowels
have higher F2 values. Thus vowel-backness is inversely correlated to F2.
Continuant:
Continuant/non-continuant is a feature to describe the manner of articulation. In
the production of a continuant sound, the primary constriction of the vocal tract is not
completely closed, so the air flow past the constriction is not blocked. The fricatives
such as /s/ or /z/ are continuant sounds. When we articulate a fricative sound, the oral
tract is held narrow enough, so that the airflow generates turbulent noise. In speech
spectrograms, this friction noise often shows some clear power concentration in a
specific frequency range. The non-continuant sounds include plosive stops, such as /p/,
/t/ and /g/, and nasal sounds, such as /m/, /n/. In this thesis, we will only focus on brain
representations of plosive stops and fricatives. Plosive stops are characterized by a
spectrographic "burst" with an abrupt onset. Figure 1.1 compares the spectrogram of
the plosive stop /p/ and the fricative consonant /f/, followed by the same vowel /ɑ/.
The spectrogram of plosive stop /p/ has a sudden burst of energy across all the
frequency range after a short closure at the beginning of the articulation. The formant
pattern of vowel /ɑ/ emerges shortly after the burst, which shows that the duration of
plosive stop is very short. The spectrogram of /f/ is characterized by high frequency
1 Formants are defined by Fant (1960) as “the spectral peaks of the sound spectrum ( ) ”. They
are produced by resonances of the vocal tract. The lowest resonant frequency is called the first formant , the second and the third .
4
noise with a gradual onset. In addition, the duration of the fricative /f/ is much longer
than that of /p/.
Figure 1.1: Spectrogram of the English syllable /pɑ/ and /fɑ/2
Voicing:
The feature voicing is used to characterize the vibration of the vocal fold, which
creates a periodic source wave during the articulation. Voiced sounds are produced
with the vibration of the vocal cord and voiceless sounds are produced without the
vibration. Periodicity is the main character that distinguishes voiced sounds from
voiceless sounds. Figure 1.2 shows the waveform and the spectrogram of /fɑ/ and /vɑ/.
The waveform of /v/ has obvious periodic structure, which comes from the vibration
of the vocal cord. The formant-like low frequency energy distribution pattern of the
spectrogram of /v/ is another indicator of voicing.
2 The spectrograms in Figure 1.1 and Figure 1.2(b) are generated using Praat (Boersma &
Weenink, 2011).
5
(a) Speech waveforms of syllables /fɑ/ and /vɑ/
(b) Speech spectrograms of syllables /fɑ/ and /vɑ/
Figure 1.2: Comparing the English syllables /fɑ/ and /vɑ/
6
Phoneticians have also found the voice onset time (VOT), which denotes the time
interval between the release of articulatory occlusion and the onset of low-frequency
periodicity, is the primary perceptual cue to distinguish the voicing stops from the
voiceless stops (Lisker & Abramson, 1964). The voiceless stops in English, such as /p/,
/t/ and /k/, feature a short VOT around 20ms. But the voiced stops usually have
negative VOTs, which means voicing onset leads the articulatory release. The negative
VOTs are characterized by a low buzz noise during the consonant closure time.
Another phonetic attribute that distinguish English voiceless stops /p/, /t/ and /k/ from
voiced stops /b/, /d/ and /g/ is aspiration. Aspiration is very important in separating the
two sets in initial position, because both sets are commonly produced with silent
closure intervals in such cases. (Lisker & Abramson, 1964)
Place of articulation:
Traditionally, the place of articulation of consonants refers to the place and
manner of the obstruction of the airflow going through the vocal tract. In English, the
obstruction may occur at many places along the oral tract, from bilabial (between the
lips) to velar (between back of the tongue and the soft palate). The production of
speech sounds can be simulated as a stimulation source, either periodic for voicing or
white noise for voiceless, passing through a filter, which reflects the shape of the vocal
tract. The different places of articulation modify the frequency response of the vocal
tract filter and change the spectral properties of the output. In Jakobson & Halle‟s
distinctive feature system, the place of articulation is denoted as several features
describing the spectrum of sound, such as grave/acute, flat and sharp. Chomsky &
Halle‟s distinctive feature system uses a series of cavity features, coronal, anterior,
high, low, back, etc., to characterize the shape of the oral tract for both consonant and
vowel articulation.
For plosive stops, the primary acoustic cue of the place of articulation is mostly in
the transition portion of the F2 of the vowel that follows. The place of articulation in
fricatives changes the resonant frequency of the front vocal cavity and is reflected by
the position and shape of the peak in speech spectra. (Johnson, 2003)
7
1.2 Brain activities in phoneme perception
1.2.1 Measurements of brain activities
When a neuron is firing, it generates action potentials, which are discrete
electrical pulses, and postsynaptic potentials, which typically last tens or even
hundreds of milliseconds. The summation of postsynaptic potentials of thousands of
approximately synchronized cortical neurons can induce the potential fluctuations on
the scalp. Thus the brain cortical activities can be roughly observed by placing an
electrode sensor on the scalp and recording the amplified signal. This technology is
called electroencephalography, or EEG.
The magnetic field produced by the electrical activities of cortical neurons also
can be measured, which is called magnetoencephalography, or MEG. Both EEG and
MEG have high temporal resolution and can record the activities with 1kHz or higher
sampling rates. However, the blurring of the potentials caused by the skull, which is a
high-resistance conductor, can be avoided by recording the magnetic field. Thus MEG
has better spatial resolution than EEG and provides more precise localization. On the
other hand, since the MEG signals are on the order of a few femto-Teslas, shielding
from external magnetic signals, including the Earth‟s magnetic field, is necessary.
The magnetic shielding equipment, usually a magnetically shielded room, is very
expensive and not portable.
Since the development of the functional Magnetic Resonance Imaging (fMRI)
technique in the early 1990‟s, fMRI rapidly dominates the brain mapping field for its
non-invasiveness and high spatial resolution, up to 1mm. fMRI measures the increased
blood flow to regions of increased neural activity, marked by blood-oxygen-level
dependence (BOLD) in magnetic resonance imaging (MRI) scan. The BOLD occurs
after the increased neural activities with a delay of approximately 1 to 5 seconds and
rises to a peak over 4 to 5 seconds. Therefore the fMRI has very low temporal
resolution. As we know, English speech is delivered at a rate of roughly 3 words per
8
second. Thus fMRI itself cannot be used to observe the details of fast-changing brain
activities, such as processing phonological or lexical information.
1.2.2 Brain activities in phoneme perception
How the human brain processes phonemes has been a subject of interest for
linguists and neuroscientists for a long time. Historically, behavioral experiments of
phoneme perception were carried out to explore the psychological discrimination of
phonemes under various conditions (Miller & Nicely, 1955; Pickett, 1957; Wang &
Bilger, 1973; Phatak et al, 2008). More detailed introductions of the behavioral
experiments can be found in Chapter 5. Since the discovery of Mismatch Negativity
(MMN) effects in EEG activities (Näätänen et al, 1978), MMN and its magnetic
equivalent MMNm, have been used extensively to measure the neural activities
reflecting subjects‟ ability to discriminate phonemes (See Näätänen 2001 for review).
These results also suggested the existence of a language-specific central phoneme
representation in the brain and pointed out its probable left-hemisphere locus
(Näätänen, 1997). More recently, using MEG recordings, human brain activities
indicating the perception of acoustic cues and more complex phonological features
were examined (Obleser, et al. 2004; Eulitz et al. 2007; Frye et al. 2007). These
findings suggested that the brain encodes the complex acoustic-phonetic information
of speech into the representations of phonological features before the lexical
information is retrieved. Invasive recordings of animal neural responses associated
with human speech also demonstrate the temporal and spatial characteristics of the
cortical activities reflecting the distinctive features of phonemes (Steinschneider et al.
1995; Steinschneider et al. 2003). It also shows that the discrimination of the neural-
activities pattern matches the animals‟ behavioral discrimination of phonemes
(Engineer, 2008) as well as the human psychological confusion of phonemes
(Mesgarani, 2008). fMRI also provides a non-invasive method to pinpoint the location
of cortical activities of phoneme perception in healthy human brain (Liebenthal, et al.
2005). Formisano (2008) reported the success in classifying brain activities of isolated
vowels using fMRI. Considering the limited temporal resolution of the fMRI
9
technique, it would be difficult to extend this work to phonemes that have more
complex time course than vowels presented in isolation.
1.3 Motivation and Contribution
Among the most commonly used technologies that can observe human brain
activities, EEG provides a promising method to examine the brain activities of natural
language processing because of its low-cost, high temporal resolution and non-
invasiveness. To study the brain activities of language processing using EEG signal,
we need to solve two problems.
First, the EEG recordings are usually large amounts of data that are contaminated
by lots of noise. Appropriate signal-processing or statistical methods are needed to
reduce the noise and extract meaningful components of the signal that carry the target
information. The ideal scenario is that the data can be compressed into parameters,
called as EEG feature parameters, without loss of much useful information.
Second, we need to develop the mathematical models to describe properties and
distributions of the EEG feature parameters of the language processing activity in the
brain. The complexity and the computation cost of constructing the mathematical
model is highly related to the number of EEG feature parameters. Generally the
smaller the EEG parameter list, the simpler the mathematical model required to
describe it.
To demonstrate the effectiveness of an EEG feature parameter or a mathematical
model, one of the most convincing approaches is to test whether the unknown EEG
samples, represented as the feature parameters, can be classified using the
mathematical model. Researchers in our lab have been working on the statistical
problem of classifying EEG brainwave associated with stimuli of language
constituents since the 1990s. We successfully classified brainwaves of sentences
(Suppes & Han, 1998; Suppes & Han, 1999; Wong, Perreau-Guimaraes, et. al. 2004),
words (Suppes & Lu, 1997) and syllables (Suppes, Han, etc, 1999). Classifying the
10
brainwaves of auditory stimuli is often more challenging than classifying that of the
visually presented linguistic stimuli. (Suppes & Han, 1999) An experiment in
classifying the brainwaves of phonemes was also reported. (Suppes, Perreau-
Guimaraes et. al., 2009) In this experiment, 42% of trials of 4 consonants were
correctly classified. However, the classification method was tested on the syllable data
of the 1997 experiment, which collected only 6 channels of 800 trials from each
subject. The size of the data is insufficient to test a more complicated classification
model.
With this consideration in mind, I designed and implemented a new experiment to
collect EEG data of syllables. The experiment focused on 8 consonants and 4 vowels,
which were carefully selected to represent 5 distinctive features: voicing, continuant,
place of articulation, vowel-height and vowel backness. The new dataset includes in
total 21540 trials for all the 32 syllables. The number of trials from each subject
ranges from 3584 to 7168. A supplemental dataset of isolated vowels was also
collected in 2010.
The phoneme recognition results reported by Suppes & Perreau-Guimaraes et. al.
(2009) was obtained using Singular Value Decomposition (SVD) and Linear
Discriminate Classification (LDC) methods in a framework with two-layer cross-
validation. I kept the original EEG preprocessing modular of the framework and
modified the classification methods to implement Out-Of-Sample testing, classifying
the averaged trials and classifying using SVM with bootstrap aggregating (Bagging).
By introducing SVM, we were able to implement the non-linear classification.
However, the classification results show that the non-linear methods cannot improve
the classification accuracy. The modified algorithm with linear kernel can classify 46%
of 426 averaged test samples of 8 consonants and 69% of 141 averaged test samples of
4 isolated vowels.
I also proposed a new approach to classify the brainwaves of auditory stimuli:
classifying by estimating the mapping relations between the speech signal and the
EEG brainwave signal. A preliminary study about estimating the linear
transformation between the brainwaves and speech stimuli has been carried out. For
11
the best subject of the EEG data collected in 1997, the classification model can
recognize 45% of individual test trials of 4 consonants, which is slightly better than
the result of SVD-LDC methods.
Furthermore, using the classification model with Bagging SVM, I explored the
frequency-domain representations of EEG brainwaves evoked by phoneme stimuli. I
found that the EEG signals can be classified without loss of accuracy when the
amplitude information of DFTs is eliminated. For classifying the averaged test
samples of 8 consonants, the accuracy rate increased to 51% if only the phase pattern
of frequency components from 2Hz to 9Hz is used.
I analyzed of the similarities between the EEG representations, derived from the
confusion matrices obtained using Bagging SVM methods and demonstrated the
invariant similarities of brain and perceptual representation of phonemes. For brain
and perceptual representation of consonants, voicing is the most distinguishable
feature among voicing, continuant and place of articulation. And the feature vowel-
height is more robust than vowel-backness in both brain and perceptual representation
of vowels.
I further refined the Bagging SVM classification model based on the findings
that the brainwaves evoked by different phonemes with similar phonological
properties are close to each other in the EEG feature domain. A simplified
classification model based on distinctive features was proposed. In this model,
brainwaves of phonemes are classified using the ensemble of binary classifiers, one
for each distinctive feature. The binary classifiers can be organized hierarchically.
This simplified classifier can recognize 47% of test samples of the 8 consonants and
65% of test samples of the 4 isolated vowels, which is slightly worse than the original
Bagging SVM classification model. However, the distinctive-feature-based classifier
can be directly extended to classify more phonemes.
12
1.4 Outline of the thesis
Chapter 2 gives the detailed description of the EEG data used in my work and the
experiment setup for collecting the EEG recordings.
In Chapter 3, I introduce two models to classify brainwaves of phonemes: One is
the brain-speech mapping method and the other is the classifier with Bagging SVM.
The first method was tested on classifying the individual trials of 4 consonants using
Syllables-I and Syllables-III data. The second method, which focuses on classifying
averaged test samples, was tested using Syllables-III and isolated-vowels data.
In Chapter 4, I examine EEG representations of phonemes in the frequency
domain by classifying the EEG response of phonemes using four EEG spectral
features. The feature DFT is Discrete Fourier Transform (DFT) coefficients of EEG
time-domain signals, which should be computed channel by channel. The feature
AMP is composed by the amplitudes of all the frequency components of DFT
coefficients. In feature PHS-1 and PHS-2, the amplitudes of DFT coefficients are
eliminated and only the phase information is kept. The classification results of four
spectral features are discussed. I also identify the frequency range of rhythmic
activities of EEG that related to phoneme perception using our experiment data.
I analyze the similarities between the brainwave representations using the
classification confusion matrices in Chapter 5. The graphs of semiorder and
hierarchical trees are used to illustrate the similarities. The brain similarities of the
phonemes are compared with perceptual similarities of phonemes obtained from
psychological experiments.
In Chapter 6, the results of classifying distinctive features using Bagging SVM
methods are discussed. I also extend the Bagging SVM algorithm to classify speech
stimuli based on distinctive features and present the experimental results.
Chapter 7 concludes the thesis.
13
Chapter 2
Relevant EEG Data
Three datasets of EEG recordings of phoneme perception are used in our study.
All these EEG experimental data were collected in our laboratory.
2.1 Syllables-I data
These EEG recordings of auditory syllable stimuli were collected in November,
1998 as an exploratory experiment. The experiment addressed 8 consonant-vowel (CV)
format syllables and 24 syllable pairs made up by 4 consonants (/p/, /t/, /b/ and /g/)
and 3 vowels (/ɑ/ as in spa, /u/ as in zoo and /oʊ/ as in boat). The stimuli syllables are
listed below:
/tu/, /pɑ/, /gu/, /bɑ/, /toʊ/, /pu/, /goʊ/, /bu/
/babu/, /bɑpɑ/, /bubɑ/, /goʊgu/, /goʊtu/, /gugoʊ/, /gutoʊ /, /gutoʊ/
/pɑpu/, /pubɑ/, /pupɑ/, /tugoʊ/, /tugu/, /tutoʊ/, /toʊgoʊ/, /toʊtu/
/bɑpu/, /bupɑ/, /bupu/, /goʊtoʊ/, /pɑbɑ/, /pɑbu/, /pubu/, /toʊgu/
All the 32 speech stimuli were spoken by a male American-English native
speaker, who is also the speaker of the stimuli in the other two experiments. We
presented the auditory stimuli to participants via stereo speakers. The 32 stimuli were
randomized and presented to the subject 12 times as the first part of the session. Then
after a short break, all the stimuli were presented again for 13 times as the second part.
Nine subjects participated the experiment but only the data from 3 subjects were used
in this thesis. The subjects were instructed to listen to stimuli attentively while no
behavioral response was required. The trial length, measured from the onset of one
syllable to the onset of the next, is 2050ms. In total 800 trials were collected from
14
each subject. The Model-12 Grass amplifiers and Neuroscan‟s Version 3.0 software
were used to measure and record EEG data. Sensors were attached to the scalp of
subjects according to standard EEG 10-20 system as shown in Figure 2.1.
Figure 2.1: EEG international 10-20 sensor location system.
Only 6 sensors, C3, C4, T3, T4, T5 and T6, were connected in the first part. In the
second part, an additional sensor, Cz, was also connected. Previous analysis results on
this dataset were reported in (Suppes, 1999; Suppes, 2009)
2.2 Syllables-III data
In 2008, we collected a new dataset of EEG recordings of perceiving 32 CV
format syllables, which are made of one of the eight consonants /p/, /t/, /b/, /g/, /f/, /s/,
/v/ and /z/, and one of four vowels /i/ (see), /æ/(cat), /u/(zoo) and /ɑ/ (spa). The
experiment was designed with several considerations in mind. First of all, we want to
check if the significant classification accuracies on initial consonants using Syllable-I
EEG data (Suppes, 2009) are repeatable. Second, we further extended the initial
consonants from the 4 plosive stops to a set of 8 consonants to investigate three major
phonological features of consonants: voicing, continuant (stop versus fricative) and
place of articulation. We also carefully selected the vowels so that they locate at
corners of the American-English vowel space area and hence are acoustically
15
separated. Table 2.1 and Table 2.2 list the phonological features of the consonants and
vowels. Moreover, the EEG collection techniques have been significantly improved in
these years. The newest equipment, which supports up to 128 sensors, can record EEG
activities with much higher spatial resolution.
Table 2.1: The traditional phonological features of the 8 consonants and 4 vowels
voiceless voicing
Labial Alveolar Labial Alveolar/Velar stop p t b g
fricative f s v z
Height
open close
backness front æ i back ɑ u
Table 2.2: Chomsky-Halle‟s Distinctive features of the 8 initial consonants
p t b g f s v z
Cavity Features
High - - - + - - - -
Back - - - + - - - -
Coronal - + - - - + - +
Anterior + + + - + + + +
Source Features
Voiced - - + + - - + +
Strident - - - - + + + +
Manner Continuant - - - - + + + +
We recorded the Syllable-III data using EGI‟s Geodesic EEG System (GES) 300
platform. In order to take the variation of pronunciation into account, the auditory
stimuli include 7 repetitions of each of 32 syllables read by a male American English
native speaker. The recordings are saved as 44.1KHz mono WAV files. In a
brainwave collection session, all the 224 sound stimuli were pseudo-randomly
presented to the subjects for 4 times using stereo speakers. The participant subjects
were instructed to listen to the sound attentively while looking at a focus point on the
computer screen. We recorded the EEG data with a sampling rate of 1000Hz using
16
EGI 128 sensors system, with 124 monopolar channels with a common reference Cz.
Two bipolar reference channels of eye movements were also recorded. The locations
of sensors are shown in Figure 2.2. The length of one trial of brainwave recording is
one second. In total 24 sessions from 4 subjects were collected. The number of trials
from one subject ranges from 3584 to 7168. The complete dataset includes about 672
brainwave recordings of each syllable. Therefore we got approximately 672×4=2688
recordings for each consonant and 672×8=5376 recordings for each vowel.
Figure 2.2: The layout of EGI-128 sensors system
17
2.3 Isolated-vowels data
The isolated vowels data recorded in 2010 are complimentary of Syllable-III data.
We recorded 7 repetitions of the 4 vowels used in Syllable-III data, spoken by the
same speaker. In one EEG collection session, the 28 stimuli were presented to the
subject randomly for 32 times using the same experiment setup as Syllable-III. We
recorded 8 sessions from one subject and collected 1792 trials for each isolated vowel.
18
Chapter 3
Signal Processing Methods for
Classifying EEG Data
3.1 EEG pre-processing
The potential changes on the scalp generated by the cortical neuron activities are
as small as a few micro-volts. The EEG signals of interest are usually submerged in a
large amount of electrical noise. The two major types are that coming from the
equipment and environment, and that from other biological sources. The
environmental noise includes AC electric power supply noise, which is around 50-
60Hz, the noise from the computers used for presenting stimuli and recording EEG
data, and the noise from the analog amplifiers which amplify the EEG signal by
several orders of magnitude. The biological activities can be eye blinks, heart beats
and muscle contractions. Therefore, before applying any analysis or classification
methods, we need to pre-process the EEG data to have cleaner signals.
We used digital filters to remove most of the environmental noise. A high-pass
filter with the cut-off frequency at 1Hz can remove the DC offset of equipment and the
slow artifacts associated with the skin conductance fluctuation. The AC electricity
noise can be removed by a notch filter at 60Hz. Our previous studies on EEG of
language stimuli show that the frequency components between 2 to 30Hz are more
important for classification. (Suppes, 1999) Thus in the present research, we usually
down-sample the EEG signals to 50-60Hz after applying anti-aliasing filters. The
down-sampling significantly reduces the dimension of data to be analyzed and
removes the high frequency noise as well.
19
The noise from other biological activities has a different character. Figure 3.1(a)
shows 4 seconds of EEG recordings from the first 60 sensors of EGI-128 sensor
system, sampling at 1KHz. The muscle-contraction noise is characterized by a burst
of high frequency noise and usually disappears after low-pass filtering, as seen in the
down-sampled data in Figure 3.1(b). Eye-blinking artifacts are the short-peak waves
with high amplitude, commonly seen at the prefrontal electrodes.
Figure 3.1(a) Original EEG recording with 1KHz sampling rate
Figure 3.1(b) EEG signal after high-pass and down-sampled to 62.5Hz
eye blink eye blink
muscle contraction
20
Figure 3.1(c) Resulting EEG signal after removing eye artifacts.
Figure 3.1: Example of EEG artifacts removing
The eye movement artifacts can be removed by visually inspecting the trials and
rejecting the contaminated ones. But this is not practical in our study considering the
large amount of data involved, for instance there are more than 20000 trials collected
in the Syllable-III experiment. Since the eye blinks or movements are usually
independent to the brain responses of stimuli. We can eliminate the artifacts from eye
movements efficiently using Independent Component Analysis (ICA).
The ICA method solves the problem illustrated in Figure 3.2. Assume there are n
independent signal sources in the target region nsss ,,, 21 , and the source signals are
transmitted instantaneously to the m receptors on the scalp mrrr ,,, 21 . At each
receptor, the received signal is a weighted mixture of the sources:
nisarm
j
jiji ,,2,11
(3.1)
Then we have Asr , where nmnm RARsRr and , . A is often referred as
the mixing matrix. When mn and A is invertible, let 1 AW , then the sources
21
and be recovered as Wrs . Here W is called the un-mixing matrix. In practice, A is
always unknown and the ultimate goal of ICA is to find the un-mixing matrix that
maximizes the statistical independence of the sources.
In our study, we took all the signals from the monopolar channels as the received
signal r and estimated the un-mixing matrix using the Infomax method. (Jung, et al.,
2000; Bell & Sejnowski, 1995) Next, we calculated the correlation coefficients
between the derived sources and signals from each of the references channels, which
were placed around the eyes to record the horizontal and vertical eye movements. If
the method works well, most of the correlation coefficients should be very low. Then
we remove the sources that are highly correlated to the reference channels by setting
them to zero. More specifically, we removed all the independent sources with a
correlation coefficient higher than 0.2 in our experiments. Finally the remaining
sources are re-mixed to reconstruct the monopolar signals. Figure 3.1(c) shows the
reconstructed EEG monopolar signals using the signals in Figure 3.1(b) as the input.
We can see the eye blink artifacts were removed.
Figure 3.2: Independent Components Analysis
s1
s2
r1 r2
r3
w11
w21 w31
22
3.2 Classifiers based on brain-speech mapping
3.2.1 Methodology
This section introduces the preliminary study of classifying EEG brainwaves of
phoneme stimuli by estimating the mapping relations between the speech signal and
the EEG brainwave signal. The basic idea underlying this approach is to consider the
whole phoneme perception process in the brain as the activity of a system in a “black
box”. The only observable aspects of the system are the input, which is the sound
waves of speech stimuli, and the EEG brainwave as the output. Hence, if we could
estimate the inverse system, we would be able to map brainwaves back to approximate
speech inputs, and classify the brainwaves by comparing the estimated inputs to the
speech prototype candidates.
3.2.1.1 Diagram of the classification model
The classification procedure is shown in Figure 3.3.
Figure 3.3: Diagram of classifying brainwaves of speech stimuli by estimating the
mapping between EEG and speech signal
Speech Pre-process
EEG Pre-process
Find the optimal F(∙) that minimize the difference
between Y and F(X)
Speech
EEG
Speech Prototypes Ỹk
Y
Xtrain
F(∙)
Ŷ=F(X) Xtest Find the closest
prototype toŶ Results Ŷ
23
At the pre-processing phase, EEG signals are down-sampled and filtered. The
speech waves of stimuli are represented by feature vectors with reduced sizes, and at
the same time, a prototype of speech signal is created for each phoneme. Details of
speech signal processing will be given later. Then the EEG data are randomly divided
into training set and test set. The training/test partition is balanced for all the stimuli.
In other words, the numbers of training trials associated with each stimulus are equal.
We compute the optimal mapping relation F̂ , which minimizes the mean square
estimation error between F(x) and y. Figure 3.3 shows the scheme of estimating one
global transformation that applied to all the classes. Alternately, we could also assume
the transformation between brainwaves and speech is unique for each phoneme. In this
case, N transformations should be estimated for the N-class classification problem. A
test sample x is classified as:
2
,,1
~)(F̂minargˆk
Nk
Yk
x
for global transformation (3.2)
or
2
,,1
~)(F̂minargˆkk
Nk
Yk
x
for class-specific transformations (3.3)
For the purpose of exploratory, we assume the transformation is linear, i.e.
Αxx )F( . Then if we estimate a linear transformation using m training samples
miii ,,1),,( )()( yx , where ni Rx )( denotes the observed EEG signal, and the
pi Ry )( are the features of the associated speech stimuli, the optimal linear
transformation npRA is the solution of the least-square optimization problem:
m
i
ii
1
2)()(min yAxA (3.4)
which can be easily calculated as:
1
1
)()(
1
)()(
m
i
Tiim
i
TiiA xxxy (3.5)
24
When the number of training samples m is too small compared to the number of
variables in x, the matrix
m
i
Tii
1
)()( xx will be close to singular and non-invertible. Thus
to get an accurate estimation of the transformation matrix, we need sufficient training
samples and the EEG observation vector cannot be too long.
3.2.1.2 Speech features
To appropriately represent the speech stimuli, we hope to find the speech features
with the size comparable to EEG brainwaves that also are able to distinguish different
phonemes. The Mel-Frequency Cepstral Coefficients (MFCC), which describe the
temporal-spectral distributions of speech, have been proved successful features to
represent speech signals and commonly used in the modern speech recognition
systems. (Rabiner & Juang, 1993) So we use MFCC as the speech features and to
construct the prototype of the phoneme stimuli. The speech pre-processing procedure
includes the following steps:
1) We manually examine the audio files of speech stimuli and mark the beginning
and end time of the phonemes. Because of the co-articulation, the boundaries between
the adjacent phonemes are not well-defined. As a result, the segmentation of
phonemes can only be roughly determined.
2) The speech segments of targeted phonemes are cut into 30ms short frames,
with 20ms overlap.
3) Calculate 12th -order MFCC speech features of each frame.
4) For each stimulus, compute average of the feature vectors across all the frames
of the targeted phoneme. The average vector is the training target vector Y.
5) Average the MFCC features of all the frames corresponding to the initial
consonant k and get the prototypes kY~
25
3.2.1.3 Parameters search
Our previous studies show that when we classify EEG signals in the time domain,
the classification rates may be improved if we only use the observations within a given
temporal interval. (Wong, 2004) But the best temporal interval is data-specific and
task-specific. In our experiments, we used Q-fold cross-validation to search for the
best interval on a parameter grid. The two parameters to be optimized are the start
point of the interval s and the interval duration d. The possible candidates of
parameters form a searching grid ),( ds . In cross-validation, all the training trials are
randomly divided into Q even groups. At each step of validation, one of the Q groups
is used for testing and the other Q-1 groups are combined for training. A classification
rate is obtained for each point of the parameter searching grid. The optimal parameters
are chosen to meet the criterion of maximizing the average number of the correctly
classified trials across the Q validation tests.
3.2.1.4 Significance level: p-value
P-value is a statistical measurement of the significance of experiment results.
Consider coin-flipping experiments. If for one experiment, we get 7 heads out of 10
flips, while for another experiment, 70 heads show in 100 flips, although both
experiments show the probability of observing a head is 70%, we are more assured to
claim that the coin used in the second one is biased. The p-value is the probability that
the outcome is at least as extreme as the actually observed value, assuming the null
hypothesis is true. In this example, the null hypothesis (H0) is that the coin is fair, i.e.,
the chance to observe a head in one flip is 0.5. Then the p-value of the first experiment
is:
1719.0)5.01(5.010
)|7Pr(10
7
100
i
ii
kHheads (3.6)
For the second experiment:
5100
70
1000 1093.3)5.01(5.0
100)|70Pr(
i
ii
kHheads (3.7)
26
The smaller the p-value, the more confident in rejecting the null hypothesis, and
hence the more significant the result is.
In the N-class EEG classification problem, the null hypothesis is that the classifier
cannot recognize any test sample and randomly assigns a label to each sample. The
probability that one test sample is correctly recognized is p=1/N under the assumption
of the null hypothesis. Thus if k of m test samples are classified correctly in one
experiment, the p-value of the result is:
m
ki
imi ppi
m)1(value-p (3.8)
3.2.2 Experimental results
We first tested the classifier based on brainwave-speech mapping by classifying
the 4 initial consonants, /p/, /t/, /b/ and /g/, of Syllables-I data. The six bipolar-channel
data, which are C3-T5, C4-T6, T3-T3, T4-C4, T5-T3 and T6-T4, were down-sampled
to 50Hz and passed through a 4th order Butterworth band-pass filter with the cut-off
frequencies at 2Hz and 20Hz. The consonants were classified using data from each
channel, each subject separately. For each subject, we collected 24 EEG trials of each
stimulus. We randomly drew 16 trials for training and used the remaining 8 for test.
Since there are 8 syllables that started with a given consonant, we have in total
128168 training trials and 6488 test trials for each class. The total number of
test trials is 256. The EEG interval defined by the start time s and duration d is
optimized using 8-fold validation. Table 3.1 summarizes the classification rates and
significance level of the results.
We can see that the classification accuracies show large variations among
subjects. For subject AB, 44.9% of the 256 testing trials can be correctly classified
using the best channels, the significance level p-value is less than 10-11. Subject PS got
slightly lower classification accuracy rates, which is 39.8% with p-value<10-6. The
significance levels of these results are high enough to prove the effectiveness of the
27
model on those subjects. However, the classification model barely works for subject
SO. The classifier estimating class-specific transformations works better than the
classifier using global transformation.
Table 3.1: Results of classifying the 4 consonants of Syllables-I data using brain-speech mapping method
subjects channels
Class-specific transformation global transformation
rates p-value rates p-value
AB
C3-T5 37.9% <10-5 31.6% 0.0098 C4-T6 44.9% <10-11 35.5% <10-3
T3-C3 44.9% <10-11 33.2% 0.0020 T4-C4 44.9% <10-11 31.6% 0.0098 T5-T3 39.8% <10-6 32.8% 0.0030 T6-T4 44.9% <10-11 28.1% 0.1399
PS
C3-T5 37.5% <10-5 28.1% 0.1399 C4-T6 31.3% 0.0141 30.5% 0.0275 T3-C3 37.1% <10-4 33.2% 0.0020 T4-C4 39.8% <10-6 31.6% 0.0098 T5-T3 38.7% <10-6 34.4% <10-3 T6-T4 36.3% <10-4 34.0% <10-3
SO
C3-T5 25.4% 0.4665 28.5% 0.1110 C4-T6 29.7% 0.0504 28.5% 0.1110 T3-C3 24.2% 0.6369 27.0% 0.2558 T4-C4 30.5% 0.0275 32.0% 0.0068 T5-T3 26.2% 0.3553 24.6% 0.5812 T6-T4 25.0% 0.5240 27.0% 0.2558
To check how the brain-speech mapping method performs when a large amount
of training data is available, we classified the same 4 initial consonants of the
Syllables-III data using the classifier with class-specific transformation matrices. We
combined all the 8 sessions from subject LK and got 32 trials for each stimuli, 24 of
them used for training and 8 for testing. Hence each transformation matrix can be
estimated using 6724724 instances and in total 8964478 trials are
available to test the classification accuracy.
28
The classification was run on 124 monopolar channels respectively and
classification rates of all the channels are shown in a brain map in Figure 3.4. Each
number on the brain map denotes the classification rate using the monopolar channel
data collected from the sensor at the corresponding scalp location. Although the
classification accuracy rates are not improved, which is 36% for the best channels, the
significance level of the results is very high (p-value<10-11) because more test trials
were available. The brain map also shows that the signal from channels located at the
left hemisphere of scalp carries more information about the phoneme compared to that
from the right channels. The best rates were obtained from the channels around the left
ear.
Figure 3.4: Classifying the 4 consonants /p/, /t/, /b/, /g/ of the Syllables-III data using
brain-speech mapping method
29
3.3 Support Vector Machine (SVM) classifiers
3.3.1 Methodology
This section proposes a different approach to classify the EEG signals of
phoneme stimuli. This method follows the traditional pattern classification strategy.
The trials from each class, i.e. each phoneme, are given a unique class label. For
example, to classify the 8 initial consonants in the Syllables-III data, we can label the
8 classes using number 1 to 8, which stand for /p/, /t/, /b/, /g/, /f/, /s/, /v/ and /z/
respectively. In other words, neither acoustic nor phonological information of the
speech stimuli is taken into account in classification.
The main scheme underlying the statistical classification approach is SVM with
bootstrap aggregating. I will introduce the idea of SVM with bootstrap aggregating at
first. Then the diagram of the classifier will be described.
3.3.1.1 SVM with Bootstrap aggregating
We use a soft-margin SVM as the basic classification unit in this classification
model. (Cortes & Vapnik, 1995) The original SVM is a binary classifier looking for a
separation hyperplane that maximizes the empirical functional margin, which is the
largest distance between the hyperplane to the nearest training data points of either
class. If the training data are consist of m samples miy ii ,,1),,( )()( x with ni Rx )(
and 1,1)( iy denoted the two class label of the samples, we write the hyperplane
as a set of points that satisfy:
0bxwT (3.9)
When the training data are separable, the optimal hyperplane can be found by the
optimization problem:
mibxwy iTi ,,1)( subject to
max
)()(
,,
wbw (3.10)
30
where is the margin. With the scaling constraint 1 , the optimization problem is
equivalent to
mibxwy iTi ,,11)( subject to
21min
)()(
2
,
wbw (3.11)
which can be efficiently solved.
However, the solution is very sensitive to outliers when the training data are noisy,
and cannot be applied to non-linearly separable cases. Therefore the soft margin is
introduced to allow training samples with margin less than 1 or even negative.
Suppose a sample ),( )()( ii yx has the margin i1 , the objective function would
increase with a cost factor C. The optimization problem is reformulated as:
mi
y
C
i
i
iTi
m
i
i
T
,,1,0 ,1)(subject to
21min
)()(1,,
bxw
wwξbw
(3.12)
SVM can implement non-linear classification by simply applying the kernel trick.
With the kernel, the optimization problem becomes
mi
by
C
i
i
iTi
m
i
i
T
,,1,0 ,1))((subject to
21min
)()(1,,
xw
wwξbw
(3.13)
It can be solved by optimizing the dual problem
miC
T
TT
,,1,0 0subject to
21min
i
αy
α1Qααα
(3.14)
where Q is an m-by-m positive semi-definite matrix with
),( )()()()( jiji
ij KyyQ xx (3.15)
31
and ),( )()( jiK xx is the kernel. The kernel is the inner product of )(ix and )( jx in linear
cases. The decision function for the test sample x is:
m
i
i
i
i bKyh1
)()( ),(sgn)( xxx (3.16)
The predicted class label of the test sample is 1 if the decision function is greater
than zero and is -1 if the decision function is less than zero.
The following kernels were tested in our study:
Linear kernel: zxzx TK ),(
Gaussian radial basis function (RBF): 2exp),( zxzx K
Polynomial kernel: dT CK 0),( zxzx for d=2 and d=3.
To solve an N-Class classification problem, we construct a binary SVM for each
pair of the N classes and predict a test sample as belonging to the class that wins the
maximum number of “votes” from the binary classifiers. In total N(N-1)/2 “one-
against-one” binary classifiers are needed.
We use the Matlab toolbox libsvm (Chang & Lin, 2001) to implement SVM
training and predicting. The toolbox trains SVM using a Sequential-Minimal-
Optimization (SMO)-type decomposition method. Since the solution is always sub-
optimal as well as the noisy training data may not represent the structure of unseen
data, the performance of SVM can be very unstable. A solution to this problem is
Bootstrap Aggregating (Bagging). Bagging is a method to generate multiple versions
of a classifier via the bootstrap sampling approach and use these to get an aggregate
classification. (Breiman, 1996; Kim, 2002) The scheme of SVM with Bagging is
shown in Figure 3.5.
32
Figure 3.5: Diagram of SVM with bootstrap aggregating
Let )},(;,,,{ΤR )()(21
ii
im yxzzzz denotes the training set. The bootstrap
method randomly draws observations from TR and produces a replicate dataset of TR,
noted as )(TR j , and repeats this drawing process B times with replacement. Each
replication is drawn independently and is used to train a SVM classifier. Thus we get a
set of B SVM classifiers, each one of them is trained independently by a replication of
the training set. To predict the classification of a test sample, we aggregate the SVMs
via majority voting. We first test the sample with all the SVMs and obtain a vector of
prediction labels BcccC ,,, 21 . Considering prediction errors of SVMs should be
random and independent, the final prediction label of the test sample is selected as the
class that occurs most often in C. Empirical studies have shown the Bagging method
generally outperforms the single SVM trained by the original training set TR. (Wang
etc. 2009)
3.3.1.2 Diagram of the classifier
The diagram of the EEG classifier based on SVM is shown in Figure 3.6. The
pre-processing steps include high-pass filtering with 1Hz cutoff frequency, down-
sampling and removing artifacts using ICA. For each trial of EEG recording, we
concatenated the data from all the channels to create an observation vector. The length
of the observation vector is the product of the number of channels and the trial length.
In addition, all the individual EEG trials are relabeled using the class index. i.e. The
33
EEG trials are labeled using number 1 to 8 for classifying 8 consonants and using
number 1 to 4 for classifying 4 vowels. This means at this stage, the classification is a
blind process without knowing any information of other phonemes presented in the
syllable stimuli.
(a)
(b)
Figure 3.6: Diagram of the SVM-with-Bagging EEG classifier
Sample-with-replacement & average
Sample-with-replacement & average
Optimize parameters via cross-validation SVM train
SVM test
SVM
model
Tra
inin
g S
et replica
tion TR
(i) TE
(i) bootstrap repetition #i
SVM model #i
pre-proc
Raw EEG
PCA estimate
PCA
Training Set (TR)
Test Set (TE)
Transformation
matrix
Bootstrap repetition #1
SVM #1
Bootstrap repetition #2
SVM #2
Bootstrap repetition #B
SVM #B
average using sample-without-replacement method
.
. .
Aggregating using
“majority voting”
results
34
We divided the EEG trials randomly into a training set (TR) and an Out-Of-
Sample (OOS) test set (TE). The classifier parameters are estimated using the training
set only, hence independent of the OOS test set. The OOS test set is used to test the
classification accuracy and generate the confusion matrices. Besides the SVM with
Bagging, the classification model also makes use of the following statistical methods:
Principal Component Analysis, Averaging and cross-validation. Next, each modular
of the classification model will be explained in detail.
Principal Components Analysis (PCA)
If the observed data include a large amount of variables, it is very likely that some
variables are correlated. For EEG signal, data from adjacent channels are highly
correlated to each other. Thus when we classify EEG using data from multiple
channels, we applied PCA to reduce the number of variables. The PCA algorithm
rotates the data to a new coordinate via an orthogonal linear transformation. The result
data have the greatest variance aligned to the first coordinate, and the second greatest
variance aligned to the second coordinate and so on (Jolliffe, 2002). In pattern
classification, PCA can be used to reduce the feature size because of the underlying
assumption that variables with very small variances are trivial for separating data from
different classes. Thus we can truncate the transformed data and only keep the first K
principal components without losing lots of information for classification. It is also
equivalent to projecting the data to a reduced subspace with only K coordinates. The
number of principal components kept in the feature vector K needs to be optimized
using empirical data.
The PCA orthogonal linear transformation can be calculated with the following
algorithm.
Suppose we have m observed samples mii ,,1,)( x and each sample has n
variables ni Rx )( . First, we zero out the mean of data by replacing each )(ix with
μx )(i , where
m
i
i
m 1)(1 xμ (3.17)
35
Then the empirical covariance matrix of x is calculated as:
m
i
Tii
m 1)()(1 xxΣ (3.18)
Next, we find the eigenvalues n ,,, 21 and the unit-length eigenvectors
nvvv ,,, 21 of Σ . The matrix nvvv 21V diagonalizes the covariance matrix as
DΣVV 1 , in which ),,,diag( 21 n D . Rearrange the order of columns of V so
that the diagonal elements of D are in descending order. Then the transformation )()( iTi xVy rotates the original data to their principal components.
In practice, the PCA usually is calculated using a Singular Value Decomposition
(SVD) of TX (Wall, etc., 2003). In this classification model, the PCA transformation
matrix is estimated using all the individual trials of the training set at the first step of
training. We apply the transformation to both TR and TE to convert the original
observations to principal components.
Bootstrap repetitions
After applying PCA, all the training trials, represented as principal components,
are passed to B independent bootstrap repetitions to train B SVM classifiers
independently. The structure of each bootstrap repetition is illustrated in Figure 3.6(b).
In the ith bootstrap repetition, we randomly draw 80% of the trials in the training set
TR to create a bootstrap replication of TR, noted as TR(i) The remaining 20% of the
trials in TR are used as a test set TE(i) to monitor the SVM classifier‟s accuracy,
although the accuracy rate is not directly related to the final result and won‟t be
reported.
Averaging
Traditionally, it is a widely accepted method in EEG studies that computing the
average of multiple trials in a given condition to extract the common structure of a
class of signals. The same technique is applied to our research. Averaging cancels out
the uncorrelated noise and improves the signal-to-noise ratio. However, the averaging
36
procedure considerably reduces the number of training and testing samples and
produces a data-deficiency problem when sophisticated classification models are
estimated. Therefore when we compute averages, we have to reuse the individual trials
in an efficient way without bringing bias to the classification accuracy. In our
classification model, we use a sample-with-replacement scheme to randomly select the
trials for computing averages. The sampling should be done for the training and
testing sets separately, which means an individual trial used to calculate an averaged
training sample cannot be used to compute an averaged testing sample.
For instance, to calculate the average of M EEG individual trials of initial
consonant io as a training sample, we randomly draw M trials from a pool, which is
consist of all ni training trials whose corresponding auditory stimuli start with the
consonant io , and calculate their mean. The M trials are put back into the pool for
calculating other averages. We repeat this procedure until sufficient number of
averaged samples is obtained. When the number of trials in the pool ni is much greater
than M, it is very unlikely to have two identical averaged samples.
Note that the PCA algorithm basically rotates the coordinate and aligns the axes
with the direction that the signal has greater variance. And if the set of data has large
variance in one direction, their averages also have large variance in that direction,
given that the number of trials in the pool is much greater than the number of trials
used to calculate one average trial. Thus we estimated and applied a PCA
transformation to individual trials to avoid repeated SVD calculation, which
dramatically slows down the computation.
Moreover, when we compute the p-value, we use the binomial distribution which
assumes the testing of each sample is independent. If the averaged test samples are
constructed using a sample-with-replacement scheme, two samples may share the
same source individual trial and hence are not statistically independent. Therefore we
apply the sample-without-replacement scheme on calculating the averaged OOS test
samples for accurate p-value estimation, as shown in Figure 3.5(a). Therefore, no
37
more than Mni averaged samples can be created for the phoneme oi, in which ni is
the number of individual trials that belong to class oi in the OOS test set.
Optimizing SVM parameters
Choosing the appropriate parameters of the SVM model, such as the number of
principal components to be kept K and the cost factor C, is crucial to obtain a high
performance classifier. In each bootstrap repetition, we determine the optimal
parameter of each SVM classifier via nested Q-fold cross-validations using TR(i).
Suppose t independent parameters need to be optimized ),,,( 21 t , the
number of candidate values for the parameters are ),,,( 21 tmmm . All the candidate
parameter values form a searching grid with tm ,,1 points, notes as
tk mkP ,,1,,1, . The procedure of cross-validation is:
(1) The EEG individual trials of TR(i) are randomly divided into Q group with the
approximate even number of trials.
(2) Repeat the following for each cross-validation loop:
(2.1) For the jth cross-validation loop, use the jth group of training trials as the
validation test set (VTE(j))and combine the other Q-1 groups as the
validation training set (VTR(j)). The averages training and testing samples
are calculated using sampling-with-replacement scheme from VTR(j) and
VTE(j) respectively.
(2.2) For each point of the parameter searching grid kP , we estimate an SVM
classifier, configured as the candidate parameter values, using the
averaged samples of VTR(j) and test its accuracy using the averaged
samples of VTE(j) . A classification rate is obtained for kP , denoted as
)()(k
j Pr .
(3) The mean of cross-validation accuracy rate for kP is calculated and the
optimal parameter set is chosen as:
38
j k
j
P
PrQ
P )(1maxargˆ )(
(3.19)
Next, we construct the averaged samples of set TR(i) and train the SVM classifier
of the ith bootstrap repetition using the optimal parameter configuration P̂ .
The computation cost of cross-validation increases exponentially with the number
of parameters to be optimized. Thus we cannot afford searching more than three
parameters. Since the parameters are optimized independently in each bootstrap
repetition, the resulting SVMs may have different structures.
As the last step, the SVM classifiers are aggregated via “majority-voting” and
tested on the averaged test samples, which are generated using a sample-without-
replacement scheme.
3.3.2 Classification results
We tested the SVM classification model using the Syllables-III data and the
Isolated-vowels data. The 1KHz raw data of brainwave were down-sampled 16 times
to 62.5Hz. For classifying the initial consonants, only the first 32 samples of each trial,
representing EEG signal of 512ms, were used in classification. The full-length trials
with 62 samples were used to classify the vowels.
3.3.2.1 Linear vs. Nonlinear Kernels
First, we combined EEG trials from all four subjects of the Syllables-III data and
tested the performance of the SVM-with-Bagging classifier using linear and non-linear
kernels. We classified the 8 consonants and 4 vowels as two independent classification
problems. We also tested the classifier on recognizing 4 vowels in the Isolated-vowels
data. The classification experiment is configured as following:
All 124 monopolar channels are concatenated as the EEG observation vector.
The training set TR included half of the individual EEG trials and OOS test set
TE contained the other half of the trials. The training/OOS testing partition is random.
39
35 SVMs were built using the Bagging scheme.
Each averaged sample was computed from 25 individual trials.
For the linear-kernel SVM model, there are only two parameters to be optimized,
which are the size of the principal-component vector K and the cost factor C in SVM.
Thus the linear-kernel SVM was tested first. For the 8-consonant classification, we
used 5-fold cross validation to choose the number of principal components used for
classification K from [5,10,15,...,200] and the cost factor of SVMs C from
178 2,,2,2 . The mean of the validation rates across all the bootstrap repetitions is
plotted in Figure 3.7. We can see that if C is fixed and K increases from 5, the
recognition accuracy is improved dramatically with K at the very beginning and the
growth rate decreases after K reaches a certain level. The cost factor C affects the
sensitivity of the classification rate with respect to K. Although the larger principal
components size K leads to a better recognition rate, it also considerably increases the
computation cost. With an appropriate C, a better recognition rate can be achieved
with a smaller number of principal components.
Figure 3.7: Mean validation accuracy on parameter search grid (K,C) using linear
kernel
40
Now we look at the computation cost of training the SVM-with-Bagging
classifier of linear kernel. The parameters were optimized via 5-fold validation
searching on a grid of 40×8=320 points. Thus around 320×5+1=1601 times of SVM
training is needed to estimate each SVM classifier. An ensemble of 35 such SVM was
used to make final predictions. Therefore, in total 1601×35=56,035 times of SVM
training should be done for constructing the EEG classification model with linear
kernel. For the non-linear kernel experiment, since more parameters need to be
optimized, the full search of the parameter grid becomes infeasible. Thus we used the
fixed cost factor C=0.02, corresponding to the fastest ascending slope of mean
validation rate with respect to K in the linear-kernel experiment, and K=200, while
only optimizing for the Gaussian-kernel experiment and optimizing and for
the polynomial-kernel experiment. The classification accuracy rate and the significant
levels (p-value) are shown in Table 3.2
Table 3.2: Phoneme classification results using SVM-with-Bagging method with linear or non-linear kernels
Task 8 consonants in CV
syllables 4 vowels in CV
syllables 4 vowels (isolated) Number of test samples 426 426 140
rate p-value rate p-value rate p-value Linear 46.0% <10-64 41.5% <10-13 68.8% <10-26
Gaussian 42.7% <10-53 36.9% <10-7 65.3% <10-23 Quadratic 42.7% <10-53 34.3% <10-5 62.4% <10-20
Cubic 43.7% <10-56 41.5% <10-13 65.9% <10-23
The result shows that the linear SVM-with-Bagging model correctly classified 46%
of the 426 consonant test samples (p-value<10-64). The classification rates of the 4
vowels in the CV syllables are much lower than the consonant results, with a 41.5%
accuracy rate using the linear kernel (p-value<10-13). However, we find that the model
works well on the same 4 vowels presented in isolation, achieving a classification rate
of 68.8% using the linear kernel. The high significance level proves the effectiveness
of the SVM-with-Bagging classification methods on modeling the averaged EEG
41
recordings of auditory phoneme stimuli. But it works much better on the phonemes
presented at the beginning of the stimuli than the following phonemes. As mentioned
in Chapter 1, the EEG recording reflects postsynaptic potentials, which may last
longer than the actual duration of the sound stimuli. Thus the EEG brainwave response
of the initial consonants may impose extra noise on the brainwave of vowel perception
and make it unintelligible.
Theoretically, the non-linear kernels should achieve at least the same performance
as the linear kernel. On the other hand, limited by the computation capabilities of our
computers, we could not run a full parameter grid search to find the optimal
parameters of non-linear kernels. As a result, the classifiers with non-linear kernels
did not outperform the classifier with linear kernel in this experiment.
3.3.2.2 Leave-out-one-subject experiment
With the same experiment setup, we tested the invariance of EEG representations
among subjects using the Syllables-III data. We used trials from one subject to create
test samples and trials from the other three subjects to train the linear SVM model.
This procedure was repeated for four subjects respectively. The result classification
rates and p-values are shown in Table 3.3.
Table 3.3: Leave-one-subject-out classification results using SVM-with-Bagging method
Subject for testing
8 consonants 4 vowels Number of
test samples rate p-value Number of test samples rate p-value
DS 138 26.8% <10-5 141 35.5% 0.0036 SA 176 33.5% <10-12 176 30.7% 0.051 LK 280 30.4% <10-10 284 37.0% <10-5
LH 248 25.0% <10-7 248 40.7% <10-7
The classification rates for 8 consonants range from 25.0% (p-value<10-7) to 33.5%
(p-value<10-12). Although these rates are considerably lower than the results of the
previous experiment, they are highly significant and demonstrate the EEG
42
representations of consonants are approximately invariant among different subjects. In
contrast, the vowel classification results, which vary from 30.7% to 40.7%, are
comparable to the results obtained from mixing the trials of all the subjects. The result
suggests that the EEG representations of vowels have stronger inter-subject invariance
than the EEG representation of consonants.
3.3.2.3 Experiment on the number of trials to calculate average
In all above experiments examined in this section, we trained and tested the
classification model using the averaged samples. The number of individual trials to
calculate average, M, was fixed at 25. To explore how the averaging process affects
the classification, we also classified the initial consonants using linear SVM with
various M. and plot the relation between M and the percentage of test samples that are
correctly classified as in Figure 3.8.
Figure 3.8: The changing of 8 initial consonants classification rates with respect to the
number of trials to calculate averages
The figure shows that for classifying individual trials, only 17.3% of the test trials
were correctly classified. The accuracy rates increase rapidly as M is increased. The
43
ascending rates slightly slow down when M is greater than 20. Although our data size
is insufficient to test when the classification rates are going to saturate, with the given
result, we conclude that the averaging can efficiently reduce the signal-to-noise ratio,
which verifies that the noise of the EEG signals are mainly uncorrelated for different
trials.
3.3.2.4 Experiment on classifying individual EEG trials using data from single
channel.
We also classified the individual EEG trials of 4 initial consonants /p/, /t/, /b/ and
/g/ from subject LK using the SVM-with-Bagging model, so that the performance can
be compared with the Brain-speech mapping classification results shown in Figure 3.4.
In this experiment we combined the 8 sessions from subject LK in Syllable-III
data. We randomly drew 75% of the individual EEG trials associated with the targeted
phonemes as training set and used the remaining 25% as the OOS test set. To match
the experiment setup of Brain-speech mapping classification, we trained and tested the
SVM-with-Bagging classifier using individual EEG trials from each of the 124
monopolar channels separately. The parameters setup of the experiment is:
Used 32 samples, represented the first 500ms brain response of each EEG trial as
the observation data vector. PCA is not necessary in this case.
Linear kernel was adopted in the SVM-with-Bagging classification model.
35 SVMs were built using the Bagging scheme.
Cost factor of the soft-margin SVMs was optimized via nested 5-fold cross-
validation loops and chosen from 2910 2,,2,2 .
The result classification rates are shown in the brain map in Figure 3.9. The
numbers denote the classification rates using the monopolar channel data collected
from the corresponding scalp locations.
44
Figure 3.9: Performance of SVM-with-Bagging method on classifying 4 initial
consonants using single channel data
We can see this brain map matches the brain map in Figure 3.4 very well. Both
experiments obtained the highest classification rates from the channels around left ear.
The best single channel classification rates are the same which is 36%. The major
difference between the two brain maps is: SVM-with-bagging methods can classify
the consonants reasonably well using some channels from the right hemisphere. And
this is not shown in the classification results of Brain-speech mapping model.
3.4 Summary
In this chapter, we proposed two different approaches to classifying the EEG
brainwaves of auditory phoneme stimuli. The first method takes usage of the acoustic
properties of the auditory stimuli and examines the relations between the brainwaves
and the speech sound waves. The second approach follows the traditional pattern
45
recognition strategy and makes use of statistical signal processing methods such as
PCA and SVM-with-Bagging. We used these methods to classify the individual EEG
trials and averaged EEG trials. Both methods achieved significant results, especially
on classifying the initial consonants. The performance of two classification models are
similar when classify the individual trials of 4 initial consonants. The classification
rates can be further improved if we bring the two models together. For example the
Brain-speech-mapping method can take advantage of the Bagging scheme and PCA
methods as well. Using the SVM-with-bagging method, we showed that non-linear
methods cannot outperform the linear method in our experiments and EEG
representation of phonemes is approximately subject-invariant.
Using the second method as the baseline, in the next chapters, we examine how
phonetic differences are reflected in EEG brainwaves and explore if the classification
methods can be improved by introducing the phonological information of stimuli into
classification.
46
Chapter 4
Frequency Analysis of EEG Signals
4.1 EEG signals in frequency domain
The long history of studying EEG in the frequency domain started almost at the
same time as the first successful recordings of human EEG in the 1920s. Researchers
found that EEG signals contain rhythmic activities across a spectrum of frequencies.
Oscillations in certain frequency range may reflect a specific cognitive state of the
brain. For example, Alpha waves in the frequency range of 8 to12 Hz are believed to
have relation with the wakeful relaxation with closed eyes. Beta waves, observed from
frontal sensors, which range from 12Hz to 30Hz are closely linked to motor activities.
(Pfurtscheller, 1999) Gamma activities in the frequency range from 30 to 100Hz seem
related to the binding of neural processes in different brain areas for carrying out a
coherent cognitive or motor activity. (Tallon-Baudry & Bertrand, 1999) However, in
the literature of EEG related to phoneme perception, the focus of this thesis,
researchers were more interested in temporal information. Very few reports on
frequency analysis of EEG activities evoked by phoneme stimuli have been published.
In this chapter, we address the problem of whether or not frequency analysis can
extract attributes of EEG associated to auditory phoneme perceptual activities.
To examine the EEG signals generated by auditory perception in the frequency
domain, we first plot the power spectral densities (PSD) of our recordings. Only one
EEG session on syllables from subject LK in Syllable-III experiment is examined. The
EEG signals were passed through a 1Hz high-pass filter and down-sampled four times
to 250Hz. The PSD of each monopolar channels was computed using the covariance
method and then the average PSD across all 124 monopolar channels was calculated.
This average PSD from 0Hz to the Nyquist frequency 125Hz is plotted in the dB scale
in Figure 4.1.
47
Figure 4.1: Average power spectral densities of EEG signal sampled at 250Hz
The plot shows that besides the dominant 60 Hz AC power supply noise, the
power of the EEG signal is mainly distributed in the low frequency range from 0 to
20Hz. And the power is inversely related to the frequency. The 20Hz component has
more than -20dB power decay compared to the maxim at 2Hz. It is natural to think
that the essential information of brain activities is carried by frequency components
with higher energies. Hence reducing the size of the data by down-sampling them to
62.5Hz will not lose much useful information, as we did in Chapter 3. In the following
part of this chapter, we will only focus on the lower frequency range from 0Hz to
approximate 31Hz.
48
4.2 EEG spectral features
The Discrete Fourier Transform (DFT) is commonly used to convert the finite
discrete time-domain signal to the frequency domain. For a time-domain signal
1,,0),( Nnnx , the N-point DFT is N complex numbers calculated as:
1,,0)()(1
0
2
NkenxkXN
n
nkN
i
(4.1)
And )(nx can be reconstructed from )(kX using the inverse transformation:
1,,0)(1)(1
0
2
NnekXN
nxN
k
nkN
i
(4.2)
The real time-domain signal )(nx has a conjugate symmetric spectrum, which
means
1,,1,* NkkNXkX (4.3)
and both 0X and 2/NX are real when N is even.
Now we only consider the case when )(nx is real and N is even. If we write the
complex number DFT )(kX as
ki
keAkX
)( (4.4)
Then the inverse DFT can be reformulated as:
1
10
1
0
1
0
)2(
2
2)2cos(2)cos(1
)2cos(1
1)(
N
N
k
k
kk
N
k
kk
N
k
knN
i
k
knN
AnAAN
knN
AN
eAN
nx
(4.5)
Equation (4.5) shows that the DFT represents the time-domain signal as a
superposition of a series of discrete sinusoidal functions, each of which is defined by
three attributes: frequency, amplitude and phase.
49
To explore the relation between these spectral attributes and the phoneme
perception process in the brain, we constructed several EEG features based on DFT
that reflect all or partial spectral attributes. Then using the SVM classifier introduced
in Chapter 3, we tested if the frequency-domain features can be used to predict the
brain representation of phoneme stimuli.
DFT of EEG
The first feature is the DFT, which is calculated separately for each trial and each
channel. Since the time domain signal )(nx is real, half of the N-point DFT is
redundant. We represent )(nx by
2/,12/,,1Im,12/,,1Re,0 NXNXXNXXXX DFT (4.6)
which is a vector with N non-redundant real numbers. Theoretically, DFTX should be
equivalent to the time-domain signal and achieve the same classification accuracy.
Amplitude of DFT
The amplitude feature of EEG only keeps the amplitude corresponding to each
sinusoidal component. From the conjugate symmetric property, we have
1,,1, NkAA kNk (4.7)
Thus the non-redundant representation of the amplitudes is:
2/10 ,,, NAMP AAAX (4.8)
EEG features based on phase
Similarly, when we define EEG features based on the phases of DFT, only N/2
values need to be included as.
2
,1 NPHSX (4.9)
The EEG signal was passed through several filters at the pre-processing stage to
remove the noise and artifacts. A low-pass filter with zero-phase response was used
for the purpose of anti-aliasing before down-sampling. But the 4th order Butterworth
50
high-pass filter with the cutoff frequency at 1Hz has non-zero phase response. The
frequency response of the high-pass filter at the low frequency range is shown in
Figure 4.2. The high-pass filter introduced non-linear phase distortion that needs to be
compensated. Suppose the high-pass filter generates a phase delay k at the frequency
of kth sinusoidal component, the phase of kth sinusoidal component k should be
replaced by kk .
Figure 4.2: Magnitude and phase frequency response of 1 Hz high-pass filter
Furthermore, all the elements in XPHS have angular values, which means k is
identical to 2k . The linear methods used in classification, such as the
averaging and the linear separation hyperplane, cannot work appropriately for angular
observation values. Here we propose two different approaches to overcome this
problem.
The first method is to describe the phase angle k as a unit-length vector in the
complex plane and use the real and imaginary parts of it, kcos and ksin , as the
observed values. Then the EEG feature is written as:
51
22
cos,sin,,cos,sin 111 NNPHSX (4.10)
In the other approach, we keep only the phase information in DFTs and transform
them back to the time domain using the inverse DFT. More specifically, for each
element of 1,,0),( NkkX , the modified DFT is defined as:
0 if00 if
)(~
k
k
i
A
AekX
k
(4.11)
And the EEG feature PHS-2 is
))(~IDFT(2 kXX PHS (4.12)
Obviously )(~kX is also conjugate symmetric and the derived time-domain signal
should be real. Thus 2PHSX is a vector of N real numbers, which may be longer than
the original EEG data. Although 2PHSX is a time-domain signal, it is constructed
based on only the phase pattern of the original signal and all the amplitude differences
of the non-zero sinusoidal components are eliminated. Hence 2PHSX is still
considered as a phase feature, while all the time-domain signal processing methods are
also applied to it.
4.3 Classification results
4.3.1 Compare the EEG features based on DFT
To test how well the EEG spectral features can describe the brain activities of
phoneme perception, we computed the four proposed EEG spectral features of the
Syllables-III data and the Isolated-vowels data. We used the features to classify the
brain representations of the phoneme stimuli with the linear-kernel SVM model
introduced in Section 3.3. The classification scheme is identical to the one described in
Figure 3.6, except that the EEG time-domain signal of each trial is converted to the
spectral features immediately after the ICA cleaning in the pre-processing stage. For
52
EEG data sampled at 62.5Hz, we used 62 samples as the time-domain observations of
one-second EEG activities and zero-pended them for a 64-point DFT. Thus the
frequency resolution of DFT is approximately 1Hz. The classifiers were trained and
tested using the following configurations:
EEG channels: 124 monopolar channels
Number of SVMs for Bagging: 35
Number of trials for averaging: 25
The number of principal components used for classification is optimized in nested
5-fold cross-validation loops and chosen from [5,10,15,...,200]
Cost factor of SVMs is optimized in nested 5-fold cross-validation loops and
chosen from 2910 2,,2,2 .
The percentages of test samples that classified correctly are summarized in Table
4.1.
Table 4.1: Comparing the classification rates of 4 EEG spectral features
8 consonants 4 vowels in CV
syllables 4 isolated vowels number of test samples 426 426 140
Temporal signal (TIME) 41.5% 35.5% 68.8%
Spectral features
DFT 38.0% 31.5% 70.2% AMP 10.3% 28.5% 27.7%
PHS-1 35.2% 29.7% 51.8% PHS-2 39.2% 27.8% 55.3%
The initial-consonant classification results show that the DFT spectral features
achieve slightly lower classification rates compared to the full-length temporal signal.
The AMP features barely worked for distinguishing EEG representations of initial
consonants and classified only 10% accuracy for the classification of 8 consonants,
which is not better than the chance level. Hence the amplitudes of DFT carry very
little information to distinguish the initial-consonant brain images. Among four
spectral features, PHS-2 gave the best classification rate of 39.2%. The accuracy rate
53
is better than that of the full DFT representations. The results demonstrated that
eliminating amplitude information can improve the initial consonants classification
rates. Similar performance-difference pattern can be found in isolated-vowel brain-
image classification, except that the phase-related features cannot classify the vowels
as well as the temporal and DFT features. Since the total number of test samples of
isolated vowels is 140, which is much less than the number of test samples of initial
consonants, 426, the isolated-vowel classification rates are not as robust as the rates of
initial-consonants classification.
We also found that the superiority of phases over the amplitudes of the DFT is not
shown clearly in classifying EEG brainwaves of the four vowels in CV syllables. This
is because the actual start times of the vowels presented at the non-initial position are
different, due to the various durations of proceeding consonants. Therefore if the
distinctions between EEG images of phonemes are reflected by the phases of DFT,
which describe temporal delays of the sinusoid components of signal, these
distinctions can be contaminated by the different delays when the phonemes are not
presented as the initial phoneme of a stimulus. These results also help us to explain
why the classification rates of vowels in CV syllables are significantly lower when
compare to others and indicate a possible direction to improve the rates.
Moreover, although the DFT feature is mathematically equivalent to the TIME
feature, which means one of them is fully determined by the other, the DFT feature
does not reach as high a rate as TIME under the linear classification methods. Similar
results were obtained from the PHS-1 and PHS-2 features. This may suggest that the
proposed SVM classification algorithm works better on the EEG signal when
represented in the time domain.
In short, the remarkable differences in classification rates using spectral features
show that the phoneme brain representations are nearly independent of the amplitude
of sinusoid components of EEG but much more reflected in their phase pattern.
54
4.3.2 Frequency selection
Now we know that the phases of the DFT can describe the EEG of phoneme
perception process rather well. This is shown by the classification experiments
discussed above, which used the spectral properties across the frequency range from
DC to Nyquist frequency as the observation features. However, it is natural to think
that the frequency components contribute differently to the classification. Those that
are unrelated to the target neural activities may impose extra noise on the classifiers
and reduce the classification rate. By optimizing the choices of frequency components
with respect to maximizing the classification rate, we may be able to find the
frequency range of EEG activities that are more directly related to brain processing of
phonemes in humans.
The ideal approach is to include a full search of possible frequency choices while
training the Bagging SVMs, as we did for the number of principal components and the
SVM cost factor. But this will increase the computation time to an impractical level.
Thus we look for the approximate optimal frequency range via a 10-fold cross
validation using only the training set. Since the number of trials in the Isolated-vowel
data is insufficient, we only examine the best frequency range for classifying initial
consonants using Syllable-III data.
First of all, we assume the frequency components that carry the information to
distinguish the initial consonants lie in a continuous range from Lf to
Hf ,
corresponding to the frequency indices L and H of a N-point DFT. Our purpose is to
find the optimal parameter pair (L, H) in the searching grid
2
,2
0;,,
N
HLN
LHLHL (4.13)
which maximizes the mean classification rates of cross-validation. In this experiment
we down-sampled the EEG data to 62.5Hz and applied a 64-point DFT. The
optimization procedure is as following:
(1). All the training trials (TR) are randomly divided into 10 groups with
approximately the same number of trials.
55
(2). Repeat the following steps for each pair (L,H) of the grid.
(2.1) Calculate the modified PHS-2 features of the frequency band between L and
H and use it as the observation vector of the trial. If 1,,0),( NkkX is the DFT of the EEG signal collected from one channel, we calculate:
otherwise0 and 0 if
and 0 if)(~
HkNLAe
HkLAe
kX k
i
k
i
k
k
(4.14)
The band-limited PHS-2 feature is the IDFT of )(~kX .
(2.2) Transform the modified PHS-2 features into their principal components
using PCA. Only the first 200 principal components are used in the
following computation. Now each EEG trial is reduced to a vector of 200
elements which are dependent on only the phase pattern of DFTs within the
frequency band HL ff , .
(2.3) Repeat the followings for 10 cross-validation loops.
(2.3.1) At the cross-validation loop i, use group i as the test set ( )(VTE i )
and combine all the other 9 groups as the training set ( )(VTR i ).
(2.3.2) Using the sample-with-replacement method discussed in 3.3.1.2 to
create averaged test and training samples, each sample is the mean
of PHS-2 features across 25 individual trials. The training set )(VTE i and test set )(VTR i are sampled separately. The total
number of training samples is the same as the number of individual
trials in )(VTR i and the total number of test samples is same as the
number of individual trials in )(VTE i .
(2.3.3) Train an 8-class linear SVM classifier using training samples in )(VTR i with the cost factor 62C . Then use it to predict the class
labels of test samples and compute the percentage of samples that
classified correctly, noted as i
HLr ),(
56
(2.4) The mean classification accuracy of the parameter pair (L, H) is defined as:
10
1 ),(),( 101
i
i
HLHL rr (4.15)
And the optimal pair (L, H) is chosen as )max(arg),( ),( HLrHL .
The mean classification accuracies of all the candidate (L, H) pairs that belong to
the searching grid 20,40;, HLLHL are shown in figure 4.3. We find that
the parameter pair (2, 9) gives the best mean classification rate of 45.5% for 8 classes.
The DFTs are approximately corresponding to the frequency range 2Hz to 9Hz.
Figure 4.3: Mean classification rates of parameter pair (L, H) obtained from 10-fold
cross validation
To test if the optimal frequency band can be generalized to the OOS test trials and
compare the result with other temporal and spectral representations of EEG, we
classified the EEG signal of phonemes, represented as the modified PHS-2 feature
with limited bandwidth [2Hz, 9Hz], using the linear-kernel SVM-with-Bagging
classifier. Although the best frequency range was obtained using the EEG responses of
initial consonants, we also applied it to classifying the isolated vowels to check if any
improvement can be made.
The experiments were configured as following:
EEG channels: 124 monopolar channels
Number of SVMs for Bagging: 35
Number of trials for averaging: 25
57
EEG observations: modified PHS-2 for 64-point DFT and L=2, H=9.
Number of principal components used for classification: 200
Cost factor of SVMs is optimized in nested 5-fold cross-validation loops and
chosen from 2910 2,,2,2 .
The classification results are summarized in Table 4.2 in compare with the
classification rates of using temporal signal and phase features across the frequency
range from DC to Nyquist frequency.
Table 4.2: SVM-with-Bagging classification results using the EEG phase feature in the frequency range from 2Hz to 9Hz
8 initial consonants 4 isolated vowels Temporal Signal 46.0% (500ms)
41.5% (1sec) 68.8%
PHS-2 39.2% 55.3% PHS-2 [2Hz, 9Hz] 51.4% 73.8%
For the 8 initial consonants, 219 out of 426 test samples were classified correctly.
The accuracy rate is 51.4% with a p-value less than 10-82. The result is significantly
better than classifying EEG of initial consonants using the time-domain signal. The
classification rate on isolated vowels is also improved from 68.8% to 73.8%.
In conclusion, the EEG features built on the phase pattern of DFT can describe
the brain image of the phoneme as well as the original time-domain EEG signal.
Eliminating the amplitude information of DFT will not diminish the distinctions of
EEG representations of different phonemes at the initial position of auditory stimuli.
The phase pattern of sinusoidal components in the frequency range from 2Hz to 9Hz is
more important than other frequency components in distinguishing the EEG image of
phonemes.
58
Chapter 5
Invariant Similarities between Brain
and Perceptual Representations of
Phonemes
Our degree of success in classifying brain representations of phonemes supports
the investigation of brain activities at a level below phonemes, i.e. the brainwaves
reflecting distinctive features of phonemes compared to perceived phonological
features. Intuitively, if two phonemes are perceptually close, the brainwaves evoked
by them should be close. In this chapter, we derive the similarities between
brainwaves of phonemes according to the confusion matrices of classification results
and compare them with the perceptual similarities obtained from corresponding
perceptual experiments.
5.1 Psychological experiments on phoneme perception
The phonological features introduced in Chapter 1 are not equally efficient in
discriminating phonemes perceptually. Historically, researchers have studied the
effectiveness of distinctive features in separating phonemes via psychological
experiments. In these experiments, the auditory speech tokens are presented to the
listeners, who are instructed to identify the phonemes they heard. The perceptual
confusion between each pair of phonemes is recorded. In the typical experiment
settings, the utterances are presented via a noisy speech channel with frequency
distortions to create the necessary confusions. One of the first psychological
experiments on consonants confusion is the renowned Miller and Nicely work. (1955).
They recorded the perceptual confusions in identifying 16 consonants, which were
59
filtered and presented with different signal-to-noise ratios (SNR) and used the
confusion data to determine the robustness of the distinctive features under filtering or
noise-masking conditions. They found that some features, voicing and nasal for
instance, are very robust, but the discernibility of the place of articulation is likely to
be affected. The Miller-Nicely experiment results were reliably re-produced in 2005
using modern computerized techniques and digital audio recordings (Phatak & Allen,
2008). Wang and Bilger conducted a similar but more thorough experiment in 1973.
(Wang & Bilger 1973) They calculated the perceptual confusion matrices of 24
consonants, which covers all the distinctive consonant sounds in most English dialects,
in CV or VC syllables with different vowels and evaluated the robustness of
phonological features in a variety of context and listening conditions. Relatively less
work about vowel confusions has been carried out. Besides the frequency-distortion
and noise-masking, researchers also studied phoneme perceptual confusions under
other conditions, such as short-term memory (Wickelgren, 1966) and impacted
hearing capability (Munson, 2002).
5.2 Similarity measurements
We applied the similarity analysis tools, semi-orders and hierarchical partition
trees, to interpret both the brainwave confusions and psychological confusions data.
Then the invariance between brainwave similarity of phoneme images and the
corresponding perceptual similarity can be derived. We now briefly describe these
methods, which follow those of Suppes, Perreau-Guimaraes & Wong (2009).
5.2.1 Semi-Order and Invariant Partial Order of
similarities
When we classify the brainwaves of phonemes, we calculate the number of test
samples of phoneme oj that are classified as belonging to phoneme oi and normalizing
it by the total number of test samples of phoneme oj. Then we get the estimated
conditional probability
ij oop | , where “+” and “-” denote the prototypes and the
60
test samples respectively. If we repeat this for each pair (i, j), a conditional probability
matrix is obtained. The normalized confusion matrix of classification results provides
empirical evidence to order the similarity differences of the brainwave representations
of phonemes. Briefly speaking, it is natural to think that the phoneme
io is more
similar to the prototype
jo than the phoneme
io to the prototype
jo if and only if
ijij oopoop || (5.1)
We note this similarity-difference relation as
ijij oooo || (5.2)
Because the confusion matrices are generally not symmetric, the similarity
differences are not necessarily symmetric. In practice, the difference between the
similarity of
jo and
io and the similarity of
jo and
io is considered statistically
insignificant if
ij oop | and
ij oop | are close enough. Here we introduce the
numerical threshold that
ijijijij oopoopoooo || iff || (5.3)
It can be proven that the similarity-difference ordering defined by the estimated
conditional probabilities with the numerical threshold is a semiorder on
NjNiooA ij ,,1;,,1:|
i.e. is irreflexive, strongly transitive and an interval order on A.
In our study, we also need to compare the structural invariance between two
semiorders – the similarity-difference ordering of brainwaves, noted as br , and the
perceptual similarity-difference ordering of phonemes, noted as per . The invariance
can is given by the intersection of the two semiorders
invperbr (5.4)
which is a strict partial order.
61
To graph the semiorders and invariant partial order, the relation
ijij oooo || is
illustrated by an arrow from the vertex which denotes
ij oo | to the vertex which
denotes
ij oo | . To further simplify the graph, we define the congruence relation ≡ as:
cb iff (ii) , iff (i) , allfor iff cabcaccba (5.5)
The congruence relation is a strict equivalent relation, i.e. reflexive, symmetric
and transitive. In the graph of the invariant partial order, we put
ij oo | and
ij oo | in
the same vertex if
ijij oooo || (5.6)
Given that the phonemes‟ prototypes are always on the left of the similarity
notation and their test samples on the right, the + and – signs can be omitted in the
graph without generating any confusion.
5.2.2 Partition tree of similarities
The similarity-difference ordering is the basis of generating a qualitative partition
tree, which shows a hierarchical partition of the combined set of test samples and
prototypes },,,,,{ 11 NN oooo in a binary tree structure. First, we define the
“merged product” of two subsets of O, OI and OJ, as:
}O&Oor O&O:{OO JjIiJiIjijJI oooooo (5.7)
The inductive procedure starts from a partition P0 which includes 2N singleton
sets of elements of O. In the kth inductive step, two subsets in the partition Pk-1 are
chosen to be merged such that the least pair of their merged product under the
similarity-difference ordering is maximized among all the possible merges.
Consequently, the subsets with greater similarity are merged earlier than the subsets
with smaller similarity in the inductive steps. Each step of the recursive procedure
reduces the cardinality of the partition by 1. Thus the 2N-1 step reaches a partition
with only one block, which is the set O. The similarity tree is constructed by using the
62
2N hierarchical partitions in reverse order. The root node of the tree denotes the single
set O. The two branches from the root node lead to the partition of the 2N-2 step
which has two blocks. The same procedure continues until all the leaves of the tree are
the elements of O. The partition tree provides a fairly intuitive approach to
summarizing the similarity of the test samples and prototypes in a matrix of
conditional probability densities. (Further details of the semiorder and similarity tree
can be found in Suppes, Perreau-Guimaraes & Wong 2009.)
5.3 Experimental data analysis
5.3.1 Vowels
Since the classifiers predict the EEG images of isolated vowels much more
accurately than those of the vowels in CV syllables, we use the Isolated-vowels data
to generate the confusion matrix of EEG representations of vowels. If the sample-
without-replacement scheme were used to calculate the averaged test samples, only
140 samples would be available. This is not enough to produce a confusion matrix
with reliable off-diagonal structure. Thus when we constructed the vowel confusion
matrix, we took the time-domain signal as the EEG observation vector and created
300 averaged test samples for each vowel from the OOS test set using the sample-
with-replacement method. In this experiment, the PHS-2 EEG feature with the
limited frequency range from 2 to 9 Hz is used to represent EEG signal. The class
labels of the test samples were predicted using the Bagging SVM model with linear
kernel. As a result, among 1200 test samples, 826 were correctly classified. The
classification rate was 68.8%. The normalized confusion matrix of classifying EEG
images of vowels is shown in Table 5.1(a). The ith element in the jth row is the
probability that the test samples of the phoneme oj are classified as oi. The summation
of each row is 100%.
We compare the EEG confusion matrix of vowels with the results of vowel
perception experiments conducted by Pickett in 1957. The Pickett experiment
63
presented 12 English vowels in artificial syllables of the form bVb, spoken in a short
carrier phrase. They reported the perceptual confusion matrices of vowels when the
utterances were masked by noise in various frequency ranges. Considering that the
brainwave data were collected in quiet office surroundings, only the perceptual
confusion matrix of flat noise is examined here. The perceptual conditional
probabilities are estimated by taking the elements associated with the four targeted
vowels from Table I(B) in (Pickett, 1957) and forming a sub-matrix. Then divide each
element of the sub-matrix by the summation of the corresponding row to get the
conditional probabilities shown in Table 5.1(b). The overall perceptual accuracy for
these 4 vowels is 82.8%.
Table 5.1: Normalized confusion matrices of 4 vowels
(a) The confusion matrix of EEG isolated-vowel classification. (b) The confusion matrix of Pickett 1957 vowel-perception experiment.
(a) (b)
% i æ u ɑ % i æ u ɑ i 66.3 6.0 21.0 6.7 i 87.0 0.2 11.8 1.1
æ 6.0 79.0 5.3 9.7 æ 0.2 92.6 0 7.2
u 22.0 6.0 65.0 7.0 u 45.3 0.2 53.6 0.9
ɑ 8.3 19.3 7.3 65.0 ɑ 0 1.9 0 98.1
Figure 5.1 compares the similarity trees derived from the brain and perceptual
confusion matrices. Looking at the similarity tree for the brain representation of
vowels, we can make several remarks. First, any vowel-test is more similar to its own
prototype than to any other vowel. A more interesting finding is the separation
between open vowels /æ/ and /ɑ/ and close vowels /i/ and/u/. The tree suggests that the
brain representation of vowel-height is more robust than vowel-backness. Since the
vowel-height reflects the frequency of the first formant (F1) and the vowel-backness is
inversely correlated to the second formant (F2), the results suggest that the EEG
activity is more sensitive to the low frequency contrast around F1 range (less than
1000Hz) than the higher frequency contrast around F2 (1000-2500Hz). The finding is
64
consistent with the fact that the human cochlea, where the sound wave pressure is
converted to the original neural signals, has higher resolution on low frequencies.
The merging pattern of the perceptual similarity tree is almost identical to the
brain similarity tree. The slight difference between the brainwave confusions and the
perceptual confusions of vowels can be found only in the confusion matrices. In the
psychological experiment, although overall 82.8% of the vowels can be perceived
accurately, the close vowels /u/ and /i/ are much more confused than the open vowels
/æ/ and /ɑ/. This distinction is not found in classifying brainwaves of vowels. As
Pickett mentioned, the perceptual intelligibilities of four vowels are highly related to
their intensities (Pickett 1957 Table II). We think the strong perceptual confusions
between /i/ and /u/ are mainly on account of the fact that the vowels of low intensities
are less intelligible when the masking noises are present. Therefore the acoustic
distinctions of the vowels, reflected by their locations in the F1-F2 space, are
qualitatively mirrored better in the similarity differences derived from the statistical
model of EEG images than in the perceptual confusions generated by the masking
noise.
The graph of the invariant partial order between brain and perceptual confusions
of vowels is shown in Figure 5.2. We computed the intersection using a threshold of
eps=0.01. We notice that the pairs with the same height, æ+|ɑ-, ɑ+|æ-, u+|i-, and i+|u-
generally rank higher than the pairs that have the same backness, æ+|i-, i+|æ-, u+|ɑ-,
and ɑ+|u-. The graph of the invariant partial order also demonstrates that the greater
robustness of vowel-height compared to that of vowel-backness in distinguishing the
vowels is invariant in the perceptual and brain representations of vowels.
65
(a) (b)
The four vowels /i/, /æ/, /u/ and /ɑ/ are labeled as “i”, “ae”, “u” and “a” correspondingly. (a) The similarity tree of brainwave representation of vowels, derived from the classification results of the
linear SVM model. The set of test samples and the prototype of a phoneme are denoted with “-” and “+” respectively. (b) The perceptual similarity tree of vowels, derived from results of Pickett‟s
psychological experiments.
Figure 5.1: The similarities of brain representation and perceptual representation of 4 vowels
Figure 5.2: Invariant partial order between brainwave and perceptual confusions of
the vowels
66
5.3.2 Consonants
Among all the experiments of classifying EEG images of consonants, the best
result was obtained when we classified the PHS-2 spectral feature limited to the
frequency range of 2 to 9Hz using Bagging SVMs with linear kernel. Here we use the
same classification model to generate the confusion matrix of the brainwaves of
consonants. 300 averaged test samples for each consonant were constructed from OOS
test trials using the sample-with-replacement scheme. The classifier correctly
predicted the class labels of 1185 test samples out of 2400, with an accuracy rate of
49.4%. The normalized confusion matrix is shown in Table 5.2(a). And we show the
resulting similarity tree in Figure 5.3(a).
Table 5.2: Normalized confusion matrices of 8 consonants
(a) The confusion matrix of EEG consonants classification. (b) The perceptual confusion matrix from Miller-Nicely experiment. The ith element in the jth row is the probability that the test samples of the
phoneme oj are classified as oi. Each row sums to 100%.
(a)
% p t b g f s v z
p 38.7 31.0 2.3 7.0 7.3 3.7 6.7 3.3 t 31.7 44.0 2.0 6.0 4.0 3.0 3.7 5.7 b 1.3 1.3 60.0 23.3 0.3 0.3 10.3 3.0 g 6.7 7.7 29.3 40.3 0.7 2.0 8.0 5.3 f 3.0 6.3 0.7 2.0 55.7 15.7 13.3 3.3 s 8.7 5.3 0.7 1.0 13.0 59.7 5.0 6.7 v 11.0 7.3 12.7 11.3 6.7 3.3 38.0 9.7 z 2.7 3.0 8.0 11.7 1.7 7.0 7.3 58.7
(b)
% p t b g f s v z
p 45.5 33.3 1.0 0.7 13.5 4.2 1.4 0.4 t 40.4 42.2 0.9 0.3 7.5 7.5 0.3 0.9 b 1.3 0.5 52.3 7.2 6.1 2.9 24.3 5.3 g 1.4 0.5 10.8 44.6 0.5 2.4 8.9 31.0 f 12.5 8.7 2.5 0.3 66.2 6.6 2.5 0.8 s 8.3 6.9 2.4 3.1 19.4 55.4 1.4 3.1 v 0.0 0.3 22.6 7.5 3.7 1.6 57.5 6.9 z 2.1 0.4 10.6 19.2 0.7 5.3 14.2 47.5
67
(a) (b)
(a) The similarity tree of brainwave representations of the consonants. The set of test samples and the prototype of a phoneme are denoted with “-” and “+” respectively. (b) The similarity tree of perceptual
representations of the consonants, derived from results of (Miller-Nicely, 1955).
Figure 5.3: The similarities of brain and perceptual representation of 8 consonants
Here we compare the similarities of the brainwave representation of consonants
with the perceptual confusion data from the Miller-Nicely (1955) experiment. Only
the confusion matrices of the frequency response 200Hz-6500Hz were inspected to
match the experimental setup of brainwave data. We calculated the summation of the
matrices of Table II and Table III in (Miller & Nicely, 1955), which are the perceptual
confusions in the listening condition of SNR=-12db and SNR=-6db respectively, and
drew the elements between each pair of the eight consonants from the summation
matrix to construct the confusion matrix for the targeted consonants. The accuracy rate
of the perceptual confusion matrix, which is the ratio between the sum of the diagonal
elements and the sum of all the elements, is 52.3%. It is very close to the
classification rates on the brainwaves of consonants and provides a good foundation to
study invariance between brain and perceptual representations. The normalized
68
perceptual confusion matrix and the subsequent similarity tree are shown in Table
5.2(b) and Figure 5.3(b).
We make the following remarks about the similarity trees of consonants.
(1) Among three distinctive features being examined, voicing, affrication, and
place of articulation, voicing is the most robust feature for both the brain and
perceptual representation of consonants, which is shown by the fact that the voiced
and voiceless consonants joined together at the last merging in both trees. The
robustness of voicing for brainwaves suggests that the temporal difference of the
auditory input, such as the voice onset time (VOT), which is the primary acoustic cue
for the voicing contrast (Lisker&Abramson 1964), is well preserved in the brain
representation.
(2) For voiceless consonants, the affrication is more distinctive than the place of
articulation for both brain and perceptual representations of consonants. In fact, the
place of articulation is the most confused feature for brainwave representations since
3 of the 4 pairs of the consonants that only differ on place of articulation: /p/ and /t//,
b/ and /g/ as well as /f/ and /s/, are merged first.
(3) The major difference lies in the grouping structure of the voicing consonants.
Unlike in the brainwave results, where /b/ is mainly attracted by /g/, /b/ is perceptually
more confused with the voiced fricative /v/, which shares the identical place of
articulation with it. The contrast between /b/ from /g/ is mostly in the transition portion
of the F2 of the vowel that follows (Miller & Nicely 1955), while the primary
perceptual cues to distinguish /b/ and /v/ are the abrupt onset of the stop sound /b/ and
the turbulent noises of the frictions of /v/ (Fujimura & Erickson, 1997). We also notice
that although the attraction between /b/ and /v/ is commonly seen in the perceptual
consonant categorization data using masking noise (Miller & Nicely 1995; Wang &
Bilger 1973; Phatak et al, 2008), it is not clearly shown in the perceptual experiment
of short-term memory (Wickelgren, 1966) and the neural activity discriminations of
the animals‟ responses to the human speech stimulation (Mesgarani, 2008).
Consequently, a possible explanation for the mismatch between the brainwave and
69
perceptual confusions is the fact that friction is more perceptually distorted by white
noise than the formant transitions.
The significant invariance between the similarity of brainwave and perceptual
confusions of consonants is further illustrated by the invariant partial order graph in
Figure 5.4. It shows that the similarity differences between the voiceless stops /p/ and
/t/, p+|t- and t+|p- are very small for both brainwave and perception, and lie on the top
part of the graph. Although the brain representation of /b/ is mainly confused with /g/,
/v/ has strong attraction to /b/ as well. Combined with the fact that /v/ and /b/ are very
confused in the perceptual experiment, the similarities v+|b- and b+|v- ranked high in
the invariant partial order graph.
Figure 5.4: Invariant partial order between brainwave confusions and perceptual
confusions of the consonants
Finally, let us revisit the classification rates. As remarked in Chapter 3, the
classifier achieves higher classification rates for the initial consonants than for vowels.
For classifying the vowels in CV syllables, this difference may be due to that the
cognition process of the initial consonant lasting longer than the actual duration of its
sound, thus imposing extra noise on the brainwave of vowel perception and make it
unintelligible. The classification model of the averaged trials is more sensitive to the
beginning part of the auditory stimuli than the later portion. However, the
classification rates on isolated vowels are not as significant as the results on
consonants, either. By examining the similarity differences of the EEG image
representations of phonemes, we found the EEG observations reflect the temporal
70
distinctions of auditory stimuli, such as VOT, more accurately than the spectral
distinctions, such as the formant transitions. This can be another reason that the
classifier performs very well on consonants and not as well on vowels. Considering
the generally accepted tonotopical organization of the human auditory cortex
(Talavage, 2004), this may be due to the relatively low spatial resolution of the EEG
signals. Extracting more spatial information from EEG or combining it with other
more space-sensitive technologies, such as MEG or fMRI, may improve the
classification rates significantly.
71
Chapter 6
Classifiers Based-on Distinctive
Features
6.1 Classifying the distinctive features
The results of Chapter 5 show that phonological distinctions of phonemes,
interpreted by the distinctive features, can be revealed in the brainwave of phonemes
and captured by our EEG classification model efficiently. The similarity analysis of
the phoneme classification results shows that brainwaves of phonemes, which differ
on some phonological features, for instance voicing, are not likely to confuse with
each other when they are represented in the EEG feature space. The similarities of
brain representation of phonemes and perceptual representation of phonemes are
approximately invariant. This finding naturally leads to the question that whether we
can predict phonological features of EEG brainwaves using the same classification
model.
To answer this question, we ran an experiment to classify distinctive features,
which take binary values. We classified three distinctive features of initial consonants,
voicing, continuant and place of articulation, using 8 sessions of brainwave recordings
from Syllables-III data, and classified two features of vowels, height and backness,
using 8 sessions of Isolated-vowels data. All the brainwaves used in this experiment
were collected from one subject (LK). In total 7168 trials were available for each
classification task. When we classified one distinctive feature, the EEG trials were
grouped into two classes, which take opposite values on the feature, for example
voiced and voiceless. Since our choices of phonemes are balanced on the distinctive
72
features, the number of trials in each class is around 3084. The binary grouping of
phonemes for each feature is shown in Table 6.1.
As mentioned in Chapter 1, some distinctive features are widely adopted in most
of the feature system, such as voicing, continuant and nasal. But the place of
articulation is a more complicated property of the sound for the obstruction may occur
at many places along the oral tract. In this experiment, we tested two kinds of
grouping for the place of articulation. As in traditional definition of the place of
articulation, the 8 initial consonants take three different values: /p/ /b/ /f/ and /v/ are
labial; /t/ /s/ and /z/ are alveolar; /g/ is velar. We followed this approach to combine
alveolar and velar consonants to form a “non-labial” group, as contrary to labial
consonants. We also tested on the feature “coronal”, which was proposed by Chomsky
and Halle. Coronal sounds are produced with the blade of the tongue raised from its
neutral position. Among the 8 initial consonants, /t/, /s/ and /z/ are coronal while /p/,
/b/, /g/, /f/ and/v/ are non-coronal.
The SVM-with-Bagging model with linear kernel was used for the classification.
The classifiers were trained and tested using following configurations:
EEG channels: 124 monopolar channels
EEG feature: PHS-2 spectral feature with limited frequency band of 2 to 9Hz
Number of SVMs for Bagging: 35
Number of trials for averaging: 25
Number of principle components used for classification: 100
Cost factor of SVMs is optimized in nested 5-fold cross-validation loops and
chosen from 2910 2,,2,2 .
The binary classification accuracies and p-values are shown in Table 6.1 as well.
73
Table 6.1: Classifying the distinctive features
feature grouping rate p-value
consonant
voicing voiceless /p/ /t/ /f/ /s/ 92.1% <10-26 voiced /b/ /g/ /v/ /z/
continuant stop /p/ /t/ /b/ /g/ 81.4% <10-13 fricative /f/ /s/ /v/ /z/ place
(labial) labial /p/ /b/ /f/ /v/ 69.3% <10-5 non-labial /t/ /g/ /s/ /z/
place (coronal)
coronal /t/ /s/ /z/ 77.9% <10-10 non-coronal /p/ /b/ /g/ /f/ /v/
vowel height open /æ/ /ɑ/ 83.8% <10-16
close /i/ /u/
backness front /æ/ /i/
71.8% <10-7 back /ɑ/ /u/
We found that for all the features under investigation, the classification rates are
well above the chance level and p-values are less than 10-7. Among the phonological
features of consonants, the classification on voicing achieved the highest accuracy of
92.1%. The classification on continuant had slightly worse result, which is 81.4%.
For the place of articulation, the classification on coronal feature achieved
significantly better rate than the classification on labial, which indicated that the
brainwaves of consonants that different on coronal are more separated than the
brainwaves of the consonants different on labial when represented in the EEG feature
space. For the features of vowels, vowel-height can be classified more accurately than
vowel-backness. The differences of the classification rates are consistent with the
similarity analysis results in Chapter 5.
The success in classifying binary distinctive features shows the brainwaves of
phonemes, which have the same value on a distinctive feature, are clustered in the
EEG feature space. This suggests a new approach to classify the brainwave of auditory
stimuli of language constituents. If we code the phonemes, syllables, words, etc as
binary features, we can use a small set of binary classifiers to separate them. For
example, only 5 binary classifiers are needed to distinguish the 32 syllables in the
Syllables-III data. This can be easily generalized to all the phonemes/syllables without
making the classification model too complicated. In fact, according to Chomsky and
Halle‟s work, all the phonemes in human speech can be represented using 27 binary
74
phonological features with some degree of redundancy. Much less features are needed
to represent a certain language such as American English.
Since our brainwave data cover only a small subset of the distinctive features, we
can only manage a preliminary study on this approach.
6.2 Distinctive-feature-based classifiers
The main frame of the distinctive feature(DF)-based classification model for the
brainwave of phonemes is identical to the SVM-with-Bagging classifier introduced in
Section 3.3. The main frame of the classifier is kept intact as in Figure 3.6. To
implement the N-class SVM, we use an ensemble of binary SVM classifiers based on
the phonological distinctive features of stimuli instead of N(N-1)/2 “one-against-one”
binary SVM classifiers.
Suppose the N speech stimuli can be distinguished using k phonological features,
each takes two values, noted as 0 and 1. Then each stimulus can be coded as an unique
k-bit binary number kbbb 21 , where }1,0{ib , with each bit denotes the value of a
phonological feature. If we use a two-class SVM classifier to predict one bit of the
code, then k SVMs is needed to classify N stimuli. The number of distinctive features
should not be less than N2log . However, since the code is not randomly assigned but
reflects phonological properties of the speech stimuli, for a specific classification task,
the binary coding may not be very compact, which means the number of distinctive
features may be much greater than N2log . Moreover, if the speech stimuli don‟t
cover all the possible combination of the distinctive features, it is very likely that the
combination of predicted labels of k binary classifiers does not correspond to any
stimuli. In our classification model, since the classification results of the N-class
SVMs are aggregated via majority-voting, we can drop off any non-decodable result
from the SVMs in aggregating.
75
To test the performance of the DF-based classification model and compare the
results with those in previous chapters, we classified 8 initial consonants using all the
24 sessions of Syllables-III data and 4 vowels using Isolated-vowel data. The first
three features in Table 6.1: voicing, continuant, place of articulation (labial) are used
to distinguish the 8 consonants. Vowel-height and vowel-backness can characterize the
4 vowels. The parameters of classification model are configured the same as the
experiments in Section 6.1. The model correctly classified 39.0% of the test samples
of initial consonants with a p-value less than 10-42, and 53.5% of the test samples of
isolated vowels, with a p-value less than 10-11.
Although the classification rates are not as good as the results obtained in Chapter
5, the success of classifying the brainwave of phonemes using distinctive features
indicates that distinctive features may be the underlying mechanism of how the brain
parse and retrieve phonemic information when it processes speech inputs. The
algorithm also works when only a small amount of training data is available since each
binary SVM can be estimated using all the training samples.
6.3 Parallel structure vs. Hierarchical structure
In the DF-based classification model, the binary SVMs, which predict the value
of distinctive features of each test sample, are trained in a parallel manner. This
requires an underlying assumption that all the distinctive features are represented in
brainwaves independently. This assumption of independency can be written as, for any
ji , the optimal separation hyperplane between class )0,0( ji bb and class
)0,1( ji bb is approximately overlap with the optimal separation hyperplane
between class )1,0( ji bb and class )1,0( ji bb . The assumption is very strict
and usually false in practice.
76
(a) Classifying the vowel-height and vowel-backness independently
(b) Classifying the vowel-height and vowel-backness hierarchically
Figure 6.1: Classifying 4 vowels in F1-F2 space
77
For instance, as we mentioned in Chapter 1, phonological features vowel-height
and vowel-backness are closely related to the first formant F1 and the second formant
F2 respectively. We cut the auditory stimuli of Syllables-III data as frames of 10ms
length and calculate the F1 and F2 of each frame within the segments of vowels sound.
Then we plot the frames of the 4 vowels, /i/, /æ/, /u/ and /ɑ/, in the F1-F2 space as
shown in Figure 6.1. Now we look for the optimal hyperplane that separates the
vowels that differ in vowel-height or vowel-backness in the F1-F2 domain. The blue
solid lines in Figure 6.1(a) and Figure 6.1(b) is the optimal separation line between
open vowels (/æ/ and /ɑ/) and close vowels (/i/ and /u/) estimated using linear soft
margin SVM model with C=0.15 . Only 1.1% of the data points lie on the wrong side
of the separation line. The blue dash line in Figure 6.1(a) illustrates the optimal
separation line for feature vowel-backness, regardless open or close the vowels are.
The samples of front vowels are located below the line and those of back vowels are
above the line with 4.7% exceptions. In Figure 6.1(b) the hyperplanes that divide
front and back vowels are estimated separately for open vowels and close vowels,
shown as a blue dash line and a green dash line correspondingly. Obviously, the blue
dash line and the green dash line are apart from each other. It means that the
separation hyperplane of vowel-backness is different for open vowels and for close
vowels. Therefore the distinctive features of vowels: vowel-height and vowel-
backness do not satisfy the assumption of independency. Subsequently, the brainwave
representations of the distinctive features will not be independent either.
With this in mind, I proposed DF-based classification model with hierarchical
structure. In the hierarchical structure, the decision rules of distinctive features are
assumed possibly dependent.
Suppose a test sample x belongs to the class labeled as y, which is coded using k
binary distinctive features as }{ 21 kbbby , where 1,0ib for ki ,,1 . We use a
two-class classifier to predict the value of one feature, and write the decision rule of
the ith classifier as ii bh ˆ)( x , the classification model with parallel structure is
described as
78
y
bh
bh
bh
x
kk
ˆ
ˆ)(
ˆ)(
ˆ)(
22
11
x
xx
(6.1)
But for the hierarchical classification model, the values of distinctive features are
predicted in sequent and the classification of ith distinctive feature is depend on the
predicted label of previous features, ie. 11ˆˆibb . Then the classification process of
sample x has a binary-tree structure. Figure 6.2 shows an 8-class classifier with the
hierarchical structure.
Figure 6.2: Hierarchical models for classifying 8 classes
Using the hierarchical structure, an N-class classification problem can be solved
by N-1 binary classifiers. Since errors are propagated from top to bottom in this
structure, the crucial step for constructing a classifier is to find the optimal ordering of
the features. Intuitively, the distinctive feature that achieved the highest classification
rate in binary classification experiment should be predicted at first to provide a good
foundation for further prediction.
DF #3
DF #2
DF #1 11ˆ)( bh x
222ˆ)( bh x 221
ˆ)( bh x
331ˆ)( bh x 332
ˆ)( bh x 333ˆ)( bh x 334
ˆ)( bh x
000 001 010 011 100 101 110 111
0xx 1xx
00x 01x 10x 11x
79
The DF-based classifier with hierarchical structure was tested and compared with
the classifier with parallel structure using both Isolated-vowels data and initial
consonants data of Syllable-III experiment. Only two distinctive features needs to be
classified for classifying the four insolated vowels. We tested two possible ordering of
the distinctive features, noted as backnessheight and heightbackness . The
percentage of tested samples that were classified correctly and the significant level are
summarized in Table 6.2.
Table 6.2: Vowels classification results using DF-based classifiers
rates p-values parallel 53.5% <10-11
height→backness 65.2% <10-23 backness→height 57.0% <10-15
We found that the hierarchical model which classifies vowel-height prior to
vowel-backness can correctly classify 65.2% of the test samples, much higher than
results from the parallel classifier and the hierarchical classifier with the order
heightbackness . The results are consistent with our prediction that the distinctive
feature with higher binary classification rate should be ranked higher in the
hierarchical structure.
Hence for classifying the 8 initial consonants, we put voicing, the distinctive
feature that can be predicted with an accuracy rate of 92.1% using binary model, on
the top of the tree structure. Table 6.3 shows the classification results of the parallel
classifier and hierarchical classifiers with the order placecontinuantvoicing
and continuantplacevoicing . The results show that the hierarchical classifier
with the DF order placecontinuantvoicing achieves the best classification rate
of 47.2%.
80
Table 6.3: Initial consonants classification results using DF-based classifiers
rates p-values parallel 39.0% <10-42
voicing→continuant→place 47.2% <10-67 voicing→place→continuant 40.8% <10-47
Although the best performance of the DF-based classifiers is slightly worse than
the best results obtained using N(N-1)/2 one-against-one SVM classifiers, it provides a
simpler model that can be easily extended to more complicated speech stimuli. We can
also analysis the relation of the distinctive features when they are processed in the
brain using the DF-based classification model. For example, we tested a 4-class
classification task using 8 sessions from LK of Syllables-III data. Each class contains
2 initial consonants that identical in place of articulation, which are voiceless stops /p/
and /t/, voiceless fricatives /f/ and /s/, voicing stops /b/ and /g/ and voicing fricatives
/s/ and /z/. Thus the four classes are distinguished by two distinctive features: voicing
and continuant. We tested the classification accuracies of the classification model
with parallel or hierarchical structure. The classification results, including the binary
classification rate for these two features, are summarized in Table 6.4.
Table 6.4: The results of classifying the combination of voicing and continuant using SVM-with-Bagging model
classification tasks rates
2 classes, voicing v.s. voiceless 92.1%
2 classes, stop v.s. continuant 81.4%
4 classes, voicing + continuant (parallel) 74.3%
4 classes, voicing→continuant (hierarchical) 75.7%
4 classes, continuant→voicing (hierarchical) 76.4%
We found that the 4-class classification accuracy rate of the parallel classifier and
two hierarchical classifiers are about the same. They are close to the product of the
81
rates obtained from the binary classification experiments using two features
respectively, which is %0.75%4.81%1.92 . The results indicate that although the
feature continuant cannot be predicted as accurate as the feature voicing in brainwaves,
the brain may process the two features independently.
The significant results of classifying the EEG image of phonemes using
distinctive features suggest that the human brain may use a distinctive-feature-based
parallel computation mechanism to process phonemes.
The hierarchical DF-based classification model can be improved by adopting the
algorithms that optimize decision trees. In particular, when we classify more
phonemes, more distinctive features are involved. It is impractical to find the optimal
structure of decision tree via examining all the possible combinations. Efficient data-
driven methods of designing decision trees can be used in this case.
82
Chapter 7
Conclusion and Prospects
A mathematical model that can recognize the brain activities of phoneme
processing is not only essential for developing the language-based brain-computer-
interface, it can also provide a powerful method for studying the mechanisms that the
human brain uses to process language.
To achieve the goal of this work, developing a mathematical-statistical method to
recognize the EEG brainwaves of phonemes, two major problems need to be solved.
One is to compress the redundant and noisy EEG data into compact features that
contain the crucial information on phoneme perception. The other is to develop
statistical methods that can identify the phonemes as represented by the features.
I started my work by solving the latter problem using EEG time-domain signals.
Three classification approaches were studied in this thesis. In the first approach, the
brain-speech mapping method, we considered the brain as a dynamic system that takes
speech, described using acoustic features, as the input and produces EEG brainwaves
as outputs. Linear transformations were estimated to simulate the inverse system. The
EEG brainwaves can be classified by passing them through the inverse system and
comparing the estimated speech input with speech prototypes. In the second approach,
purely statistical methods such as PCA and SVM with Bagging are used to construct a
classifier. The third approach is a modification of the Bagging SVM method. It
classifies phonemes by classifying their distinctive features. All of the three methods
were implemented and tested using EEG recordings collected in our lab. The brain-
speech-mapping method can classify 44.9% of individual EEG trials of 4 initial
consonants of Syllables-I data. The SVM-with-Bagging method achieved the accuracy
of 46.0% in classifying the averaged test trials of 8 consonants and the accuracy of
68.8% in classifying the averaged test trials of 4 isolated vowels when the linear
83
kernel was used. However, these methods show limitations in classifying vowels in
CV syllables. How to use the knowledge of preceding phonemes to classify the
phonemes at the non-initial position of the stimuli is one of the major challenges to
extending these methods to classify syllables and words.
The three approaches can be brought together to further improve the classification
accuracy. For example, the bootstrap aggregating method can be also incorporate into
the brain-speech mapping method. Moreover, the results of classifying initial
consonants using data from single channels show that some channels are not
contributing to classifying phonemes. The classification model can be further
improved if using only the channels closely related to phoneme perception or
introducing more sophisticated spatial analysis methods.
Using the SVM-with-Bagging method, I was able to address the first problem and
examine the frequency-domain decomposition of EEG. I found the phase pattern of
brainwave oscillations in the frequency range from 2Hz to 9 Hz is highly related to
phoneme processing. Using the phases of sinusoidal components from 2Hz to 9Hz, the
classifier can recognize 51.4% of test samples of 8 initial consonants, improved from
41.5% when the EEG time-domain signal is used.
In this thesis, I also studied the ordinal similarity difference of brainwave
representations of phonemes derived from confusion matrices of classification results
and compared them with the perceptual similarity of phonemes. The robustness of the
feature voicing can be found in both brain and perceptual representation of consonants.
And the feature vowel-height is more distinct than vowel-backness in both brain and
perceptual representation of vowels. The invariant similarity in brain and perceptual
representation of phonemes supports the claim that the brain activities of perceiving
phonological features can be effectively observed by measuring the EEG activities and
are captured by our detailed model.
84
List of References
[1] Bell, A. J., & Sejnowski, T. J. (1995). An Information-Maximization
Approach to Blind Separation and Blind Deconvolution. Neural
Computation , 7 (6), 1129-1159.
[2] Boersma, P., & Weenink, D. (2011). Praat: doing phonetics by
computer [Computer program] Version 5.2.11. Retrieved January 2011,
from http://www.praat.org/
[3] Breiman, L. (1996). Bagging Predictors. In R. Quinlan, Machine
Learning (pp. 123-140).
[4] Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: a library for support vector
machines. Retrieved from http://www.csie.ntu.edu.tw/~cjlin/libsvm
[5] Chomsky, N., & Halle, M. (1968). The sound pattern of English.
[6] Cortes, C., & Vapnik, V. (1995). Support-Vector Networks. Machine
Learning , 20 (3), 273-297.
[7] Delorme A, M. S. (2004). EEGLAB: an open source toolbox for
analysis of single-trial EEG dynamics. Journal of Neuroscience
Methods , 134, 9-21.
[8] Engineer, C. T., Perez, C. A., Chen, Y. H., Carraway, R. S., Reed, A. C.,
Shetake, J. A., et al. (2008). Cortical activity patterns predict speech
discrimination ability. Nat. Neurosci. , 11, 603-608.
[9] Eulitz, C., & Obleser, J. (2007). Perception of acoustically complex
phonological features in vowels is reflected in the induced brain-
magnetic activity. Behavioral and Brain Functions , 3 (26).
[10] Fant, G. (1960). Acoustic Theory of Speech Production. Netherlands:
Mouton & Co.
85
[11] Formisano, E., Martino, F. D., Bonte, M., & Goebel, R. (2008). “Who"
Is Saying "What"? Brain-Based Decoding of Human Voice and Speech.
Science , 322 (5903), 970-973.
[12] Frye, R. E., Fisher, J. M., Coty, A., & Zarella, M. (2007). Linear Coding
of Voice Onset Time. Journal of Cognitive Neuroscience , 19 (9), 1476-
1487.
[13] Handbook of the International Phonetic Association. (1999). Cambridge
University Press.
[14] Jakobson, R., & Morris, H. (1956). Fundamentals of Language.
[15] Johnson, K. (2003). Acoustic & auditory phonetics. Blackwell
Publishing.
[16] Jolliffe, I. T. (2002). Principle Component Analysis. NY: Springer.
[17] Jung, T., Makeig, S., Humphries, C., Lee, T., Mckeown, M. J., Iragui,
V., et al. (2000). Removing electroencephalographic artifacts by blind
source separation. Psychophysiology , 37, 163-178.
[18] Ladefoged, P. (1982). A Course in Phonetics.
[19] Liebenthal, E., Binder, J. R., Spitzer, S. M., Possing, E. T., & Medler, D.
A. (2005). Neural Substrates of Phonemic Perception. Cerebral Cortex ,
15 (10), 1621-1631.
[20] Lisker, L., & Abramson, A. S. (1964). A Cross-Langugage Study of
Voicing in Initial Stops: Acousic Measurements. Word , 20, 384-422.
[21] Mesgarani, N. (2008). Phoneme representation and classification in
primary auditory cortex. J. Acoust. Soc. Am. , 123 (2), 899-909.
[22] Miller, G. A., & Nicely, P. E. (1955). An analysis of perceptual
confusions among some English consonants. J. acoust. Soc. Am. , 27,
338–352 .
86
[23] Munson, B., & Nelson, P. B. (2005). Phonetic identification in quite and
in noise by listeners with cochlear implants. J. Acoust. Soc. Am. , 118
(4), 2607-2671.
[24] Näätänen, R. e. (1997). Language-specific phoneme representations
revealed by electric and magnetic brain responses. Nature , 382, 431-
434.
[25] Näätänen, R. (2001). The perception of speech sounds by the human
brain as reflected by the mismatch negativity (MMN) and its magnetic
equivalent (MMNm). Psychophysiology , 38 (1), 1-21.
[26] Näätänen, R., Gaillard, A., & Mantysalo, S. (1978). Early selective-
attention effect on evoked potential reinterpreted. Acta Psychologica ,
42 (4), 313-329.
[27] Obleser, J., Lahiri, A., & Eulitz, C. (2004). Magnetic Brain Response
Mirrors Extraction of Phonological Features from Spoken Vowels.
Journal of Cognitive Neuroscience , 16 (1), 31-39.
[28] Pfurtscheller, G., & Silva, F. H. (1999). Event-related EEG/MEG
synchronization and desynchronization: basic principles. Clinical
Neurophysiology , 110 (11), 1842-1857.
[29] Phatak, S. A., Lovitt, A., & Allen, J. B. (2008). Consonant Confusions
in White Noise. J. Acoust. Soc. Am , 1220-1233.
[30] Pickett, J. (1957). Perception of vowels heard in noises of various
spectra. J. acoust. Soc. Am. , 29, 613-620.
[31] Rabiner, L., & Juang, B.-H. (1993). Fundamental of Speech Racognition.
Prentice Hall.
[32] Steinschneider, M., Fishman, Y., & Arezzo, J. (2003). Representation of
the voice onset time (VOT) speech parameter in population responses
within primary auditory cortex of the awake monkey. J. Acoust. Soc.
Am. , 114, 307-321.
87
[33] Steinschneider, M., Reser, D., Schroeder, C., & Arezzo, J. (1995).
Tonotopic organization of responses reflecting stop consonant place of
articulation in primary auditory cortex (A1) of the monkey. Brain Res. ,
674, 147-152.
[34] Suppes, P., Han, B., Epelboim, J., & Lu, Z.-L. (1999). INvariance
between subjects of brain wave representaion of language. Proceedings
of the National Academy of Sciences , 12953-12958.
[35] Suppes, P., Lu, Z.-L., & Han, B. (1997). Brain wave recognition of
words. PNAS , 94, 14965-14969.
[36] Suppes, P., Perreau-Guimaraes, M., & Wong, D. K. (2009). Partial
Order of Similarity Differences Invariant Between EEG-recorded Brain
and Perceptual Representations of Language. Neural Computation , 21,
3228-3269.
[37] Tallon-Baudry, C., & Bertrand, O. (1999). Oscillatory gamma activity in
humans and its role in object representation. Trends in Cognitive
Science , 3 (4), 151-162.
[38] Wall, M. E., Rechtsteiner, A., & Rocha, L. M. (2003). Singular value
decomposition and principal component analysis. In D. Berrar, W.
Dubitzky, & M. Granzow, A Practical Approach to Microarray Data
Analysis (pp. 91-109).
[39] Wang, M. D., & Bilger, R. C. (1973). Consonant confusions in noise: a
study of perceptual features. J. Acoust. Soc. Am. , 54 (5), 1248-1266.
[40] Wang, S.-j., & etc, A. M. (2009). Empirical analysis of support vector
machine ensemble classifiers. Expert Systems with Applications , 36 (3),
6466-6476.
[41] Wickelgren, W. A. (1966). Distinctive features and errors in short-term
memory of English consonants. The Journal of the Acoustical Sociaty of
America , 39 (2), 388-398.
88
[42] Wong, D. K. (2004). Multichannel Classification of Brain-wave
Representations of Language by Perceptron-based Models and
Independent Component Analysis (Ph.D. Dissertation). Stanford
University.
[43] Wong, D. K., Perreau-Guimaraes, M., Uy, E. T., & Suppes, P. (2004).
Classification of individual trials based on the best independent
component of EEG-recorded sentences. Neurocomputing , 479-484.