recognizing phonemes and their distinctive features …wj289qm5838/final version - e... · phonemes...

RECOGNIZING PHONEMES AND THEIR DISTINCTIVE FEATURES

IN THE BRAIN

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF

ELECTRICAL ENGINEERING AND THE

COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Rui Wang

March, 2011

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/wj289qm5838

© 2011 by Wang Rui. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii



http://purl.stanford.edu/wj289qm5838

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Patrick Suppes, Primary Adviser


Stephen Boyd


Bernard Widrow

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

iv

Abstract

How the human brain processes phonemes has been a subject of interest for

linguists and neuroscientists for a long time. Electroencephalography (EEG) offers a

promising approach to observe neural activities of phoneme processing in the brain,

thanks to its high temporal resolution, low cost and noninvasiveness. The studies on

Mismatch Negativity (MMN) effects in EEG activities in the 1990s suggested the

existence of a language-specific central phoneme representation in the brain. Recent

findings using magnetoencephalograph (MEG) also suggested that the brain encodes

the complex acoustic-phonetic information of speech into the representations of

phonological features before the lexical information is retrieved. However, very little

success has yet been reported in classifying the brain activities associated with

phoneme processing.

In my work, I proposed a classification framework which incorporates Principal

Components Analysis (PCA), cross-validation and support vector machine (SVM)

methods. The initial classification rates were not very good. Progress was made by

using bootstrap aggregation (Bagging) scheme and introducing phase calculations. To

calculate phase, I computed the Discrete Fourier Transform (DFT) of the original

time-domain signal and kept the angles of the finite sample of frequencies. The

resulting EEG spectral representation contains only the phase and frequency

information and ignores the amplitudes. Using this method, the accurate rate of

classifying averaged test samples of eight consonants improved from 41% to 51%.

Furthermore, the qualitative analysis of the similarities between the EEG

representations, derived from the confusion matrices, illustrates the invariance of brain

and perceptual representation of phonemes. For brain and perceptual representation of

consonants, voicing is the most distinguishable feature among voicing, continuant and

place of articulation. And the feature vowel-height is more robust than vowel-

backness in both brain and perceptual representation of vowels.

By extending and further refining these methods, it is likely significant

classification of other phonemes and features can be made.

v

Acknowledgements

First of all, I would like to express my gratitude to my principle advisor,

Professor Patrick Suppes, for directing me to this interesting area and giving me his

invaluable support and guidance throughout my study. The enthusiasm he has for

research is infectious and encouraging. I want to thank Professor Bernard Widrow and

Professor Stephen Boyd for helpful advises on both my research and academic

progress and very insightful comments on the draft of this dissertation. I would also

like to thank Professor Christopher Potts for serving as the chairman of my oral exam

and giving valuable suggestions from the perspective of linguistics.

I am very fortunate to pursue my Ph.D. degree in a supportive and inspiring

environment at Stanford University. Being able to work closely with a group of

outstanding researchers is important to make my Ph.D. pursuit productive and

enjoyable. I am especially grateful for the members of Suppes Brain Lab. In particular,

I would like to acknowledge Marcos Perreau Guimaraes, who gave helpful advises

and tips on SVM-with-Bagging methods of EEG classification and similarity analysis

discussed in this dissertation. Dik Kin Wong, Logan Grosenick, Claudio Carvalhaes,

Acacio de Barros and Lene Harbott gave lots of thoughtful ideas and asked motivating

questions in group discussion. Blair Bohannan and Duc Nguyen helped me on

collecting EEG data.

Finally, I would like to thank my family and my parents for their love and support.

vi

Table of Contents

Chapter 1 Introduction ........................................................................................ 1

1.1 Phonemes and distinctive features ............................................................ 1

1.2 Brain activities in phoneme perception .................................................... 7

1.2.1 Measurements of brain activities ...................................................... 7

1.2.2 Brain activities in phoneme perception ............................................. 8

1.3 Motivation and Contribution .................................................................... 9

1.4 Outline of the thesis ................................................................................ 12

Chapter 2 Relevant EEG Data .......................................................................... 13

2.1 Syllables-I data ....................................................................................... 13

2.2 Syllables-III data .................................................................................... 14

2.3 Isolated-vowels data ............................................................................... 17

Chapter 3 Signal Processing Methods for Classifying EEG Data ................. 18

3.1 EEG pre-processing ................................................................................ 18

3.2 Classifiers based on brain-speech mapping ............................................ 22

3.2.1 Methodology ................................................................................... 22

3.2.1.1 Diagram of the classification model ........................................... 22

3.2.1.2 Speech features ........................................................................... 24

3.2.1.3 Parameters search ....................................................................... 25

3.2.1.4 Significance level: p-value ......................................................... 25

3.2.2 Experimental results ........................................................................ 26

3.3 Support Vector Machine (SVM) classifiers ........................................... 29

3.3.1 Methodology ................................................................................... 29

3.3.1.1 SVM with Bootstrap aggregating ............................................... 29

vii

3.3.1.2 Diagram of the classifier ............................................................ 32

3.3.2 Classification results ....................................................................... 38

3.3.2.1 Linear vs. Nonlinear Kernels...................................................... 38

3.3.2.2 Leave-out-one-subject experiment ............................................. 41

3.3.2.3 Experiment on the number of trials to calculate average ........... 42

3.3.2.4 Experiment on classifying individual EEG trials using data from

single channel. .................................................................................................. 43

3.4 Summary ................................................................................................. 44

Chapter 4 Frequency Analysis of EEG Signals ............................................... 46

4.1 EEG signals in frequency domain .......................................................... 46

4.2 EEG spectral features ............................................................................. 48

4.3 Classification results ............................................................................... 51

4.3.1 Compare the EEG features based on DFT ...................................... 51

4.3.2 Frequency selection ......................................................................... 54

Chapter 5 Invariant Similarities between Brain and Perceptual

Representations of Phonemes .................................................................................... 58

5.1 Psychological experiments on phoneme perception ............................... 58

5.2 Similarity measurements ........................................................................ 59

5.2.1 Semi-Order and Invariant Partial Order of similarities ................... 59

5.2.2 Partition tree of similarities ............................................................. 61

5.3 Experimental data analysis ..................................................................... 62

5.3.1 Vowels ............................................................................................. 62

5.3.2 Consonants ...................................................................................... 66

Chapter 6 Classifiers Based-on Distinctive Features ...................................... 71

viii

6.1 Classifying the distinctive features ......................................................... 71

6.2 Distinctive-feature-based classifiers ....................................................... 74

6.3 Parallel structure vs. Hierarchical structure ............................................ 75

Chapter 7 Conclusion and Prospects ............................................................... 82

List of References ................................................................................................ 84

ix

List of Tables

Table 2.1: The traditional phonological features of the 8 consonants and 4 vowels

...................................................................................................................................... 15

Table 2.2: Chomsky-Halle‟s Distinctive features of the 8 initial consonants ...... 15

Table 3.1: Results of classifying the 4 consonants of Syllables-I data using brain-

speech mapping method ............................................................................................... 27

Table 3.2: Phoneme classification results using SVM-with-Bagging method with

linear or non-linear kernels ........................................................................................... 40

Table 3.3: Leave-one-subject-out classification results using SVM-with-Bagging

method .......................................................................................................................... 41

Table 4.1: Comparing the classification rates of 4 EEG spectral features ........... 52

Table 4.2: SVM-with-Bagging classification results using the EEG phase feature

in the frequency range from 2Hz to 9Hz ...................................................................... 57

Table 5.1: Normalized confusion matrices of 4 vowels ....................................... 63

Table 5.2: Normalized confusion matrices of 8 consonants ................................ 66

Table 6.1: Classifying the distinctive features ..................................................... 73

Table 6.2: Vowels classification results using DF-based classifiers .................... 79

Table 6.3: Initial consonants classification results using DF-based classifiers .... 80

Table 6.4: The results of classifying the combination of voicing and continuant

using SVM-with-Bagging model ................................................................................. 80

x

List of Figures

Figure 1.1: Spectrogram of the English syllable /pɑ/ and /fɑ/ ............................... 4

Figure 1.2: Comparing the English syllables /fɑ/ and /vɑ/ .................................... 5

Figure 2.1: EEG international 10-20 sensor location system. .............................. 14

Figure 2.2: The layout of EGI-128 sensors system .............................................. 16

Figure 3.1: Example of EEG artifacts removing .................................................. 20

Figure 3.2: Independent Components Analysis ................................................... 21

Figure 3.3: Diagram of classifying brainwaves of speech stimuli by estimating

the mapping between EEG and speech signal .............................................................. 22

Figure 3.4: Classifying the 4 consonants /p/, /t/, /b/, /g/ of the Syllables-III data

using brain-speech mapping method ............................................................................ 28

Figure 3.5: Diagram of SVM with bootstrap aggregating ................................... 32

Figure 3.6: Diagram of the SVM-with-Bagging EEG classifier .......................... 33

Figure 3.7: Mean validation accuracy on parameter search grid (K,C) using linear

kernel ............................................................................................................................ 39

Figure 3.8: The changing of 8 initial consonants classification rates with respect

to the number of trials to calculate averages ................................................................ 42

Figure 3.9: Performance of SVM-with-Bagging method on classifying 4 initial

consonants using single channel data ........................................................................... 44

Figure 4.1: Average power spectral densities of EEG signal sampled at 250Hz . 47

Figure 4.2: Magnitude and phase frequency response of 1 Hz high-pass filter ... 50

Figure 4.3: Mean classification rates of parameter pair (L, H) obtained from 10-

fold cross validation ..................................................................................................... 56

Figure 5.1: The similarities of brain representation and perceptual representation

of 4 vowels ................................................................................................................... 65

xi

Figure 5.2: Invariant partial order between brainwave and perceptual confusions

of the vowels ................................................................................................................. 65

Figure 5.3: The similarities of brain and perceptual representation of 8

consonants .................................................................................................................... 67

Figure 5.4: Invariant partial order between brainwave confusions and perceptual

confusions of the consonants ........................................................................................ 69

Figure 6.1: Classifying 4 vowels in F1-F2 space ................................................. 76

Figure 6.2: Hierarchical models for classifying 8 classes .................................... 78

1

Chapter 1

Introduction

1.1 Phonemes and distinctive features

Natural languages are organized hierarchically: sentences are built from phrases,

phrases from words, words from syllables and syllables from phonemes. Phoneme is

the smallest segmental unit of speech that differentiates meaningful words (Handbook

of IPA, 1999). For example, in American English, the words light and right are

pronounced differently only on the initial consonants /l/ and /r/, thus /l/ and /r/ are

different phonemes in American English. Two sounds that belong to separate

phonemes in one language or dialect may be variants of one phoneme in another

language or dialect. (The two sounds are called allophones if they belong to the same

phoneme in the language.) It is widely recognized that phonemes are language-

specific. All the phonemes studied in this thesis are American English phonemes. In

most languages, the number of phonemes ranges from twenty to sixty. Although the

pronunciation of a phoneme can be slightly different in various contexts, a phoneme

has relatively stable articulatory and acoustic properties. Thus besides being used to

derive and describe phonological rules, the concept of phoneme is also extensively

used in building computational models of natural speech. Most of modern large-

vocabulary speech recognition systems and speech synthesis systems are based on

statistical models of acoustic features of phonemes. Phonemes are also very important

for modeling the brain activities of speech production or perception.

Linguists have proposed that phonemes can be further decomposed into

distinctive features. Phonological features such as voice, nasal and stop had been used

to describe speech sounds for a long time before the concept of distinctive feature was

proposed. Those features are commonly referred to as traditional features. Such a

2

feature relates to either articulatory or acoustic properties of the sound. They are not

necessarily binary and so may have more than two values. Ladefoged (1982) gave a

good summary of the traditional features in his book. Around the middle of the 20th

century, Jakobson and Halle introduced the notion of „distinctive features‟ as the

smallest language components that are able to differentiate meaningful units

(Jakobson, Morris & Halle, 1956). Unlike phonemes, distinctive features can overlap

in time, thus they are the suprasegmental elements of language that carry lexical

contrasts. Jakobson and Halle also proposed a set of distinctive features and gave both

acoustic and articulatory descriptions of them. The Jakobson-Halle distinctive features

are binary, which means each feature has two relative values. Today‟s most commonly

used distinctive-feature system in phonology literature is mostly taken from Chomsky

and Halle‟s work „The Sound Pattern of English‟ (1968). Chomsky and Halle

proposed in total 27 distinctive features. Each feature takes two values: a positive

value, [+], denotes the presence of a feature, while a negative value, [-], indicates its

absence. Their feature set is considered to be “universal”, which means they

“represented the phonetic capabilities of man” and are therefore the same for all

languages. Any phoneme can be represented as a set of distinctive features. For

example, according to the Chomsky-Halle system, /p/ can be represented as: “[-vocalic]

[+consonantal] [-high] [-back] [-low] [+anterior] [-coronal] [-voice] [-continuant] [-

nasal] [-strident]”. (Chomsky & Halle, 1968)

Limited by the availability of brain data, we cannot explore all the distinctive

features in this thesis. We will focus only on the brain representations of phonemes

associated with the following features:

Height and backness of vowels.

To describe the vowels, we use the vowel features of the International Phonetic

Alphabet (IPA) chart: height and backness.

Vowel height is named for the vertical position of the tongue relative to either the

roof of the mouth or the aperture of the jaw. In high vowels, such as /i/ and /u/, the

tongue is positioned high in the mouth, whereas in low vowels, such as /ɑ/, the tongue

http://en.wikipedia.org/wiki/Jaw

3

is positioned low in the mouth. In the IPA chart, the terms close and open are used to

describe the jaw as being relatively open or closed. Although described using

articulatory terms, vowel-height is nowadays defined as an acoustic quality according

to the relative frequency of the first formant (F1)1. The higher the F1 value, the lower

(more open) the vowel is. Height is thus inversely correlated to F1.

Vowel-backness refers to the position of the tongue during the articulation of a

vowel. In front vowels, such as /i/, the tongue is positioned forward in the mouth,

whereas in back vowels, such as /u/, the tongue is positioned towards the back of the

mouth. Similar to vowel-height, vowel-backness is defined according to the frequency

of the second formant (F2). The back vowels have lower F2 values and front vowels

have higher F2 values. Thus vowel-backness is inversely correlated to F2.

Continuant:

Continuant/non-continuant is a feature to describe the manner of articulation. In

the production of a continuant sound, the primary constriction of the vocal tract is not

completely closed, so the air flow past the constriction is not blocked. The fricatives

such as /s/ or /z/ are continuant sounds. When we articulate a fricative sound, the oral

tract is held narrow enough, so that the airflow generates turbulent noise. In speech

spectrograms, this friction noise often shows some clear power concentration in a

specific frequency range. The non-continuant sounds include plosive stops, such as /p/,

/t/ and /g/, and nasal sounds, such as /m/, /n/. In this thesis, we will only focus on brain

representations of plosive stops and fricatives. Plosive stops are characterized by a

spectrographic "burst" with an abrupt onset. Figure 1.1 compares the spectrogram of

the plosive stop /p/ and the fricative consonant /f/, followed by the same vowel /ɑ/.

The spectrogram of plosive stop /p/ has a sudden burst of energy across all the

frequency range after a short closure at the beginning of the articulation. The formant

pattern of vowel /ɑ/ emerges shortly after the burst, which shows that the duration of

plosive stop is very short. The spectrogram of /f/ is characterized by high frequency

1 Formants are defined by Fant (1960) as “the spectral peaks of the sound spectrum ( ) ”. They

are produced by resonances of the vocal tract. The lowest resonant frequency is called the first formant , the second and the third .

http://en.wikipedia.org/wiki/Formant

4

noise with a gradual onset. In addition, the duration of the fricative /f/ is much longer

than that of /p/.

Figure 1.1: Spectrogram of the English syllable /pɑ/ and /fɑ/2

Voicing:

The feature voicing is used to characterize the vibration of the vocal fold, which

creates a periodic source wave during the articulation. Voiced sounds are produced

with the vibration of the vocal cord and voiceless sounds are produced without the

vibration. Periodicity is the main character that distinguishes voiced sounds from

voiceless sounds. Figure 1.2 shows the waveform and the spectrogram of /fɑ/ and /vɑ/.

The waveform of /v/ has obvious periodic structure, which comes from the vibration

of the vocal cord. The formant-like low frequency energy distribution pattern of the

spectrogram of /v/ is another indicator of voicing.

2 The spectrograms in Figure 1.1 and Figure 1.2(b) are generated using Praat (Boersma &

Weenink, 2011).

5

(a) Speech waveforms of syllables /fɑ/ and /vɑ/

(b) Speech spectrograms of syllables /fɑ/ and /vɑ/

Figure 1.2: Comparing the English syllables /fɑ/ and /vɑ/

6

Phoneticians have also found the voice onset time (VOT), which denotes the time

interval between the release of articulatory occlusion and the onset of low-frequency

periodicity, is the primary perceptual cue to distinguish the voicing stops from the

voiceless stops (Lisker & Abramson, 1964). The voiceless stops in English, such as /p/,

/t/ and /k/, feature a short VOT around 20ms. But the voiced stops usually have

negative VOTs, which means voicing onset leads the articulatory release. The negative

VOTs are characterized by a low buzz noise during the consonant closure time.

Another phonetic attribute that distinguish English voiceless stops /p/, /t/ and /k/ from

voiced stops /b/, /d/ and /g/ is aspiration. Aspiration is very important in separating the

two sets in initial position, because both sets are commonly produced with silent

closure intervals in such cases. (Lisker & Abramson, 1964)

Place of articulation:

Traditionally, the place of articulation of consonants refers to the place and

manner of the obstruction of the airflow going through the vocal tract. In English, the

obstruction may occur at many places along the oral tract, from bilabial (between the

lips) to velar (between back of the tongue and the soft palate). The production of

speech sounds can be simulated as a stimulation source, either periodic for voicing or

white noise for voiceless, passing through a filter, which reflects the shape of the vocal

tract. The different places of articulation modify the frequency response of the vocal

tract filter and change the spectral properties of the output. In Jakobson & Halle‟s

distinctive feature system, the place of articulation is denoted as several features

describing the spectrum of sound, such as grave/acute, flat and sharp. Chomsky &

Halle‟s distinctive feature system uses a series of cavity features, coronal, anterior,

high, low, back, etc., to characterize the shape of the oral tract for both consonant and

vowel articulation.

For plosive stops, the primary acoustic cue of the place of articulation is mostly in

the transition portion of the F2 of the vowel that follows. The place of articulation in

fricatives changes the resonant frequency of the front vocal cavity and is reflected by

the position and shape of the peak in speech spectra. (Johnson, 2003)

7

1.2 Brain activities in phoneme perception

1.2.1 Measurements of brain activities

When a neuron is firing, it generates action potentials, which are discrete

electrical pulses, and postsynaptic potentials, which typically last tens or even

hundreds of milliseconds. The summation of postsynaptic potentials of thousands of

approximately synchronized cortical neurons can induce the potential fluctuations on

the scalp. Thus the brain cortical activities can be roughly observed by placing an

electrode sensor on the scalp and recording the amplified signal. This technology is

called electroencephalography, or EEG.

The magnetic field produced by the electrical activities of cortical neurons also

can be measured, which is called magnetoencephalography, or MEG. Both EEG and

MEG have high temporal resolution and can record the activities with 1kHz or higher

sampling rates. However, the blurring of the potentials caused by the skull, which is a

high-resistance conductor, can be avoided by recording the magnetic field. Thus MEG

has better spatial resolution than EEG and provides more precise localization. On the

other hand, since the MEG signals are on the order of a few femto-Teslas, shielding

from external magnetic signals, including the Earth‟s magnetic field, is necessary.

The magnetic shielding equipment, usually a magnetically shielded room, is very

expensive and not portable.

Since the development of the functional Magnetic Resonance Imaging (fMRI)

technique in the early 1990‟s, fMRI rapidly dominates the brain mapping field for its

non-invasiveness and high spatial resolution, up to 1mm. fMRI measures the increased

blood flow to regions of increased neural activity, marked by blood-oxygen-level

dependence (BOLD) in magnetic resonance imaging (MRI) scan. The BOLD occurs

after the increased neural activities with a delay of approximately 1 to 5 seconds and

rises to a peak over 4 to 5 seconds. Therefore the fMRI has very low temporal

resolution. As we know, English speech is delivered at a rate of roughly 3 words per

8

second. Thus fMRI itself cannot be used to observe the details of fast-changing brain

activities, such as processing phonological or lexical information.

1.2.2 Brain activities in phoneme perception

How the human brain processes phonemes has been a subject of interest for

linguists and neuroscientists for a long time. Historically, behavioral experiments of

phoneme perception were carried out to explore the psychological discrimination of

phonemes under various conditions (Miller & Nicely, 1955; Pickett, 1957; Wang &

Bilger, 1973; Phatak et al, 2008). More detailed introductions of the behavioral

experiments can be found in Chapter 5. Since the discovery of Mismatch Negativity

(MMN) effects in EEG activities (Näätänen et al, 1978), MMN and its magnetic

equivalent MMNm, have been used extensively to measure the neural activities

reflecting subjects‟ ability to discriminate phonemes (See Näätänen 2001 for review).

These results also suggested the existence of a language-specific central phoneme

representation in the brain and pointed out its probable left-hemisphere locus

(Näätänen, 1997). More recently, using MEG recordings, human brain activities

indicating the perception of acoustic cues and more complex phonological features

were examined (Obleser, et al. 2004; Eulitz et al. 2007; Frye et al. 2007). These

findings suggested that the brain encodes the complex acoustic-phonetic information

of speech into the representations of phonological features before the lexical

information is retrieved. Invasive recordings of animal neural responses associated

with human speech also demonstrate the temporal and spatial characteristics of the

cortical activities reflecting the distinctive features of phonemes (Steinschneider et al.

1995; Steinschneider et al. 2003). It also shows that the discrimination of the neural-

activities pattern matches the animals‟ behavioral discrimination of phonemes

(Engineer, 2008) as well as the human psychological confusion of phonemes

(Mesgarani, 2008). fMRI also provides a non-invasive method to pinpoint the location

of cortical activities of phoneme perception in healthy human brain (Liebenthal, et al.

2005). Formisano (2008) reported the success in classifying brain activities of isolated

vowels using fMRI. Considering the limited temporal resolution of the fMRI

9

technique, it would be difficult to extend this work to phonemes that have more

complex time course than vowels presented in isolation.

1.3 Motivation and Contribution

Among the most commonly used technologies that can observe human brain

activities, EEG provides a promising method to examine the brain activities of natural

language processing because of its low-cost, high temporal resolution and non-

invasiveness. To study the brain activities of language processing using EEG signal,

we need to solve two problems.

First, the EEG recordings are usually large amounts of data that are contaminated

by lots of noise. Appropriate signal-processing or statistical methods are needed to

reduce the noise and extract meaningful components of the signal that carry the target

information. The ideal scenario is that the data can be compressed into parameters,

called as EEG feature parameters, without loss of much useful information.

Second, we need to develop the mathematical models to describe properties and

distributions of the EEG feature parameters of the language processing activity in the

brain. The complexity and the computation cost of constructing the mathematical

model is highly related to the number of EEG feature parameters. Generally the

smaller the EEG parameter list, the simpler the mathematical model required to

describe it.

To demonstrate the effectiveness of an EEG feature parameter or a mathematical

model, one of the most convincing approaches is to test whether the unknown EEG

samples, represented as the feature parameters, can be classified using the

mathematical model. Researchers in our lab have been working on the statistical

problem of classifying EEG brainwave associated with stimuli of language

constituents since the 1990s. We successfully classified brainwaves of sentences

(Suppes & Han, 1998; Suppes & Han, 1999; Wong, Perreau-Guimaraes, et. al. 2004),

words (Suppes & Lu, 1997) and syllables (Suppes, Han, etc, 1999). Classifying the

10

brainwaves of auditory stimuli is often more challenging than classifying that of the

visually presented linguistic stimuli. (Suppes & Han, 1999) An experiment in

classifying the brainwaves of phonemes was also reported. (Suppes, Perreau-

Guimaraes et. al., 2009) In this experiment, 42% of trials of 4 consonants were

correctly classified. However, the classification method was tested on the syllable data

of the 1997 experiment, which collected only 6 channels of 800 trials from each

subject. The size of the data is insufficient to test a more complicated classification

model.

With this consideration in mind, I designed and implemented a new experiment to

collect EEG data of syllables. The experiment focused on 8 consonants and 4 vowels,

which were carefully selected to represent 5 distinctive features: voicing, continuant,

place of articulation, vowel-height and vowel backness. The new dataset includes in

total 21540 trials for all the 32 syllables. The number of trials from each subject

ranges from 3584 to 7168. A supplemental dataset of isolated vowels was also

collected in 2010.

The phoneme recognition results reported by Suppes & Perreau-Guimaraes et. al.

(2009) was obtained using Singular Value Decomposition (SVD) and Linear

Discriminate Classification (LDC) methods in a framework with two-layer cross-

validation. I kept the original EEG preprocessing modular of the framework and

modified the classification methods to implement Out-Of-Sample testing, classifying

the averaged trials and classifying using SVM with bootstrap aggregating (Bagging).

By introducing SVM, we were able to implement the non-linear classification.

However, the classification results show that the non-linear methods cannot improve

the classification accuracy. The modified algorithm with linear kernel can classify 46%

of 426 averaged test samples of 8 consonants and 69% of 141 averaged test samples of

4 isolated vowels.

I also proposed a new approach to classify the brainwaves of auditory stimuli:

classifying by estimating the mapping relations between the speech signal and the

EEG brainwave signal. A preliminary study about estimating the linear

transformation between the brainwaves and speech stimuli has been carried out. For

11

the best subject of the EEG data collected in 1997, the classification model can

recognize 45% of individual test trials of 4 consonants, which is slightly better than

the result of SVD-LDC methods.

Furthermore, using the classification model with Bagging SVM, I explored the

frequency-domain representations of EEG brainwaves evoked by phoneme stimuli. I

found that the EEG signals can be classified without loss of accuracy when the

amplitude information of DFTs is eliminated. For classifying the averaged test

samples of 8 consonants, the accuracy rate increased to 51% if only the phase pattern

of frequency components from 2Hz to 9Hz is used.

I analyzed of the similarities between the EEG representations, derived from the

confusion matrices obtained using Bagging SVM methods and demonstrated the

invariant similarities of brain and perceptual representation of phonemes. For brain

and perceptual representation of consonants, voicing is the most distinguishable

feature among voicing, continuant and place of articulation. And the feature vowel-

height is more robust than vowel-backness in both brain and perceptual representation

of vowels.

I further refined the Bagging SVM classification model based on the findings

that the brainwaves evoked by different phonemes with similar phonological

properties are close to each other in the EEG feature domain. A simplified

classification model based on distinctive features was proposed. In this model,

brainwaves of phonemes are classified using the ensemble of binary classifiers, one

for each distinctive feature. The binary classifiers can be organized hierarchically.

This simplified classifier can recognize 47% of test samples of the 8 consonants and

65% of test samples of the 4 isolated vowels, which is slightly worse than the original

Bagging SVM classification model. However, the distinctive-feature-based classifier

can be directly extended to classify more phonemes.

12

1.4 Outline of the thesis

Chapter 2 gives the detailed description of the EEG data used in my work and the

experiment setup for collecting the EEG recordings.

In Chapter 3, I introduce two models to classify brainwaves of phonemes: One is

the brain-speech mapping method and the other is the classifier with Bagging SVM.

The first method was tested on classifying the individual trials of 4 consonants using

Syllables-I and Syllables-III data. The second method, which focuses on classifying

averaged test samples, was tested using Syllables-III and isolated-vowels data.

In Chapter 4, I examine EEG representations of phonemes in the frequency

domain by classifying the EEG response of phonemes using four EEG spectral

features. The feature DFT is Discrete Fourier Transform (DFT) coefficients of EEG

time-domain signals, which should be computed channel by channel. The feature

AMP is composed by the amplitudes of all the frequency components of DFT

coefficients. In feature PHS-1 and PHS-2, the amplitudes of DFT coefficients are

eliminated and only the phase information is kept. The classification results of four

spectral features are discussed. I also identify the frequency range of rhythmic

activities of EEG that related to phoneme perception using our experiment data.

I analyze the similarities between the brainwave representations using the

classification confusion matrices in Chapter 5. The graphs of semiorder and

hierarchical trees are used to illustrate the similarities. The brain similarities of the

phonemes are compared with perceptual similarities of phonemes obtained from

psychological experiments.

In Chapter 6, the results of classifying distinctive features using Bagging SVM

methods are discussed. I also extend the Bagging SVM algorithm to classify speech

stimuli based on distinctive features and present the experimental results.

Chapter 7 concludes the thesis.

13

Chapter 2

Relevant EEG Data

Three datasets of EEG recordings of phoneme perception are used in our study.

All these EEG experimental data were collected in our laboratory.

2.1 Syllables-I data

These EEG recordings of auditory syllable stimuli were collected in November,

1998 as an exploratory experiment. The experiment addressed 8 consonant-vowel (CV)

format syllables and 24 syllable pairs made up by 4 consonants (/p/, /t/, /b/ and /g/)

and 3 vowels (/ɑ/ as in spa, /u/ as in zoo and /oʊ/ as in boat). The stimuli syllables are

listed below:

/tu/, /pɑ/, /gu/, /bɑ/, /toʊ/, /pu/, /goʊ/, /bu/

/babu/, /bɑpɑ/, /bubɑ/, /goʊgu/, /goʊtu/, /gugoʊ/, /gutoʊ /, /gutoʊ/

/pɑpu/, /pubɑ/, /pupɑ/, /tugoʊ/, /tugu/, /tutoʊ/, /toʊgoʊ/, /toʊtu/

/bɑpu/, /bupɑ/, /bupu/, /goʊtoʊ/, /pɑbɑ/, /pɑbu/, /pubu/, /toʊgu/

All the 32 speech stimuli were spoken by a male American-English native

speaker, who is also the speaker of the stimuli in the other two experiments. We

presented the auditory stimuli to participants via stereo speakers. The 32 stimuli were

randomized and presented to the subject 12 times as the first part of the session. Then

after a short break, all the stimuli were presented again for 13 times as the second part.

Nine subjects participated the experiment but only the data from 3 subjects were used

in this thesis. The subjects were instructed to listen to stimuli attentively while no

behavioral response was required. The trial length, measured from the onset of one

syllable to the onset of the next, is 2050ms. In total 800 trials were collected from

14

each subject. The Model-12 Grass amplifiers and Neuroscan‟s Version 3.0 software

were used to measure and record EEG data. Sensors were attached to the scalp of

subjects according to standard EEG 10-20 system as shown in Figure 2.1.

Figure 2.1: EEG international 10-20 sensor location system.

Only 6 sensors, C3, C4, T3, T4, T5 and T6, were connected in the first part. In the

second part, an additional sensor, Cz, was also connected. Previous analysis results on

this dataset were reported in (Suppes, 1999; Suppes, 2009)

2.2 Syllables-III data

In 2008, we collected a new dataset of EEG recordings of perceiving 32 CV

format syllables, which are made of one of the eight consonants /p/, /t/, /b/, /g/, /f/, /s/,

/v/ and /z/, and one of four vowels /i/ (see), /æ/(cat), /u/(zoo) and /ɑ/ (spa). The

experiment was designed with several considerations in mind. First of all, we want to

check if the significant classification accuracies on initial consonants using Syllable-I

EEG data (Suppes, 2009) are repeatable. Second, we further extended the initial

consonants from the 4 plosive stops to a set of 8 consonants to investigate three major

phonological features of consonants: voicing, continuant (stop versus fricative) and

place of articulation. We also carefully selected the vowels so that they locate at

corners of the American-English vowel space area and hence are acoustically

15

separated. Table 2.1 and Table 2.2 list the phonological features of the consonants and

vowels. Moreover, the EEG collection techniques have been significantly improved in

these years. The newest equipment, which supports up to 128 sensors, can record EEG

activities with much higher spatial resolution.

Table 2.1: The traditional phonological features of the 8 consonants and 4 vowels

voiceless voicing

Labial Alveolar Labial Alveolar/Velar stop p t b g

fricative f s v z

Height

open close

backness front æ i back ɑ u

Table 2.2: Chomsky-Halle‟s Distinctive features of the 8 initial consonants

p t b g f s v z

Cavity Features

High - - - + - - - -

Back - - - + - - - -

Coronal - + - - - + - +

Anterior + + + - + + + +

Source Features

Voiced - - + + - - + +

Strident - - - - + + + +

Manner Continuant - - - - + + + +

We recorded the Syllable-III data using EGI‟s Geodesic EEG System (GES) 300

platform. In order to take the variation of pronunciation into account, the auditory

stimuli include 7 repetitions of each of 32 syllables read by a male American English

native speaker. The recordings are saved as 44.1KHz mono WAV files. In a

brainwave collection session, all the 224 sound stimuli were pseudo-randomly

presented to the subjects for 4 times using stereo speakers. The participant subjects

were instructed to listen to the sound attentively while looking at a focus point on the

computer screen. We recorded the EEG data with a sampling rate of 1000Hz using

16

EGI 128 sensors system, with 124 monopolar channels with a common reference Cz.

Two bipolar reference channels of eye movements were also recorded. The locations

of sensors are shown in Figure 2.2. The length of one trial of brainwave recording is

one second. In total 24 sessions from 4 subjects were collected. The number of trials

from one subject ranges from 3584 to 7168. The complete dataset includes about 672

brainwave recordings of each syllable. Therefore we got approximately 672×4=2688

recordings for each consonant and 672×8=5376 recordings for each vowel.

Figure 2.2: The layout of EGI-128 sensors system

17

2.3 Isolated-vowels data

The isolated vowels data recorded in 2010 are complimentary of Syllable-III data.

We recorded 7 repetitions of the 4 vowels used in Syllable-III data, spoken by the

same speaker. In one EEG collection session, the 28 stimuli were presented to the

subject randomly for 32 times using the same experiment setup as Syllable-III. We

recorded 8 sessions from one subject and collected 1792 trials for each isolated vowel.

18

Chapter 3

Signal Processing Methods for

Classifying EEG Data

3.1 EEG pre-processing

The potential changes on the scalp generated by the cortical neuron activities are

as small as a few micro-volts. The EEG signals of interest are usually submerged in a

large amount of electrical noise. The two major types are that coming from the

equipment and environment, and that from other biological sources. The

environmental noise includes AC electric power supply noise, which is around 50-

60Hz, the noise from the computers used for presenting stimuli and recording EEG

data, and the noise from the analog amplifiers which amplify the EEG signal by

several orders of magnitude. The biological activities can be eye blinks, heart beats

and muscle contractions. Therefore, before applying any analysis or classification

methods, we need to pre-process the EEG data to have cleaner signals.

We used digital filters to remove most of the environmental noise. A high-pass

filter with the cut-off frequency at 1Hz can remove the DC offset of equipment and the

slow artifacts associated with the skin conductance fluctuation. The AC electricity

noise can be removed by a notch filter at 60Hz. Our previous studies on EEG of

language stimuli show that the frequency components between 2 to 30Hz are more

important for classification. (Suppes, 1999) Thus in the present research, we usually

down-sample the EEG signals to 50-60Hz after applying anti-aliasing filters. The

down-sampling significantly reduces the dimension of data to be analyzed and

removes the high frequency noise as well.

19

The noise from other biological activities has a different character. Figure 3.1(a)

shows 4 seconds of EEG recordings from the first 60 sensors of EGI-128 sensor

system, sampling at 1KHz. The muscle-contraction noise is characterized by a burst

of high frequency noise and usually disappears after low-pass filtering, as seen in the

down-sampled data in Figure 3.1(b). Eye-blinking artifacts are the short-peak waves

with high amplitude, commonly seen at the prefrontal electrodes.

Figure 3.1(a) Original EEG recording with 1KHz sampling rate

Figure 3.1(b) EEG signal after high-pass and down-sampled to 62.5Hz

eye blink eye blink

muscle contraction

20

Figure 3.1(c) Resulting EEG signal after removing eye artifacts.

Figure 3.1: Example of EEG artifacts removing

The eye movement artifacts can be removed by visually inspecting the trials and

rejecting the contaminated ones. But this is not practical in our study considering the

large amount of data involved, for instance there are more than 20000 trials collected

in the Syllable-III experiment. Since the eye blinks or movements are usually

independent to the brain responses of stimuli. We can eliminate the artifacts from eye

movements efficiently using Independent Component Analysis (ICA).

The ICA method solves the problem illustrated in Figure 3.2. Assume there are n

independent signal sources in the target region nsss ,,, 21 , and the source signals are

transmitted instantaneously to the m receptors on the scalp mrrr ,,, 21 . At each

receptor, the received signal is a weighted mixture of the sources:

nisarm

j

jiji ,,2,11

(3.1)

Then we have Asr , where nmnm RARsRr and , . A is often referred as

the mixing matrix. When mn and A is invertible, let 1 AW , then the sources

21

and be recovered as Wrs . Here W is called the un-mixing matrix. In practice, A is

always unknown and the ultimate goal of ICA is to find the un-mixing matrix that

maximizes the statistical independence of the sources.

In our study, we took all the signals from the monopolar channels as the received

signal r and estimated the un-mixing matrix using the Infomax method. (Jung, et al.,

2000; Bell & Sejnowski, 1995) Next, we calculated the correlation coefficients

between the derived sources and signals from each of the references channels, which

were placed around the eyes to record the horizontal and vertical eye movements. If

the method works well, most of the correlation coefficients should be very low. Then

we remove the sources that are highly correlated to the reference channels by setting

them to zero. More specifically, we removed all the independent sources with a

correlation coefficient higher than 0.2 in our experiments. Finally the remaining

sources are re-mixed to reconstruct the monopolar signals. Figure 3.1(c) shows the

reconstructed EEG monopolar signals using the signals in Figure 3.1(b) as the input.

We can see the eye blink artifacts were removed.

Figure 3.2: Independent Components Analysis

s1

s2

r1 r2

r3

w11

w21 w31

22

3.2 Classifiers based on brain-speech mapping

3.2.1 Methodology

This section introduces the preliminary study of classifying EEG brainwaves of

phoneme stimuli by estimating the mapping relations between the speech signal and

the EEG brainwave signal. The basic idea underlying this approach is to consider the

whole phoneme perception process in the brain as the activity of a system in a “black

box”. The only observable aspects of the system are the input, which is the sound

waves of speech stimuli, and the EEG brainwave as the output. Hence, if we could

estimate the inverse system, we would be able to map brainwaves back to approximate

speech inputs, and classify the brainwaves by comparing the estimated inputs to the

speech prototype candidates.

3.2.1.1 Diagram of the classification model

The classification procedure is shown in Figure 3.3.

Figure 3.3: Diagram of classifying brainwaves of speech stimuli by estimating the

mapping between EEG and speech signal

Speech Pre-process

EEG Pre-process

Find the optimal F(∙) that minimize the difference

between Y and F(X)

Speech

EEG

Speech Prototypes Ỹk

Y

Xtrain

F(∙)

Ŷ=F(X) Xtest Find the closest

prototype toŶ Results Ŷ

23

At the pre-processing phase, EEG signals are down-sampled and filtered. The

speech waves of stimuli are represented by feature vectors with reduced sizes, and at

the same time, a prototype of speech signal is created for each phoneme. Details of

speech signal processing will be given later. Then the EEG data are randomly divided

into training set and test set. The training/test partition is balanced for all the stimuli.

In other words, the numbers of training trials associated with each stimulus are equal.

We compute the optimal mapping relation F̂ , which minimizes the mean square

estimation error between F(x) and y. Figure 3.3 shows the scheme of estimating one

global transformation that applied to all the classes. Alternately, we could also assume

the transformation between brainwaves and speech is unique for each phoneme. In this

case, N transformations should be estimated for the N-class classification problem. A

test sample x is classified as:

2

,,1

~)(F̂minargˆk

Nk

Yk

x

for global transformation (3.2)

or

2

,,1

~)(F̂minargˆkk

Nk

Yk

x

for class-specific transformations (3.3)

For the purpose of exploratory, we assume the transformation is linear, i.e.

Αxx )F( . Then if we estimate a linear transformation using m training samples

miii ,,1),,( )()( yx , where ni Rx )( denotes the observed EEG signal, and the

pi Ry )( are the features of the associated speech stimuli, the optimal linear

transformation npRA is the solution of the least-square optimization problem:

m

i

ii

1

2)()(min yAxA (3.4)

which can be easily calculated as:

1

1

)()(

1

)()(

m

i

Tiim

i

TiiA xxxy (3.5)

24

When the number of training samples m is too small compared to the number of

variables in x, the matrix

m

i

Tii

1

)()( xx will be close to singular and non-invertible. Thus

to get an accurate estimation of the transformation matrix, we need sufficient training

samples and the EEG observation vector cannot be too long.

3.2.1.2 Speech features

To appropriately represent the speech stimuli, we hope to find the speech features

with the size comparable to EEG brainwaves that also are able to distinguish different

phonemes. The Mel-Frequency Cepstral Coefficients (MFCC), which describe the

temporal-spectral distributions of speech, have been proved successful features to

represent speech signals and commonly used in the modern speech recognition

systems. (Rabiner & Juang, 1993) So we use MFCC as the speech features and to

construct the prototype of the phoneme stimuli. The speech pre-processing procedure

includes the following steps:

1) We manually examine the audio files of speech stimuli and mark the beginning

and end time of the phonemes. Because of the co-articulation, the boundaries between

the adjacent phonemes are not well-defined. As a result, the segmentation of

phonemes can only be roughly determined.

2) The speech segments of targeted phonemes are cut into 30ms short frames,

with 20ms overlap.

3) Calculate 12th -order MFCC speech features of each frame.

4) For each stimulus, compute average of the feature vectors across all the frames

of the targeted phoneme. The average vector is the training target vector Y.

5) Average the MFCC features of all the frames corresponding to the initial

consonant k and get the prototypes kY~

25

3.2.1.3 Parameters search

Our previous studies show that when we classify EEG signals in the time domain,

the classification rates may be improved if we only use the observations within a given

temporal interval. (Wong, 2004) But the best temporal interval is data-specific and

task-specific. In our experiments, we used Q-fold cross-validation to search for the

best interval on a parameter grid. The two parameters to be optimized are the start

point of the interval s and the interval duration d. The possible candidates of

parameters form a searching grid ),( ds . In cross-validation, all the training trials are

randomly divided into Q even groups. At each step of validation, one of the Q groups

is used for testing and the other Q-1 groups are combined for training. A classification

rate is obtained for each point of the parameter searching grid. The optimal parameters

are chosen to meet the criterion of maximizing the average number of the correctly

classified trials across the Q validation tests.

3.2.1.4 Significance level: p-value

P-value is a statistical measurement of the significance of experiment results.

Consider coin-flipping experiments. If for one experiment, we get 7 heads out of 10

flips, while for another experiment, 70 heads show in 100 flips, although both

experiments show the probability of observing a head is 70%, we are more assured to

claim that the coin used in the second one is biased. The p-value is the probability that

the outcome is at least as extreme as the actually observed value, assuming the null

hypothesis is true. In this example, the null hypothesis (H0) is that the coin is fair, i.e.,

the chance to observe a head in one flip is 0.5. Then the p-value of the first experiment

is:

1719.0)5.01(5.010

)|7Pr(10

7

100

i

ii

kHheads (3.6)

For the second experiment:

5100

70

1000 1093.3)5.01(5.0

100)|70Pr(

i

ii

kHheads (3.7)

26

The smaller the p-value, the more confident in rejecting the null hypothesis, and

hence the more significant the result is.

In the N-class EEG classification problem, the null hypothesis is that the classifier

cannot recognize any test sample and randomly assigns a label to each sample. The

probability that one test sample is correctly recognized is p=1/N under the assumption

of the null hypothesis. Thus if k of m test samples are classified correctly in one

experiment, the p-value of the result is:

m

ki

imi ppi

m)1(value-p (3.8)

3.2.2 Experimental results

We first tested the classifier based on brainwave-speech mapping by classifying

the 4 initial consonants, /p/, /t/, /b/ and /g/, of Syllables-I data. The six bipolar-channel

data, which are C3-T5, C4-T6, T3-T3, T4-C4, T5-T3 and T6-T4, were down-sampled

to 50Hz and passed through a 4th order Butterworth band-pass filter with the cut-off

frequencies at 2Hz and 20Hz. The consonants were classified using data from each

channel, each subject separately. For each subject, we collected 24 EEG trials of each

stimulus. We randomly drew 16 trials for training and used the remaining 8 for test.

Since there are 8 syllables that started with a given consonant, we have in total

128168 training trials and 6488 test trials for each class. The total number of

test trials is 256. The EEG interval defined by the start time s and duration d is

optimized using 8-fold validation. Table 3.1 summarizes the classification rates and

significance level of the results.

We can see that the classification accuracies show large variations among

subjects. For subject AB, 44.9% of the 256 testing trials can be correctly classified

using the best channels, the significance level p-value is less than 10-11. Subject PS got

slightly lower classification accuracy rates, which is 39.8% with p-value<10-6. The

significance levels of these results are high enough to prove the effectiveness of the

27

model on those subjects. However, the classification model barely works for subject

SO. The classifier estimating class-specific transformations works better than the

classifier using global transformation.

Table 3.1: Results of classifying the 4 consonants of Syllables-I data using brain-speech mapping method

subjects channels

Class-specific transformation global transformation

rates p-value rates p-value

AB

C3-T5 37.9% <10-5 31.6% 0.0098 C4-T6 44.9% <10-11 35.5% <10-3

T3-C3 44.9% <10-11 33.2% 0.0020 T4-C4 44.9% <10-11 31.6% 0.0098 T5-T3 39.8% <10-6 32.8% 0.0030 T6-T4 44.9% <10-11 28.1% 0.1399

PS

C3-T5 37.5% <10-5 28.1% 0.1399 C4-T6 31.3% 0.0141 30.5% 0.0275 T3-C3 37.1% <10-4 33.2% 0.0020 T4-C4 39.8% <10-6 31.6% 0.0098 T5-T3 38.7% <10-6 34.4% <10-3 T6-T4 36.3% <10-4 34.0% <10-3

SO

C3-T5 25.4% 0.4665 28.5% 0.1110 C4-T6 29.7% 0.0504 28.5% 0.1110 T3-C3 24.2% 0.6369 27.0% 0.2558 T4-C4 30.5% 0.0275 32.0% 0.0068 T5-T3 26.2% 0.3553 24.6% 0.5812 T6-T4 25.0% 0.5240 27.0% 0.2558

To check how the brain-speech mapping method performs when a large amount

of training data is available, we classified the same 4 initial consonants of the

Syllables-III data using the classifier with class-specific transformation matrices. We

combined all the 8 sessions from subject LK and got 32 trials for each stimuli, 24 of

them used for training and 8 for testing. Hence each transformation matrix can be

estimated using 6724724 instances and in total 8964478 trials are

available to test the classification accuracy.

28

The classification was run on 124 monopolar channels respectively and

classification rates of all the channels are shown in a brain map in Figure 3.4. Each

number on the brain map denotes the classification rate using the monopolar channel

data collected from the sensor at the corresponding scalp location. Although the

classification accuracy rates are not improved, which is 36% for the best channels, the

significance level of the results is very high (p-value<10-11) because more test trials

were available. The brain map also shows that the signal from channels located at the

left hemisphere of scalp carries more information about the phoneme compared to that

from the right channels. The best rates were obtained from the channels around the left

ear.

Figure 3.4: Classifying the 4 consonants /p/, /t/, /b/, /g/ of the Syllables-III data using

brain-speech mapping method

29

3.3 Support Vector Machine (SVM) classifiers

3.3.1 Methodology

This section proposes a different approach to classify the EEG signals of

phoneme stimuli. This method follows the traditional pattern classification strategy.

The trials from each class, i.e. each phoneme, are given a unique class label. For

example, to classify the 8 initial consonants in the Syllables-III data, we can label the

8 classes using number 1 to 8, which stand for /p/, /t/, /b/, /g/, /f/, /s/, /v/ and /z/

respectively. In other words, neither acoustic nor phonological information of the

speech stimuli is taken into account in classification.

The main scheme underlying the statistical classification approach is SVM with

bootstrap aggregating. I will introduce the idea of SVM with bootstrap aggregating at

first. Then the diagram of the classifier will be described.

3.3.1.1 SVM with Bootstrap aggregating

We use a soft-margin SVM as the basic classification unit in this classification

model. (Cortes & Vapnik, 1995) The original SVM is a binary classifier looking for a

separation hyperplane that maximizes the empirical functional margin, which is the

largest distance between the hyperplane to the nearest training data points of either

class. If the training data are consist of m samples miy ii ,,1),,( )()( x with ni Rx )(

and 1,1)( iy denoted the two class label of the samples, we write the hyperplane

as a set of points that satisfy:

0bxwT (3.9)

When the training data are separable, the optimal hyperplane can be found by the

optimization problem:

mibxwy iTi ,,1)( subject to

max

)()(

,,

wbw (3.10)

30

where is the margin. With the scaling constraint 1 , the optimization problem is

equivalent to

mibxwy iTi ,,11)( subject to

21min

)()(

2

,

wbw (3.11)

which can be efficiently solved.

However, the solution is very sensitive to outliers when the training data are noisy,

and cannot be applied to non-linearly separable cases. Therefore the soft margin is

introduced to allow training samples with margin less than 1 or even negative.

Suppose a sample ),( )()( ii yx has the margin i1 , the objective function would

increase with a cost factor C. The optimization problem is reformulated as:

mi

y

C

i

i

iTi

m

i

i

T

,,1,0 ,1)(subject to

21min

)()(1,,

bxw

wwξbw

(3.12)

SVM can implement non-linear classification by simply applying the kernel trick.

With the kernel, the optimization problem becomes

mi

by

C

i

i

iTi

m

i

i

T

,,1,0 ,1))((subject to

21min

)()(1,,

xw

wwξbw

(3.13)

It can be solved by optimizing the dual problem

miC

T

TT

,,1,0 0subject to

21min

i

αy

α1Qααα

(3.14)

where Q is an m-by-m positive semi-definite matrix with

),( )()()()( jiji

ij KyyQ xx (3.15)

31

and ),( )()( jiK xx is the kernel. The kernel is the inner product of )(ix and )( jx in linear

cases. The decision function for the test sample x is:

m

i

i

i

i bKyh1

)()( ),(sgn)( xxx (3.16)

The predicted class label of the test sample is 1 if the decision function is greater

than zero and is -1 if the decision function is less than zero.

The following kernels were tested in our study:

Linear kernel: zxzx TK ),(

Gaussian radial basis function (RBF): 2exp),( zxzx K

Polynomial kernel: dT CK 0),( zxzx for d=2 and d=3.

To solve an N-Class classification problem, we construct a binary SVM for each

pair of the N classes and predict a test sample as belonging to the class that wins the

maximum number of “votes” from the binary classifiers. In total N(N-1)/2 “one-

against-one” binary classifiers are needed.

We use the Matlab toolbox libsvm (Chang & Lin, 2001) to implement SVM

training and predicting. The toolbox trains SVM using a Sequential-Minimal-

Optimization (SMO)-type decomposition method. Since the solution is always sub-

optimal as well as the noisy training data may not represent the structure of unseen

data, the performance of SVM can be very unstable. A solution to this problem is

Bootstrap Aggregating (Bagging). Bagging is a method to generate multiple versions

of a classifier via the bootstrap sampling approach and use these to get an aggregate

classification. (Breiman, 1996; Kim, 2002) The scheme of SVM with Bagging is

shown in Figure 3.5.

32

Figure 3.5: Diagram of SVM with bootstrap aggregating

Let )},(;,,,{ΤR )()(21

ii

im yxzzzz denotes the training set. The bootstrap

method randomly draws observations from TR and produces a replicate dataset of TR,

noted as )(TR j , and repeats this drawing process B times with replacement. Each

replication is drawn independently and is used to train a SVM classifier. Thus we get a

set of B SVM classifiers, each one of them is trained independently by a replication of

the training set. To predict the classification of a test sample, we aggregate the SVMs

via majority voting. We first test the sample with all the SVMs and obtain a vector of

prediction labels BcccC ,,, 21 . Considering prediction errors of SVMs should be

random and independent, the final prediction label of the test sample is selected as the

class that occurs most often in C. Empirical studies have shown the Bagging method

generally outperforms the single SVM trained by the original training set TR. (Wang

etc. 2009)

3.3.1.2 Diagram of the classifier

The diagram of the EEG classifier based on SVM is shown in Figure 3.6. The

pre-processing steps include high-pass filtering with 1Hz cutoff frequency, down-

sampling and removing artifacts using ICA. For each trial of EEG recording, we

concatenated the data from all the channels to create an observation vector. The length

of the observation vector is the product of the number of channels and the trial length.

In addition, all the individual EEG trials are relabeled using the class index. i.e. The

33

EEG trials are labeled using number 1 to 8 for classifying 8 consonants and using

number 1 to 4 for classifying 4 vowels. This means at this stage, the classification is a

blind process without knowing any information of other phonemes presented in the

syllable stimuli.

(a)

(b)

Figure 3.6: Diagram of the SVM-with-Bagging EEG classifier

Sample-with-replacement & average

Sample-with-replacement & average

Optimize parameters via cross-validation SVM train

SVM test

SVM

model

Tra

inin

g S

et replica

tion TR

(i) TE

(i) bootstrap repetition #i

SVM model #i

pre-proc

Raw EEG

PCA estimate

PCA

Training Set (TR)

Test Set (TE)

Transformation

matrix

Bootstrap repetition #1

SVM #1

Bootstrap repetition #2

SVM #2

Bootstrap repetition #B

SVM #B

average using sample-without-replacement method

.

. .

Aggregating using

“majority voting”

results

34

We divided the EEG trials randomly into a training set (TR) and an Out-Of-

Sample (OOS) test set (TE). The classifier parameters are estimated using the training

set only, hence independent of the OOS test set. The OOS test set is used to test the

classification accuracy and generate the confusion matrices. Besides the SVM with

Bagging, the classification model also makes use of the following statistical methods:

Principal Component Analysis, Averaging and cross-validation. Next, each modular

of the classification model will be explained in detail.

Principal Components Analysis (PCA)

If the observed data include a large amount of variables, it is very likely that some

variables are correlated. For EEG signal, data from adjacent channels are highly

correlated to each other. Thus when we classify EEG using data from multiple

channels, we applied PCA to reduce the number of variables. The PCA algorithm

rotates the data to a new coordinate via an orthogonal linear transformation. The result

data have the greatest variance aligned to the first coordinate, and the second greatest

variance aligned to the second coordinate and so on (Jolliffe, 2002). In pattern

classification, PCA can be used to reduce the feature size because of the underlying

assumption that variables with very small variances are trivial for separating data from

different classes. Thus we can truncate the transformed data and only keep the first K

principal components without losing lots of information for classification. It is also

equivalent to projecting the data to a reduced subspace with only K coordinates. The

number of principal components kept in the feature vector K needs to be optimized

using empirical data.

The PCA orthogonal linear transformation can be calculated with the following

algorithm.

Suppose we have m observed samples mii ,,1,)( x and each sample has n

variables ni Rx )( . First, we zero out the mean of data by replacing each )(ix with

μx )(i , where

m

i

i

m 1)(1 xμ (3.17)

35

Then the empirical covariance matrix of x is calculated as:

m

i

Tii

m 1)()(1 xxΣ (3.18)

Next, we find the eigenvalues n ,,, 21 and the unit-length eigenvectors

nvvv ,,, 21 of Σ . The matrix nvvv 21V diagonalizes the covariance matrix as

DΣVV 1 , in which ),,,diag( 21 n D . Rearrange the order of columns of V so

that the diagonal elements of D are in descending order. Then the transformation )()( iTi xVy rotates the original data to their principal components.

In practice, the PCA usually is calculated using a Singular Value Decomposition

(SVD) of TX (Wall, etc., 2003). In this classification model, the PCA transformation

matrix is estimated using all the individual trials of the training set at the first step of

training. We apply the transformation to both TR and TE to convert the original

observations to principal components.

Bootstrap repetitions

After applying PCA, all the training trials, represented as principal components,

are passed to B independent bootstrap repetitions to train B SVM classifiers

independently. The structure of each bootstrap repetition is illustrated in Figure 3.6(b).

In the ith bootstrap repetition, we randomly draw 80% of the trials in the training set

TR to create a bootstrap replication of TR, noted as TR(i) The remaining 20% of the

trials in TR are used as a test set TE(i) to monitor the SVM classifier‟s accuracy,

although the accuracy rate is not directly related to the final result and won‟t be

reported.

Averaging

Traditionally, it is a widely accepted method in EEG studies that computing the

average of multiple trials in a given condition to extract the common structure of a

class of signals. The same technique is applied to our research. Averaging cancels out

the uncorrelated noise and improves the signal-to-noise ratio. However, the averaging

36

procedure considerably reduces the number of training and testing samples and

produces a data-deficiency problem when sophisticated classification models are

estimated. Therefore when we compute averages, we have to reuse the individual trials

in an efficient way without bringing bias to the classification accuracy. In our

classification model, we use a sample-with-replacement scheme to randomly select the

trials for computing averages. The sampling should be done for the training and

testing sets separately, which means an individual trial used to calculate an averaged

training sample cannot be used to compute an averaged testing sample.

For instance, to calculate the average of M EEG individual trials of initial

consonant io as a training sample, we randomly draw M trials from a pool, which is

consist of all ni training trials whose corresponding auditory stimuli start with the

consonant io , and calculate their mean. The M trials are put back into the pool for

calculating other averages. We repeat this procedure until sufficient number of

averaged samples is obtained. When the number of trials in the pool ni is much greater

than M, it is very unlikely to have two identical averaged samples.

Note that the PCA algorithm basically rotates the coordinate and aligns the axes

with the direction that the signal has greater variance. And if the set of data has large

variance in one direction, their averages also have large variance in that direction,

given that the number of trials in the pool is much greater than the number of trials

used to calculate one average trial. Thus we estimated and applied a PCA

transformation to individual trials to avoid repeated SVD calculation, which

dramatically slows down the computation.

Moreover, when we compute the p-value, we use the binomial distribution which

assumes the testing of each sample is independent. If the averaged test samples are

constructed using a sample-with-replacement scheme, two samples may share the

same source individual trial and hence are not statistically independent. Therefore we

apply the sample-without-replacement scheme on calculating the averaged OOS test

samples for accurate p-value estimation, as shown in Figure 3.5(a). Therefore, no

37

more than Mni averaged samples can be created for the phoneme oi, in which ni is

the number of individual trials that belong to class oi in the OOS test set.

Optimizing SVM parameters

Choosing the appropriate parameters of the SVM model, such as the number of

principal components to be kept K and the cost factor C, is crucial to obtain a high

performance classifier. In each bootstrap repetition, we determine the optimal

parameter of each SVM classifier via nested Q-fold cross-validations using TR(i).

Suppose t independent parameters need to be optimized ),,,( 21 t , the

number of candidate values for the parameters are ),,,( 21 tmmm . All the candidate

parameter values form a searching grid with tm ,,1 points, notes as

tk mkP ,,1,,1, . The procedure of cross-validation is:

(1) The EEG individual trials of TR(i) are randomly divided into Q group with the

approximate even number of trials.

(2) Repeat the following for each cross-validation loop:

(2.1) For the jth cross-validation loop, use the jth group of training trials as the

validation test set (VTE(j))and combine the other Q-1 groups as the

validation training set (VTR(j)). The averages training and testing samples

are calculated using sampling-with-replacement scheme from VTR(j) and

VTE(j) respectively.

(2.2) For each point of the parameter searching grid kP , we estimate an SVM

classifier, configured as the candidate parameter values, using the

averaged samples of VTR(j) and test its accuracy using the averaged

samples of VTE(j) . A classification rate is obtained for kP , denoted as

)()(k

j Pr .

(3) The mean of cross-validation accuracy rate for kP is calculated and the

optimal parameter set is chosen as:

38

j k

j

P

PrQ

P )(1maxargˆ )(

(3.19)

Next, we construct the averaged samples of set TR(i) and train the SVM classifier

of the ith bootstrap repetition using the optimal parameter configuration P̂ .

The computation cost of cross-validation increases exponentially with the number

of parameters to be optimized. Thus we cannot afford searching more than three

parameters. Since the parameters are optimized independently in each bootstrap

repetition, the resulting SVMs may have different structures.

As the last step, the SVM classifiers are aggregated via “majority-voting” and

tested on the averaged test samples, which are generated using a sample-without-

replacement scheme.

3.3.2 Classification results

We tested the SVM classification model using the Syllables-III data and the

Isolated-vowels data. The 1KHz raw data of brainwave were down-sampled 16 times

to 62.5Hz. For classifying the initial consonants, only the first 32 samples of each trial,

representing EEG signal of 512ms, were used in classification. The full-length trials

with 62 samples were used to classify the vowels.

3.3.2.1 Linear vs. Nonlinear Kernels

First, we combined EEG trials from all four subjects of the Syllables-III data and

tested the performance of the SVM-with-Bagging classifier using linear and non-linear

kernels. We classified the 8 consonants and 4 vowels as two independent classification

problems. We also tested the classifier on recognizing 4 vowels in the Isolated-vowels

data. The classification experiment is configured as following:

All 124 monopolar channels are concatenated as the EEG observation vector.

The training set TR included half of the individual EEG trials and OOS test set

TE contained the other half of the trials. The training/OOS testing partition is random.

39

35 SVMs were built using the Bagging scheme.

Each averaged sample was computed from 25 individual trials.

For the linear-kernel SVM model, there are only two parameters to be optimized,

which are the size of the principal-component vector K and the cost factor C in SVM.

Thus the linear-kernel SVM was tested first. For the 8-consonant classification, we

used 5-fold cross validation to choose the number of principal components used for

classification K from [5,10,15,...,200] and the cost factor of SVMs C from

178 2,,2,2 . The mean of the validation rates across all the bootstrap repetitions is

plotted in Figure 3.7. We can see that if C is fixed and K increases from 5, the

recognition accuracy is improved dramatically with K at the very beginning and the

growth rate decreases after K reaches a certain level. The cost factor C affects the

sensitivity of the classification rate with respect to K. Although the larger principal

components size K leads to a better recognition rate, it also considerably increases the

computation cost. With an appropriate C, a better recognition rate can be achieved

with a smaller number of principal components.

Figure 3.7: Mean validation accuracy on parameter search grid (K,C) using linear

kernel

40

Now we look at the computation cost of training the SVM-with-Bagging

classifier of linear kernel. The parameters were optimized via 5-fold validation

searching on a grid of 40×8=320 points. Thus around 320×5+1=1601 times of SVM

training is needed to estimate each SVM classifier. An ensemble of 35 such SVM was

used to make final predictions. Therefore, in total 1601×35=56,035 times of SVM

training should be done for constructing the EEG classification model with linear

kernel. For the non-linear kernel experiment, since more parameters need to be

optimized, the full search of the parameter grid becomes infeasible. Thus we used the

fixed cost factor C=0.02, corresponding to the fastest ascending slope of mean

validation rate with respect to K in the linear-kernel experiment, and K=200, while

only optimizing for the Gaussian-kernel experiment and optimizing and for

the polynomial-kernel experiment. The classification accuracy rate and the significant

levels (p-value) are shown in Table 3.2

Table 3.2: Phoneme classification results using SVM-with-Bagging method with linear or non-linear kernels

Task 8 consonants in CV

syllables 4 vowels in CV

syllables 4 vowels (isolated) Number of test samples 426 426 140

rate p-value rate p-value rate p-value Linear 46.0% <10-64 41.5% <10-13 68.8% <10-26

Gaussian 42.7% <10-53 36.9% <10-7 65.3% <10-23 Quadratic 42.7% <10-53 34.3% <10-5 62.4% <10-20

Cubic 43.7% <10-56 41.5% <10-13 65.9% <10-23

The result shows that the linear SVM-with-Bagging model correctly classified 46%

of the 426 consonant test samples (p-value<10-64). The classification rates of the 4

vowels in the CV syllables are much lower than the consonant results, with a 41.5%

accuracy rate using the linear kernel (p-value<10-13). However, we find that the model

works well on the same 4 vowels presented in isolation, achieving a classification rate

of 68.8% using the linear kernel. The high significance level proves the effectiveness

of the SVM-with-Bagging classification methods on modeling the averaged EEG

41

recordings of auditory phoneme stimuli. But it works much better on the phonemes

presented at the beginning of the stimuli than the following phonemes. As mentioned

in Chapter 1, the EEG recording reflects postsynaptic potentials, which may last

longer than the actual duration of the sound stimuli. Thus the EEG brainwave response

of the initial consonants may impose extra noise on the brainwave of vowel perception

and make it unintelligible.

Theoretically, the non-linear kernels should achieve at least the same performance

as the linear kernel. On the other hand, limited by the computation capabilities of our

computers, we could not run a full parameter grid search to find the optimal

parameters of non-linear kernels. As a result, the classifiers with non-linear kernels

did not outperform the classifier with linear kernel in this experiment.

3.3.2.2 Leave-out-one-subject experiment

With the same experiment setup, we tested the invariance of EEG representations

among subjects using the Syllables-III data. We used trials from one subject to create

test samples and trials from the other three subjects to train the linear SVM model.

This procedure was repeated for four subjects respectively. The result classification

rates and p-values are shown in Table 3.3.

Table 3.3: Leave-one-subject-out classification results using SVM-with-Bagging method

Subject for testing

8 consonants 4 vowels Number of

test samples rate p-value Number of test samples rate p-value

DS 138 26.8% <10-5 141 35.5% 0.0036 SA 176 33.5% <10-12 176 30.7% 0.051 LK 280 30.4% <10-10 284 37.0% <10-5

LH 248 25.0% <10-7 248 40.7% <10-7

The classification rates for 8 consonants range from 25.0% (p-value<10-7) to 33.5%

(p-value<10-12). Although these rates are considerably lower than the results of the

previous experiment, they are highly significant and demonstrate the EEG

42

representations of consonants are approximately invariant among different subjects. In

contrast, the vowel classification results, which vary from 30.7% to 40.7%, are

comparable to the results obtained from mixing the trials of all the subjects. The result

suggests that the EEG representations of vowels have stronger inter-subject invariance

than the EEG representation of consonants.

3.3.2.3 Experiment on the number of trials to calculate average

In all above experiments examined in this section, we trained and tested the

classification model using the averaged samples. The number of individual trials to

calculate average, M, was fixed at 25. To explore how the averaging process affects

the classification, we also classified the initial consonants using linear SVM with

various M. and plot the relation between M and the percentage of test samples that are

correctly classified as in Figure 3.8.

Figure 3.8: The changing of 8 initial consonants classification rates with respect to the

number of trials to calculate averages

The figure shows that for classifying individual trials, only 17.3% of the test trials

were correctly classified. The accuracy rates increase rapidly as M is increased. The

43

ascending rates slightly slow down when M is greater than 20. Although our data size

is insufficient to test when the classification rates are going to saturate, with the given

result, we conclude that the averaging can efficiently reduce the signal-to-noise ratio,

which verifies that the noise of the EEG signals are mainly uncorrelated for different

trials.

3.3.2.4 Experiment on classifying individual EEG trials using data from single

channel.

We also classified the individual EEG trials of 4 initial consonants /p/, /t/, /b/ and

/g/ from subject LK using the SVM-with-Bagging model, so that the performance can

be compared with the Brain-speech mapping classification results shown in Figure 3.4.

In this experiment we combined the 8 sessions from subject LK in Syllable-III

data. We randomly drew 75% of the individual EEG trials associated with the targeted

phonemes as training set and used the remaining 25% as the OOS test set. To match

the experiment setup of Brain-speech mapping classification, we trained and tested the

SVM-with-Bagging classifier using individual EEG trials from each of the 124

monopolar channels separately. The parameters setup of the experiment is:

Used 32 samples, represented the first 500ms brain response of each EEG trial as

the observation data vector. PCA is not necessary in this case.

Linear kernel was adopted in the SVM-with-Bagging classification model.

35 SVMs were built using the Bagging scheme.

Cost factor of the soft-margin SVMs was optimized via nested 5-fold cross-

validation loops and chosen from 2910 2,,2,2 .

The result classification rates are shown in the brain map in Figure 3.9. The

numbers denote the classification rates using the monopolar channel data collected

from the corresponding scalp locations.

44

Figure 3.9: Performance of SVM-with-Bagging method on classifying 4 initial

consonants using single channel data

We can see this brain map matches the brain map in Figure 3.4 very well. Both

experiments obtained the highest classification rates from the channels around left ear.

The best single channel classification rates are the same which is 36%. The major

difference between the two brain maps is: SVM-with-bagging methods can classify

the consonants reasonably well using some channels from the right hemisphere. And

this is not shown in the classification results of Brain-speech mapping model.

3.4 Summary

In this chapter, we proposed two different approaches to classifying the EEG

brainwaves of auditory phoneme stimuli. The first method takes usage of the acoustic

properties of the auditory stimuli and examines the relations between the brainwaves

and the speech sound waves. The second approach follows the traditional pattern

45

recognition strategy and makes use of statistical signal processing methods such as

PCA and SVM-with-Bagging. We used these methods to classify the individual EEG

trials and averaged EEG trials. Both methods achieved significant results, especially

on classifying the initial consonants. The performance of two classification models are

similar when classify the individual trials of 4 initial consonants. The classification

rates can be further improved if we bring the two models together. For example the

Brain-speech-mapping method can take advantage of the Bagging scheme and PCA

methods as well. Using the SVM-with-bagging method, we showed that non-linear

methods cannot outperform the linear method in our experiments and EEG

representation of phonemes is approximately subject-invariant.

Using the second method as the baseline, in the next chapters, we examine how

phonetic differences are reflected in EEG brainwaves and explore if the classification

methods can be improved by introducing the phonological information of stimuli into

classification.

46

Chapter 4

Frequency Analysis of EEG Signals

4.1 EEG signals in frequency domain

The long history of studying EEG in the frequency domain started almost at the

same time as the first successful recordings of human EEG in the 1920s. Researchers

found that EEG signals contain rhythmic activities across a spectrum of frequencies.

Oscillations in certain frequency range may reflect a specific cognitive state of the

brain. For example, Alpha waves in the frequency range of 8 to12 Hz are believed to

have relation with the wakeful relaxation with closed eyes. Beta waves, observed from

frontal sensors, which range from 12Hz to 30Hz are closely linked to motor activities.

(Pfurtscheller, 1999) Gamma activities in the frequency range from 30 to 100Hz seem

related to the binding of neural processes in different brain areas for carrying out a

coherent cognitive or motor activity. (Tallon-Baudry & Bertrand, 1999) However, in

the literature of EEG related to phoneme perception, the focus of this thesis,

researchers were more interested in temporal information. Very few reports on

frequency analysis of EEG activities evoked by phoneme stimuli have been published.

In this chapter, we address the problem of whether or not frequency analysis can

extract attributes of EEG associated to auditory phoneme perceptual activities.

To examine the EEG signals generated by auditory perception in the frequency

domain, we first plot the power spectral densities (PSD) of our recordings. Only one

EEG session on syllables from subject LK in Syllable-III experiment is examined. The

EEG signals were passed through a 1Hz high-pass filter and down-sampled four times

to 250Hz. The PSD of each monopolar channels was computed using the covariance

method and then the average PSD across all 124 monopolar channels was calculated.

This average PSD from 0Hz to the Nyquist frequency 125Hz is plotted in the dB scale

in Figure 4.1.

47

Figure 4.1: Average power spectral densities of EEG signal sampled at 250Hz

The plot shows that besides the dominant 60 Hz AC power supply noise, the

power of the EEG signal is mainly distributed in the low frequency range from 0 to

20Hz. And the power is inversely related to the frequency. The 20Hz component has

more than -20dB power decay compared to the maxim at 2Hz. It is natural to think

that the essential information of brain activities is carried by frequency components

with higher energies. Hence reducing the size of the data by down-sampling them to

62.5Hz will not lose much useful information, as we did in Chapter 3. In the following

part of this chapter, we will only focus on the lower frequency range from 0Hz to

approximate 31Hz.

48

4.2 EEG spectral features

The Discrete Fourier Transform (DFT) is commonly used to convert the finite

discrete time-domain signal to the frequency domain. For a time-domain signal

1,,0),( Nnnx , the N-point DFT is N complex numbers calculated as:

1,,0)()(1

0

2

NkenxkXN

n

nkN

i

(4.1)

And )(nx can be reconstructed from )(kX using the inverse transformation:

1,,0)(1)(1

0

2

NnekXN

nxN

k

nkN

i

(4.2)

The real time-domain signal )(nx has a conjugate symmetric spectrum, which

means

1,,1,* NkkNXkX (4.3)

and both 0X and 2/NX are real when N is even.

Now we only consider the case when )(nx is real and N is even. If we write the

complex number DFT )(kX as

ki

keAkX

)( (4.4)

Then the inverse DFT can be reformulated as:

1

10

1

0

1

0

)2(

2

2)2cos(2)cos(1

)2cos(1

1)(

N

N

k

k

kk

N

k

kk

N

k

knN

i

k

knN

AnAAN

knN

AN

eAN

nx

(4.5)

Equation (4.5) shows that the DFT represents the time-domain signal as a

superposition of a series of discrete sinusoidal functions, each of which is defined by

three attributes: frequency, amplitude and phase.

49

To explore the relation between these spectral attributes and the phoneme

perception process in the brain, we constructed several EEG features based on DFT

that reflect all or partial spectral attributes. Then using the SVM classifier introduced

in Chapter 3, we tested if the frequency-domain features can be used to predict the

brain representation of phoneme stimuli.

DFT of EEG

The first feature is the DFT, which is calculated separately for each trial and each

channel. Since the time domain signal )(nx is real, half of the N-point DFT is

redundant. We represent )(nx by

2/,12/,,1Im,12/,,1Re,0 NXNXXNXXXX DFT (4.6)

which is a vector with N non-redundant real numbers. Theoretically, DFTX should be

equivalent to the time-domain signal and achieve the same classification accuracy.

Amplitude of DFT

The amplitude feature of EEG only keeps the amplitude corresponding to each

sinusoidal component. From the conjugate symmetric property, we have

1,,1, NkAA kNk (4.7)

Thus the non-redundant representation of the amplitudes is:

2/10 ,,, NAMP AAAX (4.8)

EEG features based on phase

Similarly, when we define EEG features based on the phases of DFT, only N/2

values need to be included as.

2

,1 NPHSX (4.9)

The EEG signal was passed through several filters at the pre-processing stage to

remove the noise and artifacts. A low-pass filter with zero-phase response was used

for the purpose of anti-aliasing before down-sampling. But the 4th order Butterworth

50

high-pass filter with the cutoff frequency at 1Hz has non-zero phase response. The

frequency response of the high-pass filter at the low frequency range is shown in

Figure 4.2. The high-pass filter introduced non-linear phase distortion that needs to be

compensated. Suppose the high-pass filter generates a phase delay k at the frequency

of kth sinusoidal component, the phase of kth sinusoidal component k should be

replaced by kk .

Figure 4.2: Magnitude and phase frequency response of 1 Hz high-pass filter

Furthermore, all the elements in XPHS have angular values, which means k is

identical to 2k . The linear methods used in classification, such as the

averaging and the linear separation hyperplane, cannot work appropriately for angular

observation values. Here we propose two different approaches to overcome this

problem.

The first method is to describe the phase angle k as a unit-length vector in the

complex plane and use the real and imaginary parts of it, kcos and ksin , as the

observed values. Then the EEG feature is written as:

51

22

cos,sin,,cos,sin 111 NNPHSX (4.10)

In the other approach, we keep only the phase information in DFTs and transform

them back to the time domain using the inverse DFT. More specifically, for each

element of 1,,0),( NkkX , the modified DFT is defined as:

0 if00 if

)(~

k

k

i

A

AekX

k

(4.11)

And the EEG feature PHS-2 is

))(~IDFT(2 kXX PHS (4.12)

Obviously )(~kX is also conjugate symmetric and the derived time-domain signal

should be real. Thus 2PHSX is a vector of N real numbers, which may be longer than

the original EEG data. Although 2PHSX is a time-domain signal, it is constructed

based on only the phase pattern of the original signal and all the amplitude differences

of the non-zero sinusoidal components are eliminated. Hence 2PHSX is still

considered as a phase feature, while all the time-domain signal processing methods are

also applied to it.

4.3 Classification results

4.3.1 Compare the EEG features based on DFT

To test how well the EEG spectral features can describe the brain activities of

phoneme perception, we computed the four proposed EEG spectral features of the

Syllables-III data and the Isolated-vowels data. We used the features to classify the

brain representations of the phoneme stimuli with the linear-kernel SVM model

introduced in Section 3.3. The classification scheme is identical to the one described in

Figure 3.6, except that the EEG time-domain signal of each trial is converted to the

spectral features immediately after the ICA cleaning in the pre-processing stage. For

52

EEG data sampled at 62.5Hz, we used 62 samples as the time-domain observations of

one-second EEG activities and zero-pended them for a 64-point DFT. Thus the

frequency resolution of DFT is approximately 1Hz. The classifiers were trained and

tested using the following configurations:

EEG channels: 124 monopolar channels

Number of SVMs for Bagging: 35

Number of trials for averaging: 25

The number of principal components used for classification is optimized in nested

5-fold cross-validation loops and chosen from [5,10,15,...,200]

Cost factor of SVMs is optimized in nested 5-fold cross-validation loops and

chosen from 2910 2,,2,2 .

The percentages of test samples that classified correctly are summarized in Table

4.1.

Table 4.1: Comparing the classification rates of 4 EEG spectral features

8 consonants 4 vowels in CV

syllables 4 isolated vowels number of test samples 426 426 140

Temporal signal (TIME) 41.5% 35.5% 68.8%

Spectral features

DFT 38.0% 31.5% 70.2% AMP 10.3% 28.5% 27.7%

PHS-1 35.2% 29.7% 51.8% PHS-2 39.2% 27.8% 55.3%

The initial-consonant classification results show that the DFT spectral features

achieve slightly lower classification rates compared to the full-length temporal signal.

The AMP features barely worked for distinguishing EEG representations of initial

consonants and classified only 10% accuracy for the classification of 8 consonants,

which is not better than the chance level. Hence the amplitudes of DFT carry very

little information to distinguish the initial-consonant brain images. Among four

spectral features, PHS-2 gave the best classification rate of 39.2%. The accuracy rate

53

is better than that of the full DFT representations. The results demonstrated that

eliminating amplitude information can improve the initial consonants classification

rates. Similar performance-difference pattern can be found in isolated-vowel brain-

image classification, except that the phase-related features cannot classify the vowels

as well as the temporal and DFT features. Since the total number of test samples of

isolated vowels is 140, which is much less than the number of test samples of initial

consonants, 426, the isolated-vowel classification rates are not as robust as the rates of

initial-consonants classification.

We also found that the superiority of phases over the amplitudes of the DFT is not

shown clearly in classifying EEG brainwaves of the four vowels in CV syllables. This

is because the actual start times of the vowels presented at the non-initial position are

different, due to the various durations of proceeding consonants. Therefore if the

distinctions between EEG images of phonemes are reflected by the phases of DFT,

which describe temporal delays of the sinusoid components of signal, these

distinctions can be contaminated by the different delays when the phonemes are not

presented as the initial phoneme of a stimulus. These results also help us to explain

why the classification rates of vowels in CV syllables are significantly lower when

compare to others and indicate a possible direction to improve the rates.

Moreover, although the DFT feature is mathematically equivalent to the TIME

feature, which means one of them is fully determined by the other, the DFT feature

does not reach as high a rate as TIME under the linear classification methods. Similar

results were obtained from the PHS-1 and PHS-2 features. This may suggest that the

proposed SVM classification algorithm works better on the EEG signal when

represented in the time domain.

In short, the remarkable differences in classification rates using spectral features

show that the phoneme brain representations are nearly independent of the amplitude

of sinusoid components of EEG but much more reflected in their phase pattern.

54

4.3.2 Frequency selection

Now we know that the phases of the DFT can describe the EEG of phoneme

perception process rather well. This is shown by the classification experiments

discussed above, which used the spectral properties across the frequency range from

DC to Nyquist frequency as the observation features. However, it is natural to think

that the frequency components contribute differently to the classification. Those that

are unrelated to the target neural activities may impose extra noise on the classifiers

and reduce the classification rate. By optimizing the choices of frequency components

with respect to maximizing the classification rate, we may be able to find the

frequency range of EEG activities that are more directly related to brain processing of

phonemes in humans.

The ideal approach is to include a full search of possible frequency choices while

training the Bagging SVMs, as we did for the number of principal components and the

SVM cost factor. But this will increase the computation time to an impractical level.

Thus we look for the approximate optimal frequency range via a 10-fold cross

validation using only the training set. Since the number of trials in the Isolated-vowel

data is insufficient, we only examine the best frequency range for classifying initial

consonants using Syllable-III data.

First of all, we assume the frequency components that carry the information to

distinguish the initial consonants lie in a continuous range from Lf to

Hf ,

corresponding to the frequency indices L and H of a N-point DFT. Our purpose is to

find the optimal parameter pair (L, H) in the searching grid

2

,2

0;,,

N

HLN

LHLHL (4.13)

which maximizes the mean classification rates of cross-validation. In this experiment

we down-sampled the EEG data to 62.5Hz and applied a 64-point DFT. The

optimization procedure is as following:

(1). All the training trials (TR) are randomly divided into 10 groups with

approximately the same number of trials.

55

(2). Repeat the following steps for each pair (L,H) of the grid.

(2.1) Calculate the modified PHS-2 features of the frequency band between L and

H and use it as the observation vector of the trial. If 1,,0),( NkkX is the DFT of the EEG signal collected from one channel, we calculate:

otherwise0 and 0 if

and 0 if)(~

HkNLAe

HkLAe

kX k

i

k

i

k

k

(4.14)

The band-limited PHS-2 feature is the IDFT of )(~kX .

(2.2) Transform the modified PHS-2 features into their principal components

using PCA. Only the first 200 principal components are used in the

following computation. Now each EEG trial is reduced to a vector of 200

elements which are dependent on only the phase pattern of DFTs within the

frequency band HL ff , .

(2.3) Repeat the followings for 10 cross-validation loops.

(2.3.1) At the cross-validation loop i, use group i as the test set ( )(VTE i )

and combine all the other 9 groups as the training set ( )(VTR i ).

(2.3.2) Using the sample-with-replacement method discussed in 3.3.1.2 to

create averaged test and training samples, each sample is the mean

of PHS-2 features across 25 individual trials. The training set )(VTE i and test set )(VTR i are sampled separately. The total

number of training samples is the same as the number of individual

trials in )(VTR i and the total number of test samples is same as the

number of individual trials in )(VTE i .

(2.3.3) Train an 8-class linear SVM classifier using training samples in )(VTR i with the cost factor 62C . Then use it to predict the class

labels of test samples and compute the percentage of samples that

classified correctly, noted as i

HLr ),(

56

(2.4) The mean classification accuracy of the parameter pair (L, H) is defined as:

10

1 ),(),( 101

i

i

HLHL rr (4.15)

And the optimal pair (L, H) is chosen as )max(arg),( ),( HLrHL .

The mean classification accuracies of all the candidate (L, H) pairs that belong to

the searching grid 20,40;, HLLHL are shown in figure 4.3. We find that

the parameter pair (2, 9) gives the best mean classification rate of 45.5% for 8 classes.

The DFTs are approximately corresponding to the frequency range 2Hz to 9Hz.

Figure 4.3: Mean classification rates of parameter pair (L, H) obtained from 10-fold

cross validation

To test if the optimal frequency band can be generalized to the OOS test trials and

compare the result with other temporal and spectral representations of EEG, we

classified the EEG signal of phonemes, represented as the modified PHS-2 feature

with limited bandwidth [2Hz, 9Hz], using the linear-kernel SVM-with-Bagging

classifier. Although the best frequency range was obtained using the EEG responses of

initial consonants, we also applied it to classifying the isolated vowels to check if any

improvement can be made.

The experiments were configured as following:




57

EEG observations: modified PHS-2 for 64-point DFT and L=2, H=9.

Number of principal components used for classification: 200



The classification results are summarized in Table 4.2 in compare with the

classification rates of using temporal signal and phase features across the frequency

range from DC to Nyquist frequency.

Table 4.2: SVM-with-Bagging classification results using the EEG phase feature in the frequency range from 2Hz to 9Hz

8 initial consonants 4 isolated vowels Temporal Signal 46.0% (500ms)

41.5% (1sec) 68.8%

PHS-2 39.2% 55.3% PHS-2 [2Hz, 9Hz] 51.4% 73.8%

For the 8 initial consonants, 219 out of 426 test samples were classified correctly.

The accuracy rate is 51.4% with a p-value less than 10-82. The result is significantly

better than classifying EEG of initial consonants using the time-domain signal. The

classification rate on isolated vowels is also improved from 68.8% to 73.8%.

In conclusion, the EEG features built on the phase pattern of DFT can describe

the brain image of the phoneme as well as the original time-domain EEG signal.

Eliminating the amplitude information of DFT will not diminish the distinctions of

EEG representations of different phonemes at the initial position of auditory stimuli.

The phase pattern of sinusoidal components in the frequency range from 2Hz to 9Hz is

more important than other frequency components in distinguishing the EEG image of

phonemes.

58

Chapter 5

Invariant Similarities between Brain

and Perceptual Representations of

Phonemes

Our degree of success in classifying brain representations of phonemes supports

the investigation of brain activities at a level below phonemes, i.e. the brainwaves

reflecting distinctive features of phonemes compared to perceived phonological

features. Intuitively, if two phonemes are perceptually close, the brainwaves evoked

by them should be close. In this chapter, we derive the similarities between

brainwaves of phonemes according to the confusion matrices of classification results

and compare them with the perceptual similarities obtained from corresponding

perceptual experiments.

5.1 Psychological experiments on phoneme perception

The phonological features introduced in Chapter 1 are not equally efficient in

discriminating phonemes perceptually. Historically, researchers have studied the

effectiveness of distinctive features in separating phonemes via psychological

experiments. In these experiments, the auditory speech tokens are presented to the

listeners, who are instructed to identify the phonemes they heard. The perceptual

confusion between each pair of phonemes is recorded. In the typical experiment

settings, the utterances are presented via a noisy speech channel with frequency

distortions to create the necessary confusions. One of the first psychological

experiments on consonants confusion is the renowned Miller and Nicely work. (1955).

They recorded the perceptual confusions in identifying 16 consonants, which were

59

filtered and presented with different signal-to-noise ratios (SNR) and used the

confusion data to determine the robustness of the distinctive features under filtering or

noise-masking conditions. They found that some features, voicing and nasal for

instance, are very robust, but the discernibility of the place of articulation is likely to

be affected. The Miller-Nicely experiment results were reliably re-produced in 2005

using modern computerized techniques and digital audio recordings (Phatak & Allen,

2008). Wang and Bilger conducted a similar but more thorough experiment in 1973.

(Wang & Bilger 1973) They calculated the perceptual confusion matrices of 24

consonants, which covers all the distinctive consonant sounds in most English dialects,

in CV or VC syllables with different vowels and evaluated the robustness of

phonological features in a variety of context and listening conditions. Relatively less

work about vowel confusions has been carried out. Besides the frequency-distortion

and noise-masking, researchers also studied phoneme perceptual confusions under

other conditions, such as short-term memory (Wickelgren, 1966) and impacted

hearing capability (Munson, 2002).

5.2 Similarity measurements

We applied the similarity analysis tools, semi-orders and hierarchical partition

trees, to interpret both the brainwave confusions and psychological confusions data.

Then the invariance between brainwave similarity of phoneme images and the

corresponding perceptual similarity can be derived. We now briefly describe these

methods, which follow those of Suppes, Perreau-Guimaraes & Wong (2009).

5.2.1 Semi-Order and Invariant Partial Order of

similarities

When we classify the brainwaves of phonemes, we calculate the number of test

samples of phoneme oj that are classified as belonging to phoneme oi and normalizing

it by the total number of test samples of phoneme oj. Then we get the estimated

conditional probability

ij oop | , where “+” and “-” denote the prototypes and the

60

test samples respectively. If we repeat this for each pair (i, j), a conditional probability

matrix is obtained. The normalized confusion matrix of classification results provides

empirical evidence to order the similarity differences of the brainwave representations

of phonemes. Briefly speaking, it is natural to think that the phoneme

io is more

similar to the prototype

jo than the phoneme

io to the prototype

jo if and only if

ijij oopoop || (5.1)

We note this similarity-difference relation as

ijij oooo || (5.2)

Because the confusion matrices are generally not symmetric, the similarity

differences are not necessarily symmetric. In practice, the difference between the

similarity of

jo and

io and the similarity of

jo and

io is considered statistically

insignificant if

ij oop | and

ij oop | are close enough. Here we introduce the

numerical threshold that

ijijijij oopoopoooo || iff || (5.3)

It can be proven that the similarity-difference ordering defined by the estimated

conditional probabilities with the numerical threshold is a semiorder on

NjNiooA ij ,,1;,,1:|

i.e. is irreflexive, strongly transitive and an interval order on A.

In our study, we also need to compare the structural invariance between two

semiorders – the similarity-difference ordering of brainwaves, noted as br , and the

perceptual similarity-difference ordering of phonemes, noted as per . The invariance

can is given by the intersection of the two semiorders

invperbr (5.4)

which is a strict partial order.

61

To graph the semiorders and invariant partial order, the relation

ijij oooo || is

illustrated by an arrow from the vertex which denotes

ij oo | to the vertex which

denotes

ij oo | . To further simplify the graph, we define the congruence relation ≡ as:

cb iff (ii) , iff (i) , allfor iff cabcaccba (5.5)

The congruence relation is a strict equivalent relation, i.e. reflexive, symmetric

and transitive. In the graph of the invariant partial order, we put

ij oo | and

ij oo | in

the same vertex if

ijij oooo || (5.6)

Given that the phonemes‟ prototypes are always on the left of the similarity

notation and their test samples on the right, the + and – signs can be omitted in the

graph without generating any confusion.

5.2.2 Partition tree of similarities

The similarity-difference ordering is the basis of generating a qualitative partition

tree, which shows a hierarchical partition of the combined set of test samples and

prototypes },,,,,{ 11 NN oooo in a binary tree structure. First, we define the

“merged product” of two subsets of O, OI and OJ, as:

}O&Oor O&O:{OO JjIiJiIjijJI oooooo (5.7)

The inductive procedure starts from a partition P0 which includes 2N singleton

sets of elements of O. In the kth inductive step, two subsets in the partition Pk-1 are

chosen to be merged such that the least pair of their merged product under the

similarity-difference ordering is maximized among all the possible merges.

Consequently, the subsets with greater similarity are merged earlier than the subsets

with smaller similarity in the inductive steps. Each step of the recursive procedure

reduces the cardinality of the partition by 1. Thus the 2N-1 step reaches a partition

with only one block, which is the set O. The similarity tree is constructed by using the

62

2N hierarchical partitions in reverse order. The root node of the tree denotes the single

set O. The two branches from the root node lead to the partition of the 2N-2 step

which has two blocks. The same procedure continues until all the leaves of the tree are

the elements of O. The partition tree provides a fairly intuitive approach to

summarizing the similarity of the test samples and prototypes in a matrix of

conditional probability densities. (Further details of the semiorder and similarity tree

can be found in Suppes, Perreau-Guimaraes & Wong 2009.)

5.3 Experimental data analysis

5.3.1 Vowels

Since the classifiers predict the EEG images of isolated vowels much more

accurately than those of the vowels in CV syllables, we use the Isolated-vowels data

to generate the confusion matrix of EEG representations of vowels. If the sample-

without-replacement scheme were used to calculate the averaged test samples, only

140 samples would be available. This is not enough to produce a confusion matrix

with reliable off-diagonal structure. Thus when we constructed the vowel confusion

matrix, we took the time-domain signal as the EEG observation vector and created

300 averaged test samples for each vowel from the OOS test set using the sample-

with-replacement method. In this experiment, the PHS-2 EEG feature with the

limited frequency range from 2 to 9 Hz is used to represent EEG signal. The class

labels of the test samples were predicted using the Bagging SVM model with linear

kernel. As a result, among 1200 test samples, 826 were correctly classified. The

classification rate was 68.8%. The normalized confusion matrix of classifying EEG

images of vowels is shown in Table 5.1(a). The ith element in the jth row is the

probability that the test samples of the phoneme oj are classified as oi. The summation

of each row is 100%.

We compare the EEG confusion matrix of vowels with the results of vowel

perception experiments conducted by Pickett in 1957. The Pickett experiment

63

presented 12 English vowels in artificial syllables of the form bVb, spoken in a short

carrier phrase. They reported the perceptual confusion matrices of vowels when the

utterances were masked by noise in various frequency ranges. Considering that the

brainwave data were collected in quiet office surroundings, only the perceptual

confusion matrix of flat noise is examined here. The perceptual conditional

probabilities are estimated by taking the elements associated with the four targeted

vowels from Table I(B) in (Pickett, 1957) and forming a sub-matrix. Then divide each

element of the sub-matrix by the summation of the corresponding row to get the

conditional probabilities shown in Table 5.1(b). The overall perceptual accuracy for

these 4 vowels is 82.8%.

Table 5.1: Normalized confusion matrices of 4 vowels

(a) The confusion matrix of EEG isolated-vowel classification. (b) The confusion matrix of Pickett 1957 vowel-perception experiment.

(a) (b)

% i æ u ɑ % i æ u ɑ i 66.3 6.0 21.0 6.7 i 87.0 0.2 11.8 1.1

æ 6.0 79.0 5.3 9.7 æ 0.2 92.6 0 7.2

u 22.0 6.0 65.0 7.0 u 45.3 0.2 53.6 0.9

ɑ 8.3 19.3 7.3 65.0 ɑ 0 1.9 0 98.1

Figure 5.1 compares the similarity trees derived from the brain and perceptual

confusion matrices. Looking at the similarity tree for the brain representation of

vowels, we can make several remarks. First, any vowel-test is more similar to its own

prototype than to any other vowel. A more interesting finding is the separation

between open vowels /æ/ and /ɑ/ and close vowels /i/ and/u/. The tree suggests that the

brain representation of vowel-height is more robust than vowel-backness. Since the

vowel-height reflects the frequency of the first formant (F1) and the vowel-backness is

inversely correlated to the second formant (F2), the results suggest that the EEG

activity is more sensitive to the low frequency contrast around F1 range (less than

1000Hz) than the higher frequency contrast around F2 (1000-2500Hz). The finding is

64

consistent with the fact that the human cochlea, where the sound wave pressure is

converted to the original neural signals, has higher resolution on low frequencies.

The merging pattern of the perceptual similarity tree is almost identical to the

brain similarity tree. The slight difference between the brainwave confusions and the

perceptual confusions of vowels can be found only in the confusion matrices. In the

psychological experiment, although overall 82.8% of the vowels can be perceived

accurately, the close vowels /u/ and /i/ are much more confused than the open vowels

/æ/ and /ɑ/. This distinction is not found in classifying brainwaves of vowels. As

Pickett mentioned, the perceptual intelligibilities of four vowels are highly related to

their intensities (Pickett 1957 Table II). We think the strong perceptual confusions

between /i/ and /u/ are mainly on account of the fact that the vowels of low intensities

are less intelligible when the masking noises are present. Therefore the acoustic

distinctions of the vowels, reflected by their locations in the F1-F2 space, are

qualitatively mirrored better in the similarity differences derived from the statistical

model of EEG images than in the perceptual confusions generated by the masking

noise.

The graph of the invariant partial order between brain and perceptual confusions

of vowels is shown in Figure 5.2. We computed the intersection using a threshold of

eps=0.01. We notice that the pairs with the same height, æ+|ɑ-, ɑ+|æ-, u+|i-, and i+|u-

generally rank higher than the pairs that have the same backness, æ+|i-, i+|æ-, u+|ɑ-,

and ɑ+|u-. The graph of the invariant partial order also demonstrates that the greater

robustness of vowel-height compared to that of vowel-backness in distinguishing the

vowels is invariant in the perceptual and brain representations of vowels.

65

(a) (b)

The four vowels /i/, /æ/, /u/ and /ɑ/ are labeled as “i”, “ae”, “u” and “a” correspondingly. (a) The similarity tree of brainwave representation of vowels, derived from the classification results of the

linear SVM model. The set of test samples and the prototype of a phoneme are denoted with “-” and “+” respectively. (b) The perceptual similarity tree of vowels, derived from results of Pickett‟s

psychological experiments.

Figure 5.1: The similarities of brain representation and perceptual representation of 4 vowels

Figure 5.2: Invariant partial order between brainwave and perceptual confusions of

the vowels

66

5.3.2 Consonants

Among all the experiments of classifying EEG images of consonants, the best

result was obtained when we classified the PHS-2 spectral feature limited to the

frequency range of 2 to 9Hz using Bagging SVMs with linear kernel. Here we use the

same classification model to generate the confusion matrix of the brainwaves of

consonants. 300 averaged test samples for each consonant were constructed from OOS

test trials using the sample-with-replacement scheme. The classifier correctly

predicted the class labels of 1185 test samples out of 2400, with an accuracy rate of

49.4%. The normalized confusion matrix is shown in Table 5.2(a). And we show the

resulting similarity tree in Figure 5.3(a).

Table 5.2: Normalized confusion matrices of 8 consonants

(a) The confusion matrix of EEG consonants classification. (b) The perceptual confusion matrix from Miller-Nicely experiment. The ith element in the jth row is the probability that the test samples of the

phoneme oj are classified as oi. Each row sums to 100%.

(a)

% p t b g f s v z

p 38.7 31.0 2.3 7.0 7.3 3.7 6.7 3.3 t 31.7 44.0 2.0 6.0 4.0 3.0 3.7 5.7 b 1.3 1.3 60.0 23.3 0.3 0.3 10.3 3.0 g 6.7 7.7 29.3 40.3 0.7 2.0 8.0 5.3 f 3.0 6.3 0.7 2.0 55.7 15.7 13.3 3.3 s 8.7 5.3 0.7 1.0 13.0 59.7 5.0 6.7 v 11.0 7.3 12.7 11.3 6.7 3.3 38.0 9.7 z 2.7 3.0 8.0 11.7 1.7 7.0 7.3 58.7

(b)

% p t b g f s v z

p 45.5 33.3 1.0 0.7 13.5 4.2 1.4 0.4 t 40.4 42.2 0.9 0.3 7.5 7.5 0.3 0.9 b 1.3 0.5 52.3 7.2 6.1 2.9 24.3 5.3 g 1.4 0.5 10.8 44.6 0.5 2.4 8.9 31.0 f 12.5 8.7 2.5 0.3 66.2 6.6 2.5 0.8 s 8.3 6.9 2.4 3.1 19.4 55.4 1.4 3.1 v 0.0 0.3 22.6 7.5 3.7 1.6 57.5 6.9 z 2.1 0.4 10.6 19.2 0.7 5.3 14.2 47.5

67

(a) (b)

(a) The similarity tree of brainwave representations of the consonants. The set of test samples and the prototype of a phoneme are denoted with “-” and “+” respectively. (b) The similarity tree of perceptual

representations of the consonants, derived from results of (Miller-Nicely, 1955).

Figure 5.3: The similarities of brain and perceptual representation of 8 consonants

Here we compare the similarities of the brainwave representation of consonants

with the perceptual confusion data from the Miller-Nicely (1955) experiment. Only

the confusion matrices of the frequency response 200Hz-6500Hz were inspected to

match the experimental setup of brainwave data. We calculated the summation of the

matrices of Table II and Table III in (Miller & Nicely, 1955), which are the perceptual

confusions in the listening condition of SNR=-12db and SNR=-6db respectively, and

drew the elements between each pair of the eight consonants from the summation

matrix to construct the confusion matrix for the targeted consonants. The accuracy rate

of the perceptual confusion matrix, which is the ratio between the sum of the diagonal

elements and the sum of all the elements, is 52.3%. It is very close to the

classification rates on the brainwaves of consonants and provides a good foundation to

study invariance between brain and perceptual representations. The normalized

68

perceptual confusion matrix and the subsequent similarity tree are shown in Table

5.2(b) and Figure 5.3(b).

We make the following remarks about the similarity trees of consonants.

(1) Among three distinctive features being examined, voicing, affrication, and

place of articulation, voicing is the most robust feature for both the brain and

perceptual representation of consonants, which is shown by the fact that the voiced

and voiceless consonants joined together at the last merging in both trees. The

robustness of voicing for brainwaves suggests that the temporal difference of the

auditory input, such as the voice onset time (VOT), which is the primary acoustic cue

for the voicing contrast (Lisker&Abramson 1964), is well preserved in the brain

representation.

(2) For voiceless consonants, the affrication is more distinctive than the place of

articulation for both brain and perceptual representations of consonants. In fact, the

place of articulation is the most confused feature for brainwave representations since

3 of the 4 pairs of the consonants that only differ on place of articulation: /p/ and /t//,

b/ and /g/ as well as /f/ and /s/, are merged first.

(3) The major difference lies in the grouping structure of the voicing consonants.

Unlike in the brainwave results, where /b/ is mainly attracted by /g/, /b/ is perceptually

more confused with the voiced fricative /v/, which shares the identical place of

articulation with it. The contrast between /b/ from /g/ is mostly in the transition portion

of the F2 of the vowel that follows (Miller & Nicely 1955), while the primary

perceptual cues to distinguish /b/ and /v/ are the abrupt onset of the stop sound /b/ and

the turbulent noises of the frictions of /v/ (Fujimura & Erickson, 1997). We also notice

that although the attraction between /b/ and /v/ is commonly seen in the perceptual

consonant categorization data using masking noise (Miller & Nicely 1995; Wang &

Bilger 1973; Phatak et al, 2008), it is not clearly shown in the perceptual experiment

of short-term memory (Wickelgren, 1966) and the neural activity discriminations of

the animals‟ responses to the human speech stimulation (Mesgarani, 2008).

Consequently, a possible explanation for the mismatch between the brainwave and

69

perceptual confusions is the fact that friction is more perceptually distorted by white

noise than the formant transitions.

The significant invariance between the similarity of brainwave and perceptual

confusions of consonants is further illustrated by the invariant partial order graph in

Figure 5.4. It shows that the similarity differences between the voiceless stops /p/ and

/t/, p+|t- and t+|p- are very small for both brainwave and perception, and lie on the top

part of the graph. Although the brain representation of /b/ is mainly confused with /g/,

/v/ has strong attraction to /b/ as well. Combined with the fact that /v/ and /b/ are very

confused in the perceptual experiment, the similarities v+|b- and b+|v- ranked high in

the invariant partial order graph.

Figure 5.4: Invariant partial order between brainwave confusions and perceptual

confusions of the consonants

Finally, let us revisit the classification rates. As remarked in Chapter 3, the

classifier achieves higher classification rates for the initial consonants than for vowels.

For classifying the vowels in CV syllables, this difference may be due to that the

cognition process of the initial consonant lasting longer than the actual duration of its

sound, thus imposing extra noise on the brainwave of vowel perception and make it

unintelligible. The classification model of the averaged trials is more sensitive to the

beginning part of the auditory stimuli than the later portion. However, the

classification rates on isolated vowels are not as significant as the results on

consonants, either. By examining the similarity differences of the EEG image

representations of phonemes, we found the EEG observations reflect the temporal

70

distinctions of auditory stimuli, such as VOT, more accurately than the spectral

distinctions, such as the formant transitions. This can be another reason that the

classifier performs very well on consonants and not as well on vowels. Considering

the generally accepted tonotopical organization of the human auditory cortex

(Talavage, 2004), this may be due to the relatively low spatial resolution of the EEG

signals. Extracting more spatial information from EEG or combining it with other

more space-sensitive technologies, such as MEG or fMRI, may improve the

classification rates significantly.

71

Chapter 6

Classifiers Based-on Distinctive

Features

6.1 Classifying the distinctive features

The results of Chapter 5 show that phonological distinctions of phonemes,

interpreted by the distinctive features, can be revealed in the brainwave of phonemes

and captured by our EEG classification model efficiently. The similarity analysis of

the phoneme classification results shows that brainwaves of phonemes, which differ

on some phonological features, for instance voicing, are not likely to confuse with

each other when they are represented in the EEG feature space. The similarities of

brain representation of phonemes and perceptual representation of phonemes are

approximately invariant. This finding naturally leads to the question that whether we

can predict phonological features of EEG brainwaves using the same classification

model.

To answer this question, we ran an experiment to classify distinctive features,

which take binary values. We classified three distinctive features of initial consonants,

voicing, continuant and place of articulation, using 8 sessions of brainwave recordings

from Syllables-III data, and classified two features of vowels, height and backness,

using 8 sessions of Isolated-vowels data. All the brainwaves used in this experiment

were collected from one subject (LK). In total 7168 trials were available for each

classification task. When we classified one distinctive feature, the EEG trials were

grouped into two classes, which take opposite values on the feature, for example

voiced and voiceless. Since our choices of phonemes are balanced on the distinctive

72

features, the number of trials in each class is around 3084. The binary grouping of

phonemes for each feature is shown in Table 6.1.

As mentioned in Chapter 1, some distinctive features are widely adopted in most

of the feature system, such as voicing, continuant and nasal. But the place of

articulation is a more complicated property of the sound for the obstruction may occur

at many places along the oral tract. In this experiment, we tested two kinds of

grouping for the place of articulation. As in traditional definition of the place of

articulation, the 8 initial consonants take three different values: /p/ /b/ /f/ and /v/ are

labial; /t/ /s/ and /z/ are alveolar; /g/ is velar. We followed this approach to combine

alveolar and velar consonants to form a “non-labial” group, as contrary to labial

consonants. We also tested on the feature “coronal”, which was proposed by Chomsky

and Halle. Coronal sounds are produced with the blade of the tongue raised from its

neutral position. Among the 8 initial consonants, /t/, /s/ and /z/ are coronal while /p/,

/b/, /g/, /f/ and/v/ are non-coronal.

The SVM-with-Bagging model with linear kernel was used for the classification.

The classifiers were trained and tested using following configurations:


EEG feature: PHS-2 spectral feature with limited frequency band of 2 to 9Hz



Number of principle components used for classification: 100



The binary classification accuracies and p-values are shown in Table 6.1 as well.

73

Table 6.1: Classifying the distinctive features

feature grouping rate p-value

consonant

voicing voiceless /p/ /t/ /f/ /s/ 92.1% <10-26 voiced /b/ /g/ /v/ /z/

continuant stop /p/ /t/ /b/ /g/ 81.4% <10-13 fricative /f/ /s/ /v/ /z/ place

(labial) labial /p/ /b/ /f/ /v/ 69.3% <10-5 non-labial /t/ /g/ /s/ /z/

place (coronal)

coronal /t/ /s/ /z/ 77.9% <10-10 non-coronal /p/ /b/ /g/ /f/ /v/

vowel height open /æ/ /ɑ/ 83.8% <10-16

close /i/ /u/

backness front /æ/ /i/

71.8% <10-7 back /ɑ/ /u/

We found that for all the features under investigation, the classification rates are

well above the chance level and p-values are less than 10-7. Among the phonological

features of consonants, the classification on voicing achieved the highest accuracy of

92.1%. The classification on continuant had slightly worse result, which is 81.4%.

For the place of articulation, the classification on coronal feature achieved

significantly better rate than the classification on labial, which indicated that the

brainwaves of consonants that different on coronal are more separated than the

brainwaves of the consonants different on labial when represented in the EEG feature

space. For the features of vowels, vowel-height can be classified more accurately than

vowel-backness. The differences of the classification rates are consistent with the

similarity analysis results in Chapter 5.

The success in classifying binary distinctive features shows the brainwaves of

phonemes, which have the same value on a distinctive feature, are clustered in the

EEG feature space. This suggests a new approach to classify the brainwave of auditory

stimuli of language constituents. If we code the phonemes, syllables, words, etc as

binary features, we can use a small set of binary classifiers to separate them. For

example, only 5 binary classifiers are needed to distinguish the 32 syllables in the

Syllables-III data. This can be easily generalized to all the phonemes/syllables without

making the classification model too complicated. In fact, according to Chomsky and

Halle‟s work, all the phonemes in human speech can be represented using 27 binary

74

phonological features with some degree of redundancy. Much less features are needed

to represent a certain language such as American English.

Since our brainwave data cover only a small subset of the distinctive features, we

can only manage a preliminary study on this approach.

6.2 Distinctive-feature-based classifiers

The main frame of the distinctive feature(DF)-based classification model for the

brainwave of phonemes is identical to the SVM-with-Bagging classifier introduced in

Section 3.3. The main frame of the classifier is kept intact as in Figure 3.6. To

implement the N-class SVM, we use an ensemble of binary SVM classifiers based on

the phonological distinctive features of stimuli instead of N(N-1)/2 “one-against-one”

binary SVM classifiers.

Suppose the N speech stimuli can be distinguished using k phonological features,

each takes two values, noted as 0 and 1. Then each stimulus can be coded as an unique

k-bit binary number kbbb 21 , where }1,0{ib , with each bit denotes the value of a

phonological feature. If we use a two-class SVM classifier to predict one bit of the

code, then k SVMs is needed to classify N stimuli. The number of distinctive features

should not be less than N2log . However, since the code is not randomly assigned but

reflects phonological properties of the speech stimuli, for a specific classification task,

the binary coding may not be very compact, which means the number of distinctive

features may be much greater than N2log . Moreover, if the speech stimuli don‟t

cover all the possible combination of the distinctive features, it is very likely that the

combination of predicted labels of k binary classifiers does not correspond to any

stimuli. In our classification model, since the classification results of the N-class

SVMs are aggregated via majority-voting, we can drop off any non-decodable result

from the SVMs in aggregating.

75

To test the performance of the DF-based classification model and compare the

results with those in previous chapters, we classified 8 initial consonants using all the

24 sessions of Syllables-III data and 4 vowels using Isolated-vowel data. The first

three features in Table 6.1: voicing, continuant, place of articulation (labial) are used

to distinguish the 8 consonants. Vowel-height and vowel-backness can characterize the

4 vowels. The parameters of classification model are configured the same as the

experiments in Section 6.1. The model correctly classified 39.0% of the test samples

of initial consonants with a p-value less than 10-42, and 53.5% of the test samples of

isolated vowels, with a p-value less than 10-11.

Although the classification rates are not as good as the results obtained in Chapter

5, the success of classifying the brainwave of phonemes using distinctive features

indicates that distinctive features may be the underlying mechanism of how the brain

parse and retrieve phonemic information when it processes speech inputs. The

algorithm also works when only a small amount of training data is available since each

binary SVM can be estimated using all the training samples.

6.3 Parallel structure vs. Hierarchical structure

In the DF-based classification model, the binary SVMs, which predict the value

of distinctive features of each test sample, are trained in a parallel manner. This

requires an underlying assumption that all the distinctive features are represented in

brainwaves independently. This assumption of independency can be written as, for any

ji , the optimal separation hyperplane between class )0,0( ji bb and class

)0,1( ji bb is approximately overlap with the optimal separation hyperplane

between class )1,0( ji bb and class )1,0( ji bb . The assumption is very strict

and usually false in practice.

76

(a) Classifying the vowel-height and vowel-backness independently

(b) Classifying the vowel-height and vowel-backness hierarchically

Figure 6.1: Classifying 4 vowels in F1-F2 space

77

For instance, as we mentioned in Chapter 1, phonological features vowel-height

and vowel-backness are closely related to the first formant F1 and the second formant

F2 respectively. We cut the auditory stimuli of Syllables-III data as frames of 10ms

length and calculate the F1 and F2 of each frame within the segments of vowels sound.

Then we plot the frames of the 4 vowels, /i/, /æ/, /u/ and /ɑ/, in the F1-F2 space as

shown in Figure 6.1. Now we look for the optimal hyperplane that separates the

vowels that differ in vowel-height or vowel-backness in the F1-F2 domain. The blue

solid lines in Figure 6.1(a) and Figure 6.1(b) is the optimal separation line between

open vowels (/æ/ and /ɑ/) and close vowels (/i/ and /u/) estimated using linear soft

margin SVM model with C=0.15 . Only 1.1% of the data points lie on the wrong side

of the separation line. The blue dash line in Figure 6.1(a) illustrates the optimal

separation line for feature vowel-backness, regardless open or close the vowels are.

The samples of front vowels are located below the line and those of back vowels are

above the line with 4.7% exceptions. In Figure 6.1(b) the hyperplanes that divide

front and back vowels are estimated separately for open vowels and close vowels,

shown as a blue dash line and a green dash line correspondingly. Obviously, the blue

dash line and the green dash line are apart from each other. It means that the

separation hyperplane of vowel-backness is different for open vowels and for close

vowels. Therefore the distinctive features of vowels: vowel-height and vowel-

backness do not satisfy the assumption of independency. Subsequently, the brainwave

representations of the distinctive features will not be independent either.

With this in mind, I proposed DF-based classification model with hierarchical

structure. In the hierarchical structure, the decision rules of distinctive features are

assumed possibly dependent.

Suppose a test sample x belongs to the class labeled as y, which is coded using k

binary distinctive features as }{ 21 kbbby , where 1,0ib for ki ,,1 . We use a

two-class classifier to predict the value of one feature, and write the decision rule of

the ith classifier as ii bh ˆ)( x , the classification model with parallel structure is

described as

78

y

bh

bh

bh

x

kk

ˆ

ˆ)(

ˆ)(

ˆ)(

22

11

x

xx

(6.1)

But for the hierarchical classification model, the values of distinctive features are

predicted in sequent and the classification of ith distinctive feature is depend on the

predicted label of previous features, ie. 11ˆˆibb . Then the classification process of

sample x has a binary-tree structure. Figure 6.2 shows an 8-class classifier with the

hierarchical structure.

Figure 6.2: Hierarchical models for classifying 8 classes

Using the hierarchical structure, an N-class classification problem can be solved

by N-1 binary classifiers. Since errors are propagated from top to bottom in this

structure, the crucial step for constructing a classifier is to find the optimal ordering of

the features. Intuitively, the distinctive feature that achieved the highest classification

rate in binary classification experiment should be predicted at first to provide a good

foundation for further prediction.

DF #3

DF #2

DF #1 11ˆ)( bh x

222ˆ)( bh x 221

ˆ)( bh x

331ˆ)( bh x 332

ˆ)( bh x 333ˆ)( bh x 334

ˆ)( bh x

000 001 010 011 100 101 110 111

0xx 1xx

00x 01x 10x 11x

79

The DF-based classifier with hierarchical structure was tested and compared with

the classifier with parallel structure using both Isolated-vowels data and initial

consonants data of Syllable-III experiment. Only two distinctive features needs to be

classified for classifying the four insolated vowels. We tested two possible ordering of

the distinctive features, noted as backnessheight and heightbackness . The

percentage of tested samples that were classified correctly and the significant level are

summarized in Table 6.2.

Table 6.2: Vowels classification results using DF-based classifiers

rates p-values parallel 53.5% <10-11

height→backness 65.2% <10-23 backness→height 57.0% <10-15

We found that the hierarchical model which classifies vowel-height prior to

vowel-backness can correctly classify 65.2% of the test samples, much higher than

results from the parallel classifier and the hierarchical classifier with the order

heightbackness . The results are consistent with our prediction that the distinctive

feature with higher binary classification rate should be ranked higher in the

hierarchical structure.

Hence for classifying the 8 initial consonants, we put voicing, the distinctive

feature that can be predicted with an accuracy rate of 92.1% using binary model, on

the top of the tree structure. Table 6.3 shows the classification results of the parallel

classifier and hierarchical classifiers with the order placecontinuantvoicing

and continuantplacevoicing . The results show that the hierarchical classifier

with the DF order placecontinuantvoicing achieves the best classification rate

of 47.2%.

80

Table 6.3: Initial consonants classification results using DF-based classifiers

rates p-values parallel 39.0% <10-42

voicing→continuant→place 47.2% <10-67 voicing→place→continuant 40.8% <10-47

Although the best performance of the DF-based classifiers is slightly worse than

the best results obtained using N(N-1)/2 one-against-one SVM classifiers, it provides a

simpler model that can be easily extended to more complicated speech stimuli. We can

also analysis the relation of the distinctive features when they are processed in the

brain using the DF-based classification model. For example, we tested a 4-class

classification task using 8 sessions from LK of Syllables-III data. Each class contains

2 initial consonants that identical in place of articulation, which are voiceless stops /p/

and /t/, voiceless fricatives /f/ and /s/, voicing stops /b/ and /g/ and voicing fricatives

/s/ and /z/. Thus the four classes are distinguished by two distinctive features: voicing

and continuant. We tested the classification accuracies of the classification model

with parallel or hierarchical structure. The classification results, including the binary

classification rate for these two features, are summarized in Table 6.4.

Table 6.4: The results of classifying the combination of voicing and continuant using SVM-with-Bagging model

classification tasks rates

2 classes, voicing v.s. voiceless 92.1%

2 classes, stop v.s. continuant 81.4%

4 classes, voicing + continuant (parallel) 74.3%

4 classes, voicing→continuant (hierarchical) 75.7%

4 classes, continuant→voicing (hierarchical) 76.4%

We found that the 4-class classification accuracy rate of the parallel classifier and

two hierarchical classifiers are about the same. They are close to the product of the

81

rates obtained from the binary classification experiments using two features

respectively, which is %0.75%4.81%1.92 . The results indicate that although the

feature continuant cannot be predicted as accurate as the feature voicing in brainwaves,

the brain may process the two features independently.

The significant results of classifying the EEG image of phonemes using

distinctive features suggest that the human brain may use a distinctive-feature-based

parallel computation mechanism to process phonemes.

The hierarchical DF-based classification model can be improved by adopting the

algorithms that optimize decision trees. In particular, when we classify more

phonemes, more distinctive features are involved. It is impractical to find the optimal

structure of decision tree via examining all the possible combinations. Efficient data-

driven methods of designing decision trees can be used in this case.

82

Chapter 7

Conclusion and Prospects

A mathematical model that can recognize the brain activities of phoneme

processing is not only essential for developing the language-based brain-computer-

interface, it can also provide a powerful method for studying the mechanisms that the

human brain uses to process language.

To achieve the goal of this work, developing a mathematical-statistical method to

recognize the EEG brainwaves of phonemes, two major problems need to be solved.

One is to compress the redundant and noisy EEG data into compact features that

contain the crucial information on phoneme perception. The other is to develop

statistical methods that can identify the phonemes as represented by the features.

I started my work by solving the latter problem using EEG time-domain signals.

Three classification approaches were studied in this thesis. In the first approach, the

brain-speech mapping method, we considered the brain as a dynamic system that takes

speech, described using acoustic features, as the input and produces EEG brainwaves

as outputs. Linear transformations were estimated to simulate the inverse system. The

EEG brainwaves can be classified by passing them through the inverse system and

comparing the estimated speech input with speech prototypes. In the second approach,

purely statistical methods such as PCA and SVM with Bagging are used to construct a

classifier. The third approach is a modification of the Bagging SVM method. It

classifies phonemes by classifying their distinctive features. All of the three methods

were implemented and tested using EEG recordings collected in our lab. The brain-

speech-mapping method can classify 44.9% of individual EEG trials of 4 initial

consonants of Syllables-I data. The SVM-with-Bagging method achieved the accuracy

of 46.0% in classifying the averaged test trials of 8 consonants and the accuracy of

68.8% in classifying the averaged test trials of 4 isolated vowels when the linear

83

kernel was used. However, these methods show limitations in classifying vowels in

CV syllables. How to use the knowledge of preceding phonemes to classify the

phonemes at the non-initial position of the stimuli is one of the major challenges to

extending these methods to classify syllables and words.

The three approaches can be brought together to further improve the classification

accuracy. For example, the bootstrap aggregating method can be also incorporate into

the brain-speech mapping method. Moreover, the results of classifying initial

consonants using data from single channels show that some channels are not

contributing to classifying phonemes. The classification model can be further

improved if using only the channels closely related to phoneme perception or

introducing more sophisticated spatial analysis methods.

Using the SVM-with-Bagging method, I was able to address the first problem and

examine the frequency-domain decomposition of EEG. I found the phase pattern of

brainwave oscillations in the frequency range from 2Hz to 9 Hz is highly related to

phoneme processing. Using the phases of sinusoidal components from 2Hz to 9Hz, the

classifier can recognize 51.4% of test samples of 8 initial consonants, improved from

41.5% when the EEG time-domain signal is used.

In this thesis, I also studied the ordinal similarity difference of brainwave

representations of phonemes derived from confusion matrices of classification results

and compared them with the perceptual similarity of phonemes. The robustness of the

feature voicing can be found in both brain and perceptual representation of consonants.

And the feature vowel-height is more distinct than vowel-backness in both brain and

perceptual representation of vowels. The invariant similarity in brain and perceptual

representation of phonemes supports the claim that the brain activities of perceiving

phonological features can be effectively observed by measuring the EEG activities and

are captured by our detailed model.

84

List of References

[1] Bell, A. J., & Sejnowski, T. J. (1995). An Information-Maximization

Approach to Blind Separation and Blind Deconvolution. Neural

Computation , 7 (6), 1129-1159.

[2] Boersma, P., & Weenink, D. (2011). Praat: doing phonetics by

computer [Computer program] Version 5.2.11. Retrieved January 2011,

from http://www.praat.org/

[3] Breiman, L. (1996). Bagging Predictors. In R. Quinlan, Machine

Learning (pp. 123-140).

[4] Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: a library for support vector

machines. Retrieved from http://www.csie.ntu.edu.tw/~cjlin/libsvm

[5] Chomsky, N., & Halle, M. (1968). The sound pattern of English.

[6] Cortes, C., & Vapnik, V. (1995). Support-Vector Networks. Machine

Learning , 20 (3), 273-297.

[7] Delorme A, M. S. (2004). EEGLAB: an open source toolbox for

analysis of single-trial EEG dynamics. Journal of Neuroscience

Methods , 134, 9-21.

[8] Engineer, C. T., Perez, C. A., Chen, Y. H., Carraway, R. S., Reed, A. C.,

Shetake, J. A., et al. (2008). Cortical activity patterns predict speech

discrimination ability. Nat. Neurosci. , 11, 603-608.

[9] Eulitz, C., & Obleser, J. (2007). Perception of acoustically complex

phonological features in vowels is reflected in the induced brain-

magnetic activity. Behavioral and Brain Functions , 3 (26).

[10] Fant, G. (1960). Acoustic Theory of Speech Production. Netherlands:

Mouton & Co.

85

[11] Formisano, E., Martino, F. D., Bonte, M., & Goebel, R. (2008). “Who"

Is Saying "What"? Brain-Based Decoding of Human Voice and Speech.

Science , 322 (5903), 970-973.

[12] Frye, R. E., Fisher, J. M., Coty, A., & Zarella, M. (2007). Linear Coding

of Voice Onset Time. Journal of Cognitive Neuroscience , 19 (9), 1476-

1487.

[13] Handbook of the International Phonetic Association. (1999). Cambridge

University Press.

[14] Jakobson, R., & Morris, H. (1956). Fundamentals of Language.

[15] Johnson, K. (2003). Acoustic & auditory phonetics. Blackwell

Publishing.

[16] Jolliffe, I. T. (2002). Principle Component Analysis. NY: Springer.

[17] Jung, T., Makeig, S., Humphries, C., Lee, T., Mckeown, M. J., Iragui,

V., et al. (2000). Removing electroencephalographic artifacts by blind

source separation. Psychophysiology , 37, 163-178.

[18] Ladefoged, P. (1982). A Course in Phonetics.

[19] Liebenthal, E., Binder, J. R., Spitzer, S. M., Possing, E. T., & Medler, D.

A. (2005). Neural Substrates of Phonemic Perception. Cerebral Cortex ,

15 (10), 1621-1631.

[20] Lisker, L., & Abramson, A. S. (1964). A Cross-Langugage Study of

Voicing in Initial Stops: Acousic Measurements. Word , 20, 384-422.

[21] Mesgarani, N. (2008). Phoneme representation and classification in

primary auditory cortex. J. Acoust. Soc. Am. , 123 (2), 899-909.

[22] Miller, G. A., & Nicely, P. E. (1955). An analysis of perceptual

confusions among some English consonants. J. acoust. Soc. Am. , 27,

338–352 .

86

[23] Munson, B., & Nelson, P. B. (2005). Phonetic identification in quite and

in noise by listeners with cochlear implants. J. Acoust. Soc. Am. , 118

(4), 2607-2671.

[24] Näätänen, R. e. (1997). Language-specific phoneme representations

revealed by electric and magnetic brain responses. Nature , 382, 431-

434.

[25] Näätänen, R. (2001). The perception of speech sounds by the human

brain as reflected by the mismatch negativity (MMN) and its magnetic

equivalent (MMNm). Psychophysiology , 38 (1), 1-21.

[26] Näätänen, R., Gaillard, A., & Mantysalo, S. (1978). Early selective-

attention effect on evoked potential reinterpreted. Acta Psychologica ,

42 (4), 313-329.

[27] Obleser, J., Lahiri, A., & Eulitz, C. (2004). Magnetic Brain Response

Mirrors Extraction of Phonological Features from Spoken Vowels.

Journal of Cognitive Neuroscience , 16 (1), 31-39.

[28] Pfurtscheller, G., & Silva, F. H. (1999). Event-related EEG/MEG

synchronization and desynchronization: basic principles. Clinical

Neurophysiology , 110 (11), 1842-1857.

[29] Phatak, S. A., Lovitt, A., & Allen, J. B. (2008). Consonant Confusions

in White Noise. J. Acoust. Soc. Am , 1220-1233.

[30] Pickett, J. (1957). Perception of vowels heard in noises of various

spectra. J. acoust. Soc. Am. , 29, 613-620.

[31] Rabiner, L., & Juang, B.-H. (1993). Fundamental of Speech Racognition.

Prentice Hall.

[32] Steinschneider, M., Fishman, Y., & Arezzo, J. (2003). Representation of

the voice onset time (VOT) speech parameter in population responses

within primary auditory cortex of the awake monkey. J. Acoust. Soc.

Am. , 114, 307-321.

87

[33] Steinschneider, M., Reser, D., Schroeder, C., & Arezzo, J. (1995).

Tonotopic organization of responses reflecting stop consonant place of

articulation in primary auditory cortex (A1) of the monkey. Brain Res. ,

674, 147-152.

[34] Suppes, P., Han, B., Epelboim, J., & Lu, Z.-L. (1999). INvariance

between subjects of brain wave representaion of language. Proceedings

of the National Academy of Sciences , 12953-12958.

[35] Suppes, P., Lu, Z.-L., & Han, B. (1997). Brain wave recognition of

words. PNAS , 94, 14965-14969.

[36] Suppes, P., Perreau-Guimaraes, M., & Wong, D. K. (2009). Partial

Order of Similarity Differences Invariant Between EEG-recorded Brain

and Perceptual Representations of Language. Neural Computation , 21,

3228-3269.

[37] Tallon-Baudry, C., & Bertrand, O. (1999). Oscillatory gamma activity in

humans and its role in object representation. Trends in Cognitive

Science , 3 (4), 151-162.

[38] Wall, M. E., Rechtsteiner, A., & Rocha, L. M. (2003). Singular value

decomposition and principal component analysis. In D. Berrar, W.

Dubitzky, & M. Granzow, A Practical Approach to Microarray Data

Analysis (pp. 91-109).

[39] Wang, M. D., & Bilger, R. C. (1973). Consonant confusions in noise: a

study of perceptual features. J. Acoust. Soc. Am. , 54 (5), 1248-1266.

[40] Wang, S.-j., & etc, A. M. (2009). Empirical analysis of support vector

machine ensemble classifiers. Expert Systems with Applications , 36 (3),

6466-6476.

[41] Wickelgren, W. A. (1966). Distinctive features and errors in short-term

memory of English consonants. The Journal of the Acoustical Sociaty of

America , 39 (2), 388-398.

88

[42] Wong, D. K. (2004). Multichannel Classification of Brain-wave

Representations of Language by Perceptron-based Models and

Independent Component Analysis (Ph.D. Dissertation). Stanford

University.

[43] Wong, D. K., Perreau-Guimaraes, M., Uy, E. T., & Suppes, P. (2004).

Classification of individual trials based on the best independent

component of EEG-recorded sentences. Neurocomputing , 479-484.

recognizing phonemes and their distinctive features …wj289qm5838/final version - e... · phonemes...

Documents