classifying dialects using cluster...

51
Master's Thesis in Computational Linguistics Jan Lundberg Department of Linguistics Göteborg University SE 405 30 Göteborg, Sweden Göteborg, May 2005 Supervisor: Ph D Anders Eriksson, Department of Linguistics Classifying Dialects Using Cluster Analysis

Upload: others

Post on 13-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

Master's Thesis in Computational Linguistics

Jan Lundberg

Department of Linguistics Göteborg University

SE 405 30 Göteborg, Sweden

Göteborg, May 2005

Supervisor: Ph D Anders Eriksson, Department of Linguistics

Classifying Dialects Using Cluster Analysis

Page 2: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

2

Contents

1 Introduction .............................................................................................................................................5 2 Background..............................................................................................................................................6

2.1 Fundamentals of Speech Analysis ...................................................................................................6 2.1.1 Source-Filter Theory of Speech...............................................................................................6

2.2 Representation of Speech ................................................................................................................6 2.2.1 Mel Frequency Cepstral Coefficients ......................................................................................7 2.2.2 Bark filter.................................................................................................................................7 2.2.3 Cochleagram............................................................................................................................7 2.2.4 Formant Tracks........................................................................................................................7

2.3 Comparing Segments.......................................................................................................................7 2.4 Data Analysis...................................................................................................................................8

2.4.1 Cluster Analysis.......................................................................................................................8 2.4.2 Principal Component Analysis ................................................................................................9 2.4.3 Multidimensional Scaling........................................................................................................9

2.5 Motivation .......................................................................................................................................9 3 Method...................................................................................................................................................10

3.1 Pre-processing the Data .................................................................................................................10 3.1.1 Sample Set .............................................................................................................................10 3.1.2 Feature Extraction..................................................................................................................10 3.1.3 Normalization ........................................................................................................................11

3.2 Cluster Analysis.............................................................................................................................11 3.2.1 Research Objective ................................................................................................................11 3.2.2 Research design .....................................................................................................................12 3.2.3 Data Validation......................................................................................................................15 3.2.4 Deriving Clusters ...................................................................................................................16

3.3 Validation ......................................................................................................................................21 3.3.1 Visual Verification.................................................................................................................21 3.3.2 Coincidence Testing ..............................................................................................................21 3.3.3 Statistical Significance of Clusters ........................................................................................21 3.3.4 Interpretation..........................................................................................................................22

3.4 Tools ..............................................................................................................................................22 3.4.1 Spotfire DecisionSite .............................................................................................................22 3.4.2 XLSTAT-Pro .........................................................................................................................22 3.4.3 Acuity 4.0 ..............................................................................................................................22

4 Empirical study......................................................................................................................................23 4.1 Pre-Processing the Data.................................................................................................................23

4.1.1 Sample Set .............................................................................................................................23 4.1.2 Feature Extraction..................................................................................................................24 4.1.3 Data Validation......................................................................................................................25

4.2 Cluster Analysis.............................................................................................................................28 4.2.1 Research Objective ................................................................................................................28 4.2.2 Research design .....................................................................................................................28 4.2.3 Estimating the Number of Data Clusters ...............................................................................30 4.2.4 Assumptions ..........................................................................................................................31 4.2.5 Deriving Clusters ...................................................................................................................31

4.3 Validation and Interpretation.........................................................................................................43 4.4 Discussion......................................................................................................................................45

5 Conclusion and Further Research..........................................................................................................46 6 References .............................................................................................................................................47 Appendix .......................................................................................................................................................49

Understanding Cepstral analysis................................................................................................................49 HTK: hcopy.conf .......................................................................................................................................50 MFCC Profiles...........................................................................................................................................51

Page 3: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

3

Abstract

Cluster analysis is the name for a group of multivariate techniques whose primary purpose is to group objects based on the characteristics they possess. Clustering has been applied in many contexts and by researchers in many disciplines. This reflects its broad appeal and usefulness as one of the steps in exploratory data analysis.

In this thesis I explore cluster analysis as a means to investigate relationships among speech samples extracted from the Swedia 2000 dialect database. By using mel frequency cepstral coefficients (MFCC) to represent the data, I also show that the cepstrum is a reliable metric for measuring acoustic distance.

The methods applied in this thesis may contribute to the validation of dialect distances, and finding the optimal number of clusters in a dialect data set. The results show that this approach may represent an effective tool to support speech analysis applications.

Sammanfattning

Klusteranalys är benämningen på en grupp multivariata statistiska metoder, vars huvudsakliga syfte är att gruppera objekt på basen av deras egenskaper. Klustring har använts flitigt av forskare inom många olika områden, och utgör ett användbart verktyg inom explorativ dataanalys.

Den här magisteruppsatsen undersöker klusteranalys som ett sätt att analysera samband mellan talprover extraherade från dialektmaterialet SweDia 2000. Genom att låta talproverna representeras av Mel-skalebaserade cepstrala koefficienter (MFCC), visas även att cepstrum är ett adekvat mått för att mäta akustisk distans.

Metoderna som används i den här uppsatsen, torde kunna bidra till att validera det akustiska avståndet mellan olika dialekter, samt även till att hitta det optimala antalet kluster i en dialektdatamängd. Resultaten visar på att detta tillvägagångssätt bör kunna utgöra ett användbart hjälpmedel vid olika tillämpningar inom talanalys.

Page 4: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

4

Acknowledgements

I would like to thank:

Ph D Torbjörn Lager, who gave me the idea to this thesis, and who also put me in touch with Anders Eriksson.

Ph D Anders Eriksson, who came up with the idea of using cepstral coefficients to measure spectral distances as a basis for the analysis. I am grateful that I was offered this opportunity to analyse such novel material. I will never forget the excitement of plotting the first clustering results

Lots of people at Spotfire for supporting me and providing me with information, especially Tomas Andersson for always taking time to listen to my ideas, and for providing me with invaluable feedback.

My family, Petra and little Albin, for cheering me on and patiently waiting for me to finish.

Andreas Stiebe, who made me realize that this thesis was something that needed to be done for more reasons than I could imagine.

Page 5: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

5

1 Introduction Phonology is the study of how sounds are organized and used in natural language, and seeks to explain how these sounds are interpreted by the native speaker. As native speakers of a language, we are well aware of the variations of pronunciation within our mother tongue, and our ability to interpret these variations is quite extraordinary. There are many methods for computationally exploring the dialect continuum [1]. Some methods work on a textual level, i.e. they require some form of transcription before they can be applied, whereas other methods work on an acoustic level.

It seems beneficial to be able to work directly on the acoustic representation of the speech, not having to transcribe everything in advance. Not only is the transcription a tedious task per se, but it also introduces subjectivity in the sense that transcribers can interpret speech differently, or even make (human) mistakes during transcription. Given a flexible and efficient representation of speech, we could determine the relationship between dialects simply by measuring the distance between their acoustic vectors, and apply classification methods using a suitable similarity measure.

One such representation, widely used in speech analysis is the cepstrum, which is a common transform used to gain information from a person s speech signal. It can be used to separate the excitation signal (originating from our vocal chords) from the transfer function that retains information relating to the vocal tract shape. The transfer function, represented by 12 mel-frequency cepstral coefficients (MFCC), has proven to be quite useful for language identification.

The objective of this work will be to evaluate clustering as a method of classifying dialects, using cepstral coefficients as the representation of the data. I will explore different clustering algorithms and visually validate the results on a dialect map. It should be said that these methods are by nature, exploratory rather than confirmatory analyses of data, and it can be argued that unsupervised algorithms provide limited information about the accuracy of the results. Nevertheless, in most cases, as I will show, the results conform with conventional hypotheses about the relationship between various Swedish pronunciations.

Page 6: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

6

2 Background

2.1 Fundamentals of Speech Analysis The variation in speech between speakers of a language are caused by factors that can be categorized as both extrinsic and intrinsic [5]. Extrinsic factors are based primarily on cultural and emotional influence, for instance, environment, level of education or the speaker's state of mind. Intrinsic factors, on the other hand, are related to the anatomy of the speaker. These factors are a consequence of the variations in the size and shape of the vocal tract, and produce changes in the resonance characteristics. This means that differences in vocal tract dimensions will cause differences in spectra, even when the speakers are producing a sound that we perceive as the same phoneme.

There are chiefly two reasons why analysing speech is a challenging task: First, the variability of the spectral shapes, even for the same phoneme makes this class more spread out, and second, since different phonemes uttered by different speakers typically have similar formants, the classes tend to overlap. Speaker adaptation and normalization techniques attempt to reduce the effects of speaker variability on the performance of speech recognition systems.

2.1.1 Source-Filter Theory of Speech

The recorded speech signal can be considered as the output from a linear system, which consists of a source of excitation, convolved with the impulse response of a filter (Fant 1960). The filter represents the acoustic effect of the vocal tract, which depends on the positions of the articulators (jaw, tongue, lips, etc.) and corresponds to the uttered sound.

Source spectrum + Filter = Output energy spectrum

The most energy will be transferred between the excitation and the output at the resonant frequencies of this resonance tube model. These resonant frequencies are the so-called formant frequencies of the speech signal, and can be identified by high-energy peaks in the spectral envelope. They are usually referred to as F1, F2, F3, etc.

2.2 Representation of Speech The speech signal must be converted from analogue to a digital representation before it can be processed by a computer. To that end, the analogue signal must first be band-limited, and then sampled at fixed time intervals. The sampling frequency usually ranges from 8 kHz (telephone quality speech) to 48 kHz (digital audio tapes). The maximum spectral frequency that can be represented is half the sampling frequency, e.g. 24 kHz for digital audio quality. Sampling resolution varies from 8 to 16 bits [4].

The raw sampled waveform is still not suited as direct input for a recognition system. The amount of data involved in using the speech signal directly is huge, suggesting that it contains a lot of redundant information. The commonly adopted approach is to convert the sampled waveform into a sequence of feature vectors, which is described in detail in section 3.1.2.

Page 7: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

7

2.2.1 Mel Frequency Cepstral Coefficients Cepstral analysis allows the excitation source energy to be separated from the frequency response characteristics of the vocal tract1. Mel-frequency cepstral coefficients (MFCCs) are motivated by the behaviour of the human auditory membrane. It uses a bank of Mel-scaled filters, modelling the hair spacing along the basilar membrane of the ear [16].

For language identification, however, only the first 12 coefficients of the (mel-weighted) cepstrum are calculated, thereby retaining information relating to the shape of the vocal tract, while ignoring the quickly varying excitation signal.

An advantage of MFCC is that the cepstrum accounts for the whole of the relevant part of the spectrum, rather than just one or two frequencies of its spectral peaks. Thus, there is theoretically more information in a cepstral than a formant comparison, and there is therefore a greater chance of picking up important similarities or differences between samples with the cepstrum. [2]

2.2.2 Bark filter The Bark scale provides a perceptually motivated alternative to the Mel-scale. Speech perception in humans begins with spectral analysis performed by the basilar membrane (BM). Each point on the BM can be considered a band pass filter having a bandwidth equal to one critical bandwidth or one Bark. The bandwidth of several auditory filters were empirically observed and used to formulate the Bark scale [6].

In the most commonly used type of spectrogram the linear Hertz frequency scale is used, for which the difference between 100 Hz and 200 Hz is the same as the difference between 1000 Hz and 1100 Hz. However, our ear evaluates frequency differences not absolutely, but relatively, namely in a logarithmic manner. Therefore, in the Bark filter, the Bark-scale is used which has a better perceptual correlation than the Hertz scale.

2.2.3 Cochleagram The cochleagram is also based on the Bark filter, but may be even more similar to human perception. The cochleagram uses the same frequency scale as the Bark filter, but in the cochleagram the perceived loudnesses, rather than the intensities, are given. In a cochleagram, the reference intensities are the intensities of a frequency of 1000 Hz. This is the basis for the measurement of loudness in phon. If a given sound is perceived to be as loud as a 60 dB sound at 1000 Hz, then it is said to have a loudness of 60 phon.

Bark filter and the cochleagram representation seem to be equally useful for finding distances between vowels, but cochleagrams differ from the Bark filter representation in virtue of the sharper distinction between voiceless and voiced sounds [1].

2.2.4 Formant Tracks Yet another way to analyse the acoustic signal is to investigate formants. Essential for perceiving vowels is that spectral peaks, known as formants, are recognized by the ear. When using a spectrogram with a large analysis window (about 20 ms), the frequency resolution will be high. Individual harmonics will show up as horizontal lines in the spectrogram.

In the IPA vowel quadrilateral, the height corresponds to F1 and the advancement corresponds to F2. Thus, the formant track representation can be used for finding segment distances, and especially for finding vowel distances [1].

2.3 Comparing Segments Once we have decided on a suitable representation for speech, we need a way to compare the segments to be able gain knowledge about the sample. For instance, we could compare the acoustic vector against a collection of vectors, knows as a codebook, to identify which phoneme the segment belongs to.

1 A short description of the mathematical basis of cepstral analysis can be found in the Appendix.

Page 8: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

8

One method for measuring the similarity between segments is to calculate the Euclidean distance between the corresponding representations [1]. This method can be applied for all representations mentioned above.

In section 3.2.2.4 different methods for measuring similarity will be explained in more detail.

2.4 Data Analysis A general question facing researchers in many fields is how to organize observed data into meaningful structures. The goal is to extract the essential information, always keeping the objective of the study in mind, and presenting the results in an easy comprehensible way.

2.4.1 Cluster Analysis Clustering involves grouping data points together according to some measure of similarity. One goal of clustering is to extract trends and information from raw data sets. An alternative goal is to develop a compact representation of a data set by creating a set of models that represent it. The former is generally the goal in geographic information systems, the latter generally the goal of pattern recognition systems. Both fields use similar, or identical techniques for clustering data sets.

Among the many approaches to clustering, the most common ones are hierarchical and partitional clustering. Hierarchical methods follow either a top-down or a bottom-up approach. In a typical top-down approach, the algorithm begins with each data record represented by a separate cluster. Similar objects are then grouped together to form larger groups. This process proceeds until the final step, when the entire data set is represented by a single cluster. Bottom-up methods reverse this process, starting with one big cluster and then divide it into smaller and smaller groups. One advantage of the hierarchical approach is that the results can be presented in a tree-like structure called a dendrogram (see section 3.2.4.1.1). When applied, the tree can be cut at a level appropriate to the analysis, thereby creating a certain number of clusters [19].

In contrast to hierarchical methods, partitional methods start by making an assumption about the number of clusters in the data and their centre points (centroids). These centroids can be represented by actual samples in the data, or by calculated ones. The remaining data records are then assigned to the nearest cluster centre based on Euclidean distance or some other measure of similarity. Depending on the resulting distribution, the researcher may then adjust the assumptions and repeat the process.

In any cluster analysis, regardless if we are using hierarchical or non-hierarchical methods, we need to make decisions in respect to [9]:

Objects; which data characterise the objects, i.e. which variables should we use in the analysis?

Similarity; what does similarity mean for a particular set of data, how should objects be compared?

Method; what kind of grouping are we looking for, i.e. which clustering algorithm is appropriate?

Presentation; how should we present the results of the cluster analysis, can we verify against a hypothesis?

Page 9: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

9

2.4.2 Principal Component Analysis Principal component analysis (PCA) is a method primarily used for reducing the dimensionality of the data, but can also be a useful tool for understanding the relationships in a data set [15]. In essence, you rotate the axes of a coordinate system so that they become aligned with the direction of the data set. The result will be a set of principal components, where the first component captures the primary direction of variation in the data set, the second component shows the next direction of variation, and so on.

By projecting the results from a cluster analysis onto the first three principal components, a 3D plot can be used to visualize and evaluate a particular clustering.

2.4.3 Multidimensional Scaling Multidimensional scaling (MDS) is a multivariate statistical technique, just like PCA, used for producing a lower-dimensional data suitable for visualisation, while still preserving the most prominent relationships of the original data. In an MDS plot, related objects are close to each other, while radically different object are plotted far away from each other [20].

The concept that underlies multidimensional scaling can be illustrated by the following example. If you were given a table of pairwise distances between the three cities, and were asked to draw a map which located the three cities as points representing the relative distances in the table, you might start by arbitrarily placing two points on the map to represent two of the cities. You could then draw in a third point so that distances from the third city to the first two cities was proportional to the distances given in the table. If the table only consisted of three cities this would not be too hard, but for more cities the task would become extremely complex. Multidimensional scaling accomplishes this task by taking a table of similarities (e.g. distances) and iteratively placing points on a map so that the original table is as fairly represented as possible. To achieve a better precision we could use more than two dimensions, however, it is difficult to graph and interpret solutions that have more than three dimensions.

2.5 Motivation This thesis was originally inspired by the work of John Nerbonne and Wilbert Heeringa, who have explored the Levenshtein distance to analyse the dialect borders of Dutch. In his doctoral thesis [1], Heeringa proposes a method using acoustic representations for quantifying distances in pronunciation between dialects using the Levenshtein distance. However, he did not consider Mel-frequency cepstral coefficients (MFCC), but instead he used other spectral measures. In this thesis, I will use MFCCs, and apply some of the methods presented by Heeringa to validate this representation.

Turning to a different field; while researchers in the discipline of functional genomics are studying the relationship between gene profiles, i.e. time series of gene expressions, similar statistical methods can be applied to speech vectors represented by 12-dimensional cepstral coefficients.

So, if the clustering of genes based on their expression levels is useful for identifying groups of similar behaviour in various biological systems, can the clustering of speech segments based on their spectral components be an effective tool in the classification of dialects?

Page 10: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

10

3 Method The method described in this chapter proposes a way of applying multivariate statistical techniques to classify speech segments. This includes validating the set of variables to cluster on, selecting a suitable similarity measure, selecting a suitable clustering algorithm, and determining the optimal number of clusters in the dialect data.

General procedure The method can be divided into the following steps: 1. Pre-processing data; this includes feature extraction, normalization and conditioning of the data 2. Validation of data; multidimensional scaling will be used as a method of verifying the speech data. 3. Cluster analysis, with the goal of obtaining an objective classification of speech samples. This step will

be divided into several stages [9]: a. defining the research objective, b. research design c. assumptions d. deriving clusters e. interpretation

4. Validation; the results will be validated geographically by creating dialect maps.

3.1 Pre-processing the Data Feature extraction, i.e. the choice of a suitable representation for the data items, is a key step in the analysis. Unsupervised methods attempts to find structures in the data set, and the structures are ultimately determined by the features chosen to represent the data items. The better the features can be tailored to reflect the requirements of the task, the better the results will be. The task of refinement requires considerable expertise both in the application area and in the data analysis methodology.

3.1.1 Sample Set There are no statistical methods that can make up for a badly planned study, therefore understanding the data is key to any data analysis. This method is designed to cluster speech data on a segment level, and therefore the data set have to be restricted to contain only individual segments of speech, and conditioned as to avoid bias and maximize distinguishability.

3.1.2 Feature Extraction To be able to compare the speech samples we need a parametric representation of speech that carries information about the short-time spectrum of the signal.

As described in section 2.2 segments may also be compared using other types of spectral representation, but for this method, Mel-frequency cepstral coefficients (MFCC) will be used to represent the data2. Cepstral coefficients are the most common representation of the spectral characteristics in the field of both speech and speaker recognition, since they provide a good representation of speech in both clean and noisy conditions.

2 The automatic feature extraction procedure described here is not critical to this method, and could thus be substituted by some other way of parameterizing the speech signal, e.g. by using linear predictive coding (LPC).

Page 11: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

11

The MFCC feature extraction procedure3 looks as follows [5], [10]:

1. The input speech signal is digitised at a sampling rate of 16 kHz. 2. A pre-emphasis filter (using a coefficient of 0.97) is applied to the speech

samples to focus on the spectral properties of the vocal tract. 3. Hamming windows of 25.6-ms duration are applied to the pre-emphasized

speech samples to extract the speech waveform. 4. The power spectrum of the windowed signal in each frame is computed

using a Fast Fourier Transform (FFT). 5. Apply Mel-scaled bandpass filters to approximate the frequency resolution

of the human ear. 6. Apply log compression in order to make the statistics of the power spectrum

approximately Gaussian. 7. For each 10-ms time frame, 12 mel-frequency cepstral coefficients

(MFCCs) are computed using a discrete cosine transform (DCT). This reduces the number of spectral parameters while retaining the relevant information of speech signal.

The main advantage of DCT is that it is an orthogonal transformation, which efficiently decorrelates the spectral coefficients, i.e., it converts statistically dependent spectral coefficients into 12 independent MFC coefficients [4].

3.1.3 Normalization To account for pitch related bias, the speech data should be normalized. This can be done by applying a warping function, which can be thought of as a mapping between two spectra. An easy way of compensating pitch related bias is to only include speakers that have approximately the same fundamental frequency. Furthermore, the median MFCC vector for each utterance can be calculated to avoid influences (deviant values) from adjacent segments and to reduce the amount of data.

3.2 Cluster Analysis The primary goal of cluster analysis is to partition a set of objects into groups based on the similarity for a set of specified characteristics. Since cluster analysis involves making decisions based on subjective assumptions about the data, many approaches will result in different results for the same data set. Thus, cluster analysis can be considered more of an art, than a science. [9]

3.2.1 Research Objective The results of the cluster analysis are highly dependent on how the research question is formulated, and all decisions made during the cluster analysis should always be checked against the objective of the analysis.

3.2.1.1 Selecting Variables The selection of variables to cluster on is governed solely by the objective of the analysis, and should be based on our knowledge about the data. Selecting the right variables is key to a successful analysis, as they constitute the data upon which we characterise the objects that we want to organize.

3 An example of the HTK configuration file, hcopy.conf, can be found in the Appendix.

Time windowing Pre-emphasis

Speech waveform

Time windowing Time Windowing

Time windowing FFT

Time windowing Mel Scaling

Time windowing Log

Time windowing DCT

MFCCs

Page 12: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

12

3.2.2 Research design

3.2.2.1 Dealing with Outliers Outliers are observations in a distribution of data that are not representative for the general population, i.e. they deviate so much from the other observations that one might suspect that they were generated by a different mechanism. Another cause of outliers might be sparse data, i.e. an undersampling of the actual groups in the population. In both cases, the outliers distort the true structure and make the derived clusters unrepresentative of the true population structure.

To avoid the risk of having outliers distort the cluster analysis, a preliminary screening is necessary. This is done by visually checking for deviant profiles for each location. If found, these samples will be excluded from the analysis.

3.2.2.2 Estimating the Number of Data Clusters There is no panacea for determining the number of clusters in a cluster analysis. The number of clusters should be governed by the objective of the analysis, and based on the inherent structure of the data [9].

One way to explore the question of how many clusters, is to simply try a number of different clusterings and see which one provides the most distinct result.

Another method for mathematically estimating the number of clusters in a data set, is the Gap statistic [8]. It compares the within cluster sum of squares of a given clustering, with an average obtained from set of randomly selected data from the original dataset. In the implementation that I have used (a demo version of Acuity 4.0, developed by Axon Instruments Inc.), the reference distribution is generated by applying a box aligned with the principal components of the data, thereby taking into account the shape of the data distribution. (Illustration by Andy M. Yip, 2004) [18]:

The initial procedure is to let the dataset xij consist of n individual observations and p features, where i = 1, 2, ..., n, and j = 1, 2,..., p. Let dii' denote the distance between observations i and i',

usually calculated using the squared Euclidean distance j jiijii xxd 2

'' )( .

For a dataset grouped into k clusters, Cr = C1, C2, , Ck, where Cr denotes the indices of observations in cluster r, and nr refers to the number of observations in Cr, the sum of pairwise distances for all data points

in cluster r is denoted rCii

iir dD',

' .

The pooled within-cluster sum of squares around the cluster mean can then be described as k

rr

rk D

nW

1 2

1, which is a measure of compactness for using k clusters.

Page 13: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

13

The Gap Statistic can now be calculated using the following procedure [8]:

1. Cluster the observed data and vary the number of groups from k = 1, 2,..., K, thereby generating within-dispersion measures Wk .

2. Generate B reference data sets (using Monte Carlo simulations), and cluster each one giving

within-dispersion measures *kbW , where b = 1, 2, , B, and k = 1, 2, , K. Hence the Gap is

defined as B

bkkb WW

BkGap

1

* loglog1

)(

Effectively, the Gap Statistic estimates the log(Wk) and compares it with its expectation under an appropriate null reference distribution of the data. Thus, the Gap Statistic estimate of the optimal number of clusters in the dataset, is the value of k for which log(Wk) falls the farthest below this reference curve.

3. Let B

bkbW

BI

1

* )log(1

,

2/1

2

1

* ))(log(1

IWB

sdB

bkbk , and

Bsds kk /11

where

I = average pooled within-cluster sum of squares from B samples,

ksd = standard deviation, and

ks = a form of standard error estimation.

The sample datasets are drawn from the reference distribution via the Monte Carlo method of randomly

generating the data points from the original dataset. Since expectations *nE are randomly generated from the

reference distribution, the sampling distribution must be considered. Thus, the estimate k , i.e., the optimal estimated number of clusters, will be the value maximising Gapn (k) after the adjustment for the sampling

distribution in )}{log(*kn WE . This means that for a cluster size k and standard error ks in the reference

distribution, the optimal cluster size is the smallest k such that Gap(k) Gap (k + 1) sk+1.

3.2.2.3 Standardize the Data The most common form of standardization is the conversion of each variable to standard scores, known as Z scores, which quantify how many standard deviations the original values are from the mean of the distribution. The Z is calculated by subtracting the mean and dividing by the standard deviation for each variable, formally expressed as

xZ

A negative Z score means that the original score was below the mean; a positive Z score means that the original score was above the mean. Thus, the actual value corresponds to the number of standard deviations the score is from the mean in the direction of the sign.

Page 14: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

14

3.2.2.4 Select a Similarity Measure The concept of similarity is fundamental to cluster analysis. To be able to decide how similarity should be measured, one must determine if the variables are metric or nonmetric. Because our cluster variables consist of metric data, only correlational measures and distance measures will be considered. 4

3.2.2.4.1 Correlational Measures Correlation provides a single number (correlation coefficient) that summarises the level of variation between points. The correlation measure does not look at the magnitudes of the values, instead it compares the pattern of the variables.

The most common measure of correlation is Pearson s correlation, designated by the coefficient r. The simplest way to think about Pearson's r is to plot the two vectors x and y as profiles, with r telling you how similar the shapes of the two profiles are. The Pearson correlation coefficient varies between -1 and 1, with 1 meaning that the two series are identical, 0 meaning they are completely independent, and -1 meaning they are perfect opposites. The correlation between two profiles, x and y, with k dimensions is calculated as

22 )()(

))((

)()(

),cov(),(

yyxx

yyxx

ysdxsd

yxyxr

kk

kk

3.2.2.4.2 Euclidean Distance Measures This group measures similarity based on proximity across the variables included in the cluster analysis [9].

Euclidean distance This metric measures the absolute distance between two points in space, in this case defined by vectors x and y. Unlike correlation-based distance measures, the Euclidean distance takes the magnitude of the variables into account, thereby preserving more information about the speech characteristics than correlational measures. Euclidean distance is calculated as:

n

iii yxyxd

1

2)(),(

Squared Euclidean distance In some cases one may want to square the standard Euclidean distance in order to place progressively greater weight on objects that are further apart. This distance is calculated as:

n

iii yxyxd

1

2)(),(

City-block (Manhattan) distance This distance is simply the average difference across dimensions. In most cases, this distance measure yields results similar to the simple Euclidean distance. However, note that for this measure, the effect of single large differences (outliers) is dampened since they are not squared. The city-block distance is calculated as:

n

iii yxyxd

1

||),(

4 For non-metric data, using a nominal or an ordinal scale, association measures are used. An example of an association measure would be the percentage of times there was agreement across a set of questions in a questionnaire.

Page 15: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

15

Chebychev distance This distance measure may be appropriate in cases when one wants to define two objects as "different" if they are different on any one of the dimensions. The Chebychev distance is calculated as:

||),( iii yxMaxyxd

3.2.2.4.3 Pattern or proximity? The next question that needs to be answered is whether to focus on patterns or proximities. Since Pearson s correlation is based on the angle between two vectors, rather than the distance, it is insensitive to the amplitude of changes that may be seen in the expression profiles. This is not the case with Euclidean distance, which measures the absolute distance between two points in space, thus taking into account both the direction and the magnitude of the vectors.

The cepstral coefficients are decorrelated from one another, which make Euclidean distance a suitable similarity measure for acoustic data.

3.2.3 Data Validation Multidimensional scaling can be used to show the relations between the sounds in three-dimensional space. This allows us to compare the ordering of the sounds with the way in which they are ordered in the IPA system. Since the relative positions of the vowels correspond to those in the IPA quadrilateral, it can be asserted that the MFCC representation is useful for finding segment distances [1].

3.2.3.1 Calculate the Mean Vector for Each Vowel For each vowel, we need to pick out the vector that best represents the pronunciation for the entire population. This is done by applying a profile search tool, which based on Euclidean distance calculates the similarity to the average vowel. The vectors are then ranked according to their similarity to the average, and the vector with the best rank gets to represent the vowel.

3.2.3.2 Apply Multidimensional Scaling The purpose of multidimensional scaling (MDS) is to provide a visual representation of the pattern of proximities (i.e., similarities or distances) among a set of objects. Thus by applying multidimensional scaling, we will be able to study the relationship between the vowels.

By first calculating a distance matrix based on the average vowel vectors, a set of points mapped in two or three dimensions are returned, so that the distances between the points are approximations to the original distances.

3.2.3.3 Verify the Relative Positions By generating scatter plots and mapping the resulting dimensions from the MDS on to the axes of the plots, we can compare the ordering of the sounds with the way in which they are ordered in the IPA system. [1]

Page 16: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

16

3.2.4 Deriving Clusters Clustering represents a set of unsupervised methods that can be used to decide if the observations fall into distinct groups. The natural basis for organizing samples in this way is the assumption that similar cepstral profiles might indicate similar articulation. Under these circumstances, the clustering process can also be helpful in identifying which coefficients are instrumental in characterizing a certain dialect.

3.2.4.1 Select a Cluster Algorithm When the variables have been selected and validated, and we know what similarity measure to use, it is time to select a clustering algorithm and decide on the number of clusters to be formed. All cluster algorithms attempt to maximize the differences between clusters, at the same time they try to minimize the variation within the clusters. Unsupervised methods can chiefly be classified into two general categories: hierarchical and nonhierarchical (partitional) clustering [13].

HierarchicalClustering

Agglomerative Divisive

PartitionalClustering

K-means SOM PCA

3.2.4.1.1 Hierarchical Clustering Hierarchical clustering is an agglomerative method, which means that the cluster analysis begins with each record in a separate cluster, and in subsequent steps the two clusters that are the most similar are combined to a new aggregate cluster [9]. The number of clusters is thereby reduced by one in each iteration step. Eventually, all records are grouped into one large cluster.

The HC algorithm (as described in the Spotfire documentation):

1. The similarity between all possible combinations of any two vectors is calculated using a selected similarity measure.

2. Each vector is placed in a separate cluster. 3. The two most similar clusters are grouped together and form a new cluster. 4. The similarity between the new cluster and all remaining clusters is recalculated using a selected

clustering method. 5. Steps 3 and 4 are repeated until all records eventually end up in one large cluster.

Page 17: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

17

The resulting tree, called a dendrogram, is a multilevel hierarchy, where clusters at one level are joined as clusters at the next higher level. In Spotfire, you use the distance slider to examine the natural groupings in the hierarchical tree, and then set the cut-off at an appropriate point to partition the data.

One way to measure how well the cluster tree generated by the linkage function reflects your data is to compare the, so called cophenetic distances with the original distance data. If the clustering is valid, the linking of objects in the cluster tree should have a strong correlation with the distances between objects in the distance vector. (Copyright: MathWorks Inc.).

3.2.4.1.2 Pairwise Distance Metrics There are a number of agglomerative procedures used to derive the similarities between the clusters. These rules differ in how the distance between clusters is calculated and the strategy which method to use is highly dependent of the structure of the data [9], [13].

Single Link This technique is also referred to as the nearest neighbours. Using the single distance function, groups are formed from the minimum distance between two items. When applying this type of distance measure, chaining effects may be observed. This refers to the tendency to incorporate intermediate gene expression groups into an existing group rather than form a new one.

Complete Link This technique is also referred to as furthest neighbours. Complete link produces compact groups that do not exceed some threshold. The maximum distance between the vectors determines the difference between two groups. Complete link algorithms require well-separated vectors. As this is rarely the case, this method usually performs well in cases when the data form naturally distinct clusters.

Unweighted Pair-Group Average (UPGMA) This method calculates the distance between two clusters as the average distance between all pairs of vectors in the two different clusters.

Weighted Pair-Group Average (WPGMA) Same as UPGMA except that in the computations, the size of the clusters is used to weight their relative importance. This method should be used when the cluster sizes are suspected to be greatly uneven.

Ward's Method This method attempts to minimize the sum of squares of any two clusters that can be formed at each step. The Ward's method is regarded as very efficient and a good choice if one wants to avoid the chaining problems found in linkage methods. However, it tends to create clusters of small size and may be sensitive to outliers.

Cophenetic

distance

Page 18: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

18

3.2.4.1.3 K-means K-means clustering can best be described as a partitioning method. That is, the algorithm partitions the observations into k mutually exclusive clusters, and returns a vector of indices indicating to which of the k clusters it has assigned each observation. Unlike the hierarchical clustering methods used in linkage, k-means does not create a tree structure to describe the groupings, but instead it creates a single level of clusters. Another difference is that k-means clustering uses the actual objects or individuals in the data, and not just their proximities. For these reasons k-means is more suitable for clustering large amounts of data [17].

K-means clustering requires a parameter, k, the number of expected clusters. Cluster centers (centroids) can be initialized randomly, or they can be seeded based on a priori knowledge about the structure of the data.

The default initialisation method in Spotfire is Data Centroid based search, where an average of all profiles in the data set is chosen to be the first centroid in this method. The similarity between the centroid and all members of the cluster is calculated using the defined similarity measure. The profile that is least fit in this group or which is least similar to the centroid is then assigned to be the centroid for the second cluster. The similarity between the second centroid and all the rest of the profiles is then calculated and all those profiles that are more similar to the second centroid than the first one are the assigned to the second cluster. Of the remaining profiles, the least similar profile is then chosen to be the third centroid and the above process is repeated. This process continues until the number of clusters specified by the user is reached.

Then, in each iteration of the algorithm, all of the profiles are assigned to the cluster whose center they are nearest to (using the distance metric), and the cluster center kx is recalculated based on the profiles within the cluster. The centroids are recalculated until a steady state is reached, i.e. when profiles no longer change cluster and the centroids no longer vary.

The figure illustrates how the locations of the cluster centers k1 - 4 move from one step to another, thereby causing an object to be reassigned from k2 to k4.

k1 k2

k3 k4

k1 k2

k3

k4 k4

Page 19: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

19

3.2.4.1.4 Self Organizing Maps Self-Organizing Maps (SOMs) are a special class of artificial neural networks based on competitive learning. The algorithm produces a two-dimensional grid, where similar records appear close to each other, and less similar records appear more distant. From this map it is possible to visually investigate how records are related. In this sense, SOMs provide a form of clustering.

A formal account on how to compute the maps can be found in [12]. The following is a non-mathematical introduction to Self-Organizing Maps (as described in the Spotfire documentation):

1. Initialisation; a two-dimensional rectangular grid is set up, spanned by the two principal eigenvectors of the data. Each node in the grid is assigned a prototype vector. This vector has the same number of dimensions as the input data.

2. Sampling; a record, called the input vector, is randomly picked from the data set. 3. Similarity matching; the input vector is compared to the weight vector of each node, and the node

whose weight vector is most similar to the input vector is declared the best matching unit (BMU).

(Copyright: Agilent Technologies)

Step i Step i +1

4. Updating; the weight vector of each node is modified. Neighbour nodes adjacent to the winner have their weight vectors modified to approach the input vector, while nodes far from the winner are less affected, or not affected at all.

5. Iteration; the algorithm is repeated from step 2. 6. Best match; after a number of iterations, the training ends. Each record in the data set is assigned to the

node whose weight vector most closely resembles it, using Euclidean distance.

The learning phase (step 1) enforces a global ordering of the map, while in the training phase (steps 2-6) the final accurate state of the map is formed gradually.

Mapping precision The mapping precision measure describes how accurately the nodes correspond to the original data set, i.e. the smaller precision error, the better the agreement. The number of data vectors are usually larger than the number of nodes, and the precision error is thus always greater than 0.

Topology preservation Unlike other clustering algorithms, SOM attempts to preserve the topology, i.e. the local neighbourhood relations in the data, thereby giving the opportunity to use continuous coloring to represent the relationship between the clusters [11]. The topology preservation measure describes how well the SOM preserves the topology of the studied data set. The topographic error is calculated as:

N

kkt xu

Ns

1

)(1

where u(xk) is 1 if the first and second BMUs of xk are not next to each other, and otherwise it is 0.

Page 20: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

20

3.2.4.1.5 Principal Component Analysis (PCA) The PCA algorithm takes a high-dimensional data set as input, and produces a new data set consisting of fewer variables, known as principal components. These components are linear combinations of the original variables, which makes it possible to assign meaning to what they represent. Rotation and scaling are linear operations, so the PCA transformation maintains all linear relationships.

A trivial case of two dimensions being reduced to one. Seen in reference to the rotated coordinate system, we have a set of points that vary significantly only along x1. We can therefore project the points onto this new axis, letting that axis represent our first principal component, and ignore the comparatively small variation along y1.

The first principal component is the linear combination of the original variables which accounts for the maximum amount of variance in a single dimension. The second principal component is that line which is orthogonal to the first principal component (corresponding to y1 in the above figure) and accounts for the maximum amount of the remaining variance in the data. The first two components therefore represent the plane of best fit through the data. By mapping the first two or three components using a scatter plot, we can get a sense of what most closely reflects the disposition of the points in the full n-dimensional space [15].

The eigenvalues obtained from Principal Components Analysis are equal to the variance explained by each of the principal components, in decreasing order of importance, (see table on page 28). The eigenvalues associated with each principal component can be examined by means of a Scree plot, which shows the eigenvalues from the first to the last principal component. The plot will generally fall off sharply from the first component and then level off. It gives us an idea of how many independent factors there are in the data set. The idea is that as the plot levels off, the remaining principal components do not explain much about the data and can therefore be disregarded.

The eigenvectors are weightings which, when applied to the original data, obtain principal component scores for the observations. A large positive or negative value indicates a variable that is correlated, in either a positive or a negative way, with the component. This can be viewed using a loadings plot, which is a profile chart where each component is represented by a profile. It can also be viewed using a biplot, (see last figure of section 4.2.5.6).

In this study, PCA will primarily be used as a mechanism for evaluating clustering results. I will use 3D plots to inspect the first three principal components visually, and see if coherent clusters correspond to the structure suggested by the clustering algorithms. This may also give a sense as to how many natural groupings exist in the distribution

y1

x1

Page 21: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

21

3.3 Validation An important question is, how reliable the clusters obtained are. There are a number of methods that can be used to examine the statistical significance of clusters.

3.3.1 Visual Verification By geographically examining the results from different clustering algorithms, one can establish if distinct patterns appear for the different clustering algorithms. Ideally, one would like to verify the results to a gold standard [1], i.e. an expert classification of speech varieties. In absence of such a standard, one can gain a pretty good apprehension of the cluster results by visually comparing the outcome against a. results from different cluster algorithms, b. principal components or MDS results using scatter plots to see how well the clusters can be

distinguished.

Heeringa et al have presented a method of validating acoustic distance perceptionally, by comparing correlation coefficients of computational distances to perceptual distances based on 15 Norwegian varieties [1]. For details, read Conclusion and Further Research.

3.3.2 Coincidence Testing Coincidence testing can be used investigate if the similarity between the outcome of two cluster analyses is a coincidence or not. The results are presented using probability p-values.

Suppose that we have performed clustering using two different methods, and we now want to know how well the two methods agree on the classification of each record. The table below, extracted from the Spotfire user documentation, shows the identifiers and cluster classifications for some records A to F. Performing a coincidence test on the two clustering columns produces the Coincidence column that contains the p-values.

Identifier Hierarchical clustering

K-means clustering

Coincidence Interpretation

A 1 3 0.2 Good match B 1 3 0.2 Good match C 1 2 0.95 Worst match D 2 2 0.2 Good match E 2 2 0.2 Good match F 3 1 0.166666... Best match

The records for which the cluster classifications correlate will get the lowest p-value, i.e. a low probability that this is a coincidence. The group with records A and B and the group with records D and E showed quite good matching. C received a high p-score since the two clusterings disagree about the classification.

Thus, by filtering out the worst matches of two (radically) different clustering methods, we should be able to see clusters results of high correlation, which could serve as a measure of objectivity in the analysis.

3.3.3 Statistical Significance of Clusters A widely used approach is Bootstrap analysis, where the basic idea involves sampling the distribution to produce a random set of samples. Each of these samples is known as a bootstrap and each provides an estimate of the parameter of interest. Repeating the sampling a large number of times provides information on the variability of the clustering process. When accuracy is high, the bootstrap estimates will be more like the original estimates and the bootstrap clustering will be more like the original clustering [13].

Page 22: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

22

3.3.4 Interpretation As a final step, we add semantic to the clusters by assigning labels that describe the nature of the groups. This can be a tricky task, particularly if unintuitive clusters are formed that we cannot find a rational explanation for.

3.4 Tools

3.4.1 Spotfire DecisionSite Spotfire DecisionSite is an analytics platform that provides tools for multivariate statistical data analysis. DecisionSite uses Spotfire's patented visualization technology, that allows the researcher to interactively query large amounts of data and instantly visualize the results as scatter plots, bar charts, profile charts and a number of other modes. Spotfire DecisionSite stores data internally in a proprietary data format, which makes it very fast in terms of response times and user interaction. In the empirical study, Spotfire DecisionSite has been used extensively to generate plots.

3.4.2 XLSTAT-Pro XLSTAT-Pro is a statistics add-in for Microsoft Excel that includes a wide range of functionalities covering most of the requirements for data analysis and statistics. In this thesis, XLSTAT has primarily been used to perform multidimensional scaling, and to generate various plots relating to MDS.

3.4.3 Acuity 4.0 Acuity developed by Axon Instruments is a platform for microarray analysis. It includes a suite of analysis methods, among them clustering algorithms like k-means and k-medians. In this thesis, these algorithms are used to compute the Gap statistic.

Page 23: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

23

4 Empirical study This chapter accounts for the application of the previously described method on actual dialect samples of native Swedish speakers. As earlier stated, these methods are by nature, exploratory rather than confirmatory analyses of data, and there is always the risk of selecting cluster methods based on the things we want to see, rather than presenting potentionally non-existent patterns. One must therefore be careful when interpreting the results.

4.1 Pre-Processing the Data

4.1.1 Sample Set The speech material used in this study has been extracted from speech databases of the dialect project SweDia 2000 (http://swedia.ling.umu.se). The recordings were made mainly in the speakers' homes, where measures had been taken to avoid acoustic disturbances, such as echo effects and other noises. The recordings were made on portable DAT-recorders and later transferred to digital workstations [7].

The speech samples of the Swedia 2000 project were produced by 10 to 12 speakers of both genders, from a younger and an older age group (20-30 years old, and 55-75 years old), and recorded on 97 different locations from all over Sweden and the coastal regions of Finland. The words were elicited by having the speaker answer a question with the target word and then repeating it up to five times.

As vowels are longer in duration and produced with a nearly static vocal tract shape, they are more easily and reliably recognized than consonants. Vowels are spectrally well defined, which make them suitable for a comparative study. The sample set in this study includes the four corner vowels of the IPA quadrilateral [i ], [u ], [a], and [ ] , as well as the referential vowels [y ] [e ], [ ], [ ], and [ ].

Phone Swedish word (Eng. transl) [i ] 'dis' (haze)

[u ] 'sot' (soot) [a] 'tack' (thanks) [ ] 'lat' (lazy)

[y ] 'typ' (type)

[e ] 'leta' (search)

[ ] 'söt' (sweet)

[ ] 'läsa' (read)

[ ] 'lus' (louse)

All of these phones will be included in the validation of the data set, whereas the actual cluster analysis will focus exclusively on the vowel [ ], as represented in the word 'lat' (lazy).

4.1.1.1 Speaker Normalisation To account for pitch related bias, the data set in this analysis has been restricted to only include male speakers from the older age group, under the assumption that these speakers all have an average fundamental frequency around 110 Hz.

4.1.1.2 Sample Normalization For part of the speech material, the vowels have been cut out in such a way as not to include any part of the preaspiration. However, some samples still contain co-articulation, i.e. influences or deviant values from adjacent segments, not representative for the speakers pronunciation of the vowel examined.

Therefore, to account for any influence related to segmental context, the median for each sample is calculated. The median can be thought of as the "average not affected by outliers", and is the middle observation in the (sorted) distribution. Admittedly, it is quite a blunt method as it is restricted to a centering of the data and does not correct for intensity-dependent biases.

Page 24: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

24

Although the median normalization implies a certain loss of information, it is also motivated by the need to reduce the amount of data. Since each utterance is represented by ten rows, each consisting of a 12-dimensional vector, this would make the data set unmanageable:

3 (speakers) * 97 (locations) * 10 (vowels) * 3 (samples) * 10 (rows) = 87,300 rows of data

After centering, the dataset will consist of approximately 10,000 records (the number of speakers varies, as does the number of samples and locations).

4.1.2 Feature Extraction To achieve a compact representation of the spectral characteristics of the speech, the recorded speech is converted into a sequence of acoustic feature vectors, using the procedure described in section 3.1.2. For historical reasons the result is called cepstral coefficients, or Mel Frequency Cepstral Coefficients (MFCCs). The raw data resulting from the cepstral analysis in HTK5 is a set of files that have the following format:

------------------------------------ Samples: 0->-1 ------------------------------------ 0: 0.000 0.000 0.000 0.000 -0.000 -0.000 0.000 0.000 -0.000 0.000 0.000 0.000 1: 0.000 0.000 0.000 0.000 -0.000 -0.000 0.000 0.000 -0.000 0.000 0.000 0.000 2: 0.000 0.000 0.000 0.000 -0.000 -0.000 0.000 0.000 -0.000 0.000 0.000 0.000

9: 0.000 0.000 0.000 0.000 -0.000 -0.000 0.000 0.000 -0.000 0.000 0.000 0.000 10: -4.117 3.897 -2.895 -8.888 3.737 5.654 -4.516 -7.549 9.197 -0.522 -5.789 2.367 11: -2.948 2.306 -1.594 1.354 -7.193 -6.618 -3.195 -6.195 -2.782 -9.075 -11.394 -4.905

34: -7.775 6.727 9.146 -0.052 -11.523 -5.715 5.645 -14.259 -13.435 -3.575 -3.124 2.907 35: -11.079 6.157 5.410 -7.714 -6.438 -3.758 -1.945 -15.445 -3.941 -5.020 -8.647 4.496 36: 0.000 0.000 0.000 0.000 -0.000 -0.000 0.000 0.000 -0.000 0.000 0.000 0.000

153: 0.000 0.000 0.000 0.000 -0.000 -0.000 0.000 0.000 -0.000 0.000 0.000 0.000 ----------------------------------------- END ------------------------------------------

The raw data is partitioned so that each file contains a number of samples of one particular vowel from one particular speaker. For instance, the file "ank_om_1.dis.txt" contains 4 samples of the phone [i ] as in 'dis', pronounced by an older man (speaker number 1) from Ankarsum (ank).

4.1.2.1 Conditioning the Data The data need to be converted into a more manageable format before we start the analysis:

1. First, we run a Perl script6 to get rid of redundant line breaks and align data into tab-separated files. The twelve MFCCs are named C1, C2, , C12.

5 The tool Hcopy available in HTK is used to copy one or more source (e.g. WAV) files to an output file. Mechanisms are also provided for extracting segments and concatenating files. By setting the appropriate configuration variables, all input files can be converted to parametric form (i.e MFCCs) as they are read-in [10]. Thus, simply copying each file in this manner performs the required encoding. The configuration used here is available in the Appendix, as HTK: hcopy.conf. 6 All Perl scripts can be obtained from the author upon request.

Page 25: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

25

Vector C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12

0 0.000 0.000 0.000 0.000 -0.000 -0.000 0.000 0.000 -0.000 0.000 0.000 0.000

1 0.000 0.000 0.000 0.000 -0.000 -0.000 0.000 0.000 -0.000 0.000 0.000 0.000

9 0.000 0.000 0.000 0.000 -0.000 -0.000 0.000 0.000 -0.000 0.000 0.000 0.000

10 -4.117 3.897 -2.895 -8.888 3.737 5.654 -4.516 -7.549 9.197 -0.522 -5.789 2.367

2. Then we calculate the median for each sample. The first two and the last two rows of each 10-row sample are omitted as to avoid influence from other segments.

3. To prepare the data for geographical plotting, a column "Location" is added and populated with names derived from the locational prefix for each sample.

4. To be able to work on different samples, a column "Sample" is added and populated with the word derived from the sample identifier, e.g. from "ank_om_1.dis_mean_1" we get "dis"

The speech data has now been converted to the following format:

Vector Location Sample C1

C10 C11 C12

ank_om_1.dis_median_1 Ankarsrum dis -2.697 ... -2.904 -1.181 -2.208

ank_om_1.dis_median_2 Ankarsrum dis -2.149

-4.442 1.715 -5.249

...

5. As an optional step to further bring down the size of the data, we could calculate the mean vectors for each speaker, see section 4.2.2.1. However, this step should be preceded by first checking the data for outliers.

4.1.3 Data Validation Multidimensional scaling can be used to show the relations between the sounds in three-dimensional space. This allows us to compare the ordering of the sounds with the way in which they are ordered in the IPA system.

4.1.3.1 Calculate the Mean Vector for Each Phoneme For each vowel, we need to pick out the vector that best represents the pronunciation for the entire population. This is done by applying a profile search tool, which based on Euclidean distance calculates the similarity to the average vowel. The vectors are then ranked according to their similarity to the average.

The profile with the highest rank (marked) is the one that best represents the vowel.

Page 26: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

26

4.1.3.2 Apply Multidimensional Scaling The purpose of applying multidimensional scaling (MDS) is to provide a visual representation of the distances among a set of elements, and in this case, it shows the relationship between the vowels.

By first calculating a similarity matrix based on the mean pronunciations of the vowels, a set of points mapped in two or three dimensions is returned, so that the distances between the points are approximations to the original distances. The matrix is calculated using Kendall's correlation coefficient, that measures similarity within the interval [-1,+1]. This matrix is motivated by the stress value, which is explained in the next section.

lus sot söt tack läs typ tak leta lat dis

lus 1,000 0,121 0,152 -0,364 0,121 0,606 -0,333 0,333 -0,242 0,333

sot 0,121 1,000 -0,061 -0,030 -0,030 0,152 0,424 0,182 0,455 0,242

söt 0,152 -0,061 1,000 0,061 0,545 0,061 -0,091 0,212 -0,061 -0,091

tack -0,364 -0,030 0,061 1,000 0,091 -0,455 0,545 -0,242 0,394 -0,303

läs 0,121 -0,030 0,545 0,091 1,000 0,091 0,061 0,303 0,152 0,000

typ 0,606 0,152 0,061 -0,455 0,091 1,000 -0,303 0,545 -0,273 0,545

tak -0,333 0,424 -0,091 0,545 0,061 -0,303 1,000 -0,091 0,788 -0,091

leta 0,333 0,182 0,212 -0,242 0,303 0,545 -0,091 1,000 -0,121 0,576

lat -0,242 0,455 -0,061 0,394 0,152 -0,273 0,788 -0,121 1,000 -0,121

dis 0,333 0,242 -0,091 -0,303 0,000 0,545 -0,091 0,576 -0,121 1,000

In bold, significant values (except diagonal) at the level of significance alpha=0,050 (two-tailed test)

4.1.3.3 Generate Scatter Plots By generating scatter plots and mapping the resulting dimensions from the MDS on to the axes of the plots, we can compare the ordering of the sounds with the way in which they are ordered in the IPA system. However, it is important to understand that the positions of the MDS plots are not absolute. Rather, they describe the relative proximities between the observations.

The stress value measures the vertical discrepancy between the map distances and the transformed data points (uncolored markers). When the stress is zero, the markers lie on top of each other. Kendall correlation is used since this measure returned the smallest stress value, which suggests that it is the best candidate for describing the similarities using three dimensions.

Similarity measure Stress (based on 3 dimensions) Kendall correlation 0.043 Euclidean distance 0,076 Pearson correlation 0,094

When looking at an MDS plot it is important to understand that the axes, as such, are artefacts and that the orientation of the picture is arbitrary. Thus an MDS representation of, for example, distances between cities need not be oriented such that north is up and east is right. All that matters in an MDS plot is the relative distances between the plotted objects.

Page 27: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

27

On the left, a 2D-multidimensional scaling plot obtained from the similarity matrix, Stress = 0.101. The first dimension (Dim1) corresponds to height and the second dimension (Dim2) to advancement of the tongue. The Shepard diagram to the left is a scatter plot of input proximities against output distances for every pair of items scaled.

Multidimensional scaling applied on three dimensions, Stress = 0.043 The height and advancement can be appreciated by connecting the corner vowels with an imaginary plate. The relative positions of other vowels seem to correspond fairly well with their positions of the IPA quadrilateral. It could be argued that it would be more correct to let 'tak' (instead of 'lat') represent the corner vowel [ ] since it has a context more similar to 'tack'. The reason for using 'lat' is that 'tak' is largely undersampled, and therefore not reliable.

Page 28: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

28

4.2 Cluster Analysis

4.2.1 Research Objective The goal of this analysis is to obtain an objective classification of a pre-selected vowel, pronounced by native speakers of Swedish. By scientifically determining the number of clusters inherent in the data set, the results may also contribute to generating a hypothesis related to the structure of the speech material.

4.2.1.1 Select Variables The selection of variables to cluster on should have bearing on the objective, which in this case is to classify speech samples based on their acoustic properties.

The speech samples will be clustered using 12 MFC coefficients, named C1, C2, , C12. The rationale for using MFCC as cluster variables is based on the fact that the first 12 cepstral coefficients represent the shape of the vocal tract, are therefore useful for differentiating speech [2].

4.2.2 Research design

4.2.2.1 Limitations Performing clustering on the data with all vowels present would make little sense, as it would be like comparing apples to oranges. To get something useful out of the analysis we need to focus on one vowel at a time. This cluster analysis will focus exclusively on the vowel [ ], as represented in the word 'lat' (lazy). The selection is motivated by the fact that no diphtongation occurs for this phoneme, which makes it stable and particularly suitable for a comparative analysis. Consequently, the analyses and graphs in the following sections, have been performed using the limited data set, consisting only of samples of [ ].

4.2.2.2 Dealing with Outliers Outliers might have a big influence on the outcome of the cluster analysis. It is therefore necessary to conduct a preliminary screening by visually inspecting profile charts to ensure that for no location there are profiles that are drastically inconsistent with the rest of the profiles for that location. If found, one must decide on whether or not to include these in the analysis. Profile charts for all locations are available in the Appendix.

By linking the profile chart and a scatter plot showing the corresponding principal components, outliers can be easily detected. However, one should be careful not to rely solely on the mapping of the principal components as the plot might be misleading. The cumulative Eigenvalues from the PCA analysis show that the three first components capture only half of variability present in the data set. The Scree plot shows that more components would give us a better depiction, but for visualization purposes three components is plenty.

Principal Component

Cumulative Eigenvalue (%)

PC (1) 19.281 PC (2) 35.808 PC (3) 48.808 PC (4) 59.689 PC (5) 69.096 PC (6) 75.803 PC (7) 81.925 PC (8) 86.884 PC (9) 91.736

PC (10) 95.693

Eigenvalues calculated for the 10 first principal components and plotted on what is know as a Scree plot. The Eigenvalue specifies how much of the variability preserved in the PCA.

Page 29: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

29

PCA scatter plot and Profile chart of location Nora. The numbers designate speech samples made by speaker 1 and 2 respectively. The outlier at the bottom cannot be considered a representative sample compared to the rest of the samples made by speaker 1.

If we are going to calculate the means for each location, we want to make sure that the mean is not affected by outliers. Say for instance that the samples of two different speakers from one location are tightly correlated, and the samples from a third speaker are tightly grouped, but deviate from the others; then calculating the mean vector for all three speakers would give us a result that may well be an artefact, i.e., it does not represent either of the pronunciations.

PCA results for location Ankarsum. In this case the set of samples made by speaker 2 deviates from the rest of the population, and may be considered outliers.

We will not actually delete any outliers in this analysis, but instead confine ourselves to having identified that they may exist. Thus, when calculating the mean vectors, each of the 95 locations in the data set will have one MFCC vector representing all samples recorded on that location.

Page 30: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

30

4.2.3 Estimating the Number of Data Clusters

The choice of the optimal number of clusters is not obvious. Intuitively, one can appreciate that too many clusters will produce an incomprehensible dialect map, whereas too few clusters will not provide enough useful information. There are several methods used to statistically determine an appropriate number of clusters in a data set, for instance "Silhouette" and "Gap". The silhouette plot displays a measure of how close each point in one cluster is to points in the neighbouring clusters. The "Gap" test uses the Gap statistic to determine how well data are clustered. The Silhouette method will not be investigated further, instead we have chosen to explore the Gap statistic as implemented in Acuity 4.0.

The Gap compares the dispersion of clusters generated from the data to that derived from a sample of null hypothesis sets. The null hypothesis sets are uniformly randomly distributed data in the box defined by the principal components of the input data. To reiterate what was formally described in section 3.2.2.2, the optimal cluster size determined by the Gap statistic is not the cluster with the largest Gap value, but instead the smallest cluster whose Gap value is closer than one standard error to the Gap value of the next cluster.

The Gap statistic as implemented in Acuity 4.0. The graph shows the Gap values calculated on all samples for [ ] as in 'lat'. The algorithm suggests an optimal cluster size of 3 clusters. The standard error sk is displayed as vertical error bars.

The Gap statistic calculated on the means of samples for [ ] per location. The Gap values have changed, but the optimal number of clusters is still 3.

Page 31: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

31

In Acuity one can choose between the following distance metrics: Euclidean squared, City block and Canberra are distance metrics. The City block and the Euclidean measure give similar results, but the City block is less affected by extreme outliers (as the values are not squared). I have used Euclidean Squared, as this metric is considered suitable for measuring similarity between MFC coefficients.

The clusters were calculated using both the K-means as well as the K-medians algorithms as implemented in Acuity; the results from the Gap analysis returned 3 as the optimal number of clusters in both cases. The same value (3) was returned when I analyzed the means for each location.

Later in the cluster analysis we shall verify the number of clusters suggested by the Gap statistics against the results of hierarchical clustering and PCA, to see if similar results can be obtained.

4.2.3.1 Select a Similarity Measure The MFC coefficients consist of continuous metric data, hence only correlational measures and distance measures will be considered.

Since the analysis has an exploratory objective, I will investigate the results from using different similarity measures. Pearson's r will be used for measuring correlation, and Euclidean distance for measuring proximity.

4.2.3.2 Standardizing the Data Euclidean distances are additive in the sense that variables contribute independently to the measure of distance [13]. This means that coefficients that have a wider range tend to dominate the metric. To reduce this effect when using Euclidean distance, we need to take Z-scores, i.e., standardize the values of each MFCC by the standard deviation for each coefficient respectively (see section 3.2.2.3 for details).

4.2.4 Assumptions Before commencing with the actual clustering, we want to make sure that the derived sample is representative for the population. In the previous section, multidimensional scaling was applied on the acoustic similarities and scaled to two and three dimensions respectively. The MFCC representation of the vowels produced a pattern reminiscent to the IPA quadrilateral, hence the cepstral vectors can be considered useful for measuring vowel distances.

At this stage we also identify any occurrence of multicollinearity, that is, if two (or more) variables are related, or measure the same thing. Since the cepstral coefficients make up an orthogonal set, we know that multicollinearity is not a problem, and can therefore be disregarded in this analysis.

4.2.5 Deriving Clusters With the variables selected and similarity measures identified, we are now in the position to begin the clustering. In this approach I will utilize a combination of hierarchical and partitional methods. To keep down the complexity of the visualizations, the analysis will be carried out on the mean vectors, rather than on the entire data set with all samples included.

4.2.5.1 Producing a Reference To get an initial sense of the dispersion and to provide a reference, we start the analysis by calculating the first three dimensions using multidimensional scaling based on Pearson correlation, as this measure returns the lowest stress value (0.164). This allows us to visualize the global context by way of a dynamic 3D visualization.

Similarity measure Stress (based on 3 dimensions) Pearson correlation 0,164 Kendall correlation 0,177 Euclidean distance 0,201

Page 32: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

32

Then we calculate the average cepstral vector, and rank the samples based on how similar (based on correlation) they are to the average vector. We will use this ranking to identify the most deviating samples, and see how different clustering algorithms deal with the distribution in general, and these deviants specifically. Thus, based on the rank we do a 80/20 percent split7 resulting in two groups: one larger group representing the mainstream pronunciation, and another smaller group that contains the samples that stand out and for this reason should be easily identified.

Based on the similarity to average rank, the distribution has been split into a mainstream group (white) and a deviant group (black), represented as scattered records in the 3-dimensional MDS plot. The geographical distribution is shown to the right. The rotation of the arrows (0 90 degrees) corresponds to the level of similarity to the average sample; the arrow for the sample closest to the average points straight up, while the one for the most deviant sample points straight to the side.

From the MDS analysis we can discern that we are dealing with one coherent set of samples and another group that is more spread out. This will be our referential hypothesis going forward. As this reference is based on correlation, and the clustering algorithms will use Euclidean distance as a similarity measure, we will be able to cross-check the results.

4.2.5.2 Hierarchical Clustering As a first step Hierarchical clustering (HC) will be used to investigate the number of clusters, and to verify the results against the referential hypothesis. The HC results will later be used as input to nonhierarchical procedures to fine-tune the results.

Since Euclidean distance will be used as the similarity measure (see section 4.2.3.2), we need to normalize the data by calculating Z scores and use these as value columns in the analysis. The Hierarchical clustering is then initiated based on the 12 Z score columns selected as cluster variables. Clustering method is set to WPGMA (weighted average), and the Euclidean distance is selected as a similarity measure. In the following figures, the dendrogram and the map represent the results of the hierarchical clustering of the standardized means per location of the vowel [ ].

7 The ratio 80/20 is arbitrary. However, the rationale for choosing this split is that we want a smaller group that contains samples that are significantly different from the rest.

Page 33: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

33

Clustering method: WPGMA (weighted average) Similarity measure: Euclidean distance calculated on Z-scores Ordering function: Average value

Since the Gap statistic suggested 3 as an optimal number of clusters, the cluster slider (the dashed vertical line) is set to intersect three lines of the dendrogram, thereby dividing the data into three distinct clusters. The length of the horizontal lines in the dendrogram that connect two clusters (nodes) is a measure of their relative closeness.

As can be seen from the 3D plot and the map, the clusters also correlate fairly well with the MDS reference in the previous section. However, since the WPGMA generated an unbalanced tree with one cluster containing only 2 records, we will also explore Ward's method, known for producing clusters with roughly the same number of observations in each cluster (SAS, 1990, p. 56).

Since the second cluster (white) accounts for over 84 percent of the records, one might consider treating it separately, and do a sub-analysis only on these records.

Page 34: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

34

Clustering method: Ward's method Similarity measure: Half square Euclidean distance Ordering function: Average value

Judging by the horizontal distances in the dendrogram, Ward's method suggests that either 2 or 4 clusters would be a more suitable number of clusters than 3. For comparative reasons, we will stick to 3 for now, but we will bear this in mind for further analyses. It is also worth noticing that Ward's method produces a more balanced tree in terms of how records are merged.

Comparing this classification to the MDS hypothesis, we can see that the Ward's method groups the data in clusters of more equal size, attempting to partition the mainstream samples while still capturing some of the scattered records in one cluster (black).

Page 35: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

35

4.2.5.3 K-means Clustering In this approach we will use a data centroid based approach to initialize the cluster centers (centroids). To be able to compare the results with those of hierarchical clustering, we will continue to use 3 clusters and Euclidean distance applied on Z scores.

MDS and geographical plot of K-means applied on Z scores (Euclidean dist., Data centroid based search). Compared to WPGMA, only the circled locations have changed clusters. (Iterations: 6; Total score: 168.6)

It turns out that the k-means algorithm partitions the data in a similar fashion as the WPGMA method, generating 3 clusters containing 15, 78, and 2 samples respectively. The analogous result of two such different methods is likely to originate from the normalized data. Therefore, we shall perform the same clustering on the original cepstral coefficients.

K-means applied directly on the MFCCs produces a rather different cut, grouping the observations into clusters of size 16, 55, and 24. The result is more comparable to that generated by Ward's method, which can be explained by Euclidean distance being sensible to outliers.

To better examine the clustering, aside from MDS, we will also map the clusters against principal components as a mechanism for verifying the distinctiveness of the clusters.

Page 36: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

36

PCA, MDS and geographical plot of K-means applied on the original MFCCs. The centroids for each cluster are marked in red. (Euclidean dist., Data centroid based search, Iterations: 9; Total score: 11 584)

As can be seen in the 3-dimensional PCA plot, the cluster centers are quite close to each other, suggesting that a better inter-cluster dispersion could be achieved. In the next section we will fine-tune the results.

4.2.5.4 Combining HC and K-means In this step we will refine the results by utilizing centroids obtained from hierarchical clustering as a basis for generating the seed points to k-means [9]. First, we need to calculate the mean for each of the HC clusters generated by Ward's method in section 4.2.5.2.

For each HC cluster, we identify the sample closest to the average profile, based on Euclidean distance. These are represented by the following locations: Dragsfjärd, Skog, and Väckelsesång.

Page 37: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

37

The 3 cluster centers from hierarchical clustering using Ward's method.

We now use these centroids as seeding points to initialize the k-means algorithm. Since Ward's method was applied using Euclidean distance, the same setting will be used for k-means. However, to be able to compare the result with the data centroid based search in the previous section, we will use the original MFCC data.

The result of k-means clustering initialized with the 3 HC centroids. (Cluster data: Original MFCC columns; Iterations: 6; Total score: 11 584)

Page 38: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

38

The HC centroid for the black cluster, Dragsfjärd, seems to be a better choice, since it is located further apart from the others in the principal component space. Furthermore, the seeded clustering only needed 6 iterations to achieve a final state, compared to 9 iterations for Data Centroid based approach, suggesting that these seeded centroids have a better fit.

Comparing Data centroid based vs. HC seeded K-means clustering. The map on the left shows the clusters from K-means initialized with data centroids search. To the right are the results of K-means seeded with cluster centroids from HC using Ward's method.

Cluster1 (black): 16 records -------------------------------------- Cluster score: 2581 Cluster radius: 16.91 Average centroid similarity: 12.35 Nearest neighbour: 2 Nearest neighbour similarity: 15.02

Cluster2 (gray): 55 records -------------------------------------- Cluster score: 6461 Cluster radius: 18.1 Average centroid similarity: 10.54 Nearest neighbour: 3 Nearest neighbour similarity: 10.07

Cluster3 (white): 24 records -------------------------------------- Cluster score: 2542 Cluster radius: 20.21 Average centroid similarity: 9.784 Nearest neighbour: 2 Nearest neighbour similarity: 10.07

Seeded_Cluster1 (black): 15 records -------------------------------------- Cluster score: 2322 Cluster radius: 16.74 Average centroid similarity: 12.07 Nearest neighbour: 2 Nearest neighbour similarity: 15.23

Seeded_Cluster2 (gray): 34 records -------------------------------------- Cluster score: 3906 Cluster radius: 15.74 Average centroid similarity: 10.46 Nearest neighbour: 3 Nearest neighbour similarity: 9.6

Seeded_Cluster3 (white): 46 records -------------------------------------- Cluster score: 5348 Cluster radius: 21.68 Average centroid similarity: 10.3 Nearest neighbour: 2 Nearest neighbour similarity: 9.6

From the Cluster radii, we can see that HC seeded clusters are more compact, except for the Seeded_Cluster 3, which on the other hand contains many more records. The Nearest neighbour similarity

Page 39: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

39

shows how well separated the clusters are from their neighbors. This confirms Dragsfjärd as a "better" choice of centroid for the black cluster, whereas the centers of the other clusters have moved closer together.

4.2.5.5 Self Organizing Maps Although k-means clustering and SOM are both partitional clustering methods, they are used in different ways. In the case with k-means, the number of clusters is chosen according to k number of clusters, whereas for the SOM algorithm, the number of reference vectors can be chosen to be much larger, regardless of the number of clusters [11]. To examine this we will perform two instances of SOM; one using a 2 x 2 grid resulting in 4 clusters, and another one with a 4 x 4 grid, resulting in 16 clusters. We will verify the Self Organizing Maps by monitoring the output parameters, Precision error and Topographic error, which should both be as small as possible (see section 3.2.4.1.4).

The first trials using unnormalized vectors returned high mapping precision errors (ranging from 4.024 to 8.363), and we will therefore use Z scores throughout the SOM analysis.

The following general settings will be used: Neighborhood function: Bubble Radius (begin x end): 2.5 x 0 Learning function: Linear Initial rate: 0.05 Number of training steps: 12 500

The first instance using a 2 x 2 grid generates a map similar to Ward's and k-means using Data centroid based search. However, since SOM has topology preserving properties, it allows us to use continuous coloring to represent the relationship between the clusters.

Self-Organizing Maps using a 2 x 2 grid (4 clusters), calculated on Z scores.

Mapping Precision: 0.131 Topology Preservation: 0

Page 40: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

40

The normalization had a favorable effect on the mapping precision, and the error is now at an acceptable level. The topographical error being 0 is an effect of using too small a grid, since the nodes in a 2 x 2 grid are all neighbours, thereby always causing the neighborhood function to return 0.

Using continuous coloring on 4 clusters still only allows us to use 4 colors, and therefore we shall produce a 4 x 4 grid, which should also produce a more accurate map.

Results of applying Self-Organizing Maps using a 4x4 grid (16 clusters). MappingPrecision: 0.077 TopologyPreservation: 0.1158

The precision error has now dropped further, which gives an indication that this SOM is a better adaption to the input vectors. In addition, the continuous coloring also gives a more balanced picture of the variations in the data, enabling us to plot a dialect map 8. The black regions in the southwest of Sweden and most of Finland stand out, as does the white region around Mälardalen.

To establish which ones of the various MFC coefficients are instrumental in identifying a particular dialect, we will use the last one of the non-hierarchical methods, PCA.

8 Different methods for visualizing SOMs have been suggested (Kohonen, 1995, Kaski 1999), for instance the cluster display, where the distances between reference vectors are used to determine the gray levels in the middle of the corresponding map units, resulting in a continuous grey scale map.

Page 41: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

41

4.2.5.6 PCA PCA has already been applied along with the other experiments as a mechanism for evaluating clustering results. However, PCA can also be used to establish the degree of relationship among different MFC coefficients, and may also be helpful in identifying which coefficients are instrumental in characterizing a certain dialect. This may also give a sense as to how many natural groupings exist in the distribution

Principal Component

Cumulative Eigenvalue (%)

PC (1) 26.797

PC (2) 45.864

PC (3) 56.023

PC (4) 65.002

PC (5) 72.569

PC (6) 80.039

PC (7) 85.536

PC (8) 89.782

PC (9) 93.101

PC (10) 96.095

Eigenvalues and Scree plot showing the result of the PCA performed on mean samples of /lat/. Variability preserved on PC(1) and PC(2) is 45.864 and 56.023, respectively.

Ideally, the first two or three eigenvalues should account for a high percentage of the variance, ensuring us that the maps based on these components are a good quality projection of the original data. In this example, the first two factors allow us to represent 45.864% of the initial variability of the data. This is a fair result, but one need to be careful when interpreting the maps as more information is hidden in the subsequent components. Specifically, there is a bend in the Scree plot at PC(03), suggesting that the information gain decreases after this point.

To confirm which variables are well linked with a PC axis, we look at the squared cosines for the first 3 PC axes; the greater the squared cosine, the greater the link with the corresponding axis. The closer the squared cosine of a given variable is to zero, the more careful one have to be when interpreting the results in terms of trends on the corresponding axis.

Radar plot displaying the squared cosines for the first 3 principal components. PC(1) is primarily dominated by coefficients C2, C4, C9, and C11; PC(2) by C2, C5, C6, and C7. We can also see that to be able to account for the effects of C10, we would need to include PC(3).

Page 42: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

42

Next, we plot PC(1) against PC(2) for a combination of observations and variables. It enables us to look at the data on a two- dimensional map, and to identify trends. The biplot is useful in interpreting the meaning of the axes.

Biplot combining the observation space and variable space. In addition, the observations are assigned colors according to k-means clustering.

The above visualization may well represent the ultimate goal of the analysis; a plot in which we are simultaneously showing the results of clustering, PCA projecting the observations, and a superimposed correlation plot projecting the MFCCs in the variable space. The correlation plot is achieved by first transposing the data so that the first row represents the PC values for C1, the second row the PC values for C2 etc.

Provided that variables C1 to C12 are far from the center, the following interpretation can be used:

If variables are close to each other, they are significantly positively correlated (r close to 1);

If they are orthogonal (perpendicular), they are not correlated (r close to 0);

If they are on either end of one axis, then they are significantly negatively correlated (r close to -1).

These trends may be helpful in understanding what underlies the groupings performed by the K-means algorithm. In this case we can see that C4, C9 and also the negatively correlated C2 play a great role in identifying the observations of the black cluster. As for the other two clusters, C5, C6 and C7 seems to be instrumental in the classification process.

However, when the variables are close to the center, it does not mean that they do not contribute; rather it tells us that some information is carried on other axes. For instance to be able to better study the effects of C10, we would need to plot PC(1) against PC(3), since it is more prominently expressed in the 3rd principal axis.

Page 43: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

43

4.3 Validation and Interpretation In section 3.3 various methods of validation were described. Visual verification of clustering results has been employed throughout the analysis, both geographically and spatially.

As a final verification method, we will apply coincidence testing on two different clustering methods; one hierarchical and one non-hierarchical method.

To refer back to previous results in the analysis, we will compare the four clusters generated by Ward's method (section 4.2.5.2) against the four clusters generated by the Self Organiszing Map (section 4.2.5.5). The coincidence can be viewed using a pie chart mapped onto two dimensions.

A two-dimensional pie chart showing the level of coincidence; lighter shades represent coincidental matches, whereas darker colors represent a higher level of significance.

If we filter out all observations with a p-value value higher than 0.01, we are left with the 51 most significant matches in terms of how the two algorithms agree on the clustering. Plotted geographically, using the continuous coloring of the 4 x 4 SOM (in section 4.2.5.5), we get a dialect map for which some level of objectivity has been introduced. To get a better understanding of the groups, we show the results on 3 separate maps; one showing the dark clusters, a second showing the mid range, and a third one showing the lighter shades.

On the next page are dialect maps, representing only the best clustering matches between SOM and Ward's method. Continuous coloring is set according to the 4 x 4 Self Organizing Map generated in section 4.2.5.5.

Page 44: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

44

The first cluster represents the broad [ ] sound. It forms a diagonal across Finland and northern Sweden, and is also present in dialects of Skåne and Gotland. Grangärde also falls into this cluster, due to its high C9. (For details, see global profile chart in Appendix.)

(10 records)

The second cluster represents a more rounded [ ] pronounced further back in the mouth. This cluster is characterized by the Närke/Södermanland group, together with observations along the Swedish borders. Here Arjeplog constitutes an outlier in the north, explained by the distinctly expressed C10.

(14 records)

Finally, the third cluster, representing the standard Swedish pronunciation of [ ]. Here we can see the white streak stretching from Uppland right across Södermanland into Värmland. This cluster also stretches along the western parts of Sweden, all the way up to Jämtland/Härjedalen. Observations of this group are characterized by an extraordinary low C4.

(27 records)

Page 45: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

45

4.4 Discussion Despite an initial assumption that the choice of clustering algorithm would be the most critical decision of the analysis, this turned out not to be the case. Different methods certainly had an impact on the result, but the choice of similarity measure and the changing of measurement scales made far more significant changes to the resulting clustering structure.

By using a data set consisting of 12 MFC coefficients, the choice of cluster variables was a given. On the other hand, the decision to normalize the data was not an obvious one, and it could be argued that by taking Z scores on cepstral coefficients, information critical for discriminating speech is lost. However, since clustering is an exploratory method, different approaches should be investigated, and evaluation of the results should be left to experts of the field.

One general problem with cluster analysis is that there is no precise and objective way to rank or choose among the different permutations of similarity measures, algorithms, and cluster data. Particularly dangerous is the fact that one often find what one is looking for by manipulating the features of the analysis. If there is a strong prior notion as to the underlying cluster structure, repeated experimentation with parameters of the analysis is likely to produce something resembling that outcome.

It should be evident from this study that it is virtually impossible say something definite about the classification of dialects using to cluster analysis. On the other hand, the resulting dialect maps of the empirical study may well serve as a reference when trying to establish resemblances among groups of observations, and for creating hypotheses. Moreover, by merging the results of essentially different algorithms, this reference may be considered objective and could serve as an alternative to a reference based on, for instance, perceptional distances.

The results of this thesis show that the clustering of speech segments based on their spectral components may indeed be an effective tool in the classification of dialects. Still, one must keep in mind, that like all statistical methods, it is likely to yield most when applied without preconceived notions as to the results.

Page 46: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

46

5 Conclusion and Further Research This thesis was originally inspired by the work of John Nerbonne and Wilbert Heeringa, who have explored the Levenshtein distance to analyse the dialect borders of Dutch. In his doctoral thesis [1], Heeringa acknowledges acoustic representations as useful, and recommends future refinement of the acoustic processing verified on the basis of more speakers.

In this study we analysed the speech spectrum by means of 12 cosinusoidal components, each defined by their cepstral coefficients. This automatic approach treats the speech wave as a time-varying signal that can be processed statistically. The dialect data is based on samples from an average of three speakers per 95 locations, resulting in a material of close to 300 speakers.

Using multidimensional scaling, we concluded that the MFCC representation of the vowels in the data set produced a pattern reminiscent to the IPA quadrilateral, thus suggesting that cepstral vectors can be considered a reliable metric for measuring acoustic distances. Since the IPA model is international, the methods suggested in this thesis may also be applied to other languages than Swedish. (Thus, it would be interesting to verify the methods on, e.g. Dutch, and compare the results with existing dialect maps.)

Unsupervised clustering methods were then applied to show similarities between the mean pronunciations of /lat/ for different locations. In particular, we used PCA to establish which MFC coefficients were instrumental in identifying a particular dialect, and the results were visualized using a Gabriel biplot. During the clustering phase we obtained similar clusters using different algorithms. These clustering results were then visually verified against one another, and by means of coincidence testing.

It would be interesting to perform an interdisciplinary study to see what phenomena underlie the regional patterns generated by the different cluster algorithms. These could be related to infrastructure, general migration, or other historical events.

One possible application of the methods in this thesis may be in an automatic speech recognition (ASR) system, where performance is a critical factor. If the ASR system could identify that the speaker is from a certain region, then expedient algorithms with respect to dialect could be applied to enhance the performance of the system. Live audio requires extremely fast processing and the algorithms in this thesis would therefore need to be optimised for performance.

Gooskens et al have done a perception experiment9, where a group of high school students listened to 15 Norwegian dialects. Each pupil then graded the distance of the corresponding dialect compared to his own dialect on a 1 to 10 scale. The final result was a 15 x 15 perceptual distance matrix, which was then used to study the correlation between acoustic distance and perceptional distance. A similar perceptional study could be set up on dialect data from SweDia 2000, to validate the clustering results in this thesis.

9 A detailed account of this experiement is available in [1].

Page 47: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

47

6 References [1] Heeringa, W. (2004). Measuring Dialect Pronunciation Differences using Levenshtein Distance.

Rijksuniversiteit Groningen, Groningen.

[2] Rose, Philip (2003). The technical comparison of forensic voice samples. In Freckelton & Selby, Expert Evidence, Ch. 99, Sidney: Thomson Lawbook Co.

[3] Jain, Murty, and Flynn (1999). Data clustering: a review. ACM Computing Surveys.

[4] Johan Koolwaaij (2000), Automatic Speaker Verification in Telephony: a probabilistic approach. University of Nijmegen

[5] Evandro B. Gouvêa (1998), Acoustic-feature-based frequency warping for speaker normalization. Pittsburgh, PA: Carnegie Mellon University

[6] Ben J. Shannon, Kuldip K. Paliwal (2003), A Comparative Study of Filter Bank Spacing for Speech Recognition, 1 2. Microelectronic Engineering Research Conference 2003.

[7] Tronnier, M. (2002), Preaspiration in Southern Swedish Dialects. Vol. 44 Fonetik 2002. Lund: Department of Linguistics and Phonetics, Lund University

[8] Hastie, T., Tibshirani R., and Walther G. (2000), Estimating the number of data clusters via the Gap statistic.Tech. report. Published in JRSSB 2000.

[9] J.F. Hair, R.E. Anderson, R.L. Tatham, W. Black (1995), Multivariate Data Analysis: With Readings. New Jersey: Prentice Hall, Englewood Cliffs, 4th edition.

[10] Young et al (2002), The HTK Book (version 3.2.1), Cambridge: Cambridge University Engineering Department.

[11] Kaski, S. (1997). Data Exploration Using Self-Organizing Maps. Helsinki: Helsinki University of Technology

[12] Kohonen, Teuvo. (1995). Self-Organizing Maps, Third Edition. Heidelberg: Springer.

[13] Technical Document (2003), Strategies for Clustering Microarray Gene Expression Data Orion Integrated Biosciences

[14] Murshudov, Garib. (2003), Principal component analysis (PCA), Lecture notes on Biology B/K/066. Heslington: University of York.

[15] Wishart, D. (1998), Clustan PCA. User documentation for ClustanGraphics 7.04. Clustan Ltd. Edinburgh,UK.

[16] Motlícek, P. (2003). Modeling of Spectra and Temporal Trajectories in Speech Processing, 2, 9 12. Brno University of Technology

[17] Documentation for MathWorks Products. (1994-2005), Statistics Toolbox. Natick, MA: The MathWorks, Inc.

[18] Andy M. Yip (2004), Estimating the Number of Data Clusters via the Gap Statistic, Presentation from BIOSTAT M278. Statistical Analysis of DNA Microarray Data http://www.genetics.ucla.edu/labs/horvath/Biostat278/GapStatistics.ppt, Stanford CA: Department of Biostatistics, Stanford University.

Page 48: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

48

[19] Paye B. (2003), Clustering: Making The Most of Information Overload. Second Moment, resources for applied analytics. http://www.secondmoment.org/articles/clustering.php. Stone Analytics, Inc.

[20] Young, F. W. and Hamer, R.M. (1987). Multidimensional scaling: History, theory and applications. Hillsdale, NJ: Lawrence Erlbaum.

Page 49: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

49

Appendix

Understanding Cepstral analysis By Fourier s theorem, we know that any signal can be represented mathematically as an infinite collection of sine waves of various frequencies, magnitudes and phases. The following description is taken from lecture notes of the course Computer Speech & Hearing, given by The Centre for Speech Technology Research, University of Edinburgh.

Assume that the observed speech signal s(t) is the result of the convolution of the energy source e(t) and the vocal tract filter v(t) in the following disposition. The procedure of calculating the real cepstrum then becomes:

1. Original time-domain signal

s(t) = e(t)

v(t)

where denotes convolution.

2. By taking Fourier transforms of both sides, we arrive in the frequency domain, where convolution is turned into multiplication (uppercase variables represent the complex spectra of the lowercase variables in time).

S(f) = E(f) V(f)

3. Take the magnitudes to get rid of the phase information from the original signal

|S(f)| = |E(f)| |V(f)|

4. Apply a filterbank of Mel-scaled triangular filters

Human hearing does not perceive frequencies over ~1 kHz in a linear fashion. The frequency resolution of the ear, measured in terms of the critical bandwidth, gets broader as frequency increases. Psychoacoustic experiments of this phenomenon have provided us with a normalised frequency scale, measured in Mels.

5. Take logs to convert multiplication into addition:

log|S(f)| = log|E(f)| + log|V(f)|

6. To separate the x and y components, transform to quefrency domain using an inverse Fourier transform, (IDFT):

F -1[log|S(f)|] = F -1[log|E(f)|] + F -1[log|V(f)|]

Note that the transform is linear, so the additivity is preserved. This last transform takes the function back into the time domain, but it is not the same as the time of the original signal; in fact, it is a measure of the rate of change of the spectral magnitudes. This domain is called the cepstrum (an anagram om spectrum), and the time axis is often referred to as the quefrency axis.

Page 50: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

50

HTK: hcopy.conf The file hcopy.conf defines the parameters to be used when performing cepstral analysis in HTK. The following configuration was used to generate the MFCC data set used in this study.

SOURCEKIND = WAVEFORM SOURCEFORMAT = WAV # SOURCERATE = 625 ZMEANSOURCE = FALSE

TARGETKIND = MFCC # TARGETFORMAT = HTK TARGETRATE = 100000

SAVECOMPRESSED = TRUE SAVEWITHCRC = TRUE

WINDOWSIZE = 256250.0 USEHAMMING = TRUE PREEMCOEF = 0.95

# USEPOWER = FALSE NUMCHANS = 25 # LOFREQ = -1.0 # HIFREQ = -1.0

# LPCORDER = 12

CEPLIFTER = 22 NUMCEPS = 12

# RAWENERGY = TRUE ENORMALISE = TRUE # ESCALE = 1.0 # SILFLOOR = 50.0

DELTAWINDOW = 4 ACCWINDOW = 4

# USESILDET = TRUE # SPEECHTHRESH = 0.0 # SILTHRESH = 0.0 # MEASURESIL = TRUE

OUTSILWARN = TRUE # SILMEAN = 0.0 # SILSTD = 0.0 # AUDIOSIG = 0 # VQTABLE = ""

Page 51: Classifying Dialects Using Cluster Analysispdfs.semanticscholar.org/48af/9ed865059fa99a93d6b3619a49e61df7cb05.pdfThe speech signal must be converted from analogue to a digital representation

51

MFCC Profiles

Profile chart showing the acoustic vectors for [ ] in 'lat'. Each profile represents the mean pronunciation per speaker, plotted for their locations respectively.